A solo Product Designer & Developer built a structured AI development workflow over 5.5 months on a production app. Custom agents, skills, and commands that evolved alongside the project and the tools themselves.
TL;DR
I used AI tools from day one on a production Flutter app, a professional practice that evolved over 5.5 months across 4 phases. When every AI tool failed on BLE device provisioning, I solved it manually. That drew the first clear effectiveness boundary. I built 11 agents, 14 skills, and 8 commands, each one crafted for this project's domain and refined as I used them. Not templates. Not one-time setups. Living artifacts that got better every time they fell short. 30+ AI-co-authored commits held to the same quality bar: 703 tests passing, no exceptions.
73+ PRs merged, ~195 CI builds · 11/14/8 agents, skills, commands · 30+ AI-co-authored commits · 13 AI-generated branches verifiable in git

By the end, the whole team had independently adopted the AI workflow.
Quick Facts
| Company | Sunday Light Limited (London, UK) |
| Role | Product Designer & Developer (sole contributor on companion app) |
| Team | CEO/Product Owner, backend engineering, firmware engineering |
| Platform | iOS + Android (Flutter) |
| Timeline | Oct 2025 to Mar 2026 (5.5 months of workflow evolution) |
| Tools | Claude Code, GitHub Copilot, Copilot SWE agent, ChatGPT, Gemini, Flutter, CodeMagic CI/CD |
The Client
Sunday Light makes premium indoor sunlight machines. A companion app provisions new devices over BLE, controls brightness and color temperature, and manages presets and automations. I owned the entire app: design and development. (See the flagship case study for the full product story.)

The Challenge
"The combination of UX expertise and Flutter development is rare, most agencies have separate designers and developers." (CEO)
As sole designer and developer on a production app, speed matters. The team moved fast, and customers needed to provision, control, and automate their lights. But quality matters just as much: the app ships to real users on the App Store, controlling premium professional lighting hardware. The quality bar had to match the hardware standard.
AI tools promise acceleration. Using them on production code isn't plug-and-play, though. The tools changed every month, with new models and new agent features. I needed an AI-assisted workflow that could keep up with rapidly changing tools while maintaining the same quality bar as manual work.

What I Built
A production governance system. 11 agent definitions that encode domain context: BLE provisioning constraints, Riverpod state management patterns, IoT device management patterns. 14 skill definitions that standardize repeatable workflows. 8 commands for routine operations. Plus an effectiveness boundary that categorizes every task type by AI suitability.
None of these were written once and left alone. I built this system across 4 phases, and every artifact got revised as I discovered where it fell short. An agent that produced inconsistent naming got tighter conventions. A skill that missed edge cases got additional validation steps. The workflow at month 5 looked nothing like month 1, and neither did the individual agents and skills that powered it.


The Process
The AI workflow wasn't designed upfront. It evolved through 4 phases as the project and the AI tools matured. Each phase built on lessons from the previous one.
Phase 1 was ad-hoc: ChatGPT for research, Copilot for code completion, general-purpose prompting with no structure. The BLE provisioning failure, where every AI tool failed after 3 days of loops, drew the first effectiveness boundary and forced a deliberate approach.
Phase 2 introduced structured context. CLAUDE.md conventions encoded codebase architecture and naming patterns. Quality improved measurably.
Phase 3 built the full governance system: 11 agents, 14 skills, 8 commands.
Phase 4 saw team-wide adoption after Copilot SWE agent launched, with human and AI work streams running in parallel on the same day.

Decision 1
Mapped the AI Effectiveness Boundary: When Every AI Tool Failed on BLE, the First Line Was Drawn
Context
Three weeks into the project, BLE device provisioning hit a wall. The task required forking a native Espressif library, resolving Swift version mismatches, and building a 3-layer bridge architecture where native iOS and Android wrappers are consumed by Flutter via event streams.
I tried every AI tool available: ChatGPT, Gemini, GitHub Copilot. None could help. "They kept taking me in loops." After 3 days, I solved it manually by forking the library and writing the bridge myself.
What I chose and why
That failure turned "use AI tools" from a general strategy into a deliberate practice. After BLE, every task got an explicit assessment: can AI handle this reliably, or does it require human expertise?


A second boundary appeared two months later, when an AI-suggested code change conflicted with the device state machine. The team's review process caught it before it reached production, confirming that human review stays essential for hardware-adjacent code.
| AI handles well | AI fails on |
|---|---|
| Bug fixes against known test failures | Novel problems crossing technology boundaries (BLE) |
| Accessibility annotations (WCAG rules are explicit) | Architecture decisions requiring cross-cutting judgment |
| Boilerplate and code formatting | Hardware protocol integration |
| Test writing for existing code | Design judgment and product direction |

What I gave up
Time. Mapping effectiveness boundaries is time not spent shipping features. But every future task delegation is faster and more reliable because the boundary is explicit, not guessed.
Decision 2
Built Production AI Governance Across 4 Project Phases
Context
Knowing where AI works is one thing. Making it work reliably on a production codebase is another, especially when the tools themselves keep changing. I needed a system that encoded codebase architecture, naming conventions, and domain patterns so AI agents could produce work consistent with manual quality.
Options considered
| Option | Approach | Verdict |
|---|---|---|
| Ad hoc prompting | Copy-paste context into each AI session | No consistency; context lost between sessions; quality varies wildly |
| Single instruction file | One CLAUDE.md or copilot-instructions.md | Better than nothing, but doesn't scale as the codebase grows |
| Layered governance system | Our choice | Agents encode domain context, skills handle repeatable workflows, commands cover routine operations |
What I chose and why
I built the governance system in phases alongside the project, and kept refining it as I learned what worked and what didn't:

| Phase | Timeline | What changed | Key addition | What got refined |
|---|---|---|---|---|
| 1 | Oct-Nov 2025 | AI tools from day one. ChatGPT for research, Copilot for completion | BLE failure establishes first boundary | Nothing yet, learning what doesn't work |
| 2 | Dec-Jan 2026 | Claude Code for larger changes; observed team running multiple Claude agents | CLAUDE.md conventions; structured context improves quality | Copilot instructions rewritten after build system governance gaps |
| 3 | Jan-Feb 2026 | Claude Code agent capabilities mature | 11 agents, 14 skills, 8 commands | Agents specialized to Flutter/Riverpod patterns, IoT constraints, API versioning patterns |
| 4 | Feb 2026 | Copilot SWE agent launches; team-wide adoption | Multi-tool delegation across backend, firmware, and product | Skills and commands tightened after real parallel usage revealed gaps |
Every phase didn't just add new artifacts. It improved existing ones. When an agent produced code that violated the project's Riverpod conventions, I updated its context. When a skill missed API versioning patterns, I added them. The governance system was a living codebase, not a configuration file.
Structuring AI workflows turned out to be information architecture. Which context does each agent need? What's the right level of autonomy for each task type? How do you keep things current when AI tools release new capabilities every few weeks?


What I gave up
Time invested in governance infrastructure instead of shipping features. Every agent definition and skill template is time not spent writing application code. But each one makes the next task faster. By month 5, delegating a bug fix took minutes instead of the 30+ minutes of context-setting it required in month 1.
Decision 3
Trusted AI with Production Code
Context
The team needed speed. The product couldn't be provisioned, controlled, or automated without the companion app. But delegating production work to AI carries real risk: inconsistent code, introduced bugs, architecture violations.
What I chose and why
The question was whether to trust AI with production code, and what systems would make that trust justified.
Agent definitions encoded codebase architecture, naming conventions, and patterns specific to this project. Not generic Flutter agents, but agents that knew about this app's Riverpod provider structure, its API versioning patterns, its IoT device management layer. Skill definitions standardized repeatable workflows. CLAUDE.md conventions gave AI agents the context to produce work matching manual quality. All of these were refined continuously. When an agent's output missed a pattern, the agent got updated before the next task.
| Task type | Delegated to | Governance layer |
|---|---|---|
| Bug fixes against known test failures | AI agent | Skill definitions + test suite validation |
| Accessibility annotations | AI agent | Agent context + automated a11y test suites |
| Code formatting and style | AI agent | CLAUDE.md conventions + linter enforcement |
| Architecture decisions | Manual | Human judgment required |
| Novel integrations (BLE, Siri) | Manual | Human expertise required |
Two days proved the system worked:

Feb 25, the busiest single day. I shipped test repair (14 pre-existing failures fixed), two version releases, slider/toggle jitter fixes, per-device interaction lock implementation, full Siri Shortcuts (a 10-step native iOS implementation), and performance optimizations. Simultaneously, 4 bug fix PRs ran on the Copilot SWE agent, handling routine fixes across the codebase. The governance system handled the delegation. I focused on architecture work; the AI handled the routine stuff.

Mar 3, the accessibility audit. 62 items audited, 60 resolved. Flutter widgets are invisible to assistive technology by default, so every interactive element needed explicit annotation. Image labels went from 0/9 to 9/9. Six new test suites added. AI agents handled the repetitive accessibility annotations (semantic labels, touch target adjustments, contrast checks) while I made the structural decisions about navigation order and screen reader flow.
The proof: 30+ AI-co-authored commits across 10 production releases. 13 AI-generated branches verifiable in git. 703 tests passing, same bar for everything.


Reflection
AI fluency is a practice, and so are the tools you build around it. The AI tools that failed in October worked by February, partly because the tools improved and partly because I got better at using them. The agents and skills I built also improved. Each one went through multiple revisions as I discovered gaps and better ways to encode domain knowledge. The workflow evolved because I kept re-evaluating what was possible, and kept refining the artifacts that made it work.
The governance system outlasts me on the project. The agents, skills, and commands aren't personal shortcuts or one-time configurations. They're production artifacts encoding months of project knowledge. A new team member inherits the same domain context, quality gates, and workflow patterns without starting from scratch. They inherit artifacts that have been battle-tested through real usage.
This is information architecture work. Structuring AI workflows is the same systems thinking that drives good IA: deciding what context each agent needs, how much autonomy to grant, and where to draw the guardrails. Once I started thinking about it that way, the design decisions got clearer.
AI handles volume. Humans handle judgment. The hardest problems in this project (BLE provisioning, slider state management architecture, the Living Light CCT-aware UI) were all solved manually. AI accelerated the repetitive work: accessibility annotations, bug fix throughput, test coverage. Sorting tasks into those two categories before starting saves more time than any individual tool.


