When AI Tools Fail (and When They Don't)

Built a structured AI development workflow with custom agents, skills, and a governance framework that evolved over 5.5 months on a production Flutter app.

AIMobileFlutterGovernance

Role: Product Designer (solo, AI-governed build)

Built structured AI governance: 11 agents, 14 skills, 8 commands

30+ AI-co-authored commits held to same quality bar: 703 tests passing

Mapped AI effectiveness boundary across 4 project phases

13 AI-generated branches verifiable in git

View Presentation

View on App Store View on Google Play

When AI Tools Fail (and When They Don't)

A solo Product Designer & Developer built a structured AI development workflow over 5.5 months on a production app. Custom agents, skills, and commands that evolved alongside the project and the tools themselves.

TL;DR

I used AI tools from day one on a production Flutter app, a professional practice that evolved over 5.5 months across 4 phases. When every AI tool failed on BLE device provisioning, I solved it manually. That drew the first clear effectiveness boundary. I built 11 agents, 14 skills, and 8 commands, each one crafted for this project's domain and refined as I used them. Not templates. Not one-time setups. Living artifacts that got better every time they fell short. 30+ AI-co-authored commits held to the same quality bar: 703 tests passing, no exceptions.

73+ PRs merged, ~195 CI builds · 11/14/8 agents, skills, commands · 30+ AI-co-authored commits · 13 AI-generated branches verifiable in git

Governance architecture overview: agents, skills, and commands forming a layered AI workflow system

By the end, the whole team had independently adopted the AI workflow.

Quick Facts


Company	Sunday Light Limited (London, UK)
Role	Product Designer & Developer (sole contributor on companion app)
Team	CEO/Product Owner, backend engineering, firmware engineering
Platform	iOS + Android (Flutter)
Timeline	Oct 2025 to Mar 2026 (5.5 months of workflow evolution)
Tools	Claude Code, GitHub Copilot, Copilot SWE agent, ChatGPT, Gemini, Flutter, CodeMagic CI/CD

The Client

Sunday Light makes premium indoor sunlight machines. A companion app provisions new devices over BLE, controls brightness and color temperature, and manages presets and automations. I owned the entire app: design and development. (See the flagship case study for the full product story.)

Sunday Light, premium indoor sunlight machine in a modern kitchen with landscape view

The Challenge

"The combination of UX expertise and Flutter development is rare, most agencies have separate designers and developers." (CEO)

As sole designer and developer on a production app, speed matters. The team moved fast, and customers needed to provision, control, and automate their lights. But quality matters just as much: the app ships to real users on the App Store, controlling premium professional lighting hardware. The quality bar had to match the hardware standard.

AI tools promise acceleration. Using them on production code isn't plug-and-play, though. The tools changed every month, with new models and new agent features. I needed an AI-assisted workflow that could keep up with rapidly changing tools while maintaining the same quality bar as manual work.

Key metrics: PRs merged, AI-co-authored commits, tests passing, governance artifacts

What I Built

A production governance system. 11 agent definitions that encode domain context: BLE provisioning constraints, Riverpod state management patterns, IoT device management patterns. 14 skill definitions that standardize repeatable workflows. 8 commands for routine operations. Plus an effectiveness boundary that categorizes every task type by AI suitability.

None of these were written once and left alone. I built this system across 4 phases, and every artifact got revised as I discovered where it fell short. An agent that produced inconsistent naming got tighter conventions. A skill that missed edge cases got additional validation steps. The workflow at month 5 looked nothing like month 1, and neither did the individual agents and skills that powered it.

Governance system detail: agent definitions, skill templates, and command structures

Annotated agent definition file showing domain context, constraints, and task boundaries

The Process

The AI workflow wasn't designed upfront. It evolved through 4 phases as the project and the AI tools matured. Each phase built on lessons from the previous one.

Phase 1 was ad-hoc: ChatGPT for research, Copilot for code completion, general-purpose prompting with no structure. The BLE provisioning failure, where every AI tool failed after 3 days of loops, drew the first effectiveness boundary and forced a deliberate approach.

Phase 2 introduced structured context. CLAUDE.md conventions encoded codebase architecture and naming patterns. Quality improved measurably.

Phase 3 built the full governance system: 11 agents, 14 skills, 8 commands.

Phase 4 saw team-wide adoption after Copilot SWE agent launched, with human and AI work streams running in parallel on the same day.

Process flow: how the AI workflow evolved over 4 phases

Decision 1

Mapped the AI Effectiveness Boundary: When Every AI Tool Failed on BLE, the First Line Was Drawn

Context

Three weeks into the project, BLE device provisioning hit a wall. The task required forking a native Espressif library, resolving Swift version mismatches, and building a 3-layer bridge architecture where native iOS and Android wrappers are consumed by Flutter via event streams.

I tried every AI tool available: ChatGPT, Gemini, GitHub Copilot. None could help. "They kept taking me in loops." After 3 days, I solved it manually by forking the library and writing the bridge myself.

What I chose and why

That failure turned "use AI tools" from a general strategy into a deliberate practice. After BLE, every task got an explicit assessment: can AI handle this reliably, or does it require human expertise?

AI failure timeline: key moments where AI tools failed and boundaries were established

BLE provisioning screens: the native bridge architecture AI couldn't solve

A second boundary appeared two months later, when an AI-suggested code change conflicted with the device state machine. The team's review process caught it before it reached production, confirming that human review stays essential for hardware-adjacent code.

AI handles well	AI fails on
Bug fixes against known test failures	Novel problems crossing technology boundaries (BLE)
Accessibility annotations (WCAG rules are explicit)	Architecture decisions requiring cross-cutting judgment
Boilerplate and code formatting	Hardware protocol integration
Test writing for existing code	Design judgment and product direction

AI effectiveness boundary: tasks categorized by AI suitability

What I gave up

Time. Mapping effectiveness boundaries is time not spent shipping features. But every future task delegation is faster and more reliable because the boundary is explicit, not guessed.

Decision 2

Built Production AI Governance Across 4 Project Phases

Context

Knowing where AI works is one thing. Making it work reliably on a production codebase is another, especially when the tools themselves keep changing. I needed a system that encoded codebase architecture, naming conventions, and domain patterns so AI agents could produce work consistent with manual quality.

Options considered

Option	Approach	Verdict
Ad hoc prompting	Copy-paste context into each AI session	No consistency; context lost between sessions; quality varies wildly
Single instruction file	One CLAUDE.md or copilot-instructions.md	Better than nothing, but doesn't scale as the codebase grows
Layered governance system	Our choice	Agents encode domain context, skills handle repeatable workflows, commands cover routine operations

What I chose and why

I built the governance system in phases alongside the project, and kept refining it as I learned what worked and what didn't:

Phase evolution: how the AI workflow matured over 5.5 months

Phase	Timeline	What changed	Key addition	What got refined
1	Oct-Nov 2025	AI tools from day one. ChatGPT for research, Copilot for completion	BLE failure establishes first boundary	Nothing yet, learning what doesn't work
2	Dec-Jan 2026	Claude Code for larger changes; observed team running multiple Claude agents	CLAUDE.md conventions; structured context improves quality	Copilot instructions rewritten after build system governance gaps
3	Jan-Feb 2026	Claude Code agent capabilities mature	11 agents, 14 skills, 8 commands	Agents specialized to Flutter/Riverpod patterns, IoT constraints, API versioning patterns
4	Feb 2026	Copilot SWE agent launches; team-wide adoption	Multi-tool delegation across backend, firmware, and product	Skills and commands tightened after real parallel usage revealed gaps

Every phase didn't just add new artifacts. It improved existing ones. When an agent produced code that violated the project's Riverpod conventions, I updated its context. When a skill missed API versioning patterns, I added them. The governance system was a living codebase, not a configuration file.

Structuring AI workflows turned out to be information architecture. Which context does each agent need? What's the right level of autonomy for each task type? How do you keep things current when AI tools release new capabilities every few weeks?

Delegation tree: how tasks flow from human judgment to AI agents based on complexity

Quality gate diagram: verification checkpoints ensuring AI output meets production standards

What I gave up

Time invested in governance infrastructure instead of shipping features. Every agent definition and skill template is time not spent writing application code. But each one makes the next task faster. By month 5, delegating a bug fix took minutes instead of the 30+ minutes of context-setting it required in month 1.

Decision 3

Trusted AI with Production Code

Context

The team needed speed. The product couldn't be provisioned, controlled, or automated without the companion app. But delegating production work to AI carries real risk: inconsistent code, introduced bugs, architecture violations.

What I chose and why

The question was whether to trust AI with production code, and what systems would make that trust justified.

Agent definitions encoded codebase architecture, naming conventions, and patterns specific to this project. Not generic Flutter agents, but agents that knew about this app's Riverpod provider structure, its API versioning patterns, its IoT device management layer. Skill definitions standardized repeatable workflows. CLAUDE.md conventions gave AI agents the context to produce work matching manual quality. All of these were refined continuously. When an agent's output missed a pattern, the agent got updated before the next task.

Task type	Delegated to	Governance layer
Bug fixes against known test failures	AI agent	Skill definitions + test suite validation
Accessibility annotations	AI agent	Agent context + automated a11y test suites
Code formatting and style	AI agent	CLAUDE.md conventions + linter enforcement
Architecture decisions	Manual	Human judgment required
Novel integrations (BLE, Siri)	Manual	Human expertise required

Two days proved the system worked:

The busiest single day: parallel human and AI work streams on Feb 25

Feb 25, the busiest single day. I shipped test repair (14 pre-existing failures fixed), two version releases, slider/toggle jitter fixes, per-device interaction lock implementation, full Siri Shortcuts (a 10-step native iOS implementation), and performance optimizations. Simultaneously, 4 bug fix PRs ran on the Copilot SWE agent, handling routine fixes across the codebase. The governance system handled the delegation. I focused on architecture work; the AI handled the routine stuff.

Accessibility audit results: 62 items audited, 60 resolved

Mar 3, the accessibility audit. 62 items audited, 60 resolved. Flutter widgets are invisible to assistive technology by default, so every interactive element needed explicit annotation. Image labels went from 0/9 to 9/9. Six new test suites added. AI agents handled the repetitive accessibility annotations (semantic labels, touch target adjustments, contrast checks) while I made the structural decisions about navigation order and screen reader flow.

The proof: 30+ AI-co-authored commits across 10 production releases. 13 AI-generated branches verifiable in git. 703 tests passing, same bar for everything.

AI-generated branches in git: 13 branches showing verifiable AI contributions

Team-wide adoption: the governance system scaling beyond one developer

Reflection

AI fluency is a practice, and so are the tools you build around it. The AI tools that failed in October worked by February, partly because the tools improved and partly because I got better at using them. The agents and skills I built also improved. Each one went through multiple revisions as I discovered gaps and better ways to encode domain knowledge. The workflow evolved because I kept re-evaluating what was possible, and kept refining the artifacts that made it work.

The governance system outlasts me on the project. The agents, skills, and commands aren't personal shortcuts or one-time configurations. They're production artifacts encoding months of project knowledge. A new team member inherits the same domain context, quality gates, and workflow patterns without starting from scratch. They inherit artifacts that have been battle-tested through real usage.

This is information architecture work. Structuring AI workflows is the same systems thinking that drives good IA: deciding what context each agent needs, how much autonomy to grant, and where to draw the guardrails. Once I started thinking about it that way, the design decisions got clearer.

AI handles volume. Humans handle judgment. The hardest problems in this project (BLE provisioning, slider state management architecture, the Living Light CCT-aware UI) were all solved manually. AI accelerated the repetitive work: accessibility annotations, bug fix throughput, test coverage. Sorting tasks into those two categories before starting saves more time than any individual tool.

Your verdict

How did this one land?

More Case Studies

Living Light: An Interface in Sync With the Sun — case study preview

8 Min Read

Product DesignIoTMobile

Living Light: An Interface in Sync With the Sun

Designed and built a cross-platform Flutter companion app for a premium IoT sunlight machine, from workshop through BLE integration to App Store and Google Play launch.

Product Designer (solo, design to shipped code)

Read Case Study →View Presentation

Three Crises, One Design System — case study preview

8 Min Read

HealthcareProduct DesignDesign Systems

Three Crises, One Design System

Built a Figma design system at seed stage, then rebuilt it twice for a rebrand and clinical regulation. Token architecture absorbed ~40 features and two team transitions with zero sprint delays.

Product Designer (first hire, design system owner)

Read Case Study →View Presentation