Skip to contentHome
When AI Tools Fail (and When They Don't) — case study hero

When AI Tools Fail (and When They Don't)

Built a structured AI development workflow with 11 custom agents, 14 skills, and governance framework that evolved over 5.5 months on a production Flutter app.

AIMobileFlutterGovernance
Role
Product Designer (solo, AI-governed build)
Built structured AI governance: 11 agents, 14 skills, 8 commands
30+ AI-co-authored commits held to same quality bar: 703 tests passing
Mapped AI effectiveness boundary across 4 project phases
13 AI-generated branches verifiable in git
When AI Tools Fail (and When They Don't)

A solo Product Designer & Developer built a structured AI development workflow over 5.5 months on a production app. Custom agents, skills, and commands that evolved alongside the project and the tools themselves.


TL;DR

I used AI tools from day one on a production Flutter app, a professional practice that evolved over 5.5 months across 4 phases. When every AI tool failed on BLE device provisioning, I solved it manually. That drew the first clear effectiveness boundary. I built 11 agents, 14 skills, and 8 commands, each one crafted for this project's domain and refined as I used them. Not templates. Not one-time setups. Living artifacts that got better every time they fell short. 30+ AI-co-authored commits held to the same quality bar: 703 tests passing, no exceptions.

73+ PRs merged, ~195 CI builds · 11/14/8 agents, skills, commands · 30+ AI-co-authored commits · 13 AI-generated branches verifiable in git

Governance architecture overview: agents, skills, and commands forming a layered AI workflow system

By the end, the whole team had independently adopted the AI workflow.


Quick Facts

CompanySunday Light Limited (London, UK)
RoleProduct Designer & Developer (sole contributor on companion app)
TeamCEO/Product Owner, backend engineering, firmware engineering
PlatformiOS + Android (Flutter)
TimelineOct 2025 to Mar 2026 (5.5 months of workflow evolution)
ToolsClaude Code, GitHub Copilot, Copilot SWE agent, ChatGPT, Gemini, Flutter, CodeMagic CI/CD

The Client

Sunday Light makes premium indoor sunlight machines. A companion app provisions new devices over BLE, controls brightness and color temperature, and manages presets and automations. I owned the entire app: design and development. (See the flagship case study for the full product story.)

Sunday Light, premium indoor sunlight machine in a modern kitchen with landscape view


The Challenge

"The combination of UX expertise and Flutter development is rare, most agencies have separate designers and developers." (CEO)

As sole designer and developer on a production app, speed matters. The team moved fast, and customers needed to provision, control, and automate their lights. But quality matters just as much: the app ships to real users on the App Store, controlling premium professional lighting hardware. The quality bar had to match the hardware standard.

AI tools promise acceleration. Using them on production code isn't plug-and-play, though. The tools changed every month, with new models and new agent features. I needed an AI-assisted workflow that could keep up with rapidly changing tools while maintaining the same quality bar as manual work.

Key metrics: PRs merged, AI-co-authored commits, tests passing, governance artifacts


What I Built

A production governance system. 11 agent definitions that encode domain context: BLE provisioning constraints, Riverpod state management patterns, IoT device management patterns. 14 skill definitions that standardize repeatable workflows. 8 commands for routine operations. Plus an effectiveness boundary that categorizes every task type by AI suitability.

None of these were written once and left alone. I built this system across 4 phases, and every artifact got revised as I discovered where it fell short. An agent that produced inconsistent naming got tighter conventions. A skill that missed edge cases got additional validation steps. The workflow at month 5 looked nothing like month 1, and neither did the individual agents and skills that powered it.

Governance system detail: agent definitions, skill templates, and command structures

Annotated agent definition file showing domain context, constraints, and task boundaries


The Process

The AI workflow wasn't designed upfront. It evolved through 4 phases as the project and the AI tools matured. Each phase built on lessons from the previous one.

Phase 1 was ad-hoc: ChatGPT for research, Copilot for code completion, general-purpose prompting with no structure. The BLE provisioning failure, where every AI tool failed after 3 days of loops, drew the first effectiveness boundary and forced a deliberate approach.

Phase 2 introduced structured context. CLAUDE.md conventions encoded codebase architecture and naming patterns. Quality improved measurably.

Phase 3 built the full governance system: 11 agents, 14 skills, 8 commands.

Phase 4 saw team-wide adoption after Copilot SWE agent launched, with human and AI work streams running in parallel on the same day.

Process flow: how the AI workflow evolved over 4 phases


Decision 1

Mapped the AI Effectiveness Boundary: When Every AI Tool Failed on BLE, the First Line Was Drawn

Context

Three weeks into the project, BLE device provisioning hit a wall. The task required forking a native Espressif library, resolving Swift version mismatches, and building a 3-layer bridge architecture where native iOS and Android wrappers are consumed by Flutter via event streams.

I tried every AI tool available: ChatGPT, Gemini, GitHub Copilot. None could help. "They kept taking me in loops." After 3 days, I solved it manually by forking the library and writing the bridge myself.

What I chose and why

That failure turned "use AI tools" from a general strategy into a deliberate practice. After BLE, every task got an explicit assessment: can AI handle this reliably, or does it require human expertise?

AI failure timeline: key moments where AI tools failed and boundaries were established

BLE provisioning screens: the native bridge architecture AI couldn't solve

A second boundary appeared two months later, when an AI-suggested code change conflicted with the device state machine. The team's review process caught it before it reached production, confirming that human review stays essential for hardware-adjacent code.

AI handles wellAI fails on
Bug fixes against known test failuresNovel problems crossing technology boundaries (BLE)
Accessibility annotations (WCAG rules are explicit)Architecture decisions requiring cross-cutting judgment
Boilerplate and code formattingHardware protocol integration
Test writing for existing codeDesign judgment and product direction

AI effectiveness boundary: tasks categorized by AI suitability

What I gave up

Time. Mapping effectiveness boundaries is time not spent shipping features. But every future task delegation is faster and more reliable because the boundary is explicit, not guessed.


Decision 2

Built Production AI Governance Across 4 Project Phases

Context

Knowing where AI works is one thing. Making it work reliably on a production codebase is another, especially when the tools themselves keep changing. I needed a system that encoded codebase architecture, naming conventions, and domain patterns so AI agents could produce work consistent with manual quality.

Options considered

OptionApproachVerdict
Ad hoc promptingCopy-paste context into each AI sessionNo consistency; context lost between sessions; quality varies wildly
Single instruction fileOne CLAUDE.md or copilot-instructions.mdBetter than nothing, but doesn't scale as the codebase grows
Layered governance systemOur choiceAgents encode domain context, skills handle repeatable workflows, commands cover routine operations

What I chose and why

I built the governance system in phases alongside the project, and kept refining it as I learned what worked and what didn't:

Phase evolution: how the AI workflow matured over 5.5 months

PhaseTimelineWhat changedKey additionWhat got refined
1Oct-Nov 2025AI tools from day one. ChatGPT for research, Copilot for completionBLE failure establishes first boundaryNothing yet, learning what doesn't work
2Dec-Jan 2026Claude Code for larger changes; observed team running multiple Claude agentsCLAUDE.md conventions; structured context improves qualityCopilot instructions rewritten after build system governance gaps
3Jan-Feb 2026Claude Code agent capabilities mature11 agents, 14 skills, 8 commandsAgents specialized to Flutter/Riverpod patterns, IoT constraints, API versioning patterns
4Feb 2026Copilot SWE agent launches; team-wide adoptionMulti-tool delegation across backend, firmware, and productSkills and commands tightened after real parallel usage revealed gaps

Every phase didn't just add new artifacts. It improved existing ones. When an agent produced code that violated the project's Riverpod conventions, I updated its context. When a skill missed API versioning patterns, I added them. The governance system was a living codebase, not a configuration file.

Structuring AI workflows turned out to be information architecture. Which context does each agent need? What's the right level of autonomy for each task type? How do you keep things current when AI tools release new capabilities every few weeks?

Delegation tree: how tasks flow from human judgment to AI agents based on complexity

Quality gate diagram: verification checkpoints ensuring AI output meets production standards

What I gave up

Time invested in governance infrastructure instead of shipping features. Every agent definition and skill template is time not spent writing application code. But each one makes the next task faster. By month 5, delegating a bug fix took minutes instead of the 30+ minutes of context-setting it required in month 1.


Decision 3

Trusted AI with Production Code

Context

The team needed speed. The product couldn't be provisioned, controlled, or automated without the companion app. But delegating production work to AI carries real risk: inconsistent code, introduced bugs, architecture violations.

What I chose and why

The question was whether to trust AI with production code, and what systems would make that trust justified.

Agent definitions encoded codebase architecture, naming conventions, and patterns specific to this project. Not generic Flutter agents, but agents that knew about this app's Riverpod provider structure, its API versioning patterns, its IoT device management layer. Skill definitions standardized repeatable workflows. CLAUDE.md conventions gave AI agents the context to produce work matching manual quality. All of these were refined continuously. When an agent's output missed a pattern, the agent got updated before the next task.

Task typeDelegated toGovernance layer
Bug fixes against known test failuresAI agentSkill definitions + test suite validation
Accessibility annotationsAI agentAgent context + automated a11y test suites
Code formatting and styleAI agentCLAUDE.md conventions + linter enforcement
Architecture decisionsManualHuman judgment required
Novel integrations (BLE, Siri)ManualHuman expertise required

Two days proved the system worked:

The busiest single day: parallel human and AI work streams on Feb 25

Feb 25, the busiest single day. I shipped test repair (14 pre-existing failures fixed), two version releases, slider/toggle jitter fixes, per-device interaction lock implementation, full Siri Shortcuts (a 10-step native iOS implementation), and performance optimizations. Simultaneously, 4 bug fix PRs ran on the Copilot SWE agent, handling routine fixes across the codebase. The governance system handled the delegation. I focused on architecture work; the AI handled the routine stuff.

Accessibility audit results: 62 items audited, 60 resolved

Mar 3, the accessibility audit. 62 items audited, 60 resolved. Flutter widgets are invisible to assistive technology by default, so every interactive element needed explicit annotation. Image labels went from 0/9 to 9/9. Six new test suites added. AI agents handled the repetitive accessibility annotations (semantic labels, touch target adjustments, contrast checks) while I made the structural decisions about navigation order and screen reader flow.

The proof: 30+ AI-co-authored commits across 10 production releases. 13 AI-generated branches verifiable in git. 703 tests passing, same bar for everything.

AI-generated branches in git: 13 branches showing verifiable AI contributions

Team-wide adoption: the governance system scaling beyond one developer


Reflection

AI fluency is a practice, and so are the tools you build around it. The AI tools that failed in October worked by February, partly because the tools improved and partly because I got better at using them. The agents and skills I built also improved. Each one went through multiple revisions as I discovered gaps and better ways to encode domain knowledge. The workflow evolved because I kept re-evaluating what was possible, and kept refining the artifacts that made it work.

The governance system outlasts me on the project. The agents, skills, and commands aren't personal shortcuts or one-time configurations. They're production artifacts encoding months of project knowledge. A new team member inherits the same domain context, quality gates, and workflow patterns without starting from scratch. They inherit artifacts that have been battle-tested through real usage.

This is information architecture work. Structuring AI workflows is the same systems thinking that drives good IA: deciding what context each agent needs, how much autonomy to grant, and where to draw the guardrails. Once I started thinking about it that way, the design decisions got clearer.

AI handles volume. Humans handle judgment. The hardest problems in this project (BLE provisioning, slider state management architecture, the Living Light CCT-aware UI) were all solved manually. AI accelerated the repetitive work: accessibility annotations, bug fix throughput, test coverage. Sorting tasks into those two categories before starting saves more time than any individual tool.

Your verdict

How did this one land?

More Case Studies

Living Light: An Interface in Sync With the Sun — case study preview
8 Min Read
Product DesignIoTMobile

Living Light: An Interface in Sync With the Sun

Designed and built a cross-platform Flutter companion app for a premium IoT sunlight machine, from workshop through BLE integration to App Store and Google Play launch.

Product Designer (solo, design to shipped code)

Three Crises, One Design System — case study preview
8 Min Read
HealthcareProduct DesignDesign Systems

Three Crises, One Design System

Built a Figma design system at seed stage, then rebuilt it twice for a rebrand and clinical regulation. Token architecture absorbed ~40 features and two team transitions with zero sprint delays.

Product Designer (first hire, design system owner)