Pupil

As AI accelerates the speed at which code is written and shipped, testing and QA become the bottleneck. Pupil is a vision-based AI QA agent built on Claude Code; point it at any app and it explores autonomously. The agent takes screenshots, visually identifies interactive elements and methodically navigates through the UI. When it finishes, you get a coverage report with proposed test cases, documented screens, flagged issues, and a full video recording with action markers where you can scrub to any moment. And because it only needs pixels on a screen, it works on applications that resist conventional testing: canvas renderers, emulators, and other exotic UIs.

What a human can do now that they couldn't before: run exhaustive exploratory testing across multiple applications simultaneously, unattended. A human can click through an app and take notes, but they can't do it across ten apps in parallel, they can't do it for eight hours without fatigue, and they don't produce a structured coverage report with screenshots and video as a side effect of the work itself. A single developer can kick off a 30-minute run for a quick surface check or an 8-hour run overnight for deep coverage, across as many apps as they want, and come back to full reports in the morning.

What the AI is responsible for: every navigation decision during exploration. The agent fingerprints screen layouts to know when it's reached somewhere new, builds a navigation graph of how screens connect, parses screen content using OmniParser to annotate what it sees, and proposes test cases based on observed behavior. It also detects anomalies like slow loads, error messages, and unexpected states by comparing before-and-after screenshots of each action.

Where the AI must stop: it cannot judge correctness or usefulness. It can tell you a screen exists and document every element on it, but not always whether that screen matters or whether an error message is expected. Proposed tests are drafts. The critical decision that must remain human is deciding which tests are worth keeping and which assertions are accurate, because that requires understanding what the product is for, what users care about, and which flows are business-critical; the agent sees structure, not meaning.

What would break first at scale: state-space explosion. In more complex apps, navigation paths branch beyond what the agent can visit in any reasonable time. Pupil currently uses a greedy strategy (prioritize untested elements, prefer transitions to new screens). Beyond a certain complexity, you would need goal-directed exploration or human-guided scope narrowing to keep runs tractable. The second bottleneck is visual parsing reliability in dense or exotic UIs, where overlapping elements produce noisy annotations that degrade screen fingerprinting accuracy.