Pupil

AI-powered app exploration and testing
Demo video - exploring & generating tests on Linear demo
Exploration Dashboard
Static export of the Pupil dashboard for a explore sessions that ran on Linear and YouTube — video playback, screen coverage, action timeline, and exploration report
What Pupil does

Pupil is a vision-based AI QA agent built on Claude Code; point it at any web or mobile app and it autonomously. The agent takes screenshots, identifies interactive elements using visual parsing, works through the UI, and builds a navigation graph of every screen it finds. When it finishes, you get a coverage report with proposed test cases, documented screens, flagged issues, and a full video recording with action markers where you can scrub to any moment.

What a human can do now that they couldn't before: run exhaustive exploratory testing across multiple applications simultaneously, unattended. A human can click through an app and take notes, but they can't do it across ten apps in parallel, they can't do it for eight hours without fatigue, and they don't produce a structured coverage report with screenshots and video as a side effect of the work itself. A single developer can kick off a 30-minute run for a quick surface check or an 8-hour run overnight for deep coverage, across as many apps as they want, and come back to full reports in the morning. This matters most for applications that resist conventional testing: canvas renderers, emulators, and third-party UIs where you have no access to the DOM.

What the AI is responsible for: every navigation decision during exploration. The agent decides what to tap next, identifies when it has reached a new screen by fingerprinting element layouts, parses screen content using OmniParser to annotate what it sees, and proposes test cases based on observed behavior. It also detects anomalies like slow loads, error messages, and unexpected states by comparing before-and-after screenshots of each action.

Where the AI must stop: it cannot judge correctness or usefulness. It can tell you a screen exists and document every element on it, but not always whether that screen matters or whether an error message is expected. Proposed tests are drafts. The critical decision that must remain human is deciding which tests are worth keeping and which assertions are accurate, because that requires understanding what the product is for, what users care about, and which flows are business-critical; the agent sees structure, not meaning.

What would break first at scale: state-space explosion. In more complex apps, navigation paths branch beyond what the agent can visit in any reasonable time. Pupil currently uses a greedy strategy (prioritize untested elements, prefer transitions to new screens). Beyond a certain complexity, you would need goal-directed exploration or human-guided scope narrowing to keep runs tractable. The second bottleneck is visual parsing reliability in dense or exotic UIs, where overlapping elements produce noisy annotations that degrade screen fingerprinting accuracy.