Android Recording Format for Agents
This page describes the current Android recording output that Clawperator can produce on device.
Use it for:
- understanding what a raw recording contains
- deciding how an agent should consume a pulled recording
- knowing which fields are reliable and which are best-effort
- understanding current proof-of-concept limits before building replay or skill-authoring workflows on top
This page intentionally focuses on the recording data itself. It does not document a public host-side retrieval API because that surface is still in progress.
Recordings are started and stopped by dispatching start_recording and
stop_recording action types through clawperator execute.
Current status
The current proof of concept can record usable interaction traces for non-system-navigation flows.
What is working today:
window_changeclickscrolltext_change- synchronous snapshot capture for step-candidate accessibility events
What is intentionally deferred:
- normalized Back / Home / Recents semantics
- reliable
press_keykey: "back"capture as a required contract - any parser behavior that depends on system-gesture inference
System navigation evidence may still appear in raw recordings, but agents should not currently assume those events are normalized or portable across devices.
Recording file shape
Android recordings are newline-delimited JSON.
- line 1 is always a
recording_header - all following lines are individual event records
seqis the authoritative ordering fieldtsis wall-clock event time in epoch milliseconds
Example:
{"type":"recording_header","schemaVersion":1,"sessionId":"demo-001","startedAt":1710000000000,"operatorPackage":"com.clawperator.operator.dev"}
{"ts":1710000000000,"seq":0,"type":"window_change","packageName":"com.android.settings","className":"com.android.settings.Settings","title":"Settings","snapshot":"<hierarchy .../>"}
{"ts":1710000000800,"seq":1,"type":"click","packageName":"com.android.settings","resourceId":"com.android.settings:id/dashboard_tile","text":"Display","contentDesc":null,"bounds":{"left":0,"top":400,"right":1080,"bottom":560},"snapshot":"<hierarchy .../>"}
{"ts":1710000002100,"seq":2,"type":"scroll","packageName":"com.android.settings","resourceId":null,"scrollX":0,"scrollY":420,"maxScrollX":0,"maxScrollY":2800,"snapshot":null}
{"ts":1710000002600,"seq":3,"type":"text_change","packageName":"com.google.android.youtube","resourceId":null,"text":"linkin park waiting for the end emily","snapshot":null}
Header record
The first line identifies the file and schema.
Fields:
type: alwaysrecording_headerschemaVersion: file schema versionsessionId: session identifier chosen or generated when recording startedstartedAt: epoch milliseconds for session startoperatorPackage: package that produced the recording
Agents should use the header to:
- reject unsupported schema versions
- correlate the file to a session identifier
- understand whether the file came from the debug or release operator variant
Event types
The current recording runtime emits these event categories.
window_change
Represents screen or window transitions.
Common fields:
packageNameclassNametitlesnapshot
Use this event when an agent needs to understand:
- which app or screen came to the foreground
- what transition happened before a later click or text entry
click
Represents a tap-like activation on a UI element.
Common fields:
packageNameresourceIdtextcontentDescboundssnapshot
This is usually the most useful action record for replay-style reasoning.
scroll
Represents a view scroll event.
Common fields:
packageNameresourceIdscrollXscrollYmaxScrollXmaxScrollYsnapshot: null
Scroll is captured because it is meaningful context, but it is intentionally cheap. It does not carry a tree snapshot.
text_change
Represents text changing inside an editable control.
Common fields:
packageNameresourceIdtextsnapshot: null
Important: the emitted text value is the whole visible text state after that
change, not just the last character.
Snapshot semantics
Snapshots are present on step-candidate accessibility events:
window_changeclick
Snapshots are intentionally null on high-rate events:
scrolltext_change
The current implementation uses synchronous capture on the accessibility event thread for step-candidate events. That improves snapshot correctness relative to an async design, but it does not make snapshots perfect.
Agents should treat snapshot as:
- best-effort context about what was visible around the interaction
- useful for understanding the target element and nearby UI
- not guaranteed to be an exact pre-interaction ground truth frame
The authoritative basis for a new action is still the live snapshot from the current device state.
How agents should use recordings
The right mental model is:
- recordings are context, not executable scripts
- raw event fields are hints, not guaranteed selectors
- live device state still decides what to do next
Recommended usage pattern:
- Read the recording in
seqorder. - Use
window_changeandclickevents to understand the intended path. - Treat
snapshotas supporting context for what the user likely saw. - Derive the actual next selector from the current device snapshot, not from recorded fields alone.
- Use
scrollandtext_changeas behavioral context rather than assuming they are directly replayable in a one-to-one way.
Field reliability guidance
Some fields are much more stable than others.
Usually helpful:
seqtypepackageNamesnapshotwhen presenttexton visible user-facing controlsboundsas spatial context
Frequently missing or unstable:
resourceIdon many third-party appscontentDescon many controlstitleon some window transitionsresourceIdontext_change
Agents should prefer:
- live current snapshot
- package and nearby UI context
- visible text and content description
- resource IDs when they exist and appear stable
- recorded bounds only as supporting evidence
Real text-input example
On a physical Samsung device, a manual YouTube search recording captured a useful text-input sequence:
- manual character entry emitted repeated
text_changeevents - autocomplete selection emitted a normal
click - more manual typing after autocomplete emitted more
text_changeevents
Excerpt:
{"seq":17,"type":"text_change","packageName":"com.google.android.youtube","resourceId":null,"text":"l","snapshot":null}
{"seq":18,"type":"text_change","packageName":"com.google.android.youtube","resourceId":null,"text":"li","snapshot":null}
{"seq":19,"type":"text_change","packageName":"com.google.android.youtube","resourceId":null,"text":"lin","snapshot":null}
{"seq":28,"type":"text_change","packageName":"com.google.android.youtube","resourceId":null,"text":"linking park","snapshot":null}
{"seq":37,"type":"click","packageName":"com.google.android.youtube","resourceId":"com.google.android.youtube:id/edit_suggestion","text":"Edit suggestion linkin park waiting for the end","contentDesc":"Edit suggestion linkin park waiting for the end","snapshot":"<hierarchy .../>"}
{"seq":38,"type":"text_change","packageName":"com.google.android.youtube","resourceId":null,"text":"linkin park waiting for the end e","snapshot":null}
{"seq":42,"type":"text_change","packageName":"com.google.android.youtube","resourceId":null,"text":"linkin park waiting for the end emily","snapshot":null}
What this means for agent design:
- text entry is more tractable than system-navigation normalization in the current proof of concept
- autocomplete is visible as a click target, not a hidden IME-only effect
text_changemay haveresourceId: null, so surrounding UI context matters more than the event alone
Current gotchas
System navigation is not normalized yet
Back / Home / Recents behavior is device- and navigation-mode-dependent in current recordings.
Do not assume:
- a Back action will appear as
press_key - a Home gesture has one universal raw shape
- system UI events are portable across Samsung, emulator, and future devices
text_change may be underspecified
Text changes can be useful even when they lack resourceId, but that means
the event alone may not identify the correct field to target later.
Scroll and text events do not carry snapshots
That is intentional. High-rate event categories stay cheap by omitting tree capture.
Recordings are lossy
They preserve identifiable interaction moments, not full human intent or every intermediate visual state.
Agents should expect:
- missing timing nuance
- missing IME-specific details
- occasional nulls
- OEM-specific variation
What recordings are good for right now
The current recording format is already useful for:
- understanding a user-demonstrated navigation path
- turning a demonstrated flow into agent context
- grounding replay attempts in actual observed UI state
- bootstrapping skill authoring for stable non-system-navigation flows
It is not yet the right contract to assume for:
- normalized system gesture replay
- device-portable Back / Home / Recents semantics
- one-to-one deterministic playback with no live state verification