Android Recording Format for Agents

This page describes the current Android recording output that Clawperator can produce on device.

Use it for:

understanding what a raw recording contains
deciding how an agent should consume a pulled recording
knowing which fields are reliable and which are best-effort
understanding current proof-of-concept limits before building replay or skill-authoring workflows on top

This page intentionally focuses on the recording data itself. It does not document a public host-side retrieval API because that surface is still in progress.

Recordings are started and stopped by dispatching start_recording and stop_recording action types through clawperator execute.

Current status

The current proof of concept can record usable interaction traces for non-system-navigation flows.

What is working today:

window_change
click
scroll
text_change
synchronous snapshot capture for step-candidate accessibility events

What is intentionally deferred:

normalized Back / Home / Recents semantics
reliable press_key key: "back" capture as a required contract
any parser behavior that depends on system-gesture inference

System navigation evidence may still appear in raw recordings, but agents should not currently assume those events are normalized or portable across devices.

Recording file shape

Android recordings are newline-delimited JSON.

line 1 is always a recording_header
all following lines are individual event records
seq is the authoritative ordering field
ts is wall-clock event time in epoch milliseconds

Example:

{"type":"recording_header","schemaVersion":1,"sessionId":"demo-001","startedAt":1710000000000,"operatorPackage":"com.clawperator.operator.dev"}
{"ts":1710000000000,"seq":0,"type":"window_change","packageName":"com.android.settings","className":"com.android.settings.Settings","title":"Settings","snapshot":"<hierarchy .../>"}
{"ts":1710000000800,"seq":1,"type":"click","packageName":"com.android.settings","resourceId":"com.android.settings:id/dashboard_tile","text":"Display","contentDesc":null,"bounds":{"left":0,"top":400,"right":1080,"bottom":560},"snapshot":"<hierarchy .../>"}
{"ts":1710000002100,"seq":2,"type":"scroll","packageName":"com.android.settings","resourceId":null,"scrollX":0,"scrollY":420,"maxScrollX":0,"maxScrollY":2800,"snapshot":null}
{"ts":1710000002600,"seq":3,"type":"text_change","packageName":"com.google.android.youtube","resourceId":null,"text":"linkin park waiting for the end emily","snapshot":null}

Header record

The first line identifies the file and schema.

Fields:

type: always recording_header
schemaVersion: file schema version
sessionId: session identifier chosen or generated when recording started
startedAt: epoch milliseconds for session start
operatorPackage: package that produced the recording

Agents should use the header to:

reject unsupported schema versions
correlate the file to a session identifier
understand whether the file came from the debug or release operator variant

Event types

The current recording runtime emits these event categories.

`window_change`

Represents screen or window transitions.

Common fields:

packageName
className
title
snapshot

Use this event when an agent needs to understand:

which app or screen came to the foreground
what transition happened before a later click or text entry

`click`

Represents a tap-like activation on a UI element.

Common fields:

packageName
resourceId
text
contentDesc
bounds
snapshot

This is usually the most useful action record for replay-style reasoning.

`scroll`

Represents a view scroll event.

Common fields:

packageName
resourceId
scrollX
scrollY
maxScrollX
maxScrollY
snapshot: null

Scroll is captured because it is meaningful context, but it is intentionally cheap. It does not carry a tree snapshot.

`text_change`

Represents text changing inside an editable control.

Common fields:

packageName
resourceId
text
snapshot: null

Important: the emitted text value is the whole visible text state after that change, not just the last character.

Snapshot semantics

Snapshots are present on step-candidate accessibility events:

window_change
click

Snapshots are intentionally null on high-rate events:

scroll
text_change

The current implementation uses synchronous capture on the accessibility event thread for step-candidate events. That improves snapshot correctness relative to an async design, but it does not make snapshots perfect.

Agents should treat snapshot as:

best-effort context about what was visible around the interaction
useful for understanding the target element and nearby UI
not guaranteed to be an exact pre-interaction ground truth frame

The authoritative basis for a new action is still the live snapshot from the current device state.

How agents should use recordings

The right mental model is:

recordings are context, not executable scripts
raw event fields are hints, not guaranteed selectors
live device state still decides what to do next

Recommended usage pattern:

Read the recording in seq order.
Use window_change and click events to understand the intended path.
Treat snapshot as supporting context for what the user likely saw.
Derive the actual next selector from the current device snapshot, not from recorded fields alone.
Use scroll and text_change as behavioral context rather than assuming they are directly replayable in a one-to-one way.

Field reliability guidance

Some fields are much more stable than others.

Usually helpful:

seq
type
packageName
snapshot when present
text on visible user-facing controls
bounds as spatial context

Frequently missing or unstable:

resourceId on many third-party apps
contentDesc on many controls
title on some window transitions
resourceId on text_change

Agents should prefer:

live current snapshot
package and nearby UI context
visible text and content description
resource IDs when they exist and appear stable
recorded bounds only as supporting evidence

Real text-input example

On a physical Samsung device, a manual YouTube search recording captured a useful text-input sequence:

manual character entry emitted repeated text_change events
autocomplete selection emitted a normal click
more manual typing after autocomplete emitted more text_change events

Excerpt:

{"seq":17,"type":"text_change","packageName":"com.google.android.youtube","resourceId":null,"text":"l","snapshot":null}
{"seq":18,"type":"text_change","packageName":"com.google.android.youtube","resourceId":null,"text":"li","snapshot":null}
{"seq":19,"type":"text_change","packageName":"com.google.android.youtube","resourceId":null,"text":"lin","snapshot":null}
{"seq":28,"type":"text_change","packageName":"com.google.android.youtube","resourceId":null,"text":"linking park","snapshot":null}
{"seq":37,"type":"click","packageName":"com.google.android.youtube","resourceId":"com.google.android.youtube:id/edit_suggestion","text":"Edit suggestion linkin park waiting for the end","contentDesc":"Edit suggestion linkin park waiting for the end","snapshot":"<hierarchy .../>"}
{"seq":38,"type":"text_change","packageName":"com.google.android.youtube","resourceId":null,"text":"linkin park waiting for the end e","snapshot":null}
{"seq":42,"type":"text_change","packageName":"com.google.android.youtube","resourceId":null,"text":"linkin park waiting for the end emily","snapshot":null}

What this means for agent design:

text entry is more tractable than system-navigation normalization in the current proof of concept
autocomplete is visible as a click target, not a hidden IME-only effect
text_change may have resourceId: null, so surrounding UI context matters more than the event alone

Current gotchas

Back / Home / Recents behavior is device- and navigation-mode-dependent in current recordings.

Do not assume:

a Back action will appear as press_key
a Home gesture has one universal raw shape
system UI events are portable across Samsung, emulator, and future devices

`text_change` may be underspecified

Text changes can be useful even when they lack resourceId, but that means the event alone may not identify the correct field to target later.

Scroll and text events do not carry snapshots

That is intentional. High-rate event categories stay cheap by omitting tree capture.

Recordings are lossy

They preserve identifiable interaction moments, not full human intent or every intermediate visual state.

Agents should expect:

missing timing nuance
missing IME-specific details
occasional nulls
OEM-specific variation

What recordings are good for right now

The current recording format is already useful for:

understanding a user-demonstrated navigation path
turning a demonstrated flow into agent context
grounding replay attempts in actual observed UI state
bootstrapping skill authoring for stable non-system-navigation flows

It is not yet the right contract to assume for:

normalized system gesture replay
device-portable Back / Home / Recents semantics
one-to-one deterministic playback with no live state verification

Android Recording Format for Agents

Current status

Recording file shape

Header record

Event types

window_change

click

scroll

text_change

Snapshot semantics

How agents should use recordings

Field reliability guidance

Real text-input example

Current gotchas

System navigation is not normalized yet

text_change may be underspecified

Scroll and text events do not carry snapshots

Recordings are lossy

What recordings are good for right now

Related pages

`window_change`

`click`

`scroll`

`text_change`

`text_change` may be underspecified