Short answer
Browser agent reliability is not a vibe check; it is four numbers you can audit every hour: session startup success and p95 latency, first action success, completion rate without retries, and how often humans or missing evidence stall the run. Steel’s remote benchmark hits 0.89 s average startup, 1.09 s p95, and 1.34 s p99 across 5,000 runs with 100 percent success; that is 1.70×–8.95× faster than the other providers we tested, so treat 1.2 s p95 and 99.5 percent success as the first red lines. When a metric breaks, check which plan tier hit it: Steel Local is effectively a single-session control plane while Steel Cloud Starter, Pro, and Enterprise tiers stretch into the tens or hundreds of concurrent sessions, so annotate every scoreboard row with the plan and region before trusting an average.
Instead of counting how many jobs eventually finished, tag each Steel session with your queue ID, emit stage timers from sessions.create through release, and record whether each failure had a replay, logs, and an operator intervention. Publish those metrics with plan-tier utilization so alerts fire before startup p95 drifts past benchmark values; the loop shows whether you should fix proxies, selectors, profiles, or human approval policies before touching model prompts.
What to measure
Steel benchmark baselines
| Signal | Steel baseline | Why it matters |
|---|---|---|
| Startup success & p95 | 100% success, 0.89 s average, 1.09 s p95, 1.34 s p99 across 5,000 runs. | Treat 1.2 s p95 as your red line; exceeding it means the control plane is starving queues before automation even touches the DOM. |
| Control-plane tax | 229 ms total for create + release (25.6% of lifecycle). | When this number doubles, retries and warm pools burn your concurrency budget instead of shipping work. |
First action (connect + first goto) | 665 ms average for connect plus the first DOM-ready navigation. | Spikes here mean anti-bot controls or layout drift is choking agents before any useful work happens. |
These baselines are the default scoreboard targets until you log fresher numbers in your own environment.
Reliability metrics that matter
| Metric | Why it matters | How to calculate | Steel signal |
|---|---|---|---|
| Startup success & p95 | Control plane health determines whether you ever reach the page; queue cascades begin when creations fail or p95 crosses 1.2 s. Steel’s benchmark baseline is 0.89 s avg, 1.09 s p95 with 0 failures over 5,000 runs. | successful_creates / total_creates plus percentile on (session_created_at → first_connect_at) per region and plan. | Sessions API create/release telemetry, Pricing/Limits caps, remote-browser-benchmark reference runs. |
| First action success | Anti-bot walls and selector drift usually appear on the first DOM mutation, not the final output. Catching it here prevents wasted browser hours. | Count navigation or page.goto completions on the first action per session, split by proxy, UA, and mobile mode flags. | agentLogs timeline + Playwright event hooks + sessionContext metadata. |
| Workflow completion vs retries | Reliability is “done on first run”; retries mask failure costs and double proxy, CAPTCHA, and human review spend. | completed_without_retry / total_runs and avg_retries_per_success, segmented by workflow. | Job queue + session.id + releaseAll to ensure cleanup before re-queueing. |
| Manual intervention rate | Every approval drains latency budgets and exposes UX issues. Keep the rate low by diagnosing root causes early. | (human_interventions / total_runs) * 100, recorded via embed approvals or ops dashboards. | debugUrl embed callbacks + approval logs + operator queue metadata. |
| Plan-tier saturation | Steel Local is effectively one concurrent session while Steel Cloud Starter/Pro stretch into the tens or hundreds; once a tier hits 90 percent of its concurrency or RPS cap, startup p95 will spike before success drops. | Track (active_sessions / plan_cap) and (requests_per_second / plan_cap) per plan tier; alert when either crosses 0.9 for 5 minutes. | Pricing/Limits table for caps + Sessions API list to count active sessions; annotate scoreboard rows with plan tiers. |
| Evidence coverage | If a failure lacks replay/logs/files, it cannot improve reliability. Evidence coverage should be 100 percent for failed runs. | failures_with_replay / total_failures plus storage lag for logs/files exports. | HLS replay URLs, MP4 exports, Files API attachments, agentLogs pagination. |
Segment every metric
A blended reliability number hides which region, plan tier, or proxy pool poisoned the curve. Slice the scoreboard before you trust it.
| Dimension | Why it matters | Implementation hook |
|---|---|---|
| Region & data center | ISPs throttle differently; a noisy geography can sink your p95 without impacting other customers. | Tag sessions with region on create, then chart startup and first-action metrics per region before aggregating. |
| Plan tier & warm pool size | Steel Local (~1 concurrent session) vs Steel Cloud (hundreds), plus Starter vs Pro caps, control how many sessions can boot before the queue saturates. | Join session metrics with plan metadata from Pricing/Limits and alert when any tier crosses 90% of its concurrency or RPS ceiling. |
| Proxy or fingerprint class | Headless vs headful, mobile mode vs desktop, or residential vs datacenter proxies fail in different ways. | Store proxyPool, fingerprintMode, and mobileMode flags on each session and pivot success rates accordingly. |
| Workflow or domain | Selectors and approvals are domain-specific; hiding them inside an average keeps flaky flows invisible. | Derive a workflow tag from the job queue and publish per-domain slices in the same dashboard. |
| Human intervention reason | Approvals are not free; trending “KYC stuck” vs “OTP missing” guides fixes faster than knowing the blended rate. | Require structured reason codes in your approval UI and count them alongside the intervention metric. |
Plan-tier guardrails (from Pricing/Limits)
Use the current Pricing/Limits table to annotate each slice with the right ceiling so alerts trigger before a plan starves the queue. These are the published caps today; if they change, update the scoreboard in lockstep.
| Plan tier | Concurrent sessions cap | Requests per second cap | Reliability response |
|---|---|---|---|
| Hobby ($0) | 5 | 1 | Treat it as a single-developer sandbox; if startup success or p95 drift, move production jobs off this tier immediately. |
| Starter ($29) | 10 | 2 | Alert as soon as a workflow holds ≥9 sessions or 90% of RPS for 5 minutes; pre-warm Steel Cloud pools or stage upgrades before approvals pile up. |
| Developer ($99) | 20 | 5 | Slice metrics per region to avoid burning the cap on one geography, and sweep stuck sessions with releaseAll when utilization sticks above 90 percent. |
| Pro ($499) | 100 | 10 | Publish per-region dashboards; most regressions here come from evidence export lag or manual approvals blocking releases rather than raw concurrency. |
| Enterprise | Custom | Custom | Mirror the contracted ceiling and stop trusting “unlimited” without on-call automation; negotiate burst headroom so alerts stay meaningful. |
Why completion rates lie
A green “workflow succeeded” light usually hides the 30 seconds you spent waiting for a retry, the three CAPTCHA solves it consumed, or the operator who had to click Continue inside a debugUrl. Reliability falls apart when you only measure the last stage:
- Queues can drain minutes before a session ever boots when concurrency caps (5, 10, 20, 100) are saturated.
- Anti-bot blocks often kill the first navigation, so counting final output means you miss the step where you should rotate proxies or switch to headful mode.
- Retries reset session IDs, which nukes the evidence trail unless you store it per attempt.
Measure every stage. If startup is healthy but first actions fail, you need better fingerprints. If first actions pass but completion tanks, explore selectors, credentials, or human approvals. If completion looks perfect but evidence coverage drops below 100 percent, you are flying blind during postmortems.
Instrument the loop
- Tag sessions with job context. Store
session.id, workflow name, proxy mode, and profile ID alongside your queue metadata so metrics can pivot on any dimension. - Capture stage timers. Record
(create → connect),(connect → first action),(first action → completion), and(completion → release)durations; export them to Prometheus or whatever time-series system you already trust. - Log agent actions. Call
/v1/sessions/{id}/agent-logs(or stream live logs) to align model steps, CDP commands, and human inputs with timestamps. - Store evidence immediately. Save HLS URLs, MP4 downloads, screenshots, and scraped files using the Files API so every failure has replayable proof even after the 14-day retention window.
- Track interventions as events. Your approval UIs should emit structured telemetry when someone takes control via
debugUrlor a live embed, including reason codes and elapsed time. - Automate cleanup. Use
releaseandreleaseAllafter every run, and record whether cleanup succeeded; stuck sessions distort both cost and perceived reliability. - Segment by region and plan. Break out success, latency, and intervention metrics by region, proxy type, and plan tier so a noisy geography or price plan cannot hide systemic regressions.
- Publish a live scoreboard. Pipe these numbers into a shared dashboard (Grafana, Metabase, spreadsheets), pin the target + alert threshold for each metric, and assign a single owner so reliability drift is visible without digging through traces.
- Wire plan caps to alerts. Pull concurrency and RPS ceilings straight from Pricing/Limits and annotate every scoreboard row with the plan tier; alert early when startup p95 creeps upward while a plan is within 10 percent of saturation so ops can flip workloads from Steel Local (~1 concurrent session) to Steel Cloud (100+), or between Starter/Pro tiers, before prompts waste browser slots.
Stage instrumentation cheat sheet
| Stage | Fields to capture | Metric unlocked |
|---|---|---|
| Create → connect | session.id, queue/job ID, region, plan tier, proxy pool, timestamps for sessions.create and first CDP connect | Startup success %, p95 latency, plan saturation alerts |
| First connect → first action | agentLogs event ID, action type (goto, selector click), DOM target, fingerprint mode (headful/mobile), proxy label | First-action success %, domain-by-domain anti-bot diagnosis |
| First action → completion | Workflow name, profile ID, credentials namespace, retry counter, navigation + form checkpoints | Single-pass completion %, retry rate, flaky selectors or auth drift |
| Completion → release | Cleanup duration, release status, releaseAll result, stuck-session flag | Control-plane tax %, stranded session detection, concurrency reclaim rate |
| Failure evidence | Replay URL, MP4 export path, Files API payload IDs, log export status | Evidence coverage %, audit readiness, incident reproducibility |
| Manual intervention | Approver ID, reason code, elapsed review time, embed/debugUrl path, resumed session ID | Intervention rate, approval backlog trends, human-time cost per workflow |
Session tag dictionary
Standardize the tags you emit on every session so slicing never turns into regex archaeology later.
| Tag | Purpose | Example values |
|---|---|---|
jobId | Join queue records, retries, and metrics without guessing which run maps to which artifact. | claims-sync-2026-03-31-00012 |
workflow | Segment reliability per business flow or domain and keep alert routing clean. | kyc-verify, billing-export, claims-sync |
region | Compare startup and first-action health per data center before trusting blended numbers. | us-east-1, eu-west-2, ap-south-1 |
planTier | Watch for starter/pro plan saturation and know when to upgrade or shard workloads. | starter, pro, enterprise |
proxyPool | Spot when one proxy vendor or residency class tanks first-action success. | steel-resi, byo-dc, headful-lithuania |
fingerprintMode | Track headless/headful/mobile toggles so anti-bot fixes have evidence. | headful, stealth, mobile |
profileId | Tie completion and retry metrics to the underlying auth state. | merchant-portal-prod, issuer-sandbox |
mobileMode | Confirm when mobile DOMs were requested so DOM selectors and evidence stay aligned. | true, false |
approvalReason | Trend manual interventions by root cause rather than by gut feel. | otp-missing, kyc-review, compliance |
evidenceStatus | Ensure replay/log exports completed; alert when anything ships without proof. | complete, missing-replay, missing-logs |
Retire fake reliability proxies
Vanity dashboards hide the root cause. Kill these summary metrics so the scoreboard reflects reality.
| Proxy signal | Why it lies | What to track instead |
|---|---|---|
| Jobs completed per hour | Retries, manual approvals, and stuck cleanup still count as “success,” so you congratulate a workflow that burned double the time and proxies. | Startup success, single-pass completion, and manual intervention rate per workflow and region. |
| Queue depth or wait time | A long queue tells you demand beats supply, not why sessions fail to boot. | Startup p95 per region/plan plus first-action success segmented by proxy pool. |
| CPU or RAM usage | Host metrics improve even when workflows die on OTP or CAPTCHA steps; ops fix VMs instead of selectors. | Evidence coverage, approval reason codes, and retry rate; these point straight to automation bugs. |
| Prompt/response count | LLM turn counts fluctuate with copy tweaks but say nothing about browser survival. | Completion latency per attempt and retries per success so you see wasted browser hours. |
Manual intervention event schema
Audit every approval the same way you log incidents: structured fields keep reviews auditable and help you replay what happened inside debugUrl or live embeds.
| Field | Why it matters | Example |
|---|---|---|
approverId | Tie the decision back to a human so you can audit access and follow up on outliers. | ops-lead-07 |
reasonCode | Buckets interventions into actionable categories instead of Slack anecdotes. | otp-missing, kyc-handoff, layout-shift |
sessionId | Keeps the approval linked to the evidence, retries, and cleanup events. | sess_a12bc3 |
entryTimestamp | Shows how long the run waited before someone took control. | 2026-03-31T12:04:22Z |
elapsedMs | Helps quantify the human-time tax per workflow and set staffing thresholds. | 18500 |
resumeAction | Tells on-call whether the same session resumed, restarted, or was killed. | resume_same_session |
embedPath | Records whether the takeover came from debugUrl, an iframe, or a custom review UI. | embed/live/claims |
Evidence artifact log schema
Treat replay, log, and file exports as first-class records; missing any of them should block your reliability publish just like a failed metric.
| Artifact field | Purpose | Example |
|---|---|---|
replayUrl | Links reviewers to the HLS playlist or MP4 so they can see the failure without rerunning it. | https://hls.steel.dev/.../master.m3u8 |
mp4Path | Guarantees you can keep the proof past Steel’s retention window by mirroring to your storage. | s3://agents/runs/claims-sync-12.mp4 |
logsExportId | Confirms agent logs finished exporting; without it you cannot align model actions to DOM state. | log_exp_92ff |
filesAssetIds | Tracks downloaded spreadsheets, PDFs, or screenshots tied to the run. | file_123, file_124 |
status | Shows whether the evidence bundle is complete, partial, or missing to trigger alerts. | complete |
expiresAt | Forces teams to pull artifacts before the 14-day default window closes. | 2026-04-14T00:00:00Z |
Example: tag and emit metrics in code
import { SteelClient } from "@steeldev/sdk";
import { chromium } from "playwright";
const metricsTags = { workflow, region, plan, proxyPool };
const createdAt = Date.now();
const client = new SteelClient({ apiKey: process.env.STEEL_API_KEY });
const session = await client.sessions.create({ tags: { jobId, workflow } });
const connectStart = Date.now();
const browser = await chromium.connectOverCDP(session.cdpUrl);
metrics.observe("session_startup_ms", connectStart - createdAt, metricsTags);
const page = await browser.newPage();
await page.goto(firstUrl, { waitUntil: "domcontentloaded" });
metrics.observe("first_action_ms", Date.now() - connectStart, metricsTags);
try {
await runWorkflow(page);
metrics.increment("workflow_completion", { ...metricsTags, result: "success" });
} catch (error) {
metrics.increment("workflow_completion", { ...metricsTags, result: "failure" });
await uploadReplayEvidence(session.id); // Files API helper keeps evidence coverage at 100%.
throw error;
} finally {
await client.sessions.release(session.id);
}
Tag the session the moment it exists, emit each stage duration as soon as it finishes, and treat replay uploads plus releases as part of the same function. This keeps every metric synchronized with artifacts and prevents retries from erasing the evidence trail.
Scoreboard payload example
Ship one normalized record per workflow slice so alerting and dashboards stay consistent.
{
"metric": "startup_success",
"value": 0.997,
"windowMinutes": 5,
"tags": {
"workflow": "claims-sync",
"region": "us-east-1",
"planTier": "pro",
"proxyPool": "steel-resi"
},
"evidence": {
"replayUrl": "https://hls.steel.dev/runs/claims-sync-12/master.m3u8",
"logsExportId": "log_exp_92ff",
"missingEvidence": false
},
"alert": {
"threshold": 0.995,
"action": "scale_control_plane"
}
}Scoreboard payload fields
| Field | Why it exists | Notes |
|---|---|---|
metric | Names the signal so Grafana, Metabase, or alerting routers can fan out without guessing column names. | Keep it stable (startup_success, first_action_success, etc.) so downstream joins stay simple. |
value | Stores the aggregated measurement you are comparing to the alert threshold. | Use ratios (0.997) or milliseconds depending on the metric; mixing units muddies alerts. |
windowMinutes | Documents the roll-up window so humans know whether this was a 1-minute spike or a 1-hour regression. | Match your alert windows (5, 15, 30) and update it when playbooks change. |
tags.* | Provide segmentation context (workflow, region, plan tier, proxy pool) so you can page the right owner. | Emit every tag defined in the dictionary; dashboards stay filterable without custom SQL. |
evidence.* | Links the metric to replay/log artifacts so responders can inspect proof before muting alarms. | Set missingEvidence to true anytime a replay/log export failed; that alone should trigger an incident. |
alert.threshold | Shows the contract baked into the runbook so nobody debates severity mid-incident. | Pair with the same target you published in the reliability scoreboard table. |
alert.action | Tells on-call owners exactly what to do (scale pool, rotate proxies, pause queue) when the alert fires. | Keep it short and imperative so it can be pasted directly into PagerDuty or Slack. |
Reliability scoreboard example
Keep one aggregated view for executives, but publish per-region, per-proxy, and per-plan slices so a noisy geography cannot hide a failing control plane.
| Metric | Target | Alert threshold | Owner & response |
|---|---|---|---|
| Startup success & p95 | ≥99.5%, p95 <1.2 s (Steel benchmark: 0.89 s avg / 1.09 s p95 / 1.34 s p99) | Success <99% or p95 ≥1.3 s for 5 min | Control-plane on-call: scale plan or warm pool, inspect control plane logs, fail open to Steel Cloud. |
| First action success | ≥97% | Drops <95% per domain | Anti-bot owner: rotate proxies, test headful/mobile mode, refresh selectors. |
| Completion without retry | ≥90% single-pass | Retries ≥1 per success | Workflow maintainer: split flows, persist profiles, fix flaky inputs. |
| Manual intervention rate | ≤5 per 100 runs | ≥10 per 100 runs or 3 consecutive spikes | Trust desk lead: pair approval reason with bug tickets; fix automation before adding staff. |
| Plan-tier saturation | <90% of concurrency or RPS cap per tier | ≥90% for 5 consecutive minutes | Control-plane + growth owners: shift jobs off Steel Local (~1 session) onto Steel Cloud (100+), or move Starter workloads to Pro/Enterprise tiers before queues stall. |
| Evidence coverage | 100% of failures | Any miss | Observability owner: halt deploys, patch replay/log exporters, treat as incident. |
Alert wiring checklist
Tie each metric to an automated action so nobody has to discover regressions in Slack.
| Condition | Trigger window | Automated action |
|---|---|---|
| Startup success <99% or p95 ≥1.3 s | 5-minute rolling window per region/plan | Page control-plane on-call, auto-scale warm pools, and temporarily cap queue fan-out. |
| Plan tier ≥90% of its concurrency or RPS cap | 5-minute rolling window per plan tier | Auto-scale the pool or shift workloads to the next Steel Cloud tier before saturation drags startup p95 above the benchmark, then notify owners to rebalance queue fan-out. |
| First action success dips below 95% for a domain | 15-minute rolling domain slice | Flip impacted workflows to headful or mobile mode and force proxy rotation until green again. |
| Manual intervention rate ≥10 per 100 runs | 30-minute rolling workflow slice | Pause the workflow queue, create a bug ticket with the last replay, and require incident review before resuming. |
| Evidence coverage <100% for failures | Immediate detection | Block deploys, open a Sev-2, and backfill missing replays/log exports before allowing new releases. |
Run a daily reliability review
A scoreboard only matters if someone reads it. Hold a 10-minute cadence that keeps metrics, evidence, and owners honest:
- Export the last 24 hours of scoreboard rows per workflow, region, and proxy slice; highlight any metric that crossed its alert threshold even once.
- Pull the top three manual intervention reason codes and pin their replays plus agent logs to the bug tracker before the next queue resumes.
- Verify evidence coverage stayed at 100 percent; if any run is missing replay or logs, open an incident before green-lighting deployments.
- Compare startup success by plan tier against Pricing/Limits; if a cap is within 10 percent, scale warm pools or fail workloads over to Steel Cloud before the next spike.
| Ritual | Cadence | Questions to answer | Output |
|---|---|---|---|
| Scoreboard export | Daily (10 min) | Which workflow-region slices crossed targets or alert thresholds? | Highlighted table ready for owners to triage. |
| Intervention triage | Daily | Which reason codes consumed the most human time, and do they map to known bugs? | Ticket list with replay/log links per incident. |
| Evidence audit | Daily | Did any failed run miss replays, logs, or files? | Incident report plus export backlog if any gap exists. |
| Plan-cap check | Daily | Which plan tiers or warm pools exceeded 90% of concurrency/RPS? | Action list to shift jobs between Steel Local vs Cloud or across Starter/Pro tiers. |
Interpret and act on the numbers
| Metric | Healthy range | Escalate when | First move |
|---|---|---|---|
| Startup success & p95 | ≥99.5% success, p95 <1.2 s, p99 <1.4 s | Success dips, p95 rises, or one region lags | Check plan caps, warm pool size, or proxy health; fall back to Steel Cloud if self-hosted control plane stalls. |
| First action success | ≥97% of sessions clear the first navigation | Drops below 95% or spikes per domain | Switch to headful, enable mobile mode, or attach managed proxies/CAPTCHA solving only on affected flows. |
| Completion without retry | ≥90% single-pass completion, retries ≤0.3 per success | Retries exceed 1 per success or certain flows hog retries | Instrument selectors and form states, persist profiles, or split risky flows into smaller, checkpointed steps. |
| Manual intervention rate | ≤5 per 100 runs for sensitive actions | Hits double digits or trends upward week over week | Patch automation logic, add pre-flight validation, or introduce staged approvals with better evidence to reduce escalations. |
| Evidence coverage | 100% of failed runs have replay + logs | Any gap appears | Fail open by shipping logs/replays to your own storage automatically; treat missing evidence as sev-worthy. |
Use these thresholds to trigger automation: alert when startup success falls for 5 minutes, pause a queue when manual approvals blow past the limit, or ship an incident review when evidence coverage dips at all.
Steel telemetry map
| Signal | Where it lives | Notes |
|---|---|---|
| Session lifecycle | sessions.create, sessions.get, sessions.release, releaseAll, plan caps from Pricing/Limits | Gives raw success/latency plus concurrency saturation data. |
| Agent decisions | GET /v1/sessions/{id}/agent-logs | Aligns prompts, actions, and DOM results for both automation and audit trails. |
| Live + replay evidence | debugUrl live viewer, HLS playback, MP4 downloads | Embed in ops tools for real-time approvals; store MP4 for forensic review. |
| Artifacts + files | Files API uploads/downloads | Persist structured outputs, DOM snapshots, and attachments without re-running sessions. |
| Context + profiles | sessionContext, Profiles API, Credentials API | Show whether failures correlate with stale auth or profile drift; helps size retries vs repairs. |
Trade-offs and limits
- Steel Cloud caps a single session at 24 hours; jobs longer than that must checkpoint progress into Files API or storage and resume in a fresh session.
- Session replays, logs, and artifacts stick around for up to 14 days on Pro plans. Export anything compliance-sensitive before that window closes.
- Managed proxies and CAPTCHA solving carry per-GB and per-1k charges; reliable metrics should separate baseline runs from stealth-heavy ones so cost spikes are obvious.
- Human approvals still require your own ACLs around
debugUrland embeds; Steel provides URLs, not full access control. - Steel Local is roughly a single-session runway while Steel Cloud plans ship managed stealth plus hundreds of concurrent browsers, so plan-cap alerts should always state which side of that boundary triggered.
Works for X, not yet for Y
- Works for teams already tagging each queue job with
session.id, so startup, action, and completion metrics match the evidence you review. - Works for ops groups layering approvals on top of Steel Cloud because interventions become structured events rather than Slack screenshots.
- Not yet for shops that cannot emit unique job IDs or store replays; the loop depends on correlating metrics to artifacts you can audit later.
- Not yet for workflows that routinely exceed the 24-hour session cap or run disconnected from the network; you will lose telemetry before the run ends.
Next steps
- Wire your metrics sink to the Pricing and Limits table so alarms fire before you exceed concurrency or RPS caps.
- Revisit the Sessions API overview to confirm which lifecycle events you emit and how to tag them.
- Download the Remote Browser Benchmark raw results so your scoreboard targets inherit the same evidence before you tune thresholds.
- Pair this page with Why Browser Agents Fail in Production so every failed run stays tied to concrete failure modes and evidence expectations.
- Grab a Steel Cloud API key at https://app.steel.dev so you can tag sessions with these metrics on your next run.
Humans use Chrome. Agents use Steel.