Browser Automation for RAG Data Collection

Show RAG teams how Steel sessions, Files, and observability keep browser data collection repeatable, auditable, and ready for knowledge base ingestion.

RAG pipelines only produce trustworthy answers when every crawl, scrape, or agent step leaves a trail: where the content came from, which selectors captured it, which attachments were downloaded, and what human approved the export. Steel sessions spin up in under a second, stay live for up to 24 hours, and preserve portal trust inside reusable profiles, so each run can tag its evidence and send the final payload downstream without guessing what actually happened.

Instead of bolting Playwright or Selenium onto an ad hoc Chrome fleet, treat Steel as the managed browser tier for knowledge ingestion. Credal already processes more than 6 million URLs a month just to keep enterprise knowledge bases current, and teams like Stack AI or Zapier learned the hard way that public-site scraping eventually hits authenticated docs, JavaScript-only releases, or delta tracking that needs a long-lived browser. Steel keeps that complexity inside an API: credentials stay vaulted, downloads land in a /files mount, observers get a wrapped debugUrl, and everything releases with HLS replays plus agent logs you can audit alongside your vector store commits.

Workflow snapshot

WorkflowTypical targetsTraceability failure modeSteel move
Public docs and changelog ingestionMarketing sites, changelog pages, multi-framework docs that render via ReactDOM snapshots drift, edits ship without proof, and datasets lose selector contextTag each session with source + version, stream DOM -> markdown, store selectors in metadata, export replays for the approval packet
Authenticated knowledge syncSaaS help centers behind SSO, customer-only release notes, private forumsPassword sharing, MFA drift, and no audit trail for who pulled the dataSeed one profile per tenant, inject credentials via namespaces + optional TOTP, mirror replays/logs/files into your storage on release
Field intelligence sweepsPricing tables, talent pages, marketplace listingsCrawlers get throttled, proxies go stale, humans cannot replay what changedUse managed proxies and CAPTCHA solving, keep per-workflow metadata to diff runs, run sessions.release() plus evidence export after each batch
Dataset lifts and download-heavy runsInteractive dashboards, CSV exports, PDF bundlesFiles land on random disks, attribution is lost, compliance cannot rebuild the dataUpload prompts or context via Global Files, mount /files for downloads, call sessions.files.downloadArchive and attach the archive hash to the ingestion job

Why RAG ingestion fights automation

  • JS-heavy docs hide actual copy behind hydration and client routing, so naive HTTP-only crawlers miss the rendered state entirely.
  • Incremental updates demand diffing: without per-run metadata and replays you cannot prove which delta fed the embedding job or why a chunk changed.
  • Authenticated sources tie logins to actual humans, meaning MFA resets, IP drift, and audit questions whenever secrets live inside task prompts.
  • Compliance teams expect evidence: if you cannot hand them logs, HLS, and downloaded files on request, the pipeline gets paused until someone re-runs the crawl manually.
  • Browser fleets die under load. Local Chrome farms stall after a dozen concurrent sessions, while Steel Cloud plans start in the tens and scale into the hundreds with managed proxies and CAPTCHA solving.
  1. Plan the crawl boundary. Define the domains, max depth, and change detection method before you create sessions so every run emits intentional metadata like { sourceSlug, version, jobId }.
  2. Create tagged sessions. Call client.sessions.create with metadata and persistProfile: true whenever the source requires login. Keep names consistent so evidence queries stay simple.
  3. Reuse profiles for gated sources. Seed a profile manually, finish MFA, then reuse the profileId until either the site forces a reauth or the 30 day inactivity timer hits. Release stale profiles to stay under the 300 MB cap.
  4. Inject secrets safely. Store credentials with namespaces (plus totpSecret if needed) and fetch them inside the workflow instead of embedding them into planner prompts or job configs.
  5. Mount datasets through Files. Upload seed documents or filters to Global Files, mount them into /files, and rely on sessions.files.downloadArchive plus files.upload to push outputs back into your storage tier.
  6. Record normalized artifacts. Render DOM to markdown, save selector contracts, hash exports, and capture HLS replays plus agent logs as part of the same queue item so humans can reenact any ingestion later.
  7. Release and scale intentionally. Chain sessions.release, replay download, log export, and Files mirroring. Move from Steel Local (~1 session) to Steel Cloud Starter/Pro once the queue needs tens or hundreds of live sessions.

Steel surfaces that matter here

SurfaceWhat it providesWhy it matters to RAG teams
Sessions + ProfilesSub-second cold starts, up to 24 hour lifetimes, persistent auth with per-profile metadataLets you hold trust cookies, locale settings, and feature flags steady across repeated crawls without retyping credentials
Files API (Session + Global)Deterministic /files mount plus automatic promotion of session files to global storage on releaseKeeps datasets, attachments, and raw exports in one place with hashes you can reuse inside downstream ingestion jobs
Credentials APIVaulted secrets, namespace scoping, optional TOTP fieldsRemoves passwords from prompts, scopes access to each workflow, and documents when secrets were last used
Observability stackLive viewer, optional interactive control, HLS replay export, agent logsGives reviewers proof of what the agent saw before accepting a dataset and an artifact they can attach to release notes
Metadata + LogsStructured metadata object on sessions plus agent log export APIAllows ingestion jobs to link replays, selectors, and dataset hashes back to a single ID when auditors ask
Deployment optionsSteel Local for air-gapped or dev runs, Steel Cloud for managed proxies, CAPTCHA solving, and higher concurrencyKeeps sensitive data on-prem when required while still letting production RAG crawlers burst to hundreds of sessions

Traceability checklist

ControlOwnerAction
Dataset lineageData engineeringTag every session with job IDs and source slugs, archive HLS + logs, and store selector manifests next to the embedding batch
Evidence retentionOpsMirror HLS playlists, agent logs, and Files archives into your storage before the 7 or 14 day clock on your plan expires
Credential hygieneSecurityRotate secrets in the Credentials API, enforce namespace policies, and delete credentials or profiles when an operator leaves
Viewer accessApp + SecurityWrap debugUrl behind your SSO, default interactive=false, and log every escalation for human-in-loop approvals
Dataset approvalsKnowledge opsRequire a human to watch the replay or review the normalized markdown before pushing embeddings into production
Concurrency limitsInfraMonitor plan caps (Steel Local ~1 session, Cloud Starter in the tens, Pro >=100) and queue runs accordingly so crawls never silently stall

Works for / Not yet

Works for

  • Dynamic marketing sites, changelog hubs, and docs portals that render via JS or require cookie-sticky previews
  • Authenticated customer portals or partner-only help centers where you control the logins and can store evidence safely
  • Download-heavy datasets such as CSV exports, PDF bundles, or asset archives that must attach to ingestion tickets

Not yet

  • Sources without any browser surface (pure API feeds, data warehouses)
  • Sites whose terms or regulators forbid capturing replays or storing page content outside their network
  • Flows that rely on physical tokens or hardware keys with no SMS or TOTP fallback for automation

Next step

Pick one knowledge source that keeps breaking your RAG ingestion, seed a Steel profile for it, run the crawl with session metadata plus Files exports, and store the resulting HLS replay alongside the dataset. The docs to start with are docs.steel.dev/overview/sessions-api/overview, docs.steel.dev/overview/files-api/overview, and docs.steel.dev/overview/credentials-api/overview.

Humans use Chrome. Agents use Steel.