Browser Automation for RAG Data Collection

RAG pipelines only produce trustworthy answers when every crawl, scrape, or agent step leaves a trail: where the content came from, which selectors captured it, which attachments were downloaded, and what human approved the export. Steel sessions spin up in under a second, stay live for up to 24 hours, and preserve portal trust inside reusable profiles, so each run can tag its evidence and send the final payload downstream without guessing what actually happened.

Instead of bolting Playwright or Selenium onto an ad hoc Chrome fleet, treat Steel as the managed browser tier for knowledge ingestion. Credal already processes more than 6 million URLs a month just to keep enterprise knowledge bases current, and teams like Stack AI or Zapier learned the hard way that public-site scraping eventually hits authenticated docs, JavaScript-only releases, or delta tracking that needs a long-lived browser. Steel keeps that complexity inside an API: credentials stay vaulted, downloads land in a /files mount, observers get a wrapped debugUrl, and everything releases with HLS replays plus agent logs you can audit alongside your vector store commits.

Workflow snapshot

Workflow	Typical targets	Traceability failure mode	Steel move
Public docs and changelog ingestion	Marketing sites, changelog pages, multi-framework docs that render via React	DOM snapshots drift, edits ship without proof, and datasets lose selector context	Tag each session with source + version, stream DOM -> markdown, store selectors in metadata, export replays for the approval packet
Authenticated knowledge sync	SaaS help centers behind SSO, customer-only release notes, private forums	Password sharing, MFA drift, and no audit trail for who pulled the data	Seed one profile per tenant, inject credentials via namespaces + optional TOTP, mirror replays/logs/files into your storage on release
Field intelligence sweeps	Pricing tables, talent pages, marketplace listings	Crawlers get throttled, proxies go stale, humans cannot replay what changed	Use managed proxies and CAPTCHA solving, keep per-workflow metadata to diff runs, run `sessions.release()` plus evidence export after each batch
Dataset lifts and download-heavy runs	Interactive dashboards, CSV exports, PDF bundles	Files land on random disks, attribution is lost, compliance cannot rebuild the data	Upload prompts or context via Global Files, mount `/files` for downloads, call `sessions.files.downloadArchive` and attach the archive hash to the ingestion job

Why RAG ingestion fights automation

JS-heavy docs hide actual copy behind hydration and client routing, so naive HTTP-only crawlers miss the rendered state entirely.
Incremental updates demand diffing: without per-run metadata and replays you cannot prove which delta fed the embedding job or why a chunk changed.
Authenticated sources tie logins to actual humans, meaning MFA resets, IP drift, and audit questions whenever secrets live inside task prompts.
Compliance teams expect evidence: if you cannot hand them logs, HLS, and downloaded files on request, the pipeline gets paused until someone re-runs the crawl manually.
Browser fleets die under load. Local Chrome farms stall after a dozen concurrent sessions, while Steel Cloud plans start in the tens and scale into the hundreds with managed proxies and CAPTCHA solving.

Recommended browser pattern

Plan the crawl boundary. Define the domains, max depth, and change detection method before you create sessions so every run emits intentional metadata like { sourceSlug, version, jobId }.
Create tagged sessions. Call client.sessions.create with metadata and persistProfile: true whenever the source requires login. Keep names consistent so evidence queries stay simple.
Reuse profiles for gated sources. Seed a profile manually, finish MFA, then reuse the profileId until either the site forces a reauth or the 30 day inactivity timer hits. Release stale profiles to stay under the 300 MB cap.
Inject secrets safely. Store credentials with namespaces (plus totpSecret if needed) and fetch them inside the workflow instead of embedding them into planner prompts or job configs.
Mount datasets through Files. Upload seed documents or filters to Global Files, mount them into /files, and rely on sessions.files.downloadArchive plus files.upload to push outputs back into your storage tier.
Record normalized artifacts. Render DOM to markdown, save selector contracts, hash exports, and capture HLS replays plus agent logs as part of the same queue item so humans can reenact any ingestion later.
Release and scale intentionally. Chain sessions.release, replay download, log export, and Files mirroring. Move from Steel Local (~1 session) to Steel Cloud Starter/Pro once the queue needs tens or hundreds of live sessions.

Steel surfaces that matter here

Surface	What it provides	Why it matters to RAG teams
Sessions + Profiles	Sub-second cold starts, up to 24 hour lifetimes, persistent auth with per-profile metadata	Lets you hold trust cookies, locale settings, and feature flags steady across repeated crawls without retyping credentials
Files API (Session + Global)	Deterministic `/files` mount plus automatic promotion of session files to global storage on release	Keeps datasets, attachments, and raw exports in one place with hashes you can reuse inside downstream ingestion jobs
Credentials API	Vaulted secrets, namespace scoping, optional TOTP fields	Removes passwords from prompts, scopes access to each workflow, and documents when secrets were last used
Observability stack	Live viewer, optional interactive control, HLS replay export, agent logs	Gives reviewers proof of what the agent saw before accepting a dataset and an artifact they can attach to release notes
Metadata + Logs	Structured `metadata` object on sessions plus agent log export API	Allows ingestion jobs to link replays, selectors, and dataset hashes back to a single ID when auditors ask
Deployment options	Steel Local for air-gapped or dev runs, Steel Cloud for managed proxies, CAPTCHA solving, and higher concurrency	Keeps sensitive data on-prem when required while still letting production RAG crawlers burst to hundreds of sessions

Traceability checklist

Control	Owner	Action
Dataset lineage	Data engineering	Tag every session with job IDs and source slugs, archive HLS + logs, and store selector manifests next to the embedding batch
Evidence retention	Ops	Mirror HLS playlists, agent logs, and Files archives into your storage before the 7 or 14 day clock on your plan expires
Credential hygiene	Security	Rotate secrets in the Credentials API, enforce namespace policies, and delete credentials or profiles when an operator leaves
Viewer access	App + Security	Wrap `debugUrl` behind your SSO, default `interactive=false`, and log every escalation for human-in-loop approvals
Dataset approvals	Knowledge ops	Require a human to watch the replay or review the normalized markdown before pushing embeddings into production
Concurrency limits	Infra	Monitor plan caps (Steel Local ~1 session, Cloud Starter in the tens, Pro >=100) and queue runs accordingly so crawls never silently stall

Works for / Not yet

Works for

Dynamic marketing sites, changelog hubs, and docs portals that render via JS or require cookie-sticky previews
Authenticated customer portals or partner-only help centers where you control the logins and can store evidence safely
Download-heavy datasets such as CSV exports, PDF bundles, or asset archives that must attach to ingestion tickets

Not yet

Sources without any browser surface (pure API feeds, data warehouses)
Sites whose terms or regulators forbid capturing replays or storing page content outside their network
Flows that rely on physical tokens or hardware keys with no SMS or TOTP fallback for automation

Next step

Pick one knowledge source that keeps breaking your RAG ingestion, seed a Steel profile for it, run the crawl with session metadata plus Files exports, and store the resulting HLS replay alongside the dataset. The docs to start with are docs.steel.dev/overview/sessions-api/overview, docs.steel.dev/overview/files-api/overview, and docs.steel.dev/overview/credentials-api/overview.

Humans use Chrome. Agents use Steel.