RAG pipelines only produce trustworthy answers when every crawl, scrape, or agent step leaves a trail: where the content came from, which selectors captured it, which attachments were downloaded, and what human approved the export. Steel sessions spin up in under a second, stay live for up to 24 hours, and preserve portal trust inside reusable profiles, so each run can tag its evidence and send the final payload downstream without guessing what actually happened.
Instead of bolting Playwright or Selenium onto an ad hoc Chrome fleet, treat Steel as the managed browser tier for knowledge ingestion. Credal already processes more than 6 million URLs a month just to keep enterprise knowledge bases current, and teams like Stack AI or Zapier learned the hard way that public-site scraping eventually hits authenticated docs, JavaScript-only releases, or delta tracking that needs a long-lived browser. Steel keeps that complexity inside an API: credentials stay vaulted, downloads land in a /files mount, observers get a wrapped debugUrl, and everything releases with HLS replays plus agent logs you can audit alongside your vector store commits.
Workflow snapshot
| Workflow | Typical targets | Traceability failure mode | Steel move |
|---|---|---|---|
| Public docs and changelog ingestion | Marketing sites, changelog pages, multi-framework docs that render via React | DOM snapshots drift, edits ship without proof, and datasets lose selector context | Tag each session with source + version, stream DOM -> markdown, store selectors in metadata, export replays for the approval packet |
| Authenticated knowledge sync | SaaS help centers behind SSO, customer-only release notes, private forums | Password sharing, MFA drift, and no audit trail for who pulled the data | Seed one profile per tenant, inject credentials via namespaces + optional TOTP, mirror replays/logs/files into your storage on release |
| Field intelligence sweeps | Pricing tables, talent pages, marketplace listings | Crawlers get throttled, proxies go stale, humans cannot replay what changed | Use managed proxies and CAPTCHA solving, keep per-workflow metadata to diff runs, run sessions.release() plus evidence export after each batch |
| Dataset lifts and download-heavy runs | Interactive dashboards, CSV exports, PDF bundles | Files land on random disks, attribution is lost, compliance cannot rebuild the data | Upload prompts or context via Global Files, mount /files for downloads, call sessions.files.downloadArchive and attach the archive hash to the ingestion job |
Why RAG ingestion fights automation
- JS-heavy docs hide actual copy behind hydration and client routing, so naive HTTP-only crawlers miss the rendered state entirely.
- Incremental updates demand diffing: without per-run metadata and replays you cannot prove which delta fed the embedding job or why a chunk changed.
- Authenticated sources tie logins to actual humans, meaning MFA resets, IP drift, and audit questions whenever secrets live inside task prompts.
- Compliance teams expect evidence: if you cannot hand them logs, HLS, and downloaded files on request, the pipeline gets paused until someone re-runs the crawl manually.
- Browser fleets die under load. Local Chrome farms stall after a dozen concurrent sessions, while Steel Cloud plans start in the tens and scale into the hundreds with managed proxies and CAPTCHA solving.
Recommended browser pattern
- Plan the crawl boundary. Define the domains, max depth, and change detection method before you create sessions so every run emits intentional metadata like
{ sourceSlug, version, jobId }. - Create tagged sessions. Call
client.sessions.createwith metadata andpersistProfile: truewhenever the source requires login. Keep names consistent so evidence queries stay simple. - Reuse profiles for gated sources. Seed a profile manually, finish MFA, then reuse the
profileIduntil either the site forces a reauth or the 30 day inactivity timer hits. Release stale profiles to stay under the 300 MB cap. - Inject secrets safely. Store credentials with namespaces (plus
totpSecretif needed) and fetch them inside the workflow instead of embedding them into planner prompts or job configs. - Mount datasets through Files. Upload seed documents or filters to Global Files, mount them into
/files, and rely onsessions.files.downloadArchiveplusfiles.uploadto push outputs back into your storage tier. - Record normalized artifacts. Render DOM to markdown, save selector contracts, hash exports, and capture HLS replays plus agent logs as part of the same queue item so humans can reenact any ingestion later.
- Release and scale intentionally. Chain
sessions.release, replay download, log export, and Files mirroring. Move from Steel Local (~1 session) to Steel Cloud Starter/Pro once the queue needs tens or hundreds of live sessions.
Steel surfaces that matter here
| Surface | What it provides | Why it matters to RAG teams |
|---|---|---|
| Sessions + Profiles | Sub-second cold starts, up to 24 hour lifetimes, persistent auth with per-profile metadata | Lets you hold trust cookies, locale settings, and feature flags steady across repeated crawls without retyping credentials |
| Files API (Session + Global) | Deterministic /files mount plus automatic promotion of session files to global storage on release | Keeps datasets, attachments, and raw exports in one place with hashes you can reuse inside downstream ingestion jobs |
| Credentials API | Vaulted secrets, namespace scoping, optional TOTP fields | Removes passwords from prompts, scopes access to each workflow, and documents when secrets were last used |
| Observability stack | Live viewer, optional interactive control, HLS replay export, agent logs | Gives reviewers proof of what the agent saw before accepting a dataset and an artifact they can attach to release notes |
| Metadata + Logs | Structured metadata object on sessions plus agent log export API | Allows ingestion jobs to link replays, selectors, and dataset hashes back to a single ID when auditors ask |
| Deployment options | Steel Local for air-gapped or dev runs, Steel Cloud for managed proxies, CAPTCHA solving, and higher concurrency | Keeps sensitive data on-prem when required while still letting production RAG crawlers burst to hundreds of sessions |
Traceability checklist
| Control | Owner | Action |
|---|---|---|
| Dataset lineage | Data engineering | Tag every session with job IDs and source slugs, archive HLS + logs, and store selector manifests next to the embedding batch |
| Evidence retention | Ops | Mirror HLS playlists, agent logs, and Files archives into your storage before the 7 or 14 day clock on your plan expires |
| Credential hygiene | Security | Rotate secrets in the Credentials API, enforce namespace policies, and delete credentials or profiles when an operator leaves |
| Viewer access | App + Security | Wrap debugUrl behind your SSO, default interactive=false, and log every escalation for human-in-loop approvals |
| Dataset approvals | Knowledge ops | Require a human to watch the replay or review the normalized markdown before pushing embeddings into production |
| Concurrency limits | Infra | Monitor plan caps (Steel Local ~1 session, Cloud Starter in the tens, Pro >=100) and queue runs accordingly so crawls never silently stall |
Works for / Not yet
Works for
- Dynamic marketing sites, changelog hubs, and docs portals that render via JS or require cookie-sticky previews
- Authenticated customer portals or partner-only help centers where you control the logins and can store evidence safely
- Download-heavy datasets such as CSV exports, PDF bundles, or asset archives that must attach to ingestion tickets
Not yet
- Sources without any browser surface (pure API feeds, data warehouses)
- Sites whose terms or regulators forbid capturing replays or storing page content outside their network
- Flows that rely on physical tokens or hardware keys with no SMS or TOTP fallback for automation
Next step
Pick one knowledge source that keeps breaking your RAG ingestion, seed a Steel profile for it, run the crawl with session metadata plus Files exports, and store the resulting HLS replay alongside the dataset. The docs to start with are docs.steel.dev/overview/sessions-api/overview, docs.steel.dev/overview/files-api/overview, and docs.steel.dev/overview/credentials-api/overview.
Humans use Chrome. Agents use Steel.