Gemini Computer Use With Steel

Connect Gemini Computer Use to Steel sessions for managed browsers, replay-grade observability, and anti-bot help without rebuilding your agent loop today.

Keep Gemini Computer Use exactly as it is and hand its browser actions to Steel. Create one Steel session, feed that Computer API into Gemini's normalized coordinate loop, and you get sub-second startup, 24 hour runtimes, and deterministic evidence without touching your prompts or orchestrator.

Steel adds what Gemini leaves to you: live viewer links, replay-ready screenshots, CAPTCHA routing, proxy management, and a cleanup contract that frees concurrency the second a run ends. Pair the Gemini 2.5 Computer Use reasoning stack with Steel sessions so you can watch each action, rerun failures with evidence, and keep your task queue honest.

What stays the same

Gemini concernWhat you keepNotes
Task prompts and reasoningSame gemini-2.5-computer-use-preview model, same system prompt, same task payloadSteel never touches your Google credentials or conversation state
Tool contractGemini's single computer-use tool with normalized 0-1000 coordinates stays exactly as providedYou only replace the backend that translates coordinates into browser actions
Safety gatingExisting safety confirmations and reviewer promptsSteel just surfaces the viewer link so reviewers can watch the action they are approving
Queue + hostingYour Python or Node loop, cron, or worker stackSteel is another API client sitting next to google-genai

What Steel adds

Steel surfaceWhy it matters for Computer UseHow to wire it
Session lifecycleFast startup and 24 hour caps keep Gemini loops running without relaunching Chromesession = client.sessions.create({dimensions:{width:1280,height:768}, blockAds:true, timeout:900000}) then client.sessions.release(session.id) in finally
ObservabilityViewer URL, replay, and agent logs make every click reviewableLog session.session_viewer_url, store it beside the Gemini response ID, and pull client.sessions.logs.list(session.id) after runs
Computer APIDeterministic mapping for click, type, scroll, wait, navigate, and take_screenshot responsesForward each Gemini function_call to client.sessions.computer(session.id, body) and return the base64 PNG back to Gemini
Anti-bot and CAPTCHA toolingManaged proxies plus CAPTCHA queue prevents loops from stalling on login wallsSet useProxy, region, and poll client.sessions.captchas.status(session.id) when the response flags a challenge
Release disciplineReleasing sessions publishes the replay, frees plan-cap slots, and locks observability recordsTreat release success as a metric; call sessions.release during happy and unhappy paths

Minimal integration path

  1. Install the official SDKs: npm install steel-sdk @google/genai dotenv or pip install steel-sdk google-genai python-dotenv plus your TypeScript or Python runtime deps.
  2. Load .env values for STEEL_API_KEY, GEMINI_API_KEY, and a default TASK. Keep the quickstart's MODEL = "gemini-2.5-computer-use-preview-10-2025" constant so both runtimes stay aligned.
  3. Create a Steel client and session with the same viewport Gemini expects:
    import { Steel } from "steel-sdk";
    const steel = new Steel({ steelAPIKey: process.env.STEEL_API_KEY! });
    const session = await steel.sessions.create({
      dimensions: { width: 1280, height: 768 },
      blockAds: true,
      timeout: 900_000,
    });
    console.log(`Viewer: ${session.sessionViewerUrl}`);
  4. Mirror the helper from the docs: keep MAX_COORDINATE = 1000, add denormalizeX and denormalizeY functions, and normalize Gemini key combos before handing them to Steel's Computer API.
  5. In your Gemini loop, capture every function_call, translate it through the helper, and invoke steel.sessions.computer(session.id, actionPayload) so each action returns a PNG screenshot and optional URL back to Gemini.
  6. Wrap execution in try/finally so sessions.release(session.id) always runs. Print both the viewer link and replay link for humans who need to verify the outcome.

Mirror the helper structure in TypeScript and Python

  • System prompt: Keep the same <BROWSER_ENV> block from the quickstarts so Gemini knows it is driving a Steel-managed Chromium instance with internet access.
  • Coordinate helpers: The TS denormalizeX/denormalizeY methods and the Python _denormalize_x/_denormalize_y pair both map normalized coordinates to the 1280x768 viewport. Reuse them verbatim.
  • Action router: Copy the switch/if ladder from agent.ts or agent.py. Every Gemini action (click_at, scroll_document, type_text_at, navigate, drag_and_drop, wait_5_seconds) already has the Steel API payload defined. Keep the screenshot flag on so observability stays in sync.
  • Logging: The helper prints each action and logs the viewer link. Extend that log with your job IDs so you can correlate Gemini reasoning, Steel evidence, and downstream approvals.

Pair Gemini Computer Use with Steel observability

SignalSteel hookWhy it matters
Live viewersession.session_viewer_urlShare with operators to watch Gemini's reasoning in real time and pause high risk actions
ReplaySame viewer URL after releaseGives you a permanent artifact to debug or escalate without rerunning a flaky task
Agent logsclient.sessions.logs.list(session.id)Store log excerpts next to Gemini transcripts so you can diff retries and see why a click misfired
CAPTCHA statusclient.sessions.captchas.status(session.id)Pause Gemini actions until Steel clears the challenge, then resume with context
Release metricsTrack sessions.release success per jobPrevent orphaned sessions from soaking concurrency limits and keep plan usage auditable

Fit and trade-offs

Works best for

  • Teams already calling Gemini Computer Use who just need a reliable browser backend with replay evidence.
  • Agents that require human approvals, post mortems, or escalations; Steel's viewer and logs make that evidence one click away.
  • Workloads where normalized coordinates need zero changes but the Chrome runtime keeps crashing under load.

Not yet ideal when

  • You need a desktop app surface outside Chromium; Steel only supplies browsers today.
  • Runs exceed the 24 hour session cap or need more concurrency than your Steel plan currently offers.
  • Your org cannot enable the gemini-2.5-computer-use-preview capability yet; Steel cannot sidestep Google's access controls.

Go-live checklist

  • .env checked into your secrets store with Steel and Gemini keys plus the default TASK and viewport settings.
  • Action router tested in both TS and Python quickstarts from docs.steel.dev/integrations/gemini-computer-use so future edits stay grounded in live code.
  • Logs capture session ID, viewer URL, replay URL, Gemini response ID, and a release_success flag.
  • CAPTCHA routing tested on at least one high friction site so your queue does not stall when Gemini hits a challenge.
  • Observability review ritual in place: operators watch the viewer for sensitive steps and replay failed jobs before re-queuing them.

Next step: run the cookbook sample once, keep the viewer link in your logs, and layer CAPTCHA monitoring before scaling the queue. Humans use Chrome. Agents use Steel.