SDK ReferenceTypeScript SDK

Scrape

Fetch a URL as raw HTML, clean markdown, or structured content.

Scrape

Fetch a URL as raw HTML, clean markdown, or structured content. Accessed via client.scrape.

create

client.scrape.create(opts: ScrapeOpts): Promise<unknown>

POST /v1/scrape — create a scrape job

get

client.scrape.get(id: ScrapeId): Promise<unknown>

GET /v1/scrape/{id} — poll a scrape by ID

bulk

create

client.scrape.bulk.create(opts: Record<string, unknown>): Promise<unknown>

POST /v1/scrape/bulk — create a bulk scrape job

get

client.scrape.bulk.get(id: BulkId): Promise<unknown>

GET /v1/scrape/bulk/{id} — get a bulk scrape job

Options (ScrapeOpts)

  • url (required) — string
  • formatsArray<'rawHtml' | 'html' | 'markdown' | 'links' | 'images'>
  • countrystring
  • cookiesArray<Record<string, unknown>>
  • headersRecord<string, string>
  • delay_msnumber
  • timeout_msnumber
  • asyncboolean
  • webhook_urlstring
  • eventsArray<'queued' | 'completed' | 'failed'>
  • markdownModeMarkdownMode — Markdown processing mode. article=article extraction (default), raw=minimal cleanup, llm=compact LLM-optimised output.
  • markdownQuerystring — BM25 query string for relevance-ranked content filtering. Omit or leave empty to disable.
  • markdownLinksMarkdownLinks — Link rendering style in the markdown output.
  • markdownCompactboolean — Collapse excessive whitespace for a more compact output.
  • markdownFilterImagesboolean — Filter low-signal images from the markdown output.
  • markdownIncludeMediaboolean — When true, formats.links and formats.images return ScrapeScoredLink[] / ScrapeScoredImage[] (rich objects) instead of string[], and a top-level tables array is included. Only effective when markdown is in formats.
  • markdownIncludeWarningsboolean — When true, the response includes a top-level warnings array of ScrapeWarning objects. Only effective when markdown is in formats.
  • markdownIncludeStatsboolean — When true, the response includes a top-level stats object with ScrapeStats (chars, tokens, blocks). Only effective when markdown is in formats.
  • cache_ttlstring | 0 — How long a freshly fetched URL may be served from cache. '0'/0 disables cache, 'Nh'/'Nd' set a TTL (capped at 168h / 7d). Default '48h'. Honoured on the synchronous path only — the async path accepts the value but does not currently act on it.
  • customRecord<string, unknown> — User-supplied JSON payload, echoed back on the success envelope so callers can correlate the response to caller-side state (job IDs, batch metadata). Capped at 4096 UTF-8 bytes after JSON serialization. Does NOT affect cache-key inputs — two requests differing only in custom share the same cache slot.