Scrape a URL

Fetches and processes a URL, returning content in one or more formats wrapped in a ScrapeEnvelope. Simple requests use the HTTP fast-path (~500ms); complex requests (delay_ms > 0, cookies, custom headers, wait conditions) are routed to headless Chromium and return HTTP 202 (3-15s). When `webhook_url` is provided and the scrape is async, webhook deliveries include `X-RapidCrawl-Event: scrape.completed` or `X-RapidCrawl-Event: scrape.failed` so receivers can dispatch on the header without parsing the body shape.

Authorization

AuthorizationRequiredBearer <token>

API key. Obtain from POST /v1/account/api-keys.

In: header

Request Body

application/jsonRequired

urlRequiredstring

Format: "uri"

formatsarray<string>

Default: ["rawHtml"]

countrystring

Default: "US"Pattern: "^[A-Z]{2}$"

cookiesarray<object>

Default: []

headersobject

Default: {}

delay_msinteger

Default: 0Minimum: 0Maximum: 10000

timeout_msinteger

Default: 30000Minimum: 1000Maximum: 60000

asyncboolean

Default: false

webhook_urlstring

Webhook endpoint URL (HTTPS only; http:// rejected with HTTP 422)

Pattern: "^https://"Format: "uri"

eventsarray<string>

Default: ["completed","failed"]

markdownModestring

Markdown processing mode. article=article extraction (default), raw=minimal cleanup, llm=compact LLM-optimised output.

Value in: "article" | "raw" | "llm"

markdownQuerystring

BM25 query string for relevance-ranked content filtering. Omit or leave empty to disable.

Maximum length: 200

markdownLinksstring

Link rendering style in the markdown output.

Value in: "inline" | "references" | "none"

markdownCompactboolean

Collapse excessive whitespace for a more compact output.

markdownFilterImagesboolean

Filter low-signal images from the markdown output.

markdownIncludeMediaboolean

When true, formats.links and formats.images return ScrapeScoredLink[] / ScrapeScoredImage[] (rich objects) instead of string[], and a top-level tables array is included. Only effective when markdown is in formats.

markdownIncludeWarningsboolean

When true, the response includes a top-level warnings array of ScrapeWarning objects. Only effective when markdown is in formats.

markdownIncludeStatsboolean

When true, the response includes a top-level stats object with ScrapeStats (chars, tokens, blocks). Only effective when markdown is in formats.

removeBase64Imagesboolean

When true (default), base64-encoded data: image sources are stripped from the HTML before markdown conversion and from the html format output. Set to false to preserve base64 images in the pipeline output. Does not affect rawHtml, which always returns the original HTML unchanged.

Default: true

cache_ttlstring | integer

How long a freshly fetched URL may be served from cache. 0 (string or integer) skips cache entirely, Nh/Nd set a TTL (capped at 168 h / 7 d). Honoured on the synchronous path only — the async branch accepts the field for forward compatibility but does not currently act on it.

Default: "48h"

customobject

User-supplied JSON payload, echoed back on the success envelope (sync, async, webhook). Capped at 4096 UTF-8 bytes (Buffer.byteLength). Distinct from the system-owned metadata column. Does NOT affect cache-key inputs — two requests differing only in custom resolve to the same cache entry.

Default: {}

curl -X POST "https://api.bytekit.com/v1/scrape" \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com"
  }'

Scrape completed synchronously.

{
  "status": "success",
  "scrapeId": "sc_01j9abc123",
  "url": "https://example.com",
  "finalUrl": "https://example.com/",
  "contentType": "text/html; charset=UTF-8",
  "contentLength": 1256,
  "statusCode": 200,
  "retrievedAt": "2025-01-15T12:00:00Z",
  "formats": {
    "rawHtml": "<html><body><h1>Hello World</h1><p>Example page with a <a href=\"/about\">link</a> and an image.</p></body></html>",
    "html": "<html><body><h1>Hello World</h1><p>Example page with a <a href=\"/about\">link</a> and an image.</p></body></html>",
    "markdown": "# Hello World\n\nExample page with a [link](/about) and an image.\n\n![Alt text](https://example.com/photo.png)",
    "links": [
      {
        "url": "https://example.com/about",
        "text": "link",
        "isInternal": true,
        "rel": null
      }
    ],
    "images": [
      {
        "url": "https://example.com/photo.png",
        "alt": "Alt text",
        "width": 800,
        "height": 600,
        "format": "png",
        "score": 0.95
      }
    ]
  },
  "metadata": {
    "title": "Example Domain",
    "description": "An example page for demonstration",
    "language": "en",
    "canonicalUrl": "https://example.com",
    "ogTitle": "Example Domain",
    "ogDescription": null,
    "ogImage": null,
    "byline": "Jane Doe",
    "publishedAt": "2025-01-15T09:00:00Z"
  },
  "tables": [
    {
      "kind": "data",
      "gfm": "| Col1 | Col2 |\n|------|------|\n| A | B |",
      "rows": 2,
      "cols": 2
    }
  ],
  "warnings": [
    {
      "code": "lossy_table",
      "message": "Table at index 3 was simplified",
      "element": "<table>...</table>"
    }
  ],
  "stats": {
    "chars": 1256,
    "tokens": 312,
    "blocks": 4
  }
}

Scrape a URL

Authorization

Request Body

Response

TypeScript