Scrape a URL

Fetches and processes a URL, returning content in one or more formats wrapped in a ScrapeEnvelope. Simple requests use the HTTP fast-path (~500ms); complex requests (delay_ms > 0, cookies, custom headers, wait conditions) are routed to headless Chromium and return HTTP 202 (3-15s). When `webhook_url` is provided and the scrape is async, webhook deliveries include `X-RapidCrawl-Event: scrape.completed` or `X-RapidCrawl-Event: scrape.failed` so receivers can dispatch on the header without parsing the body shape.

POST
/v1/scrape

Authorization

AuthorizationRequiredBearer <token>

API key. Obtain from POST /v1/account/api-keys.

In: header

Request Body

application/jsonRequired
urlRequiredstring
Format: "uri"
formatsarray<string>
Default: ["rawHtml"]
countrystring
Default: "US"Pattern: "^[A-Z]{2}$"
cookiesarray<object>
Default: []
headersobject
Default: {}
delay_msinteger
Default: 0Minimum: 0Maximum: 10000
timeout_msinteger
Default: 30000Minimum: 1000Maximum: 60000
asyncboolean
Default: false
webhook_urlstring

Webhook endpoint URL (HTTPS only; http:// rejected with HTTP 422)

Pattern: "^https://"Format: "uri"
eventsarray<string>
Default: ["completed","failed"]
markdownModestring

Markdown processing mode. article=article extraction (default), raw=minimal cleanup, llm=compact LLM-optimised output.

Value in: "article" | "raw" | "llm"
markdownQuerystring

BM25 query string for relevance-ranked content filtering. Omit or leave empty to disable.

Maximum length: 200
markdownLinksstring

Link rendering style in the markdown output.

Value in: "inline" | "references" | "none"
markdownCompactboolean

Collapse excessive whitespace for a more compact output.

markdownFilterImagesboolean

Filter low-signal images from the markdown output.

markdownIncludeMediaboolean

When true, formats.links and formats.images return ScrapeScoredLink[] / ScrapeScoredImage[] (rich objects) instead of string[], and a top-level tables array is included. Only effective when markdown is in formats.

markdownIncludeWarningsboolean

When true, the response includes a top-level warnings array of ScrapeWarning objects. Only effective when markdown is in formats.

markdownIncludeStatsboolean

When true, the response includes a top-level stats object with ScrapeStats (chars, tokens, blocks). Only effective when markdown is in formats.

removeBase64Imagesboolean

When true (default), base64-encoded data: image sources are stripped from the HTML before markdown conversion and from the html format output. Set to false to preserve base64 images in the pipeline output. Does not affect rawHtml, which always returns the original HTML unchanged.

Default: true
cache_ttlstring | integer

How long a freshly fetched URL may be served from cache. 0 (string or integer) skips cache entirely, Nh/Nd set a TTL (capped at 168 h / 7 d). Honoured on the synchronous path only — the async branch accepts the field for forward compatibility but does not currently act on it.

Default: "48h"
customobject

User-supplied JSON payload, echoed back on the success envelope (sync, async, webhook). Capped at 4096 UTF-8 bytes (Buffer.byteLength). Distinct from the system-owned metadata column. Does NOT affect cache-key inputs — two requests differing only in custom resolve to the same cache entry.

Default: {}
curl -X POST "https://api.bytekit.com/v1/scrape" \
  -H "Authorization: Bearer <token>" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com"
  }'

Scrape completed synchronously.

{
  "status": "success",
  "scrapeId": "sc_01j9abc123",
  "url": "https://example.com",
  "finalUrl": "https://example.com/",
  "contentType": "text/html; charset=UTF-8",
  "contentLength": 1256,
  "statusCode": 200,
  "retrievedAt": "2025-01-15T12:00:00Z",
  "formats": {
    "rawHtml": "<html><body><h1>Hello World</h1><p>Example page with a <a href=\"/about\">link</a> and an image.</p></body></html>",
    "html": "<html><body><h1>Hello World</h1><p>Example page with a <a href=\"/about\">link</a> and an image.</p></body></html>",
    "markdown": "# Hello World\n\nExample page with a [link](/about) and an image.\n\n![Alt text](https://example.com/photo.png)",
    "links": [
      {
        "url": "https://example.com/about",
        "text": "link",
        "isInternal": true,
        "rel": null
      }
    ],
    "images": [
      {
        "url": "https://example.com/photo.png",
        "alt": "Alt text",
        "width": 800,
        "height": 600,
        "format": "png",
        "score": 0.95
      }
    ]
  },
  "metadata": {
    "title": "Example Domain",
    "description": "An example page for demonstration",
    "language": "en",
    "canonicalUrl": "https://example.com",
    "ogTitle": "Example Domain",
    "ogDescription": null,
    "ogImage": null,
    "byline": "Jane Doe",
    "publishedAt": "2025-01-15T09:00:00Z"
  },
  "tables": [
    {
      "kind": "data",
      "gfm": "| Col1 | Col2 |\n|------|------|\n| A | B |",
      "rows": 2,
      "cols": 2
    }
  ],
  "warnings": [
    {
      "code": "lossy_table",
      "message": "Table at index 3 was simplified",
      "element": "<table>...</table>"
    }
  ],
  "stats": {
    "chars": 1256,
    "tokens": 312,
    "blocks": 4
  }
}