REST API Reference | Whisper API Docs

Navigation

The Whisper API provides a Deepgram-compatible REST interface for pre-recorded audio transcription and model management.

Endpoints Overview

Method	Endpoint	Description
POST	`/v1/listen`	Transcribe audio (file upload or URL)
GET	`/v1/models`	List available models (no auth)
GET	`/ping`	Health check
POST	`/v1/auth/test-token`	Generate test token (dev only)
WS	`/v1/listen`	Live streaming (see Streaming docs)

POST `/v1/listen` — Transcribe Audio

The primary transcription endpoint. Accepts audio either as a binary file upload or as a URL in a JSON body. Upload size is capped by MAX_AUDIO_UPLOAD_BYTES (default 50 MiB).

Request Parameters (Query String)

Parameter	Type	Default	Description
`model`	`string`	`tiny.en`	Model to use. See `GET /v1/models` for options.
`language`	`string`	`en`	BCP-47 language code for transcription.
`prompt`	`string`	`null`	Context/vocabulary prompt to guide the model (e.g., `"TURNIPS, MUTTON"`).
`start`	`integer`	`0`	Offset in milliseconds — skip audio before this point.
`duration`	`integer`	`null`	Maximum duration in milliseconds to process.
`response_format`	`string`	`json`	Response format: `json`, `srt`, or `vtt`.
`diarize`	`boolean`	`false`	Enable speaker separation (best with stereo audio).
`utterances`	`boolean`	`false`	Return speech interval metadata.

Request Headers

Header	Value	Required
`Authorization`	`Token <your_api_key>`	Yes
`Content-Type`	`audio/wav`, `audio/mpeg`, `audio/mp4`, etc.	For file upload
`Content-Type`	`application/json`	For URL-based transcription

File Upload

Send raw audio bytes as the request body:

curl -X POST 'http://localhost:7860/v1/listen?model=tiny.en&language=en' \
  -H "Authorization: Token YOUR_API_KEY" \
  -H "Content-Type: audio/wav" \
  --data-binary @audio.wav

URL-Based Transcription

Send a JSON body with the audio file URL:

curl -X POST 'http://localhost:7860/v1/listen?model=tiny.en' \
  -H "Authorization: Token YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/audio.mp3"}'

URL fetch rules (SSRF protection)

When you pass {"url": "..."}, the server downloads the file for you. To reduce SSRF risk (callers tricking your server into hitting internal addresses), the implementation:

Allows only http and https URLs.
Resolves the hostname and rejects addresses that are not publicly routable (for example loopback, private RFC1918 ranges, link-local, and IPv6 ULA).
Enforces a maximum download size (MAX_AUDIO_DOWNLOAD_BYTES, default 50 MiB).
By default does not follow redirects (AUDIO_URL_FOLLOW_REDIRECTS=false). If you enable redirects for CDN short links, the server re-checks the final URL host after redirects.

JSON Response Schema

When response_format=json (default), the response follows the Deepgram format:

{
  "metadata": {
    "request_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "created": "2026-03-29T10:30:00.000000Z",
    "duration": 10.43,
    "channels": 1,
    "sha256": "abc123def456..."
  },
  "results": {
    "channels": [
      {
        "alternatives": [
          {
            "transcript": "And so my fellow Americans ask not what your country can do for you ask what you can do for your country",
            "confidence": 0.98,
            "words": [
              {
                "word": "and",
                "start": 0.0,
                "end": 0.32,
                "confidence": 0.97
              },
              {
                "word": "so",
                "start": 0.32,
                "end": 0.56,
                "confidence": 0.99
              }
            ]
          }
        ]
      }
    ]
  }
}

SRT Response

When response_format=srt, raw subtitle text is returned:

1
00:00:00,000 --> 00:00:03,200
And so my fellow Americans

2
00:00:03,200 --> 00:00:06,800
ask not what your country can do for you

3
00:00:06,800 --> 00:00:10,430
ask what you can do for your country

VTT Response

When response_format=vtt:

WEBVTT

00:00:00.000 --> 00:00:03.200
And so my fellow Americans

00:00:03.200 --> 00:00:06.800
ask not what your country can do for you

00:00:06.800 --> 00:00:10.430
ask what you can do for your country

GET `/v1/models` — List Models

Returns metadata for each .bin model file present under MODELS_DIR. No authentication required.

curl -X GET 'http://localhost:7860/v1/models'

Response:

{
  "models": [
    {
      "name": "whisper-tiny.en",
      "model_id": "tiny.en",
      "description": "Tiny English-only Whisper model (~75MB). Fastest inference, good for English.",
      "language": "en",
      "version": "ggml-v1",
      "file": "ggml-tiny.en.bin",
      "size_bytes": 78643200
    }
  ],
  "count": 1
}

GET `/ping` — Health Check

A simple health check endpoint (no authentication required):

curl http://localhost:7860/ping

{
  "ping": "pong",
  "status": "healthy"
}

Error Responses

Status Code	Description
`400`	Bad request — invalid parameters, unsupported format, URL policy violation, or download failure
`401`	Unauthorized — missing or invalid API key
`413`	Payload too large — upload body exceeds `MAX_AUDIO_UPLOAD_BYTES`
`415`	Unsupported media type — unrecognized content type
`422`	Validation error — malformed request body
`500`	Internal server error — transcription or conversion failed
`504`	Gateway timeout — `whisper-cli` or ffmpeg exceeded configured timeout

Concurrent transcriptions wait on an internal limiter (MAX_CONCURRENT_TRANSCRIPTIONS); requests are queued, not rejected with 503, unless the client or proxy times out first.

Error Response Format

{
  "detail": "Invalid or missing API key"
}

← Previous

Authentication

Live Streaming (WebSocket)