W
Whisper API
Navigation

The Whisper API provides a Deepgram-compatible REST interface for pre-recorded audio transcription and model management.


Endpoints Overview

MethodEndpointDescription
POST/v1/listenTranscribe audio (file upload or URL)
GET/v1/modelsList available models (no auth)
GET/pingHealth check
POST/v1/auth/test-tokenGenerate test token (dev only)
WS/v1/listenLive streaming (see Streaming docs)

POST /v1/listen — Transcribe Audio

The primary transcription endpoint. Accepts audio either as a binary file upload or as a URL in a JSON body. Upload size is capped by MAX_AUDIO_UPLOAD_BYTES (default 50 MiB).

Request Parameters (Query String)

ParameterTypeDefaultDescription
modelstringtiny.enModel to use. See GET /v1/models for options.
languagestringenBCP-47 language code for transcription.
promptstringnullContext/vocabulary prompt to guide the model (e.g., "TURNIPS, MUTTON").
startinteger0Offset in milliseconds — skip audio before this point.
durationintegernullMaximum duration in milliseconds to process.
response_formatstringjsonResponse format: json, srt, or vtt.
diarizebooleanfalseEnable speaker separation (best with stereo audio).
utterancesbooleanfalseReturn speech interval metadata.

Request Headers

HeaderValueRequired
AuthorizationToken <your_api_key>Yes
Content-Typeaudio/wav, audio/mpeg, audio/mp4, etc.For file upload
Content-Typeapplication/jsonFor URL-based transcription

File Upload

Send raw audio bytes as the request body:

curl -X POST 'http://localhost:7860/v1/listen?model=tiny.en&language=en' \
  -H "Authorization: Token YOUR_API_KEY" \
  -H "Content-Type: audio/wav" \
  --data-binary @audio.wav

URL-Based Transcription

Send a JSON body with the audio file URL:

curl -X POST 'http://localhost:7860/v1/listen?model=tiny.en' \
  -H "Authorization: Token YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/audio.mp3"}'

URL fetch rules (SSRF protection)

When you pass {"url": "..."}, the server downloads the file for you. To reduce SSRF risk (callers tricking your server into hitting internal addresses), the implementation:

  • Allows only http and https URLs.
  • Resolves the hostname and rejects addresses that are not publicly routable (for example loopback, private RFC1918 ranges, link-local, and IPv6 ULA).
  • Enforces a maximum download size (MAX_AUDIO_DOWNLOAD_BYTES, default 50 MiB).
  • By default does not follow redirects (AUDIO_URL_FOLLOW_REDIRECTS=false). If you enable redirects for CDN short links, the server re-checks the final URL host after redirects.

JSON Response Schema

When response_format=json (default), the response follows the Deepgram format:

{
  "metadata": {
    "request_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
    "created": "2026-03-29T10:30:00.000000Z",
    "duration": 10.43,
    "channels": 1,
    "sha256": "abc123def456..."
  },
  "results": {
    "channels": [
      {
        "alternatives": [
          {
            "transcript": "And so my fellow Americans ask not what your country can do for you ask what you can do for your country",
            "confidence": 0.98,
            "words": [
              {
                "word": "and",
                "start": 0.0,
                "end": 0.32,
                "confidence": 0.97
              },
              {
                "word": "so",
                "start": 0.32,
                "end": 0.56,
                "confidence": 0.99
              }
            ]
          }
        ]
      }
    ]
  }
}

SRT Response

When response_format=srt, raw subtitle text is returned:

1
00:00:00,000 --> 00:00:03,200
And so my fellow Americans

2
00:00:03,200 --> 00:00:06,800
ask not what your country can do for you

3
00:00:06,800 --> 00:00:10,430
ask what you can do for your country

VTT Response

When response_format=vtt:

WEBVTT

00:00:00.000 --> 00:00:03.200
And so my fellow Americans

00:00:03.200 --> 00:00:06.800
ask not what your country can do for you

00:00:06.800 --> 00:00:10.430
ask what you can do for your country

GET /v1/models — List Models

Returns metadata for each .bin model file present under MODELS_DIR. No authentication required.

curl -X GET 'http://localhost:7860/v1/models'

Response:

{
  "models": [
    {
      "name": "whisper-tiny.en",
      "model_id": "tiny.en",
      "description": "Tiny English-only Whisper model (~75MB). Fastest inference, good for English.",
      "language": "en",
      "version": "ggml-v1",
      "file": "ggml-tiny.en.bin",
      "size_bytes": 78643200
    }
  ],
  "count": 1
}

GET /ping — Health Check

A simple health check endpoint (no authentication required):

curl http://localhost:7860/ping
{
  "ping": "pong",
  "status": "healthy"
}

Error Responses

Status CodeDescription
400Bad request — invalid parameters, unsupported format, URL policy violation, or download failure
401Unauthorized — missing or invalid API key
413Payload too large — upload body exceeds MAX_AUDIO_UPLOAD_BYTES
415Unsupported media type — unrecognized content type
422Validation error — malformed request body
500Internal server error — transcription or conversion failed
504Gateway timeout — whisper-cli or ffmpeg exceeded configured timeout

Concurrent transcriptions wait on an internal limiter (MAX_CONCURRENT_TRANSCRIPTIONS); requests are queued, not rejected with 503, unless the client or proxy times out first.

Error Response Format

{
  "detail": "Invalid or missing API key"
}