Navigation
The Whisper API provides a Deepgram-compatible REST interface for pre-recorded audio transcription and model management.
Endpoints Overview
| Method | Endpoint | Description |
|---|---|---|
| POST | /v1/listen | Transcribe audio (file upload or URL) |
| GET | /v1/models | List available models (no auth) |
| GET | /ping | Health check |
| POST | /v1/auth/test-token | Generate test token (dev only) |
| WS | /v1/listen | Live streaming (see Streaming docs) |
POST /v1/listen — Transcribe Audio
The primary transcription endpoint. Accepts audio either as a binary file upload or as a URL in a JSON body. Upload size is capped by MAX_AUDIO_UPLOAD_BYTES (default 50 MiB).
Request Parameters (Query String)
| Parameter | Type | Default | Description |
|---|---|---|---|
model | string | tiny.en | Model to use. See GET /v1/models for options. |
language | string | en | BCP-47 language code for transcription. |
prompt | string | null | Context/vocabulary prompt to guide the model (e.g., "TURNIPS, MUTTON"). |
start | integer | 0 | Offset in milliseconds — skip audio before this point. |
duration | integer | null | Maximum duration in milliseconds to process. |
response_format | string | json | Response format: json, srt, or vtt. |
diarize | boolean | false | Enable speaker separation (best with stereo audio). |
utterances | boolean | false | Return speech interval metadata. |
Request Headers
| Header | Value | Required |
|---|---|---|
Authorization | Token <your_api_key> | Yes |
Content-Type | audio/wav, audio/mpeg, audio/mp4, etc. | For file upload |
Content-Type | application/json | For URL-based transcription |
File Upload
Send raw audio bytes as the request body:
curl -X POST 'http://localhost:7860/v1/listen?model=tiny.en&language=en' \
-H "Authorization: Token YOUR_API_KEY" \
-H "Content-Type: audio/wav" \
--data-binary @audio.wav
URL-Based Transcription
Send a JSON body with the audio file URL:
curl -X POST 'http://localhost:7860/v1/listen?model=tiny.en' \
-H "Authorization: Token YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/audio.mp3"}'
URL fetch rules (SSRF protection)
When you pass {"url": "..."}, the server downloads the file for you. To reduce SSRF risk (callers tricking your server into hitting internal addresses), the implementation:
- Allows only
httpandhttpsURLs. - Resolves the hostname and rejects addresses that are not publicly routable (for example loopback, private RFC1918 ranges, link-local, and IPv6 ULA).
- Enforces a maximum download size (
MAX_AUDIO_DOWNLOAD_BYTES, default 50 MiB). - By default does not follow redirects (
AUDIO_URL_FOLLOW_REDIRECTS=false). If you enable redirects for CDN short links, the server re-checks the final URL host after redirects.
JSON Response Schema
When response_format=json (default), the response follows the Deepgram format:
{
"metadata": {
"request_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"created": "2026-03-29T10:30:00.000000Z",
"duration": 10.43,
"channels": 1,
"sha256": "abc123def456..."
},
"results": {
"channels": [
{
"alternatives": [
{
"transcript": "And so my fellow Americans ask not what your country can do for you ask what you can do for your country",
"confidence": 0.98,
"words": [
{
"word": "and",
"start": 0.0,
"end": 0.32,
"confidence": 0.97
},
{
"word": "so",
"start": 0.32,
"end": 0.56,
"confidence": 0.99
}
]
}
]
}
]
}
}
SRT Response
When response_format=srt, raw subtitle text is returned:
1
00:00:00,000 --> 00:00:03,200
And so my fellow Americans
2
00:00:03,200 --> 00:00:06,800
ask not what your country can do for you
3
00:00:06,800 --> 00:00:10,430
ask what you can do for your country
VTT Response
When response_format=vtt:
WEBVTT
00:00:00.000 --> 00:00:03.200
And so my fellow Americans
00:00:03.200 --> 00:00:06.800
ask not what your country can do for you
00:00:06.800 --> 00:00:10.430
ask what you can do for your country
GET /v1/models — List Models
Returns metadata for each .bin model file present under MODELS_DIR. No authentication required.
curl -X GET 'http://localhost:7860/v1/models'
Response:
{
"models": [
{
"name": "whisper-tiny.en",
"model_id": "tiny.en",
"description": "Tiny English-only Whisper model (~75MB). Fastest inference, good for English.",
"language": "en",
"version": "ggml-v1",
"file": "ggml-tiny.en.bin",
"size_bytes": 78643200
}
],
"count": 1
}
GET /ping — Health Check
A simple health check endpoint (no authentication required):
curl http://localhost:7860/ping
{
"ping": "pong",
"status": "healthy"
}
Error Responses
| Status Code | Description |
|---|---|
400 | Bad request — invalid parameters, unsupported format, URL policy violation, or download failure |
401 | Unauthorized — missing or invalid API key |
413 | Payload too large — upload body exceeds MAX_AUDIO_UPLOAD_BYTES |
415 | Unsupported media type — unrecognized content type |
422 | Validation error — malformed request body |
500 | Internal server error — transcription or conversion failed |
504 | Gateway timeout — whisper-cli or ffmpeg exceeded configured timeout |
Concurrent transcriptions wait on an internal limiter (MAX_CONCURRENT_TRANSCRIPTIONS); requests are queued, not rejected with 503, unless the client or proxy times out first.
Error Response Format
{
"detail": "Invalid or missing API key"
}