OpenAI Whisper is an open-source automatic speech recognition (ASR) model trained on 680,000 hours of multilingual audio, capable of transcribing and translating speech in 99 languages with near-human accuracy. Unlike cloud ASR services, Whisper runs entirely locally — no API keys, no per-minute billing, no data leaving your machine. In 2026, faster-whisper (built on CTranslate2) runs Whisper at 4-8x the speed of the original implementation while using less memory, making real-time transcription feasible on consumer hardware.
This guide covers installation, transcription with word-level timestamps, language detection, audio translation, speaker diarisation with pyannote, real-time streaming transcription, and building a FastAPI transcription service.
Whisper is available in five sizes — tiny, base, small, medium, and large-v3 — trading accuracy for speed and memory. The large-v3 model achieves the best accuracy but requires 10 GB VRAM. For most production use cases, medium (5 GB VRAM) hits a good accuracy/speed balance. Install via pip or use faster-whisper for CTranslate2-optimised inference.
# Official OpenAI Whisper
pip install openai-whisper
pip install ffmpeg-python # Audio loading dependency
# System ffmpeg (required)
# Ubuntu: sudo apt install ffmpeg
# macOS: brew install ffmpeg
# Windows: winget install ffmpeg
# Faster-Whisper (recommended for production)
pip install faster-whisper
# Model sizes and VRAM requirements:
# tiny (~39M params, ~1 GB VRAM, fastest)
# base (~74M params, ~1 GB VRAM)
# small (~244M params, ~2 GB VRAM)
# medium (~769M params, ~5 GB VRAM)
# large-v3(~1550M params,~10 GB VRAM, best accuracy)
tiny or base for near-real-time transcription. medium on CPU takes roughly 3-5x the audio duration. faster-whisper with INT8 quantisation cuts this by 4x.
The original Whisper API is straightforward: load a model, call transcribe() with an audio file path. Whisper handles MP3, WAV, M4A, FLAC, and most audio formats automatically via ffmpeg. The result includes the full text, detected language, and segment-level timestamps by default.
import whisper
import json
# Load model — downloads on first run and caches to ~/.cache/whisper
model = whisper.load_model("medium")
# Basic transcription
result = model.transcribe("interview.mp3")
print(result["text"])
print(f"Detected language: {result['language']}")
# Access segments with timestamps
for segment in result["segments"]:
start = segment["start"]
end = segment["end"]
text = segment["text"].strip()
print(f"[{start:6.2f}s → {end:6.2f}s] {text}")
# Save as SRT subtitle file
def to_srt(segments, output_path):
with open(output_path, "w", encoding="utf-8") as f:
for i, seg in enumerate(segments, start=1):
start = format_timestamp(seg["start"])
end = format_timestamp(seg["end"])
f.write(f"{i}\n{start} --> {end}\n{seg['text'].strip()}\n\n")
def format_timestamp(seconds: float) -> str:
h = int(seconds // 3600)
m = int((seconds % 3600) // 60)
s = int(seconds % 60)
ms = int((seconds % 1) * 1000)
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
to_srt(result["segments"], "subtitles.srt")
print("Saved subtitles.srt")
# Also save raw JSON for downstream processing
with open("transcript.json", "w") as f:
json.dump(result, f, indent=2, ensure_ascii=False)
Standard Whisper provides segment-level timestamps. Enabling word_timestamps=True adds per-word timing information — essential for subtitle synchronisation, highlight clipping, and training data annotation. Note that word timestamps require whisper-timestamped or the stable-ts library for best accuracy.
import whisper
import stable_whisper # pip install stable-ts
# stable-ts provides more accurate word timestamps
model = stable_whisper.load_model("medium")
result = model.transcribe(
"podcast_episode.mp3",
regroup=True, # Regroup segments for better readability
word_timestamps=True,
)
# Iterate word-level data
for segment in result.segments:
for word in segment.words:
print(f"{word.start:.2f}s – {word.end:.2f}s '{word.word}' conf={word.probability:.2f}")
# Export as word-highlighted HTML
result.to_ass("video_subtitles.ass") # Advanced SubStation Alpha format
result.save_as_json("word_timestamps.json")
# Standard whisper word timestamps
import whisper
model = whisper.load_model("medium")
result = model.transcribe("audio.mp3", word_timestamps=True)
for segment in result["segments"]:
for word_data in segment.get("words", []):
print(f"{word_data['start']:.2f}s {word_data['word']}")
Whisper can auto-detect the spoken language and optionally translate it to English in a single pass — without a separate translation model. This is useful for multilingual content pipelines where you want an English transcript regardless of the source language. The task="translate" flag switches from transcription to translation mode.
import whisper
import numpy as np
model = whisper.load_model("medium")
# Auto-detect language before full transcription
audio = whisper.load_audio("speech_french.mp3")
audio_pad = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio_pad).to(model.device)
_, probs = model.detect_language(mel)
detected_lang = max(probs, key=probs.get)
confidence = probs[detected_lang]
print(f"Detected language: {detected_lang} (confidence: {confidence:.1%})")
# Top 5 language candidates
top5 = sorted(probs.items(), key=lambda x: x[1], reverse=True)[:5]
for lang, prob in top5:
print(f" {lang}: {prob:.1%}")
# Transcribe in original language
transcription = model.transcribe("speech_french.mp3", language="fr")
print("French transcript:", transcription["text"])
# Translate to English
translation = model.transcribe(
"speech_french.mp3",
task="translate", # Translate to English
language="fr", # Source language (optional — auto-detected if omitted)
)
print("English translation:", translation["text"])
Faster-Whisper reimplements Whisper using CTranslate2, a C++ inference engine with INT8 quantisation. It achieves 4-8x speedup over the original PyTorch implementation on CPU, and 2-3x on GPU. This makes large-v3 practical on consumer hardware and enables near-real-time transcription of long audio files.
from faster_whisper import WhisperModel
import time
# Load with INT8 quantisation on CPU — 4x faster than original
model = WhisperModel(
"large-v3",
device="cpu", # or "cuda"
compute_type="int8", # INT8 quantised — half the memory, same accuracy
num_workers=4, # Parallel audio loading
)
audio_file = "long_interview.mp3"
start = time.time()
segments, info = model.transcribe(
audio_file,
beam_size=5,
language=None, # Auto-detect
word_timestamps=True,
vad_filter=True, # Voice activity detection — skip silence
vad_parameters=dict(min_silence_duration_ms=500),
)
print(f"Detected language: {info.language} ({info.language_probability:.1%})")
full_text = []
for segment in segments:
text = segment.text.strip()
full_text.append(text)
print(f"[{segment.start:.1f}s → {segment.end:.1f}s] {text}")
elapsed = time.time() - start
print(f"\nTranscribed in {elapsed:.1f}s")
print("\n".join(full_text))
vad_filter=True uses Silero VAD to skip silent segments before processing. This can cut transcription time by 30-60% for audio with lots of pauses or music between speech.
Whisper identifies what was said but not who said it. Combining Whisper with pyannote.audio adds speaker diarisation — labelling each speech segment with a speaker ID. The typical pipeline: Whisper transcribes the audio, pyannote segments by speaker, then you align the two outputs by timestamp.
from pyannote.audio import Pipeline as DiarizationPipeline
from faster_whisper import WhisperModel
import torch
# Pyannote requires a HuggingFace token (free — accept model terms on the Hub)
HF_TOKEN = "hf_your_token_here"
# Load diarisation pipeline
diarize_pipeline = DiarizationPipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token=HF_TOKEN,
)
# Load Whisper
whisper_model = WhisperModel("medium", device="cuda", compute_type="float16")
audio_file = "meeting_recording.mp3"
# Step 1: Diarise — identify speaker turns
diarization = diarize_pipeline(audio_file)
# Step 2: Transcribe with timestamps
segments, _ = whisper_model.transcribe(audio_file, word_timestamps=True)
segments = list(segments)
# Step 3: Align speakers to transcript segments
def get_speaker(diarization, start, end):
"""Find the dominant speaker for a time range."""
speaker_times = {}
for turn, _, speaker in diarization.itertracks(yield_label=True):
overlap_start = max(turn.start, start)
overlap_end = min(turn.end, end)
if overlap_end > overlap_start:
speaker_times[speaker] = speaker_times.get(speaker, 0) + (overlap_end - overlap_start)
return max(speaker_times, key=speaker_times.get) if speaker_times else "UNKNOWN"
print("=== MEETING TRANSCRIPT ===\n")
current_speaker = None
for seg in segments:
speaker = get_speaker(diarization, seg.start, seg.end)
if speaker != current_speaker:
current_speaker = speaker
print(f"\n[{speaker}]")
print(f" [{seg.start:.1f}s] {seg.text.strip()}")
Wrap Whisper in a FastAPI service to transcribe audio files uploaded via HTTP. Use background tasks for long audio files so the HTTP response returns immediately with a job ID. This pattern scales to production — add a task queue (Celery) and object storage (S3) for enterprise workloads.
from fastapi import FastAPI, UploadFile, File, BackgroundTasks
from faster_whisper import WhisperModel
import uuid, aiofiles, asyncio
from pathlib import Path
from pydantic import BaseModel
app = FastAPI(title="Whisper Transcription API")
UPLOAD_DIR = Path("uploads")
UPLOAD_DIR.mkdir(exist_ok=True)
# In-memory job store (use Redis in production)
jobs: dict[str, dict] = {}
@app.on_event("startup")
async def startup():
app.state.model = WhisperModel("medium", device="cpu", compute_type="int8")
def transcribe_job(job_id: str, audio_path: str):
jobs[job_id]["status"] = "processing"
try:
segments, info = app.state.model.transcribe(audio_path, vad_filter=True)
text = " ".join(seg.text.strip() for seg in segments)
jobs[job_id].update({"status": "done", "text": text, "language": info.language})
except Exception as e:
jobs[job_id].update({"status": "error", "error": str(e)})
finally:
Path(audio_path).unlink(missing_ok=True)
@app.post("/transcribe")
async def transcribe(background_tasks: BackgroundTasks, file: UploadFile = File(...)):
job_id = str(uuid.uuid4())
audio_path = UPLOAD_DIR / f"{job_id}_{file.filename}"
async with aiofiles.open(audio_path, "wb") as f:
await f.write(await file.read())
jobs[job_id] = {"status": "queued"}
background_tasks.add_task(transcribe_job, job_id, str(audio_path))
return {"job_id": job_id, "status": "queued"}
@app.get("/jobs/{job_id}")
async def get_job(job_id: str):
if job_id not in jobs:
return {"error": "Job not found"}
return jobs[job_id]