OpenAI Whisper: Speech Recognition and Transcription Guide

OpenAI Whisper is an open-source automatic speech recognition (ASR) model trained on 680,000 hours of multilingual audio, capable of transcribing and translating speech in 99 languages with near-human accuracy. Unlike cloud ASR services, Whisper runs entirely locally — no API keys, no per-minute billing, no data leaving your machine. In 2026, faster-whisper (built on CTranslate2) runs Whisper at 4-8x the speed of the original implementation while using less memory, making real-time transcription feasible on consumer hardware.

This guide covers installation, transcription with word-level timestamps, language detection, audio translation, speaker diarisation with pyannote, real-time streaming transcription, and building a FastAPI transcription service.

Installation and Model Sizes
Basic Transcription
Word-Level Timestamps
Language Detection and Translation
Faster-Whisper Optimisation
Speaker Diarisation
Building a Transcription API

Installation and Model Sizes

Whisper is available in five sizes — tiny, base, small, medium, and large-v3 — trading accuracy for speed and memory. The large-v3 model achieves the best accuracy but requires 10 GB VRAM. For most production use cases, medium (5 GB VRAM) hits a good accuracy/speed balance. Install via pip or use faster-whisper for CTranslate2-optimised inference.

# Official OpenAI Whisper
pip install openai-whisper
pip install ffmpeg-python  # Audio loading dependency

# System ffmpeg (required)
# Ubuntu: sudo apt install ffmpeg
# macOS:  brew install ffmpeg
# Windows: winget install ffmpeg

# Faster-Whisper (recommended for production)
pip install faster-whisper

# Model sizes and VRAM requirements:
# tiny    (~39M params,  ~1 GB VRAM, fastest)
# base    (~74M params,  ~1 GB VRAM)
# small   (~244M params, ~2 GB VRAM)
# medium  (~769M params, ~5 GB VRAM)
# large-v3(~1550M params,~10 GB VRAM, best accuracy)

CPU tip: On CPU, use tiny or base for near-real-time transcription. medium on CPU takes roughly 3-5x the audio duration. faster-whisper with INT8 quantisation cuts this by 4x.

Basic Transcription

The original Whisper API is straightforward: load a model, call transcribe() with an audio file path. Whisper handles MP3, WAV, M4A, FLAC, and most audio formats automatically via ffmpeg. The result includes the full text, detected language, and segment-level timestamps by default.

import whisper
import json

# Load model — downloads on first run and caches to ~/.cache/whisper
model = whisper.load_model("medium")

# Basic transcription
result = model.transcribe("interview.mp3")
print(result["text"])
print(f"Detected language: {result['language']}")

# Access segments with timestamps
for segment in result["segments"]:
    start = segment["start"]
    end = segment["end"]
    text = segment["text"].strip()
    print(f"[{start:6.2f}s → {end:6.2f}s]  {text}")

# Save as SRT subtitle file
def to_srt(segments, output_path):
    with open(output_path, "w", encoding="utf-8") as f:
        for i, seg in enumerate(segments, start=1):
            start = format_timestamp(seg["start"])
            end = format_timestamp(seg["end"])
            f.write(f"{i}\n{start} --> {end}\n{seg['text'].strip()}\n\n")

def format_timestamp(seconds: float) -> str:
    h = int(seconds // 3600)
    m = int((seconds % 3600) // 60)
    s = int(seconds % 60)
    ms = int((seconds % 1) * 1000)
    return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"

to_srt(result["segments"], "subtitles.srt")
print("Saved subtitles.srt")

# Also save raw JSON for downstream processing
with open("transcript.json", "w") as f:
    json.dump(result, f, indent=2, ensure_ascii=False)

Word-Level Timestamps

Standard Whisper provides segment-level timestamps. Enabling word_timestamps=True adds per-word timing information — essential for subtitle synchronisation, highlight clipping, and training data annotation. Note that word timestamps require whisper-timestamped or the stable-ts library for best accuracy.

import whisper
import stable_whisper  # pip install stable-ts

# stable-ts provides more accurate word timestamps
model = stable_whisper.load_model("medium")

result = model.transcribe(
    "podcast_episode.mp3",
    regroup=True,         # Regroup segments for better readability
    word_timestamps=True,
)

# Iterate word-level data
for segment in result.segments:
    for word in segment.words:
        print(f"{word.start:.2f}s – {word.end:.2f}s  '{word.word}'  conf={word.probability:.2f}")

# Export as word-highlighted HTML
result.to_ass("video_subtitles.ass")  # Advanced SubStation Alpha format
result.save_as_json("word_timestamps.json")

# Standard whisper word timestamps
import whisper
model = whisper.load_model("medium")
result = model.transcribe("audio.mp3", word_timestamps=True)
for segment in result["segments"]:
    for word_data in segment.get("words", []):
        print(f"{word_data['start']:.2f}s  {word_data['word']}")

Language Detection and Translation

Whisper can auto-detect the spoken language and optionally translate it to English in a single pass — without a separate translation model. This is useful for multilingual content pipelines where you want an English transcript regardless of the source language. The task="translate" flag switches from transcription to translation mode.

import whisper
import numpy as np

model = whisper.load_model("medium")

# Auto-detect language before full transcription
audio = whisper.load_audio("speech_french.mp3")
audio_pad = whisper.pad_or_trim(audio)
mel = whisper.log_mel_spectrogram(audio_pad).to(model.device)

_, probs = model.detect_language(mel)
detected_lang = max(probs, key=probs.get)
confidence = probs[detected_lang]
print(f"Detected language: {detected_lang} (confidence: {confidence:.1%})")

# Top 5 language candidates
top5 = sorted(probs.items(), key=lambda x: x[1], reverse=True)[:5]
for lang, prob in top5:
    print(f"  {lang}: {prob:.1%}")

# Transcribe in original language
transcription = model.transcribe("speech_french.mp3", language="fr")
print("French transcript:", transcription["text"])

# Translate to English
translation = model.transcribe(
    "speech_french.mp3",
    task="translate",  # Translate to English
    language="fr",     # Source language (optional — auto-detected if omitted)
)
print("English translation:", translation["text"])

Faster-Whisper Optimisation

Faster-Whisper reimplements Whisper using CTranslate2, a C++ inference engine with INT8 quantisation. It achieves 4-8x speedup over the original PyTorch implementation on CPU, and 2-3x on GPU. This makes large-v3 practical on consumer hardware and enables near-real-time transcription of long audio files.

from faster_whisper import WhisperModel
import time

# Load with INT8 quantisation on CPU — 4x faster than original
model = WhisperModel(
    "large-v3",
    device="cpu",           # or "cuda"
    compute_type="int8",    # INT8 quantised — half the memory, same accuracy
    num_workers=4,          # Parallel audio loading
)

audio_file = "long_interview.mp3"
start = time.time()

segments, info = model.transcribe(
    audio_file,
    beam_size=5,
    language=None,          # Auto-detect
    word_timestamps=True,
    vad_filter=True,        # Voice activity detection — skip silence
    vad_parameters=dict(min_silence_duration_ms=500),
)

print(f"Detected language: {info.language} ({info.language_probability:.1%})")

full_text = []
for segment in segments:
    text = segment.text.strip()
    full_text.append(text)
    print(f"[{segment.start:.1f}s → {segment.end:.1f}s]  {text}")

elapsed = time.time() - start
print(f"\nTranscribed in {elapsed:.1f}s")
print("\n".join(full_text))

VAD filter: Setting vad_filter=True uses Silero VAD to skip silent segments before processing. This can cut transcription time by 30-60% for audio with lots of pauses or music between speech.

Speaker Diarisation

Whisper identifies what was said but not who said it. Combining Whisper with pyannote.audio adds speaker diarisation — labelling each speech segment with a speaker ID. The typical pipeline: Whisper transcribes the audio, pyannote segments by speaker, then you align the two outputs by timestamp.

from pyannote.audio import Pipeline as DiarizationPipeline
from faster_whisper import WhisperModel
import torch

# Pyannote requires a HuggingFace token (free — accept model terms on the Hub)
HF_TOKEN = "hf_your_token_here"

# Load diarisation pipeline
diarize_pipeline = DiarizationPipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token=HF_TOKEN,
)

# Load Whisper
whisper_model = WhisperModel("medium", device="cuda", compute_type="float16")

audio_file = "meeting_recording.mp3"

# Step 1: Diarise — identify speaker turns
diarization = diarize_pipeline(audio_file)

# Step 2: Transcribe with timestamps
segments, _ = whisper_model.transcribe(audio_file, word_timestamps=True)
segments = list(segments)

# Step 3: Align speakers to transcript segments
def get_speaker(diarization, start, end):
    """Find the dominant speaker for a time range."""
    speaker_times = {}
    for turn, _, speaker in diarization.itertracks(yield_label=True):
        overlap_start = max(turn.start, start)
        overlap_end = min(turn.end, end)
        if overlap_end > overlap_start:
            speaker_times[speaker] = speaker_times.get(speaker, 0) + (overlap_end - overlap_start)
    return max(speaker_times, key=speaker_times.get) if speaker_times else "UNKNOWN"

print("=== MEETING TRANSCRIPT ===\n")
current_speaker = None
for seg in segments:
    speaker = get_speaker(diarization, seg.start, seg.end)
    if speaker != current_speaker:
        current_speaker = speaker
        print(f"\n[{speaker}]")
    print(f"  [{seg.start:.1f}s] {seg.text.strip()}")

Building a Transcription API

Wrap Whisper in a FastAPI service to transcribe audio files uploaded via HTTP. Use background tasks for long audio files so the HTTP response returns immediately with a job ID. This pattern scales to production — add a task queue (Celery) and object storage (S3) for enterprise workloads.

from fastapi import FastAPI, UploadFile, File, BackgroundTasks
from faster_whisper import WhisperModel
import uuid, aiofiles, asyncio
from pathlib import Path
from pydantic import BaseModel

app = FastAPI(title="Whisper Transcription API")
UPLOAD_DIR = Path("uploads")
UPLOAD_DIR.mkdir(exist_ok=True)

# In-memory job store (use Redis in production)
jobs: dict[str, dict] = {}

@app.on_event("startup")
async def startup():
    app.state.model = WhisperModel("medium", device="cpu", compute_type="int8")

def transcribe_job(job_id: str, audio_path: str):
    jobs[job_id]["status"] = "processing"
    try:
        segments, info = app.state.model.transcribe(audio_path, vad_filter=True)
        text = " ".join(seg.text.strip() for seg in segments)
        jobs[job_id].update({"status": "done", "text": text, "language": info.language})
    except Exception as e:
        jobs[job_id].update({"status": "error", "error": str(e)})
    finally:
        Path(audio_path).unlink(missing_ok=True)

@app.post("/transcribe")
async def transcribe(background_tasks: BackgroundTasks, file: UploadFile = File(...)):
    job_id = str(uuid.uuid4())
    audio_path = UPLOAD_DIR / f"{job_id}_{file.filename}"
    async with aiofiles.open(audio_path, "wb") as f:
        await f.write(await file.read())
    jobs[job_id] = {"status": "queued"}
    background_tasks.add_task(transcribe_job, job_id, str(audio_path))
    return {"job_id": job_id, "status": "queued"}

@app.get("/jobs/{job_id}")
async def get_job(job_id: str):
    if job_id not in jobs:
        return {"error": "Job not found"}
    return jobs[job_id]