podscripter

Overview
podscripter is a lightweight tool designed to transcribe audio using OpenAI’s Whisper model inside a Docker container. It supports multiple languages with automatic language detection, including English (en), Spanish (es), French (fr), and German (de).
podscripter enables users to generate accurate transcriptions locally, making it perfect for platforms like LingQ where text and audio integration can boost comprehension.
Features
- Local Processing: No API keys or usage limits, run everything on your own machine.
- Dockerized Environment: Easily install and run the tool in an isolated container.
- Flexible Input: Supports both audio files (MP3, WAV, etc.) and video files (MP4, etc.).
- Multiple Output Formats: Choose between TXT (sentence-separated) or SRT (subtitles).
- Automatic Language Detection: Automatically detects the language of your audio content by default.
- Primary Language Support: English (en), Spanish (es), French (fr), German (de). Other languages are experimental.
- Advanced Punctuation Restoration: Uses Sentence-Transformers for intelligent punctuation restoration, with automatic spaCy-based capitalization.
- Batch Processing: Transcribe multiple files using simple shell loops.
- Powered by Whisper: Uses OpenAI’s Whisper model for accurate speech recognition.
- Hugging Face Integration: Leverages Hugging Face models and caches for local, offline workflows.
Quickstart
Minimal setup and a single run:
# Build image
docker build -t podscripter .
# Create cache folders (first time only)
mkdir -p audio-files models/sentence-transformers models/huggingface
# Transcribe one file (TXT output). Replace example.mp3 with your file.
docker run --rm \
-v $(pwd):/app \
-v $(pwd)/models/sentence-transformers:/root/.cache/torch/sentence_transformers \
-v $(pwd)/models/huggingface:/root/.cache/huggingface \
-v $(pwd)/audio-files:/app/audio-files \
podscripter python3 /app/podscripter.py \
/app/audio-files/example.mp3 --output_dir /app/audio-files
Notes:
- Compute type defaults to
auto(no CLI needed). - VAD is enabled by default. You can disable it with
--no-vador adjust padding with--vad-speech-pad-ms <int>.
Requirements
- Docker-compatible system (Mac, Linux, Windows with WSL)
Installation
1. Install Prerequisites
Make sure you have the following tools installed on your system:
- Docker — need help installing? See the beginner guide: Docker installation for Windows, macOS, and Ubuntu
- Git
2. Clone the Repository
Open a terminal and run:
git clone https://github.com/algernon725/podscripter.git
cd podscripter
3. Set Up Required Folders
Create folders to store audio files and model data:
mkdir -p audio-files
mkdir -p models/sentence-transformers models/huggingface
This creates the necessary directory structure for caching models:
models/huggingface/- Hugging Face cache (includes Faster-Whisper model repos)models/sentence-transformers/- Caches sentence embedding models for punctuation restoration
4. Build the Docker Image
Build the container image that will run the transcription tool:
docker build -t podscripter .
5. Start the Docker Container
Run the container and mount the folders you just created:
docker run -it \
-v $(pwd)/models/sentence-transformers:/root/.cache/torch/sentence_transformers \
-v $(pwd)/models/huggingface:/root/.cache/huggingface \
-v $(pwd)/audio-files:/app/audio-files \
podscripter
This opens an interactive terminal inside the container. You’ll run all transcription commands from here.
Alternative: Use the caching script
./docker-run-with-cache.sh
💡 Model Caching: The first run will download models (~1-2 GB). Subsequent runs will use cached models for faster startup.
⚙️ NLP Capitalization: The image enables spaCy-based capitalization by default (NLP_CAPITALIZATION=1). To disable per run, pass
-e NLP_CAPITALIZATION=0todocker run.
Usage
Basic Usage
From inside the Docker Container, run:
python podscripter.py <media_file> --output_dir <output_dir> \
[--language <code>|auto] [--output_format {txt|srt}] [--single] \
[--compute-type {auto,int8,int8_float16,int8_float32,float16,float32}] \
[--beam-size <int>] [--no-vad] [--vad-speech-pad-ms <int>] \
[--quiet|--verbose]
Example:
To transcribe example.mp3 using default settings (auto-detect language, txt output):
python podscripter.py audio-files/example.mp3 --output_dir audio-files
Example with video file:
To transcribe example.mp4:
python podscripter.py audio-files/example.mp4 --output_dir audio-files
Examples
One example per scenario to keep things concise.
TXT (default, auto-detect language)
python podscripter.py audio-files/example.mp3 --output_dir audio-files
SRT (subtitles)
python podscripter.py audio-files/example.mp3 --output_dir audio-files --output_format srt
Single-call (no manual chunking)
python podscripter.py audio-files/example.mp3 --output_dir audio-files --single
Use --single if your hardware can handle longer files in a single call for best context continuity. Default mode uses overlapped chunking with VAD.
Expected output snippets
English (TXT):
Hello everyone, welcome to our show!
Today, we’re going to talk about travel tips.
Spanish (TXT):
Hola a todos, ¡bienvenidos a Españolistos!
Hoy vamos a hablar de algunos consejos de viaje.
Options
| Argument | Description |
|---|---|
media_file |
Path to the audio or video file (e.g. audio-files/example.mp3 or audio-files/example.mp4) |
--output_dir |
Directory where the transcription file will be saved |
--language |
Language code. Primary: en, es, fr, de. Others are experimental. Default auto (auto-detect) |
--output_format |
Output format: txt or srt (default txt) |
--single |
Bypass manual chunking and process the full file in one call |
--compute-type |
Compute type for faster-whisper: auto, int8, int8_float16, int8_float32, float16, float32 (default auto) |
--beam-size |
Beam size for decoding (default 3) |
--no-vad |
Disable VAD filtering (default: VAD enabled) |
--vad-speech-pad-ms |
Padding in milliseconds when VAD is enabled (default 200) |
--quiet/--verbose |
Toggle log verbosity (default --verbose) |
Supported Languages
PodScripter supports automatic language detection and manual language selection for the following languages:
| Language | Code | Language | Code |
|---|---|---|---|
| English | en |
Spanish | es |
| French | fr |
German | de |
Note: Whisper can transcribe many additional languages, but only the four listed above have project-level optimization and tests. Other languages are considered experimental.
Automatic NLP Capitalization (spaCy)
Punctuation restoration uses Sentence-Transformers with automatic spaCy-based capitalization that capitalizes named entities and proper nouns for English, Spanish, French, and German.
- Always enabled - spaCy models are included in the Docker image.
- Automatically capitalizes names, places, and organizations while preserving language-specific connectors like “de”, “del”, “y”, etc.
- For unsupported languages, falls back to the English model.
This feature is CPU-only and uses cached spaCy “sm” models baked into the image.
Development
See tests/README.md for details on running tests.
Batch Transcription: All Media Files
To transcribe all .mp3 and .mp4 files in the audio-files folder with auto-detection (default), run this from inside the container:
for f in audio-files/*.{mp3,mp4}; do
python podscripter.py "$f" --output_dir audio-files
done
Why Use This?
When learning a new language, especially through podcasts, having accurate, aligned transcriptions is essential for comprehension and retention. Many language learning apps impose monthly transcription limits or rely on cloud-based AI. This tool gives you full control over your data, with no recurring costs, and the power of Whisper, all on your own hardware.
Model Caching
Podscripter caches models locally to avoid repeated downloads. Cache locations are created during Installation → “Set Up Required Folders” and are mounted into the container in the run commands above. In short:
- Faster-Whisper (Whisper) models are cached via the Hugging Face Hub under
models/huggingface/(look forhub/entries likeSystran/faster-whisper-*) - Sentence-Transformers under
models/sentence-transformers/
Note: The Sentence-Transformers loader first attempts to load from the local cache and prefers offline use when the cache is present (avoids network calls). When caches are warm you may set HF_HOME and/or HF_HUB_OFFLINE=1 to run fully offline.
To clear cache and re-download models:
rm -rf models/sentence-transformers/* models/huggingface/*
Output
Transcriptions are saved in sentence-separated .txt or .srt
Testing
Run the test suite inside Docker with caches mounted. See tests/README.md for details.
Quick run (default selection):
docker run --rm \
-e NLP_CAPITALIZATION=1 \
-v $(pwd):/app \
-v $(pwd)/models/sentence-transformers:/root/.cache/torch/sentence_transformers \
-v $(pwd)/models/huggingface:/root/.cache/huggingface \
-v $(pwd)/audio-files:/app/audio-files \
podscripter python3 /app/tests/run_all_tests.py
Optional groups via env flags: RUN_ALL=1, RUN_MULTILINGUAL=1, RUN_TRANSCRIPTION=1, RUN_DEBUG=1.
Troubleshooting
- HTTP 429 (rate limiting) during model loads: ensure cache volumes are mounted; the app prefers offline when caches exist. If needed, set
HF_HOMEand considerHF_HUB_OFFLINE=1inside the container when caches are warm.