gitwtfhub

xzf-thu/mega-asr

170PythonAudience · researcherComplexity · 4/5ActiveSetup · hard

TL;DR

Tsinghua speech recognition foundation model tuned for noisy, far-field, in-the-wild audio, claiming up to 30 percent lower WER than Whisper and Qwen3-ASR on hard clips.

Mindmap

mindmap
  root((Mega-ASR))
    Inputs
      Noisy audio recordings
      Far-field microphone clips
    Outputs
      Transcribed text
      Lower WER scores
    Use Cases
      Transcribe field recordings
      Benchmark against Whisper
      Research robust ASR
    Tech Stack
      Python
      PyTorch
      Hugging Face
Click or tap to explore — scroll the page freely

Code map

Detail Auto

An interactive map of this repo's files and how they connect — its source is parsed live in your browser. Click Visualize to build it.

filefunction / class

Why anyone would actually build this

REASON 1

Transcribe noisy real-world audio with strong background interference

REASON 2

Benchmark in-the-wild speech recognition against Whisper and Qwen3-ASR

REASON 3

Fine-tune an ASR foundation model on custom acoustic conditions

REASON 4

Research A2S-SFT and DG-WGPO training recipes

Stack

PythonPyTorchHugging Face

Spinning it up

Difficulty · hard Time to first run · 1day+

Foundation model trained on 2.6M samples, expect GPU plus Hugging Face downloads and a multi-step inference setup.

Wtf does this do

MEGA-ASR is a speech recognition model from a group at Tsinghua University aimed at transcribing audio captured in messy real-world conditions, rather than the clean studio recordings that most speech models are tested on. The README frames it as a foundation model for what the authors call in-the-wild speech recognition, meaning audio with background noise, far-field microphones, obstructions, echoes and reverberation, recording artifacts, electronic distortion, and dropped pieces of transmission. The training set is described as 2.6 million samples covering 7 atomic acoustic conditions and 54 compound scenarios where those conditions stack on top of each other. The authors report up to roughly 30 percent gains over leading open and closed source models on these harder cases. Two training techniques are named in the README: A2S-SFT for supervised fine-tuning, and a reinforcement learning step called DG-WGPO. The README does not explain what those acronyms stand for or how they work in detail, so a non-technical reader will mostly take them as the labels of the recipes used. Most of the README is a side-by-side comparison table where short audio clips are transcribed by MEGA-ASR and by other systems, including Qwen3-ASR, Gemini-3-Pro, Seed-ASR, and Whisper. Each row shows the ground-truth text, each model's transcription, and a Word Error Rate score. In the examples shown, MEGA-ASR produces lower error rates on the hard clips while the other systems often return empty output, hallucinate unrelated text, or drop large portions of the sentence. The project links out to a technical report on arXiv, the Voices-in-the-Wild-2M training dataset on Hugging Face, the model weights on Hugging Face, a separate benchmark repository called Voices-in-the-Wild-Bench, and a project page. The README in this repository is mostly the marketing-style introduction and the comparison samples.

Yoink these prompts

Prompt 1
Show me how to load the Mega-ASR weights from Hugging Face and transcribe a noisy wav file
Prompt 2
Compare Mega-ASR and Whisper on three hard clips and compute Word Error Rate for each
Prompt 3
Summarize what the A2S-SFT and DG-WGPO training steps in Mega-ASR are doing at a high level
Prompt 4
Build a small benchmark script around Voices-in-the-Wild-Bench to evaluate my own ASR model
View the repo → Decode another repo

← xzf-thu on gitmyhub — every repo by this author, as a profile.

Don't trust strangers blindly. Verify against the repo.