I have been playing around with whisper.cpp for some time now. Going back to September 2023 at the NORDUnet community workshop, where someone from Oslo university was presenting on using whisper to do subtitles for their video platform. Having heard about whisper.cpp I spun it up on my MacBook Pro after the talk, and was quite impressed, as most people are when they try whisper for the first time. While the talk had been about running whisper using PyTorch with a Nvidia H100, here I was running it on a m2 max macbook, and the speed was not horrible. I even grabbed some of the other attendees to see how it handled Swedish, Danish and English. And for a quick small demo in the hallway it worked surprisingly good, not perfect in any way, but for a first pass transcription it was good. It even handled multiple people talking, though not

Fastforward a bit and whisper.cpp has grown, and OpenAI released multiple versions of their large model, culminating in large-v3-turbo. People also started releasing fine-tunes, the National Library of Norway released a model tuned on 66.000 hours of speech. They even released ggml versions, the model format that whisper.cpp (and llama.cpp) uses, which made it super easy to test out on a mac. The Royal Library of Sweden also released a whisper fine tune trained on 50k hours of Swedish, but they did not release a ggml version.

Luckily converting a PyTorch model is fairly easy, if you got a h5 or a PyTorch pickle file whisper.cpp has conversion scripts. But due to the fact that pickled python objects are dangerous, as you just side load a binary blob and hope it does not contain anything nefarious (you should never use untrusted pickles), models released these days are in the safetensors format.

Converting the model

First we setup a virtual python environment and install the requirements from the whisper.cpp models folder.

$ python3 -m venv venv
$ source venv/bin/activate
$ pip install -U -r requirements-coreml.txt

In this case the KBLab model is released in a h5 like format, just using safetensors instead of a pickeled model. Therefore we can work off of convert-h5-to-ggml.py so make a copy and call it convert-safetensors-to-ggml.py.

First thing to change is to import the safetensors load_file.

# ...

# from transformers import WhisperForConditionalGeneration
from safetensors.torch import load_file

Next you need to change the way we load the model.

# ...
# Add this block to handle missing 'max_length'
if "max_length" not in hparams or not hparams['max_length']:
    hparams["max_length"] = hparams.get("max_target_positions", 448)

# model = WhisperForConditionalGeneration.from_pretrained(dir_model)
st_file = dir_model / 'model.safetensors'
if st_file.exists():
    list_vars = load_file(str(st_file), device='cpu')
else:
    print("Error: failed to load Safetensors model file:" , st_file)
    sys.exit(1)

First we need to add an extra default condition for hparams, as the config.json might have max_length set to null.

Then instead of using WhisperForConditionalGeneration we use the safetensors.torch.load_file.

And finally we need to remove or comment out the line:

list_vars = model.state_dict()

That is it, we should now be able to convert the whisper model from KBLab.

# remember to soure the venv
$ cd whisper.cpp/models
$ git clone https://github.com/openai/whisper
$ git clone https://huggingface.co/KBLab/kb-whisper-large

# if you dont have git lfs installed, you need to fetch the model.safetensors file
$ cd kb-whisper-large
$ curl -LO https://huggingface.co/KBLab/kb-whisper-large/resolve/main/model.safetensors?download=true
$ cd ..

# lets make the output dir
$ mkdir kb-whisper-large-ggml

# finally convert the kb-whisper-large h5 model to ggml
$ python3 convert-safetensors-to-ggml.py kb-whisper-large/ whisper/ kb-whisper-large-ggml/

That’s it kb-whisper-large-ggml/ggml-model.bin can now be used with whisper.cpp.

# be in the whisper.cpp main folder
$ ./build/bin/whisper-cli -m models/kb-whisper-large-ggml/ggml-model.bin -ovtt -otxt -f test.wav -of test

Running it on my mac a 20 minutes wav file takes 2 minutes to process, so a 10x compared to realtime.

The whisper models still hallucinate things, like applauds, or decide that the end of the audio should include a ScanPic translated by Someone or copywrite notice, which clearly show the original whisper model has been trained on some movies and subtitles. But as mentioned they are a good first pass.

This post draws from the information in this GitHub Issue.