Running OpenAI Whisper Model on Docker with GPU Support

5 min readAug 18, 2024

Hi fellows, in this article I have talked about how to run the Whisper Large v3 Speech-to-Text(STT) model on a Docker container with GPU support. This guideline helps you to deploy your other deep learning models as well. Please follow each step carefully. Here we go!

Let’s start with check CUDA drivers and toolkits are installed correctly.

nvidia-smi
nvcc --version

If it seems like below, everything is fine. Otherwise, you should install it again after remove everything.

Then, you can use this whisper template that I’ve created before for my personal projects. I used the faster-whisper library and large-v3 model

import datetime
import time
from faster_whisper import WhisperModel
from faster_whisper.vad import VadOptions
import torch
import re
class WhisperInference:
    def __init__(self,thread_size: int):
        model_size = "large-v3"
        print("----CUDA------>" , torch.cuda.is_available())
        print("----CUDA Current Device------>" , torch.cuda.current_device())
        print("----CUDA Is Initialized------>" , torch.cuda.is_initialized())
        print("----CUDA Memory Allocated------>" , torch.cuda.memory_allocated())
        print("----CUDA Memory Reserved------>" , torch.cuda.memory_reserved())
        print("----CUDA Memory Summary------>" , torch.cuda.memory_summary())
        self.thread_size = thread_size
        self.vad_options = VadOptions(threshold=0.55)
        self.model = WhisperModel(model_size, device="CUDA", compute_type="float16",cpu_threads=thread_size)

def run_inference(self,clip_language: str,ad_id : int):
        print("clip language --> ", clip_language)
        segments, info = self.model.transcribe("../tmp/output.mp3", beam_size=3, language=clip_language, 
                                               vad_filter =True,vad_parameters=self.vad_options)
        segments = list(segments)
        print("Detected language '%s' with probability %f" % (info.language, info.language_probability))
        return_subtitle = []
        return_transcript = []
        #handle segments whatever format you want
        return segments

For compute_type parameter, you can follow this documentation to reduce accuracy and speed up your model or reverse versa.

Quantization — CTranslate2 4.3.1 documentation

Quantization is a technique that can reduce the model size and accelerate its execution with little to no degradation…

opennmt.net

These settings above belong to me, you can change or edit whatever you like.

After, let’s create our docker-compose.yml file.

version: '3.12'
services:
  app:
    build:
      context: .
    environment:
      - PYTHONUNBUFFERED=1
      - LD_LIBRARY_PATH=/usr/local/lib/python3.12/site-packages/nvidia/cublas/lib:/usr/local/lib/python3.12/site-packages/nvidia/cudnn/lib
    volumes:
      - .:/XXX_VOLUME_XXX
    ports:
      - "5000:5000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: always

In docker-compose.yml file, LD_LIBRARY_PATH is so important. I have used tricky way to find this library path correctly. You can run your container with interactive mode and print these lines below. That gives you a path that you can copy and paste in your .yml file. As I said, this is tricky way and I do not recommend it to use. The general file path is look like similar if you use base Python on Ubuntu, however if you use different environment or Python version, you should change it.

If this path not given correctly, you can run code below inside your container to know the correct path. Interactive mode or find it from logs, any option that you want is fine.

import nvidia.cublas.lib
import nvidia.cudnn.lib
print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))

Next, “deploy” line that in our docker-compose.yml file, you should check these settings.

driver : nvidia
count : 1
capabilities : [gpu]

.Dockerfile file should be like this:

# Use Python 3.12 as the base image
FROM python:3.12

# Environment variables
ENV PYTHONUNBUFFERED=1

# Set the working directory
WORKDIR /livad-ad-speech-recognition

# Install essential packages
RUN apt-get update && apt-get install -y \
    ffmpeg \
    curl \
    gnupg2 \
    ca-certificates

# Add NVIDIA Container Toolkit repositories
RUN curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
    && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

# Update package lists and install NVIDIA Container Toolkit
RUN apt-get update \
    && apt-get install -y nvidia-container-toolkit

# Download and install CUDA and cuDNN libraries
RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb && \
    dpkg -i cuda-keyring_1.0-1_all.deb && \
    apt-get update && apt-get upgrade -y && \
    apt-get install -y libcudnn8 libcudnn8-dev

# Install Python dependencies
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt

# Copy the cuDNN files check and copy script
COPY copy_cudnn_files.py /usr/local/bin/copy_cudnn_files.py
RUN python3 /usr/local/bin/copy_cudnn_files.py

# Copy application files
COPY scripts scripts
COPY main.py main.py
COPY src src

# Copy the Docker entrypoint script and make it executable
COPY docker-entrypoint.sh docker-entrypoint.sh
RUN chmod +x docker-entrypoint.sh

# Set the entry point
ENTRYPOINT ["./docker-entrypoint.sh"]

Requirement.txt:

torch
torchaudio
torchvision
pybind11
python-dotenv
faster-whisper
nvidia-cudnn-cu11
nvidia-cublas-cu11
numpy

torch
torchaudio
torchvision
pybind11
python-dotenv
faster-whisper
nvidia-cudnn-cu11
nvidia-cublas-cu11
numpy

After these settings, let’s build:

docker-compose up --build -d

Important Note

Most of the person get, “Could not load library libcudnn_ops_infer.so.8. Error: libcudnn_ops_infer.so.8” error. To solve this complex problem, I have spent so much time to find the optimum solution. You should put this line into your Dockerfile, that should fix your error. (maybe you should change versions for your appropriate device)

RUN wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.0-1_all.deb && \
    dpkg -i cuda-keyring_1.0-1_all.deb && \
    apt-get update && apt-get upgrade -y && \
    apt-get install -y libcudnn8 libcudnn8-dev

You shouldn’t pass the step and must to do it during the docker build step. If the bug continues, you can try another solution like:

import os
import shutil

def copy_cudnn_files():
    try:
        # Find the location of libcudnn_ops_infer.so.8
        cudnn_path = None
        for root, dirs, files in os.walk('/usr/'):
            if 'libcudnn_ops_infer.so.8' in files:
                cudnn_path = os.path.join(root, 'libcudnn_ops_infer.so.8')
                break
        
        # If the file is found, copy it to /usr/lib/
        if cudnn_path:
            cudnn_lib_dir = os.path.dirname(cudnn_path)
            dest_dir = '/usr/lib/'
            for file in os.listdir(cudnn_lib_dir):
                if file.startswith('libcud'):
                    full_file_name = os.path.join(cudnn_lib_dir, file)
                    if os.path.isfile(full_file_name):
                        shutil.copy(full_file_name, dest_dir)
            print(f"Library files successfully copied to {dest_dir}.")
        else:
            print("Required libcudnn_ops_infer.so.8 file not found.")
    
    except Exception as e:
        print(f"Error occurred during file copy: {e}")
# Start the file copy process
copy_cudnn_files()

This tricky way is copy relevant file into usr/lib path. That allows to main script can find that library in the library path.

During your test stage, you should see CUDA logs to see whether they are running correctly. To do it, I really recommend you to install “Portainer” and see your logs below “Container Logs” tab.

Running OpenAI Whisper Model on Docker with GPU Support

Quantization — CTranslate2 4.3.1 documentation

Quantization is a technique that can reduce the model size and accelerate its execution with little to no degradation…

Important Note

Written by Egemen Gulpinar