A.N.T

ML Inference Performance on GPU and CPU across different batch sizes.

A comparison of ML inference speed and memory consumption across various batch sizes on both GPU and CPU.

June 09, 2024

08m Read

By: Abhilaksh Singh Reen

Table of Contents

The Model

CPU and GPU

Testing Code

CPU Inference

GPU Inference

Batched GPU Inference

Batched CPU Inference: Any Good?

Model Dependence

Conclusion

When deploying your ML model to production, it is important to get the inference process running as fast as possible. GPUs are efficient at parallel processing. In today's article, we compare the inference speeds of running a PyTorch model on a CPU and a GPU. We also test how processing more images in parallel, as batches, on GPU affects the speed.

The Model

Inside the project directory, let's create a folder called src. Inside this folder, create a new file called model.py. Here, we'll define a very simple model based on the UNet++ architecture. For this, we'll use the segmentation_models_pytorch package. Inside the file, we create a class called Model that has 4 main functions:

1. __init__: initializes the model, loads the weights, and creates the preprocessing transform.

1. preprocess: takes a list of images and returns a batch that can be passed to the forward function.

1. postprocess: takes a single model output and returns an image with an option to resize the image.

1. forward: the function used for inference.

from segmentation_models_pytorch import UnetPlusPlus
import torch
import torch.nn.functional as F
from torchvision import transforms


class Model:
    def __init__(self, weights_path, use_cuda=False):
        self.using_cuda = use_cuda and torch.cuda.is_available()
        self.device = torch.device("cuda") if self.using_cuda else torch.device("cpu")
        self.dtype = torch.cuda.FloatTensor if self.using_cuda else torch.Tensor

        self.model = UnetPlusPlus(
            encoder_name="resnet34",
            encoder_depth=4,
            encoder_weights="imagenet",
            decoder_channels=(256, 128, 64, 32),
            in_channels=1,
            classes=1,
            activation=None,
            aux_params=None
        )

        self.model = self.model.to(self.device)
        self.model.load_state_dict(torch.load(weights_path, map_location=self.device))

        self.preprocessing_transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Resize(size=(256, 256))
        ])

        if use_cuda and not self.using_cuda:
            print("CUDA is not available to Torch, using CPU instead.")

    def preprocess(self, images):
        transformed_images = torch.stack([self.preprocessing_transform(image) for image in images])
        transformed_images = transformed_images.type(self.dtype)

        batched_images = transformed_images.to(self.device)

        return batched_images

    def postprocess(self, output, output_size=(256, 256)):
        output = output[0]
        output = output.ge(0.5)
        output = F.interpolate(output.unsqueeze(0).unsqueeze(0).float(), size=output_size, mode="nearest")
        output = output.bool().squeeze(0).squeeze(0)

        if self.using_cuda:
            output = output.cpu()

        output = output.numpy()

        return output

    def forward(self, preprocessed_images):
        output = self.model.forward(preprocessed_images)
        output = torch.sigmoid(output)
        output = output.detach()

        return output

CPU and GPU

In the following three lines, define self.device and self.dtype based on whether the constructor was called with the instruction to use CUDA and if CUDA is actually available to torch.

self.using_cuda = use_cuda and torch.cuda.is_available()
self.device = torch.device("cuda") if self.using_cuda else torch.device("cpu")
self.dtype = torch.cuda.FloatTensor if self.using_cuda else torch.Tensor

We load the model to the device.

self.model = self.model.to(self.device)
self.model.load_state_dict(torch.load(weights_path, map_location=self.device))

And, in the preprocessing function, we transfer our batched images to the device as well.

batched_images = transformed_images.to(self.device)

In the src directory, we'll make another file config.py to store the directories we will be using for testing.

from os.path import dirname, join as path_join


inputs_dir = path_join(dirname(dirname(__file__)), "inputs")
weights_dir = path_join(dirname(dirname(__file__)), "weights")
outputs_dir = path_join(dirname(dirname(__file__)), "outputs")

Let us also create these three folders, here is what the project's directory structure should look like after doing so.

├───inputs
│       sample_image.png
│
├───outputs
│       sample_mask.png
│
├───src
│       config.py
│       model.py
│       test.py
│
└───weights
        weights_001.pt

Testing Code

Let's work on the src/test.py file. Our input images are located in the inputs folder and model weights are in the weights folder.

from os import listdir
from os.path import join as path_join
from time import time

import cv2
import numpy as np

from config import inputs_dir, outputs_dir, weights_dir
from model import Model


if __name__ == "__main__":
    test_images = [cv2.imread(path_join(inputs_dir, image_name), cv2.IMREAD_GRAYSCALE) for image_name in listdir(inputs_dir)]
    print("Images loaded.")

    model = Model(path_join(weights_dir, "weights_001.pt"))
    print("Model loaded.")

Next, we can iterate over each of the test images, preprocess them, run a forward pass, postprocess, and save the output.

start_time = time()

outputs = []
for i, test_image in enumerate(test_images):
    preprocessed_image = model.preprocess([test_image])
    output = model.forward(preprocessed_image)
    output = output.detach()
    outputs.append(output)

end_time = time()
time_elapsed = end_time - start_time
print(
    f"Num images: {len(test_images)}, "
    f"Batch size: {1}, "
    f"Time taken: {time_elapsed} s, "
    f"Time taken per image: {1000 * time_elapsed / len(test_images)} ms"
)

for output in outputs:
    boolean_array = model.postprocess(output[0], (512, 512))
    binary_image = boolean_array.astype(np.uint8) * 255
    file_path = path_join(outputs_dir, f"{i}.png")
    cv2.imwrite(file_path, binary_image)

Here's the entire test.py file:

from os import listdir
from os.path import join as path_join
from time import time

import cv2
import numpy as np

from config import inputs_dir, outputs_dir, weights_dir
from model import Model


if __name__ == "__main__":
    test_images = []
    all_image_names = listdir(inputs_dir)
    for i, image_name in enumerate(all_image_names):
        image = cv2.imread(path_join(inputs_dir, image_name), cv2.IMREAD_GRAYSCALE)
        test_images.append(image)

        if (i + 1) % 500 == 0:
            print(f"Loaded images: {i} / {len(all_image_names)}")

        if (i + 1) % 50 == 0:
            break
    print("Images loaded.")

    model = Model(path_join(weights_dir, "weights_001.pt"), use_cuda=False)
    print("Model loaded.")

    start_time = time()

    outputs = []
    for i, test_image in enumerate(test_images):
        preprocessed_image = model.preprocess([test_image])
        output = model.forward(preprocessed_image)
        output = output.detach()
        outputs.append(output)

    end_time = time()
    time_elapsed = end_time - start_time
    print(
        f"Num images: {len(test_images)}, "
        f"Batch size: {1}, "
        f"Time taken: {time_elapsed} s, "
        f"Time taken per image: {1000 * time_elapsed / len(test_images)} ms"
    )

    for output in outputs:
        boolean_array = model.postprocess(output[0], (512, 512))
        binary_image = boolean_array.astype(np.uint8) * 255
        file_path = path_join(outputs_dir, f"{i}.png")
        cv2.imwrite(file_path, binary_image)

CPU Inference

Right now, in the line where we have initialized the model, we pass the use_cuda parameter as False. Actually, we don't really need to do that because that is what it defaults to.

model = Model(path_join(weights_dir, "weights_001.pt"), use_cuda=False)

Let's run the script.

For 5,000 images, it took around 6674.206 seconds (01h51m14s). This is around 1.3348 s per image or 0.7491 FPS.

Num images: 5,000 | Device: CPU

Batch Size Time Taken (s) Time Taken per Image (s) FPS
1 01h51m14s 1.3348 0.7491

GPU Inference

Before we try to run our model on a GPU, we should make sure that CUDA is indeed available to torch. Open up a new Python interpreter in your environment and run the following lines:

import torch

torch.cuda.is_available()

You should get True in the output.

Let's convert our code to now run the inference on the machine's GPU. Since we have already written logic in the model.py to handle this, we can change our inference device by changing a single line in the test.py where we are initializing our model.

model = Model(path_join(weights_dir, "weights_001.pt"), use_cuda=True)

Let's run this and see how much of a speed improvement we got.

Num images: 5,000 | Device: GPU

Batch Size Time Taken (s) Time Taken per Image (ms) FPS
1 49.6585s 9.9317 100.6876

Woah! That's a 134.41 times improvement. The MythBusters made a good illustration of this speed comparison. GPUs mean business when it comes to crunching numbers.

Batched GPU Inference

Let's take a look at the preprocess function in model.py.

def preprocess(self, images):
    transformed_images = torch.stack([self.preprocessing_transform(image) for image in images])
    transformed_images = transformed_images.type(self.dtype)

    batched_images = transformed_images.to(self.device)

    return batched_images

Here, we are accepting a list of images, transforming them using the preprocessing transform, and then combining them into a single tensor using torch.stack. We can call forward with this single tensor and the output tensor will include the result for each of the inputs i.e. the output is also a batch.

In test.py, we'll create a variable called batch_size to define how many images are preprocessed and sent for inference at once. Then, we can create batches from the test_images list as follows:

batch_size = 4
test_batches = [test_images[i:i+batch_size] for i in range(0, len(test_images), batch_size)]

Let's iterate the batch size from 1 to 32 (at 32 my GPU ran out of memory) and print the time taken to process.

for batch_size in [1, 2, 4, 8, 16, 32]: #, 64, 128, 256]:
    test_batches = [test_images[i:i+batch_size] for i in range(0, len(test_images), batch_size)]
    outputs = []

    start_time = time()
    preprocessed_batches = [model.preprocess(batch) for batch in test_batches]

    for i, preprocessed_batch in enumerate(preprocessed_batches):
        output = model.forward(preprocessed_batch)
        output = output.detach()
        outputs.append(output)

        if (i + 1) % 25 == 0:
            print(f"Processed: {i} / {len(preprocessed_batches)}")

    end_time = time()
    time_elapsed = end_time - start_time

    print(
        f"Num images: {len(test_images)}, "
        f"Batch size: {batch_size}, "
        f"Time taken: {time_elapsed} s, "
        f"Time taken per image: {1000 * time_elapsed / len(test_images)} ms"
    )

Here's the entire test.py file:

from os import listdir
from os.path import join as path_join
from time import time

import cv2
import numpy as np

from config import inputs_dir, outputs_dir, weights_dir
from model import Model


if __name__ == "__main__":
    test_images = [cv2.imread(path_join(inputs_dir, image_name), cv2.IMREAD_GRAYSCALE) for image_name in listdir(inputs_dir)]
    print("Images loaded.")

    model = Model(path_join(weights_dir, "weights_001.pt"), use_cuda=True)
    print("Model loaded.")

    for batch_size in [1, 2, 4, 8, 16, 32]: #, 64, 128, 256]:
        test_batches = [test_images[i:i+batch_size] for i in range(0, len(test_images), batch_size)]
        outputs = []

        start_time = time()
        preprocessed_batches = [model.preprocess(batch) for batch in test_batches]

        for i, preprocessed_batch in enumerate(preprocessed_batches):
            output = model.forward(preprocessed_batch)
            output = output.detach()
            outputs.append(output)

            if (i + 1) % 25 == 0:
                print(f"Processed: {i} / {len(preprocessed_batches)}")

        end_time = time()
        time_elapsed = end_time - start_time

        print(
            f"Num images: {len(test_images)}, "
            f"Batch size: {batch_size}, "
            f"Time taken: {time_elapsed} s, "
            f"Time taken per image: {1000 * time_elapsed / len(test_images)} ms"
        )

        for i, output in enumerate(outputs):
            for j, single_output in enumerate(output):
                boolean_array = model.postprocess(single_output, (512, 512))
                binary_image = boolean_array.astype(np.uint8) * 255
                file_path = path_join(outputs_dir, f"{i * batch_size + j}.png")
                cv2.imwrite(file_path, binary_image)

Run the code and you should see logs in the console saying how long it took. I've tabulated my results (for 5,000 input images):

Num images: 5,000 | Device: GPU

Batch Size Time Taken (s) Time Taken per Image (ms) FPS CUDA Memory Consumption (GB) Improvement over Batch Size 1 Improvement over Previous Row
1 46.9190 9.3838 106.5666 0.9051 1x -
2 40.4961 8.0992 123.4689 1.1397 1.1586x 1.1586x
4 39.8241 7.9648 125.5524 1.5774 1.1781x 1.0168x
8 39.3241 7.8648 127.1488 2.6414 1.1931x 1.0127x
16 39.2899 7.8579 127.2442 4.4764 1.1940x 1.00087x

The biggest improvement is from Batch Size 1 -> 2, while the least increase in memory consumption.

Batched CPU Inference: Any Good?

I'll set use_cuda to False where we are loading the model and run the inference again, with the same batches.

model = Model(path_join(weights_dir, "weights_001.pt"), use_cuda=True)

Num images: 5,000 | Device: CPU

Batch Size Time Taken Time Taken per Image (s) FPS
1 01h51m14s 1.3348 0.7491
2 01h47m19s 1.2878 0.7765
4 01h42m44s 1.2327 0.8112
8 01h40m47s 1.2095 0.8267
16 01h38m59s 1.1877 0.8419

In this case, there is a trend of decreasing inference time with larger batch sizes but it's no significant improvement.

Model Dependence

The improvement we get while running a model on GPU or while running batched inference depends highly on how to model has been constructed. Here are some tests that I ran on another model (these were on an RTX 4090, with much smaller images, with a model for a different task).

Num images: 21,846 | Device: GPU

Batch Size Time Taken (s) Time Taken per Image (ms) FPS CUDA Memory Consumption (GB) Improvement over Batch Size 1 Improvement over Previous Row
1 219.56 10.05 99.50 0.2628 1x -
2 134.41 6.15 162.60 0.3487 1.6341x 1.6341x
4 76.88 3.5193 284.14 0.4893 2.8556x 1.7474x
8 44.47 2.0360 491.15 0.8018 4.9361x 1.6933x
16 26.64 1.2196 819.94 1.3409 8.2406x 1.6694x
32 19.37 0.8871 1,127.26 2.3565 11.3292x 1.3748x
64 14.69 0.6725 1,486.98 4.4815 14.9445x 1.3191x
128 12.84 0.5881 1,700.39 9.6925 17.0893x 1.1435x
256 12.35 0.5656 1,768.03 19.2745 17.7691x 1.0397x

The improvements for this model and input are very different.

Conclusion

In this Post, we have seen how GPUs are significantly faster than CPUs for ML inference tasks. We have also seen that processing multiple inputs in parallel i.e. in a batch serves to decrease the inference time. The relative decrease is considerable among smaller batch sizes but diminishes as the batch sizes get larger. The effect on performance also depends on the model architecture and its implementation.

Or maybe, you want to serve models to run on the client side, in that case, this Article may interest you, where we run a model on the user's web browser using TensorFlow JS.

See you next time :)

TensorFlow.js Inference in React on an image drawn with Konva

TensorFlow.js Inference in React on an image drawn with Konva

Inferencing on an ONNX model in a React App using ONNXRuntime Web.

08m Read