A comparison of ML inference speed and memory consumption across various batch sizes on both GPU and CPU.
June 09, 2024
08m Read
By: Abhilaksh Singh Reen
When deploying your ML model to production, it is important to get the inference process running as fast as possible. GPUs are efficient at parallel processing. In today's article, we compare the inference speeds of running a PyTorch model on a CPU and a GPU. We also test how processing more images in parallel, as batches, on GPU affects the speed.
Inside the project directory, let's create a folder called src. Inside this folder, create a new file called model.py. Here, we'll define a very simple model based on the UNet++ architecture. For this, we'll use the segmentation_models_pytorch package. Inside the file, we create a class called Model that has 4 main functions:
1. __init__: initializes the model, loads the weights, and creates the preprocessing transform.
1. preprocess: takes a list of images and returns a batch that can be passed to the forward function.
1. postprocess: takes a single model output and returns an image with an option to resize the image.
1. forward: the function used for inference.
from segmentation_models_pytorch import UnetPlusPlus
import torch
import torch.nn.functional as F
from torchvision import transforms
class Model:
def __init__(self, weights_path, use_cuda=False):
self.using_cuda = use_cuda and torch.cuda.is_available()
self.device = torch.device("cuda") if self.using_cuda else torch.device("cpu")
self.dtype = torch.cuda.FloatTensor if self.using_cuda else torch.Tensor
self.model = UnetPlusPlus(
encoder_name="resnet34",
encoder_depth=4,
encoder_weights="imagenet",
decoder_channels=(256, 128, 64, 32),
in_channels=1,
classes=1,
activation=None,
aux_params=None
)
self.model = self.model.to(self.device)
self.model.load_state_dict(torch.load(weights_path, map_location=self.device))
self.preprocessing_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Resize(size=(256, 256))
])
if use_cuda and not self.using_cuda:
print("CUDA is not available to Torch, using CPU instead.")
def preprocess(self, images):
transformed_images = torch.stack([self.preprocessing_transform(image) for image in images])
transformed_images = transformed_images.type(self.dtype)
batched_images = transformed_images.to(self.device)
return batched_images
def postprocess(self, output, output_size=(256, 256)):
output = output[0]
output = output.ge(0.5)
output = F.interpolate(output.unsqueeze(0).unsqueeze(0).float(), size=output_size, mode="nearest")
output = output.bool().squeeze(0).squeeze(0)
if self.using_cuda:
output = output.cpu()
output = output.numpy()
return output
def forward(self, preprocessed_images):
output = self.model.forward(preprocessed_images)
output = torch.sigmoid(output)
output = output.detach()
return output
In the following three lines, define self.device and self.dtype based on whether the constructor was called with the instruction to use CUDA and if CUDA is actually available to torch.
self.using_cuda = use_cuda and torch.cuda.is_available()
self.device = torch.device("cuda") if self.using_cuda else torch.device("cpu")
self.dtype = torch.cuda.FloatTensor if self.using_cuda else torch.Tensor
We load the model to the device.
self.model = self.model.to(self.device)
self.model.load_state_dict(torch.load(weights_path, map_location=self.device))
And, in the preprocessing function, we transfer our batched images to the device as well.
batched_images = transformed_images.to(self.device)
In the src directory, we'll make another file config.py to store the directories we will be using for testing.
from os.path import dirname, join as path_join
inputs_dir = path_join(dirname(dirname(__file__)), "inputs")
weights_dir = path_join(dirname(dirname(__file__)), "weights")
outputs_dir = path_join(dirname(dirname(__file__)), "outputs")
Let us also create these three folders, here is what the project's directory structure should look like after doing so.
├───inputs
│ sample_image.png
│
├───outputs
│ sample_mask.png
│
├───src
│ config.py
│ model.py
│ test.py
│
└───weights
weights_001.pt
Let's work on the src/test.py file. Our input images are located in the inputs folder and model weights are in the weights folder.
from os import listdir
from os.path import join as path_join
from time import time
import cv2
import numpy as np
from config import inputs_dir, outputs_dir, weights_dir
from model import Model
if __name__ == "__main__":
test_images = [cv2.imread(path_join(inputs_dir, image_name), cv2.IMREAD_GRAYSCALE) for image_name in listdir(inputs_dir)]
print("Images loaded.")
model = Model(path_join(weights_dir, "weights_001.pt"))
print("Model loaded.")
Next, we can iterate over each of the test images, preprocess them, run a forward pass, postprocess, and save the output.
start_time = time()
outputs = []
for i, test_image in enumerate(test_images):
preprocessed_image = model.preprocess([test_image])
output = model.forward(preprocessed_image)
output = output.detach()
outputs.append(output)
end_time = time()
time_elapsed = end_time - start_time
print(
f"Num images: {len(test_images)}, "
f"Batch size: {1}, "
f"Time taken: {time_elapsed} s, "
f"Time taken per image: {1000 * time_elapsed / len(test_images)} ms"
)
for output in outputs:
boolean_array = model.postprocess(output[0], (512, 512))
binary_image = boolean_array.astype(np.uint8) * 255
file_path = path_join(outputs_dir, f"{i}.png")
cv2.imwrite(file_path, binary_image)
Here's the entire test.py file:
from os import listdir
from os.path import join as path_join
from time import time
import cv2
import numpy as np
from config import inputs_dir, outputs_dir, weights_dir
from model import Model
if __name__ == "__main__":
test_images = []
all_image_names = listdir(inputs_dir)
for i, image_name in enumerate(all_image_names):
image = cv2.imread(path_join(inputs_dir, image_name), cv2.IMREAD_GRAYSCALE)
test_images.append(image)
if (i + 1) % 500 == 0:
print(f"Loaded images: {i} / {len(all_image_names)}")
if (i + 1) % 50 == 0:
break
print("Images loaded.")
model = Model(path_join(weights_dir, "weights_001.pt"), use_cuda=False)
print("Model loaded.")
start_time = time()
outputs = []
for i, test_image in enumerate(test_images):
preprocessed_image = model.preprocess([test_image])
output = model.forward(preprocessed_image)
output = output.detach()
outputs.append(output)
end_time = time()
time_elapsed = end_time - start_time
print(
f"Num images: {len(test_images)}, "
f"Batch size: {1}, "
f"Time taken: {time_elapsed} s, "
f"Time taken per image: {1000 * time_elapsed / len(test_images)} ms"
)
for output in outputs:
boolean_array = model.postprocess(output[0], (512, 512))
binary_image = boolean_array.astype(np.uint8) * 255
file_path = path_join(outputs_dir, f"{i}.png")
cv2.imwrite(file_path, binary_image)
Right now, in the line where we have initialized the model, we pass the use_cuda parameter as False. Actually, we don't really need to do that because that is what it defaults to.
model = Model(path_join(weights_dir, "weights_001.pt"), use_cuda=False)
Let's run the script.
For 5,000 images, it took around 6674.206 seconds (01h51m14s). This is around 1.3348 s per image or 0.7491 FPS.
Batch Size | Time Taken (s) | Time Taken per Image (s) | FPS |
---|---|---|---|
1 | 01h51m14s | 1.3348 | 0.7491 |
Before we try to run our model on a GPU, we should make sure that CUDA is indeed available to torch. Open up a new Python interpreter in your environment and run the following lines:
import torch
torch.cuda.is_available()
You should get True in the output.
Let's convert our code to now run the inference on the machine's GPU. Since we have already written logic in the model.py to handle this, we can change our inference device by changing a single line in the test.py where we are initializing our model.
model = Model(path_join(weights_dir, "weights_001.pt"), use_cuda=True)
Let's run this and see how much of a speed improvement we got.
Batch Size | Time Taken (s) | Time Taken per Image (ms) | FPS |
---|---|---|---|
1 | 49.6585s | 9.9317 | 100.6876 |
Woah! That's a 134.41 times improvement. The MythBusters made a good illustration of this speed comparison. GPUs mean business when it comes to crunching numbers.
Let's take a look at the preprocess function in model.py.
def preprocess(self, images):
transformed_images = torch.stack([self.preprocessing_transform(image) for image in images])
transformed_images = transformed_images.type(self.dtype)
batched_images = transformed_images.to(self.device)
return batched_images
Here, we are accepting a list of images, transforming them using the preprocessing transform, and then combining them into a single tensor using torch.stack. We can call forward with this single tensor and the output tensor will include the result for each of the inputs i.e. the output is also a batch.
In test.py, we'll create a variable called batch_size to define how many images are preprocessed and sent for inference at once. Then, we can create batches from the test_images list as follows:
batch_size = 4
test_batches = [test_images[i:i+batch_size] for i in range(0, len(test_images), batch_size)]
Let's iterate the batch size from 1 to 32 (at 32 my GPU ran out of memory) and print the time taken to process.
for batch_size in [1, 2, 4, 8, 16, 32]: #, 64, 128, 256]:
test_batches = [test_images[i:i+batch_size] for i in range(0, len(test_images), batch_size)]
outputs = []
start_time = time()
preprocessed_batches = [model.preprocess(batch) for batch in test_batches]
for i, preprocessed_batch in enumerate(preprocessed_batches):
output = model.forward(preprocessed_batch)
output = output.detach()
outputs.append(output)
if (i + 1) % 25 == 0:
print(f"Processed: {i} / {len(preprocessed_batches)}")
end_time = time()
time_elapsed = end_time - start_time
print(
f"Num images: {len(test_images)}, "
f"Batch size: {batch_size}, "
f"Time taken: {time_elapsed} s, "
f"Time taken per image: {1000 * time_elapsed / len(test_images)} ms"
)
Here's the entire test.py file:
from os import listdir
from os.path import join as path_join
from time import time
import cv2
import numpy as np
from config import inputs_dir, outputs_dir, weights_dir
from model import Model
if __name__ == "__main__":
test_images = [cv2.imread(path_join(inputs_dir, image_name), cv2.IMREAD_GRAYSCALE) for image_name in listdir(inputs_dir)]
print("Images loaded.")
model = Model(path_join(weights_dir, "weights_001.pt"), use_cuda=True)
print("Model loaded.")
for batch_size in [1, 2, 4, 8, 16, 32]: #, 64, 128, 256]:
test_batches = [test_images[i:i+batch_size] for i in range(0, len(test_images), batch_size)]
outputs = []
start_time = time()
preprocessed_batches = [model.preprocess(batch) for batch in test_batches]
for i, preprocessed_batch in enumerate(preprocessed_batches):
output = model.forward(preprocessed_batch)
output = output.detach()
outputs.append(output)
if (i + 1) % 25 == 0:
print(f"Processed: {i} / {len(preprocessed_batches)}")
end_time = time()
time_elapsed = end_time - start_time
print(
f"Num images: {len(test_images)}, "
f"Batch size: {batch_size}, "
f"Time taken: {time_elapsed} s, "
f"Time taken per image: {1000 * time_elapsed / len(test_images)} ms"
)
for i, output in enumerate(outputs):
for j, single_output in enumerate(output):
boolean_array = model.postprocess(single_output, (512, 512))
binary_image = boolean_array.astype(np.uint8) * 255
file_path = path_join(outputs_dir, f"{i * batch_size + j}.png")
cv2.imwrite(file_path, binary_image)
Run the code and you should see logs in the console saying how long it took. I've tabulated my results (for 5,000 input images):
Batch Size | Time Taken (s) | Time Taken per Image (ms) | FPS | CUDA Memory Consumption (GB) | Improvement over Batch Size 1 | Improvement over Previous Row |
---|---|---|---|---|---|---|
1 | 46.9190 | 9.3838 | 106.5666 | 0.9051 | 1x | - |
2 | 40.4961 | 8.0992 | 123.4689 | 1.1397 | 1.1586x | 1.1586x |
4 | 39.8241 | 7.9648 | 125.5524 | 1.5774 | 1.1781x | 1.0168x |
8 | 39.3241 | 7.8648 | 127.1488 | 2.6414 | 1.1931x | 1.0127x |
16 | 39.2899 | 7.8579 | 127.2442 | 4.4764 | 1.1940x | 1.00087x |
The biggest improvement is from Batch Size 1 -> 2, while the least increase in memory consumption.
I'll set use_cuda to False where we are loading the model and run the inference again, with the same batches.
model = Model(path_join(weights_dir, "weights_001.pt"), use_cuda=True)
Batch Size | Time Taken | Time Taken per Image (s) | FPS |
---|---|---|---|
1 | 01h51m14s | 1.3348 | 0.7491 |
2 | 01h47m19s | 1.2878 | 0.7765 |
4 | 01h42m44s | 1.2327 | 0.8112 |
8 | 01h40m47s | 1.2095 | 0.8267 |
16 | 01h38m59s | 1.1877 | 0.8419 |
In this case, there is a trend of decreasing inference time with larger batch sizes but it's no significant improvement.
The improvement we get while running a model on GPU or while running batched inference depends highly on how to model has been constructed. Here are some tests that I ran on another model (these were on an RTX 4090, with much smaller images, with a model for a different task).
Batch Size | Time Taken (s) | Time Taken per Image (ms) | FPS | CUDA Memory Consumption (GB) | Improvement over Batch Size 1 | Improvement over Previous Row |
---|---|---|---|---|---|---|
1 | 219.56 | 10.05 | 99.50 | 0.2628 | 1x | - |
2 | 134.41 | 6.15 | 162.60 | 0.3487 | 1.6341x | 1.6341x |
4 | 76.88 | 3.5193 | 284.14 | 0.4893 | 2.8556x | 1.7474x |
8 | 44.47 | 2.0360 | 491.15 | 0.8018 | 4.9361x | 1.6933x |
16 | 26.64 | 1.2196 | 819.94 | 1.3409 | 8.2406x | 1.6694x |
32 | 19.37 | 0.8871 | 1,127.26 | 2.3565 | 11.3292x | 1.3748x |
64 | 14.69 | 0.6725 | 1,486.98 | 4.4815 | 14.9445x | 1.3191x |
128 | 12.84 | 0.5881 | 1,700.39 | 9.6925 | 17.0893x | 1.1435x |
256 | 12.35 | 0.5656 | 1,768.03 | 19.2745 | 17.7691x | 1.0397x |
The improvements for this model and input are very different.
In this Post, we have seen how GPUs are significantly faster than CPUs for ML inference tasks. We have also seen that processing multiple inputs in parallel i.e. in a batch serves to decrease the inference time. The relative decrease is considerable among smaller batch sizes but diminishes as the batch sizes get larger. The effect on performance also depends on the model architecture and its implementation.
Or maybe, you want to serve models to run on the client side, in that case, this Article may interest you, where we run a model on the user's web browser using TensorFlow JS.
See you next time :)