Semantic Image Segmentation with DeepLabv3-pytorch | by Vinayak Nayak | Dec, 2020
We will be using opencv to interface a webcam for reading in input from our screens and we’ll use matplotlib’s pyplot module to render the processed video feed to output.
opencv’s VideoCapture object is used to get the image input for video. If you have multiple webcams you could create multiple such objects by passing the appropriate index; by default nowadays, most monitors have one inbuilt camera which could be indexed at 0th position.
Subsequently, opencv reads images in a BGR format but while rendering we need to show it in RGB format; so we’ve written a tiny function that captures a frame in realtime and converts it from BGR format to RGB format above. With this, we’re set with the input preprocessing steps. Let’s look at how we’ll set the stage for output now.
Why use matplotlib.pyplot and not cv2.imshow? There are some inconsistencies with cv2.imshow when it comes to ubuntu distributions; I had one such issue which caused me to look at alternate methods and since pyplot is easy to implement and mostly included with all major python distributions like anaconda by default, I thought of using pyplot for rendering the output.
So basically we set up two subplots as seen in the gif on the top; one to see the blurred version and another one to actually look at the labels mask which the deeplab model has predicted. We switch on the interactive mode for pyplot and display the image captured directly at the very beginning of the stream. Now, that we have the stage set, let’s discuss the part to obtain predictions from the deeplab-v3 model.
The model offered at torch-hub for segmentation is trained on PASCAL VOC dataset which contains 20 different classes of which the most important one for us is the person class with label 15.
Using the above code we can download the model from torch-hub and use it for our segmentation task. Note that since this is based on top of a Resnet-101 model which is quite heavy, the inference will take time and rendering will lag heavily if you do not have a medium sized GPU if not a high end one. Now that we have the model loaded, let’s discuss how we could get predictions from it.
We will first load the image using Pillow or directly get it from the VideoCapture object of cv2 defined above. Once we have this, we will normalize it using imagenet stats and convert it to a tensor. Then if we have a GPU available at our disposal, we can move the tensor to GPU. All pre-trained models expect input images in mini-batches of 3-channel RGB images of shape
(N, 3, H, W), where
N is the number of images,
W are height and width of the two images respectively. Since our video capture object captures single frames, it’s output is
(3, H, W). We therefore unsqueeze the tensor along the first dimension to make it
(1, 3, H, W).
Once we do that, we then put the model in eval mode and without autograd perform a forward pass through the same. The forward pass gives
out objects from which, the
out object is of interest to us during inference. The
aux values have loss value per pixel and are useful during train time. So, we get the out object and do a argmax along the 1st dimension to obtain the labels map which is of the same height and width as the original image with a single channel. This mask could be used for segmentation which we’ll look at in the next section.
Read More …