Roadmap to Computer Vision


By Pier Paolo Ippolito, The University of Southampton




Computer Vision (CV) is these days one of many foremost utility of Artificial Intelligence (eg. Image Recognition, Object Tracking, Multilabel Classification). In this text, I’ll stroll you thru a number of the foremost steps which compose a Computer Vision System.

A regular illustration of the workflow of a Computer Vision system is:

  1. A set of pictures enters the system.
  2. A Feature Extractor is utilized in order to pre-process and extract options from these pictures.
  3. A Machine Learning system makes use of the characteristic extracted so as to prepare a mannequin and make predictions.

We will now briefly stroll via a number of the foremost processes our information may undergo every of those three totally different steps.


Images Enter the System

When making an attempt to implement a CV system, we want to consider two foremost parts: the picture acquisition {hardware} and the picture processing software program. One of the primary necessities to meet so as to deploy a CV system is to check its robustness. Our system ought to, the truth is, have the option to be invariant to environmental adjustments (equivalent to adjustments in illumination, orientation, scaling) and in a position to carry out it’s designed process repeatably. In order to fulfill these necessities, it may be needed to apply some type of constraints to both the {hardware} or software program of our system (eg. remotely management the lighting atmosphere).

Once a picture is acquired from a {hardware} system, there are lots of attainable methods to numerically represents colors (Colour Spaces) inside a software program system. Two of probably the most well-known color areas are RGB (Red, Green, Blue) and HSV (Hue, Saturation, Value). One of the primary benefits of utilizing an HSV color area is that by taking simply the HS parts we will make our system illumination invariant (Figure 1).


Figure 1: RGB vs HSV color areas [1]


Feature Extractor


Image Pre-processing

Once a picture enters a system and is represented through the use of a color area, we will then apply totally different operators on the picture so as to enhance its illustration:

  • Point Operators: we use all of the factors in a picture to create a remodeled model of the unique picture (so as to make express the content material inside a picture, with out altering its content material). Some examples of Point Operators are: Intensity Normalization, Histogram Equalization and Thresholding. Point Operators are generally utilized in order to assist visualize higher a picture for human imaginative and prescient however don’t essentially present any benefit for a Computer Vision system.
  • Group Operators: on this case, we take a gaggle of factors from the unique picture so as to create a single level into the remodeled model of the picture. This kind of operation is usually achieved through the use of Convolution. Different forms of kernels can be utilized to be convolved with the picture so as to receive our remodeled consequence (Figure 2). Some examples are: Direct Averaging, Gaussian Averaging and the Median Filter. Applying a convolution operation to a picture can, because of this, lower the quantity of noise within the picture and enhance smoothing (though this may additionally find yourself barely blurring the picture). Since we’re utilizing a gaggle of factors so as to create a single new level within the new picture, the scale of the brand new picture will essentially be decrease than the unique one. One resolution to this drawback is to apply both zero paddings (setting the pixel values to zero) or through the use of a smaller template on the border of the picture. One of the primary limitations of utilizing convolution is its execution velocity when working with giant template sizes, one attainable resolution to this drawback is to use a Fourier Transform as a substitute.


Once pre-processed a picture, we will then apply extra superior strategies so as to attempt to extract the sides and shapes inside a picture through the use of strategies equivalent to First Order Edge Detection (eg. Prewitt Operator, Sobel Operator, Canny Edge Detector) and Hough Transforms.


Feature Extraction

Once pre-processed a picture, there are Four foremost forms of Feature Morphologies which might be extracted from a picture through the use of a Feature Extractor:

  • Global Features: the entire picture is analysed as one and a single characteristic vector comes out of the characteristic extractor. A easy instance of a worldwide characteristic is usually a histogram of binned pixel values.
  • Grid or Block-Based Features: the picture is break up into totally different blocks and options are extracted from every of the totally different blocks. One of the primary approach utilizing so as to extract options from blocks of a picture is Dense SIFT (Scale Invariant Feature Transform). This kind of Features is utilizing prevalently to prepare Machine Learning fashions.
  • Region-Based Features: the picture is segmented into totally different areas (eg. utilizing strategies equivalent to thresholding or Okay-Means Clustering after which join them into segments utilizing Connected Components) and a characteristic is extracted from every of those areas. Features might be extracted through the use of area and boundary description strategies equivalent to Moments and Chain Codes).
  • Local Features: a number of single curiosity factors are detected within the picture and options are extracted by analysing the pixels neighbouring the curiosity factors. Two of the primary forms of curiosity factors which might be extracted from a picture are corners and blobs, these might be extracted through the use of strategies such because the Harris & Stephens Detector and Laplacian of Gaussians. Features can lastly be extracted from the detected curiosity factors through the use of strategies equivalent to SIFT (Scale Invariant Feature Transform). Local Features are sometimes utilized in order to match pictures to construct a panorama/3D reconstruction or to retrieve pictures from a database.

Once extracted a set of discriminative options, we will then use them so as to prepare a Machine Learning mannequin to make inference. Feature descriptors might be simply utilized in Python utilizing libraries equivalent to OpenCV.


Machine Learning

One of the primary idea utilized in Computer Vision to classify a picture is the Bag of Visual Words (BoVW). In order to assemble a Bag of Visual Words, we want to begin with to create a vocabulary by extracting all of the options from a set of pictures (eg. utilizing grid-based options or native options). Successively, we will then rely the variety of occasions an extracted characteristic seems in a picture and construct a frequency histogram from the outcomes. Using the frequency histogram as a fundamental template, we will lastly classify if a picture belongs to the identical class or not by evaluating their histograms (Figure 3).

This course of might be summarised in the next few steps:

  1. We first construct a vocabulary by extracting the totally different options from a dataset of pictures utilizing characteristic extraction algorithms equivalent to SIFT and Dense SIFT.
  2. Secondly, we cluster all of the options in our vocabulary utilizing algorithms equivalent to Okay-Means or DBSCAN and use the cluster centroids so as to summarise our information distribution.
  3. Finally, we will assemble a frequency histogram from every picture by counting the variety of occasions totally different options from the vocabulary seems within the picture.

New pictures can then be categorised by repeating this identical course of for every picture we would like to classify after which utilizing any classification algorithm to discover out which picture in our vocabulary resembles probably the most our check picture.


Figure 3: Bag of Visual Words [2]


Nowadays, thanks to the creation of Artificial Neural Networks architectures equivalent to Convolutional Neural Networks (CNNs) and Recurrent Artificial Neural Networks (RCNNs), it has been attainable to ideate another workflow for Computer Vision (Figure 4).


Figure 4: Computer Vision Workflow [3]


In this case, the Deep Learning Algorithm incorporates each the Feature Extraction and Classification steps of the Computer Vision workflow. When utilizing Convolutional Neural Networks, every layer of the neural community applies the totally different characteristic extraction strategies at his description (eg. Layer 1 detects edges, Layer 2 finds shapes in a picture, Layer Three segments the picture, and so forth…) earlier than offering the characteristic vectors to the dense layer classifier.

Further purposes of Machine Learning in Computer Vision embrace areas equivalent to Multilabel Classification and Object Recognition. In Multilabel Classification, we intention to assemble a mannequin in a position to accurately determine what number of objects there are in a picture and to what class they do belong to. In Object Recognition as a substitute, we intention to take this idea a step additional by figuring out additionally the place of the totally different objects within the picture.



If you need to preserve up to date with my newest articles and initiatives observe me on Medium and subscribe to my mailing listing. These are a few of my contacts particulars:



[1] Modular robotic used as a seashore cleaner, Felippe Roza. Researchgate. Accessed at: https://www.researchgate.internet/determine/RGB-left-and-HSV-right-color-spaces_fig1_310474598
[2] Bag of visible phrases in OpenCV, Vision & Graphics GroupJan Kundrac. Accessed at:
[3] Deep Learning Vs. Traditional Computer Vision. Haritha Thilakarathne, NaadiSpeaks. Accessed at:

Bio: Pier Paolo Ippolito is a last yr MSc Artificial Intelligence pupil at The University of Southampton. He is an AI Enthusiast, Data Scientist and RPA Developer.

Original. Reposted with permission.



Source hyperlink

Write a comment