What is computer vision? AI for images and video

On Aug 30, 2020

Computer vision identifies and often locates objects in digital images and videos. Since living organisms process images with their visual cortex, many researchers have taken the architecture of the mammalian visual cortex as a model for neural networks designed to perform image recognition. The biological research goes back to the 1950s.

The progress in computer vision over the last 20 years has been absolutely remarkable. While not yet perfect, some computer vision systems achieve 99% accuracy, and others run decently on mobile devices.

The breakthrough in the neural network field for vision was Yann LeCun’s 1998 LeNet-5, a seven-level convolutional neural network for recognition of handwritten digits digitized in 32×32 pixel images. To analyze higher-resolution images, the LeNet-5 network would need to be expanded to more neurons and more layers.

Today’s best image classification models can identify diverse catalogs of objects at HD resolution in color. In addition to pure deep neural networks (DNNs), people sometimes use hybrid vision models, which combine deep learning with classical machine-learning algorithms that perform specific sub-tasks.

Other vision problems besides basic image classification have been solved with deep learning, including image classification with localization, object detection, object segmentation, image style transfer, image colorization, image reconstruction, image super-resolution, and image synthesis.

How does computer vision work?

Computer vision algorithms usually rely on convolutional neural networks, or CNNs. CNNs typically use convolutional, pooling, ReLU, fully connected, and loss layers to simulate a visual cortex.

The convolutional layer basically takes the integrals of many small overlapping regions. The pooling layer performs a form of non-linear down-sampling. ReLU layers apply the non-saturating activation function f(x) = max(0,x).

In a fully connected layer, the neurons have connections to all activations in the previous layer. A loss layer computes how the network training penalizes the deviation between the predicted and true labels, using a Softmax or cross-entropy loss for classification.

Computer vision training datasets

There are many public image datasets that are useful for training vision models. The simplest, and one of the oldest, is MNIST, which contains 70,000 handwritten digits in 10 classes, 60K for training and 10K for testing. MNIST is an easy dataset to model, even using a laptop with no acceleration hardware. CIFAR-10 and Fashion-MNIST are similar 10-class datasets. SVHN (street view house numbers) is a set of 600K images of real-world house numbers extracted from Google Street View.

COCO is a larger-scale dataset for object detection, segmentation, and captioning, with 330K images in 80 object categories. ImageNet contains about 1.5 million images with bounding boxes and labels, illustrating about 100K phrases from WordNet. Open Images contains about nine million URLs to images, with about 5K labels.

Google, Azure, and AWS all have their own vision models trained against very large image databases. You can use these as is, or run transfer learning to adapt these models to your own image datasets. You can also perform transfer learning using models based on ImageNet and Open Images. The advantages of transfer learning over building a model from scratch are that it is much faster (hours rather than weeks) and that it gives you a more accurate model. You’ll still need 1,000 images per label for the best results, although you can sometimes get away with as few as 10 images per label.

Computer vision applications

While computer vision isn’t perfect, it’s often good enough to be practical. A good example is vision in self-driving automobiles.

Waymo, formerly the Google self-driving car project, claims tests on seven million miles of public roads and the ability to navigate safely in daily traffic. There has been at least one accident involving a Waymo van; the software was not believed to be at fault, according to police.

Tesla has three models of self-driving car. In 2018 a Tesla SUV in self-driving mode was involved in a fatal accident. The report on the accident said that the driver (who was killed) had his hands off the steering wheel despite multiple warnings from the console, and that neither the driver nor the software tried to brake to avoid hitting the concrete barrier. The software has since been upgraded to require rather than suggest that the driver’s hands be on the steering wheel.

Amazon Go stores are checkout-free self-service retail stores where the in-store computer vision system detects when shoppers pick up or return stock items; shoppers are identified by and charged through an Android or iPhone app. When the Amazon Go software misses an item, the shopper can keep it for free; when the software falsely registers an item taken, the shopper can flag the item and get a refund for that charge.

In healthcare, there are vision applications for classifying certain features in pathology slides, chest x-rays, and other medical imaging systems. A few of these have demonstrated value when compared to skilled human practitioners, some enough for regulatory approval. There’s also a real-time system for estimating patient blood loss in an operating or delivery room.

There are useful vision applications for agriculture (agricultural robots, crop and soil monitoring, and predictive analytics), banking (fraud detection, document authentication, and remote deposits), and industrial monitoring (remote wells, site security, and work activity).

There are also applications of computer vision that are controversial or even deprecated. One is face recognition, which when used by government can be an invasion of privacy, and which often has a training bias that tends to misidentify non-white faces. Another is deepfake generation, which is more than a little creepy when used for pornography or the creation of hoaxes and other fraudulent images.

Computer vision frameworks and models

Most deep learning frameworks have substantial support for computer vision, including Python-based frameworks TensorFlow (the leading choice for production), PyTorch (the leading choice for academic research), and MXNet (Amazon’s framework of choice). OpenCV is a specialized library for computer vision that leans toward real-time vision applications and takes advantage of MMX and SSE instructions when they are available; it also has support for acceleration using CUDA, OpenCL, OpenGL, and Vulkan.

Amazon Rekognition is an image and video analysis service that can identify objects, people, text, scenes, and activities, including facial analysis and custom labels. The Google Cloud Vision API is a pretrained image analysis service that can detect objects and faces, read printed and handwritten text, and build metadata into your image catalog. Google AutoML Vision allows you to train custom image models. Both Amazon Rekognition Custom Labels and Google AutoML Vision perform transfer learning.

The Microsoft Computer Vision API can identify objects from a catalog of 10,000, with labels in 25 languages. It also returns bounding boxes for identified objects. The Azure Face API does face detection that perceives faces and attributes in an image, person identification that matches an individual in your private repository of up to one million people, and perceived emotion recognition. The Face API can run in the cloud or on the edge in containers.

IBM Watson Visual Recognition can classify images from a pre-trained model, allow you to train custom image models with transfer learning, perform object detection with object counting, and train for visual inspection. Watson Visual Recognition can run in the cloud, or on iOS devices using Core ML.