Object detection

Last updated
Objects detected with OpenCV's Deep Neural Network module (dnn) by using a YOLOv3 model trained on COCO dataset capable to detect objects of 80 common classes Detected-with-YOLO--Schreibtisch-mit-Objekten.jpg
Objects detected with OpenCV's Deep Neural Network module (dnn) by using a YOLOv3 model trained on COCO dataset capable to detect objects of 80 common classes

Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos. [1] Well-researched domains of object detection include face detection and pedestrian detection. Object detection has applications in many areas of computer vision, including image retrieval and video surveillance.

Contents

Uses

Detection of objects on a road Computer vision sample in Simon Bolivar Avenue, Quito.jpg
Detection of objects on a road

It is widely used in computer vision tasks such as image annotation, [2] vehicle counting, [3] activity recognition, [4] face detection, face recognition, video object co-segmentation. It is also used in tracking objects, for example tracking a ball during a football match, tracking movement of a cricket bat, or tracking a person in a video.

Often, the test images are sampled from a different data distribution, making the object detection task significantly more difficult. [5] To address the challenges caused by the domain gap between training and test data, many unsupervised domain adaptation approaches have been proposed. [5] [6] [7] [8] [9] A simple and straightforward solution of reducing the domain gap is to apply an image-to-image translation approach, such as cycle-GAN. [10] Among other uses, cross-domain object detection is applied in autonomous driving, where models can be trained on a vast amount of video game scenes, since the labels can be generated without manual labor.

Concept

Every object class has its own special features that help in classifying the class – for example all circles are round. Object class detection uses these special features. For example, when looking for circles, objects that are at a particular distance from a point (i.e. the center) are sought. Similarly, when looking for squares, objects that are perpendicular at corners and have equal side lengths are needed. A similar approach is used for face identification where eyes, nose, and lips can be found and features like skin color and distance between eyes can be found.

Methods

Simplified neural network training example.svg
Simplified example of training a neural network in object detection: The network is trained by multiple images that are known to depict starfish and sea urchins, which are correlated with "nodes" that represent visual features. The starfish match with a ringed texture and a star outline, whereas most sea urchins match with a striped texture and oval shape. However, the instance of a ring textured sea urchin creates a weakly weighted association between them.
Simplified neural network example.svg
Subsequent run of the network on an input image (left): [11] The network correctly detects the starfish. However, the weakly weighted association between ringed texture and sea urchin also confers a weak signal to the latter from one of two intermediate nodes. In addition, a shell that was not included in the training gives a weak signal for the oval shape, also resulting in a weak signal for the sea urchin output. These weak signals may result in a false positive result for sea urchin.
In reality, textures and outlines would not be represented by single nodes, but rather by associated weight patterns of multiple nodes.

Methods for object detection generally fall into either neural network-based or non-neural approaches. For non-neural approaches, it becomes necessary to first define features using one of the methods below, then using a technique such as support vector machine (SVM) to do the classification. On the other hand, neural techniques are able to do end-to-end object detection without specifically defining features, and are typically based on convolutional neural networks (CNN).

See also

Related Research Articles

<span class="mw-page-title-main">Image segmentation</span> Partitioning a digital image into segments

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

Template matching is a technique in digital image processing for finding small parts of an image which match a template image. It can be used for quality control in manufacturing, navigation of mobile robots, or edge detection in images.

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is the subset of machine learning methods which are based on artificial neural networks with representation learning. The adjective "deep" in deep learning refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

<span class="mw-page-title-main">MNIST database</span> Database of handwritten digits

The MNIST database is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training and testing in the field of machine learning. It was created by "re-mixing" the samples from NIST's original datasets. The creators felt that since NIST's training dataset was taken from American Census Bureau employees, while the testing dataset was taken from American high school students, it was not well-suited for machine learning experiments. Furthermore, the black and white images from NIST were normalized to fit into a 28x28 pixel bounding box and anti-aliased, which introduced grayscale levels.

<span class="mw-page-title-main">Convolutional neural network</span> Artificial neural network

Convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns feature engineering by itself via filters optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

<span class="mw-page-title-main">Saliency map</span>

In computer vision, a saliency map is an image that highlights the region on which people's eyes focus first. The goal of a saliency map is to reflect the degree of importance of a pixel to the human visual system. For example, in this image, a person first looks at the fort and light clouds, so they should be highlighted on the saliency map. Saliency maps engineered in artificial or computer vision are typically not the same as the actual saliency map constructed by biological or natural vision.

<span class="mw-page-title-main">AlexNet</span> Convolutional neural network

AlexNet is the name of a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor at the University of Toronto.

The CIFAR-10 dataset is a collection of images that are commonly used to train machine learning and computer vision algorithms. It is one of the most widely used datasets for machine learning research. The CIFAR-10 dataset contains 60,000 32x32 color images in 10 different classes. The 10 different classes represent airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks. There are 6,000 images of each class.

<span class="mw-page-title-main">Neural architecture search</span> Machine learning-powered structure design

Neural architecture search (NAS) is a technique for automating the design of artificial neural networks (ANN), a widely used model in the field of machine learning. NAS has been used to design networks that are on par or outperform hand-designed architectures. Methods for NAS can be categorized according to the search space, search strategy and performance estimation strategy used:

<span class="mw-page-title-main">U-Net</span> Type of convolutional neural network

U-Net is a convolutional neural network that was developed for biomedical image segmentation at the Computer Science Department of the University of Freiburg. The network is based on a fully convolutional neural network whose architecture was modified and extended to work with fewer training images and to yield more precise segmentation. Segmentation of a 512 × 512 image takes less than a second on a modern GPU.

A Siamese neural network is an artificial neural network that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors. Often one of the output vectors is precomputed, thus forming a baseline against which the other output vector is compared. This is similar to comparing fingerprints but can be described more technically as a distance function for locality-sensitive hashing.

An event camera, also known as a neuromorphic camera, silicon retina or dynamic vision sensor, is an imaging sensor that responds to local changes in brightness. Event cameras do not capture images using a shutter as conventional (frame) cameras do. Instead, each pixel inside an event camera operates independently and asynchronously, reporting changes in brightness as they occur, and staying silent otherwise.

In the domain of physics and probability, the filters, random fields, and maximum entropy (FRAME) model is a Markov random field model of stationary spatial processes, in which the energy function is the sum of translation-invariant potential functions that are one-dimensional non-linear transformations of linear filter responses. The FRAME model was originally developed by Song-Chun Zhu, Ying Nian Wu, and David Mumford for modeling stochastic texture patterns, such as grasses, tree leaves, brick walls, water waves, etc. This model is the maximum entropy distribution that reproduces the observed marginal histograms of responses from a bank of filters, where for each filter tuned to a specific scale and orientation, the marginal histogram is pooled over all the pixels in the image domain. The FRAME model is also proved to be equivalent to the micro-canonical ensemble, which was named the Julesz ensemble. Gibbs sampler is adopted to synthesize texture images by drawing samples from the FRAME model.

<span class="mw-page-title-main">Video super-resolution</span> Generating high-resolution video frames from given low-resolution ones

Video super-resolution (VSR) is the process of generating high-resolution video frames from the given low-resolution video frames. Unlike single-image super-resolution (SISR), the main goal is not only to restore more fine details while saving coarse ones, but also to preserve motion consistency.

<span class="mw-page-title-main">Self-supervised learning</span> A paradigm in machine learning

Self-supervised learning (SSL) is a paradigm in machine learning where a model is trained on a task using the data itself to generate supervisory signals, rather than relying on external labels provided by humans. In the context of neural networks, self-supervised learning aims to leverage inherent structures or relationships within the input data to create meaningful training signals. SSL tasks are designed so that solving it requires capturing essential features or relationships in the data. The input data is typically augmented or transformed in a way that creates pairs of related samples. One sample serves as the input, and the other is used to formulate the supervisory signal. This augmentation can involve introducing noise, cropping, rotation, or other transformations. Self-supervised learning more closely imitates the way humans learn to classify objects.

A vision transformer (ViT) is a transformer designed for computer vision. Transformers were introduced in 2017, and have found widespread use in natural language processing. In 2020, they were adapted for computer vision, yielding ViT. The basic structure is to break down input images as a series of patches, then tokenized, before applying the tokens to a standard Transformer architecture.

Small object detection is a particular case of object detection where various techniques are employed to detect small objects in digital images and videos. "Small objects" are objects having a small pixel footprint in the input image. In areas such as aerial imagery, state-of-the-art object detection techniques under performed because of small objects.

Xiaoming Liu is a Chinese-American computer scientist and an academic. He is a Professor in the Department of Computer Science and Engineering, MSU Foundation Professor as well as Anil K. and Nandita Jain Endowed Professor of Engineering at Michigan State University.

References

  1. Dasiopoulou, Stamatia, et al. "Knowledge-assisted semantic video object detection." IEEE Transactions on Circuits and Systems for Video Technology 15.10 (2005): 1210–1224.
  2. Ling Guan; Yifeng He; Sun-Yuan Kung (1 March 2012). Multimedia Image and Video Processing. CRC Press. pp. 331–. ISBN   978-1-4398-3087-1.
  3. Alsanabani, Ala; Ahmed, Mohammed; AL Smadi, Ahmad (2020). "Vehicle Counting Using Detecting-Tracking Combinations: A Comparative Analysis". 2020 the 4th International Conference on Video and Image Processing. pp. 48–54. doi:10.1145/3447450.3447458. ISBN   9781450389075. S2CID   233194604.
  4. Wu, Jianxin, et al. "A scalable approach to activity recognition based on object use." 2007 IEEE 11th international conference on computer vision. IEEE, 2007.
  5. 1 2 Oza, Poojan; Sindagi, Vishwanath A.; VS, Vibashan; Patel, Vishal M. (2021-07-04). "Unsupervised Domain Adaptation of Object Detectors: A Survey". arXiv: 2105.13502 [cs.CV].
  6. Khodabandeh, Mehran; Vahdat, Arash; Ranjbar, Mani; Macready, William G. (2019-11-18). "A Robust Learning Approach to Domain Adaptive Object Detection". arXiv: 1904.02361 [cs.LG].
  7. Soviany, Petru; Ionescu, Radu Tudor; Rota, Paolo; Sebe, Nicu (2021-03-01). "Curriculum self-paced learning for cross-domain object detection". Computer Vision and Image Understanding. 204: 103166. arXiv: 1911.06849 . doi:10.1016/j.cviu.2021.103166. ISSN   1077-3142. S2CID   208138033.
  8. Menke, Maximilian; Wenzel, Thomas; Schwung, Andreas (October 2022). "Improving GAN-based Domain Adaptation for Object Detection". 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC). pp. 3880–3885. doi:10.1109/ITSC55140.2022.9922138. ISBN   978-1-6654-6880-0. S2CID   253251380.
  9. Menke, Maximilian; Wenzel, Thomas; Schwung, Andreas (2022-08-31). "AWADA: Attention-Weighted Adversarial Domain Adaptation for Object Detection". arXiv: 2208.14662 [cs.CV].
  10. Zhu, Jun-Yan; Park, Taesung; Isola, Phillip; Efros, Alexei A. (2020-08-24). "Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks". arXiv: 1703.10593 [cs.CV].
  11. Ferrie, C., & Kaiser, S. (2019). Neural Networks for Babies. Sourcebooks. ISBN   1492671207.{{cite book}}: CS1 maint: multiple names: authors list (link)
  12. Dalal, Navneet (2005). "Histograms of oriented gradients for human detection" (PDF). Computer Vision and Pattern Recognition. 1.
  13. Ross, Girshick (2014). "Rich feature hierarchies for accurate object detection and semantic segmentation" (PDF). Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE. pp. 580–587. arXiv: 1311.2524 . doi:10.1109/CVPR.2014.81. ISBN   978-1-4799-5118-5. S2CID   215827080.
  14. Girschick, Ross (2015). "Fast R-CNN" (PDF). Proceedings of the IEEE International Conference on Computer Vision. pp. 1440–1448. arXiv: 1504.08083 . Bibcode:2015arXiv150408083G.
  15. Shaoqing, Ren (2015). "Faster R-CNN". Advances in Neural Information Processing Systems. arXiv: 1506.01497 .
  16. 1 2 Pang, Jiangmiao; Chen, Kai; Shi, Jianping; Feng, Huajun; Ouyang, Wanli; Lin, Dahua (2019-04-04). "Libra R-CNN: Towards Balanced Learning for Object Detection". arXiv: 1904.02701v1 [cs.CV].
  17. Liu, Wei (October 2016). "SSD: Single Shot MultiBox Detector". Computer Vision – ECCV 2016. Lecture Notes in Computer Science. Vol. 9905. pp. 21–37. arXiv: 1512.02325 . doi:10.1007/978-3-319-46448-0_2. ISBN   978-3-319-46447-3. S2CID   2141740.
  18. Zhang, Shifeng (2018). "Single-Shot Refinement Neural Network for Object Detection". Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4203–4212. arXiv: 1711.06897 . Bibcode:2017arXiv171106897Z.
  19. Lin, Tsung-Yi (2020). "Focal Loss for Dense Object Detection". IEEE Transactions on Pattern Analysis and Machine Intelligence. 42 (2): 318–327. arXiv: 1708.02002 . Bibcode:2017arXiv170802002L. doi:10.1109/TPAMI.2018.2858826. PMID   30040631. S2CID   47252984.
  20. Zhu, Xizhou (2018). "Deformable ConvNets v2: More Deformable, Better Results". arXiv: 1811.11168 [cs.CV].
  21. Dai, Jifeng (2017). "Deformable Convolutional Networks". arXiv: 1703.06211 [cs.CV].