Maximally stable extremal regions

Last updated November 14, 2022

In computer vision, maximally stable extremal regions (MSER) are used as a method of blob detection in images. This technique was proposed by Matas et al.^[1] to find correspondences between image elements from two images with different viewpoints. This method of extracting a comprehensive number of corresponding image elements contributes to the wide-baseline matching, and it has led to better stereo matching and object recognition algorithms.

Terms and definitions

Image $I$ is a mapping $I:D\subset \mathbb {Z} ^{2}\to S$ . Extremal regions are well defined on images if:

$S$ is totally ordered (total, antisymmetric and transitive binary relations $\leq$ exist).
An adjacency relation $A\subset D\times D$ is defined. We will denote that two points are adjacent as $pAq$ .

Region $Q$ is a contiguous (aka connected) subset of $D$ . (For each $p,q\in Q$ there is a sequence $p,a_{1},a_{2},..,a_{n},q$ such as $pAa_{1},a_{1}Aa_{2},\dots ,a_{n-1}Aa_{n},a_{n}Aq$ .) Note that under this definition the region can contain "holes" (for example, a ring-shaped region is connected, but its internal circle is not the part of $Q$ ).

(Outer) region boundary $\partial Q=\{q\in D\setminus Q:\exists p\in Q:qAp\}$ , which means the boundary $\partial Q$ of $Q$ is the set of pixels adjacent to at least one pixel of $Q$ but not belonging to $Q$ . Again, in case of regions with "holes", the region boundary is not obliged to be connected subset of $D$ (a ring has inner bound and outer bound which do not intersect).

Extremal region $Q\subset D$ is a region such that either for all $p\in Q,q\in \partial Q:I(p)>I(q)$ (maximum intensity region) or for all $p\in Q,q\in \partial Q:I(p)<I(q)$ (minimum intensity region). As far as $S$ is totally ordered, we can reformulate these conditions as $\min(I(p))>\max(I(q))$ for maximum intensity region and $\max(I(p))<\min(I(q))$ for minimum intensity region, respectively. In this form we can use a notion of a threshold intensity value which separates the region and its boundary.

Maximally stable extremal region Let $Q_{i}$ an extremal region such as all points on it have an intensity smaller than $i\in S$ . Note $Q_{i}\subset Q_{i+\Delta }$ for all positive $\Delta \in S$ . Extremal region $Q_{i*}$ is maximally stable if and only if $|Q_{i+\Delta }\setminus Q_{i-\Delta }|/|Q_{i}|$ has a local minimum at $i*$ . (Here $|\cdot |$ denotes cardinality). $\Delta \in S$ is here a parameter of the method.

The equation checks for regions that remain stable over a certain number of thresholds. If a region $Q_{i+\Delta }$ is not significantly larger than a region $Q_{i-\Delta }$ , region $Q_{i}$ is taken as a maximally stable region.

The concept more simply can be explained by thresholding. All the pixels below a given threshold are 'black' and all those above or equal are 'white'. Given a source image, if a sequence of thresholded result images $I_{t}$ is generated where each image $t$ corresponds to an increasing threshold t, first a white image would be seen, then 'black' spots corresponding to local intensity minima will appear then grow larger. A maximally stable extremal region is found when size of one of these black areas is the same (or near the same) than in previous image.

These 'black' spots will eventually merge, until the whole image is black. The set of all connected components in the sequence is the set of all extremal regions. In that sense, the concept of MSER is linked to the one of component tree of the image.^[2] The component tree indeed provide an easy way for implementing MSER.^[3]

Extremal regions

Extremal regions in this context have two important properties, that the set is closed under...

continuous transformation of image coordinates. This means it is affine invariant and it doesn't matter if the image is warped or skewed.
monotonic transformation of image intensities. The approach is of course sensitive to natural lighting effects as change of day light or moving shadows.

Advantages of MSER

Because the regions are defined exclusively by the intensity function in the region and the outer border, this leads to many key characteristics of the regions which make them useful. Over a large range of thresholds, the local binarization is stable in certain regions, and have the properties listed below.

Invariance to affine transformation of image intensities
Covariance to adjacency preserving (continuous) transformation $T:D\to D$ on the image domain
Stability: only regions whose support is nearly the same over a range of thresholds is selected.
Multi-scale detection without any smoothing involved, both fine and large structure is detected.
Note, however, that detection of MSERs in a scale pyramid improves repeatability, and number of correspondences across scale changes.^[4]
The set of all extremal regions can be enumerated in worst-case $O(n)$ , where $n$ is the number of pixels in the image.^[5]

Comparison to other region detectors

In Mikolajczyk et al.,^[6] six region detectors are studied (Harris-affine, Hessian-affine, MSER, edge-based regions, intensity extrema, and salient regions). A summary of MSER performance in comparison to the other five follows.

Region density – in comparison to the others MSER offers the most variety detecting about 2600 regions for a textured blur scene and 230 for a light changed scene, and variety is generally considered to be good. Also MSER had a repeatability of 92% for this test.
Region size – MSER tended to detect many small regions, versus large regions which are more likely to be occluded or to not cover a planar part of the scene. Though large regions may be slightly easier to match.
Viewpoint change – MSER outperforms the five other region detectors in both the original images and those with repeated texture motifs.
Scale change – Following Hessian-affine detector, MSER comes in second under a scale change and in-plane rotation.
Blur – MSER proved to be the most sensitive to this type of change in image, the only area in which this type of detection is lacking.
Note however that this evaluation did not make use of multi-resolution detection, which has been shown to improve repeatability under blur.^[4]
Light change – MSER showed the highest repeatability score for this type of scene, with all the other having good robustness as well.

MSER consistently resulted in the highest score through many tests, proving it to be a reliable region detector.^[6]

Implementation

The original algorithm of Matas et al.^[1] is $O(n\,\log(\log(n)))$ in the number $n\,$ of pixels. It proceeds by first sorting the pixels by intensity. This would take $O(n)\,$ time, using BINSORT. After sorting, pixels are marked in the image, and the list of growing and merging connected components and their areas is maintained using the union-find algorithm. This would take $O(n\,\log(\log(n)))$ time. In practice these steps are very fast. During this process, the area of each connected component as a function of intensity is stored producing a data structure. A merge of two components is viewed as termination of existence of the smaller component and an insertion of all pixels of the smaller component into the larger one. In the extremal regions, the 'maximally stable' ones are those corresponding to thresholds where the relative area change as a function of relative change of threshold is at a local minimum, i.e. the MSER are the parts of the image where local binarization is stable over a large range of thresholds.^[1]^[6]

The component tree is the set of all connected components of the thresholds of the image, ordered by inclusion. Efficient (quasi-linear whatever the range of the weights) algorithms for computing it do exist.^[2] Thus this structure offers an easy way for implementing MSER.^[3]

More recently, Nister and Stewenius have proposed a truly (if the weight are small integers) worst-case $O(n)\,$ method in,^[5] which is also much faster in practice. This algorithm is similar to the one of Ph. Salembier et al.^[7]

Robust wide-baseline algorithm

The purpose of this algorithm is to match MSERs to establish correspondence points between images. First MSER regions are computed on the intensity image (MSER+) and on the inverted image (MSER-). Measurement regions are selected at multiple scales: the size of the actual region, 1.5x, 2x, and 3x scaled convex hull of the region. Matching is accomplished in a robust manner, so it is better to increase the distinctiveness of large regions without being severely affected by clutter or non-planarity of the region's pre-image. A measurement taken from an almost planar patch of the scene with stable invariant description are called a 'good measurement'. Unstable ones or those on non-planar surfaces or discontinuities are called 'corrupted measurements'. The robust similarity is computed: For each $M_{A}^{i}$ on region $A,k$ regions $B_{1},\dots ,B_{k}$ from the other image with the corresponding i-th measurement $M_{B_{1}}^{i},\dots ,M_{B_{k}}^{i}$ nearest to $M_{A}^{i}$ are found and a vote is cast suggesting correspondence of A and each of $B_{1},\dots ,B_{k}$ . Votes are summed over all measurements, and using probability analysis, 'good measurements' can be picked out as the 'corrupt measurements' will likely spread their votes randomly. By applying RANSAC to the centers of gravity of the regions, a rough epipolar geometry can be computed. An affine transformation between pairs of potentially corresponding regions is computed, and correspondences define it up to a rotation, which is then determined by epipolar lines. The regions are then filtered, and the ones with correlation of their transformed images above a threshold are chosen. RANSAC is applied again with a more narrow threshold, and the final epipolar geometry is estimated by the eight-point algorithm.

This algorithm can be tested here (Epipolar or homography geometry constrained matches): WBS Image Matcher

Use in text detection

The MSER algorithm has been used in text detection by Chen by combining MSER with Canny edges. Canny edges are used to help cope with the weakness of MSER to blur. MSER is first applied to the image in question to determine the character regions. To enhance the MSER regions any pixels outside the boundaries formed by Canny edges are removed. The separation of the later provided by the edges greatly increase the usability of MSER in the extraction of blurred text.^[8]

An alternative use of MSER in text detection is the work by Shi using a graph model. This method again applies MSER to the image to generate preliminary regions. These are then used to construct a graph model based on the position distance and color distance between each MSER, which is treated as a node. Next the nodes are separated into foreground and background using cost functions. One cost function is to relate the distance from the node to the foreground and background. The other penalizes nodes for being significantly different from its neighbor. When these are minimized the graph is then cut to separate the text nodes from the non-text nodes.^[9]

To enable text detection in a general scene, Neumann uses the MSER algorithm in a variety of projections. In addition to the greyscale intensity projection, he uses the red, blue, and green color channels to detect text regions that are color distinct but not necessarily distinct in greyscale intensity. This method allows for detection of more text than solely using the MSER+ and MSER- functions discussed above.^[10]

Extensions and adaptations

The MSER algorithm has been adapted to colour images, by replacing thresholding of the intensity function with agglomerative clustering, based on colour gradients.^[11]
The MSER algorithm can be used to detect regions based on color as opposed to intensity. This is done by Chavez by creating an intensity function for red, green, and blue in the HSV color space. The MSER algorithm is then run five times; over the three color pseudo-intensities and then over the grey scale intensities using the standard MSER+ and MSER- functions.^[12]
The MSER algorithm can be used to track colour objects, by performing MSER detection on the Mahalanobis distance to a colour distribution.^[3]
By detecting MSERs in multiple resolutions, robustness to blur, and scale change can be improved.^[4]

Other applications

External links

VLFeat, an open source computer vision library in C (with a MEX interface to MATLAB), including an implementation of MSER
OpenCV, an open source computer vision library in C/C++, including an implementation of Linear Time MSER
Detector Repeatabilty Study, Kristian Mikolajczyk Binaries (Win/Linux to compute MSER/HarrisAffine... . Binary used in his repeatability study.
Linear Time MSER Implementation, Charles Dubout, C++ implementation of MSER as a blob detector

Related Research Articles

Digital image processing is the use of a digital computer to process digital images through an algorithm. As a subcategory or field of digital signal processing, digital image processing has many advantages over analog image processing. It allows a much wider range of algorithms to be applied to the input data and can avoid problems such as the build-up of noise and distortion during processing. Since images are defined over two dimensions digital image processing may be modeled in the form of multidimensional systems. The generation and development of digital image processing are mainly affected by three factors: first, the development of computers; second, the development of mathematics ; third, the demand for a wide range of applications in environment, agriculture, military, industry and medical science has increased.

Edge detection includes a variety of mathematical methods that aim at identifying edges, curves in a digital image at which the image brightness changes sharply or, more formally, has discontinuities. The same problem of finding discontinuities in one-dimensional signals is known as step detection and the problem of finding signal discontinuities over time is known as change detection. Edge detection is a fundamental tool in image processing, machine vision and computer vision, particularly in the areas of feature detection and feature extraction.

The Canny edge detector is an edge detection operator that uses a multi-stage algorithm to detect a wide range of edges in images. It was developed by John F. Canny in 1986. Canny also produced a computational theory of edge detection explaining why the technique works.

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local features in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.

In digital image processing, thresholding is the simplest method of segmenting images. From a grayscale image, thresholding can be used to create binary images.

<span class="mw-page-title-main">Corner detection</span>

Corner detection is an approach used within computer vision systems to extract certain kinds of features and infer the contents of an image. Corner detection is frequently used in motion detection, image registration, video tracking, image mosaicing, panorama stitching, 3D reconstruction and object recognition. Corner detection overlaps with the topic of interest point detection.

In computer vision, blob detection methods are aimed at detecting regions in a digital image that differ in properties, such as brightness or color, compared to surrounding regions. Informally, a blob is a region of an image in which some properties are constant or approximately constant; all the points in a blob can be considered in some sense to be similar to each other. The most common method for blob detection is convolution.

Compressed sensing is a signal processing technique for efficiently acquiring and reconstructing a signal, by finding solutions to underdetermined linear systems. This is based on the principle that, through optimization, the sparsity of a signal can be exploited to recover it from far fewer samples than required by the Nyquist–Shannon sampling theorem. There are two conditions under which recovery is possible. The first one is sparsity, which requires the signal to be sparse in some domain. The second one is incoherence, which is applied through the isometric property, which is sufficient for sparse signals.

Haar-like features are digital image features used in object recognition. They owe their name to their intuitive similarity with Haar wavelets and were used in the first real-time face detector.

The Kadir–Brady saliency detector extracts features of objects in images that are distinct and representative. It was invented by Timor Kadir and J. Michael Brady in 2001 and an affine invariant version was introduced by Kadir and Brady in 2004 and a robust version was designed by Shao et al. in 2007.

In the fields of computer vision and image analysis, the Harris affine region detector belongs to the category of feature detection. Feature detection is a preprocessing step of several algorithms that rely on identifying characteristic points or interest points so to make correspondences between images, recognize textures, categorize objects or build panoramas.

The Hessian affine region detector is a feature detector used in the fields of computer vision and image analysis. Like other feature detectors, the Hessian affine detector is typically used as a preprocessing step to algorithms that rely on identifiable, characteristic interest points.

Region growing is a simple region-based image segmentation method. It is also classified as a pixel-based image segmentation method since it involves the selection of initial seed points.

The principal curvature-based region detector, also called PCBR is a feature detector used in the fields of computer vision and image analysis. Specifically the PCBR detector is designed for object recognition applications.

In the fields of computer vision and image analysis, the scale-invariant feature operator is an algorithm to detect local features in images. The algorithm was published by Förstner et al. in 2009.

Features from accelerated segment test (FAST) is a corner detection method, which could be used to extract feature points and later used to track and map objects in many computer vision tasks. The FAST corner detector was originally developed by Edward Rosten and Tom Drummond, and was published in 2006. The most promising advantage of the FAST corner detector is its computational efficiency. Referring to its name, it is indeed faster than many other well-known feature extraction methods, such as difference of Gaussians (DoG) used by the SIFT, SUSAN and Harris detectors. Moreover, when machine learning techniques are applied, superior performance in terms of computation time and resources can be realised. The FAST corner detector is very suitable for real-time video processing application because of this high-speed performance.

Foreground detection is one of the major tasks in the field of computer vision and image processing whose aim is to detect changes in image sequences. Background subtraction is any technique which allows an image's foreground to be extracted for further processing.

In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.

The Harris corner detector is a corner detection operator that is commonly used in computer vision algorithms to extract corners and infer features of an image. It was first introduced by Chris Harris and Mike Stephens in 1988 upon the improvement of Moravec's corner detector. Compared to the previous one, Harris' corner detector takes the differential of the corner score into account with reference to direction directly, instead of using shifting patches for every 45 degree angles, and has been proved to be more accurate in distinguishing between edges and corners. Since then, it has been improved and adopted in many algorithms to preprocess images for subsequent applications.

References

1 2 3 J. Matas, O. Chum, M. Urban, and T. Pajdla. "Robust wide baseline stereo from maximally stable extremal regions." Proc. of British Machine Vision Conference, pages 384-396, 2002.
1 2 L. Najman and M. Couprie: "Building the component tree in quasi-linear time" Archived 2011-04-09 at the Wayback Machine ; IEEE Transactions on Image Processing, Volume 15, Numbers 11, 2006, pp 3531-3539
1 2 3 Donoser, M. and Bischof, H. Efficient Maximally Stable Extremal Region (MSER) Tracking CVPR, 2006.
1 2 3 Forssen, P-E. and Lowe, D.G. "Shape Descriptors for Maximally Stable Extremal Regions" Archived 2011-06-10 at the Wayback Machine ICCV, 2007.
1 2 Nister, D. and Stewenius, H., "Linear Time Maximally Stable Extremal Regions", ECCV, 2008.
1 2 3 K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, T. Kadir and L. Van Gool: "A Comparison of Affine Region Detectors"; International Journal of Computer Vision, Volume 65, Numbers 1-2 / November, 2005, pp 43-72
↑ Salembier, Philippe; A. Oliveras; L. Garrido (1998). "Anti-extensive Connected Operators for Image and Sequence Processing". IEEE Transactions on Image Processing. 7 (4): 555–570. doi:10.1109/83.663500. hdl: 2117/90134 . PMID 18276273. Archived from the original on 2012-04-25. Retrieved 2011-11-17.
↑ Chen, Huizhong; Tsai, Sam; Schroth, Georg; Chen, David; Grzeszczuk, Radek; Girod, Bernd. "Robust Text Detection in Natural Images with Edge-enhanced Maximally Stable Extremal Regions". Proc. IEEE International Conference on Image Processing 2011.
↑ Shi, Cunzhao; Wang, Chunheng; Xiao, Baihua; Gao, Song (15 January 2013). "Scene Text Detection Using Graph Model Built Upon Maximally Stable Extremal Regions". Pattern Recognition Letters. 34 (2): 107–116. doi:10.1016/j.patrec.2012.09.019.
↑ Neumann, Lukas; Matas, Jiri (2011). "A Method for Text Localization and Recognition in Real-World Images". Accv 2010: 770–783.
↑ Forssen, P-E. Maximally Stable Colour Regions for Recognition and Matching Archived 2011-06-10 at the Wayback Machine , CVPR, 2007.
↑ Chavez, Aaron; Gustafson, David (2011). "Color-Based Extensions to MSERs". Isvc 2011. Lecture Notes in Computer Science. 6939: 358–366. doi:10.1007/978-3-642-24031-7_36. ISBN 978-3-642-24030-0.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[matas-1] 1 2 3 J. Matas, O. Chum, M. Urban, and T. Pajdla. "Robust wide baseline stereo from maximally stable extremal regions." Proc. of British Machine Vision Conference, pages 384-396, 2002.

[compTree-2] 1 2 L. Najman and M. Couprie: "Building the component tree in quasi-linear time" Archived 2011-04-09 at the Wayback Machine ; IEEE Transactions on Image Processing, Volume 15, Numbers 11, 2006, pp 3531-3539

[tracking-3] 1 2 3 Donoser, M. and Bischof, H. Efficient Maximally Stable Extremal Region (MSER) Tracking CVPR, 2006.

[descriptor-4] 1 2 3 Forssen, P-E. and Lowe, D.G. "Shape Descriptors for Maximally Stable Extremal Regions" Archived 2011-06-10 at the Wayback Machine ICCV, 2007.

[stewenius_nister-5] 1 2 Nister, D. and Stewenius, H., "Linear Time Maximally Stable Extremal Regions", ECCV, 2008.

[compare-6] 1 2 3 K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, T. Kadir and L. Van Gool: "A Comparison of Affine Region Detectors"; International Journal of Computer Vision, Volume 65, Numbers 1-2 / November, 2005, pp 43-72

[AntiExt-7] Salembier, Philippe; A. Oliveras; L. Garrido (1998). "Anti-extensive Connected Operators for Image and Sequence Processing". IEEE Transactions on Image Processing. 7 (4): 555–570. doi:10.1109/83.663500. hdl: 2117/90134 . PMID 18276273. Archived from the original on 2012-04-25. Retrieved 2011-11-17.

[chen-8] Chen, Huizhong; Tsai, Sam; Schroth, Georg; Chen, David; Grzeszczuk, Radek; Girod, Bernd. "Robust Text Detection in Natural Images with Edge-enhanced Maximally Stable Extremal Regions". Proc. IEEE International Conference on Image Processing 2011.

[shi-9] Shi, Cunzhao; Wang, Chunheng; Xiao, Baihua; Gao, Song (15 January 2013). "Scene Text Detection Using Graph Model Built Upon Maximally Stable Extremal Regions". Pattern Recognition Letters. 34 (2): 107–116. doi:10.1016/j.patrec.2012.09.019.

[neumann-10] Neumann, Lukas; Matas, Jiri (2011). "A Method for Text Localization and Recognition in Real-World Images". Accv 2010: 770–783.

[mscr-11] Forssen, P-E. Maximally Stable Colour Regions for Recognition and Matching Archived 2011-06-10 at the Wayback Machine , CVPR, 2007.

[chavez-12] Chavez, Aaron; Gustafson, David (2011). "Color-Based Extensions to MSERs". Isvc 2011. Lecture Notes in Computer Science. 6939: 358–366. doi:10.1007/978-3-642-24031-7_36. ISBN 978-3-642-24030-0.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]