Document mosaicing

Last updated December 25, 2022

Document mosaicing is a process that stitches multiple, overlapping snapshot images of a document together to produce one large, high resolution composite. The document is slid under a stationary, over-the-desk camera by hand until all parts of the document are snapshotted by the camera's field of view. As the document slid under the camera, all motion of the document is coarsely tracked by the vision system. The document is periodically snapshotted such that the successive snapshots are overlap by about 50%. The system then finds the overlapped pairs and stitches them together repeatedly until all pairs are stitched together as one piece of document.^[1]

Tracking (simple correlation process)
Feature detecting for efficient matching
Skew angle estimation
Columns, lines and words finding
Correspondences establishing
Seed match finding
Match list building
Images mosaicing
Many images coping
Applied areas
Relevant research papers
References
Bibliography
External links

The document mosaicing can be divided into four main processes.

Tracking
Feature detecting
Correspondences establishing
Images mosaicing.

Tracking (simple correlation process)

In this process, the motion of the document slid under the camera is coarsely tracked by the system. Tracking is performed by a process called simple correlation process. In the first frame of snapshots, a small patch is extracted from the center of the image as a correlation template as shown in Figure 1. The correlation process is performed in the four times size of the patch area of the next frame. The motion of the paper is indicated by the peak in the correlation function. The peak in the correlation function indicates the motion of the paper. The template is resampled from this frame and the tracking continues until the template reaches the edge of the document. After the template reaches the edge of the document, another snapshot is taken and the tracking process performs repeatedly until the whole document is imaged. The snapshots are stored in an ordered list to facilitate pairing the overlapped images in later processes.

Feature detecting for efficient matching

Feature detection is the process of finding the transformation that aligns one image with another. There are two main approaches for feature detection.^[2]^[3]

Feature-based approach : Motion parameters are estimated from point correspondences. This approach is suitable for the case that there is plenty supply of stable and detectable features.
Featureless approach : When the motion between the two images is small, the motion parameters are estimated using optical flow. On the other hand, when the motion between the two images is large, the motion parameters are estimated using generalised cross-correlation. However, this approach requires a computationally expensive resources.

Each image is segmented into a hierarchy of columns, lines, and words to match the organised sets of features across images. Skew angle estimation and columns, lines and words finding are the examples of feature detection operations.

Skew angle estimation

Firstly, the angle that the rows of text make with the image raster lines (skew angle) is estimated. It is assumed to lie in the range of ±20°. A small patch of text in the image is selected randomly and then rotated in the range of ±20° until the variance of the pixel intensities of the patch summed along the raster lines is maximised.^[4] See Figure 2.

To ensure that the found skew angle is accurate, the document mosaic system performs calculation at many image patches and derive the final estimation by finding the average of the individual angles weighted by the variance of the pixel intensities of each patch.

Columns, lines and words finding

In this operation, the de-skewed document is intuitively segmented into a hierarchy of columns, lines and words. The sensitivity to illumination and page coloration of the de-skewed document can be removed by applying a Sobel operator to the de-skewed image and thresholding the output to obtain the binary gradient, de-skewed image.^[5] See Figure 3.

The operation can be roughly separated into 3 steps: column segmentation, line segmentation and word segmentation.

Columns are easily segmented from the binary gradient, de-skewed images by summing pixels vertically as shown in Figure 4.
Baselines of each row are segmented in the same way as the column segmentation process but horizontally.
Finally, individual words are segmented by applying the vertical process at each segmented row.

These segmentations are important because the document mosaic is created by matching the lower right corners of words in overlapping images pair. Moreover, the segmentation operation can organize the list of images in the context of a hierarchy of rows and column reliably.

The segmentation operation involves a considerable amount of summing in the binary gradient, de-skewed images, which done by construct a matrix of partial sums^[6] whose elements are given by

$p_{iy}=\sum _{u=1}^{i}\sum _{v=1}^{j}b_{uv}$

The matrix of partial sums is calculated in one pass through the binary gradient, de-skewed image.^[6]

$\sum _{u=u_{1}}^{u_{2}}\sum _{v=v_{1}}^{v_{2}}b_{uv}=p_{u_{2}v_{2}}+p_{u_{1}v_{1}}-p_{u_{1}v_{2}}-p_{u_{2}v_{1}}$

Correspondences establishing

The two images are now organized in hierarchy of linked lists in following structure :

image=list of columns
row=list of words
column=list of row
word=length (in pixels)

At the bottom of the structure, the length of each word is recorded for establishing correspondence between two images to reduce to search only the corresponding structures for the groups of words with the matching lengths.

Seed match finding

A seed match finding is done by comparing each row in image1 with each row in image2. The two rows are then compared to each other by every word. If the length (in pixel) of the two words (one from image1 and one from image2) and their immediate neighbours agree with each other within a predefined tolerance threshold (5 pixels, for example), then they are assumed to match. The row of each image is assumed a match if there are three or more word matches between the two rows. The seed match finding operation is terminated when two pairs of consecutive row match are found.

Match list building

After finishing a seed match finding operation, the next process is to build the match list to generate the correspondences points of the two images. The process is done by searching the matching pairs of rows away from the seed row.

Images mosaicing

Figure 5 : Mosaicing of two document images. Blurring is evident in the affine mosaic (b), but not in the mosaic constructed using a plane-to-plane projectivity (a). Close-ups of typical seams of (a) and (b) are shown in (c) and (d) respectively. Mosaicing.png — **Figure 5** : Mosaicing of two document images. Blurring is evident in the affine mosaic (b), but not in the mosaic constructed using a plane-to-plane projectivity (a). Close-ups of typical seams of (a) and (b) are shown in (c) and (d) respectively.

Given the list of corresponding points of the two images, finding the transformation of the overlapping portion of the images is the next process. Assuming a pinhole camera model, the transformation between pixels (u,v) of image 1 and pixels (u0, v0) of image 2 is demonstrated by a plane-to-plane projectivity.^[7]

$\left[{\begin{array}{c}su'\\sv'\\s\end{array}}\right]=\left[{\begin{array}{ccc}p_{11}&p_{12}&p_{13}\\p_{21}&p_{22}&p_{23}\\p_{31}&p_{32}&1\end{array}}\right]\left[{\begin{array}{c}u\\v\\1\end{array}}\right]\qquad Eq.1$

The parameters of the projectivity is found from four pairs of matching points. RANSAC regression^[8] technique is used to reject outlying matches and estimate the projectivity from the remaining good matches.

The projectivity is fine-tuned using correlation at the corners of the overlapping portion to obtain four correspondences to sub-pixel accuracy. Therefore, image1 is then transformed into image2's coordinate system using Eq.1. The typical result of the process is shown in Figure 5.

Many images coping

Finally, the whole page composition is built up by mapping all the images into the coordinate system of an "anchor" image, which is normally the one nearest the page center. The transformations to the anchor frame are calculated by concatenating the pair-wise transformations found earlier. The raw document mosaic is shown in Figure 6.

However, there might be a problem of non-consecutive images that are overlap. This problem can be solved by performing Hierarchical sub-mosaics. As shown in Figure 7, image1 and image2 are registered, as are image3 and image4, creating two sub-mosaics. These two sub-mosaics are later stitched together in another mosaicing process.

Applied areas

There are various areas that the technique of document mosaicing can be applied to such as :

Text segmentation of images of documents^[5]
Document Recognition^[4]
Interaction with paper on the digital desk^[9]
Video mosaics for virtual environments^[10]
Image registration techniques^[3]

Relevant research papers

Huang, T.S.; Netravali, A.N. (1994). "Motion and structure from feature correspondences: A review". Proceedings of the IEEE. 82 (2): 252–268. doi:10.1109/5.265351.
D.G. Lowe. Perceptual Organization and Visual Recognition. Kluwer Academic Publishers, Boston, 1985.
Irani, M.; Peleg, S. (1991). "Improving resolution by image registration". CVGIP: Graphical Models and Image Processing . 53 (3): 231–239. doi:10.1016/1049-9652(91)90045-L.
Shivakumara, P.; Kumar, G. Hemantha; Guru, D. S.; Nagabhushan, P. (2006). "Sliding window based approach for document image mosaicing". Image and Vision Computing. 24 (1): 94–100. doi:10.1016/j.imavis.2005.09.015.
Camera-Based Document Image Mosaicing. (n.d.). Image (Rochester, N.Y.), 1.
Kumar, G. H.; Shivakumara, P.; Guru, D. S.; Nagabhushan (2004). "Document image mosaicing : A novel approach" (PDF). Text. 29 (3): 329–341. CiteSeerX 10.1.1.107.4304 . doi:10.1007/bf02703782. S2CID 62593940.
Sato, T., Ikeda, S., Kanbara, M., Iketani, A., Nakajima, N., Yokoya, N., & Yamada, K. (n.d.). High-resolution Video Mosaicing for Documents and Photos by Estimating Camera Motion. Mosaic A Journal for the Interdisciplinary Study of Literature.

Related Research Articles

In digital imaging, a pixel, pel, or picture element is the smallest addressable element in a raster image, or the smallest point in an all points addressable display device. In most digital display devices, pixels are the smallest element that can be manipulated through software.

A digital image is an image composed of picture elements, also known as pixels, each with finite, discrete quantities of numeric representation for its intensity or gray level that is an output from its two-dimensional functions fed as input by its spatial coordinates denoted with x, y on the x-axis and y-axis, respectively. Depending on whether the image resolution is fixed, it may be of vector or raster type. By itself, the term "digital image" usually refers to raster images or bitmapped images.

In digital image processing and computer vision, image segmentation is the process of partitioning a digital image into multiple image segments, also known as image regions or image objects. The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

A Bayer filter mosaic is a color filter array (CFA) for arranging RGB color filters on a square grid of photosensors. Its particular arrangement of color filters is used in most single-chip digital image sensors used in digital cameras, camcorders, and scanners to create a color image. The filter pattern is half green, one quarter red and one quarter blue, hence is also called BGGR,RGBG, GRBG, or RGGB.

In computer vision or natural language processing, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document and this kind of semantic labeling is the scope of the logical layout analysis.

Template matching is a technique in digital image processing for finding small parts of an image which match a template image. It can be used in manufacturing as a part of quality control, a way to navigate a mobile robot, or as a way to detect edges in images.

<span class="mw-page-title-main">Image stitching</span> Combining multiple photographic images with overlapping fields of view

Image stitching or photo stitching is the process of combining multiple photographic images with overlapping fields of view to produce a segmented panorama or high-resolution image. Commonly performed through the use of computer software, most approaches to image stitching require nearly exact overlaps between images and identical exposures to produce seamless results, although some stitching algorithms actually benefit from differently exposed images by doing high-dynamic-range imaging in regions of overlap. Some digital cameras can stitch their photos internally.

A demosaicing algorithm is a digital image process used to reconstruct a full color image from the incomplete color samples output from an image sensor overlaid with a color filter array (CFA). It is also known as CFA interpolation or color reconstruction.

<span class="mw-page-title-main">Image sensor</span> Device that converts an optical image into an electronic signal

An image sensor or imager is a sensor that detects and conveys information used to make an image. It does so by converting the variable attenuation of light waves into signals, small bursts of current that convey the information. The waves can be light or other electromagnetic radiation. Image sensors are used in electronic imaging devices of both analog and digital types, which include digital cameras, camera modules, camera phones, optical mouse devices, medical imaging equipment, night vision equipment such as thermal imaging devices, radar, sonar, and others. As technology changes, electronic and digital imaging tends to replace chemical and analog imaging.

<span class="mw-page-title-main">Active-pixel sensor</span> Image sensor, consisting of an integrated circuit

An active-pixel sensor (APS) is an image sensor, which was invented by Peter J.W. Noble in 1968, where each pixel sensor unit cell has a photodetector and one or more active transistors. In a metal–oxide–semiconductor (MOS) active-pixel sensor, MOS field-effect transistors (MOSFETs) are used as amplifiers. There are different types of APS, including the early NMOS APS and the now much more common complementary MOS (CMOS) APS, also known as the CMOS sensor. CMOS sensors are used in digital camera technologies such as cell phone cameras, web cameras, most modern digital pocket cameras, most digital single-lens reflex cameras (DSLRs), and mirrorless interchangeable-lens cameras (MILCs).

The following outline is provided as an overview of and topical guide to computer vision:

Connected-component labeling (CCL), connected-component analysis (CCA), blob extraction, region labeling, blob discovery, or region extraction is an algorithmic application of graph theory, where subsets of connected components are uniquely labeled based on a given heuristic. Connected-component labeling is not to be confused with segmentation.

Binocular disparity refers to the difference in image location of an object seen by the left and right eyes, resulting from the eyes’ horizontal separation (parallax). The brain uses binocular disparity to extract depth information from the two-dimensional retinal images in stereopsis. In computer vision, binocular disparity refers to the difference in coordinates of similar features within two stereo images.

Image rectification is a transformation process used to project images onto a common image plane. This process has several degrees of freedom and there are many strategies for transforming images to the common plane. Image rectification is used in computer stereo vision to simplify the problem of finding matching points between images, and in geographic information systems to merge images taken from multiple perspectives into a common map coordinate system.

Rolling shutter is a method of image capture in which a still picture or each frame of a video is captured not by taking a snapshot of the entire scene at a single instant in time but rather by scanning across the scene rapidly, vertically, horizontally or rotationally. In other words, not all parts of the image of the scene are recorded at exactly the same instant. This produces predictable distortions of fast-moving objects or rapid flashes of light. This is in contrast with "global shutter" in which the entire frame is captured at the same instant.

Camera auto-calibration is the process of determining internal camera parameters directly from multiple uncalibrated images of unstructured scenes. In contrast to classic camera calibration, auto-calibration does not require any special calibration objects in the scene. In the visual effects industry, camera auto-calibration is often part of the "Match Moving" process where a synthetic camera trajectory and intrinsic projection model are solved to reproject synthetic content into video.

Computer stereo vision is the extraction of 3D information from digital images, such as those obtained by a CCD camera. By comparing information about a scene from two vantage points, 3D information can be extracted by examining the relative positions of objects in the two panels. This is similar to the biological process of stereopsis.

<span class="mw-page-title-main">Image texture</span>

An image texture is a set of metrics calculated in image processing designed to quantify the perceived texture of an image. Image texture gives us information about the spatial arrangement of color or intensities in an image or selected region of an image.

In computer vision, rigid motion segmentation is the process of separating regions, features, or trajectories from a video sequence into coherent subsets of space and time. These subsets correspond to independent rigidly moving objects in the scene. The goal of this segmentation is to differentiate and extract the meaningful rigid motion from the background and analyze it. Image segmentation techniques labels the pixels to be a part of pixels with certain characteristics at a particular time. Here, the pixels are segmented depending on its relative movement over a period of time i.e. the time of the video sequence.

In computer vision, a saliency map is an image that highlights the region on which people's eyes focus first. The goal of a saliency map is to reflect the degree of importance of a pixel to the human visual system. For example, in this image, a person first looks at the fort and light clouds, so they should be highlighted on the saliency map. Saliency maps engineered in artificial or computer vision are typically not the same as the actual saliency map constructed by biological or natural vision.

References

1 2 Zappalá, Anthony; Gee, Andrew; Taylor, Michael (1999). "Document mosaicing". Image and Vision Computing. 17 (8): 589–595. doi:10.1016/S0262-8856(98)00178-4.
↑ Mann, S.; Picard, R. W. (1995). "Video orbits of the projective group: A new perspective on image mosaicing". Technical Report (Perceptual Computing Section), MIT Media Laboratory (338). CiteSeerX 10.1.1.56.6000 .
1 2 Brown, L.G. (1992). "A survey of image registration techniques". ACM Computing Surveys. 24 (4): 325–376. CiteSeerX 10.1.1.35.2732 . doi:10.1145/146370.146374. S2CID 14576088.
1 2 Bloomberg, Dan S.; Kopec, Gary E.; Dasari, Lakshmi (1995). "Measuring document image skew and orientation" (PDF). In Vincent, Luc M; Baird, Henry S (eds.). Document Recognition II. Proceedings of the SPIE. Vol. 2422. pp. 302–315. Bibcode:1995SPIE.2422..302B. doi:10.1117/12.205832. S2CID 5106427.
1 2 Taylor, M. J.; Zappala, A.; Newman, W. M.; Dance, C. R. (1999). "Documents through cameras". Image and Vision Computing. 17 (11): 831–844. doi:10.1016/S0262-8856(98)00155-3.
1 2 Preparata, F.P.; Shamos, M. I. (1985). Computational Geometry: An Introduction . Monographs in Computer Science. Springer–Verlag. ISBN 9780387961316.
↑ Mundy, J.L.; Zisserman, A. (1992). "Appendix-Projective geometry for machine vision" . Geometric Invariance in Computer Vision. Cambridge MA: MIT Press. CiteSeerX 10.1.1.17.1329 . ISBN 9780262132855.
↑ Martin A. Fischler; Robert C. Bolles (1981). "Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography" (PDF). Communications of the ACM . 24 (6): 381–395. doi:10.1145/358669.358692. S2CID 972888.
↑ Wellner, P. (1993). "Interacting with paper on the digital desk". Communications of the ACM. 36 (7): 87–97. CiteSeerX 10.1.1.53.7526 . doi:10.1145/159544.159630. S2CID 207174911.
↑ Szeliski, R. (1996). "Video mosaics for virtual environments". IEEE Computer Graphics and Applications. 16 (2): 22–306. doi:10.1109/38.486677.

Bibliography

Anthony, Zappalá; Andrew Gee; Michael Taylor (1999). "Document mosaicing". Image and Vision Computing. 17 (8): 589–595. doi:10.1016/S0262-8856(98)00178-4.

External links

Advanced Vision homepage

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[MyRef2-1] 1 2 Zappalá, Anthony; Gee, Andrew; Taylor, Michael (1999). "Document mosaicing". Image and Vision Computing. 17 (8): 589–595. doi:10.1016/S0262-8856(98)00178-4.

[MyRef3-2] Mann, S.; Picard, R. W. (1995). "Video orbits of the projective group: A new perspective on image mosaicing". Technical Report (Perceptual Computing Section), MIT Media Laboratory (338). CiteSeerX 10.1.1.56.6000 .

[MyRef4-3] 1 2 Brown, L.G. (1992). "A survey of image registration techniques". ACM Computing Surveys. 24 (4): 325–376. CiteSeerX 10.1.1.35.2732 . doi:10.1145/146370.146374. S2CID 14576088.

[MyRef5-4] 1 2 Bloomberg, Dan S.; Kopec, Gary E.; Dasari, Lakshmi (1995). "Measuring document image skew and orientation" (PDF). In Vincent, Luc M; Baird, Henry S (eds.). Document Recognition II. Proceedings of the SPIE. Vol. 2422. pp. 302–315. Bibcode:1995SPIE.2422..302B. doi:10.1117/12.205832. S2CID 5106427.

[MyRef6-5] 1 2 Taylor, M. J.; Zappala, A.; Newman, W. M.; Dance, C. R. (1999). "Documents through cameras". Image and Vision Computing. 17 (11): 831–844. doi:10.1016/S0262-8856(98)00155-3.

[MyRef7-6] 1 2 Preparata, F.P.; Shamos, M. I. (1985). Computational Geometry: An Introduction . Monographs in Computer Science. Springer–Verlag. ISBN 9780387961316.

[MyRef8-7] Mundy, J.L.; Zisserman, A. (1992). "Appendix-Projective geometry for machine vision" . Geometric Invariance in Computer Vision. Cambridge MA: MIT Press. CiteSeerX 10.1.1.17.1329 . ISBN 9780262132855.

[MyRef9-8] Martin A. Fischler; Robert C. Bolles (1981). "Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography" (PDF). Communications of the ACM . 24 (6): 381–395. doi:10.1145/358669.358692. S2CID 972888.

[MyRef10-9] Wellner, P. (1993). "Interacting with paper on the digital desk". Communications of the ACM. 36 (7): 87–97. CiteSeerX 10.1.1.53.7526 . doi:10.1145/159544.159630. S2CID 207174911.

[MyRef11-10] Szeliski, R. (1996). "Video mosaics for virtual environments". IEEE Computer Graphics and Applications. 16 (2): 22–306. doi:10.1109/38.486677.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]