FaceNet is a facial recognition system developed by Florian Schroff, Dmitry Kalenichenko and James Philbina, a group of researchers affiliated to Google. The system was first presented in the IEEE Conference on Computer Vision and Pattern Recognition held in 2015. [1] The system uses a deep convolutional neural network to learn a mapping (also called an embedding) from a set of face images to the 128-dimensional Euclidean space and the similarity between two face images is assessed based on the square of the Euclidean distance between the corresponding normalized vectors in the 128-dimensional Euclidean space. The system used the triplet loss function as the cost function and introduced a new online triplet mining method. The system achieved an accuracy of 99.63% which is the highest score on Labeled Faces in the Wild dataset in the unrestricted with labeled outside data protocol. [2]
The structure of the FaceNet facenet recognition system is represented schematically in Figure 1.
For training, the researchers used as input batches of about 1800 images in which for each identity there were about 40 similar images and several randomly sampled images relating to other identities. These batches were fed to a deep convolutional neural network and the network was trained using stochastic gradient descent method with standard backpropagation and the Adaptive Gradient Optimizer (AdaGrad) algorithm. The learning rate was initially set at 0.05 which was later lowered while finalizing the model.
The researchers used used two types of architectures, which they called NN1 and NN2, and explored their trade-offs. The practical differences between the models lie in the difference of parameters and FLOPS. The details of the NN1 model are presented in the table below.
Layer | Size-in (rows × cols × #filters) | Size-out (rows × cols × #filters) | Kernel (rows × cols, stride) | Parameters | FLOPS |
---|---|---|---|---|---|
conv1 | 220×220×3 | 110×110×64 | 7×7×3, 2 | 9K | 115M |
pool1 | 110×110×64 | 55×55×64 | 3×3×64, 2 | 0 | — |
rnorm1 | 55×55×64 | 55×55×64 | 0 | ||
conv2a | 55×55×64 | 55×55×64 | 1×1×64, 1 | 4K | 13M |
conv2 | 55×55×64 | 55×55×192 | 3×3×64, 1 | 111K | 335M |
rnorm2 | 55×55×192 | 55×55×192 | 0 | ||
pool2 | 55×55×192 | 28×28×192 | 3×3×192, 2 | 0 | |
conv3a | 28×28×192 | 28×28×192 | 1×1×192, 1 | 37K | 29M |
conv3 | 28×28×192 | 28×28×384 | 3×3×192, 1 | 664K | 521M |
pool3 | 28×28×384 | 14×14×384 | 3×3×384, 2 | 0 | |
conv4a | 14×14×384 | 14×14×384 | 1×1×384, 1 | 148K | 29M |
conv4 | 14×14×384 | 14×14×256 | 3×3×384, 1 | 885K | 173M |
conv5a | 14×14×256 | 14×14×256 | 1×1×256, 1 | 66K | 13M |
conv5 | 14×14×256 | 14×14×256 | 3×3×256, 1 | 590K | 116M |
conv6a | 14×14×256 | 14×14×256 | 1×1×256, 1 | 66K | 13M |
conv6 | 14×14×256 | 14×14×256 | 3×3×256, 1 | 590K | 116M |
pool4 | 14×14×256 | 3×3×256, 2 | 7×7×256 | 0 | |
concat | 7×7×256 | 7×7×256 | 0 | ||
fc1 | 7×7×256 | 1×32×128 | maxout p=2 | 103M | 103M |
fc2 | 1×32×128 | 1×32×128 | maxout p=2 | 34M | 34M |
fc7128 | 1×32×128 | 1×1×128 | 524K | 0.5M | |
L2 | 1×1×128 | 1×1×128 | 0 | ||
Total | 140M | 1.6B |
The loss function that was used in the FaceNet system was called the "Triplet Loss Function". This was a novel idea introduced by the developers of the FaceNet system. This function is defined using certain triplets of the form of training images. In this triplet, (called an "anchor image") denotes the image of a person, (called a "positive image") denotes some other image of the person whose image is and (called a "negative image") denotes the image of some other person different from the person whose image is . Let be some image and let be the embedding of in the 128-dimensional Euclidean space. It shall be assumed that the L2-norm of is unity. (The L2 norm of a vector in a finite dimensional Euclidean space is denoted by .) We pick such triplets from the training data set and let there be such triplets and be a typical triplet. The training is to ensure that, after learning, the following condition called the "triplet constraint" should be satisfied by all triplets in the training data set:
where is a constant called the margin and its value has to be set manually. Its value has been set as 0.2.
Thus, the function to be minimized is the following function called the triplet loss function:
In general the number of triplets of the form is very large. To make computations faster, the Google researchers considered only those triplets which violate the triplet constraint. For this, for a given anchor image they chose that positive image for which is maximum (such a positive image was called a "hard positive image") and that negative image for which is minimum (such a positive image was called a "hard negative image"). since using the whole training data set to determine the hard positive and hard negative images was computationally expensive and infeasible, the researchers experimented with several methods for selecting the triplets.
On the widely used Labeled Faces in the Wild (LFW) dataset, the FaceNet system achieved an accuracy of 99.63% which is the highest score on LFW in the unrestricted with labeled outside data protocol. [2] On YouTube Faces DB the system achieved an accuracy of 95.12%. [1]
In mathematics, the absolute value or modulus of a real number , denoted , is the non-negative value of without regard to its sign. Namely, if is a positive number, and if is negative, and . For example, the absolute value of 3 is 3, and the absolute value of −3 is also 3. The absolute value of a number may be thought of as its distance from zero.
In mathematics, convolution is a mathematical operation on two functions that produces a third function that expresses how the shape of one is modified by the other. The term convolution refers to both the result function and to the process of computing it. It is defined as the integral of the product of the two functions after one is reflected about the y-axis and shifted. The choice of which function is reflected and shifted before the integral does not change the integral result. The integral is evaluated for all values of shift, producing the convolution function.
In geometry, a Cartesian coordinate system in a plane is a coordinate system that specifies each point uniquely by a pair of real numbers called coordinates, which are the signed distances to the point from two fixed perpendicular oriented lines, called coordinate lines, coordinate axes or just axes of the system. The point where they meet is called the origin and has (0, 0) as coordinates.
Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, in Euclid's Elements, it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are Euclidean spaces of any positive integer dimension n, which are called Euclidean n-spaces when one wants to specify their dimension. For n equal to one or two, they are commonly called respectively Euclidean lines and Euclidean planes. The qualifier "Euclidean" is used to distinguish Euclidean spaces from other spaces that were later considered in physics and modern mathematics.
In mathematics, a normed vector space or normed space is a vector space over the real or complex numbers on which a norm is defined. A norm is a generalization of the intuitive notion of "length" in the physical world. If is a vector space over , where is a field equal to or to , then a norm on is a map , typically denoted by , satisfying the following four axioms:
In mathematics and physics, a vector space is a set whose elements, often called vectors, may be added together and multiplied ("scaled") by numbers called scalars. Scalars are often real numbers, but can be complex numbers or, more generally, elements of any field. The operations of vector addition and scalar multiplication must satisfy certain requirements, called vector axioms. Real vector space and complex vector space are kinds of vector spaces based on different kinds of scalars: real coordinate space or complex coordinate space.
In Euclidean geometry, an affine transformation or affinity is a geometric transformation that preserves lines and parallelism, but not necessarily Euclidean distances and angles.
In mathematics, a ball is the solid figure bounded by a sphere; it is also called a solid sphere. It may be a closed ball or an open ball.
In mathematics, a function from a set X to a set Y assigns to each element of X exactly one element of Y. The set X is called the domain of the function and the set Y is called the codomain of the function.
An eigenface is the name given to a set of eigenvectors when used in the computer vision problem of human face recognition. The approach of using eigenfaces for recognition was developed by Sirovich and Kirby and used by Matthew Turk and Alex Pentland in face classification. The eigenvectors are derived from the covariance matrix of the probability distribution over the high-dimensional vector space of face images. The eigenfaces themselves form a basis set of all images used to construct the covariance matrix. This produces dimension reduction by allowing the smaller set of basis images to represent the original training images. Classification can be achieved by comparing how faces are represented by the basis set.
In mathematics, the magnitude or size of a mathematical object is a property which determines whether the object is larger or smaller than other objects of the same kind. More formally, an object's magnitude is the displayed result of an ordering of the class of objects to which it belongs.
In classical Euclidean geometry, a point is a primitive notion that models an exact location in space, and has no length, width, or thickness. In modern mathematics, a point is considered as an element of some set, a point set. A space is a point set with some additional structure. An isolated point has no other neighboring points in a given subset.
In mathematics, a square is the result of multiplying a number by itself. The verb "to square" is used to denote this operation. Squaring is the same as raising to the power 2, and is denoted by a superscript 2; for instance, the square of 3 may be written as 32, which is the number 9. In some cases when superscripts are not available, as for instance in programming languages or plain text files, the notations x^2 (caret) or x**2 may be used in place of x2. The adjective which corresponds to squaring is quadratic.
In mathematics, a norm is a function from a real or complex vector space to the non-negative real numbers that behaves in certain ways like the distance from the origin: it commutes with scaling, obeys a form of the triangle inequality, and is zero only at the origin. In particular, the Euclidean distance in a Euclidean space is defined by a norm on the associated Euclidean vector space, called the Euclidean norm, the 2-norm, or, sometimes, the magnitude of the vector. This norm can be defined as the square root of the inner product of a vector with itself.
In mathematics, the real coordinate space or real coordinate n-space, of dimension n, denoted Rn or , is the set of the n-tuples of real numbers, that is the set of all sequences of n real numbers. Special cases are called the real lineR1, the real coordinate planeR2, and the real coordinate three-dimensional spaceR3. With component-wise addition and scalar multiplication, it is a real vector space, and its elements are called coordinate vectors.
In mathematics, a manifold is a topological space that locally resembles Euclidean space near each point. More precisely, an -dimensional manifold, or -manifold for short, is a topological space with the property that each point has a neighborhood that is homeomorphic to an open subset of -dimensional Euclidean space.
Mean shift is a non-parametric feature-space mathematical analysis technique for locating the maxima of a density function, a so-called mode-seeking algorithm. Application domains include cluster analysis in computer vision and image processing.
Similarity learning is an area of supervised machine learning in artificial intelligence. It is closely related to regression and classification, but the goal is to learn a similarity function that measures how similar or related two objects are. It has applications in ranking, in recommendation systems, visual identity tracking, face verification, and speaker verification.
A Siamese neural network is an artificial neural network that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors. Often one of the output vectors is precomputed, thus forming a baseline against which the other output vector is compared. This is similar to comparing fingerprints but can be described more technically as a distance function for locality-sensitive hashing.
Triplet loss is a loss function for machine learning algorithms where a reference input is compared to a matching input and a non-matching input. The distance from the anchor to the positive is minimized, and the distance from the anchor to the negative input is maximized. An early formulation equivalent to triplet loss was introduced for metric learning from relative comparisons by M. Schultze and T. Joachims in 2003.