FaceNet

Last updated

FaceNet is a facial recognition system developed by Florian Schroff, Dmitry Kalenichenko and James Philbina, a group of researchers affiliated to Google. The system was first presented in the IEEE Conference on Computer Vision and Pattern Recognition held in 2015. [1] The system uses a deep convolutional neural network to learn a mapping (also called an embedding) from a set of face images to the 128-dimensional Euclidean space and the similarity between two face images is assessed based on the square of the Euclidean distance between the corresponding normalized vectors in the 128-dimensional Euclidean space. The system used the triplet loss function as the cost function and introduced a new online triplet mining method. The system achieved an accuracy of 99.63% which is the highest score on Labeled Faces in the Wild dataset in the unrestricted with labeled outside data protocol. [2]

Contents

Structure

Basic structure

The structure of the FaceNet facenet recognition system is represented schematically in Figure 1.

Figure 1: Overall structure of the FaceNet face recognition system Structure of FaceNet System.png
Figure 1: Overall structure of the FaceNet face recognition system

For training, the researchers used as input batches of about 1800 images in which for each identity there were about 40 similar images and several randomly sampled images relating to other identities. These batches were fed to a deep convolutional neural network and the network was trained using stochastic gradient descent method with standard backpropagation and the Adaptive Gradient Optimizer (AdaGrad) algorithm. The learning rate was initially set at 0.05 which was later lowered while finalizing the model.

Structure of the CNN

The researchers used used two types of architectures, which they called NN1 and NN2, and explored their trade-offs. The practical differences between the models lie in the difference of parameters and FLOPS. The details of the NN1 model are presented in the table below.

Structure of the CNN used in the model NN1 in the FaceNet face recognition system
LayerSize-in
(rows × cols × #filters)
Size-out
(rows × cols × #filters)
Kernel
(rows × cols, stride)
ParametersFLOPS
conv1220×220×3110×110×647×7×3, 29K115M
pool1110×110×6455×55×643×3×64, 20
rnorm155×55×6455×55×640
conv2a55×55×6455×55×641×1×64, 14K13M
conv255×55×6455×55×1923×3×64, 1111K335M
rnorm255×55×19255×55×1920
pool255×55×19228×28×1923×3×192, 20
conv3a28×28×19228×28×1921×1×192, 137K29M
conv328×28×19228×28×3843×3×192, 1664K521M
pool328×28×38414×14×3843×3×384, 20
conv4a14×14×38414×14×3841×1×384, 1148K29M
conv414×14×38414×14×2563×3×384, 1885K173M
conv5a14×14×25614×14×2561×1×256, 166K13M
conv514×14×25614×14×2563×3×256, 1590K116M
conv6a14×14×25614×14×2561×1×256, 166K13M
conv614×14×25614×14×2563×3×256, 1590K116M
pool414×14×2563×3×256, 27×7×2560
concat7×7×2567×7×2560
fc17×7×2561×32×128maxout p=2103M103M
fc21×32×1281×32×128maxout p=234M34M
fc71281×32×1281×1×128524K0.5M
L21×1×1281×1×1280
Total140M1.6B

Triplet loss function

The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity. Triplet Loss Minimization.png
The Triplet Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative of a different identity.

The loss function that was used in the FaceNet system was called the "Triplet Loss Function". This was a novel idea introduced by the developers of the FaceNet system. This function is defined using certain triplets of the form of training images. In this triplet, (called an "anchor image") denotes the image of a person, (called a "positive image") denotes some other image of the person whose image is and (called a "negative image") denotes the image of some other person different from the person whose image is . Let be some image and let be the embedding of in the 128-dimensional Euclidean space. It shall be assumed that the L2-norm of is unity. (The L2 norm of a vector in a finite dimensional Euclidean space is denoted by .) We pick such triplets from the training data set and let there be such triplets and be a typical triplet. The training is to ensure that, after learning, the following condition called the "triplet constraint" should be satisfied by all triplets in the training data set:

where is a constant called the margin and its value has to be set manually. Its value has been set as 0.2.

Thus, the function to be minimized is the following function called the triplet loss function:

Selection of triplets

In general the number of triplets of the form is very large. To make computations faster, the Google researchers considered only those triplets which violate the triplet constraint. For this, for a given anchor image they chose that positive image for which is maximum (such a positive image was called a "hard positive image") and that negative image for which is minimum (such a positive image was called a "hard negative image"). since using the whole training data set to determine the hard positive and hard negative images was computationally expensive and infeasible, the researchers experimented with several methods for selecting the triplets.

Performance

On the widely used Labeled Faces in the Wild (LFW) dataset, the FaceNet system achieved an accuracy of 99.63% which is the highest score on LFW in the unrestricted with labeled outside data protocol. [2] On YouTube Faces DB the system achieved an accuracy of 95.12%. [1]

See also

Further reading

Related Research Articles

<span class="mw-page-title-main">Absolute value</span> Distance from zero to a number

In mathematics, the absolute value or modulus of a real number , denoted , is the non-negative value of without regard to its sign. Namely, if is a positive number, and if is negative, and . For example, the absolute value of 3 is 3, and the absolute value of −3 is also 3. The absolute value of a number may be thought of as its distance from zero.

<span class="mw-page-title-main">Convolution</span> Integral expressing the amount of overlap of one function as it is shifted over another

In mathematics, convolution is a mathematical operation on two functions that produces a third function that expresses how the shape of one is modified by the other. The term convolution refers to both the result function and to the process of computing it. It is defined as the integral of the product of the two functions after one is reflected about the y-axis and shifted. The choice of which function is reflected and shifted before the integral does not change the integral result. The integral is evaluated for all values of shift, producing the convolution function.

<span class="mw-page-title-main">Cartesian coordinate system</span> Most common coordinate system (geometry)

In geometry, a Cartesian coordinate system in a plane is a coordinate system that specifies each point uniquely by a pair of real numbers called coordinates, which are the signed distances to the point from two fixed perpendicular oriented lines, called coordinate lines, coordinate axes or just axes of the system. The point where they meet is called the origin and has (0, 0) as coordinates.

<span class="mw-page-title-main">Euclidean space</span> Fundamental space of geometry

Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, in Euclid's Elements, it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are Euclidean spaces of any positive integer dimension n, which are called Euclidean n-spaces when one wants to specify their dimension. For n equal to one or two, they are commonly called respectively Euclidean lines and Euclidean planes. The qualifier "Euclidean" is used to distinguish Euclidean spaces from other spaces that were later considered in physics and modern mathematics.

<span class="mw-page-title-main">Normed vector space</span> Vector space on which a distance is defined

In mathematics, a normed vector space or normed space is a vector space over the real or complex numbers on which a norm is defined. A norm is a generalization of the intuitive notion of "length" in the physical world. If is a vector space over , where is a field equal to or to , then a norm on is a map , typically denoted by , satisfying the following four axioms:

  1. Non-negativity: for every ,.
  2. Positive definiteness: for every , if and only if is the zero vector.
  3. Absolute homogeneity: for every and ,
  4. Triangle inequality: for every and ,
<span class="mw-page-title-main">Vector space</span> Algebraic structure in linear algebra

In mathematics and physics, a vector space is a set whose elements, often called vectors, may be added together and multiplied ("scaled") by numbers called scalars. Scalars are often real numbers, but can be complex numbers or, more generally, elements of any field. The operations of vector addition and scalar multiplication must satisfy certain requirements, called vector axioms. Real vector space and complex vector space are kinds of vector spaces based on different kinds of scalars: real coordinate space or complex coordinate space.

<span class="mw-page-title-main">Affine transformation</span> Geometric transformation that preserves lines but not angles nor the origin

In Euclidean geometry, an affine transformation or affinity is a geometric transformation that preserves lines and parallelism, but not necessarily Euclidean distances and angles.

<span class="mw-page-title-main">Ball (mathematics)</span> Volume space bounded by a sphere

In mathematics, a ball is the solid figure bounded by a sphere; it is also called a solid sphere. It may be a closed ball or an open ball.

<span class="mw-page-title-main">Function (mathematics)</span> Association of one output to each input

In mathematics, a function from a set X to a set Y assigns to each element of X exactly one element of Y. The set X is called the domain of the function and the set Y is called the codomain of the function.

<span class="mw-page-title-main">Eigenface</span> Set of eigenvectors used in the computer vision problem of human face recognition

An eigenface is the name given to a set of eigenvectors when used in the computer vision problem of human face recognition. The approach of using eigenfaces for recognition was developed by Sirovich and Kirby and used by Matthew Turk and Alex Pentland in face classification. The eigenvectors are derived from the covariance matrix of the probability distribution over the high-dimensional vector space of face images. The eigenfaces themselves form a basis set of all images used to construct the covariance matrix. This produces dimension reduction by allowing the smaller set of basis images to represent the original training images. Classification can be achieved by comparing how faces are represented by the basis set.

In mathematics, the magnitude or size of a mathematical object is a property which determines whether the object is larger or smaller than other objects of the same kind. More formally, an object's magnitude is the displayed result of an ordering of the class of objects to which it belongs.

<span class="mw-page-title-main">Point (geometry)</span> Fundamental object of geometry

In classical Euclidean geometry, a point is a primitive notion that models an exact location in space, and has no length, width, or thickness. In modern mathematics, a point is considered as an element of some set, a point set. A space is a point set with some additional structure. An isolated point has no other neighboring points in a given subset.

<span class="mw-page-title-main">Square (algebra)</span> Product of a number by itself

In mathematics, a square is the result of multiplying a number by itself. The verb "to square" is used to denote this operation. Squaring is the same as raising to the power 2, and is denoted by a superscript 2; for instance, the square of 3 may be written as 32, which is the number 9. In some cases when superscripts are not available, as for instance in programming languages or plain text files, the notations x^2 (caret) or x**2 may be used in place of x2. The adjective which corresponds to squaring is quadratic.

In mathematics, a norm is a function from a real or complex vector space to the non-negative real numbers that behaves in certain ways like the distance from the origin: it commutes with scaling, obeys a form of the triangle inequality, and is zero only at the origin. In particular, the Euclidean distance in a Euclidean space is defined by a norm on the associated Euclidean vector space, called the Euclidean norm, the 2-norm, or, sometimes, the magnitude of the vector. This norm can be defined as the square root of the inner product of a vector with itself.

<span class="mw-page-title-main">Real coordinate space</span> Space formed by the n-tuples of real numbers

In mathematics, the real coordinate space or real coordinate n-space, of dimension n, denoted Rn or , is the set of the n-tuples of real numbers, that is the set of all sequences of n real numbers. Special cases are called the real lineR1, the real coordinate planeR2, and the real coordinate three-dimensional spaceR3. With component-wise addition and scalar multiplication, it is a real vector space, and its elements are called coordinate vectors.

<span class="mw-page-title-main">Manifold</span> Topological space that locally resembles Euclidean space

In mathematics, a manifold is a topological space that locally resembles Euclidean space near each point. More precisely, an -dimensional manifold, or -manifold for short, is a topological space with the property that each point has a neighborhood that is homeomorphic to an open subset of -dimensional Euclidean space.

<span class="mw-page-title-main">Mean shift</span> Mathematical technique

Mean shift is a non-parametric feature-space mathematical analysis technique for locating the maxima of a density function, a so-called mode-seeking algorithm. Application domains include cluster analysis in computer vision and image processing.

Similarity learning is an area of supervised machine learning in artificial intelligence. It is closely related to regression and classification, but the goal is to learn a similarity function that measures how similar or related two objects are. It has applications in ranking, in recommendation systems, visual identity tracking, face verification, and speaker verification.

A Siamese neural network is an artificial neural network that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors. Often one of the output vectors is precomputed, thus forming a baseline against which the other output vector is compared. This is similar to comparing fingerprints but can be described more technically as a distance function for locality-sensitive hashing.

<span class="mw-page-title-main">Triplet loss</span>

Triplet loss is a loss function for machine learning algorithms where a reference input is compared to a matching input and a non-matching input. The distance from the anchor to the positive is minimized, and the distance from the anchor to the negative input is maximized. An early formulation equivalent to triplet loss was introduced for metric learning from relative comparisons by M. Schultze and T. Joachims in 2003.

References

  1. 1 2 Florian Schroff; Dmitry Kalenichenko; James Philbin. "FaceNet: A Unified Embedding for Face Recognition and Clustering" (PDF). The Computer Vision Foundation. Retrieved 4 October 2023.
  2. 1 2 Erik Learned-Miller; Gary Huang; Aruni RoyChowdhury; Haoxiang Li; Gang Hua (April 2016). "Labeled Faces in the Wild: A Survey". Advances in Face Detection and Facial Image Analysis (PDF). Springer. pp. 189–248. Retrieved 5 October 2023.