FaceNet

Last updated December 11, 2023

FaceNet is a facial recognition system developed by Florian Schroff, Dmitry Kalenichenko and James Philbina, a group of researchers affiliated to Google. The system was first presented in the IEEE Conference on Computer Vision and Pattern Recognition held in 2015.^[1] The system uses a deep convolutional neural network to learn a mapping (also called an embedding) from a set of face images to the 128-dimensional Euclidean space and the similarity between two face images is assessed based on the square of the Euclidean distance between the corresponding normalized vectors in the 128-dimensional Euclidean space. The system used the triplet loss function as the cost function and introduced a new online triplet mining method. The system achieved an accuracy of 99.63% which is the highest score on Labeled Faces in the Wild dataset in the unrestricted with labeled outside data protocol.^[2]

Structure

Basic structure

The structure of the FaceNet facenet recognition system is represented schematically in Figure 1.

For training, the researchers used as input batches of about 1800 images in which for each identity there were about 40 similar images and several randomly sampled images relating to other identities. These batches were fed to a deep convolutional neural network and the network was trained using stochastic gradient descent method with standard backpropagation and the Adaptive Gradient Optimizer (AdaGrad) algorithm. The learning rate was initially set at 0.05 which was later lowered while finalizing the model.

Structure of the CNN

The researchers used used two types of architectures, which they called NN1 and NN2, and explored their trade-offs. The practical differences between the models lie in the difference of parameters and FLOPS. The details of the NN1 model are presented in the table below.

Structure of the CNN used in the model NN1 in the FaceNet face recognition system
Layer	Size-in (rows × cols × #filters)	Size-out (rows × cols × #filters)	Kernel (rows × cols, stride)	Parameters	FLOPS
conv1	220×220×3	110×110×64	7×7×3, 2	9K	115M
pool1	110×110×64	55×55×64	3×3×64, 2	0	—
rnorm1	55×55×64	55×55×64		0
conv2a	55×55×64	55×55×64	1×1×64, 1	4K	13M
conv2	55×55×64	55×55×192	3×3×64, 1	111K	335M
rnorm2	55×55×192	55×55×192		0
pool2	55×55×192	28×28×192	3×3×192, 2	0
conv3a	28×28×192	28×28×192	1×1×192, 1	37K	29M
conv3	28×28×192	28×28×384	3×3×192, 1	664K	521M
pool3	28×28×384	14×14×384	3×3×384, 2	0
conv4a	14×14×384	14×14×384	1×1×384, 1	148K	29M
conv4	14×14×384	14×14×256	3×3×384, 1	885K	173M
conv5a	14×14×256	14×14×256	1×1×256, 1	66K	13M
conv5	14×14×256	14×14×256	3×3×256, 1	590K	116M
conv6a	14×14×256	14×14×256	1×1×256, 1	66K	13M
conv6	14×14×256	14×14×256	3×3×256, 1	590K	116M
pool4	14×14×256	3×3×256, 2	7×7×256	0
concat	7×7×256	7×7×256		0
fc1	7×7×256	1×32×128	maxout p=2	103M	103M
fc2	1×32×128	1×32×128	maxout p=2	34M	34M
fc7128	1×32×128	1×1×128		524K	0.5M
L2	1×1×128	1×1×128		0

Total				140M	1.6B

Triplet loss function

The loss function that was used in the FaceNet system was called the "Triplet Loss Function". This was a novel idea introduced by the developers of the FaceNet system. This function is defined using certain triplets of the form $(A,P,N)$ of training images. In this triplet, $A$ (called an "anchor image") denotes the image of a person, $P$ (called a "positive image") denotes some other image of the person whose image is $A$ and $N$ (called a "negative image") denotes the image of some other person different from the person whose image is $A$ . Let $x$ be some image and let $f(x)$ be the embedding of $x$ in the 128-dimensional Euclidean space. It shall be assumed that the L2-norm of $f(x)$ is unity. (The L2 norm of a vector $X$ in a finite dimensional Euclidean space is denoted by $\Vert X\Vert$ .) We pick such triplets from the training data set and let there be $m$ such triplets and $(A^{(i)},P^{(i)},N^{(i)})$ be a typical triplet. The training is to ensure that, after learning, the following condition called the "triplet constraint" should be satisfied by all triplets $(A^{(i)},P^{(i)},N^{(i)})$ in the training data set:

\Vert f(A^{(i)})-f(P^{(i)})\Vert _{2}^{2}+\alpha <\Vert f(A^{(i)})-f(N^{(i)})\Vert _{2}^{2}

where $\alpha$ is a constant called the margin and its value has to be set manually. Its value has been set as 0.2.

Thus, the function to be minimized is the following function called the triplet loss function:

L=\sum _{i=1}^{m}\max {\Big (}\Vert f(A^{(i)})-f(P^{(i)})\Vert _{2}^{2}-\Vert f(A^{(i)})-f(N^{(i)})\Vert _{2}^{2}+\alpha ,0{\Big )}

Selection of triplets

In general the number of triplets of the form $(A^{(i)},P^{(i)},N^{(i)})$ is very large. To make computations faster, the Google researchers considered only those triplets which violate the triplet constraint. For this, for a given anchor image $A^{(i)}$ they chose that positive image $P^{(i)}$ for which $\Vert f(A^{(i)})-f(P^{(i)})\Vert _{2}^{2}$ is maximum (such a positive image was called a "hard positive image") and that negative image $N^{(i)}$ for which $\Vert f(A^{(i)})-f(N^{(i)})\Vert _{2}^{2}$ is minimum (such a positive image was called a "hard negative image"). since using the whole training data set to determine the hard positive and hard negative images was computationally expensive and infeasible, the researchers experimented with several methods for selecting the triplets.

Generate triplets offline computing the minimum and maximum on a subset of the data.
Generate triplets online by selecting the hard positive/negative examples from within a mini-batch.

Performance

On the widely used Labeled Faces in the Wild (LFW) dataset, the FaceNet system achieved an accuracy of 99.63% which is the highest score on LFW in the unrestricted with labeled outside data protocol.^[2] On YouTube Faces DB the system achieved an accuracy of 95.12%.^[1]

Related Research Articles

In mathematics, the absolute value or modulus of a real number $, denoted, is the non-negative value of without regard to its sign. Namely, if is a positive number, and if is negative, and . For example, the absolute value of 3 is 3, and the absolute value of -3 is also 3. The absolute value of a number may be thought of as its distance from zero.$

In mathematics, convolution is a mathematical operation on two functions that produces a third function that expresses how the shape of one is modified by the other. The term convolution refers to both the result function and to the process of computing it. It is defined as the integral of the product of the two functions after one is reflected about the y-axis and shifted. The choice of which function is reflected and shifted before the integral does not change the integral result. The integral is evaluated for all values of shift, producing the convolution function.

<span class="mw-page-title-main">Cartesian coordinate system</span> Most common coordinate system (geometry)

In geometry, a Cartesian coordinate system in a plane is a coordinate system that specifies each point uniquely by a pair of real numbers called coordinates, which are the signed distances to the point from two fixed perpendicular oriented lines, called coordinate lines, coordinate axes or just axes of the system. The point where they meet is called the origin and has $(0, 0)$ as coordinates.

Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, in Euclid's Elements, it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are Euclidean spaces of any positive integer dimension n, which are called Euclidean n-spaces when one wants to specify their dimension. For n equal to one or two, they are commonly called respectively Euclidean lines and Euclidean planes. The qualifier "Euclidean" is used to distinguish Euclidean spaces from other spaces that were later considered in physics and modern mathematics.

In mathematics, a normed vector space or normed space is a vector space over the real or complex numbers on which a norm is defined. A norm is a generalization of the intuitive notion of "length" in the physical world. If $is a vector space over, where is a field equal to or to, then a norm on is a map, typically denoted by, satisfying the following four axioms:$

Non-negativity: for every $, .$
Positive definiteness: for every $, if and only if is the zero vector.$
Absolute homogeneity: for every $and,$
Triangle inequality: for every $and,$

In mathematics and physics, a vector space is a set whose elements, often called vectors, may be added together and multiplied ("scaled") by numbers called scalars. Scalars are often real numbers, but can be complex numbers or, more generally, elements of any field. The operations of vector addition and scalar multiplication must satisfy certain requirements, called vector axioms. Real vector space and complex vector space are kinds of vector spaces based on different kinds of scalars: real coordinate space or complex coordinate space.

In Euclidean geometry, an affine transformation or affinity is a geometric transformation that preserves lines and parallelism, but not necessarily Euclidean distances and angles.

<span class="mw-page-title-main">Ball (mathematics)</span> Volume space bounded by a sphere

In mathematics, a ball is the solid figure bounded by a sphere; it is also called a solid sphere. It may be a closed ball or an open ball.

In mathematics, a function from a set $X$ to a set $Y$ assigns to each element of $X$ exactly one element of $Y$ . The set $X$ is called the domain of the function and the set $Y$ is called the codomain of the function.

<span class="mw-page-title-main">Eigenface</span> Set of eigenvectors used in the computer vision problem of human face recognition

An eigenface is the name given to a set of eigenvectors when used in the computer vision problem of human face recognition. The approach of using eigenfaces for recognition was developed by Sirovich and Kirby and used by Matthew Turk and Alex Pentland in face classification. The eigenvectors are derived from the covariance matrix of the probability distribution over the high-dimensional vector space of face images. The eigenfaces themselves form a basis set of all images used to construct the covariance matrix. This produces dimension reduction by allowing the smaller set of basis images to represent the original training images. Classification can be achieved by comparing how faces are represented by the basis set.

In mathematics, the magnitude or size of a mathematical object is a property which determines whether the object is larger or smaller than other objects of the same kind. More formally, an object's magnitude is the displayed result of an ordering of the class of objects to which it belongs.

In classical Euclidean geometry, a point is a primitive notion that models an exact location in space, and has no length, width, or thickness. In modern mathematics, a point is considered as an element of some set, a point set. A space is a point set with some additional structure. An isolated point has no other neighboring points in a given subset.

In mathematics, a square is the result of multiplying a number by itself. The verb "to square" is used to denote this operation. Squaring is the same as raising to the power 2, and is denoted by a superscript 2; for instance, the square of 3 may be written as 3², which is the number 9. In some cases when superscripts are not available, as for instance in programming languages or plain text files, the notations x^2 (caret) or x**2 may be used in place of x². The adjective which corresponds to squaring is quadratic.

In mathematics, a norm is a function from a real or complex vector space to the non-negative real numbers that behaves in certain ways like the distance from the origin: it commutes with scaling, obeys a form of the triangle inequality, and is zero only at the origin. In particular, the Euclidean distance in a Euclidean space is defined by a norm on the associated Euclidean vector space, called the Euclidean norm, the 2-norm, or, sometimes, the magnitude of the vector. This norm can be defined as the square root of the inner product of a vector with itself.

In mathematics, the real coordinate space or real coordinate n-space, of dimension $n$ , denoted $R n$ or , is the set of the $n$ -tuples of real numbers, that is the set of all sequences of $n$ real numbers. Special cases are called the real line $R 1$ , the real coordinate plane $R 2$ , and the real coordinate three-dimensional space $R 3$ . With component-wise addition and scalar multiplication, it is a real vector space, and its elements are called coordinate vectors.

In mathematics, a manifold is a topological space that locally resembles Euclidean space near each point. More precisely, an $-dimensional manifold, or -manifold for short, is a topological space with the property that each point has a neighborhood that is homeomorphic to an open subset of -dimensional Euclidean space.$

Mean shift is a non-parametric feature-space mathematical analysis technique for locating the maxima of a density function, a so-called mode-seeking algorithm. Application domains include cluster analysis in computer vision and image processing.

Similarity learning is an area of supervised machine learning in artificial intelligence. It is closely related to regression and classification, but the goal is to learn a similarity function that measures how similar or related two objects are. It has applications in ranking, in recommendation systems, visual identity tracking, face verification, and speaker verification.

A Siamese neural network is an artificial neural network that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors. Often one of the output vectors is precomputed, thus forming a baseline against which the other output vector is compared. This is similar to comparing fingerprints but can be described more technically as a distance function for locality-sensitive hashing.

<span class="mw-page-title-main">Triplet loss</span>

Triplet loss is a loss function for machine learning algorithms where a reference input is compared to a matching input and a non-matching input. The distance from the anchor to the positive is minimized, and the distance from the anchor to the negative input is maximized. An early formulation equivalent to triplet loss was introduced for metric learning from relative comparisons by M. Schultze and T. Joachims in 2003.

References

1 2 Florian Schroff; Dmitry Kalenichenko; James Philbin. "FaceNet: A Unified Embedding for Face Recognition and Clustering" (PDF). The Computer Vision Foundation. Retrieved 4 October 2023.
1 2 Erik Learned-Miller; Gary Huang; Aruni RoyChowdhury; Haoxiang Li; Gang Hua (April 2016). "Labeled Faces in the Wild: A Survey". Advances in Face Detection and Facial Image Analysis (PDF). Springer. pp. 189–248. Retrieved 5 October 2023.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[FaceNet-1] 1 2 Florian Schroff; Dmitry Kalenichenko; James Philbin. "FaceNet: A Unified Embedding for Face Recognition and Clustering" (PDF). The Computer Vision Foundation. Retrieved 4 October 2023.

[survey-2] 1 2 Erik Learned-Miller; Gary Huang; Aruni RoyChowdhury; Haoxiang Li; Gang Hua (April 2016). "Labeled Faces in the Wild: A Survey". Advances in Face Detection and Facial Image Analysis (PDF). Springer. pp. 189–248. Retrieved 5 October 2023.

[1]

[2]