This paper presents Arc2Face, an identity-conditioned face foundation model, which, given the ArcFace embedding of a person, can generate diverse photo-realistic images with an unparalleled degree of face similarity than existing models. Despite previous attempts to decode face recognition features into detailed images, we find that common high-resolution datasets (e.g. FFHQ) lack sufficient identities to reconstruct any subject. To that end, we meticulously upsample a significant portion of the WebFace42M database, the largest public dataset for face recognition (FR). Arc2Face builds upon a pretrained Stable Diffusion model, yet adapts it to the task of ID-to-face generation, conditioned solely on ID vectors. Deviating from recent works that combine ID with text embeddings for zero-shot personalization of text-to-image models, we emphasize on the compactness of FR features, which can fully capture the essence of the human face, as opposed to hand-crafted prompts. Crucially, text-augmented models struggle to decouple identity and text, usually necessitating some description of the given face to achieve satisfactory similarity. Arc2Face, however, only needs the discriminative features of ArcFace to guide the generation, offering a robust prior for a plethora of tasks where ID consistency is of paramount importance. As an example, we train a FR model on synthetic images from our model and achieve superior performance to existing synthetic datasets.
We use a straightforward design to condition the pre-trained Stable Diffusion on ID features. The ArcFace embedding is processed by the text encoder using a frozen pseudo-prompt for compatibility, allowing projection into the CLIP latent space for cross-attention control. Both the encoder and UNet are optimized on a million-scale FR dataset (after upsampling), followed by additional fine-tuning on high-quality datasets, without any text annotations.
Through extensive fine-tuning, we effectively transform the text encoder into a face encoder specifically tailored for projecting ArcFace embeddings into the CLIP latent space. The resulting model exclusively adheres to ID-embeddings, disregarding its initial language guidance.
Given the ID-embedding as input, Arc2Face can generate diverse, realistic images of any subject with state-of-the-art ID retention.
We can additionally control the pose/expression using a 3DMM.
Text-augmented methods achieve impressive stylizations, however they require detailed prompts. We evaluate their ID-conditioning ability using the abstract prompt “photo of a person”.
@inproceedings{paraperas2024arc2face,
title={Arc2Face: A Foundation Model for ID-Consistent Human Faces},
author={Paraperas Papantoniou, Foivos and Lattas, Alexandros and Moschoglou, Stylianos and Deng, Jiankang and Kainz, Bernhard and Zafeiriou, Stefanos},
booktitle={Proceedings of the European Conference on Computer Vision (ECCV)},
year={2024}
}