Arc2Face: A Foundation Model of Human Faces

1Imperial College London, UK
2FAU Erlangen-Nürnberg, Germany

TL;DR: We introduce a large dataset of high-resolution facial images with consistent ID and intra-class variability, and an ID-conditioned face model trained on it, which:
   🔥 generates high-quality images of any subject given only its ArcFace embedding, within a few seconds
   🔥 offers superior ID similarity compared to existing text-based models
   🔥 is built on top of Stable Diffusion and can be extended to different input modalities, e.g. pose/expression

Abstract

This paper presents Arc2Face, an identity-conditioned face foundation model, which, given the ArcFace embedding of a person, can generate diverse photo-realistic images with an unparalleled degree of face similarity than existing models. Despite previous attempts to decode face recognition features into detailed images, we find that common high-resolution datasets (e.g. FFHQ) lack sufficient identities to reconstruct any subject. To that end, we meticulously upsample a significant portion of the WebFace42M database, the largest public dataset for face recognition (FR). Arc2Face builds upon a pretrained Stable Diffusion model, yet adapts it to the task of ID-to-face generation, conditioned solely on ID vectors. Deviating from recent works that combine ID with text embeddings for zero-shot personalization of text-to-image models, we emphasize on the compactness of FR features, which can fully capture the essence of the human face, as opposed to hand-crafted prompts. Crucially, text-augmented models struggle to decouple identity and text, usually necessitating some description of the given face to achieve satisfactory similarity. Arc2Face, however, only needs the discriminative features of ArcFace to guide the generation, offering a robust prior for a plethora of tasks where ID consistency is of paramount importance. As an example, we train a FR model on synthetic images from our model and achieve superior performance to existing synthetic datasets.

Overview

We use a straightforward design to condition the pre-trained Stable Diffusion on ID features. The ArcFace embedding is processed by the text encoder using a frozen pseudo-prompt for compatibility, allowing projection into the CLIP latent space for cross-attention control. Both the encoder and UNet are optimized on a million-scale FR dataset (after upsampling), followed by additional fine-tuning on high-quality datasets, without any text annotations.

method

Through extensive fine-tuning, we effectively transform the text encoder into a face encoder specifically tailored for projecting ArcFace embeddings into the CLIP latent space. The resulting model exclusively adheres to ID-embeddings, disregarding its initial language guidance.

ID-consistent generation

Given the ID-embedding as input, Arc2Face can generate diverse, realistic images of any subject with state-of-the-art ID retention.

Arc2Face + ControlNet

We can additionally control the pose/expression using a 3DMM.

Comparison with text-based methods

Text-augmented methods achieve impressive stylizations, however they require detailed prompts. We evaluate their ID-conditioning ability using the abstract prompt “photo of a person”.

BibTeX

@misc{paraperas2024arc2face,
      title={Arc2Face: A Foundation Model of Human Faces}, 
      author={Foivos Paraperas Papantoniou and Alexandros Lattas and Stylianos Moschoglou and Jiankang Deng and Bernhard Kainz and Stefanos Zafeiriou},
      year={2024},
      eprint={2403.11641},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}