Lately, we now have witnessed rising curiosity throughout shoppers and researchers in built-in augmented actuality (AR) experiences utilizing real-time face characteristic era and enhancing capabilities in cellular purposes, together with quick movies, digital actuality, and gaming. Consequently, there’s a rising demand for light-weight, but high-quality face era and enhancing fashions, which are sometimes based mostly on generative adversarial community (GAN) methods. Nevertheless, nearly all of GAN fashions undergo from excessive computational complexity and the necessity for a big coaching dataset. As well as, it is usually essential to make use of GAN fashions responsibly.
On this put up, we introduce MediaPipe FaceStylizer, an environment friendly design for few-shot face stylization that addresses the aforementioned mannequin complexity and knowledge effectivity challenges whereas being guided by Google’s accountable AI Ideas. The mannequin consists of a face generator and a face encoder used as GAN inversion to map the picture into latent code for the generator. We introduce a mobile-friendly synthesis community for the face generator with an auxiliary head that converts options to RGB at every degree of the generator to generate prime quality photos from coarse to wonderful granularities. We additionally fastidiously designed the loss capabilities for the aforementioned auxiliary heads and mixed them with the widespread GAN loss capabilities to distill the scholar generator from the trainer StyleGAN mannequin, leading to a light-weight mannequin that maintains excessive era high quality. The proposed resolution is out there in open supply by means of MediaPipe. Customers can fine-tune the generator to be taught a method from one or just a few photos utilizing MediaPipe Mannequin Maker, and deploy to on-device face stylization purposes with the custom-made mannequin utilizing MediaPipe FaceStylizer.
Few-shot on-device face stylization
An end-to-end pipeline
Our objective is to construct a pipeline to assist customers to adapt the MediaPipe FaceStylizer to totally different types by fine-tuning the mannequin with just a few examples. To allow such a face stylization pipeline, we constructed the pipeline with a GAN inversion encoder and environment friendly face generator mannequin (see beneath). The encoder and generator pipeline can then be tailored to totally different types through a few-shot studying course of. The consumer first sends a single or just a few comparable samples of the type photos to MediaPipe ModelMaker to fine-tune the mannequin. The fine-tuning course of freezes the encoder module and solely fine-tunes the generator. The coaching course of samples a number of latent codes near the encoding output of the enter type photos because the enter to the generator. The generator is then skilled to reconstruct a picture of an individual’s face within the type of the enter type picture by optimizing a joint adversarial loss perform that additionally accounts for type and content material. With such a fine-tuning course of, the MediaPipe FaceStylizer can adapt to the custom-made type, which approximates the consumer’s enter. It might then be utilized to stylize check photos of actual human faces.
The StyleGAN mannequin household has been broadly adopted for face era and numerous face enhancing duties. To assist environment friendly on-device face era, we based mostly the design of our generator on StyleGAN. This generator, which we name BlazeStyleGAN, is much like StyleGAN in that it additionally incorporates a mapping community and synthesis community. Nevertheless, for the reason that synthesis community of StyleGAN is the main contributor to the mannequin’s excessive computation complexity, we designed and employed a extra environment friendly synthesis community. The improved effectivity and era high quality is achieved by:
- Decreasing the latent characteristic dimension within the synthesis community to 1 / 4 of the decision of the counterpart layers within the trainer StyleGAN,
- Designing a number of auxiliary heads to rework the downscaled characteristic to the picture area to kind a coarse-to-fine picture pyramid to judge the perceptual high quality of the reconstruction, and
- Skipping all however the ultimate auxiliary head at inference time.
With the newly designed structure, we prepare the BlazeStyleGAN mannequin by distilling it from a trainer StyleGAN mannequin. We use a multi-scale perceptual loss and adversarial loss within the distillation to switch the excessive constancy era functionality from the trainer mannequin to the scholar BlazeStyleGAN mannequin and in addition to mitigate the artifacts from the trainer mannequin.
Extra particulars of the mannequin structure and coaching scheme may be present in our paper.
|Visible comparability between face samples generated by StyleGAN and BlazeStyleGAN. The pictures on the primary row are generated by the trainer StyleGAN. The pictures on the second row are generated by the scholar BlazeStyleGAN. The face generated by BlazeStyleGAN has comparable visible high quality to the picture generated by the trainer mannequin. Some outcomes reveal the scholar BlazeStyleGAN suppresses the artifacts from the trainer mannequin within the distillation.|
Within the above determine, we reveal some pattern outcomes of our BlazeStyleGAN. By evaluating with the face picture generated by the trainer StyleGAN mannequin (high row), the photographs generated by the scholar BlazeStyleGAN (backside row) preserve excessive visible high quality and additional cut back artifacts produced by the trainer as a result of loss perform design in our distillation.
An encoder for environment friendly GAN inversion
To assist image-to-image stylization, we additionally launched an environment friendly GAN inversion because the encoder to map enter photos to the latent house of the generator. The encoder is outlined by a MobileNet V2 spine and skilled with pure face photos. The loss is outlined as a mixture of picture perceptual high quality loss, which measures the content material distinction, type similarity and embedding distance, in addition to the L1 loss between the enter photos and reconstructed photos.
We documented mannequin complexities when it comes to parameter numbers and computing FLOPs within the following desk. In comparison with the trainer StyleGAN (33.2M parameters), BlazeStyleGAN (generator) considerably reduces the mannequin complexity, with solely 2.01M parameters and 1.28G FLOPs for output decision 256×256. In comparison with StyleGAN-1024 (producing picture dimension of 1024×1024), the BlazeStyleGAN-1024 can cut back each mannequin dimension and computation complexity by 95% with no notable high quality distinction and may even suppress the artifacts from the trainer StyleGAN mannequin.
|Mannequin||Picture Measurement||#Params (M)||FLOPs (G)|
|Mannequin complexity measured by parameter numbers and FLOPs.|
We benchmarked the inference time of the MediaPipe FaceStylizer on numerous high-end cellular units and demonstrated the ends in the desk beneath. From the outcomes, each BlazeStyleGAN-256 and BlazeStyleGAN-512 achieved real-time efficiency on all GPU units. It might run in lower than 10 ms runtime on a high-end telephone’s GPU. BlazeStyleGAN-256 may also obtain real-time efficiency on the iOS units’ CPU.
|Mannequin||BlazeStyleGAN-256 (ms)||Encoder-256 (ms)|
|iPhone 13 Professional||7.22||5.41|
|Samsung Galaxy S10||17.01||12.70|
|Samsung Galaxy S20||8.95||8.20|
|Latency benchmark of the BlazeStyleGAN, face encoder, and the end-to-end pipeline on numerous cellular units.|
The mannequin has been skilled with a excessive range dataset of human faces. The mannequin is anticipated to be truthful to totally different human faces. The equity analysis demonstrates the mannequin performs good and balanced when it comes to human gender, skin-tone, and ages.
Face stylization visualization
Some face stylization outcomes are demonstrated within the following determine. The pictures within the high row (in orange bins) signify the type photos used to fine-tune the mannequin. The pictures within the left column (within the inexperienced bins) are the pure face photos used for testing. The 2×4 matrix of photos represents the output of the MediaPipe FaceStylizer which is mixing outputs between the pure faces on the left-most column and the corresponding face types on the highest row. The outcomes reveal that our resolution can obtain high-quality face stylization for a number of standard types.
|Pattern outcomes of our MediaPipe FaceStylizer.|
The MediaPipe FaceStylizer goes to be launched to public customers in MediaPipe Options. Customers can leverage MediaPipe Mannequin Maker to coach a custom-made face stylization mannequin utilizing their very own type photos. After coaching, the exported bundle of TFLite mannequin information may be deployed to purposes throughout platforms (Android, iOS, Internet, Python, and many others.) utilizing the MediaPipe Duties FaceStylizer API in only a few traces of code.
This work is made potential by means of a collaboration spanning a number of groups throughout Google. We’d prefer to acknowledge contributions from Omer Tov, Yang Zhao, Andrey Vakunov, Fei Deng, Ariel Ephrat, Inbar Mosseri, Lu Wang, Chuo-Ling Chang, Tingbo Hou, and Matthias Grundmann.