conclusion.tex

% \vfill\eject

\section{Discussion}

We present a method to generate realistic video sequences of faces from a single photograph 
which can then be used to replace the face of a source/driver video sequence. To the best of our knowledge, our method is the first to leverage GANs to produce realistic, dynamic textures of a subject from a single target image.

\paragraph{Limitations and Future Work}
Though we are able to infer dynamic textures, the input target face is assumed to be without extreme specular lighting and/or pronounced shadowing. If present, these can cause the texture extraction phase following~\cite{f2f} to produce artifacts. As fitting the facial geometry precisely from a single viewpoint is a highly underconstrained problem, the extracted texture of the target subject may be improperly registered in extreme cases in which this fitting is insufficiently accurate. Other issues resulting from imperfect fitting include missing transient expressions, such as blinking.  

The target image must be sufficiently high resolution to generate appropriate details for the corresponding expressions. If the target image is largely non-frontal or otherwise occluded, the captured textures will be incomplete, which causes artifacts in the synthesis. The source sequence, however, may be non-frontal, provided that the angle is not so extreme that the face tracking method of~\cite{f2f} fails. Our method produces reasonable results in non-occluded regions, but cannot synthesize unseen parts. However, ~\cite{saito2016} could be applied to infer the invisible face regions before completing the detail transfer to deal with this. Our compositing process assumes that both the source and target are front-facing, but additional non-frontal synthesis and retargeting results without compositing are in the supplementary materials.  

Limited appearance variation in the training corpus is also an issue. Though the data augmentation mitigates this, the generated wrinkles and deformations will not be as sharp or as strong when the target's appearance varies greatly from those in our dataset. We believe that having a larger dataset with even greater appearance variations would resolve address this. Lastly, our method synthesizes each frame independently, which in some cases results in minor temporal incoherency. However, this could be addressed by solving for multiple frames simultaneously, or by applying temporal smoothing as a post-processing step.  

% \vfill\eject