Speaker 1: We present Rendering with Style, a method to combine traditional rendering and very recent neural rendering approaches for high-quality face rendering. Traditionally, researchers have spent more than a decade investigating techniques to capture and model faces, typically using multiview camera setups, from which multiview stereo and temporal tracking methods allow high-quality 3D geometry with consistent topology over time. A major problem with all these methods is that they reconstruct only the skin pixels, and largely ignore the eyes, hair, and inner mouth. In parallel to geometry reconstruction, a host of techniques has been developed for appearance capture, with face relighting in mind. But again, all the effort has been centered on facial skin, and not on the complete head. As a result, there is still a huge gap between what people can easily capture and render, versus final photorealistic digital doubles, complete with hair, eyes, and inner mouth. To close this gap, it usually takes a lot of manual work from skilled artists. The goal of our work is to take a step towards automatically bridging this gap. Our method starts with a high-quality ray-traced facial skin render, and automatically creates a complete digital render of the full head. Importantly, we build on top of the traditional graphics pipeline in order to leverage those many years of development in high-quality skin modeling and rendering. Our approach for completing the face renders makes use of recent advances in neural rendering. Specifically, StyleGAN2 is a generative modeling network that is able to synthesize an infinite number of photorealistic face images. Our method introduces a new algorithm to project a partial face render into the latent space of StyleGAN2, and then composite the results to make a final render. This is achieved through an optimization that we describe in the paper. But what's important is that we present the first such optimization that considers a batch of input frames at the same time, and generates plausible photorealistic completions of the renders, where the identity of the subject is consistent across the frames, including the hair, eyes, and even the background, yet all the time matching the desired skin render, including the expression, pose, and rendered illumination. The optimized StyleGAN2 projections closely match the target ray-traced face and the skin area, however not perfectly, as some high-frequency details just can't be achieved by the network. As we said, we wish to leverage today's high-quality face rendering methods, and so we blend and composite the ray-traced skin onto the neural render result. Thanks to our optimization, the neural render result is close enough that this compositing step remains very realistic. We now demonstrate several different results. First, to show the overall power and versatility of the method, here are individual digital human renders created with our approach. Each one looks photorealistic and was generated with traditional widespread methods for ray-tracing skin, followed by our neural rendering approach to obtain the complete digital human. Since we base our method on top of traditional face rendering, we can leverage the main pipeline concepts like blend shape-based facial expressions. Here you can see that our method successfully renders a full set of expressions, consistently with the given pose and illumination. Here are our additional results for sequences of other subjects. For context, just one of the input skin renders is shown on the left for each person. Another aspect that is trivial to control is illumination changes. Here we render three images while rotating the light to the left and to the right. The final composited render perfectly matches the ray-traced faces and plausibly completes the scene with the corresponding changing illumination. And here is another example. We now move on to more challenging illumination conditions, generated from real-world environment maps. Here we have three different subjects, rendered in three different environment maps, and we show the consistency of the identity including hair and background across different expressions. We also demonstrate the ability to partially control the background in-painting in our method by supplying a small margin of constraint pixels around the boundary of the image. While the neural render does not match the background constraint perfectly, it's only meant as a guide, since optimizing for realistic human components like the hair, eyes, and teeth is the main goal. More challenging is to try and maintain a consistent identity while changing the environment lighting. Here we see the result of the same identity rendered under five drastically different lighting conditions. Despite the complexity of this problem, our method plausibly fills the eyes, hair, and background to match the artist-directed illumination. Another parameter that is easy to control is camera viewpoint changes. Realistically in-painting consistent hair, ears, eyes, and background during camera motion is difficult, given that our method relies on the StyleGAN2 network, and thus we are limited in viewpoints. However, our method achieves plausible rendering under these varying viewpoints. Here's a second example. This one illustrates some limitations in achieving consistency in the hairstyle, eyes, and clothing. We further push the limitations of the method by varying multiple scene components at the same time, for example, changing the illumination, or the viewpoint, during a dialogue sequence. In the last row, we attempt to vary all the scene components, including expression, viewpoint, and illumination, at the same time, and we see some degradation in the in-painting results. Here we show the scenario of artistic modification of all the components individually, but in a single optimization, for one identity. Our method is able to naturally in-paint the non-skin pixels with the consistent hair, eyes, and background. We now show results on animation sequences. Here's an example of an expression animation, where we simply blend between different expressions for two different subjects. Our method creates mostly stable, complete neural face renders, given only the ray-traced skin pixels shown in the upper left corner. Here are two captured performance dialogues rendered with our method. Although some temporal instability artifacts do exist, our method is able to generate largely consistent photorealistic neural face renders. Here we animate the light direction for two different subjects. The neural renders are temporally smooth and closely reflect the change in scene illumination. One of the most challenging situations is animating the camera viewpoint and consistently solving for the neural in-painting. As the projection of 3D features move rather drastically on the image plane, our optimization attempts to achieve temporal continuity, but some artifacts do exist, in particular in the widest camera angles, where in the center of the animation it is much more stable. In this example, we fix viewpoints to non-frontal angles and then animate dialogue performances. And now we show animated versions of varying multiple scene components at the same time. For example, here we vary the expression as well as the illumination. And here we attempt to vary the expression, illumination, and viewpoint. As mentioned earlier, some artifacts begin to appear when varying too many components in complex scenes. In addition to rendering faces in the field of entertainment, our method has a second major benefit in the area of dataset generation for deep learning applications. Recall that every photorealistic result we generate has an underlying corresponding geometry and appearance maps, and that was rendered from known camera viewpoints with known illumination. This ground-truth information can be vital for training downstream applications such as monocular 3D face reconstruction, facial recognition, or scene understanding. And so every result render could be considered a data sample, and we can generate many variations of many different individuals. Furthermore, even for a single person rendered in a single expression with a single viewpoint and illumination, we can generate random variations of the photoreal render by varying the randomization seed during optimization. Such a technique could be very useful in training facial recognition applications. And finally, we can go even further than generating data for people that were scanned in 3D. Using a recently developed variational autoencoder trained on a large database of 3D faces, we can generate random but plausible 3D face meshes and render them with our approach. Here we easily create literally an infinite number of fully synthetic photorealistic face renders with the underlying ground-truth data. As such, our new rendering approach is an invaluable tool in dataset generation for many fields. In summary, we've presented rendering with style, an approach to combine traditional rendering with neural StyleGAN face generation for high-quality facial rendering. Our method is able to leverage current technology for facial skin capture, modeling, and rendering, and automatically creates complete photorealistic face renders that match the desired identity, expression, and scene configuration. This approach has applications in facial rendering for film and entertainment, saving manual artist labor, and also for data generation in different fields of deep learning. Thanks a lot for watching.
Generate a brief summary highlighting the main points of the transcript.
GenerateGenerate a concise and relevant title for the transcript based on the main themes and content discussed.
GenerateIdentify and highlight the key words or phrases most relevant to the content of the transcript.
GenerateAnalyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.
GenerateCreate interactive quizzes based on the content of the transcript to test comprehension or engage users.
GenerateWe’re Ready to Help
Call or Book a Meeting Now