Recovering accurate 3D human pose and shape from images is a key component of human-centric computer vision applications. Although recent developments in deep learning have led to many technological advances, estimating the 3D pose and shape of the whole integrated body, hands, and face is difficult, and especially recovering those of multi-person from a single image in the wild remains a great challenge.
In this talk, I will present the recent developments and results of my group on this problem. First, I will introduce a new fully learning-based 3D multi-person body pose estimation framework, in which the relative distance of each person from the camera is efficiently estimated using the camera geometry and deep context features. Next, I will introduce the extension of this framework to 3D multi-person whole shape estimation. To estimate accurate mesh shape for each body part, a 3D positional pose-guided 3D rotational pose prediction network is proposed to utilize both joint-specific local and global features. The proposed framework integrates the 3D poses and shapes of the body/hands with a facial expression. Experimental results demonstrate how effectively our new methods work in real scenarios.
Deep networks excel in learning patterns from large amounts of data. On the other hand, many geometric vision tasks are specified as optimization problems. To seamlessly combine deep learning and geometric vision, it is vital to perform learning and geometric optimization end-to-end. Towards this aim, we present BPnP, a novel network module that backpropagates gradients through a Perspective-n-Points (PnP) solver to guide parameter updates of a neural network. Based on implicit differentiation, we show that the gradients of a “self-contained” PnP solver can be derived accurately and efficiently, as if the optimizer block were a differentiable function. We validate BPnP by incorporating it in a deep model that can learn camera intrinsics, camera extrinsics (poses) and 3D structure from training datasets. Further, we develop an end-to-end trainable pipeline for object pose estimation, which achieves greater accuracy by combining feature-based heatmap losses with 2D-3D reprojection errors. Since our approach can be extended to other optimization problems, our work helps to pave the way to perform learnable geometric vision in a principled manner. Visit the project page (https://github.com/BoChenYS/BPnP) for our PyTorch implementation of BPnP.
Imagine a futuristic version of Google Street View that could dial up any possible place in the world, at any possible time. Effectively, such a service would be a recording of the plenoptic function—the hypothetical function described by Adelson and Bergen that captures all light rays passing through space at all times. While the plenoptic function is completely impractical to capture in its totality, every photo ever taken represents a sample of this function. I will present recent methods we’ve developed to reconstruct the plenoptic function from sparse space-time samples of photos—including Street View itself, as well as tourist photos of famous landmarks. The results of this work include the ability to take a single photo and synthesize a full dawn-to-dusk timelapse video, as well as compelling 4D view synthesis capabilities where a scene can simultaneously be explored in space and time.
Noah Snavely is an associate professor of Computer Science at Cornell University and Cornell Tech, and also a researcher at Google Research. Noah’s research interests are in computer vision and graphics, in particular 3D understanding and depiction of scenes from images. Noah is the recipient of a PECASE, a Microsoft New Faculty Fellowship, an Alfred P. Sloan Fellowship, and a SIGGRAPH Significant New Researcher Award.
In this talk, I will show several recent results of my group on learning neural implicit 3D representations, departing from the traditional paradigm of representing 3D shapes explicitly using voxels, point clouds or meshes. Implicit representations have a small memory footprint and allow for modeling arbitrary 3D topologies at (theoretically) arbitrary resolution in continuous function space. I will show the ability and limitations of these approaches in the context of reconstructing 3D geometry (Occupancy Networks), texture (Texture Fields / Surface Light Fields) and motion (Occupancy Flow). I will further demonstrate Differentiable Volumetric Rendering (DVR) for learning implicit 3D models using only 2D supervision through implicit differentiation of the level set constraint. Finally, I will introduce GRAF, a generative 3D model for neural radiance fields which generates 3D consistent photo-realistic renderings from unstructured and unposed image collections of various objects.
Andreas Geiger is professor at the University of Tübingen and group leader at the Max Planck Institute for Intelligent Systems. Prior to this, he was a visiting professor at ETH Zürich and a research scientist at MPI-IS. He studied at KIT, EPFL and MIT and received his PhD degree in 2013 from the KIT. His research interests are at the intersection of 3D reconstruction, motion estimation, scene understanding and sensory-motor control. He maintains the KITTI vision benchmark.