NeRS Paper Figures — Video Versions




Figure 1

Nearest Neigh. Input Initial Car Mesh Output Radiance Predicted Texture Illum. of Mean Texture
NN Input Output Interpolation NN Input Output Interpolation NN Input Output Interpolation
3D view synthesis in the wild. Top: From 8 multi-view internet images of a truck and a coarse initial mesh, we recover the camera poses, 3D shape, texture, and illumination. Bottom: We demonstrate the scalability of our approach on a wide variety of indoor and outdoor object, such as a microwave, motorcycle, mayonaise bottle, ricecooker, and computer monitor.


Figure 5

Nearest Neighbor Input Initial Mesh NeRS (Ours) Shape Interpolation
Qualitative results on various household objects. We demonstrate the versatility of our approach on an espresso machine, a bottle of ketchup, a game controller, and a fire hydrant. Each instance has 7-10 input views. We find that a coarse, cuboid mesh is sufficient as an initialization to learn detailed shape and texture. Here, we visualize 360 degree viewpoints of the intial mesh and output radiance, as well as the interpolation between the initial and predicted shapes.


Figure 6

NN Training View NeRS IDR NeRF* MetaNeRF MetaNeRF-ft
NN Training View NeRS IDR NeRF* NeRS w/ NeRF-style View-dep.
Qualitative comparison with fixed cameras. We evaluate all baselines on the task of novel view synthesis on Multi-view Marketplace Cars trained and tested with fixed, pseudo-ground truth cameras. Since we do not have ground truth cameras, we manually tune cameras optimized by NeRS over all images and treat these as pseudo-ground truth. We compare against IDR, which extracts a surface from an SDF representation, but struggles to produce a view-consistent output given limited input views. We train a modified version of that is more competitive with sparse views (NeRF*). We also evaluate against a meta-learned initialization of NeRF with and without finetuning until convergence but found poor results perhaps due to the domain shift from Shapenet cars. Finally, we also visualize an ablation of NeRS that directly conditions the radiance on position and viewing-direction, similar to NeRF (NeRS w/ NeRF-style View dep.). We find that this ablation exhibits similar ghosting behavior as NeRF, suggesting that the texture-illumination factorization serves as a useful regularizer. The red truck has 16 total views while the blue SUV has 8 total views. Note that these results are trained with all images from the instance rather than all-but-one as in the main paper.


Figure 7

NN Training View NeRS IDR NeRF* NeRS w/ NeRF-style View-dep.
Qualitative comparison with trainable, approximate cameras. We evaluate all baselines on the task of in-the-wild novel view synthesis. Since ground truth cameras may not be recoverable in general for in-the-wild scenes, we recover approximate cameras using an off-the-shelf approach and allow each method to refine the camera poses over the course of optimization. We find that NeRS generalizes better than the baselines in this unconstrained but more realistic setup. The red truck has 16 total views while the blue SUV has 8 total views. Note that these results are trained with all images from the instance rather than all-but-one as in the main paper.


Figure 8

NN Training View NeRS Illumin. of Mean Texture NN Training View NeRS Illumin. of Mean Texture
Qualitative results on our in-the-wild multi-view Marketplace Cars dataset. Here we visualize the NeRS outputs for 3 listings from MVMC.

Figure 10

Nearest Neighbor Training View NeRS no View-Dep. NeRS (Ours) Illumination of Mean Texture Environment Map
Comparison with NeRS trained without view dependence. Here we compare the full NeRS with a NeRS trained without any view dependence by only rendering using only the texture predictor and not the neural environment map. We find that NeRS trained without view-dependence cannot capture lighting effects when they are inconsistent across images. We also visualize the environment maps and the illumination of the mean texture. The environment maps show that the light is coming primarily from one side for the first car, uniformly from all directions for the second car, and strongly front left for the third car.


Figure 11

Nearest Neighbor Training View Mask Carving using Initial Cameras Mask Carving using Pre-trained Cameras Mask Carving using Optimized Cameras Learned Shape Model NeRS (Ours)
Shapes from Silhouettes using Volume Carving. We compare shapes carved from the silhouettes of the training views with the shape model learned by our approach. We construct a voxel grid of size 128x128x128 and keep only the voxels that are visible when projected to the masks using the off-the-shelf cameras ("Initial Cameras"), pre-trained cameras from Stage 1 ("Pretrained Cameras"), and the final cameras after Stage 4 ("Optimized Cameras"). We compare this with the shape model output by the shape predictor. We show the nearest neighbor training view and the final NeRS rendering for reference. While volume carving can appear reasonable given sufficiently accurate cameras, we find that the shape model learned by NeRS is qualitatively a better reconstruction. In particular, the model learns to correctly output the right side of the pickup truck and reconstructs the sideview mirrors from the texture cues, suggesting that a joint optimization of the shape and appearance is useful. Also, we note that the more "accurate" optimized cameras are themselves outputs of NeRS.