Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild

Jason Y. Zhang* 1
Sam Pepose* 2
Hanbyul Joo 2
Deva Ramanan 1,3
Jitendra Malik 2,4
Angjoo Kanazawa 4

Carnegie Mellon University
Facebook AI Research
Argo AI
University of California, Berkeley

We present PHOSA: Perceiving Human-Object Spatial Arrangements. From a single RGB image, we recover plausible spatial arrangements of humans and objects by reasoning about their intrinsic scale and human-object interaction. Here, we demonstrate our approach on images captured in unconstrained outdoor environments across a wide range of object categories.


We present a method that infers spatial arrangements and shapes of humans and objects in a globally consistent 3D scene, all from a single image in-the-wild captured in an uncontrolled environment. Notably, our method runs on datasets without any scene- or object-level 3D supervision. Our key insight is that considering humans and objects jointly gives rise to "3D common sense" constraints that can be used to resolve ambiguity. In particular, we introduce a scale loss that learns the distribution of object size from data; an occlusion-aware silhouette re-projection loss to optimize object pose; and a human-object interaction loss to capture the spatial layout of objects with which humans interact. We empirically validate that our constraints dramatically reduce the space of likely 3D spatial configurations. We demonstrate our approach on challenging, in-the-wild images of humans interacting with large objects (such as bicycles, motorcycles, and surfboards) and handheld objects (such as laptops, tennis rackets, and skateboards). We quantify the ability of our approach to recover human-object arrangements and outline remaining challenges in this relatively unexplored domain.


Preview thumbnail for paper

Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild

Jason Y. Zhang*, Sam Pepose*, Hanbyul Joo, Deva Ramanan, Jitendra Malik, and Angjoo Kanazawa
    title = {Perceiving 3D Human-Object Spatial Arrangements from a Single Image in the Wild},
    author = {Zhang, Jason Y. and Pepose, Sam and Joo, Hanbyul and Ramanan, Deva and Malik, Jitendra and Kanazawa, Angjoo},
    booktitle = {European Conference on Computer Vision (ECCV)},
    year = {2020},

Narrated Videos

Short Video (1 minute)

Full Video (10 minutes)


Model overview figure


We thank Georgia Gkioxari and Shubham Tulsiani for insightful discussion and Victoria Dean and Gengshan Yang for useful feedback. We also thank Senthil Purushwalkam for deadline reminders. This work was funded in part by the CMU Argo AI Center for Autonomous Vehicle Research. Webpage template.