arXiv

We present JOG3R, a unified framework that fine-tunes a video generation model jointly with a 3D point map estimation task. JOG3R improves the 3D-consistency of the generated videos compared to the pre-trained video diffusion transformer (OpenSora in our experiments) as shown by the warped feature maps on the left and scores on the right using MEt3R

MEt3R: Measuring Multi-View Consistency in Generated Images, CVPR 2025.

. Lower scores indicate higher 3D-consistency across frames.



Abstract

Emergent capabilities of image generators have led to many impactful zero- or few-shot applications. Inspired by this success, we investigate whether video generators similarly exhibit 3D-awareness. Using structure-from-motion as a 3D-aware task, we test if intermediate features of a video generator (OpenSora in our case) can support camera pose estimation. Surprisingly, we only find a weak correlation between the two tasks. Deeper investigation reveals that although the video generator produces plausible video frames, the frames themselves are not truly 3D-consistent. Instead, we propose to jointly train for the two tasks, using photometric generation and 3D aware errors. Specifically, we find that SoTA video generation and camera pose estimation networks share common structures, and propose an architecture that unifies the two. The proposed unified model, named JOG3R, produces camera pose estimates with competitive quality while producing 3D-consistent videos. In summary, we propose the first unified video generator that is 3D-consistent, generates realistic video frames, and can potentially be repurposed for other 3D-aware tasks.



Text Guided Video Generation (T2V) Results


We generate 180 videos using the captions in the testing split of the RealEstate10K dataset. For each of the text prompt, we generate videos using 3 models: (i) pre-trained OpenSora (pre-trained OS), (ii) fine-tuned OpenSora using our dataset (fine-tuned OS), (iii) our method JOG3R. We report the FID/FVD against the real images/videos in RealEstate10K testing split as well as the MEt3r metric where we use JOG3R to estimate point maps. We provide sampled generated videos along with the corresponding MEt3R error maps. Please note that MEt3R errors are computed based on visibility, discarding the yellow regions that appear at the boundary due to camera motion.




a screened in porch with furniture and a ceiling fan
pre-trained OS
fine-tuned OS
JOG3R

a view of a living room and dining room
pre-trained OS
fine-tuned OS
JOG3R

a laundry room with a washer and dryer in it
pre-trained OS
fine-tuned OS
JOG3R

a living room with a couch, desk and chair
pre-trained OS
fine-tuned OS
JOG3R

a patio with a table, chairs and an umbrella
pre-trained OS
fine-tuned OS
JOG3R

a dining room with a table, chairs and a refrigerator
pre-trained OS
fine-tuned OS
JOG3R

a backyard with a pool and hot tub overlooking the water
pre-trained OS
fine-tuned OS
JOG3R

a kitchen with a refrigerator and a dining room table
pre-trained OS
fine-tuned OS
JOG3R

a living room filled with furniture and a television
pre-trained OS
fine-tuned OS
JOG3R

a living room filled with furniture and a dining room table
pre-trained OS
fine-tuned OS
JOG3R



Text Guided Video & Camera Generation (T2V+C) Results

Besides videos, JOG3R also generates the corresponding camera paths. Below we visualize the path as well as the estimated correspondences across frames. for each frame pair, we visualize only 10 correspondences to avoid clutter.



a backyard with steps leading up to a blue house
generated video
correspondences from T2V+C
generated camera


a basketball court in the backyard of a house
generated video
correspondences from T2V+C
generated camera


a dining room table with chairs and a view of the water
generated video
correspondences from T2V+C
generated camera


a view of a kitchen and living room in a new home
generated video
correspondences from T2V+C
generated camera