We present JOG3R, a unified framework that fine-tunes a video generation model jointly with a 3D point map estimation task. JOG3R improves the 3D-consistency of the generated videos compared to the pre-trained video diffusion transformer (OpenSora in our experiments) as shown by the warped feature maps on the left and scores on the right using MEt3R
MEt3R: Measuring Multi-View Consistency in Generated Images, CVPR 2025.
. Lower scores indicate higher 3D-consistency across frames.
Abstract
Emergent capabilities of image generators have led to many impactful zero- or few-shot applications. Inspired by this success, we investigate whether video generators similarly exhibit 3D-awareness. Using structure-from-motion as a 3D-aware task, we test if intermediate features of a video generator (OpenSora in our case) can support camera pose estimation. Surprisingly, we only find a weak correlation between the two tasks. Deeper investigation reveals that although the video generator produces plausible video frames, the frames themselves are not truly 3D-consistent. Instead, we propose to jointly train for the two tasks, using photometric generation and 3D aware errors. Specifically, we find that SoTA video generation and camera pose estimation networks share common structures, and propose an architecture that unifies the two. The proposed unified model, named JOG3R, produces camera pose estimates with competitive quality while producing 3D-consistent videos. In summary, we propose the first unified video generator that is 3D-consistent, generates realistic video frames, and can potentially be repurposed for other 3D-aware tasks.
Text Guided Video Generation (T2V) Results
We generate 180 videos using the captions in the testing split of the RealEstate10K dataset. For each of the text prompt, we generate videos using 3 models: (i) pre-trained OpenSora (pre-trained OS), (ii) fine-tuned OpenSora using our dataset (fine-tuned OS), (iii) our method JOG3R. We report the FID/FVD against the real images/videos in RealEstate10K testing split as well as the MEt3r metric where we use JOG3R to estimate point maps. We provide sampled generated videos along with the corresponding MEt3R error maps. Please note that MEt3R errors are computed based on visibility, discarding the yellow regions that appear at the boundary due to camera motion.
pre-trained OS
|
fine-tuned OS
|
JOG3R
|
---|---|---|
pre-trained OS
|
fine-tuned OS
|
JOG3R
|
---|---|---|
pre-trained OS
|
fine-tuned OS
|
JOG3R
|
---|---|---|
pre-trained OS
|
fine-tuned OS
|
JOG3R
|
---|---|---|
pre-trained OS
|
fine-tuned OS
|
JOG3R
|
---|---|---|
pre-trained OS
|
fine-tuned OS
|
JOG3R
|
---|---|---|
pre-trained OS
|
fine-tuned OS
|
JOG3R
|
---|---|---|
pre-trained OS
|
fine-tuned OS
|
JOG3R
|
---|---|---|
pre-trained OS
|
fine-tuned OS
|
JOG3R
|
---|---|---|
pre-trained OS
|
fine-tuned OS
|
JOG3R
|
---|---|---|
Text Guided Video & Camera Generation (T2V+C) Results
Besides videos, JOG3R also generates the corresponding camera paths. Below we visualize the path as well as the estimated correspondences across frames. for each frame pair, we visualize only 10 correspondences to avoid clutter.
generated video
|
correspondences from T2V+C
![]() |
generated camera
![]() |
---|
generated video
|
correspondences from T2V+C
![]() |
generated camera
![]() |
---|
generated video
|
correspondences from T2V+C
![]() |
generated camera
![]() |
---|
generated video
|
correspondences from T2V+C
![]() |
generated camera
![]() |
---|