JOG3R: On Unifying Video Generation
and Camera Pose Estimation

Chun-Hao Paul Huang1 Jae Shin Yoon1, Hyeonho Jeong1,2, Niloy Mitra1,3, Duygu Ceylan1

1Adobe Research, 2KAIST, 3UCL



Motivation

To study the 3D awareness of features in video diffusion transformers, we first propose a unified architecture that does video generation and camera pose estimation jointly.
Our novel combined network, JOG3R, stands for JOintGeneration and 3Dcamera Reconstruction.
It is able to perform text to video generation (T2V), camera pose estimation from videos (V2C), and both in one go (T2V+C).
We provide extensitve study and analysis on these tasks in the JOG3R framework and share our insights.
In the example below, for each frame pair, we visualize only 10 correspondences to avoid clutter.

generated video from JOG3R
correspondences estimated during generating videos
camera trajectory estimated during generating videos
V2C qualitative results: ours vs. DUSt3R variants

JOG3R can reconstruct 3D cameras from both real videos and generated videos.
The estimated cameras are better than pretrained DUSt3R and on-par with finetuned DUSt3R on RealEstate10k.


Input: real videos (Figure 4 in the main paper)

input video
ground truth camera trajectory
our camera trajectory
pretrained DUSt3R's trajectory
fine-tuned DUSt3R's trajectory


Input: generated videos (no ground truth avaliable)

a basketball court in the backyard of a house
our camera trajectory
pretrained DUSt3R's trajectory
fine-tuned DUSt3R's trajectory
from-scratch trained DUSt3R's trajectory
a modern home with glass walls and patio furniture
our camera trajectory
pretrained DUSt3R's trajectory
fine-tuned DUSt3R's trajectory
from-scratch trained DUSt3R's trajectory
T2V & T2V+C qualitative results

All videos in this section are generated from JOG3R (Figure 6 in the main paper).
We additionally generate cammera paths (T2V+C) and confirm they are nearly identical
with the paths estimated by running V2C on the generated videos (T2V->V2C) (Figure 5 in the main paper).
For each pair of frames, we visualize only 10 correspondences to avoid clutter.

a backyard with steps leading up to a blue house
correspondences from T2V+C
correspondences from T2V->V2C
camera poses from T2V+C
camera poses from T2V->V2C
a basketball court in the backyard of a house
correspondences from T2V+C
correspondences from T2V->V2C
camera poses from T2V+C
camera poses from T2V->V2C
a patio with chairs and tables in front of a house
correspondences from T2V+C
correspondences from T2V->V2C
camera poses from T2V+C
camera poses from T2V->V2C
a view of a kitchen and living room in a new home
correspondences from T2V+C
correspondences from T2V->V2C
camera poses from T2V+C
camera poses from T2V->V2C
a dining room table with chairs and a view of the water
correspondences from T2V+C
correspondences from T2V->V2C
camera poses from T2V+C
camera poses from T2V->V2C