Zo3T: Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training

Ruicheng Zhang1,2, Jun Zhou1, Zunnan Xu1, Zihao Liu1, Jiehui Huang3,
Mingyang Zhang4, Yu Sun2, Xiu Li1*
1Tsinghua University   2Sun Yat-sen University
3The Hong Kong University of Science and Technology   4China University of Geosciences

† Equal contribution   * Corresponding author

Abstract

Trajectory-Guided image-to-video (I2V) generation aims to synthesize videos that adhere to user-specified motion instructions. Existing methods typically rely on computationally expensive fine-tuning on scarce annotated datasets. Although some zero-shot methods attempt to trajectory control in the latent space, they may yield unrealistic motion by neglecting 3D perspective and creating a misalignment between the manipulated latents and the network's noise predictions. To address these challenges, we introduce Zo3T, a novel zero-shot test-time-training framework for trajectory-guided generation with three core innovations: First, we incorporate a 3D-Aware Kinematic Projection, leveraging inferring scene depth to derive perspective-correct affine transformations for target regions. Second, we introduce Trajectory-Guided Test-Time LoRA, a mechanism that dynamically injects and optimizes ephemeral LoRA adapters into the denoising network alongside the latent state. Driven by a regional feature consistency loss, this co-adaptation effectively enforces motion constraints while allowing the pre-trained model to locally adapt its internal representations to the manipulated latent, thereby ensuring generative fidelity and on-manifold adherence. Finally, we develop Guidance Field Rectification, which refines the denoising evolutionary path by optimizing the conditional guidance field through a one-step lookahead strategy, ensuring efficient generative progression towards the target trajectory. Zo3T significantly enhances 3D realism and motion accuracy in trajectory-controlled I2V generation, demonstrating superior performance over existing training-based and zero-shot approaches.

Pipeline Framework

Pipeline Framework Overview
An overview of our zero-shot trajectory-guided video generation framework. Our method optimizes a pre-trained video diffusion model at specific denoising timesteps via two key stages. First, Test-Time Training (TTT) adapts the latent state and an ephemeral adapter to maintain semantic consistency along the trajectory. Second, Guidance Field Rectification refines the denoising direction using a one-step lookahead optimization to ensure precise path execution.

Demo Gallery

Comparison with Other Methods

Comparison Set 1

Condition Image

Condition for Comparison Set 1

Ours

DragAnything

DragNUWA

ObjCtrl-2.5D

SG-I2V

Comparison Set 2

Condition Image

Condition for Comparison Set 2

Ours

DragAnything

DragNUWA

ObjCtrl-2.5D

SG-I2V

Side-by-Side Demos

Acknowledgments

We thank the authors of SG-I2V and DragAnything for releasing their code and inspiring this project.

Citation

@article{zhang2025zeroshot3dawaretrajectoryguidedimagetovideo,
  title={Zero-shot 3D-Aware Trajectory-Guided image-to-video generation via Test-Time Training}, 
  author={Ruicheng Zhang and Jun Zhou and Zunnan Xu and Zihao Liu and Jiehui Huang and Mingyang Zhang and Yu Sun and Xiu Li},
  journal={arXiv preprint arXiv:2509.06723},
  year={2025},
  url={https://arxiv.org/abs/2509.06723}, 
}