Synthesizing Light Field Video from Monocular Video

Shrisudhan Govindarajan
Prasan Shedligeri
Kaushik Mitra

ECCV 2022(Oral)



The hardware challenges associated with light-field(LF) imaging has made it difficult for consumers to access its benefits like applications in post-capture focus and aperture control. Learning-based techniques which solve the ill-posed problem of LF reconstruction from sparse (1, 2 or 4) views have significantly reduced the requirement for complex hardware. LF video reconstruction from sparse views poses a special challenge as acquiring ground-truth for training these models is hard. Hence, we propose a self-supervised learning-based algorithm for LF video reconstruction from monocular videos. We use self-supervised geometric, photometric and temporal consistency constraints inspired from a recent self-supervised technique for LF video reconstruction from stereo video. Additionally, we propose three key techniques that are relevant to our monocular video input. We propose an explicit disocclusion handling technique that encourages the network to inpaint disoccluded regions in a LF frame, using information from adjacent input temporal frames. This is crucial for a self-supervised technique as a single input frame does not contain any information about the disoccluded regions. We also propose an adaptive low-rank representation that provides a significant boost in performance by tailoring the representation to each input scene. Finally, we also propose a novel refinement block that is able to exploit the available LF image data using supervised learning to further refine the reconstruction quality. Our qualitative and quantitative analysis demonstrates the significance of each of the proposed building blocks and also the superior results compared to previous state-of-the-art monocular LF reconstruction techniques. We further validate our algorithm by reconstructing LF videos from monocular videos acquired using a commercial GoPro camera.




Our proposed algorithm takes as input a sequence of previous, current and next frame, along with the disparity of the to be synthesized light-field. A recurrent LF synthesis network first predicts an intermediate low-rank representation for the corresponding LF frame. An adaptive TD layer takes the same sequence and low-rank representation as input and outputs the light-field frame. A set of self-supervised cost-functions are then imposed on predicted light-field for the end-to-end training of the recurrent network and the adaptive TD layer. Finally, a refinement block then takes predicted light-field and current frame as input and outputs a refined LF.


Paper and Supplementary Material

S. Govindarajan, P. Shedligeri, K. Mitra
Synthesizing Light Field Video from Monocular Video
ECCV 2022 [oral]



Light Field image comparison

Ground Truth Ours Li et al Srinivasan et al

Light Field video comparison(temporal consistency)

Ours Li et al

Light Field synthesis from GoPro monocular camera

Monocular image Li et al Ours(1x baseline) Ours(2x baseline)

Refocusing application with Synthesized Light Field Video

Normal video Far focus(1x baseline) Far focus(2x baseline) Near focus(1x baseline) Near focus(2x baseline)


This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.