Lumiere: Google Research's New AI Video Generation Model Written on . Posted in News and Trends.
On January 29, 2024, Thomas Calvi reported on Lumiere, the latest video generation model unveiled by Google Research. Lumiere employs a probabilistic diffusion model based on a spatio-temporal U-Net architecture to create realistic and coherent 5-second videos from prompts or still images. Users can customize the style of these videos or create cinemagraphs by animating only selected parts of an image.
Image and video generation models such as Adobe Firefly, DALL-E, Midjourney, Imagen, and Stable Diffusion have captured the imagination and quickly gained popularity. Following in these footsteps, video generation was the next frontier. Meta AI ventured into this domain in October 2022 with Make-A-Video, and the NVIDIA AI Lab in Toronto introduced a high-resolution Text-to-Video synthesis model based on Stability AI's open-source Stable Diffusion model. Stability AI also launched Stable Video Diffusion in November, showcasing a highly efficient model.
Video generation presents a more complex challenge than image creation, adding a temporal dimension to the spatial one. The model must accurately generate each pixel and predict its evolution over time to produce a fluid and coherent video.
For Lumiere, Google Research, which had a hand in developing last month's video generation model W.A.L.T, chose an innovative approach to overcome the specific challenges of training text-to-video models.
LUMIERE consists of a base model and a spatial super-resolution model. The base model generates low-resolution video clips by processing the video signal across various spatio-temporal scales, leveraging a pre-trained text-to-image model. The spatial super-resolution model enhances the spatial resolution of video clips using a multi-diffusion technique to ensure the overall continuity of the result.
Researchers explain, "We introduce a spatio-temporal U-Net architecture that generates the full temporal duration of the video in a single pass through the model. This contrasts with existing video models that synthesize distant keyframes followed by temporal super-resolution, an approach that inherently complicates achieving global temporal coherence."
Applications
Lumiere can be adapted for a variety of video content creation and editing tasks, such as generating stylized videos, image-to-video generation, video inpainting, outpainting, and creating cinemagraphs, as demonstrated in the video below.
Inpainting involves realistically filling or restoring missing or damaged parts of a video. It can be used to remove unwanted objects, repair artifacts or corrupted areas in a video, or even create special effects.
Video outpainting, on the other hand, refers to extending or adding content beyond the existing video boundaries. It allows for the addition of elements to enlarge the scene, create smooth transitions between video clips, or add decorative or contextual elements.
Evaluations
Lumiere was evaluated on 113 textual descriptions and the UCF101 dataset, achieving competitive results in terms of Frechet Video Distance and Inception Score. Users preferred Lumiere for its visual quality and motion coherence over competing methods.
While the model demonstrated strong performance, researchers caution, "Our main goal in this work is to enable novice users to generate visual content in a creative and flexible manner. However, there's a risk of misuse for creating false or harmful content with our technology. We believe it's crucial to develop and apply tools to detect biases and malicious use cases to ensure safe and equitable use."
Incorporating cutting-edge technologies like Lumiere into e-commerce platforms, such as PrestaShop, could revolutionize product showcases and advertising. Imagine integrating Google's Gemini Pro with PrestaShop to dynamically generate product videos, enhancing user engagement and driving sales with visually captivating content.