The current reveal of OpenAI’s Sora mannequin which generates movies from textual content made headlines world wide. And understandably so, as a result of it’s really one thing superb.
However I used to be not too stunned with the announcement. I wrote concerning the emergence of text-to-video generative AI on my weblog 16 months in the past! See right here: AI Video Technology (Textual content-To-Video Translation). So, I knew that it was only a matter of time earlier than one of many massive gamers launched one thing of such lovely calibre.
What did shock me, nevertheless, was one thing that seemingly went beneath the radar simply 2 weeks in the past: an announcement from Google’s DeepMind analysis staff of an AI mannequin that generates video video games from single instance pictures. The unique tutorial paper, entitled “Genie: Generative Interactive Environments” was printed 23 February 2024.
With Genie, Google is coining a brand new time period: “generative interactive environments (Genie), whereby interactive, playable environments could be generated from a single picture immediate”.
What does this imply? Easy: you present Genie with an instance picture (hand drawn, in order for you) and you’ll then play a 2D platformer sport set contained in the setting that you just created.
Listed below are some examples. The primary picture is a human-drawn sketch, the next picture is a brief video displaying any individual taking part in a online game contained in the world depicted within the first picture:


Right here’s one other one which begins off with a hand-drawn image:


Actual world pictures (images) work as properly! As soon as once more, the second picture is a brief snippet of any individual really transferring a personality with a controller inside a generated online game.


See Google’s announcement for extra nice examples.
The title of my put up states “Textual content-to-Video Sport Translation”. If the one enter permitted is a single picture, how does “text-to-video sport” match right here? The thought is that text-to-image fashions/mills like DALL-E or Secure Diffusion might be used to transform your preliminary textual content immediate into a picture, after which that picture might be fed into Genie.
Very cool.
Video Sport High quality
Now, the generated online game high quality isn’t excellent. It actually leaves loads to be desired. Additionally, you may solely play the sport at 1 body per second (FPS). Usually video games run at 30-60 FPS, so seeing the display screen change solely as soon as per second is not any enjoyable. Nevertheless, the sport is being generated on-the-fly, as you play it. So, should you press considered one of eight potential buttons on a gamepad, the following body can be a freshly generated response to your chosen motion.
Nonetheless, it’s not tremendous thrilling. However similar to with my first put up on text-to-video generative AI that launched the entire thought of movies generated by AI, I’m doing the identical factor now. That is what’s at the moment being labored on. So, there is likely to be extra thrilling stuff coming simply across the nook – in 16 months maybe? For instance this: “We concentrate on movies of 2D platformer video games and robotics however our technique is basic and will work for any sort of area, and is scalable to ever bigger Web datasets.” (quoted from right here)
There’s extra coming. You heard it right here first!
Different Works
For full disclosure, I would like to say that this isn’t the primary time individuals have dabbled in text-to-video sport era. Nvidia, for instance, launched GameGAN in 2020, which might produce clones of video games like Pac-Man.
The distinction with Google’s mannequin is that it was totally skilled in an unsupervised method from unlabelled web movies. So, Genie discovered simply from movies what components on the display screen have been being managed by a participant, what the corresponding controls have been, and which components have been merely a part of the scrolling background. Nvidia, however, used as coaching materials video enter paired with descriptions of actions taken. Making a labelled dataset of actions paired with video outcomes is a laborious course of. Like I stated, Google did their coaching uncooked: on 30,000 hours of simply web movies of lots of of 2D platform video games.
To learn when new content material like that is posted, subscribe to the mailing checklist:
(Notice: If this put up is discovered on a website apart from zbigatron.com, a bot has stolen it – it’s been occurring loads recently)