Animate Anyone can change the pose of a subject of a photo and make it move

As if deepfakes in images weren’t bad enough, everyone who posts a photo of themselves online will soon have to deal with generated videos of themselves, since bad actors can now puppeteer people more effectively than ever thanks to Animate Anyone.

According to this article, researchers at the Institute for Intelligent Computing of Alibaba Group invented the new generative video approach. Compared to earlier image-to-video systems like DreamPose and DisCo, which were amazing but are now outdated, this one is a significant advancement.

Animate Anyone‘s capabilities are by no means new, but they have successfully navigated the tricky transition from something experimental to something good enough to the point that people assume it’s real and won’t even try to examine it closely.

animate anyone

Image-to-video models, such as this one, begin by taking details from a reference image, such as a fashion photo of a model wearing a dress for sale, such as facial features, patterns, and poses. Then a series of images is created where those details are mapped onto very slightly different poses, which can be motion-captured or extracted from another video.

While earlier models demonstrated that this could be accomplished, there were numerous problems. As the model has to create realistic elements such as how a person’s hair or sleeve might move when they turn, hallucinations were a major issue. This results in many pretty odd images, which detract greatly from the credibility of the final video. However, the idea persisted, and Animate Anyone has significantly improved—though it is still far from flawless.

The paper highlights a new intermediate step that “enables the model to comprehensively learn the relationship with the reference image in a consistent feature space, which significantly contributes to the improvement of appearance detail preservation”. Enhancing the preservation of fundamental and intricate details will lead to better-generated images in the future since they will have a stronger ground truth to work with.

They present their findings in a few different settings such as fashion models adopting random positions without losing their shape or the design of their clothes; a realistic, dancing 2D anime character coming to life; etc…

They are far from perfect, particularly in regard to the hands and eyes, which present particular difficulties for generative models. Furthermore, the most accurate postures are those that closely resemble the original; for example, the model finds it difficult to keep up if the subject turns around. However, it represents a significant improvement over the prior state of the art, which generated many more artifacts or entirely lost crucial information like a person’s clothes or hair color.

The idea that a bad actor or producer could make you do almost anything with just a single high-quality photo of you is unsettling. For now, the technology is too complex and buggy for general use, but things don’t tend to stay that way for long in the AI world.

The team isn’t releasing the code to the public just yet, at least. The creators state on their GitHub page, “We are actively working on preparing the demo and code for public release. Although we cannot commit to a specific release date at this very moment, please be certain that the intention to provide access to both the demo and our source code is firm”.

With deepfakes, we had begun to worry about the spread of photos and videos in which a person could see themselves doing things they never did. Now, the deception can be extended to the whole body by potentially simulating poses and movements never made by the subject.

If previously you would take a video and paste a face on it to make it the protagonist of the video, now you can even alter its movements from a single photo. This implies that the level of photographic and video alteration of the medium means that they can no longer be easily used as evidence. If we also add to this the possibility of being able to clone an individual’s voice, we understand that the mystification of reality is now at a high level.

What is true and what is false are becoming increasingly indistinguishable, so we need to be sharper and trust less and less of what we see and hear at first glance.