Staged demo video hurts trust in new multimodal model

Google revealed Gemini, their next-generation artificial intelligence model, following months of teasers. This is intended to directly compete with OpenAI’s GPT models.

The IT community was caught off guard by the statement since there had been rumors that problems with multilingual support had caused the release to be postponed. But only the mid-tier model of Gemini—out of the three—was launched right away.

According to this article, Gemini comes in three different versions. Capable of “seeing the world the way humans do” through text, images, audio, and video, the largest model is the Ultra. The second type is called Pro, and it drives Google Bard. Its capabilities are comparable to those of the free ChatGPT.

Google Gemini Nano, a small AI model that runs just on an Android phone and can generate text, be used for conversations, and analyze or summarize content, was the most unexpected news.

Google Gemini

The large language model is now the dominant model in artificial intelligence. With their ability to produce many types of content and manage natural language interactions, they power applications such as Microsoft Copilot, ChatGPT, and Bard.

The first offering from the combination of all Google AI teams—including the British AI lab DeepMind—is called Gemini, which was trained from the ground up to be multimodal. This indicates that text, code, audio, video, and photos were all included in the training dataset while other models are stitched together after being trained independently on various kinds of data.

Only the Gemini Ultra variant of the model—which needs the most advanced chips and a data center to run—has complete capability. Google also unveiled Pro and Nano, two small AI versions that operate faster, on less expensive CPUs, and even locally on devices. The Pro model of Google Gemini, which is integrated into the most recent version of Google Bard, is now the only version of the program that is generally accessible.

According to Google, this is comparable to OpenAI’s GPT-3.5, the previous-generation AI model that powers ChatGPT’s free version. Given that Gemini is integrated into the Google Pixel 8 Pro, you may have previously used the Nano version of the app without even knowing it. In addition, developers can also include its capabilities in their apps. However, Google has decided to delay the release of the Ultra model until next year to do more thorough safety testing and ensure the model is in line with human values.

The next steps

Next year, Gemini Ultra will be the center of attention due to its usage in several products, such as Duo, the tools that drive generative AI in Workspace, and a new iteration of Google’s chatbot called Bard Advanced.

Nonetheless, the Nano version may be used by even more people. Thousands of Play Store apps will use this to power text generation, content analysis, summaries, and other features. It will enhance translation and transcription capabilities and improve Android search results.

The demo

After making its major premiere, Google’s new Gemini AI model received good feedback. However, users may lose faith in the company’s technology or ethics after learning that the most spectacular Gemini demo was essentially staged.

It’s easy to understand why a video titled “Hands-on with Gemini: Interacting with Multimodal AI” received one million views in the last 24 hours. The striking demonstration “highlights some of our favorite interactions with Gemini,” demonstrating the multimodal model’s adaptability and responsiveness to a range of inputs. The multimodal model is capable of understanding and combining linguistic and visual knowledge.

As reported here, the video starts by telling the story of a duck sketch that progresses from a scribble to a finished design, which Gemini claims is an unrealistic color. The algorithm then shows amazement upon finding a toy blue duck. Subsequently, it reacts to multiple speech inquiries concerning that particular toy. The demonstration then progresses to additional impressive actions, such as following a ball in a cup-switching game, identifying shadow puppet gestures, reordering planet sketches, and so on.

Even if the video warns that “latency has been reduced and Gemini outputs have been shortened”, everything is still incredibly responsive. Overall, it was a really impressive demonstration of power in the field of multimodal understanding.

There is only one issue: the video is fake. “We created the demo by capturing footage in order to test Gemini’s capabilities on a wide range of challenges. Then we prompted Gemini using still image frames from the footage and prompting via text”. It was Parmy Olson of Bloomberg who initially brought attention to the discrepancy.

Thus, while it may be able to perform some of the tasks that Google demonstrates in the video, it was unable to do so in real-time or as intended. It was a sequence of precisely calibrated text prompts along with still images that were purposefully misrepresented and chosen to distort the true nature of the interaction. To be fair, the video description includes a link to a connected blog post where you can view some of the real prompts and comments used.

Since OpenAI released GPT 3, the world of AI has changed dramatically. And with the arrival of ChatGPT, the beginning of a new era has emerged. Since then, Google has been trying to compete and overtake OpenAI, initially criticizing the release of such technology so early but then obligatorily trying to raise the bar. Although Gemini’s potential may suggest that there may be a further leap forward for multimodal AI, the misstep Google made by exaggerating Gemini’s capabilities is not a good sign for the company. Nevertheless, in the next few years, AIs will be massively integrated into every aspect of technology, with all the pros and cons to be considered.