Google Gemini is a next-generation AI model aimed at replacing Google’s existing AI architectures, including PaLM 2. It distinguishes itself from other large language models by not being solely trained on text. Google designed the model with multimodal capabilities in mind, pointing towards a more general-purpose future for AI. It can generate and process text, images, and other types of data such as diagrams and maps1.
Gemini allows these services to analyze or generate text, images, audio, videos, and other data types simultaneously. Unlike previous AI tools, which are restricted to a single data type, Gemini is a „multimodal“ model capable of processing more than one data type at a time1.
Gemini can also combine visual and textual data to generate more than one type of data simultaneously. Imagine an AI that not only writes the content of a magazine but also designs the layout and graphics for it. Or an AI that can summarize an entire newspaper or podcast based on the topics that matter most to you1.
A multimodal AI model like Gemini works with several main components that work together, starting with an encoder and a decoder. When given an input with more than one data type (like a piece of text and an image), the encoder extracts all relevant details from each data type (modality) separately. The AI then uses an attention mechanism to look for important features or patterns in the extracted data — essentially forcing it to focus on a specific task. Finally, the AI can fuse the information it has learned from different data types to make a prediction1.
Google has confirmed that Gemini will come in different sizes, but the exact technical details are not yet known. The smallest model could even fit on a typical smartphone, making it ideal for generative AI on the go. However, it’s more likely that Gemini will first come to the Bard chatbot and other Google services. Currently, Gemini is still in its training phase and will move on to fine-tuning and improving safety after completion1.