2024 is gearing up to become the year of AI as OpenAI and Google showcase new technology

May has undoubtedly been the high point of AI innovation in 2024, as we have seen AI giants OpenAI and Google, fighting it out in the AI ring. In the left corner, we have OpenAI launching GPT-4 Omni, its heaviest AI model to date in terms of capabilities, speed, and price tag, being the company’s fully-fledged take on a multimodal AI model. But in the right corner, we have one of the tech world’s hardest hitters, none other than Google, who announced a slew of AI goodies during its trademark annual event: Google I/O.

It remains to be seen who will be the winner of this AI contest and the one to take the title of reigning AI champion. While the AI tech titans are competing blow for blow, Humans.ai is not slacking behind, implementing the latest technologies developed by the two powerhouses to bring the tech into the hands of real people. We continue to expand our business model mixing the technology developed in-house with open-source solutions and new OpenAI/Google features.

Humans.ai Leverages Google and OpenAI’s AI Models

At Humans.ai, we’re always pushing the boundaries of what’s possible with artificial intelligence. We’re excited to share how we’re integrating some of the most advanced AI models from Google and OpenAI into our ecosystem. This integration marks a significant milestone in our journey to provide unparalleled AI solutions that are not only powerful but also accessible to everyone.

Bringing GPT-4o and Gemini into Our Fold

We’re thrilled to announce that we’re implementing GPT-4o and Google’s latest version of Gemini via API. These integrations are set to revolutionize our services by drastically improving response times and overall performance. Although our agentic solutions are model-agnostic, meaning they don’t rely on any single AI model, having the flexibility to mix and compare results from multiple agents allows us to deliver even better initial responses.

Elevating Performance with GPT-4o

GPT-4o is a game-changer for us. It significantly boosts our capabilities in Massive Multitask Language Understanding (MMLU), which is critical for any system requiring high-level language comprehension. This improvement means that our AI can provide more accurate and insightful proposals across a wide range of applications.

Moreover, with the integration of OpenAI’s vision model, our web crawling capabilities have received a major upgrade. This enhancement allows for faster, more conversational interactions with complex online content, making it easier for our users to find the information they need quickly and efficiently.

Harnessing the Power of Gemini

Google’s Gemini brings a 2 million token context window into our toolkit, allowing our solutions to handle a much larger set of documents than ever before. This capability is particularly useful in scenarios where precise information retrieval is essential. Additionally, the context caching feature helps us reduce costs for our clients by reusing tokens more efficiently.

Tailored Solutions for Diverse Needs

We believe in an inclusive approach to AI, which means we cater to diverse linguistic and cultural needs. One of our standout integrations is the Navarasa family of models, derived from Google’s Gemma. Navarasa 2.0 is specifically fine-tuned for Indic languages, ensuring that users in India can interact with our AI in their native languages, receiving accurate and culturally relevant responses.

Continuously Exploring New Horizons

Our commitment to innovation doesn’t stop at integrating high-profile models. We’re constantly experimenting with powerful new models like MiniCPM-V, a series of end-side multimodal LLMs designed for vision-language understanding. These models often outperform proprietary ones like GPT-4V-1106, Gemini Pro, Qwen-VL-Max, and Claude 3, showcasing our dedication to leveraging the best technologies available.

At Humans.ai, we’re dedicated to making AI more powerful and accessible. By integrating state-of-the-art models from Google and OpenAI, we’re setting new standards in the industry. Our approach ensures that we provide cutting-edge solutions tailored to meet the diverse needs of our global user base. We’re excited about the future and committed to bringing the best AI capabilities to everyone, no matter their technical expertise. Stay tuned for more updates as we continue to innovate and lead the way in AI development.

Explained in depth: What Google and Open AI presented

GPT-4o takes the world by storm

On May 13, 2024, OpenAI surprised everyone with a spring update, unveiling their new flagship model, GPT-4o. This model is faster and cheaper than GPT-4 Turbo and it combines text, vision, and audio into one. Interestingly, the release of GPT-4o just hours before Google I/O seems like a coincidence and not a deliberate move to upstage Google.

What’s really impressive about GPT-4o is its super human-like conversational abilities. By default, it speaks with a girl’s voice, but it can adapt its tone to match the needs of any interaction, whether you want a relaxed tone for bedtime stories, a dramatic flair for storytelling, or a more neutral and informative style for professional conversations. This flexibility makes GPT-4o not just a powerful tool for communication, but also a highly personalized one, enhancing user engagement across various applications.

The GPT-4o model is available now, but the conversational features aren’t yet open to the public. While that’s a bit of a letdown, it’s exciting to hear that OpenAI is in talks to bring this tech to the iPhone. AI rival Google is similarly working to integrate its AI model, Gemini, into the iPhone ecosystem. Negotiations are currently underway, with both OpenAI and Google racing to develop a model that is not only highly intelligent but also efficient and cost-effective enough to operate on mobile devices.

Omni’s integration of multimodal capabilities — text, vision, and audio marks a significant step forward in AI’s ability to interact with the world. This model can understand and generate responses based on images and spoken input, offering a richer and more interactive user experience. It’s also super-fast, responding to audio inputs in just 232 milliseconds on average, about the same speed as humans in a conversation. GPT-4o matches the performance of GPT-4 Turbo in English and coding tasks while making significant improvements in understanding non-English texts. Plus, it’s much faster and 50% cheaper to use via the API. This model also excels in vision and audio comprehension compared to its predecessors.

Key Features of GPT-4o

Multimodal Inputs

GPT-4 Omni stands out with its ability to process and respond to diverse input formats, making it perfect for applications that demand cross-modal intelligence. Whether it’s text, audio, images, or video, GPT-4 Omni seamlessly integrates them all, pushing the boundaries of what’s possible in AI.

Real-Time Interaction

Experience lightning-fast responses with GPT-4o. Enhanced for real-time interaction, it significantly accelerates user engagement without compromising on accuracy. This means more fluid and dynamic conversations, ideal for interactive applications.

Improved Non-English Language Understanding:

This iteration of GPT-4 offers superior performance in processing non-English languages, expanding its usability on a global scale. It ensures more accurate and nuanced understanding and generation of multilingual content, making it a truly international AI.

Enhanced Audio and Visual Processing:

With advanced capabilities in understanding and generating audio and visual content, GPT-4o is a powerful tool for both creative and technical fields. Whether you’re working on music, video production, or complex visual data analysis, GPT-4o enhances your creative and technical output with precision and efficiency.

Google goes all in for the Large Language Model Race

On May 14, 2024, the tech world was abuzz with excitement as Google hosted its annual developer conference, Google I/O. This year’s event showcased a series of remarkable announcements, with Google making significant strides to bridge the gap with its AI competitor, OpenAI. The highlight of the conference was the introduction of Gemini 1.5 PRO, a revolutionary AI model boasting the capability to handle a 2 million token context window.

This impressive feature allows Gemini 1.5 PRO to process extensive amounts of data — equivalent to two hours of video content or 60,000 lines of code. This expansion in context capacity represents a significant leap in the AI field, enabling a more comprehensive understanding and generation of language.

However, managing such large context windows can be costly. To address this, Google introduced a feature called context caching. This innovation significantly reduces the expense by allowing the reuse of tokens at a fraction of the cost. In AI models, tokens and context windows are critical components that influence the ability to process and generate language.

AI models function by breaking down words into tokens, analyzing them, and then producing responses in tokens, which are subsequently converted into human-understandable words. The context window acts as the model’s memory, with larger windows providing greater memory capacity. This enables the AI to understand and respond more accurately by retaining more information from the dialogue.

A larger context window means the AI can utilize more tokens in a single interaction, enhancing its ability to deliver accurate and contextually relevant responses. Essentially, the more tokens available within the context window, the more data can be input into a query, allowing the AI model to generate more precise and useful results.

Google is pointing out that LLMs are swiftly advancing beyond their English-centric origins. As the field evolves, there is an increasing emphasis on multilingual datasets and the development of models that cater to the diverse linguistic tapestry of our world. However, ensuring fair representation and consistent performance across all languages, especially those with limited data and computational resources, remains a significant challenge.

Concerning these aspects, Google’s Gemma family of open models is designed to meet these challenges head-on, facilitating the development of projects in non-Germanic languages. Equipped with a robust tokenizer and an extensive token vocabulary, Gemma is adept at handling a wide array of languages, making it a powerful tool for global linguistic diversity.

Addressing the complexities of language, particularly in a linguistically diverse country like India, presents a fascinating challenge. In India, languages can vary significantly every few kilometers, creating a unique problem for technology that aims to understand and cater to such diversity. Traditional AI models often struggle with this, as they are typically designed with a narrower cultural focus. However, Google’s Gemma offers a revolutionary solution.

One of Gemma’s standout features is its exceptionally powerful tokenizer, capable of handling hundreds of thousands of words, symbols, and characters across various alphabets and language systems. This expansive vocabulary is crucial for adapting Gemma to projects like Navarasa, which is specifically trained for Indic languages.

Navarasa 2.0, a fine-tuned model based on Google’s Gemma, exemplifies the potential of culturally rooted AI. It allows people to interact with technology in their native languages and receive responses in the same language. It seems that Google’s goal is to create an AI model that includes everyone from all corners of India, ensuring that no one is left behind.

The vision behind Navarasa 2.0 and Gemma is to harness AI technology so that it is accessible and beneficial to everyone. Navarasa 2.0 highlights Gemma’s potential to drive innovation and inclusivity in AI, ensuring that even less-represented languages receive the computational attention they deserve.

Google showcased its AI game at the I/O 2024 Conference

During Google I/O 2024, Google also introduced Project Astra, a groundbreaking AI assistant that leverages the capabilities of Gemini to create more natural and intuitive interactions. In a compelling demo, an employee used her phone camera to ask the AI for help locating her glasses, which it successfully accomplished. Google proclaims Project Astra as “the future of AI assistants,” designed to facilitate seamless conversation and quick responses through advanced information caching and video processing capabilities. Project Astra’s ability to interpret and respond to vocal queries using the phone’s camera exemplifies Google’s commitment to enhancing user experience. While impressive, the demonstration revealed a slight latency in response time and a more robotic voice compared to OpenAI’s model.

In addition to Project Astra, Google announced significant upgrades to Google Photos with the integration of Gemini. The new “Ask Photos” feature, set to roll out experimentally in the coming months, allows users to search for specific images or recall details from their gallery via text input. This feature showcases Gemini’s multimodal capabilities, enabling it to understand the context and content of photos to provide detailed information, such as identifying past camping locations or the expiry dates of vouchers.

To foster innovation, Google launched a competition for developers, with the winner of the best Gemini-powered app receiving an electric DeLorean, the iconic car from the 80s classic “Back to the Future.” To aid in this challenge, Google introduced Firebase Genkit, a new tool that simplifies the creation of AI-enabled API endpoints, making it easier for developers to integrate advanced AI features into their applications.

Moreover, Google announced that Project IDX, an AI-assisted workspace for full-stack, multiplatform app development in the cloud, is now open to the public. Project IDX supports a wide array of frameworks, languages, and services, and integrates seamlessly with popular Google products, streamlining the development workflow and enhancing productivity.

On the hardware front, Google revealed new advancements, including the Trillium TPU processors designed to run the next generation of AI models beyond the current Gemini large-language model and Axion, a new ARM-based CPU for data centers. These innovations underscore Google’s dedication to pushing the boundaries of AI and computing technology.

Finally, Google also introduced Veo, a generative video model designed to rival OpenAI’s Sora. Veo represents a significant leap forward in video generation technology, producing high-quality, 1080p resolution videos that extend beyond a minute and encompass a diverse range of cinematic and visual styles. Veo seems to excel at capturing the nuances and tones of prompts, offering an unprecedented level of creative control. It understands and implements various cinematic effects, such as time lapses and aerial shots, making it a versatile tool for different creative needs.

According to Google, Veo aims to democratize video production, making it accessible to everyone from seasoned filmmakers to aspiring creators and educators. This powerful model opens new possibilities for storytelling, education, and beyond, enabling users to bring their visions to life with greater ease and precision.

While Veo is undeniably impressive, showcasing significant advancements over the past year, it still seems to lag slightly behind OpenAI’s Sora in certain aspects. Nonetheless, Google’s commitment to pushing the boundaries of generative video technology is evident, and Veo promises to be a formidable player in the rapidly evolving landscape of AI-driven content creation.