Google’s New AI: Translating Vision and Language into Robotic Actions

Google DeepMind introduces a new vision-language-action model, RT-2, aimed at enhancing robotics. Trained on web-based text and images, RT-2 can directly output robotic actions, essentially „speaking robot“.

Key Points:

  • RT-2 is a vision-language-action model trained on web-based text and images, capable of directly outputting robotic actions.
  • Unlike chatbots, robots require a „grounding“ in the real world and their abilities. They need to recognize and understand objects in context, and most importantly, know how to interact with them.
  • RT-2 allows a single model to perform complex reasoning seen in foundational models and also output robotic actions. It demonstrates that with a small amount of robot training data, the system can transfer concepts embedded in its language and vision training data to direct robot actions – even for tasks it hasn’t been trained to do.
  • RT-2 shows tremendous promise for more general-purpose robots. While there’s still a significant amount of work to be done to enable helpful robots in human-centered environments, RT-2 presents an exciting future for robotics within reach.

For more information, visit the full article on the Google DeepMind Blog.

Data protection overview

This website uses cookies so that we can provide you with the best possible user experience. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helps our team understand which sections of the website are most interesting and useful to you.