Researchers from Google Robotics and the Technical University of Berlin have developed PaLM-E. The visual language model is trained with text and image data, takes commands and forwards them to the individual components of a robot – such as a gripper arm. PaLM-E is an acronym for “Pathways Language Model” and “Embodied”. The “Pathways Language Model” is the latest language model from Google.

According to the research paper, it is currently the largest visual language model with 562 billion parameters. “Embodied” is English for “embodiment”. The abbreviation indicates that this version of the “Pathways Language Model” was trained for the tasks and modalities of a robot – the language model is thus “embodied” in the form of a robot.

The research paper on PaLM-E describes an application example. After the command “Bring me rice chips from the drawer”, a white platform equipped with a gripper and a camera moves through a room. This stops at the drawer. The gripping arm then extends, opens the drawer, grabs a pack of rice chips and places them on the tabletop. According to Google, all navigation instructions should come from the language model itself.

Unlike the predecessor PaLM, which was only trained with text-based data, PaLM-E obtains training data from text, images and data from other robot sensors. The input for PaLM-E are “multimodal sentences”, a combination of text and image data that come from the robot’s camera, for example. For example, a multimodal sentence would be: “What happened between and ?”, where the “img” represents an image file. The already trained visual language model provides an answer to this. If the input is a command, for example, “Bring me rice chips from the drawer,” the language model can generate a series of decisions that let the robot navigate through space and perform actions.

Google Sea PaLM-E achieves the highest currently reported score in OK-VQA, a benchmark that tests the accuracy of a language model by asking 14,005 open-ended questions about image files. PaLM-E is applicable to multiple robot types and multiple modalities, such as image data from a camera or the positioning of a mounted gripper arm. Artificial intelligence has visual and linguistic abilities: labeling pictures, discovering objects in a room, but also quoting poems.

Using a language model, robotics researchers try to build systems that do not depend on task-specific training. The result would be robots that can navigate through unstructured and changing environments, for example to complete everyday tasks such as cleaning.

Such an everyday robot requires “Positive Knowledge Transfer”: the ability to apply skills and knowledge from a learned task to an unfamiliar task. So what has been learned makes it easier to learn new skills. The skills a musician develops while playing a guitar will make it easier for them to learn to play a violin in the future. According to the research paper, PaLM-E exhibits this property: “PaLM-E, when trained on different tasks and datasets simultaneously, leads to significantly higher performance compared to models trained separately on individual tasks.”

In addition, PaLM-E shows “emergent capabilities” that may not come from the relationships and patterns in the previous training data. For example, “the ability to reach a conclusion across multiple frames, even though the model is only trained with prompts showing a single frame”.


(weave)

To home page

California18

Welcome to California18, your number one source for Breaking News from the World. We’re dedicated to giving you the very best of News.

Leave a Reply