Robotic navigation assisted by large language models

Science & Technology, (Commonwealth Union) – A method that employs language-based inputs rather than expensive visual data to guide a robot through a multi-step navigation task.

The ability for a robot to carry out basic functions has been met with challenges.

Researchers indicated that for an AI agent, achieving seamless navigation is easier said than done. Current methods typically employ multiple hand-crafted machine learning models to address various aspects of the task, demanding extensive human effort and expertise. These techniques rely on visual representations for navigation decisions, requiring vast amounts of visual data for training, which is often difficult to obtain.

To address these challenges, researchers from MIT and the MIT-IBM Watson AI Lab developed a navigation method that transforms visual representations into language descriptions. These descriptions are then processed by a large language model that handles all components of the multistep navigation task.

Instead of encoding visual features from images of a robot’s surroundings, which is computationally demanding, their method generates text captions describing the robot’s point of view. A large language model uses these captions to predict the robot’s actions in response to language-based instructions.

This language-based approach allows the use of a large language model to efficiently create vast amounts of synthetic training data. Although this method does not outperform those using visual features, it excels in scenarios where visual data is insufficient for training. The researchers discovered that integrating their language-based inputs with visual signals enhances navigation performance.

“By purely using language as the perceptual representation, ours is a more straightforward approach. Since all the inputs can be encoded as language, we can generate a human-understandable trajectory,” added Bowen Pan, who is an electrical engineering and computer science (EECS) graduate student as well as a lead author of a paper on this approach.

The co-authors of Pan consist of his advisor, Aude Oliva, who is the director of strategic industry engagement at the MIT Schwarzman College of Computing, the MIT director of the MIT-IBM Watson AI Lab, and a senior research scientist in the Computer Science and Artificial Intelligence Laboratory (CSAIL). Other co-authors are Philip Isola, an associate professor of EECS and a member of CSAIL; senior author Yoon Kim, an assistant professor of EECS and a member of CSAIL; along with colleagues from the MIT-IBM Watson AI Lab and Dartmouth College. This research is set to be presented at the Conference of the North American Chapter of the Association for Computational Linguistics.

Given that large language models are the most powerful machine-learning models available, the researchers aimed to integrate them into the complex task of vision-and-language navigation, according to Pan.

However, these models are designed to process text-based inputs and cannot directly handle visual data from a robot’s camera. Therefore, the team had to form a method to use language as an intermediary.

Their approach involves employing a simple captioning model to generate text descriptions of the robot’s visual observations. These captions are then combined with language-based instructions and input into a large language model, which determines the next navigation step for the robot.

The large language model generates a caption describing the scene the robot should observe after completing each step. This caption updates the trajectory history, enabling the robot to track its progress.

The model iterates these steps to create a trajectory guiding the robot towards its goal, one step at a time.

To streamline the process, researchers developed templates to standardize how observation information is presented to the model. This information is formatted as a series of choices the robot can make based on its surroundings.

An example, that was given was a caption might state: “To your 30-degree left is a door with a potted plant beside it; to your back is a small office with a desk and a computer.” The model then decides whether the robot should move toward the door or the office.

“One of the biggest challenges was figuring out how to encode this kind of information into language in a proper way to make the agent understand what the task is and how they should respond,” explained Pan.

When they put this approach to the test, they found that although it did not outperform vision-based techniques, it provided several advantages.

The findings provide a significant outcome for AI applications in robots.

Robotic navigation assisted by large language models

Are UK Job Losses About to Spike? Experts Link Economic Slowdown to Iran War

Beyond Square Feet: How Brigade’s 2 million Sq. Ft. Bengaluru Bet Signals the Rise of India’s Next-Gen Urban Ecosystems

What really happened in Ghana that made South Africa act against its envoy?

Britain Faces Inflation Spike: How the Iran War Is Fuelling Price Rises and Economic Anxiety

Rising Costs Push Farmers Away from Organic Practices as Vandana Shiva Promotes Sustainable Solutions for Global Food Security

Related Articles

From Whale Songs to Lion Roars: What Determines How Far Animal Calls Can Travel?

Can Coral Houses Reveal Hidden Pacific Histories? Archaeologists Test New Dating Method in French Polynesia

Deepfake Medical Images: Why Even Radiologists and AI Models Can’t Reliably Detect Fake X-Rays

Meta, Google Lose Landmark U.S. Case as Courts Hold Facebook and YouTube Liable for Social Media Harm in Children

Could Future Drones Fly Like Birds? Engineers Develop Motor-Free Ornithopter Powered by Piezoelectric Wings

BRICS expansion: five countries join, another 25 to follow in 2024

Chinese state media lauds India’s achievements under PM Modi!

Security alarm for India: Bangladesh frees Al-Qaeda terror group chief