Robotic navigation assisted by large language models

- Advertisement -

Science & Technology, (Commonwealth Union) – A method that employs language-based inputs rather than expensive visual data to guide a robot through a multi-step navigation task.

The ability for a robot to carry out basic functions has been met with challenges.

Researchers indicated that for an AI agent, achieving seamless navigation is easier said than done. Current methods typically employ multiple hand-crafted machine learning models to address various aspects of the task, demanding extensive human effort and expertise. These techniques rely on visual representations for navigation decisions, requiring vast amounts of visual data for training, which is often difficult to obtain.

To address these challenges, researchers from MIT and the MIT-IBM Watson AI Lab developed a navigation method that transforms visual representations into language descriptions. These descriptions are then processed by a large language model that handles all components of the multistep navigation task.

Instead of encoding visual features from images of a robot’s surroundings, which is computationally demanding, their method generates text captions describing the robot’s point of view. A large language model uses these captions to predict the robot’s actions in response to language-based instructions.

This language-based approach allows the use of a large language model to efficiently create vast amounts of synthetic training data. Although this method does not outperform those using visual features, it excels in scenarios where visual data is insufficient for training. The researchers discovered that integrating their language-based inputs with visual signals enhances navigation performance.

“By purely using language as the perceptual representation, ours is a more straightforward approach. Since all the inputs can be encoded as language, we can generate a human-understandable trajectory,” added Bowen Pan, who is an electrical engineering and computer science (EECS) graduate student as well as a lead author of a paper on this approach.

The co-authors of Pan consist of his advisor, Aude Oliva, who is the director of strategic industry engagement at the MIT Schwarzman College of Computing, the MIT director of the MIT-IBM Watson AI Lab, and a senior research scientist in the Computer Science and Artificial Intelligence Laboratory (CSAIL). Other co-authors are Philip Isola, an associate professor of EECS and a member of CSAIL; senior author Yoon Kim, an assistant professor of EECS and a member of CSAIL; along with colleagues from the MIT-IBM Watson AI Lab and Dartmouth College. This research is set to be presented at the Conference of the North American Chapter of the Association for Computational Linguistics.

Given that large language models are the most powerful machine-learning models available, the researchers aimed to integrate them into the complex task of vision-and-language navigation, according to Pan.

However, these models are designed to process text-based inputs and cannot directly handle visual data from a robot’s camera. Therefore, the team had to form a method to use language as an intermediary.

Their approach involves employing a simple captioning model to generate text descriptions of the robot’s visual observations. These captions are then combined with language-based instructions and input into a large language model, which determines the next navigation step for the robot.

The large language model generates a caption describing the scene the robot should observe after completing each step. This caption updates the trajectory history, enabling the robot to track its progress.

The model iterates these steps to create a trajectory guiding the robot towards its goal, one step at a time.

To streamline the process, researchers developed templates to standardize how observation information is presented to the model. This information is formatted as a series of choices the robot can make based on its surroundings.

An example, that was given was a caption might state: “To your 30-degree left is a door with a potted plant beside it; to your back is a small office with a desk and a computer.” The model then decides whether the robot should move toward the door or the office.

“One of the biggest challenges was figuring out how to encode this kind of information into language in a proper way to make the agent understand what the task is and how they should respond,” explained Pan.

When they put this approach to the test, they found that although it did not outperform vision-based techniques, it provided several advantages.

The findings provide a significant outcome for AI applications in robots.

Hot this week

Can Antigua and Barbuda Turn CHOGM 2026 Into a New Era of Commonwealth Unity?

(Commonwealth)—The Secretary-General opened the launch ceremony for CHOGM 2026...

Is South Africa’s Social Housing System Failing Its Most Vulnerable Residents?

The Portfolio Committee on Human Settlements has strongly advocated...

How Is Australia Strengthening Trade and Clean Energy Ties Across Asia at APEC 2025?

Prime Minister Anthony Albanese was in the Republic of...

Bangladesh students rise against the Yunus government over new policies: Is history repeating itself in Bangladesh?

Bangladesh (Commonwealth Union)_ Bangladesh’s campuses have come alive once...

New Hope for Bariatric Patients: Weight Loss Drugs May Prevent Post-Surgery Regain

Healthcare (Commonwealth Union) – A study led by Monash...
- Advertisement -

Related Articles

- Advertisement -sitaramatravels.comsitaramatravels.com

Popular Categories

Commonwealth Union
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.