A Vision-language Model and Reinforcement Learning-based Zero-shot Action Generation Framework for an Autonomous Robot

  • Dae Myeong Hong
  • , Doukhi Oualid
  • , Jae Ho Lee
  • , Linfeng Wang
  • , Deok Jin Lee*
  • *Corresponding author for this work

Research output: Contribution to journalJournal articlepeer-review

Abstract

This paper explores a novel methodology for zero-shot action generation using vision-language models. By leveraging bootstrapping language-image pre-training(BLIP) and reinforcement learning techniques such as proximal policy optimization(PPO), we establish an approach that maximizes similarity between text_image pairs to determine optimal robotic actions. The methodology demonstrates effective generalization in simulation-to-real-world scenarios for tasks such as human recognition and stair detection. In simulations, the robot detected human with 0.86 similarity in 18.4s and identified stairs with 0.87 similarity in 19.2s, whereas in real-world experiments, it achieved 0.88 similarity for human in 20.3s and 0.89 similarity for stairs recognition in 21.1s, confirming the robustness and adaptability of the framework in diverse environments.

Original languageEnglish
Pages (from-to)621-628
Number of pages8
JournalJournal of Institute of Control, Robotics and Systems
Volume31
Issue number6
DOIs
StatePublished - 2025

Keywords

  • BLIP
  • PPO (Proximal Policy Optimization)
  • robotic actions
  • Sim2Real
  • vision-language models
  • zero-shot learning

Quacquarelli Symonds(QS) Subject Topics

  • Computer Science & Information Systems
  • Mathematics

Fingerprint

Dive into the research topics of 'A Vision-language Model and Reinforcement Learning-based Zero-shot Action Generation Framework for an Autonomous Robot'. Together they form a unique fingerprint.

Cite this