Abstract
This paper explores a novel methodology for zero-shot action generation using vision-language models. By leveraging bootstrapping language-image pre-training(BLIP) and reinforcement learning techniques such as proximal policy optimization(PPO), we establish an approach that maximizes similarity between text_image pairs to determine optimal robotic actions. The methodology demonstrates effective generalization in simulation-to-real-world scenarios for tasks such as human recognition and stair detection. In simulations, the robot detected human with 0.86 similarity in 18.4s and identified stairs with 0.87 similarity in 19.2s, whereas in real-world experiments, it achieved 0.88 similarity for human in 20.3s and 0.89 similarity for stairs recognition in 21.1s, confirming the robustness and adaptability of the framework in diverse environments.
| Original language | English |
|---|---|
| Pages (from-to) | 621-628 |
| Number of pages | 8 |
| Journal | Journal of Institute of Control, Robotics and Systems |
| Volume | 31 |
| Issue number | 6 |
| DOIs | |
| State | Published - 2025 |
Keywords
- BLIP
- PPO (Proximal Policy Optimization)
- robotic actions
- Sim2Real
- vision-language models
- zero-shot learning
Quacquarelli Symonds(QS) Subject Topics
- Computer Science & Information Systems
- Mathematics
Fingerprint
Dive into the research topics of 'A Vision-language Model and Reinforcement Learning-based Zero-shot Action Generation Framework for an Autonomous Robot'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver