Naturalistic Speech-Guided Semantic Segmentation for Assistive Wearable Systems

alternate text

Voice is the most natural hands-free modality for interacting with a wearable assistive system. A visually impaired user navigating an unfamiliar room should be able to simply say “door” or “chair” and immediately see the relevant objects highlighted—without memorising codes or navigating menus.

Wang et al. [1] recently demonstrated an audio-controlled edge semantic segmentation system that takes a step in this direction: users issue voice commands to select which object class is segmented from a live camera feed, with the result displayed as an edge overlay suitable for visual neuroprostheses. The system runs in real time on an NVIDIA Jetson AGX Orin. However, due to the lack of a suitable spoken object-name dataset, the current interface requires users to speak a digit mapped to each class (e.g., “five” → door), which the authors identify as a key limitation to be addressed in future work.

This project proposes to close that gap by integrating OpenAI’s Whisper automatic speech recognition (ASR) model [2] into the pipeline. Whisper is a large-scale Transformer trained on 680,000 hours of multilingual audio and can transcribe arbitrary spoken words out of the box— so in many cases it should handle class-name recognition without any task-specific fine-tuning. Should fine-tuning prove necessary (e.g., for robustness to environmental noise or domain-specific vocabulary), our group has direct experience adapting the Whisper architecture to novel audio tasks [3].

Contact

jayers (at) ethz.ch