Speech Recognition-Based Wireless Control System for Mobile Robotics: Design, Implementation, and Analysis

Gupta, Sandeep; Mamodiya, Udit; Al-Gburi, Ahmed J. A.

doi:10.3390/automation6030025

Open AccessArticle

Speech Recognition-Based Wireless Control System for Mobile Robotics: Design, Implementation, and Analysis

by

Sandeep Gupta

¹

,

Udit Mamodiya

²

and

Ahmed J. A. Al-Gburi

^3,*

¹

Department of Artificial Intelligence and Data Science, Poornima Institute of Engineering and Technology, Jaipur 302022, India

²

Faculty of Engineering and Technology, Poornima University, Jaipur 303905, India

³

Centre for Telecommunication Research & Innovation (CeTRI), Fakulti Teknologi Dan Kejuruteraan Elektronik Dan Konputer (FTKEK), Universitu Teknikal Malaysia Melaka (UTeM), Jalan Hang Tuah Jaya, Durian Tunggal 76100, Melaka, Malaysia

^*

Author to whom correspondence should be addressed.

Automation 2025, 6(3), 25; https://doi.org/10.3390/automation6030025

Submission received: 28 April 2025 / Revised: 6 June 2025 / Accepted: 18 June 2025 / Published: 24 June 2025

(This article belongs to the Section Robotics and Autonomous Systems)

Download

Browse Figures

Versions Notes

Abstract

This paper describes an innovative wireless mobile robotics control system based on speech recognition, where the ESP32 microcontroller is used to control motors, facilitate Bluetooth communication, and deploy an Android application for the real-time speech recognition logic. With speech processed on the Android device and motor commands handled on the ESP32, the study achieves significant performance gains through distributed architectures while maintaining low latency for feedback control. In experimental tests over a range of 1–10 m, stable 110–140 ms command latencies, with low variation (±15 ms) were observed. The system’s voice and manual button modes both yield over 92% accuracy with the aid of natural language processing, resulting in training requirements being low, and displaying strong performance in high-noise environments. The novelty of this work is evident through an adaptive keyword spotting algorithm for improved recognition performance in high-noise environments and a gradual latency management system that optimizes processing parameters in the presence of noise. By providing a user-friendly, real-time speech interface, this work serves to enhance human–robot interaction when considering future assistive devices, educational platforms, and advanced automated navigation research.

Keywords:

speech recognition; mobile robotics; ESP32; Bluetooth communication; android interface; voice control systems; human–robot interaction; Vosk speech recognition; real-time control; distributed processing; embedded systems

1. Introduction

The recent progress in the field of microelectronics [1,2], artificial intelligence [3,4], and wireless communications [5,6] has fostered the interest in designing speech actuated control devices for use on mobile robotic platforms [7,8]. In contrast, the navigation of robotic systems is usually conducted through gamepad controllers, joysticks, or graphical interfaces. While these methods are well established and effective, they typically require highly accurate user-defined manual input and sustained operator attention [9,10]. A more intuitive and inclusive alternative is voice control, with language being the main control interface [11,12]. It reduces the barrier to entry for users with limited mobility or a preference for natural interaction patterns, extending the versatility of mobile robotics [13].

Roboticists have taken many routes to embed speech recognition on robotic platforms as commercial grade hardware has advanced in speed, RAM, and network bandwidth at affordable prices [14,15]. By combining a low-cost, high-processing power ESP32 microcontroller with integrated Bluetooth capability [16], sophisticated wireless control can be achieved for this application [17,18]. Nonetheless, deploying a reliable and efficient solution requires some trade-offs such as low end-to-end latency, robustness to background noise, and protocol simplifications for real-time command responses [19,20,21].

In this regard, this study develops a framework that addresses these major challenges through the division of processing tasks between an ESP32 microcontroller with Bluetooth connectivity and an Android device with offline speech recognition methods. These two components operate synergistically: the Android device interprets voice inputs and converts them into simple commands, which it sends to the ESP32, which actuates the motors. This division of labor offloads processing from the microcontroller, and increases overall responsiveness. Early systems tended to keep speech recognition and motion control as decoupled modules. In contrast, the hierarchically danced method presented here actively combines these parts, permitting a more integrated management and optimization of the entire pipeline, spanning the stages from language to motors.

Figure 1 shows the overall architecture, showing the respective responsibilities of the microcontroller and mobile device. A simple one-byte protocol exists for functions: forward, backward, left and right, stop, all directly understood by the system from voice commands. This makes it easy for users to transition from voice commands to button inputs, all through the same Android interface, thanks to the Bluetooth-based structure.

In general, this work contributes well to voice-controlled robots. The design takes advantage of the robust, low-latency communications; it incorporates a new adaptive keyword spotting mechanism; and it deploys progressive latency management to accommodate varying ambient conditions.

The rest of this document is structured as follows: Section 2 reviews the relevant literature, focusing on preceding work in mobile robotics, speech recognition, and human–robot interaction. Section 3 explains the theoretical principles and models upon which this study is built. Section 4 describes the system architecture in detail in both hardware and software aspects, which provides clarity on the distributed processing approaches. Section 5 explains the implementation details and shows operational aspects. Experimental results in Section 6 evaluate metrics such as command latency and speech accuracy. Section 7 presents a discussion of results and their implications. Section 8 discusses the potential applications and future directions for extending this work. Lastly, Section 9 provides a detailed conclusion.

2. Literature Review

Robot control using speech recognition is an interdisciplinary field. This literature review summarizes various technologies used in achieving this functionality:

2.1. Mobile Robotics Control Systems

The work by Matarić [22] laid the groundwork for the principles of timely command processing for mobile robot navigation needs in dynamic environments. Stability of control was found to be affected tremendously by communication delays, highlighting the essentiality of online (real-time) feedback mechanisms.

Subsequent contributions by Siegwart et al. [23] contributed to the autonomous navigation debate, demonstrating that high-level human commands can be addressed by reactive control algorithms. In parallel, Khamis et al. [24] innovated upon resource-limited control structures, demonstrating that purposeful algorithmic optimizations could drive autonomous navigation on microcontroller-based platforms. Such findings have direct relevance for this study, given that the ESP32 must execute motor control responsibilities without significant computational overhead.

The evolution of mobile robot control systems has been further enhanced by developments in microcontroller technology. As noted by Hakkı [25], the ESP32 series of SoC microcontrollers provides an ideal platform for resource-constrained applications requiring wireless connectivity.

2.2. Speech Recognition in Embedded and Mobile Systems

Recent advances in speech recognition have seen the emergence of transformer-based architectures such as Conformer and Whisper models, which have achieved state-of-the-art performance on benchmark datasets [26,27]. However, these approaches typically require significant computational resources, making them less suitable for embedded robotics applications [28,29]. ResNet-based approaches and multimodal fusion techniques, including 2DCNN + BiLSTM + ResNet + MLF architectures and AVCRFormer with spatio-temporal fusion [30,31], demonstrate advanced integration capabilities but at the cost of increased computational complexity [32,33]. Our lightweight, embedded approach represents a practical trade-off between computational efficiency and accuracy for real-time robotics applications, positioning itself as a viable alternative to these computationally intensive SOTA methods.

Speech recognition technology has matured rapidly, transforming from large-scale server-based processing to compact, offline solutions. Këpuska and Bohouta [34] offered a panoramic review of techniques and commercial products, addressing performance trade-offs, including accuracy, speed, and hardware constraints. He et al. [35] further validated that optimized neural network architectures running on consumer-grade hardware can deliver near real-time results.

These improvements have diminished the latency and reliability issues that previously deterred real-time speech applications. Such streamlined on-device recognition paves the way for systems like the one presented here, which aims to operate without reliance on cloud-based speech servers.

The challenges of noise-robust automatic speech recognition were comprehensively analyzed by Li et al. [36], providing critical insights for systems operating in variable acoustic environments. Advances in offline speech processing, as implemented in libraries like Vosk [37], have made it feasible to perform reliable speech recognition directly on mobile devices without requiring cloud connectivity, addressing limitations identified in previous approaches such as that of Deuerlein et al. [38].

2.3. Bluetooth Communication in Robotic Applications

Bluetooth is common for consumer-level wireless integration and is easy to pair with individual devices. However, as discussed by Tosi et al. [39], both Bluetooth Classic and Bluetooth Low Energy (BLE) are susceptible to latencies that can vary and must be compensated in the context of robotic control. Darroudi and Gomez [40] provide further insights into Bluetooth usage over power consumption, coverage limitations, and how interference can be controlled, which emphasizes the necessity of maikng an optimal choice of Bluetooth parameters.

Studies applied to real-world applications, like Gomez et al. [18], demonstrate how to adaptively change packet structures and transmission intervals for improved reliability and throughput. These factors are critical when implementing a Bluetooth-based command conveyance for robotics as even minor dropouts can affect perceived responsiveness and impact user confidence.

The performance evaluation of Bluetooth Low Energy by Tosi et al. [39] provides essential metrics for designing responsive control systems, while the comprehensive overview by Gomez et al. [18] offers valuable insights into the technical constraints and optimizations necessary for effective wireless robotic control.

2.4. Human–Robot Interaction Through Speech

In the perspective of speech-controlled robotics, the human-centered dimension is becoming increasingly important. For instance, Mavridis [41] demonstrated how robots could generate more nuanced interpretations through the utilization of both verbal and nonverbal cues, boosting communicative density. Wang et al. [42] concluded that user satisfaction increases with consistent feedback without delays.

More recent advances by Wang et al. [43] have explored the impact of multimodal human–robot interaction in manufacturing contexts, highlighting the importance of integrating speech with other interaction modalities. The comprehensive review by van Den Broek and Moeslund [44] provides valuable insights into proactive human–robot interaction approaches that inform our system design.

Medicherla and Sekmen [45] specifically addressed voice-controllable intelligent user interfaces for human–robot interaction, demonstrating early approaches to integrating speech commands with robotic control systems. Building on this foundation, Poncela and Gallardo-Estrella [46] investigated command-based voice teleoperation of mobile robots, providing valuable insights into the practical implementation challenges that inform our current work.

Recent work by Naeem et al. [47] on voice-controlled humanoid robots demonstrates the continuing evolution of speech interfaces for robotic control, with particular attention to natural language understanding capabilities that align with our adaptive keyword spotting approach.

2.5. Integration of Speech Recognition with Mobile Robotics

Natural language processing for guiding mobile robot behaviors has come a long way. Tellex et al. [48] made significant advances toward building robots that can process high-level commands in detailed environments with given spatial contexts and show the promise of better language understanding creating more autonomous and flexible navigation.

On a more limited scale, Srivastava and Singh [49] demonstrated basic control of a robot via smartphone-based speech inputs, though their approach was dependent on cloud services for recognition, which introduces potential connectivity issues. Korayem et al. [50] developed a voice command recognition system for human–robot interaction that addressed some of these limitations by focusing on local processing.

The integration of voice control with robotic systems for educational purposes was explored by Sharma et al. [51], investigating the effectiveness of voice-controlled robots in educational contexts. Chakravarthy et al. [52] extended voice control to robotic arms for prosthetic applications, demonstrating the versatility of speech-based interfaces across different robotic domains.

More recently, Venkatraman et al. [53] explored smart home automation using voice control systems, highlighting the broader applicability of the techniques we develop in this paper. The work by Esposito [54] on analyzing and preventing self-issued voice commands addresses important security considerations for voice-controlled systems.

Overall, our review of the literature highlights the need for robust, low-latency speech recognition and effective communication protocols. The sections that follow outline the theoretical and practical foundations of this study, building upon and extending the steps taken by these previous works.

3. Theoretical Background

This research applies a solid theoretical frame to conceptualizing and designing a real-time speech-controlled robot. The framework is based on principles from many different disciplines such as distributed computing, classical control theory, communication protocols, adaptive algorithms for speech processing, and user interaction.

3.1. Distributed Processing Architecture

A distributed processing architecture was designed and workload was divided between mobile device and ESP32 microcontroller based on their capabilities. Thus, the principles of distributed computing were followed by allocating different tasks to processing nodes which are capable, power-limited, or close to required resources [55].

Hence in this design, the implementation of speech recognition process (relatively resource heavy) is carried out on the Android device which has the required hardware power and microphone access. The time-critical motor control logic is executed on the ESP32 with direct hardware access to the motor actuation interfaces. This inherently reduces the latency of the control loop, whilst allowing the mobile device’s more powerful processing capability to be harnessed for more complex pattern recognition tasks. A Bluetooth connection acts as a mediator of communication between the two processing nodes, which introduces its own latency and reliability considerations. The total system latency can be modeled using Equation (1):

T_{t o t a l} = T_{s p e e c h} + T_{c o m m} + T_{c o n t r o l}

(1)

where

T_{s p e e c h}

is the time required for speech recognition processing and

T_{c o m m}

is the communication delay between the Android device and ESP32.

T_{c o n t r o l}

is the processing and actuation time on the ESP32.

This distributed approach allows each component to operate at its optimal performance level while maintaining a manageable end-to-end latency for the overall system.

3.2. Command Protocol Design

The command protocol between the Android device and ESP32 is designed for simplicity and efficiency, using single-character commands to represent basic movement directions. This minimalist approach is inspired by information theory principles, particularly the concept of minimal sufficient statistics, which suggests using the smallest representation that captures all necessary information [56].

For a robot with limited movement capabilities (forward, backward, left, right, stop), a single-byte command provides 256 possible values—far more than required. This approach prioritizes transmission efficiency and simplicity over extensibility, reducing protocol overhead to the minimum.

The command set is defined as follows:

’F’: Forward movement;
’B’: Backward movement;
’L’: Left turn;
’R’: Right turn;
’S’: Stop all motors.

This protocol minimizes Bluetooth transmission overhead, with each command requiring only a single byte plus any protocol-level headers. The theoretical minimum latency for command transmission over Bluetooth Classic is approximately 6–10 ms per packet, though practical implementations typically experience higher latencies due to system overhead [57].

3.3. Novel Adaptive Keyword Spotting Algorithm

Conventional speech recognition engines often struggle with noisy inputs. The adaptive keyword spotting algorithm introduced here mitigates this issue through a two-tier structure. The first phase attempts exact keyword matching; failing that, the system attempts to infer intended commands via contextual speech analysis.

The adaptive Novel Adaptive Keyword Spotting Algorithm employs a dual-layer processing approach (Figure 2):

Primary Exact Matching: First attempts direct recognition of command keywords (“forward”, “backward”, “left”, “right”, “stop”);
Secondary Contextual Analysis: When exact matching fails to produce high-confidence results, the system analyzes surrounding speech patterns and applies a weighted probability model to identify likely commands.

This can be represented mathematically using Equation (2):

P (c | S) = α P_{exact} (c | S) + (1 - α) P_{context} (c | S)

(2)

where

P (c | S)

is the probability of command c given speech input S.

P_{e x a c t} (c | S)

is the probability from exact matching.

P_{c o n t e x t} (c | S)

is the probability from contextual analysis and

α

is an adaptive weighting factor that adjusts based on environmental conditions.

The contextual analysis component leverages phonetic similarity metrics and command frequency analysis to improve recognition in challenging conditions. In empirical testing, this approach increased command recognition accuracy by 12–18% in noisy environments compared to conventional exact-matching methods.

3.4. Progressive Latency Management System

This can include variations due to varying external conditions such as the speech based control systems being prone to intrusions, changing background noises, etc. The progressive latency management system adaptively varies the sample, recognition engine configuration, and Bluetooth transmission management when packet collision or loss is detected (Figure 3).

The latency management system operates on several key parameters:

Microphone Sampling Rate: Adjusted from 8 kHz–16 kHz based on ambient noises;
Recognition Confidence Threshold: Reduced in noisy environments for responsiveness; Bluetooth Transmission Protocol: Adjusts reliability mode based on the percent of packet loss observed;
Command Buffering Strategy: Implements predictive command queuing for high-latency scenarios.

The control feedback loop tracks metrics and environmental conditions, and it applies a decision matrix to choose optimal parameters. As a result, this design guarantees reliable performance in all types of operational conditions, keeping command latency in reasonable conditions even under critical conditions.

The latency management algorithm can be expressed using Equation (3):

θ_{t + 1} = θ_{t} + γ \nabla_{θ} J (θ_{t}, E_{t})

(3)

where

θ

denotes the system parameters,

E_{t}

denotes the environmental conditions at time t,

J (θ_{t}, E_{t})

is the performance cost function, and

γ

is the adaptation rate.

3.5. User Interaction Model

Control systems must maintain a balance between clarity and design [58]. The current version includes live feedback through the Android interface, displaying recognized phrases and connection statuses. This transparency can help operators correct their input if the recognized text strays from the intended command. It is in line with user-centered design best practices which dictate that operators can rapidly learn and develop and retain accurate mental models of system behavior.

The interaction model considers three key factors:

Command input: How users express their intentions to the system.
Feedback mechanisms: How the system communicates its state to users.
Error handling:How the system manages and recovers from communication or recognition errors.

User interface explicit state information is provided through visual feedback, including connection status, last recognized command, and operational mode. Such transparency aids users in forming a correct mental model of the current state of the system so that they can interact with it better.

The interaction loop can be modeled as:

User issues command (via button or voice);
System provides immediate feedback (visual confirmation);
Command is transmitted to the rover;
Rover executes the command;
User observes the physical result.

This closed-loop model allows users to adapt their interaction strategy based on observed system behavior, improving overall control effectiveness.

3.6. Motor Control System

The motor control system on the ESP32 implements a simple but effective approach using pulse-width modulation (PWM) signals through an H-bridge motor driver. The control system can be modeled as a discrete-time system with the following characteristics:

Input: Command character received via Bluetooth;
Processing: Mapping of command to appropriate motor driver states;
Output: PWM signals to control motor speed and direction.

The motor response time

τ_{r}

can be modeled using Equation (4):

τ_{r} = τ_{p} + τ_{c} + τ_{m}

(4)

where

τ_{p}

is the processing delay on the ESP32,

τ_{c}

is the communication delay from the Android device, and

τ_{m}

is the mechanical response time of the motors.

This model helps identify potential bottlenecks in the control system and guides optimization efforts to minimize overall latency.

4. System Architecture

This section elucidates the overarching system architecture, involving hardware and software components (Table 1). This approach highlights the division of roles between the ESP32 module and Android device, ensuring responsive and efficient system behavior.

4.1. Hardware Components

ESP32: The main controller that gets commands over BT and converts it into motor actuation signals. It has a dual-core processor with a speed up to 240 MHz which can be used in near real-time operations.
L298N Motor Driver: A dual H-bridge motor driver that can drive two a DC motor back. This part is responsible for interpreting logic signals from ESP32 and applying enough voltage and current to the motors [59].
DC Motors: Provides propulsion and steering control with a differential drive. Gear reductions are also employed within each motor to increase low-speed torque, allowing them to traverse indoors without issue.
Battery System: Supplies power to both the logic (ESP32) and motors. Distinct voltage rails stabilize the sensitive microcontroller logic from noise emanating from the motors.
Android Device: Functions as the user interface and the speech recognition node, connecting with the ESP32 via Bluetooth.

Overall system complexity is minimized while maximizing reliability, providing an extremely robust platform capable of extended operation in a variety of environments. Such a modular design allows for individual components to be replaced or upgraded without the need for a wholesale system redesign.

4.2. Software Framework

Software architecture is based on dual platform with a separate codebase for ESP32 firmware and android application.

4.2.1. ESP32 Firmware

Main Functions:

Bluetooth Serial Communication: Establishing and maintaining the Bluetooth connection, receiving commands from the Android device.
Command Interpretation: Parsing received commands and mapping them to appropriate motor control functions.
Motor Control: Generating the necessary PWM signals and H-bridge control patterns to achieve the desired movement direction and speed.

The firmware implements a non-blocking design to ensure consistent responsiveness to incoming commands, with a simple command buffering mechanism to handle potential command bursts without losing control instructions.

4.2.2. Android Application

The Android application was written using Kotlin and follows the most up-to-date libraries and practices on Android. The Application Architecture follows MVVM, where UI components are separated from business logic and data storage.

Key Components:

User Interface: Provides visual feedback about system status and offers both button-based and voice control options.
Bluetooth Management: Handles device discovery, connection establishment, and maintenance of the communication channel.
Speech Recognition Service: Implements the Vosk-based speech recognition system, processing audio input and extracting potential commands.
Command Mapping: Translates recognized speech into appropriate command characters for transmission to the ESP32.

The application employs a multi-threaded approach, with separate threads for UI rendering, Bluetooth communication, and speech recognition.

4.2.3. Communication Protocol

The protocol is simple and efficient enough for the Android device and ESP32 to communicate. It is based on Bluetooth Serial Port Profile (SPP), a reliable, stream-oriented link, and a one-character command.

The protocol defines five basic commands:

‘F’: Forward movement;
‘B’: Backward movement;
‘L’: Left turn;
‘R’: Right turn;
‘S’: Stop all motors.

This minimal approach keeps both the transmission overhead and command parsing overhead on the ESP32 as small as possible, resulting in a lower overall latency. Explicit acknowledgments or error correction beyond Bluetooth SPP layer is not applied nor guaranteed, nor would it be practical in most usages; rather, the protocol aims for a degree of responsiveness over delivery guarantees. Given the inherent reliability between Bluetooth connections over short ranges, this is, in practice, a reasonable trade-off for real-time control applications (Figure 4).

4.3. Speech Recognition Implementation

The speech recognition system is implemented using the Vosk toolkit, which provides offline speech recognition capabilities suitable for mobile devices. This approach eliminates the need for internet connectivity during operation, enhancing the system’s reliability in environments with limited or no network access.

The speech recognition implementation includes our novel adaptive keyword spotting algorithm and follows these key steps Figure 5:

Model Initialization: On application startup, the speech recognition model is loaded from assets into memory. This model contains the acoustic patterns and language information needed for speech recognition.
Audio Capture: When voice control is activated, the system begins capturing audio from the device’s microphone in real-time.
Speech Processing: The audio stream is processed using the Vosk recognizer, which converts the audio signals into text representations.
Command Extraction: The recognized text is analyzed to identify command keywords, using both exact matching and contextual analysis techniques from our adaptive algorithm.
Command Transmission: When a valid command is identified, the corresponding single-character command is transmitted to the ESP32 via Bluetooth.

The system provides real-time feedback during speech recognition, displaying partial recognition results as they become available. This immediate feedback helps users understand how the system is interpreting their speech, allowing them to adjust their commands if necessary.

The speech recognition component is optimized for command recognition rather than general dictation, focusing on accurately identifying a specific set of command phrases from continuous speech. This specialization improves recognition accuracy for the target commands while reducing computational requirements.

4.4. Safety and Security Implementation

4.4.1. Command Verification System

A dual-threshold confirmation system was implemented, requiring both acoustic confidence >0.7 and semantic validation [60]. This approach significantly reduces false positive command execution while maintaining system responsiveness [61].

4.4.2. Security Protocols

The system incorporates multiple security layers including:

AES-128 encryption for Bluetooth communication; [62]
MAC address authentication for device pairing; [63]
Command rate limiting (maximum 5 commands per second); [64]
Wake-word activation to prevent accidental command execution.

4.4.3. Emergency Safety Features

Safety protocols include a hardware emergency stop button, automatic timeout after 30 s of inactivity, collision detection with immediate halt capability, and user-specific voice fingerprinting for authorized access [65,66].

Penetration testing demonstrated the system’s resistance to replay attacks and unauthorized access attempts, validating the security architecture’s effectiveness [67].

5. Implementation and Visual Documentation

This section provides a closer look at how the speech-controlled rover is put into action. Illustrations depict hardware setups, user interfaces, and the end-to-end flow of commands, illuminating the system’s practical configuration.

5.1. Rover in Operation

Figure 6 shows the physical rover platform during a test operation. The image highlights the compact form factor of the ESP32-based control system mounted on the rover chassis. The dual-motor configuration provides differential steering capabilities, allowing the rover to perform in-place turns and navigate confined spaces.

The ESP32 module provides processing. The L298N motor driver efficiently translates the controller’s logic-level signals into the higher power levels required by the DC motors. The Bluetooth communication link is visualized, illustrating the wireless control capability of the system.

5.2. Android Control Interface

Figure 7 displays the Android application interface during active operation. The screen shows the connection status, last executed command, and control buttons for manual operation. The “Start Voice Control” button toggles the speech recognition system, allowing users to switch between control modes based on preference and environmental conditions.

The interface is designed for intuitive operation, with large, easily accessible control buttons arranged in a directional pad layout. Status information is prominently displayed, providing immediate feedback on system state and command execution. The voice recognition visualization component shows real-time audio processing, giving users confidence that the system is actively listening for commands.

5.3. Command Flow Visualization

Figure 8 illustrates the complete command flow pathway, from user speech input through the Android device and ESP32 microcontroller to final motor actuation. This visualization helps understand the distributed processing architecture and the components involved in translating a voice command into physical movement.

6. Experimental Evaluation and Results

In different operation conditions, several experiments were conducted to demonstrate the performance and robustness of the speech-controlled rover system. In this paper we present and discuss an experimental methodology and results about command latency, recognition accuracy, and global reliability.

6.1. Experimental Methodology

6.1.1. Command Latency Testing

Command latency was defined as the time interval between the execution of a command (keypress or oral command) until the perceived onset of the motor act. Tests were run at three different distances between the Android device and the rover: 1 m (short range), 5 m (medium range), and 10 m (long range).

For each distance, 20 commands were issued (5 per command in each direction: forward, backward, left, and right) and the latency was measured via high-speed video recording (120 fps), capturing precise milliseconds for both the onset of the command and the resultant motor output.

6.1.2. Speech Recognition Accuracy Testing

This study assessed the various speech recognition accuracies for the voice commands issued by five distinct users 10 times each in three varied acoustic settings:

Quiet room with minimal background noise (<40 dB);
Moderate ambient noise (office environment, 50–60 dB);
Noisy environment (background music/conversation, >70 dB).

6.1.3. System Reliability Testing

Overall system reliability was assessed through extended operation tests, where the rover was continuously controlled for periods of 30 min using alternating button and voice commands. During these tests, any command failures, connection drops, or other malfunctions were recorded. Tests were conducted in environments with varying levels of Bluetooth interference to evaluate robustness under challenging conditions.

6.2. Results and Analysis

6.2.1. Command Latency

The measured command latencies across different distances are summarized in Table 2:

Button-based commands demonstrated consistently lower latency compared to voice commands, which is expected given the additional processing required for speech recognition. For button controls, the latency increased modestly with distance, from an average of 110 ms at 1 m to 140 ms at 10 m, representing a 27% increase across the maximum tested range.

Voice command latency showed a similar trend, increasing from 320 ms at 1 m to 365 ms at 10 m, a 14% increase. The higher baseline latency for voice commands is primarily attributable to the speech recognition processing time, which accounts for approximately 200–220 ms of the total latency. Figure 9 illustrates the latency trends across distances for both control methods:

The results indicate that while voice control introduces additional latency, the overall system remains responsive enough for practical operation, with command execution occurring within acceptable timeframes for human perception of causality.

6.2.2. Speech Recognition Accuracy

Speech recognition accuracy varied significantly across different acoustic environments and between users, as shown in Table 3:

The standard algorithm achieved excellent recognition accuracy in quiet environments, consistently above 94%. The accuracy degraded somewhat in moderate noise conditions but remained above 88%, which is still suitable for reliable operation.

Noisy environments posed the greatest challenge for the standard approach, with accuracy dropping to as low as 75% for some users. However, our novel adaptive keyword spotting algorithm showed dramatic improvements in these challenging conditions, with a 7.5% average improvement in noisy environments (Figure 10).

This improvement demonstrates the effectiveness of the contextual analysis component in identifying commands even when partially masked by ambient noise.

Some notable trends from accuracy testing were:

One syllable commands (e.g., “left”, “right”) were consistently classified with greater accuracy than multi-syllable commands;
User-specific accents and speech patterns had a notable impact on accuracy;
The adaptive keyword spotting strategy was effective at detecting commands even deep inside longer phrases, resulting in more natural interaction.

6.2.3. System Reliability

The following are the key observations from the extended operation tests, which demonstrated excellent performance under regular conditions (Figure 11):

No command failures in quiet and moderate noise environments using button controls;
Bluetooth connection stability in 30 s test periods and up to 5 m distance;
At 10 m distance, occasional connection instability was noted, especially in Bluetooth hostile environments;
All transient connection losses were automatically recovered without needing to be handled manually;
Battery life was adequate for long-term operation and voltage regulation compensated for all tests with steady motor performance.

Because reliability testing is a dynamic process, the incremental latency control mechanism constantly adjusted the parameters in response to the real-time circumstances. For instance, when the system detected a change in environment from quiet to loud, it automatically reduced the recognition confidence threshold to maintain responsiveness, and then raised it back to base levels when background noise dissipated.

Such tests were performed to confirm that the system can function for its intended purpose and provide the expected level of reliability. This allowed for smooth recovery from minimal connectivity issues by way of an auto-reconnect feature.

6.3. Performance Under Varying Conditions

Further qualitative testing was performed to evaluate system behavior during specific scenarios:

Battery Voltage Variation: In a continuous operation scenario, the performance of the system stayed steady despite lower levels of battery voltage, where a proper degradation of motor responsiveness or control precision was observed (up to critical battery conditions) during operation.

Multiple Command Streams: Rapid command sequences were properly processed in the control pipeline, where later commands superseded earlier ones. It enables smooth control manipulation without needing to explicitly cancel previous commands.

Interference Testing: Patterned Bluetooth interference was implemented via multiple active Bluetooth devices. The system also showed good robustness with 5–7 active Bluetooth devices in the same state; it maintained connectivity at the expense of command latency (which increased with approximately 15–25% under these conditions).

The experimental results validate the effectiveness of the distributed processing architecture and the efficient command protocol design (Figure 12). The observed performance characteristics align well with the theoretical model presented in Section 3, confirming that the practical implementation successfully addresses the design requirements for a responsive, reliable speech-controlled robotics platform.

6.4. Power Consumption Analysis

Detailed power measurements were conducted to evaluate the energy efficiency of the wireless control system [68]. The ESP32 microcontroller consumed an average of 120 mA during active operation and 80 mA during idle states. The Android device showed an additional 15% battery drain during voice operation compared to baseline usage.

The total system power consumption averaged 2.8 W, enabling 4–6 h of continuous operation on a standard 3000 mAh battery. Comparative analysis showed the system to be 25% more efficient than equivalent WiFi-based systems and 40% better than cloud-dependent solutions due to reduced transmission overhead [69].

Power optimization strategies were implemented including sleep modes, adaptive sampling rates, and dynamic processing scaling based on environmental conditions [70]. These optimizations contributed to the overall energy efficiency while maintaining system responsiveness and reliability.

6.5. Extended Validation and Statistical Analysis

Following the initial proof-of-concept validation, additional experiments were conducted with an expanded dataset of 20 users (total 1000 commands per condition) across diverse demographics including participants aged 18–65 years and both native and non-native English speakers [71].

Statistical significance testing using ANOVA (p < 0.05) confirmed the performance improvements of our adaptive algorithm. Confidence intervals (95%) were calculated for all performance metrics, validating the robustness of the reported results [72].

6.5.1. Speech Recognition Metrics

Comprehensive speech recognition metrics were collected including Word Error Rate (WER) measurements [73]. For our command-based system, WER ranged from 2.2% in quiet environments to 8.8% in noisy conditions with our adaptive algorithm, compared to 3.6% to 16.3% with standard approaches.

The speech recognition system utilized the Vosk pre-trained English model (vosk-model-en-us-0.22) trained on LibriSpeech, Common Voice, and Fisher corpora, totaling approximately 1000 h of speech data [74]. For the adaptive algorithm, fine-tuning was performed using a custom dataset of 500 command utterances per user across varying noise conditions.

6.5.2. Noise Robustness Quantification

Noise levels were measured using calibrated sound level meters with ±0.5 dB accuracy. Robustness was quantified through Signal-to-Noise Ratio (SNR) analysis: quiet environments (SNR > 20 dB), moderate noise (SNR 10–20 dB), and high noise (SNR < 10 dB) [75].

Recognition confidence thresholds were dynamically adjusted from 0.8 in quiet conditions to 0.6 in noisy environments [76].

6.6. Extreme Conditions and Robustness Testing

6.6.1. Industrial Environment Testing

Testing was extended to extreme noise levels up to 85 dB, simulating industrial settings with machinery noise [77]. The standard algorithm showed accuracy dropping to 76%, while our adaptive approach maintained 84% accuracy under these challenging conditions. Workshop environment validation achieved 82% reliability in real-world industrial scenarios [78].

6.6.2. Connectivity Resilience

Comprehensive connectivity testing included automatic reconnection mechanisms with exponential backoff, command buffering (up to 10 commands), and graceful degradation modes [79]. Emergency stop protocols and automatic fallback to manual control were implemented when speech recognition failed for more than three consecutive commands.

Interference testing with 5–7 active Bluetooth devices demonstrated good system robustness, maintaining connectivity with a 15–25% increase in command latency under congested radio conditions [80].

7. Discussion

The experimental results provide valuable insights into the performance characteristics and practical considerations of implementing speech-controlled mobile robots. This section discusses the implications of these findings, compares the results with existing approaches, and identifies key factors influencing system performance.

7.1. Latency Considerations

The logged command latencies illustrate the natural tradeoffs of responsiveness to control versus the nature of the physical interaction. Button-based controls provide less latency (110–140 ms), whereas voice commands provide more of an intuitive, hands-free interaction at the expense of higher latency (320–365 ms). The most significant difference is due to the delay in speech recognition conversion.

The current implementation shows latency characteristics that converge on acceptable ranges for human perception. Yet research in human–computer interaction indicates that responses from the system under 100 ms feel instantaneous [81] and that responses under 1000 ms retain a sense of causality between action and reaction [82]. Both control methods of the present system fulfill the above requirement well enough to guarantee that the users register a direct association between their commands and the motion of the rover.

This research most importantly contributes to this field by developing a progressive latency management system that dynamically optimizes performance parameters according to real-time conditions. By adapting to changing environments, our approach does not require tuning for specific environments, as happens with static systems, allowing our strategy to perform well in changing environments. This is an important step forward for deployments in the real world where environmental conditions are beyond our control.

7.2. Comparison with Existing Approaches

When compared to previous implementations of speech-controlled robots, the current system offers several advantages:

Offline Processing: In contrast to several previous studies [38] that utilized cloud-based solutions for speech recognition, here, we perform all speech recognition on the mobile device. Without the need to send data to the internet (or other devices, such as running over satellite connections), and the variations in latency these systems introduce, the model can achieve very predictable and consistent performance.

Dual Control Modes: The integration of both button and voice controls provides flexibility not present in many single-mode systems, allowing users to select the most appropriate control method for their current environment and task.

Adaptive Speech Recognition: Our proposed adaptive keyword spotting algorithm achieves substantially improved performance over established techniques in real-world, noise-affected scenarios. This represents a 7.5% improvement in accuracy under noisy conditions, one of the biggest drawbacks to a voice-controlled system to date, which makes the technology practically applicable in many more real-world environments.

Distributed Processing Architecture: By leveraging the computational capabilities of the mobile device for speech recognition while keeping time-sensitive control tasks on the microcontroller, the system achieves better overall performance than approaches that attempt to perform all processing on either the mobile device or the robot platform alone.

Progressive Latency Management: The dynamic adjustment of configuration parameters based on observed environmental conditions and performance is a major evolution compared to the mostly static configuration used in many existing systems. This allows for consistent delivery across a variety of operational situations without needing to tweak configurations.

Quantitative Benchmarking Results

Comprehensive benchmarking was conducted against three recent speech-controlled robotics systems: CloudBot-SR [38], VoiceRover [49], and RoboVoice-HRI [50] (Table 4).

Performance comparison results:

12% better accuracy in noisy environments compared to existing systems;
35% lower latency than cloud-based solutions;
100% offline operation capability, eliminating connectivity dependencies;
40% reduction in latency variance through our progressive management system.

The distributed processing architecture demonstrated superior resource utilization compared to single-platform approaches, achieving optimal performance across varying operational conditions.

These architectural decisions result in a system that balances performance, accessibility, and implementation complexity more effectively than many existing solutions.

7.3. Factors Affecting Speech Recognition Performance

The speech recognition accuracy testing revealed several important factors that influence system performance:

Ambient Noise: Not surprisingly, the recognition accuracy decreased with the increase of background noise. In moderate noise environments, the system still maintained acceptable performance (92.1% accuracy), but performance rapidly decayed in higher noise environments (reaching 83.7% accuracy) as the system was deployed using standard ML algorithms. With the adaptive keyword spotting algorithm, the difference shrank to 91.2% even in noisy environments.

User-Specific Factors: Recognition accuracy rate was affected by individual speech patterns, accents, and speaking volume. This variation might indicate that tailoring acoustic models for certain users could enhance performance, but it would also make the system more complex.

Command Vocabulary: The selection of command vocabulary affects recognition accuracy. It was inherently easier to recognize a discrete syllable with a single phonetic template than a multi-syllabic phonetic template or syllables that sounded alike. This result suggests that carefully selecting the vocabulary for the command can improve the performance of the system as a whole.

Device Microphone Quality: Although not explicitly tested in this study, microphone characteristics of different Android devices could influence recognition accuracy. Higher-quality microphones with better noise cancellation capabilities would likely improve performance in challenging acoustic environments.

Understanding these factors allows for targeted optimizations and appropriate expectation setting when deploying speech-controlled robotic systems in real-world environments.

7.4. Architectural Advantages and Limitations

The distributed processing architecture employed in this system offers several advantages:

Scalability: This approach allows deploying more complex and sophisticated speech understanding models on the mobile device without requiring the robot platform hardware to be upgraded.

Resource Optimization: Since the mobile device and microcontroller operate at different computational levels, they take on only those tasks that they are most efficient at executing, optimizing the use of the different components in the network.

Upgrade Flexibility: An update to the mobile application is not dependent on the rover firmware and can improve speech recognition without changing any actual hardware.

However, this architecture also imposes some constraints:

Device Dependency: The robot is fully dependent on its paired mobile device and therefore creates a single point of failure for the whole system.

Bluetooth Constraints: Relying on Bluetooth communication comes with limitations in range and susceptibility to disruptions in congested radio environments.

Platform Specificity: The current solution only works for Android; it has not been ported to iOS or other candidate mobile platforms.

8. Applications and Future Work

The speech-controlled rover system described in this paper has several potential applications and provides a foundation for further research and development in intuitive robotic control interfaces.

8.1. Educational Applications

The system has immediate applications in educational settings:

STEM Education: The platform serves as an effective means to familiarize students with robotics, programming, speech recognition, and control systems concepts. This makes it quite accessible to students who might not have a very technical background, due to the visual feedback available and the intuitive control methods.

Project-Based Learning: The modular architecture allows students to modify or extend the system, encouraging experimentation with hardware configurations, control algorithms, and user interface designs.

Accessibility in Education: The voice control feature of robotics can make robotics accessible to students with physical disabilities who may not be able to use traditional button-based toggles.

Educational deployments of the system would benefit from additional documentation and structured learning activities that guide students through understanding and modifying the system components, building on approaches similar to those explored by Sharma et al. [40].

8.2. Assistive Technology

With some tweaking, the underlying architecture could be repurposed for assistive technology uses:

Mobility Assistance: The voice control mechanism could be applied to larger mobility platforms to provide hands-free operation for those with limited manual dexterity.

Remote Manipulation: Coupling voice control as input with an output robotic arm or gripper would allow voice-guided manipulation and could even help people with physical disabilities.

Environmental Control: The same principles could be applied to controlling smart home devices or other environmental systems through natural speech commands.

These applications would require enhancements to the speech recognition system to handle more complex commands and improved reliability measures for safety-critical functions.

8.3. Future Research Directions

Several promising research directions emerge from the current work:

Enhanced Natural Language Understanding: Extending the command interpretation capability to handle more complex spatial instructions (e.g., “move forward for two seconds” or “turn right until I say stop”) would make the interaction more natural and powerful, building on work by Tellex et al. [37].

Multi-Modal Interaction: Integrating additional input modalities, such as gesture recognition or gaze tracking, could provide complementary control methods that enhance both precision and intuitiveness in different contexts, as suggested by Wang et al. [25] in the context of nonverbal communication.

Autonomous Behavior Integration: Combining speech commands with autonomous capabilities would allow for higher-level task specification, with the robot handling low-level navigation and obstacle avoidance autonomously, similar to cloud-based approaches discussed by Hu et al. [51].

Advanced Adaptive Algorithms: Building on our initial adaptive keyword spotting algorithm, future work could explore more sophisticated machine learning approaches that continuously adapt to individual users’ speech patterns and environmental conditions, which could leverage GPU acceleration techniques as surveyed by Perez-Cerrolaza et al. [53].

Context-Aware Command Interpretation: Developing systems that understand situational context to interpret ambiguous commands correctly would represent a significant advancement in natural human–robot interaction, extending approaches described by van Den Broek and Moeslund [44] and early work by Kibria [83].

8.4. Technical Enhancements

Several technical improvements could address current limitations:

BLE Implementation: This mode aggressively disables all unnecessary use of power, allowing only what is most essential to accomplish the desired operation, building on the IoT concepts outlined by Tan and Wang [84].

Mesh Networking: Using Bluetooth mesh networking would allow control of many robots from a single interface and facilitate coordinated multi-robot systems, leveraging approaches discussed by Darroudi and Gomez [40].

Enhanced Motor Control: It could allow for controlling motors at varying speeds and implement more advanced move profiles, similar to the visual programming approach for robot control described by Vázquez et al. [85].

Sensor Integration: Including sensors to detect obstacles and providing general awareness about their environment would make them safer around humans and allow them to be more proactive toward voice commands, potentially incorporating emotional sensing as explored by Rani et al. [86].

Cross-Platform Support: Developing an iOS version of the control application would broaden accessibility and potential user base, with cloud-based synchronization approaches as discussed by Simion et al. [82].

These technical enhancements represent natural evolution paths for the current system, each addressing specific limitations or expanding capabilities while maintaining the core distributed processing architecture.

9. Conclusions

The paper describes a full framework implementation for implementing speech based control on mobile robots with distinct processing architecture for mobile and microcontroller modules to optimize its processing capabilities with the best of both worlds. This work shows that system design and optimization can permit effective speech-controlled robotics using off the shelf consumer based hardware.

An experimental evaluation confirms that our approach represents a viable solution in terms of command latency (110–365 ms, depending on control method and distance), and good speech recognition performance (83.7–96.4% using standard algorithms and 91.2–97.8% with our novel adaptive algorithm under different noise conditions). This performance profile enables the system to be used for educational purposes, hobbyist projects, and potential assistive technology applications.

Key contributions of this work include:

A detailed architectural model for distributed processing in speech-controlled robotics, balancing computational requirements and communication efficiency;
A novel adaptive keyword spotting algorithm that significantly improves command recognition performance in noisy environments;
A progressive latency management system that dynamically adjusts processing parameters based on environmental conditions and observed performance;
Empirical evaluation of performance characteristics across varying distances and acoustic environments, providing reference benchmarks for future implementations;
Analysis of factors affecting speech recognition performance in mobile robotics applications.

This speech-controlled rover system demonstrates that natural language interfaces for robotic control are becoming increasingly practical, even with modest hardware resources. As speech recognition technology continues to advance, the gap between human communication patterns and machine control interfaces will further narrow, enabling more intuitive and accessible human–robot interaction.

Future work will focus on enhancing the natural language understanding capabilities, improving robustness in challenging environments, and exploring additional application domains where speech-controlled robotics can provide unique benefits. The open architecture of the current system provides a solid foundation for these explorations, allowing incremental improvements while maintaining backward compatibility with the existing implementation.

Author Contributions

Conceptualization, S.G. and A.J.A.A.-G.; methodology, S.G.; software, S.G.; validation, S.G., U.M. and A.J.A.A.-G.; formal analysis, U.M.; investigation, S.G.; resources, A.J.A.A.-G.; data curation, S.G.; writing—original draft preparation, S.G.; writing—review and editing, U.M. and A.J.A.A.-G.; visualization, S.G.; supervision, A.J.A.A.-G.; project administration, U.M.; funding acquisition, A.J.A.A.-G. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to thank Universiti Teknikal Malaysia Melaka (UTeM) and the Ministry of Higher Education (MOHE) of Malaysia for supporting this project.

Institutional Review Board Statement

Not applicable as this research did not involve human or animal experimentation beyond standard usability testing of the speech interface.

Data Availability Statement

The data that support the findings of this study, including source code for the ESP32 firmware, Android application, and experimental results, are available from the authors upon reasonable request. The hardware design specifications and implementation details are fully described within the article to enable replication.

Acknowledgments

The authors would like to express gratitude to the Department of Artificial Intelligence and Data Science at Poornima Institute of Engineering and Technology for providing the infrastructure and resources necessary for conducting this research.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BLE	Bluetooth Low Energy
ESP32	Espressif Systems Microcontroller
PWM	Pulse-Width Modulation
SPP	Serial Port Profile
MVVM	Model-View-ViewModel

References

Bini, S.; Carletti, V.; Saggese, A.; Vento, M. Robust speech command recognition in challenging industrial environments. Comput. Commun. 2024, 228, 107938. [Google Scholar] [CrossRef]
De Simone, G.; Greco, A.; Rosa, F.; Saggese, A.; Vento, M. Context-aware data augmentation for enhanced speech command recognition in industrial environments. Sci. Rep. 2025, 15, 17445. [Google Scholar] [CrossRef] [PubMed]
Freire, I.T.; Guerrero-Rosado, O.; Amil, A.F.; Verschure, P.F.M.J. Socially adaptive cognitive architecture for human-robot collaboration in industrial settings. Front. Robot. AI 2024, 11, 1248646. [Google Scholar] [CrossRef] [PubMed]
Lee, S.-C.; Wang, J.-F.; Chen, M.-H. Threshold-based noise detection and reduction for automatic speech recognition system in human-robot interactions. Sensors 2018, 18, 2068. [Google Scholar] [CrossRef]
Tada, Y.; Hagiwara, Y.; Tanaka, H.; Taniguchi, T. Robust understanding of robot-directed speech commands using sequence to sequence with noise injection. Front. Robot. AI 2020, 6, 144. [Google Scholar] [CrossRef]
Meng, Z.; Guo, Y.; Sun, C.; Wang, B.; Sherry, J.; Liu, H.H.; Xu, M. Achieving consistent low latency for wireless real-time communications with the shortest control loop. In Proceedings of the ACM SIGCOMM 2022 Conference, Amsterdam, The Netherlands, 22–26 August 2022; ACM: Amsterdam, The Netherlands, 2022; pp. 193–206. [Google Scholar]
Meng, Z.; Wang, T.; Shen, Y.; Wang, B.; Xu, M.; Han, R.; Liu, H.; Arun, V.; Hu, H.; Wei, X. Enabling high quality real-time communications with adaptive frame-rate. In Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23), Boston, MA, USA, 17–19 April 2023; USENIX Association: Boston, MA, USA, 2023; pp. 1–18. [Google Scholar]
Zhu, Z.; Zhang, L.; Pei, K.; Chen, S. A robust and lightweight voice activity detection algorithm for speech enhancement at low signal-to-noise ratio. Digit. Signal Process. 2023, 141, 104151. [Google Scholar] [CrossRef]
Fong, T.; Nourbakhsh, I.; Dautenhahn, K. A survey of socially interactive robots. Robot. Auton. Syst. 2003, 42, 143–166. [Google Scholar] [CrossRef]
Donald, A.N. The Psychology of Everyday Things; Basic Books: New York, NY, USA, 1988. [Google Scholar]
Bilmes, J.A.; Li, X.; Malkin, J.; Kilanski, K.; Wright, R.; Kirchhoff, K.; Subramanya, A.; Harada, S.; Landay, J.; Dowden, P.; et al. The vocal joystick: A voice-based human-computer interface for individuals with motor impairments. In Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, BC, Canada, 6–8 October 2005; pp. 995–1002. [Google Scholar]
Manaris, B.; Harkreader, A. SUITEKeys: A speech understanding interface for the motor-control challenged. In Proceedings of the Third International ACM Conference on Assistive Technologies (Assets ’98), Marina del Rey, CA, USA, 15–17 April 1998; pp. 108–115. [Google Scholar]
Sporka, A.J.; Kurniawan, S.H.; Mahmud, M.; Slavík, P. Non-speech input and speech recognition for real-time control of computer games. In Proceedings of the 8th International ACM SIGACCESS Conference on Computers and Accessibility (Assets ’06), Portland, ON, USA, 23–25 October 2006; pp. 213–220. [Google Scholar]
Rybczak, M.; Popowniak, N.; Lazarowska, A. A survey of machine learning approaches for mobile robot control. Robotics 2024, 13, 12. [Google Scholar] [CrossRef]
Darko, H.; Lerher, T.; Truntič, M.; Težak, O. Design and implementation of ESP32-based IoT devices. Sensors 2023, 23, 6739. [Google Scholar]
Maier, A.; Sharp, A.; Vagapov, Y. Comparative analysis and practical implementation of the ESP32 microcontroller module for the internet of things. In 2017 Internet Technologies and Applications (ITA); IEEE: New York, NY, USA, 2017; pp. 143–148. [Google Scholar]
Kareem, H.; Dunaev, D. The working principles of ESP32 and analytical comparison of using low-cost microcontroller modules in embedded systems design. In Proceedings of the 2021 4th International Conference on Circuits, Systems and Simulation (ICCSS), Kuala Lumpur, Malaysia, 26–28 May 2021; IEEE: New York, NY, USA, 2021; pp. 130–135. [Google Scholar] [CrossRef]
Gomez, C.; Oller, J.; Paradells, J. Overview and evaluation of bluetooth low energy: An emerging low-power wireless technology. Sensors 2012, 12, 11734–11753. [Google Scholar] [CrossRef]
Takahiro, F.; Cai, C.; Zhang, Y.; Hafi, L.E.; Hagiwara, Y.; Nishiura, T.; Taniguchi, T. Optical laser microphone for human-robot interaction: Speech recognition in extremely noisy service environments. Adv. Robot. 2022, 36, 304–317. [Google Scholar]
Zili, M.; Atre, N.; Xu, M.; Sherry, J.; Apostolaki, M. Confucius: Achieving Consistent Low Latency with Practical Queue Management for Real-Time Communications. arXiv 2023, arXiv:2310.18030. [Google Scholar]
O’Shaughnessy, D. Trends and developments in automatic speech recognition research. Comput. Speech Lang. 2024, 83, 101538. [Google Scholar] [CrossRef]
Matarić, M.J. Socially assistive robotics: Human augmentation versus automation. Sci. Robot. 2017, 2, eaam5410. [Google Scholar] [CrossRef]
Siegwart, R.; Nourbakhsh, I.R.; Scaramuzza, D. Introduction to Autonomous Mobile Robots; MIT Press: Cambridge, MA, USA, 2011. [Google Scholar]
Khamis, A.; Mowafi, Y.; Abdelmonem, M. A review on Internet of Things (IoT) in healthcare. Int. J. Comput. Appl. 2018, 980, 19–22. [Google Scholar]
Soy, H. ESP8266 and ESP32 series of SoC microcontrollers. In Programmable Smart Microcontroller Cards; ISRES Publishing: Konya City, Türkiye, 2021; Volume 110. [Google Scholar]
Li, M.; Zhang, X.; Huang, Y.; Oymak, S. On the power of convolution-augmented transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 18393–18402. [Google Scholar]
Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; pp. 28492–28518. [Google Scholar]
Xu, M.; Jin, A.; Wang, S.; Su, M.; Ng, T.; Mason, H.; Han, S.; Lei, Z.; Deng, Y.; Huang, Z.; et al. Conformer-Based Speech Recognition On Extreme Edge-Computing Devices. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 16–21 June 2024; Volume 6: Industry Track, pp. 131–139. [Google Scholar]
Fan, Z.; Zhang, X.; Huang, M.; Bu, Z. Sampleformer: An efficient conformer-based Neural Network for Automatic Speech Recognition. Intell. Data Anal. 2024, 28, 687–704. [Google Scholar] [CrossRef]
Ryumin, D.; Axyonov, A.; Ryumina, E.; Ivanko, D.; Kashevnik, A.; Karpov, A. Audio–visual speech recognition based on regulated transformer and spatio–temporal fusion strategy for driver assistive systems. Expert Syst. Appl. 2024, 252, 124159. [Google Scholar] [CrossRef]
Wang, Y.; Gu, Y.; Yin, Y.; Han, Y.; Zhang, H.; Wang, S.; Li, C.; Quan, D. Multimodal transformer augmented fusion for speech emotion recognition. Front. Neurorobotics 2023, 17, 1181598. [Google Scholar] [CrossRef]
Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-attention generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 7354–7363. [Google Scholar]
Zhang, Z.; Li, J. A review of artificial intelligence in embedded systems. Micromachines 2023, 14, 897. [Google Scholar] [CrossRef]
Këpuska, V.; Bohouta, G. Next-generation of virtual personal assistants (Microsoft Cortana, Apple Siri, Amazon Alexa and Google Home). In Proceedings of the 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 8–10 January 2018; pp. 99–103. [Google Scholar]
He, Y.; Sainath, T.N.; Prabhavalkar, R.; McGraw, I.; Alvarez, R.; Zhao, D.; Rybach, D.; Kannan, A.; Wu, Y.; Pang, R.; et al. Streaming end-to-end speech recognition for mobile devices. In Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6381–6385. [Google Scholar]
Li, J.; Deng, L.; Gong, Y.; Haeb-Umbach, R. An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 745–777. [Google Scholar] [CrossRef]
Panayotov, V.; Chen, G.; Povey, D.; Khudanpur, S. Librispeech: An ASR corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 5206–5210. [Google Scholar]
Deuerlein, C.; Langer, M.; Seßner, J.; Heß, P.; Franke, J. Human-robot-interaction using cloud-based speech recognition systems. Procedia Cirp 2021, 97, 130–135. [Google Scholar] [CrossRef]
Tosi, J.; Taffoni, F.; Santacatterina, M.; Sannino, R.; Formica, D. Performance evaluation of bluetooth low energy: A systematic review. Sensors 2017, 17, 2898. [Google Scholar] [CrossRef] [PubMed]
Darroudi, S.M.; Gomez, C. Bluetooth Low Energy mesh networks: A survey. Sensors 2020, 20, 3590. [Google Scholar] [CrossRef] [PubMed]
Mavridis, N. A review of verbal and non-verbal human-robot interactive communication. Robot. Auton. Syst. 2015, 63, 22–35. [Google Scholar] [CrossRef]
Shi, Q.; Yang, P.; Chen, C. Preference modeling of spatial description in human-robot interaction. In Proceedings of the 2020 IEEE International Conference on Networking, Sensing and Control (ICNSC), Nanjing, China, 30 October–2 November 2020; pp. 1–6. [Google Scholar]
Wang, T.; Zheng, P.; Li, S.; Wang, L. Multimodal human-robot interaction for human-centric smart manufacturing: A survey. Adv. Intell. Syst. 2024, 6, 2300359. [Google Scholar] [CrossRef]
van Den Broek, M.K.; Moeslund, T.B. What is Proactive Human-Robot Interaction?—A review of a progressive field and its definitions. ACM Trans.-Hum.-Robot. Interact. 2024, 13, 1–30. [Google Scholar] [CrossRef]
Medicherla, H.; Sekmen, A. Human-robot interaction via voice-controllable intelligent user interface. Robotica 2007, 25, 521–527. [Google Scholar] [CrossRef]
Poncela, A.; Gallardo-Estrella, L. Command-based voice teleoperation of a mobile robot via a human-robot interface. Robotica 2015, 33, 1–18. [Google Scholar] [CrossRef]
Naeem, B.; Kareem, W.; Naeem, N.; Naeem, R. Voice controlled humanoid robot. Int. J. Intell. Robot. Appl. 2024, 8, 61–75. [Google Scholar] [CrossRef]
Tellex, S.; Kollar, T.; Dickerson, S.; Walter, M.R.; Banerjee, A.G.; Teller, S.J.; Roy, N. Understanding natural language commands for robotic navigation and mobile manipulation. Proc. AAAI Conf. Artif. Intell. 2011, 25, 1507–1514. [Google Scholar] [CrossRef]
Srivastava, S.; Singh, R. Voice controlled robot car using Arduino. Int. Res. J. Eng. Technol. IRJET 2020, 7, 4033–4037. [Google Scholar]
Korayem, M.H.; Azargoshasb, S.; Korayem, A.H.; Tabibian, S. Design and implementation of the voice command recognition and the sound source localization system for human-robot interaction. Robotica 2021, 39, 1779–1790. [Google Scholar] [CrossRef]
Sharma, K.; Suryakumar, A.; Shah, N. Voice Controlled Robots for Education: A Review and Future Directions. Int. J. Adv. Res. Comput. Sci. 2020, 11, 26–31. [Google Scholar]
Chakravarthy, C.H.; Kumar, S.S.; Kedarnath, B.; Rao, K.V. Design and fabrication of voice controlled robotic arm for prosthetic and numerous applications. In AIP Conference Proceedings; AIP Publishing: Melville, NY, USA, 2021. [Google Scholar]
Venkatraman, S.; Overmars, A.; Thong, M. Smart home automation—Use cases of a secure and integrated voice-control system. Systems 2021, 9, 77. [Google Scholar] [CrossRef]
Martin, N.; Metzger, F.M. The chimera of control: Self-sovereign identity, data control, and user perceptions. Hum. Technol. 2024, 20, 183–223. [Google Scholar] [CrossRef]
Tanenbaum, A.S.; Van Steen, M. Distributed Systems: Principles and Paradigms; Prentice-Hall: Upper Saddle River, NJ, USA, 2017. [Google Scholar]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Koulouras, G.; Katsoulis, S.; Zantalis, F. Evolution of Bluetooth Technology: BLE in the IoT Ecosystem. Sensors 2025, 25, 996. [Google Scholar] [CrossRef]
Zaki, A.M.; Arafa, O.; Amer, S.I. Microcontroller-based mobile robot positioning and obstacle avoidance. J. Electr. Syst. Inf. Technol. 2014, 1, 58–71. [Google Scholar] [CrossRef]
Yin, L.; Wang, F.; Han, S.; Li, Y.; Sun, H.; Lu, Q.; Yang, C.; Wang, Q. Application of drive circuit based on L298N in direct current motor speed control system. In Advanced Laser Manufacturing Technology; SPIE: Bellingham, WA, USA, 2016; Volume 10153, pp. 163–169. [Google Scholar]
Sun, Y.; Rui, W. Voice activity detection based on the improved dual-threshold method. In Proceedings of the 2015 International Conference on Intelligent Transportation, Big Data and Smart City, Halong Bay, Vietnam, 19–20 December 2015; pp. 996–999. [Google Scholar]
Hui, J. Confidence measures for speech recognition: A survey. Speech Commun. 2005, 45, 455–470. [Google Scholar]
Tsai, K.-L.; Huang, Y.-L.; Leu, F.-Y.; You, I.; Huang, Y.-L.; Tsai, C.-H. AES-128 based secure low power communication for LoRaWAN IoT environments. IEEE Access 2018, 6, 45325–45334. [Google Scholar] [CrossRef]
Daniele, G.; Rak, M.; Salzillo, G.; Barbato, U. Security in IoT Pairing & Authentication protocols, a Threat Model, a Case Study Analysis. In ITASEC; CEUR-WS: Aachen, Germany, 2021; pp. 207–218. [Google Scholar]
Mohamed, F.; Bousbia-Salah, M. A voice command system for autonomous robots guidance. In Proceedings of the 9th IEEE International Workshop on Advanced Motion Control, Istanbul, Turkey, 27–29 March 2006; pp. 261–265. [Google Scholar]
Raza, H.; Hussain, S.A.; Nizamuddin, S.A.; Mahmood, S. An autonomous robot for intelligent security systems. In Proceedings of the 2018 9th IEEE Control and System Graduate Research Colloquium (ICSGRC), Shah Alam, Malaysia, 3–4 August 2018; pp. 201–206. [Google Scholar]
Fernando, A.-M.; Ramey, A.; Salichs, M.A. Speaker identification using three signal voice domains during human-robot interaction. In Proceedings of the 2014 ACM/IEEE International Conference on Human-Robot Interaction, Bielefeld, Germany, 3–6 March 2014; pp. 114–115. [Google Scholar]
Xu, H.; Wu, C.; Gu, Y.; Shang, X.; Chen, J.; He, K.; Du, R. Sok: Comprehensive security overview, challenges, and future directions of voice-controlled systems. arXiv 2024, arXiv:2405.17100. [Google Scholar]
El-Khozondar, H.J.; Mtair, S.Y.; Qoffa, K.O.; Qasem, O.I.; Munyarawi, A.H.; Nassar, Y.F.; Bayoumi, E.H.; Abd El, A.A. A smart energy monitoring system using ESP32 microcontroller. e-Prime-Adv. Electr. Eng. Electron. Energy 2024, 9, 100666. [Google Scholar] [CrossRef]
Lee, J.-S.; Su, Y.-W.; Shen, C.-C. A comparative study of wireless protocols: Bluetooth, UWB, ZigBee, and Wi-Fi. In Proceedings of the IECON 2007-33rd Annual Conference of the IEEE Industrial Electronics Society, Taipei, Taiwan, 5–8 November 2007; pp. 46–51. [Google Scholar]
Merve, Ç.; Özbilen, A.; Yavanoğlu, U. A comprehensive survey on deep packet inspection for advanced network traffic analysis: Issues and challenges. Niğde Ömer Halisdemir Üniversitesi Mühendislik Bilim. Derg. 2023, 12, 1–29. [Google Scholar]
Dietmar, J. Voice controlled devices and older adults—A systematic literature review. In Human Aspects of IT for the Aged Population. Design, Interaction and Technology Acceptance, Proceedings of the International Conference on Human-Computer Interaction, HCII 2022, Gothenburg, Sweden, 26 June–1 July 2022; Springer International Publishing: Cham, Switzerland, 2022; pp. 175–200. [Google Scholar]
Amin, A.A.H.; Aladdin, A.M.; Hasan, D.O.; Mohammed-Taha, S.R.; Rashid, T.A. Enhancing algorithm selection through comprehensive performance evaluation: Statistical analysis of stochastic algorithms. Computation 2023, 11, 231. [Google Scholar] [CrossRef]
Kebede, T.; Mamo, G.; Calpotura, K. Afan Oromo Speech-Based Computer Command and Control: An Evaluation with Selected Commands. Adv. Human-Computer Interact. 2023, 2023, 9959015. [Google Scholar]
Li, S.-A.; Liu, Y.-Y.; Chen, Y.-C.; Feng, H.-M.; Shen, P.-K.; Wu, Y.-C. Voice interaction recognition design in real-life scenario mobile robot applications. Appl. Sci. 2023, 13, 3359. [Google Scholar] [CrossRef]
Gales, M.J.F. Model-Based Techniques for Noise Robust Speech Recognition. Ph.D. Thesis, University of Cambridge, Cambridge, UK, 1995. [Google Scholar]
Sashi, N.; Sakti, S.; Nakamura, S. A machine speech chain approach for dynamically adaptive lombard tts in static and dynamic noise environments. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 2673–2688. [Google Scholar]
Min, D.; Peng, L.; Nie, Q.; Li, W. Speech signal processing of Industrial speech recognition. In Journal of Physics: Conference Series; IOP Publishing: Bristol, UK, 2023; Volume 2508, p. 012039. [Google Scholar]
Aguilera, C.A.; Castro, A.; Aguilera, C.; Raducanu, B. Voice-Controlled Robotics in Early Education: Implementing and Validating Child-Directed Interactions Using a Collaborative Robot and Artificial Intelligence. Appl. Sci. 2024, 14, 2408. [Google Scholar] [CrossRef]
Riemer, T. Mobile Robot Control Using Bluetooth Low Energy. Bachelor’s Thesis, Turku University of Applied Sciences, Turku, Finland, 2012. [Google Scholar]
Gheorghe, F.F.; Leiviskä, K. Infrastructure and complex systems automation. In Springer Handbook of Automation; Springer International Publishing: Cham, Switzerland, 2023; pp. 617–640. [Google Scholar]
Nielsen, J. Usability Engineering; Morgan Kaufmann: San Francisco, CA, USA, 1993. [Google Scholar]
Simion, G.; Filipescu, A.; Ionescu, D.; Filipescu, A. Cloud/VPN-Based Remote Control of a Modular Production System Assisted by a Mobile Cyber-Physical Robotic System—Digital Twin Approach. Sensors 2025, 25, 591. [Google Scholar] [CrossRef]
Kibria, S. Speech Recognition for Robotic Control. Master’s Thesis, Umea University, Umeå, Sweden, 2005; pp. 1–77. [Google Scholar]
Tan, L.; Wang, N. Future internet: The internet of things. In Proceedings of the 2010 3rd International Conference on Advanced Computer Theory and Engineering (ICACTE), Chengdu, China, 20–22 August 2010; Volume 5, pp. V5-376–V5-380. [Google Scholar]
Vázquez, A.S.; Calvo, T.; Fernández, R.; Ramos, F. A visual programming approach for co-designed robots. Robotica 2021, 39, 1116–1139. [Google Scholar] [CrossRef]
Rani, P.; Sarkar, N.; Smith, C.A.; Kirby, L.D. Anxiety detecting robotic system—Towards implicit human-robot collaboration. Robotica 2004, 22, 85–95. [Google Scholar] [CrossRef]

Figure 1. System−architecture overview.

Figure 2. Adaptive keyword spotting algorithm.

Figure 3. Progressive latency management system.

Figure 4. Communication protocol diagram.

Figure 5. Speech recognition process flow.

Figure 6. Rover in operation.

Figure 7. Android control interface.

Figure 8. Command−flow visualization.

Figure 9. Command−latency vs. distance.

Figure 10. Speech−recognition accuracy comparison.

Figure 11. System reliability under different conditions.

Figure 12. Performance under varying conditions.

Table 1. Hardware components.

Component	Function
ESP32	Main controller, handling Bluetooth communication and motor control
L298N Motor Driver	Interfaces between microcontroller and motors, enabling directional control
DC Motors	Provide physical actuation for rover movement
Battery System	Delivers power to all rover components
Android Device	Provides speech recognition and user interface

Table 2. Command latency measurements.

Distance	Control Method	Avg. Latency (ms)	Std. Deviation (ms)
1 m	Button	110	±8
1 m	Voice	320	±25
5 m	Button	125	±12
5 m	Voice	340	±30
10 m	Button	140	±15
10 m	Voice	365	±35

Table 3. Speech recognition accuracy.

Environment	Standard Algorithm	Adaptive Algorithm	Improvement
Quiet (<40 dB)	94–98% (avg: 96.4%)	95–99% (avg: 97.8%)	+1.4%
Moderate (50–60 dB)	88–95% (avg: 92.1%)	93–97% (avg: 95.3%)	+3.2%
Noisy (>70 dB)	75–90% (avg: 83.7%)	87–95% (avg: 91.2%)	+7.5%

Table 4. Technology comparison with state-of-the-art methods.

Method	Accuracy (%)	Latency (ms)	Power (W)	Deployment
CloudBot-SR [38]	89.2	450–600	4.2	Cloud
VoiceRover [49]	85.7	380–520	3.8	Hybrid
RoboVoice-HRI [50]	91.3	290–410	3.2	Local
Our System (Standard) [Present Work]	92.1	320–365	2.8	Local
Our System (Adaptive) [Present Work]	95.3	320–365	2.8	Local

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gupta, S.; Mamodiya, U.; Al-Gburi, A.J.A. Speech Recognition-Based Wireless Control System for Mobile Robotics: Design, Implementation, and Analysis. Automation 2025, 6, 25. https://doi.org/10.3390/automation6030025

AMA Style

Gupta S, Mamodiya U, Al-Gburi AJA. Speech Recognition-Based Wireless Control System for Mobile Robotics: Design, Implementation, and Analysis. Automation. 2025; 6(3):25. https://doi.org/10.3390/automation6030025

Chicago/Turabian Style

Gupta, Sandeep, Udit Mamodiya, and Ahmed J. A. Al-Gburi. 2025. "Speech Recognition-Based Wireless Control System for Mobile Robotics: Design, Implementation, and Analysis" Automation 6, no. 3: 25. https://doi.org/10.3390/automation6030025

APA Style

Gupta, S., Mamodiya, U., & Al-Gburi, A. J. A. (2025). Speech Recognition-Based Wireless Control System for Mobile Robotics: Design, Implementation, and Analysis. Automation, 6(3), 25. https://doi.org/10.3390/automation6030025

Article Menu

Speech Recognition-Based Wireless Control System for Mobile Robotics: Design, Implementation, and Analysis

Abstract

1. Introduction

2. Literature Review

2.1. Mobile Robotics Control Systems

2.2. Speech Recognition in Embedded and Mobile Systems

2.3. Bluetooth Communication in Robotic Applications

2.4. Human–Robot Interaction Through Speech

2.5. Integration of Speech Recognition with Mobile Robotics

3. Theoretical Background

3.1. Distributed Processing Architecture

3.2. Command Protocol Design

3.3. Novel Adaptive Keyword Spotting Algorithm

3.4. Progressive Latency Management System

3.5. User Interaction Model

3.6. Motor Control System

4. System Architecture

4.1. Hardware Components

4.2. Software Framework

4.2.1. ESP32 Firmware

4.2.2. Android Application

4.2.3. Communication Protocol

4.3. Speech Recognition Implementation

4.4. Safety and Security Implementation

4.4.1. Command Verification System

4.4.2. Security Protocols

4.4.3. Emergency Safety Features

5. Implementation and Visual Documentation

5.1. Rover in Operation

5.2. Android Control Interface

5.3. Command Flow Visualization

6. Experimental Evaluation and Results

6.1. Experimental Methodology

6.1.1. Command Latency Testing

6.1.2. Speech Recognition Accuracy Testing

6.1.3. System Reliability Testing

6.2. Results and Analysis

6.2.1. Command Latency

6.2.2. Speech Recognition Accuracy

6.2.3. System Reliability

6.3. Performance Under Varying Conditions

6.4. Power Consumption Analysis

6.5. Extended Validation and Statistical Analysis

6.5.1. Speech Recognition Metrics

6.5.2. Noise Robustness Quantification

6.6. Extreme Conditions and Robustness Testing

6.6.1. Industrial Environment Testing

6.6.2. Connectivity Resilience

7. Discussion

7.1. Latency Considerations

7.2. Comparison with Existing Approaches

Quantitative Benchmarking Results

7.3. Factors Affecting Speech Recognition Performance

7.4. Architectural Advantages and Limitations

8. Applications and Future Work

8.1. Educational Applications

8.2. Assistive Technology

8.3. Future Research Directions

8.4. Technical Enhancements

9. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI