**3. Prototype**

A neck-mounted wearable prototype was developed and used for classifying neck movement, mouth movement, and speech. The prototype consists of a sensor affixed to the neck which is connected to a microcontroller. The data collected from the sensor is wirelessly transferred via Bluetooth by the microcontroller to the user's paired smart phone. On the smart phone, the time-series data is in real time filtered, classified, and then used as input to a software application. Figure 1 provides an overview of the wearable system and its components interactions.

**Figure 1.** Prototype system's component overview, with sensor placed on neck and wearable hardware placed on collar for communicating data to a smartphone for processing and for interfacing with the application.

E-textile and flex sensors were investigated as potential candidates for the prototype. E-textiles can be used as capacitive sensors or as resistive sensors. With the capacitive method, the e-textile worked well as a proximity sensor to detect when the sensor was near human skin. However, once the sensor was in contact with or in close proximity of the skin, the sensor data became saturated and did not provide valuable features or respond to movements. Using the e-textile sensor as a resistive sensor was more successful in displaying features when actively bending or pulling the material.

The flex sensor proved to be the most appropriate for sensoring the neck. The flex sensor acts as a flexible potentiometer, whose resistance increases as the bend angle increases. Unlike the e-textile, which did not return to a static level after deformation and was prone to noise, the flex sensor performed reliably under bending and returned to a stable level when straight.

A variety of positions for the sensor around the neck, chin, and side of face were explored with the neck being the most practical in terms of data collection and ease of wear.

The hardware of the final prototype consists of an inexpensive (approximately USD 10) flex sensor, whose change in resistance signaled change in the bend of the sensor. The flex sensor was placed against the neck by weaving it under a small piece of paper that was taped to the neck. An Arduino microcontroller collected and wirelessly transmitted the data from the sensor to a smart phone for processing and display. Both an Arduino Nano and an Arduino Mega 2560 were used in the experiments.

A simple moving average (SMA) filter was used to smooth the measured resistance signal. SMA filters replace the current data value with the unweighted mean of the k previous points in the data stream, in effect smoothing the data by flattening the impact of noise and artifact that is outside the bigger trend of the data. As the window size is decreased, the smoothness of the data is decreased. In this application, a window size that is too small can result in artifact and/or noise in the time-series data being improperly classified as a neck movement event. As the window size is increased, the impact of noise and artifact is also decreased, but the likelihood that relevant information is filtered out is increased. In this application, with a window size that is too large, there is the risk of delaying the recognition of neck movement events or even missing the events altogether. A window size of k = 40 was selected, which roughly maps to one second of data.

### **4. Head Tilt Detection**

In a series of experiments, two types of flex sensors in a variety of positions on the neck are evaluated to determine the feasibility of differentiating and classifying head tilt and positioning.

In the experiments conducted, both a short sensor in three different positions and a long sensor were considered. Each sensor placement and sensor received 10 experiments per head-tilt with a time duration of 30 s. The tilts were held static for the entire 30 s. For each experiment, approximately 1100 data points were collected.

### *4.1. Flex Sensor Types and Placement*

Two types of flex sensors are considered: a short sensor and a long sensor. With the short sensor, three different placements are considered: a low placement, a center placement, and a high placement. The low placement is at the bottom of the neck, closest to the collar, as shown in Figure 2a. The center placement is directly over the larynx, at the middle of the neck, as shown in Figure 2b. The high placement is the top of the throat, closest to the chin, as shown in Figure 2c. The long sensor spans the three positions along the neck, from the base of the neck to under the chin, as shown in Figure 3.

**Figure 2.** (**a**) Low, (**b**) center, and (**c**) high placement of the short flex sensor along the center line of the neck.

**Figure 3.** The placement of the long flex sensor along the center line of the neck.

### *4.2. Data Visualization*

We visualize here some of the data collected across various placements of the sensors and for different head tilts. Figures 4–6, respectively, display the collected resistance data over a 30-s time frame across the first three classes of head tilts, namely down, forward/no tilt, and up, for each placement of the short sensor, namely low, center, and high placement. Figure 7 displays the collected resistance data over a 30-s time frame for the long sensor, across the first three classes of head tilts, namely down, forward, and up. The data represented has been filtered using a moving average filter.

**Figure 4.** With low placement of short sensor, head tilt filtered data.

**Figure 5.** With center placement of short sensor, head tilt filtered data.

**Figure 6.** With high placement of short sensor, head tilt filtered data.

**Figure 7.** With long sensor, head tilt filtered data.

The short, low sensor placement and the long sensor (Figures 4 and 7, respectively) show the clearest distinction between the three classes. Therefore, the short, low sensor placement and the long sensor were further evaluated using all five classes of head tilts, namely down, forward, up, right, left. The collected resistance data over a 30-s time frame are shown in Figures 8 and 9, respectively.

**Figure 8.** With low placement of short sensor, head tilt filtered data, with right and left tilts added.

**Figure 9.** With long sensor, head tilt filtered data, with right and left tilts added.

#### *4.3. Head Tilt Detection Machine Learning Results*

We evaluated the accuracy of classifying a three-class dictionary of head tilts. We then went on to evaluate the accuracy of classifying an expanded five-class dictionary of head tilts. The classification results are presented in this subsection.

Three different classical machine learning (ML) classifiers were considered, specifically logistic regression, SVM, and random forest. The labeled dataset was partitioned into a train and held-out test set with an 80:20 ratio. To ensure the consistency of the models, a *k*-fold cross-validation was performed. A fivefold cross-validation of the train set was performed, with a random fourth of the examples in the training fold being used for validation during hyper-parameter tuning. For all the classical ML models, the Scikit-learn library in Python was used.

All four configurations, i.e., the long sensor and the three (low, center, and high) placements of the short sensor, were evaluated using the three head tilts (down, forward/not tilt, and up).

Table 1 displays our fivefold accuracy based on the model and placements of the sensors. In all cases, Logistic Regression was not sufficient in classifying the three-class dictionary. The short and low sensor placement and the long sensor had the best results. In both cases, random forest is the best performing model with test accuracies reaching ~83.4% and ~96% for the short, low placement and the long sensor, respectively.

**Table 1.** Fivefold training, cross-validation, and held-out test accuracy of classical ML models with different feature sets. The bold font denotes the cases with the highest accuracy for that model. These results are for the three-class dictionary.


To the best performing results, two additional classes were added. The two additional classes are the user's head facing right and the user's head facing left.

Table 2 shows the performance of the short sensor with low placement and the long sensor when classifying against this five-class dictionary. As with previous results, random forest had the best performance with a test accuracy of ~83% for the short sensor and ~91% for the long sensor.

**Table 2.** Fivefold training, cross-validation, and held-out test accuracy of classical ML models with different feature sets. The bold font denotes the cases with the highest accuracy for that model. These results are for the five-class dictionary that includes facing right and facing left.


Table 3 shows the confusion matrix for the short sensor with low placement with the random forest classifier. The largest source of misclassifications are from the up data points, with only 65 out of 157 labels predicted correctly.

**Table 3.** Five-class confusion matrix for the short sensor with low placement. Rows represent actual class and columns represent predicted class.


Table 4 shows the confusion matrix for the long sensor using the random forest classifier. With the long sensor, only 17 out of 182 up data points are mislabeled. The largest confusion is between left and right tilts.

From the confusion matrix the neck gesture language can be created. The most frequent or the most important gestures can be assigned to the head tilts that achieve the highest classification accuracy, both in terms of sensitivity and specificity. For example, the following mapping of neck gestures would be appropriate for the social media app Instagram. While on their feeds, users would tilt their heads forward to signal scrolling and would turn their heads to the side, either right or left, to 'like' an image.


**Table 4.** Five-class confusion matrix for the long sensor. Rows represent actual class and columns represent predicted class.

### **5. Speech and Mouth Movement Detection**

In this section, we explore a larger range of opportunities that the neck-mounted sensor can provide in addition to the head gesture detection detailed in Section 4. Section 5.1 addresses speech detection using the prototype, by differentiating speech from static breathing. Section 5.2 address mouth movement classification, namely the determination of how many times the mouth has been opened and closed. Section 5.3 tackles the challenging task of speech classification using only the detection of movement in the neck.

Speech and mouth movement detection provide contextual information that can be used to trigger or to mute the head tilt interface. For instance, if the system detects that the user is talking, then the user's head tilts are not relayed to application software.

### *5.1. Speech Detection*

Figure 10 shows an example sensor reading from static breathing and from talking, specifically saying 'hello', on the same graph. The visualization demonstrates that the presence of speech can potentially be differentiated from static breathing using only the data collected from the flex sensor on the neck-mounted prototype.

**Figure 10.** Sensor readings from static breathing and saying 'hello'.

Using the neck-mounted prototype, an experiment was conducted to see if static breathing can indeed be differentiated from speech. Three-second-long samples with the prototype's flex sensor were collected of both static breathing and of saying 'hello'. A total of 60 samples, 30 of each class, were collected. The samples were classified using K-nearest neighbors (k-NN) with dynamic time warping (DTW), with k set to 3.

Dynamic time warping measures the similarity between two time-series signals, which may vary in speed and in length. It calculates the minimal distance between the signals allowing for warping of the time axis, with similar signals having lower cost than dissimilar signals.

Each test signal is compared against all the training signals, and the DTW cost between the test signal and each training signals is calculated. The DTW cost of the k nearest neighbors, i.e., most similar training signals, is then used to classify the signal.

Table 5 shows the confusion matrix for the classification results. The overall accuracy of the classification was 83.3% with 3 of the 30 talking samples misclassified as breathing.

**Table 5.** Two-class confusion matrix for static breathing and talking. Rows represent actual class and columns represent predicted class.


### *5.2. Mouth Movement Classification*

In another experiment, the classification of mouth movements without the generation of any sound was examined. The mouth was opened and closed without sound being generated. It was a four-class dictionary, with static breathing (no mouth movement), opening and closing of the mouth once, opening and closing of the mouth twice, and opening and closing of the mouth three times.

Three-second-long samples with the prototype's flex sensor were collected with a total of 60 samples, 15 of each class. The samples were classified using K-nearest neighbors (k-NN) with dynamic time warping, with k set to 3.

Table 6 shows the confusion matrix for the classification results. The overall accuracy of the classification was 67.5%. The classification of static breathing resulted in most of the misclassifications. By considering sample's peak-to-valley amplitude, this misclassification can be decreased.

**Table 6.** Four-class confusion matrix for mouth movements. Rows represent actual class and columns represent predicted class.


### *5.3. Speech Classification*

The final experiments explored speech classification. Two different experiments of speech classification were carried with each having a set of four different sentences or phrases being spoken with the prototype affixed to the neck and the bend sensor capturing the neck activity.

For each of the two experiments, three-second-long samples with the prototype's flex sensor were collected. For the first experiment with sentences, a total of 40 samples were collected, 10 of each class. The sentences used in the experiments were "I am a user who is talking right now"; "This is me talking with a sensor attached"; "Who am I talking to at this very moment?"; and "Can you recognize what I am saying while attached to a sensor?" For the second experiment with famous idioms, a total of 80 samples were collected, 20 of each class. The idioms used in the experiment were "a blessing in disguise"; "cut somebody some slack"; "better late than never"; and "a dime a dozen." The samples were classified using K-nearest neighbors (k-NN) with dynamic time warping, with k set to 3.

Tables 7 and 8 show the confusion matrices for the classification results for the two experiments, respectively. The overall accuracy of the classification was 62.5% and 32.5%, respectively.

**Table 7.** Four-class confusion matrix for spoken sentences. Rows represent actual class and columns represent predicted class.


**Table 8.** Four-class confusion matrix for spoken phrases. Rows represent actual class and columns represent predicted class.

