Voice-Controlled Intelligent Personal Assistant for Call-Center Automation in the Uzbek Language

Mukhamadiyev, Abdinabi; Khujayarov, Ilyos; Cho, Jinsoo

doi:10.3390/electronics12234850

Open AccessArticle

Voice-Controlled Intelligent Personal Assistant for Call-Center Automation in the Uzbek Language

by

Abdinabi Mukhamadiyev

¹

,

Ilyos Khujayarov

²

and

Jinsoo Cho

^1,*

¹

Department of Computer Engineering, Gachon University, Sujeong-gu, Seongnam-si 13120, Republic of Korea

²

Department of Information Technologies, Samarkand Branch of Tashkent University of Information Technologies Named after Muhammad al-Khwarizmi, Tashkent 140100, Uzbekistan

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(23), 4850; https://doi.org/10.3390/electronics12234850

Submission received: 6 October 2023 / Revised: 22 November 2023 / Accepted: 29 November 2023 / Published: 30 November 2023

(This article belongs to the Section Computer Science & Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

The demand for customer support call centers has surged across various sectors due to the pandemic. Yet, the constraints of round-the-clock human services and fluctuating wait times pose challenges in fully meeting customer needs. In response, there’s a growing need for automated customer service systems that can provide responses tailored to specific domains and in the native languages of customers, particularly in developing nations like Uzbekistan where call center usage is on the rise. Our system, “UzAssistant,” is designed to recognize user voices and accurately present customer issues in standardized Uzbek, as well as vocalize the responses to voice queries. It employs feature extraction and recurrent neural network (RNN)-based models for effective automatic speech recognition, achieving an impressive 96.4% accuracy in real-time tests with 56 participants. Additionally, the system incorporates a sentence similarity assessment method and a text-to-speech (TTS) synthesis feature specifically for the Uzbek language. The TTS component utilizes the WaveNet architecture to convert text into speech in Uzbek.

Keywords:

speech technologies; call center; speech corpus; Uzbek language; speech-to-text; text-to-speech; speech recognition; speech synthesis; IVR; public services

1. Introduction

1.1. Research Context and Motivation

The growing popularity of artificial intelligence (AI) has been notably marked by the rise of voice assistants (VAs). Prominent examples include Amazon Echo, Google Assistant, Microsoft Cortana, and Apple Siri. These AI-driven voice assistants are revolutionizing how people engage with technology. McCue’s research indicates that 27% of internet users globally utilize voice search [1,2], and forecasts suggest a doubling in the use of in-home voice assistants from 2018 to 2023 [3]. Many specialists believe that voice assistants will complement traditional computing devices like PCs and laptops, especially for practical shopping tasks [4]. Despite ongoing concerns about privacy and security, the use of voice assistants is on the rise. Thanks to advancements in natural language processing and machine learning, these assistants are capable of conducting complex conversations and handling multiple tasks simultaneously [5]. The continued development of this technology is likely to bring significant changes in the way humans interact with machines.

As technology continues to advance, an increasing number of people are turning to virtual voice assistants to help them with simple tasks and answering basic questions. With the integration and development of voice-recognition and natural-learning algorithms, these systems have become increasingly automated and efficient. The ability to control multiple functionalities of smart devices through voice commands has made voice assistants such as Siri, Alexa, Cortana, and Google convenient and essential parts of our daily lives.

Automatic speech recognition and synthesis systems are widely used in various call center applications such as automatic call distribution (ACD), IVR systems, and personnel management programs. These systems enable the use of intelligent voice menus (IVR) to handle incoming calls, automatic telephone calls to customers with a voice interface, conversion of phone conversations to text, providing real-time recommendations to call center employees during conversations, recognizing customer emotions during phone conversations, and improving employee productivity. With the help of these systems, call centers can efficiently handle tasks such as evaluating customer satisfaction with the quality of the service provided.

1.2. Research Aims and Contributions

Our research indicates that combining various neural network architectures can significantly enhance the precision of automated customer service systems for the Uzbek language and its various dialects. By employing a blend of RNN encoder-decoder, DNN-CTC, E2E-transformer, and E2E-conformer models, we are able to develop both statistical and neural network-driven language models within automatic speech recognition (ASR) frameworks. These models are adept at accurately identifying customer vocal requests and delivering responses tailored to the Uzbek context [6,7,8,9,10,11,12]. Such an approach is poised to not only boost customer satisfaction but also enable call centers to manage a larger influx of customer queries more effectively.

Work is no longer difficult in the era of management development through artificial intelligence in the world, there is no more difficult work. Voice programs are also useful for determining the destination, translating text, and finding the necessary information through a simple voice message. However, they are required to speak either English or Russian [13,14,15,16].

Live conversations are the norm when communicating with clients in call centers. This means that speech recognition technology needs to work in quasi-real time or even in real time. Some of the main benefits of using speech recognition systems in call center services include improving call center efficiency by reducing the time required to handle calls, enhancing customer service by providing more accurate information and faster responses, lowering the risk of errors and misunderstandings during conversations, and allowing call center operators to focus more on conversations and less on taking notes [17,18,19,20].

In call centers, communication between the operator and the client is carried out in the form of live conversations. Therefore, speech recognition must occur on a quasi-real-time scale if not in real time. The main purposes of speech recognition systems in call center services [21,22,23,24,25] include:

➢: A significant reduction in waiting time (handling time) reduces work costs.
➢: Reducing call time by 1.5–2 times by reducing the time the operator enters the information.
➢: Reduction in operators’ working time for complex calls owing to the ability to automatically answer simple questions.
➢: Creating the ability to work with customers 24/7 (even on holidays).
➢: Verifying customers’ voices by answering one or two simple questions is especially important in the banking sector to protect against theft of personal cards and confidential documents.
➢: Ability to work with a large number of short calls (in bookmakers call centers).
➢: The ability to replace a complex and error-prone IVR system operating in tone mode.
➢: The ability to use speech recognition as a source of additional information not only during conversations but also during further analysis of the call. In particular, this analysis helped increase the main indicator of first call resolution (FCR).
➢: Problem resolution in one call. This leads to a reduction in callbacks and, simultaneously, an increase in customer satisfaction, which in turn results in lower operating costs.

Through these articulated aims, this study aims to address the gaps in the existing literature, offer efficient solutions for the present challenges, and pave the way for future advancements in the field of natural language processing (NLP). The key contributions of this study are as follows:

Speaker recognition in varied environments: The study tested a speaker recognition module in different environments, observing the accuracy of the module under varying conditions. For example, we achieved an impressive 96.4% accuracy in real-time tests with 56 participants.
Use of the Deep Speech 2 model: we used the Deep Speech 2 model for extracting MFCC features from utterances.
Automatic speech recognition (ASR) model: The ASR model used in the study converts consumer speech into text. It was trained using an RNN-based end-to-end speech recognition architecture on large Uzbek-automatic speech recognition training data (USC).
Sentence summarizing: The system includes a sentence summarizing component that uses the BERT sentence transformer for embedding sentences and search queries, achieving an average accuracy of 85.27%.
Database management: The system incorporates three types of databases—personal information database (PID), generic information database (GID), and credential information database (CID), which play crucial roles in managing user data and queries.
Development and implementation of an Uzbek speech synthesizer rooted in natural voice for call centers:
✧
Objective: To seamlessly integrate a speech synthesizer calibrated to the phonetic intricacies of the Uzbek language into the telephonic interfaces of call centers.
✧
Operational mechanism: Upon receiving a call, the synthesized voice mechanism initiates a dialogue with the caller, efficiently garnering the requisite information, and thus minimizing the preliminary conversational stages traditionally facilitated by human operators.
✧
Anticipated impact: The incorporation of this synthesizer is projected to considerably alleviate the operational burdens shouldered by call center representatives, rendering the process more streamlined and expeditious.
Challenges and practical implications of speech recognition in public service call centers:
✧
Context: The burgeoning integration of speech synthesizers into public service-oriented call centers is challenging. Their primary function often pivots on vocalizing the results stemming from voice-initiated queries, such as ascertaining the status of administrative applications.
✧
Technical nuances: While the conceptual frameworks of these systems are undeniably innovative, it is imperative to comprehend the intricacies associated with their seamless operation, ranging from linguistic variations to background noise interference.
✧
Operational benefits: Despite the potential obstacles, the judicious deployment of such systems can increase the efficacy of voice response mechanisms, thereby ensuring that callers receive precise and timely information.
Enhancing call center efficiency through automated speech recognition and synthesis:
✧
Premise: The swift and accurate resolution of client inquiries is at the heart of contemporary call center dynamics. Therefore, automated speech recognition and synthesis are of paramount importance.
✧
Research scope: This study delves into the ramifications of implementing these speech technologies, particularly in contexts that require simple procedural updates, such as tracking the status of an application.
✧
Projected outcomes: Preliminary data suggest that astute deployment of these systems could precipitate a reduction in manual operator involvement by a substantial 20–25%. Furthermore, it paves the way for uninterrupted 24/7 customer service, bolstering operational efficiency and augmenting customer satisfaction.

1.3. Structure of the Paper

The structure of this paper is laid out in the following manner: Section 2 provides an overview of the current prevalent methods. In Section 3, we delve into the specific methodology employed by the Uzbek voice-controlled intelligent personal assistant. Section 4 focuses on the deployment and evaluation of our proposed system, offering a comparative analysis with existing methods. The paper concludes with Section 5, where we encapsulate the main findings and summarize the key points of our discussion.

2. Related Work

Recently, the use of AI-driven solutions in business operations has skyrocketed [26]. This infusion of AI introduces advanced cognitive features similar to human abilities, including automation, image recognition, problem solving, and informed decision making [27]. These features are brought to life through tools such as chatbots, intelligent virtual interfaces, robotic machinery, and other digital aids [28]. These tools serve a dual purpose. They can elevate individual productivity by replacing human components in certain tasks. This makes them invaluable in areas, such as education, healthcare, management, and industrial production [27]. For example, AI-powered data management systems can supersede conventional record management, aiding healthcare professionals in organizing and analyzing patient data for better decision making. In healthcare, robots can help with surgical procedures, attend to elderly patients, and oversee medication regimes [29]. Within the industrial realm, AI adoption can streamline production processes leading to higher output [30]. In terms of handling data, AI’s capacity to swiftly process and depict intricate data boosts organizational efficacy and simplifies decision-making processes [31]. The swift processing ability of AI systems surpasses human limitations [32]. However, the growing dependency on machines and AI has ushered in debates about their ethical and moral consequences [33].

Schwenk et al. [34] pioneered the application of artificial neural networks (ANNs) in language modeling, contrasting an ANN-driven n-gram model with a refined Kneser–Ney smoothed approach, informed by a corpus exceeding 550 million words. Instead of utilizing the entire vocabulary, they focused on the most frequently used words for their ANN-based language model (LM). Their approach involved training a neural network on a large dataset by randomly selecting text segments for each training iteration. For speech recognition, an n-gram LM was employed, while a neural network-based LM was used for reevaluating word sequences, achieving a 0.5% decrease in word misrecognition. Mikolov et al. [35] introduced an RNN-based language modeling approach to streamline training. They categorized less common words into unique groups based on frequency of occurrence. In their speech recognition experiments, they utilized a 5-g LM with Kneser–Ney smoothing as a baseline, then reevaluated the top 100 predictions using an RNN-based LM. This RNN implementation resulted in an 18% reduction in word error rate (WER) compared to the 5-g LM, simultaneously simplifying the model’s complexity by a factor of 5%.

Huang et al. [36] developed a recurrent neural network (RNN) based language model (LM) for the initial decoding stage in Bing’s voice search. They recommended employing the RNN-based LM particularly when the n-gram LM’s predictions were significantly accurate. To boost processing efficiency, they incorporated a key-value hash table cache. This approach lowered the word error rate (WER) from 25.3% to 23.2%. Additionally, they enhanced the system by reweighting recognition lattices using the RNN-based LM. Optimal results were achieved by combining the RNN-based LM with a foundational 4-g model for lattice generation, followed by rescoring with a similar model, which brought the WER down to 22.7% with an interpolation coefficient of 0.3. Sundermeyer et al. [37] investigated the efficacy differences between LMs using feedforward artificial neural networks (ANNs) and RNNs. They tested three neural network LM setups: (1) a feedforward ANN built with LIMSI-2013 software, focusing on commonly used words; (2) a clustering approach with a feedforward ANN using the complete word pair; and (3) clustering with an RNN. These LMs were trained on a 27 million word corpus, forming 200 classes for ANN clustering based on word frequency, with hidden layer sizes ranging from 300 to 500 units, adjusted according to validation data results. They used an n-gram model for deriving the LM from an ANN system, achieving a WER reduction of 1.5% in training and 1.4% in testing. In these evaluations, feedforward ANNs were less effective than RNNs, with the RNN showing a 0.4% enhancement over the feedforward ANN in test scenarios. Morioka et al. [38] introduced an LM that utilizes variable length contexts. In speech recognition tests with an extensive dictionary, this model demonstrated a reduction in both perplexity and WER.

In a seminal research endeavor, Hilda et al. [39] proposed a dialogue management system characterized by its ability to facilitate spoken language interactions with users. This system, which seamlessly integrates automatic speech recognition, text-to-speech synthesis, a sophisticated dialogue manager, and an expansive information database, holds promise for revolutionizing telephone-based automated customer service paradigms. In a subsequent scholarly investigation, Zweig et al. [40] introduced an avant-garde quality monitoring apparatus tailored for call centers. This innovative mechanism amalgamates the advantages of the speaker recognition module, maximal entropy classification, state-of-the-art pattern recognition technology, and automatic speech recognition, thereby promising unparalleled robustness. Venturing further into the realm of customer experience, Mclean et al. [41] embarked on an empirical study that leveraged a web-based survey methodology, garnering insights from 302 participants. This study seeks to unravel the intricate tapestry of the determinants underpinning customer satisfaction in real-time chat service encounters. In another academically rigorous study, Warnapura et al. [42] proposed an AI-infused architecture engineered to deliver diverse information modalities to customers by spanning texts, voice outputs, and emails. Harnessing the power of sentiment analysis for user response classification in conjunction with the capabilities of natural language processing (NLP) and automatic speech recognition, they conceived an automated system of remarkable resilience and efficacy.

Mansurov and his team [43] recently introduced UzBERT, a BERT-based model tailored for the Uzbek language. They developed this model using a specially compiled news corpus of over 142 million words. The model’s effectiveness in masked language modeling was benchmarked against the multilingual BERT (mBERT). UzBERT was trained with objectives like masked language modeling (MLM) and next sentence prediction (NSP), and it incorporated hyperparameters such as a dropout probability of 0.1 and a GeLU activation function. The model’s architecture mirrored the original BERT design, featuring 12 layers, 768 hidden units, 12 attention heads, and a total of 110 million parameters, with a vocabulary size of 30,000 tokens. Out of the 142 million words in the dataset, 140 million were allocated for training, while the remaining 2 million were reserved for validation.

Despite the growing popularity and advancements in hybrid CTC/attention ASR systems, particularly in low-resource languages, their application to Central Asian languages like Turkish and Uzbek remains limited. Ren et al. [44] introduced a novel feature extraction method using CNNs, termed multiscale parallel convolution (MSPC). This technique utilizes convolution kernels of varying sizes to capture features at different scales, combined with a bidirectional long short-term memory (Bi-LSTM) network to boost the accuracy and stability of the end-to-end model. They also incorporated a fine-tuned BERT model to initialize their RNN language model, integrating it during the decoding phase.

Further exploring the End2End approach for the Uzbek language, Mamatov et al. [45] developed an Uzbek speech recognition system. Their approach involved evaluating existing speech recognition methods to identify the most effective one. They trained various models using a diverse dataset, including 432 h of audiobook recordings and 72 h of audio clips featuring sayings and maxims, voiced by a total of 174 speakers, to create an extensive database.

Several methods, such as Doc2Vec and Word2Vec, have been used by scholars to gauge the similarity between sentences. Doc2Vec, built on the foundation of the Word2Vec model, is adept at encapsulating the semantic essence of sentences or paragraphs [46]. Similarly, Word2Vec is a neural network-driven model that is proficient in depicting words within a high-dimensional space, thus encapsulating their semantic nuances and contextual significance [47].

In conclusion, despite the substantial academic attention garnered by natural language processing (NLP) and speech recognition disciplines, their empirical integration into call-center milieus remains conspicuously under-examined. However, it is plausible to posit that research findings from cognate domains, when judiciously adapted to the nuances of call center operations, could offer valuable insights.

3. Methodology

3.1. Workflow

Our methodology was meticulously designed to optimize results and achieve our set goals. The project was divided into four principal segments: speech recognition, text summarization, sentence similarity analysis, and text-to-speech (TTS) conversion. For the speech recognition part, we utilized an advanced deep learning framework suitable for this purpose (Deep Speech 2). Text summarization was handled using the seq2seq model, while the doc2vec model was instrumental in assessing sentence similarity. Lastly, for converting text into spoken words, we employed TTS technology, specifically using the WaveNet deep learning framework, which is known for its high-quality speech synthesis capabilities (as shown in Figure 1).

Text-to-speech (TTS) technology simulates human-like speech by transforming written text into audible sound through advanced machine learning methods. This technology is particularly useful for developing voice-operated robots and interactive voice response (IVR) systems, offering businesses a cost-effective and efficient solution by automating sound generation and eliminating the need for manual audio recording and editing.

The quality of TTS-generated speech has significantly improved, achieving a natural sound through meticulous refinement of various elements such as tone, smoothness, accent placement, pauses, and intonation. There are two primary methods for achieving this: Concatenative TTS, which stitches together pre-recorded audio snippets, and Parametric TTS, which uses a probabilistic model to determine the acoustic characteristics of a sound signal based on the input text. Concatenative TTS is known for its high-quality output but requires extensive data for training the machine learning models. Parametric TTS, in contrast, can produce speech that closely resembles human speech with less data requirement [48].

3.2. Recognition of Speakers

In our speaker recognition approach, we utilized the Mel-frequency cepstral coefficients (MFCC) method [49] for extracting features from audio signals, focusing primarily on low-frequency components. Additionally, we developed a feature-matching technique that works in tandem with a maximization algorithm based on the Gaussian mixture model (GMM) [50]. For the purpose of speaker adaptation, our system collects a 20-s voice sample from users. It then employs feature extraction techniques to create a unique GMM profile for each individual, which aids in making precise similarity assessments. Furthermore, we implemented a threshold mechanism to identify and differentiate unregistered speakers. This significantly improves the system’s accuracy by reducing the chances of misidentification. The functioning of the speaker recognition module is depicted in Figure 2.

3.3. Automatic Speech Recognition

We utilized cutting-edge technology to ensure the accurate and efficient transcription of consumer speech. We employed an ASR model trained on a publicly available annotated Uzbek voice corpus dataset using a state-of-the-art architecture called Deep Speech 2 [51]. This architecture utilizes HPC techniques and batch normalization to achieve 7× faster training compared to its predecessor, while employing a unique optimization curriculum known as SortaGrad [50].

l^{*} =_{l \in A l i g n (x, y)}^{a r g m a x} \prod_{t}^{T} p_{c t c} (l_{t} | x; θ)

(1)

To synchronize the text transcription with the input frames, we employ the function Align(x,y), which maps every potential pairing of characters from the transcription y with frames in the input x.

3.4. Summarization of Sentences

We developed a system that applies an Uzbek summarization of sentences method, aimed at shortening sentences and distilling key information from customer replies. This system harnesses the Seq2Seq [52] summarization method, enhanced with an attention mechanism, which successfully brought down the training loss to 0.001. The efficacy of this method was confirmed through our experimental evaluations, showing promising outcomes. The architecture of our summarization of sentences system is depicted in Figure 3 [52].

(1): Data Collection and Processing Techniques

The Uzbek Bank speech corpus (UBSC v1.0) is a comprehensive dataset used for speech-to-text conversion. It includes 108 h of recorded Uzbek speech data in .wav format from 863 speakers of different ages, genders, dialects, education levels, and accents. This dataset was also used for text summarization which included 35 k short articles and 24 k short summaries for evaluation purposes. Additionally, 322 questions were generated from the City Bank website and social network pages to measure sentence similarity.

After collecting the data, we processed them in four steps. First, the sentences were broken down into a series of tokens. Next, we expanded the contractions to their full forms and eliminated all stop words and punctuation marks. Following this, we implemented lemmatization to transform the words to their base forms. Subsequently, we categorized the words according to their parts of speech. These steps enhanced our ability to analyze the data thoroughly and extract significant insights.

(2): Model Architecture

During the training process, we employed an RNN encoder-decoder in tandem with the Seq2Seq approach, augmented by an attention mechanism, to effectively condense the articles. This architectural design comprises three primary segments: encoder, attention, and decoder modules. The function of the encoder is to transform a sequence into a consistent context vector, capturing the semantic abstraction of the entire article. This context vector acts as the foundational state for the decoder, interfacing with its hidden layers despite disparities in the timestamps of the encoder and decoder. Given that a represents an extended input sequence, where ‘a’ signifies the target set of sentences and ‘b’ the source set, the highest likelihood for the word vector sequence is given by:

\arg (m a x b p (a| b))

(2)

A sequence-to-sequence framework fortified with the Bahdanau attention mechanism was applied to align the fixed-length output. For the embedding process, we harness a pre-established Uzbek word vector, named the “uz w2c model”, which transmutes words into their numeric counterparts. The terminal representation of each word is vectorized, which is pivotal for model training.

3.5. Text-to-Speech Synthesis

Our proposed model was designed to interact with customers by responding to a range of verbal inputs in an audio format. Given our focus on the state services center sector in Uzbekistan, it is noteworthy that the majority of users communicate in the Uzbek language. To convert our text into audible speech, we employed the TTS. Central to its efficacy is the use of DeepMind’s WaveNet [53], renowned for its optimal accuracy. The architecture of the WaveNet model is shown in Figure 4. In addition, the model can generate speech responses in real time, making it extremely useful for real-world applications.

WaveNet is an advanced neural network designed to produce raw audio and was developed by DeepMind, an AI company based in London. Presented in a September 2016 paper, WaveNet can create human-like voice sounds that are quite realistic by using a neural network trained on actual spoken voice data. When tested with American English and Mandarin, it surpassed Google’s top text-to-speech systems at the time. However, as of 2016, the synthesized speech from WaveNet was still not as authentic sounding as genuine human speech. The capability of WaveNet to generate raw audio waveforms allows it to replicate various types of sounds, encompassing both speech and music.

Creating a top-tier synthetic TTS database necessitates ensuring that the speech output from the source TTS model aligns precisely with phonetic pronunciations. To achieve this, we implemented a Tacotron 2 decoder, equipped with a phoneme alignment method [54]. This approach is adept at precisely synchronizing phoneme sequences with their corresponding acoustic features. In this setup, an external model dedicated to duration prediction determines the length of each phoneme based on linguistic attributes. Following this, the Tacotron 2 decoder is responsible for producing the relevant acoustic features. Subsequently, these features are transformed into speech signals by a WaveNet-based neural excitation vocoder. Within this vocoder, a WaveNet-based mixture density network [55] operates, adhering to the principles of human speech production mechanisms [56]. This results in the stable and accurate generation of speech signals [53,57].

Within the TTS framework, the advanced time-frequency trajectory excitation vocoder functions by capturing a range of features every 5 milliseconds [58]. This includes a diverse array of elements: line spectral frequencies spanning 40 dimensions, along with the fundamental frequency, energy levels, a voicing indicator, a 32-dimensional waveform exhibiting gradual evolution, and a rapidly altering 4-dimensional waveform. Collectively, these components constitute a detailed 79-dimensional feature vector.

The source TTS’s acoustic model is composed of three distinct parts: a context analyzer, a context encoder, and a Tacotron decoder. Initially, the context analyzer processes the input text to extract 354-dimensional phoneme-level linguistic feature vectors, which include 330 categorical and 24 numerical contexts. Following this, a duration predictor, comprising three fully connected layers with unit counts of 1024, 512, and 256, and an LSTM layer with 128 memory blocks, calculates the duration of each phoneme. These phoneme-level features are then scaled up to match frame-level dimensions. The context encoder further refines these features by passing the frame-level linguistic features through three convolution layers with 10 × 1 kernels and 512 channels, a bidirectional LSTM with 512 memory blocks, and fully connected layers with 512 units each. Subsequently, the Tacotron decoder, which includes a PreNet, PostNet, and a primary unidirectional LSTM, takes over to produce the acoustic features. The PreNet, consisting of two fully connected layers with 256 units each, processes the previously generated acoustic features. These outputs, along with those from a context-embedding module, are then routed through two unidirectional LSTM layers with 1024 memory blocks and followed by two projection layers with 79 units to create the acoustic features. Lastly, the PostNet, which is made up of five convolution layers with 5 × 1 kernels and 512 channels, incorporates residual elements into the acoustic features to enhance the precision of generation.

In setting up WaveNet, the dilation factors were arranged in a sequence from [2⁰, 2¹, …, 2⁹] and this sequence was repeated thrice. This configuration led to the formation of 30 layers of residual blocks and a receptive field comprising 3067 samples. Within each of these residual blocks, convolution layers with 128 channels were utilized. The system was designed to output two dimensions, specifically to calculate the mean and the standard deviation for a Gaussian distribution. Additionally, a weight normalization method was employed, ensuring that all weight vectors were normalized to a standard length [5].

Enhancing the spectral definition of the synthesized speech involved applying a spectral domain sharpening filter, set with a coefficient value of 0.95, as a post-processing measure. Furthermore, for producing clearer speech audio, the scale parameter generated by WaveNet in the voiced sections was decreased by a ratio of 0.87.

3.6. Database

The system utilized three distinct databases. The first, known as the personal information database (PID), gathers various user details such as name, national identity (NID) number, mobile number, date of birth, recorded response, speech-tailored GMM-based model, and the time of the call. This collected data is subsequently utilized for verification, where the system cross-references it with the personal information submitted by the user. Table 1 illustrates a sample layout of a PID.

Another component of the system is the generic information database (GID), which houses a collection of commonly asked questions (FAQs) pertinent to the e-commerce sector. New users, who do not yet have access to the credential information database (CID) that safeguards sensitive information, primarily interact with the GID. Table 2 and Table 3 display examples of the GID’s contents and the layout of the CID, respectively.

4. Experiment and Result Analysis

4.1. WaveNet Model Accuracy

In this study, we employed the character error rate (CER) as a benchmark for evaluation. The CER gauges the efficacy of an automatic speech recognition (ASR) system, reflecting the proportion of characters inaccurately identified. A lower CER signifies superior performance, with a rate of zero indicating flawless results. Following a 24 h training period on the WaveNet model via Collab Pro+ using an Nvidia A100 GPU, the recorded CER was 0.064907. The most favorable CER value was 0.064907 at the 16,400th step. The observed training and validation losses are 0.105120 and 0.332718, respectively (Table 4).

C E R = \frac{S + D + I}{N}

Constraints related to GPU capabilities limited us to a 50 h session on Collab Pro+. To compensate for this, we augmented our training sets to include 10,000, 15,000, and 25,000 sample inputs, respectively (Table 5). The intent was to demonstrate the correlation between an increased number of sample inputs and enhancements in training loss, validation loss, and CER. The premise was that longer training durations lead to better outcomes across all metrics.

4.2. Seq2Seq Model-Based Summary Prediction

We sourced our sample data from the Uzbek dataset for Uzbek text summarization. This visualization presents the word count distribution for both the articles and their summaries. Articles peaked at sixty words, whereas summaries typically ranged between five and ten words. During the training period of the seq2seq model, the step loss was 3.3613 and the value loss was 2.9232 (Figure 5).

Our developed speaker recognition system underwent testing in two distinct settings: one with background noise and another with studio-level sound quality. We conducted tests using 56 samples across eight sequential stages for each setting, monitoring the system’s accuracy as the number of samples grew. In both scenarios, the system successfully identified all seven individuals when the sample count was limited to seven. However, with an increase in sample size to 14, the system encountered a recognition issue with one individual in the noisy setting, leading to a slight drop in accuracy to 96.4%. While we recorded a 96.4% accuracy rate in real-world conditions, it’s important to note that the system’s effectiveness might diminish in larger-scale, real-time environments (Figure 6).

4.3. Text Summarization Using the Seq2Seq Model

Through our evaluation employing the cosine similarity algorithm, we assessed the degree of resemblance between the customer’s inquiry and our existing dataset. We compared each question in the dataset with a customer query to determine the most similar questions and answers. This process takes approximately one–two minutes to complete. Our system was designed to handle different ways of asking questions; therefore, we tested it using various question formats to ensure its accuracy. However, we acknowledge that there is always room for improvement, and we are continually working to enhance the system with the available resources. We are confident that our system generates accurate output and provides answers that are relevant to the questions asked. Below are some examples of how it works in practice in Table 6 and Table 7.

The sentence summarization model was trained with meticulous attention to detail, utilizing an RNN size of 256, batch size of two, learning rate of 0.001, and probability rate ranging from 0.65 to 0.75 with the “Adam” optimizer. The model was rigorously tested on the Uzbek dataset, and the results were impressive, with an average loss of 0.004. These parameters were carefully chosen by the main author to ensure optimal performance and accurate summarization of sentences. Overall, the Sentence Summarization model is a highly effective tool for summarizing large amounts of text into concise and informative summaries.

According to Table 8, we utilized the weight of the BERT sentence transformer called paraphrase-mpnet-base-v2, which was trained on USC, CC100-uzbek, voice-recognition-Uzbek, and xls-r-uzbek-cv8 datasets with an average accuracy of 85.27%.

To assess the precision of our system, we randomly selected ten questions and constructed a minimum of four variants for each question to evaluate the system responses (Figure 7). Our observations revealed that the system performed exceptionally well for certain queries and provided appropriate responses. However, it failed to deliver accurate answers to the others. The outcomes are summarized in the table below. Based on our evaluation, we estimated the system accuracy as 82.5%. We are confident that increasing the number of variable questions will further improve the accuracy of our system.

In our experiments, we have used various techniques to expand the corpus artificially. These techniques include adding noise (AN), changing the audio reading speed (SP), and masking (SA) in the spectral domain along the time and frequency axes. Additionally, we have evaluated the effectiveness of using language models (LM) at the decoding stage.

For adding noise, we have selected Gaussian noise with a noise amplitude of

σ = 0.01

. We have also increased the audio playback speed to 0.9, 1.0, and 1.1 times. In the spectral domain, we have used 2 masks with a maximum time width of T = 40 and 2 masks with a maximum time width of F = 30 in the frequency axis for masking on the time axis. The results of these experiments are presented in Table 9.

During the experiments, the Deep Speech 2 model demonstrated superior performance in terms of word error rate (WER) and character error rate (CER). On a test set comprising 5474 samples, the Deep Speech 2 model achieved a WER of 13.8% and a CER of 5.22%, indicating its effectiveness in accurately processing and transcribing speech.

5. Conclusions and Future Work

UzAssistance represents a groundbreaking step in Uzbekistan’s banking sector. It offers a transformative approach to how banking services are accessed and utilized, enhancing efficiency, convenience, and inclusivity for its future implementation and growth. This automated voice chat system stands to elevate the customer experience, broaden financial inclusion, and reduce operational expenses for banks by offering round-the-clock client services without the need for additional staff.

A key benefit of UzAssistant is its ability to facilitate customer interactions in their native Uzbek language, greatly enhancing user satisfaction. However, automated voice chat systems do face certain challenges that could affect the user experience, including a limited range of vocabulary, challenges in speech recognition accuracy, constraints in multi-modal interaction, a somewhat impersonal or mechanical tone, and potential technical issues.

Despite these challenges, the potential for further advancement is significant, particularly in the Uzbek market. There are numerous avenues for future research, such as incorporating more multi-modal interaction capabilities, employing advanced machine learning methods and larger datasets for training, enabling real-time response generation, adding additional functionalities and features, and customizing experiences based on users’ past interactions and behaviors.

Recent progress in text-to-speech (TTS) technologies has made it possible to produce more lifelike automated voices in Uzbek. By integrating additional functionalities and features, such as managing intricate transactions and requests or offering tailored recommendations based on a customer’s banking history, this technology can be significantly enhanced. Implementing feedback mechanisms within the system to gather client opinions on its precision and effectiveness can provide valuable insights for further improvements. Automated Uzbek voice chat systems in banking can address these challenges and explore these future research directions to create a more immersive and effective customer experience, ultimately benefiting both the customers and the banking institutions.

Author Contributions

Conceptualization, A.M. and I.K.; methodology, A.M.; software, A.M. and I.K.; validation, A.M. and I.K.; formal analysis, A.M.; investigation, A.M. and I.K.; resources, I.K. and A.M.; data curation, A.M. and I.K.; writing—original draft preparation, A.M. and I.K.; writing—review and editing, J.C.; visualization, A.M.; supervision, J.C.; project administration, J.C.; funding acquisition, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the ‘Customized technology partner’ project funded by the Korea Ministry of SMEs and Startups in 2023. (project No. 202305440001). This work was supported by the Gachon University research fund of 2021 (GCU- 202106350001).

Informed Consent Statement

Informed consent was obtained from all participants involved in the study.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Guzman, A.L. Voices in and of the machine: Source orientation toward mobile virtual assistants. Comput. Hum. Behav. 2018, 90, 343–350. [Google Scholar] [CrossRef]
McCue, T.J. Okay Google: Voice Search Technology and the Rise of Voice Commerce. Forbes Online. 2018. Available online: https://www.forbes.com/sites/tjmccue/2018/08/28/okay-google-voice-search-technology-and-the-rise-of-voice-commerce/#57eca9124e29 (accessed on 28 January 2018).
Juniper Research. Voice Assistants Used in Smart Homes to Grow 1000%, Reaching 275 Million by 2023, as Alexa Leads the Way. 2018. Available online: https://www. juniperresearch.com/press/press-releases/voice-assistants-used-in-smart-homes (accessed on 25 June 2018).
Gartner. “Digital Assistants will Serve as the Primary Interface to the Connected Home” Gartner Online. 2016. Available online: https://www.gartner.com/newsroom/id/3352117 (accessed on 12 September 2016).
Hoy, M.B. Alexa, Siri, Cortana, and more: An introduction to voice assistants. Med. Ref. Serv. Q. 2018, 37, 81–88. [Google Scholar] [CrossRef] [PubMed]
Sergey, O. Listens and Understands: How Automatic Speech Recognition Technology Works [Electronic Resource]. Available online: https://mcs.mail.ru/blog/slushaet-i-ponimaet-kak-rabotaet-tehnologija-avtomaticheskogo-raspoznavanija-rechi (accessed on 3 April 2023).
Mukhamadiyev, A.; Khujayarov, I.; Djuraev, O.; Cho, J. Automatic Speech Recognition Method Based on Deep Learning Approaches for Uzbek Language. Sensors 2022, 22, 3683. [Google Scholar] [CrossRef] [PubMed]
Mukhamadiyev, A.; Mukhiddinov, M.; Khujayarov, I.; Ochilov, M.; Cho, J. Development of Language Models for Continuous Uzbek Speech Recognition System. Sensors 2023, 23, 1145. [Google Scholar] [CrossRef] [PubMed]
Ochilov, M. Social network services-based approach to speech corpus creation. TUIT News 2021, 1, 21–31. [Google Scholar]
Musaev, M.; Mussakhojayeva, S.; Khujayorov, I.; Khassanov, Y.; Ochilov, M.; Varol, H.A. USC: An Open-Source Uzbek Speech Corpus and Initial Speech Recognition Experiments. In Proceedings of the Speech and Computer 23rd International Conference, SPECOM 2021, St. Petersburg, Russia, 27–30 September 2021. [Google Scholar]
Khujayarov, I.S.; Ochilov, M.M. Analysis of methods of acoustic modeling of speech signals based on neural networks. TUIT News 2020, 2, 2–15. [Google Scholar]
Musaev, M.; Khujayorov, I.; Ochilov, M. Image approach to speech recognition on CNN. In Proceedings of the 2019 International Conference on Frontiers of Neural Networks (ICFNN 2019), Rome, Italy, 26–28 July 2019; pp. 1–6. [Google Scholar]
Sundar, S.S.; Jung, E.H.; Waddell, F.T.; Kim, K.J. Cheery companions or serious assistants? Role and demeanour congruity as predictors of robot attraction and use intentions among senior citizens. Int. J. Hum. Comput. Stud. 2017, 97, 88–97. [Google Scholar] [CrossRef]
Balakrishnan, J.; Dwivedi, Y.K. Conversational commerce: Entering the next stage of AI-powered digital assistants. Ann. Oper. Res. 2021, 290, 1–35. [Google Scholar] [CrossRef]
Liao, Y.; Vitak, J.; Kumar, P.; Zimmer, M.; Kritikos, K. Understanding the role of privacy and trust in intelligent personal assistant adoption. In Proceedings of the 14th International Conference, iConference, Washington, DC, USA, 31 March–3 April 2019. [Google Scholar]
Moriuchi, E. Okay, Google!: An empirical study on voice assistants on consumer engagement and loyalty. Psychol. Mark. 2019, 36, 489–501. [Google Scholar] [CrossRef]
McLean, G.; Osei-Frimpong, K. Hey Alexa… examine the variables influencing the use of artificial intelligent in-home voice assistants. Comput. Hum. Behav. 2019, 99, 28–37. [Google Scholar] [CrossRef]
Pantano, E.; Pizzi, G. Forecasting artificial intelligence on online customer assistance: Evidence from chatbot patents analysis. J. Retail. Consum. Serv. 2020, 55, 102096. [Google Scholar] [CrossRef]
Smith, S. Voice Assistants Used in Smart homes to Grow 1000%, Reaching 275 Million by 2023, as Alexa Leads th. Juniper Research. 2018. Available online: https://www.juniperresearch.com/press/voice-assis tants-in-smart-homes-reach-275m-2023 (accessed on 25 June 2018).
Goasduff, L. Chatbots will Appeal to Modern Workers. Gartner. 2019. Available online: https://www.gart ner.com/smarterwithgartner/chatbots-will-appeal-to-modern-workers (accessed on 31 July 2019).
Swoboda, C. COVID-19 Is Making Alexa And Siri A Hands-Free Necessity. Forbes. 2020. Available online: https://www.forbes.com/sites/chuckswoboda/2020/04/06/covid-19-is-making-alexa-and-siri-a-hands-free-necessity/?sh=21a1fe391fa7 (accessed on 6 April 2020).
Barnes, S.J. Information management research and practice in the post-COVID19 world. Int. J. Inf. Manag. 2020, 55, 102175. [Google Scholar] [CrossRef] [PubMed]
Carroll, N.; Conboy, K. Normalising the “new normal”: Changing tech-driven work practices under pandemic time pressure. Int. J. Inf. Manag. 2020, 55, 102186. [Google Scholar] [CrossRef] [PubMed]
Papagiannidis, S.; Harris, J.; Morton, D. WHO led the digital transformation of your company? A reflection of IT related challenges during the pandemic. Int. J. Inf. Manag. 2020, 55, 102166. [Google Scholar] [CrossRef]
Marikyan, D.; Papagiannidis, S.; Alamanos, E. A systematic review of the smart home literature: A user perspective. Technol. Forecast. Soc. Chang. 2019, 138, 139–154. [Google Scholar] [CrossRef]
Abbet, C.; M’hamdi, M.; Giannakopoulos, A.; West, R.; Hossmann, A.; Baeriswyl, M.; Musat, C. Churn intent detection in multilingual chatbot conversations and social media. In Proceedings of the 22nd Conference on Computational Natural Language Learning, CoNLL 2018, Brussels, Belgium, 31 October–1 November 2018; Korhonen, A., Titov, I., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA; pp. 161–170. [Google Scholar]
Benbya, H.; Davenport, T.H.; Pachidi, S. Artificial intelligence in organizations: Current state and future opportunities. MIS Q. Exec. 2020, 19, 9–21. [Google Scholar] [CrossRef]
Fernandes, T.; Oliveira, E. Understanding consumers’ acceptance of automated technologies in service encounters: Drivers of digital voice assistants adoption. J. Bus. Res. 2021, 122, 180–191. [Google Scholar] [CrossRef]
Hamet, P.; Tremblay, J. Artificial intelligence in medicine. Metabolism 2017, 69, S36–S40. [Google Scholar] [CrossRef]
Li, B.-H.; Hou, B.-C.; Yu, W.-T.; Lu, X.-B.; Yang, C.-W. Applications of artificial intelligence in intelligent manufacturing: A review. Front. Inf. Technol. Electron. Eng. 2017, 18, 86–96. [Google Scholar] [CrossRef]
Olshannikova, E.; Ometov, A.; Koucheryavy, Y.; Olsson, T. Visualizing Big Data with augmented and virtual reality: Challenges and research agenda. J. Big Data 2015, 2, 22. [Google Scholar] [CrossRef]
Young, A.G.; Majchrzak, A.; Kane, G.C. Organizing workers and machine learning tools for a less oppressive workplace. Int. J. Inf. Manag. 2021, 59, 102353. [Google Scholar] [CrossRef]
Kane, G.C.; Young, A.G.; Majchrzak, A.; Ransbotham, S. Avoiding an oppressive future of machine learning: A design theory for emancipatory assistants. MIS Q. 2021, 45, 371–396. [Google Scholar] [CrossRef]
Schwenk, H.; Gauvain, J.L. Training neural network language models on very large corpora. In Proceedings of the Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, BC, Canada, 6–8 October 2005; pp. 201–208. [Google Scholar]
Mikolov, T.; Karafiát, M.; Burget, L.; Cernocký, J.; Khudanpur, S. Recurrent neural network based language model. In Interspeech; Johns Hopkins University: Baltimore, MD, USA, 2010; Volume 3, pp. 1045–1048. [Google Scholar]
Huang, Z.; Zweig, G.; Dumoulin, B. Cache Based Recurrent Neural Network Language Model Inference for First Pass Speech Recognition. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 4–9 May 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 6354–6358. [Google Scholar]
Sundermeyer, M.; Oparin, I.; Gauvain, J.L.; Freiberg, B.; Schlüter, R.; Ney, H. Comparison of Feedforward and Recurrent Neural Network Language Models. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 8430–8434. [Google Scholar]
Morioka, T.; Iwata, T.; Hori, T.; Kobayashi, T. Multiscale Recurrent Neural Network Based Language Model. In Proceedings of the Sixteenth Annual Conference of the International Speech Communication Association, Dresden, Germany, 6–10 September 2015; ISCA Speech: Dublin, Ireland, 2015. [Google Scholar]
Hardy, H.; Strzalkowski, T.; Wu, M. Dialogue Management for an Automated Multilingual Call Center. State Univ Of New York at Albany Inst for Informatics Logics and Security Studies. 2003. Available online: https://aclanthology.org/W03-0704.pdf (accessed on 1 January 2003).
Zweig, G.; Siohan, O.; Saon, G.; Ramabhadran, B.; Povey, D.; Mangu, L.; Kingsbury, B. Automated quality monitoring for call centers using speech and NLP technologies. In Proceedings of the Human Language Technology Conference of the NAACL, New York, NY, USA, 4–6 June 2006; Companion Volume: Demonstrations. [Google Scholar]
McLean, G.; Osei-Frimpong, K. Examining satisfaction with the experience during a live chat service encounter-implications for website providers. Comput. Hum. Behav. 2017, 76, 494–508. [Google Scholar] [CrossRef]
Warnapura, A.K.; Rajapaksha, D.S.; Ranawaka, H.P.; Fernando, P.S.S.J.; Kasthuriarachchi, K.T.S.; Wijendra, D. Automated Customer Care Service System for Finance Companies. In Research and Publication of Sri Lanka Institute of Information Technology (SLIIT)’; NCTM: Reston, VA, USA, 2014; p. 8. [Google Scholar]
Mansurov, B.; Mansurov, A. Uzbert: Pretraining a bert model for uzbek. arXiv 2021, arXiv:2108.09814. [Google Scholar]
Ren, Z.; Yolwas, N.; Slamu, W.; Cao, R.; Wang, H. Improving Hybrid CTC/Attention Architecture for Agglutinative Language Speech Recognition. Sensors 2022, 22, 7319. [Google Scholar] [CrossRef] [PubMed]
Mamatov, N.S.; Niyozmatova, N.A.; Abdullaev, S.S.; Samijonov, A.N.; Erejepov, K.K. November. Speech Recognition Based on Transformer Neural Networks. In Proceedings of the 2021 International Conference on Information Science and Communications Technologies (ICISCT), Tashkent, Uzbekistan, 3–5 November 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–5. [Google Scholar]
Le, Q.; Mikolov, T. Distributed representations of sentences and documents. In Proceedings of theInternational Conference on Machine Learning, Beijing, China, 22–24 June 2014; PMLR: Westminster, UK, 2014; pp. 1188–1196. [Google Scholar]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 2013, arXiv:1301.3781. [Google Scholar]
Khamdamov, U.; Mukhiddinov, M.; Akmuradov, B.; Zarmasov, E. A Novel Algorithm of Numbers to Text Conversion for Uzbek Language TTS Synthesizer. In Proceedings of the 2020 International Conference on Information Science and Communications Technologies (ICISCT), Tashkent, Uzbekistan, 4–6 November 2020; pp. 1–5. [Google Scholar] [CrossRef]
Zhao, Q.; Tu, D.; Xu, S.; Shao, H.; Meng, Q. Natural human-robot interaction for elderly and disabled healthcare application. In Proceedings of the 2014 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Belfast, UK, 2–5 November 2014; IEEE: Piscataway, NJ, USA; pp. 39–44. [Google Scholar]
Yan, H.; Ang, M.H.; Poo, A.N. A survey on perception methods for human–robot interaction in social robots. Int. J. Soc. Robot. 2014, 6, 85–119. [Google Scholar] [CrossRef]
Amodei, D.; Ananthanarayanan, S.; Anubhai, R.; Bai, J.; Battenberg, E.; Case, C.; Casper, J.; Catanzaro, B.; Cheng, Q.; Chen, G.; et al. Deep speech 2: End-to-end speech recognition in english and mandarin. In Proceedings of the International Conference on Machine Learning, New York, NY, USA, 19–24 June 2016; PMLR: Westminster, UK, 2016; pp. 173–182. [Google Scholar]
Sultana; Mariyam; Chakraborty, P.; Choudhury, T. Bengali Abstractive News Summarization Using Seq2Seq Learning with Attention; Cyber Intelligence and Information Retrieval; Springer: Singapore, 2022; pp. 279–289. [Google Scholar]
Oord, A.V.D.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. Wavenet: A generative model for raw audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
Okamoto, T.; Toda, T.; Shiga, Y.; Kawai, H. TacotronBased Acoustic Model Using Phoneme Alignment for Practical Neural Text-to-Speech Systems. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019; pp. 214–221. [Google Scholar]
Bishop, C.M. Mixture density networks. Tech. Rep. 1994, 1–26. Available online: https://research.aston.ac.uk/en/publications/mixture-density-networks (accessed on 6 April 2020).
Quatieri, T.F. Discrete-Time Speech Signal Processing: Principles and Practice; Pearson Education India: Noida, India, 2006. [Google Scholar]
Tamamori, A.; Hayashi, T.; Kobayashi, K.; Takeda, K.; Toda, T. Speaker-dependent WaveNet vocoder. In Proceedings of the Interspeech, Stockholm, Sweden, 20–24 August 2017; pp. 1118–1122. [Google Scholar]
Song, E.; Soong, F.K.; Kang, H.-G. Effective Spectral and Excitation Modeling Techniques for LSTM-RNN-Based Speech Synthesis Systems. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 2152–2161. [Google Scholar] [CrossRef]

Figure 1. Schematic representation of UzAssistant components.

Figure 2. Procedure for the speaker recognition system.

Figure 3. Structure of the sentence summarization system.

Figure 4. WaveNet architecture.

Figure 5. Training and testing value loss.

Figure 6. Comparison of the speaker recognition module’s accuracy in two distinct environments.

Figure 7. Estimating model accuracy.

Table 1. Personal information database (PID).

Full Name	National ID (NID) Number	Birth Date	Text Format of Recorded Response	GMM-Model Utilized	Call Time
Abror Salimov	32007914010096	20 July 1991	Hisob ochish uchun (To open an account)…	Person1.gmm	13 August 2023, 06:41 p.m.
Zafar Odilov	32806884050115	28 June 1988	Kartadan pul yechish (Withdraw money from the card)…	Person2.gmm	14 August 2023, 06:56 p.m.
…	…	…	…	…	…

Table 2. Generic information database (GID).

User Response	Solution
Karta PIN kodini unutdim? (Forgot your card PIN?)	Karta ochilgan bank filialiga shaxsingizni tasdiqlovchi hujjat bilan murojaat qilishingiz lozim (You should apply to the branch of the bank where the card was opened with an identity document).
SMS-xabarnoma xizmatini ulash (Connect the SMS notification service)	Milliy valyutadagi plastik kartasi uchun sms-xabarnomani infokiosk orqali ulash mumkin(An SMS notification for a plastic card in national currency can be connected through an infokiosk)
Bank xizmatlaridan qaysi kunlarda foydalanish mumkin? (On what days can bank services be used?)	Bayram kunlaridan tashqari haftaning dushanbadan juma kunigacha foydalanishingiz mumkin. (You can use it from Monday to Friday, except holidays)
. . .	. . .

Table 3. Credential Information Database (CID).

User Response	Domain	Predefined Questions from the System Based on Specific Domain
Bu bankda mening hisob raqamim bor. Men hisobimdagi oxirgi balansni bilmoqchiman. (I have an account with this bank. I want to know the last balance of my account)	Balance issue	Hisob raqamingiz nima? (What is your account number?) NID raqamingizni kiriting (Enter your NID number) …
Mening kartam yo’qolgan. Men nima qilishim mumkin? (My card is lost. What can I do?)	Card issue	Karta raqamingizni ayting (Enter your card number) NID raqamingizni kiriting (Enter your NID number) …
…	…	…

Table 4. WaveNet model training.

Step	Training Loss	Validation Loss	CER
400	6.319100	3.322290	0.970519
800	2.128300	0.678408	0.207603
1200	0.879500	0.433805	0.141339
1600	0.736400	0.364600	0.121398
…	…	…	…
17,600	0.108070	0.337852	0.065240
18,000	0.106200	0.334034	0.065009
18,400	0.105120	0.332718	0.064907

Table 5. Comparison between various sample inputs.

Sample Inputs	Hours of Training	Loss during Training	Loss during Validation	Character Error Rate (CER)
10,000	20	1.830201	1.340452	0.840124
15,000	30	0.880291	0.64385	0.306281
25,000	50	0.105120	0.332718	0.064907

Table 6. Variations in questions and corresponding system responses.

Sample Question: Siz Kredit Berasizmi? (Do You Give Credit?)
Variations	Response	Correct/Incorrect
Sizlarda qanday kredit turlari mavjud? (What types of loans do you have?)	Siz banklardan avtokredit, mikroqarz, iste’mol kreditlari, ipoteka kreditlaridan foydalanishingiz mumkin. (You can use auto loans, microloans, consumer loans, and mortgage loans from banks)	Correct
Mikroqarzni qanday rasmiylashtirsam bo’ladi? (How can I formalize a microloan?)	Mikroqarzni rasmiylashtirish uchun kredit menejeriga murojaat etish, kreditni rasmiylashtirishga ariza yozish va bank tomonidan talab etiladigan barcha zarur hujjatlarni taqdim etish lozim. (In order to issue a microloan, it is necessary to contact the credit manager, write an application for issuing a loan and submit all the necessary documents required by the bank).	Correct
Kreditingiz haqida bilmoqchimisiz? (Want to know about your credit?)	Agar bizning bankda hisob raqamingiz bo’lsa, siz debit karta ochishga ariza berishingiz mumkin. (If you have an account number in our bank, you can apply for opening a debit card)	Incorrect
Kredit to’lovini bankka bormay amalga oshirsam bo’ladimi? (Can I pay the loan without going to the bank?)	Albatta! Identifikatsiyadan o’tkazilgan ilova orqali mijozlar kredit to’lovlarini amalga oshirishi hamda monitoring qilib borishi mumkin. (Of course! Through the authenticated application, customers can make and monitor loan payments.)	Correct

Table 7. Variations in questions and corresponding system responses.

Sample Question: Karta Ochishim Uchun Nima Qilishim Kerak? (What Do I Need to Do to Open a Card?)
Variations	Response	Correct/Incorrect
Kartaga sms xabarnoma yoqish tartibi qanday? (What is the procedure for enabling SMS notifications on the card?)	Uzcard kartasi sms xabarnomasi bankomat orqali yoqish mumkun. Humo kartasining sms xabarnomasi karta ochilgan filial orqali yoqiladi. (It is possible to activate the Uzcard card SMS notification through the ATM. SMS notification of Humo card is enabled through the branch where the card was opened.)	Correct
Karta PIN kodini unutganda nima qilishim kerak? (What should I do if I forget my card PIN?)	Agar Siz kartangizning PIN-kodini unutgan bo’lsangiz karta ochilgan bank filialiga shaxsingizni tasdiqlovchi hujjat bilan murojaat qilishingiz lozim. (If you have forgotten the PIN code of your card, you should contact the branch of the bank where the card was opened with your identity document.)	Correct
Karta yo’qotilganda qanday bloklash mumkin? (How to block when the card is lost?)	Kartani blokdan chiqarish uchun karta ochilgan filialga shaxsingizni tasdiqlovchi hujjat bilan murojaat etishingiz lozim. (To unblock the card, you need to contact the branch where the card was opened with your identity document.)	Incorrect
Bank filiallariga bormasdan kartaga onlayn buyurtma berish imkoni mavjudmi? (Is it possible to order a card online without going to bank branches?)	Milliy va xorijiy valyutadagi barcha kartalarga Milliy ilova yordamida masofadan turib buyurtma berishingiz mumkin. (You can order all cards in national and foreign currency remotely using the National application.)	Correct

Table 8. The proficiency of the “paraphrase-mpnet base-v2” sentence transformer in creating top-tier sentence embeddings and its accuracy in embedding paragraphs for search queries.

Dataset	Accuracy (%)
Our USC dataset	91.29
CC100-uzbek	88.96
voice-recognition-Uzbek	78.62
xls-r-uzbek-cv8	82.24

Table 9. Experimental results in model training. (⨯—Not available ✓—Available).

Model	LM	AN	SP	SA	Valid		Test
Model	LM	AN	SP	SA	CER	WER	CER	WER
E2E-BLSTM	⨯	⨯	⨯	⨯	13.9	43.2	15.1	44.7
	✓	⨯	⨯	⨯	14.8	30.1	15.2	31.9
	✓	✓	⨯	⨯	13.8	27.7	15.5	31.2
	✓	✓	✓	⨯	12.7	24.8	12.3	27.8
	✓	✓	✓	✓	10.6	22.6	11.5	23.9
DNN-CTC	⨯	⨯	⨯	⨯	13.1	35.1	10.9	32.7
	✓	⨯	⨯	⨯	10.9	21.3	9.0	25.4
	✓	✓	⨯	⨯	7.3	19.4	8.1	23.9
	✓	✓	✓	⨯	7.2	20.2	8.7	25.4
	✓	✓	✓	✓	6.0	17.1	6.5	21.9
E2E-Conformer	⨯	⨯	⨯	⨯	9.3	40.5	12.6	44.2
	✓	⨯	⨯	⨯	8.2	32.6	10.3	28.6
	✓	✓	⨯	⨯	8.1	30.3	9.9	27.2
	✓	✓	✓	⨯	7.9	24.1	9.2	24.4
	✓	✓	✓	✓	7.6	23.1	8.9	22.3
Deep Speech 2	⨯	⨯	⨯	⨯	12.0	36.7	10.2	34.6
	✓	⨯	⨯	⨯	11.3	26.4	9.3	25.9
	✓	✓	⨯	⨯	8.9	20.6	7.1	20.7
	✓	✓	✓	⨯	7.2	17.5	5.9	16.9
	✓	✓	✓	✓	5.4	15.1	5.22	13.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mukhamadiyev, A.; Khujayarov, I.; Cho, J. Voice-Controlled Intelligent Personal Assistant for Call-Center Automation in the Uzbek Language. Electronics 2023, 12, 4850. https://doi.org/10.3390/electronics12234850

AMA Style

Mukhamadiyev A, Khujayarov I, Cho J. Voice-Controlled Intelligent Personal Assistant for Call-Center Automation in the Uzbek Language. Electronics. 2023; 12(23):4850. https://doi.org/10.3390/electronics12234850

Chicago/Turabian Style

Mukhamadiyev, Abdinabi, Ilyos Khujayarov, and Jinsoo Cho. 2023. "Voice-Controlled Intelligent Personal Assistant for Call-Center Automation in the Uzbek Language" Electronics 12, no. 23: 4850. https://doi.org/10.3390/electronics12234850

APA Style

Mukhamadiyev, A., Khujayarov, I., & Cho, J. (2023). Voice-Controlled Intelligent Personal Assistant for Call-Center Automation in the Uzbek Language. Electronics, 12(23), 4850. https://doi.org/10.3390/electronics12234850

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Voice-Controlled Intelligent Personal Assistant for Call-Center Automation in the Uzbek Language

Abstract

1. Introduction

1.1. Research Context and Motivation

1.2. Research Aims and Contributions

1.3. Structure of the Paper

2. Related Work

3. Methodology

3.1. Workflow

3.2. Recognition of Speakers

3.3. Automatic Speech Recognition

3.4. Summarization of Sentences

3.5. Text-to-Speech Synthesis

3.6. Database

4. Experiment and Result Analysis

4.1. WaveNet Model Accuracy

4.2. Seq2Seq Model-Based Summary Prediction

4.3. Text Summarization Using the Seq2Seq Model

5. Conclusions and Future Work

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI