1. Introduction
In the field of civil aviation, ensuring flight safety and efficiency necessitates the establishment of effective communication at aerodromes and within airspace. The Very High Frequency Communication System (VHF COMM) plays a crucial role in civil aviation communications, using voice as the medium and VHF radio signals as the carrier for information transmission. Air traffic controllers use the air traffic control system to communicate with and coordinate multiple aircraft in the airspace. As shown in
Figure 1, the voice commands of controllers are transmitted via wire from the air traffic control system to interphone systems, which are then broadcasted via VHF COMM in the very high frequency band through wireless transmission to the airspace. When aircraft receive commands, they wirelessly transmit them back to the aerodrome, where they are stored in interphone systems and relayed back to the air traffic controller [
1].
The Very High Frequency Communication System (VHF COMM) operates in the 30–3000 MHz frequency band, primarily serving the A1 sea area of the Global Maritime Distress and Safety System (GMDSS), and is an important system for communicating between air traffic controllers and pilots in aviation. The frequencies used in civil aviation range from 118.000 to 151.975 MHz, according to International Civil Aviation Organization standards, with a channel spacing of 25 kHz. Specifically, the 121.600 to 121.925 MHz band is primarily used for ground control. VHF COMM features semi-duplex amplitude modulation, a minimum transmission power of 20 W, the rapid attenuation of surface waves, a communication distance limited by line of sight, propagation primarily through space waves, and susceptibility to the troposphere and terrain. Additionally, communication can be affected by noise interference, channel compression, and voice signal distortion due to weather, encoders, and channel noise [
2].
In the civil aviation system, the Very High Frequency (VHF) Communication System serves as the primary network for aircraft communication, and it is critical, particularly in ensuring the accuracy of two-way voice communication between aircraft crew members and various control positions. It is critical to continuously record dialogues between different pilots and controllers during critical operations with a high complexity and risk of accidents, such as takeoff and landing. This allows for the detection of any missed or incorrect control information by comparing repeated instructions from controllers and pilots. It is evident that the structured recording of voice communication between air traffic control and crew is critical [
3].
Since the crew in the airspace cannot effectively perceive the specific situation of the entire airspace, such as drone conflicts and potential navigation conflicts, it is necessary to rely on the ground air traffic management units for control services. The process of providing services, according to the standard air traffic control procedures published by ICAO, can be divided into the following three steps:
Surveillance and Identification: Controllers identify aircraft through surveillance facilities or air–ground dialogues.
Issuance and Execution of Commands: Controllers issue commands to aircraft after a reasonable analysis of airspace dynamics and aircraft dynamics, and pilots execute the commands effectively upon receiving them.
Monitoring of Commands: Controllers use dialogues with pilots to further monitor the implementation of commands.
These three steps form a cycle, achieving the entire air traffic management service. It is evident that surveillance and communication constitute the entire air traffic management process To achieve effective air traffic safety management, effective surveillance and communication are essential.
In the air traffic surveillance of general large airports, secondary radar (SSR) and ADS-B surveillance equipment are often used to achieve the effective monitoring of aircraft status. However, in the air traffic surveillance of most medium and small airports, due to the lack of sufficient surveillance equipment, procedural control based on position reports is adopted. Compared with the highly automated and visualized radar control method, procedural control relies on pilots reporting their positions, and controllers build a model of the entire airspace through dialogues with pilots. At this time, the air traffic management service heavily depends on the controllers’ understanding of the airspace and their communication abilities, making air–ground cooperation (AGC) the most critical high-risk human-in-the-loop (HITL) segment. Whether AI-based methods can be used to assist in surveillance becomes the key to improving procedural control services.
Correspondingly, achieving voice-based assisted surveillance can also be roughly divided into the following three steps:
Voice Segmentation: Real-time voice is processed into individual meaningful voice segments that computers can understand, achieving “when speak”.
Voice Understanding: Voice is recognized as text, identifying who said the corresponding command, achieving command identification, and reaching “who speak when”.
Semantic Analysis and Display: The text content is understood and abstracted into a specific command vector, representing the specific meaning of the command, and a modeling analysis to determine whether the command affects future flights or achieves a visual effect similar to radar control, achieving “who speak what at when”.
Therefore, in the research of civil aviation air–ground communication, how to achieve automated voice segmentation and accurately match the corresponding speakers, solving the “who speak when” problem, has become a focus of research in this field [
4,
5].
The task of mapping voices to speakers is a classic “Who Spoke When” problem, addressed through speaker diarization networks. Researchers have extensively investigated various network architectures for automating voice segmentation and clustering tasks, yielding significant achievements. Tae Jin Park et al. reviewed the development of speaker diarization, noting its early use in air traffic control dialogue segmentation and clustering [
6]. Federico Landini et al. utilized a Bayesian Hidden Markov Model for the speaker clustering of x-vector sequences in the VBx system [
7]. Hervé Bredin and others [
5,
8,
9] developed pyannote.audio, a neural network processing system for speaker identification and diarization that used advanced machine learning and deep learning techniques to achieve a top performance in multiple competitions and datasets. Zuluaga-Gomez et al. [
10] combined voice activity detection with the BERT model to detect speaker roles and changes by segmenting the text of voice recognition software. Bonastre et al. [
11] introduced the ALIZE speaker diarization system, which is written in C++ and designed and implemented using traditional method-based models. In practical applications, speaker diarization technology has been used in scenarios such as telephone customer service, where the use of deep learning has strengthened the technology, with a focus on modular and end-to-end approaches [
12].
In summary, the current speaker-logging process typically consists of two steps: first, slicing the input long speech data, which is typically implemented using a voice activity detection (VAD) algorithm; second, clustering the sliced audio to achieve a one-to-one correspondence between the audio and the speaker. Currently, VAD algorithms frequently employ endpoint discrimination algorithms based on statistical models, such as ALIZE [
11], and machine learning endpoint discrimination algorithms based on waveform recognition, such as pyan-note.audio [
8]. In noisy environments, statistically based endpoint discrimination algorithms typically perform poorly. Although waveform recognition-based machine learning endpoint discrimination algorithms can improve silence detection techniques under complex background noise by enhancing neural networks, simple machine learning networks are overly sensitive to speaker breathing gaps, resulting in too many small segments being cut from a continuous piece of audio. Currently, mainstream clustering algorithms are based on x-vector speaker recognition technology, including VBx [
7] and pyan-note.audio [
8]. The main idea is to augment variable-length speech with noise and reverb, then map it through deep neural networks into fixed-length vectors, and finally recognize the speaker by clustering the vectors after a secondary mapping. However, such algorithms have high requirements for speech quality, requiring distinct voiceprints from different speakers. The traditional modulation, transmission, and demodulation operations used in radiotelephone communications using very high frequency (VHF) systems not only cause voiceprint blurring (filtering by filters in analog signal processing), but also introduce strong noise (poor communication environments), all of which can have a significant impact on the performance of current speaker-logging systems.
While speaker diarization technology originated in the context of voice segmentation for radiotelephone communications, its application and development in civil aviation face challenges due to high noise interference from radiotelephone conversations and complex multi-party communication scenarios. The existing research has used text processing methods for speaker detection, but it has not fully integrated the characteristics of civil aviation radiotelephone communications with speaker voiceprint features. This paper aims to optimize for the semi-duplex nature of radiotelephone communications, multi-speaker environments, and complex background noise by developing a comprehensive radiotelephone communication speaker diarization system to meet the application needs of speaker diarization in radiotelephone communications.
Given the complex nature of radiotelephone half-duplex communications, which are characterized by multiple speakers and noisy environments, this study proposes a novel speaker diffraction network suitable for airports with aircraft that are not equipped with CPDLC or do not prioritize CPDLC as the primary communication method.
2. Contribution
As shown in
Figure 2., our network is comprised of three main components: voice activity detection (VAD), end-to-end radiotelephone speaker separation (EESS), and flight prior knowledge-based text-related clustering (PKTC).
The VAD model used in this study has been redesigned with an attention mechanism to filter out the silence and background noises common in radiotelephone communications. The difference in voice signal transmission pathways between controllers and pilots—wired into the intercom system for controllers and wireless for pilots—has resulted in distinct voice signal characteristics. Using current advanced end-to-end speaker segmentation models based on voiceprints enables an effective separation and pre-clustering of communications between the two parties. As a result, the EESS model is an adaptation of the radiotelephone speaker segmentation model based on [
5], fine-tuned to pre-cluster voice signals into two categories: pilots and controllers. The PKTC model is innovative in that it introduces a text-feature speaker-clustering model based on the textual content of radiotelephone communications and prior flight knowledge. This includes graph construction based on prior knowledge, graph correction with radar data, and probability optimization to cluster pilot voices into speaker classes identified by call signs. In summary, this paper presents a novel Radiotelephone Communications Speaker Diarization Network that is tailored to the unique characteristics of radiotelephone communications.
To assess the effectiveness of the network, this paper compiled a collection of real and continuous radiotelephone communication voice data. Based on this, the ATCSPEECH dataset was constructed to train and test speaker diarization tasks. The dataset contains 5347 voice recordings from 106 speakers, including 11 air traffic controllers who held the same position at different times and 95 pilots. The data includes 14 h of recordings from a specific airport, from 7 a.m. to 9 p.m.
The remainder of this paper is organized as follows.
Section 2 uses a piece of typical voice data from real radiotelephone communication to demonstrate three features of the Radiotelephone Communication Speaker Diarization Network, as well as the challenges in extracting these features. Based on these challenges,
Section 3 proposes a novel radiotelephone communication speaker diarization system made up of three modules: voice activity detection (VAD), end-to-end radiotelephone speaker separation (EESS), and flight prior knowledge-based text-related clustering (PKTC).
Section 4 introduces the ATCSPEECH database, which we created for the radiotelephone communication speaker diarization task.
Section 5 trains and compares the network using the public AMI and ATCO2 PROJECT databases, followed by ablation studies on various network modules. Finally,
Section 6 summarizes the contents of this article.
3. Preliminary Analysis
In radiotelephone communication, a single controller in one location communicates with multiple pilots via the Very High Frequency Communication System (VHF COMM).
Figure 2 depicts a segment of real radiotelephone communication voice data captured from the airport intercom system. As shown in
Figure 3, voice communication has the following characteristics during transmission:
Characteristic 1: Radiotelephone communication operates in half-duplex mode, which requires speakers to press the Push to Talk (PIT) device button to speak, allowing only one speaker to speak at a time. When neither party presses the talk button, the intercom system records absolute silence, as shown in the figures with the empty waveform and empty spectrogram. When both parties press the talk button at the same time, they are unable to communicate effectively, and the system records a segment of chaotic noise that does not belong to any speaker. The duration of this noise is typically correlated with the time both parties press the button simultaneously, and it is usually very brief, with its waveform and spectrogram shown in the figures as Crackle Waveform and Crackle Spectrogram, respectively. Such conflicts are common at congested airports.
Characteristic 2: The audio processed during speaker logging is collected from the intercom system. The controller’s voice is transmitted from the control tower to the intercom system via wired transmission, resulting in minimal noise and a clear voiceprint, as shown by the waveform green marker and ATC speech spectrogram in the figure. However, because the controller uses the microphone at such a close distance, “popping” may occur. In the waveform, this is represented by an overly full and enlarged shape. Pilots’ voices enter the intercom system via wireless transmission, as indicated by the waveform red marker and pilot speech spectrogram in the figures. The pilots’ end voice introduces noise that varies with communication distance, the presence of obstacles in the channel, flight phases, and weather conditions, making the pilots’ end voice noisier. This noise appears in the spectrogram as a series of red lines spanning from left to right, indicating constant noise in those frequency bands.
Characteristic 3: During transmission, pilots’ voices are filtered through a band-pass filter, which is represented by wave cancellation in the blue box of the pilot speech spectrogram. This results in a sparse and blurry voiceprint for the pilots, which differs significantly from that of the controllers. When a controller communicates with multiple pilots, their voices exhibit a weakened voiceprint, causing their voiceprints to become similar and making it difficult for voiceprint-based speaker-clustering models to distinguish between them [
12].
The challenges associated with Characteristic 1 include the need for the voice activity detection (VAD) module to distinguish not only silence but also noise generated during conflicts between the parties in conversation. The challenge with Characteristic 2 stems from the significant differences in voiceprints between controllers and pilots, with a focus on how to use this feature to create an effective speech separation model. For Characteristic 3, the challenge is that relying solely on acoustic models to distinguish pilots’ voices may be ineffective. Thus, in order to aid in speech separation, additional features such as text must be incorporated into the speaker diarization system design. To address these challenges in air–ground communication scenarios, this paper proposes a speaker diarization system consisting of three modules: voice activity detection (VAD), end-to-end radiotelephone speaker separation (EESS), and flight prior knowledge-based text-related clustering (PKTC). The VAD module has been enhanced with an attention mechanism to effectively filter out silence and noise, as described in Characteristic 1. The EESS model fine-tunes an end-to-end speaker segmentation model to distinguish between controller and pilot voiceprints by pre-clustering voices into two categories: pilots and controllers. Given Characteristics 2 and 3, a text-related clustering model based on flight prior knowledge is designed to accurately cluster pilot voices to the corresponding call sign, effectively resolving speaker identification errors in speaker diarization logs caused by blurred voiceprints.
4. Proposed Framework
The model proposed in this paper consists of three components: voice activity detection (VAD), end-to-end radiotelephone speaker separation (EESS), and flight prior knowledge-based text-related clustering (PKTC).
4.1. Voice Activity Detection
The voice activity detection (VAD) model attempts to slice the input speech signal and filter out noise caused by equipment conditions to the greatest extent possible. The VAD model developed in this paper serves as the system’s entry point and is optimized for radiotelephone communication scenarios.
Figure 4 shows how the model uses a local temporal network based on Bi-LSTM to filter out noise-generated intervals [
13]. It also includes a global temporal network based on self-attention that preserves silent intervals caused by the speaker’s breathing [
14]. This network enables the segmentation of long speech signals into individual radiotelephone command speech segments.
Due to the use of Push to Talk (PTT) devices, air–ground communication can be seen as a polling speech, meaning that only one person speaks at a time. In the design of the VAD system, the fundamental reason we needed to change the VAD architecture is that in air–ground communication, there is “silence” when no one is speaking, but there are also some “crackle” sounds caused by button presses. These segments should be filtered out and removed by the VAD system. Traditional energy spectrum-based VAD models are very effective at filtering “silence”, but when filtering “crackle”, since “crackle” itself has a non-zero energy spectrum, energy-based segmentation models will treat “crackle” as a valid sentence and retain it. Therefore, we chose a waveform-based VAD framework. Although “crackle” itself carries energy, it appears as isolated, sharp spikes in the waveform spectrum. Thus, the waveform-based VAD model can effectively identify and remove it. Additionally, as to the silent intervals caused by breathing, Bi-LSTM-based VAD systems cannot effectively recognize these intervals due to the characteristics of the Bi-LSTM model, leading to the segmentation of a continuous speech segment into two or more sentences. Therefore, we designed a self-attention global temporal network to extend the model’s “receptive field” to retain the silent intervals caused by the speaker’s breathing.
Assume that the audio input here consists of
sample points, denoted as
. Then, this speech segment is sliced with a stride of
,
First, each audio segment is convolved using the SincNet convolution method mentioned in [
15], resulting in
:
where
represents the convolution operation and
denotes a convolutional band-pass filter with a length of
, where
represents learnable parameters, as defined below:
In this context, and represent each convolutional band-pass filter’s upper and lower frequency limits, respectively, while represents a Hamming window function defined to have the same length as the original signal and utilized to mitigate the effects of spectral leakage.
After being convolved with SincNet, the signal is absolute value-pooled and normalized before being activated using the Leaky ReLU function. The processed audio signal is then fed into a two-layer Bi-LSTM network to extract short-term features. Assuming that the feature extracted is vector
, the specific steps are as follows:
where
represents the index for each timestep,
stands for the sigmoid function,
represent the input, forget, and output gates, respectively, and
denote the cell state and hidden output, respectively.
and
represent trainable weight matrices and bias vectors, while
indicates the element-wise multiplication of corresponding vectors. Finally, by concatenating the feature vector obtained from the forward computation and the vector derived from the backward sequence, we obtain the result of the feature extraction for this layer, which is
To address the issues of gradient explosion and vanishing encountered by the Bi-LSTM network when processing long-duration data [
16,
17], as well as the network’s oversensitivity to silent gaps caused by speaker breathing, this paper introduces a self-attention-based network for the further processing of time-related features. The input sequence data are embedded in a vector space of dimension
to track the relative order of each part within the sequence, which is then concatenated with the original
-dimensional data prior to encoding. The encoder consists primarily of two parts: the Multi-Head Attention (MHA) and the feed-forward layer. The MHA mechanism consists of several scaled dot-product attention units. Given a sequence vector, each attention unit computes contextual information about a specific token and combines it with a weighted set of similar tokens. During encoding, the network mainly learns three weight matrices: key weight
, value weight
, and query weight
. These three weight matrices enable us to compute the attention representation of all vectors.
where
is the transpose of matrix
and
represents the dimensionality of vector features aimed at stabilizing the gradient. The SoftMax function is used for weight normalization. During computation, the weights are dot-multiplied with each head’s attention vectors, then summed and fed through a linear layer activated by the Gelu function.
where
represents the attention mechanism generated by the
-th head.
Finally, the processed feature vectors undergo a binary classification task to determine whether each stride contains a valid human voice signal. At this point, the network outputs the audio segmented into units based on this determination.
4.2. End-to-End Radiotelephone Speaker Separation
The end-to-end radiotelephone speaker separation model requires processing the audio segments divided by the voice activity detection model, clustering pilots and controllers into two groups based on the acoustic characteristics of each command segment. The Pyannote-based end-to-end speaker segmentation model [
9] is fine-tuned to cluster speech.
With simple pre-training, the binary clustering model based on voiceprint features can easily cluster pilot and controller voices. In actual long-duration radiotelephone communication, the communication time established between individual pilots and controllers is short, particularly in scenarios where multiple pilots communicate with controllers. Given the high volume of flights at airports, with numerous and unknown speakers, conventional speaker diarization systems struggle to handle the audio. As a result, this model incorporates the flight prior knowledge-based text-related clustering model to improve the robustness of the speaker diarization task in multi-pilot scenarios.
4.3. Flight Prior Knowledge Based Text-Related Clustering
To address the challenges of dealing with a large and indeterminate number of speakers in long-duration radiotelephone communication, as well as the fuzziness of pilot voiceprints, which makes designing a voiceprint-based speaker clustering system difficult, this system introduces flight prior knowledge-based text-related clustering. It re-clusters speakers based on radiotelephone command texts, which include call signs.
In standard radiotelephone communications, commands from the pilot’s side should contain the aircraft’s call sign, and if a command includes multiple call signs, the last call sign is definitively that of the aircraft [
18,
19]. Additionally, airlines coordinate flights through the Air Coordination Council, with coordination cycles measured in years divided into spring and autumn seasons, resulting in flights appearing in fixed cycles. Martin Kocour et al. [
20] found that airport radar data can be used to correct the textual results of voice recognition when extracting call signs. Therefore, based on prior knowledge, pilot call signs can be matched and corrected to address errors caused by voice recognition systems [
21]. This paper presents a flight prior knowledge-based text-related clustering model, which performs call sign recognition, correction, and extraction for clustering. The algorithm’s specific process is as follows:
First, the call sign recognition task is defined as follows: given an acoustic observation
, the process involves finding the most likely corresponding sequence of call sign strings
. The specific process is described as follows:
where
represents the probability constraints of the acoustic model, while
represents the probability constraints of the language model.
Secondly, the call sign correction task is defined as follows: given a string , by correcting a few characters , the required string sequence is determined.
The algorithm’s primary workflow is divided into three parts: graph construction based on prior information, graph correction based on radar data, and probability optimization.
4.3.1. Graph Construction Based on Prior Knowledge
The algorithm begins with aviation data that have been revised by the aviation coordination council, and it constructs a dictionary graph as follows: Initially, it collects all possible flight call signs that may appear at the airport, splitting them into three-letter airline codes and four-digit flight numbers. Separate graphs are generated for airline codes and flight numbers. The processed call signs are then organized by the character sequence, with the preceding letter as the head and the following letter as the tail, forming a directed edge. Finally, the outdegree of each node’s letter is tallied, and the results are normalized to determine the transition probabilities for each character state.
where
represents the probability of transitioning from the current state
to state
,
denotes the outdegree of transitioning from state
to state
, and
represents the outdegree from any state
to state
The final output is depicted on the left side of
Figure 5.
4.3.2. Graph Correction Based on Radar Data
The algorithm begins with data obtained from radar to correct the dictionary graph, with the following procedure: It updates the graph with the most recent flight call signs from the radar at the time of query. Similarly, the call signs are first split into the three-letter airline code and the four-digit flight number, and each is inserted into the graphs separately. It updates all transition probabilities for the current head node based on the character sequence, with the preceding letter serving as the head and the following letter as the tail. The transition probability for states ending with the tail node is updated by multiplying it by weight
and adding weight
, while the transition probability for states not ending with the tail node is multiplied by weight
. The sum of weights
and
equals 1.
where
represents the degree of correlation between the radar data and control commands, with a positive correlation between the call signs appearing in the radar data and those in the control commands when
; the larger
is, the smaller the correlation. In different scenarios, the correlation between the most recent radar data call signs and control command call signs varies. For example, in busy airports with a high volume of aircraft movements, the correlation between the most recent radar data call signs and control command call signs is weaker, necessitating a higher value for
. Therefore, the value of
should be adjusted during pre-training in real-world test scenarios. In this experiment, the value is set to
.
4.3.3. Probability Optimization
During a call sign-decoding operation, assuming that character
has already been decoded and character
is awaiting decoding, the potential decoding results for character
form the character set
. In this context, the probability of decoding
to
is
. Therefore, multiplying the decoding probability by the corresponding state transition probability in the graph represents the potential decoding outcome at that moment.
Therefore, the speech recognition task is updated as follows:
Similarly, for the correction task, assume that the call sign character to be corrected is
with the character value being
, and the potential correction result set being
. The preceding character to this character is
, and the succeeding character is
. In the dictionary graph, the transition probability from character state
to state
is
, and from character state
to state
is
. In a scenario where
occurs, it indicates an erroneous call sign sequence; thus, a correction should be made, with the corrected character ensuring that
is maximized. Thus
Figure 5 (right) shows that the flight CCA4369 has now been added to the simulated radar information. After probability optimization, call sign characters will tend to select characters that have appeared in prior information, with a greater emphasis on call sign characters that have recently appeared in radar information.
Figure 6 illustrates the specific operational process of this algorithm using a simple example.
This algorithm enhances the accuracy of call signs in the text.
Finally, for processed speech text, call sign matching is performed. In radiotelephone communication, pilots often append the aircraft’s call sign to the end of a command. As shown in
Figure 7, by first reversing the text and then matching the airline code in reverse order, adding the first four characters before reversing it again, a call sign recognized from speech can be obtained.
Re-clustering based on call sign texts allows us to easily overcome the clustering challenges caused by the sparsity of pilot voiceprints, as well as the difficulties of clustering in scenarios involving multiple pilots.
7. Conclusions
This study introduces a novel radiotelephone communications speaker diffraction network that is based on the unique characteristics of air–ground communication scenarios. The network is comprised of three major components: voice activity detection (VAD), end-to-end air–ground speaker separation (EESS), and prior knowledge-based text-related clustering (PKTC). The VAD module uses attention mechanisms to effectively filter out silence and noise during air–ground communications. The end-to-end speaker segmentation model precisely segments speech and divides it into two categories: pilots and controllers. One of this paper’s novel features is the prior knowledge-based text-related clustering model, which accurately clusters the segmented pilot speech to the corresponding call sign-identified speakers.
Furthermore, to train and test speaker diarization tasks, this study collected real, continuous air–ground communication voice data, resulting in the ATCSPEECH dataset. This dataset contains 5347 voice recordings from 106 speakers, including 11 controllers and 95 pilots, for a total duration of approximately 14 h.
After fine-tuning and training on the proprietary ATCSPEECH dataset, as well as the public ATCO2 PROJECT and AMI datasets, the network outperformed the baseline models significantly. Ablation studies confirmed the strong robustness of each network module. Overall, the model proposed in this study is effectively optimized for the characteristics of air–ground communication and performed exceptionally well in speaker diarization tasks, providing valuable references for future research and applications in the aviation communication field. In addition, this system can also adapt to significant changes in voiceprints caused by inconsistencies in VHF COMM equipment or handle data from the auxiliary VHF COMM receiver, by fine-tuning the End-to-End Land–Air Speaker Separation module or through an unbalanced fine-tuning training method (where the data for the controller class are much larger than those for the pilot class during fine-tuning).
In future research, we plan to investigate integrating automatic speech recognition (ASR) and speaker diarization into an end-to-end design to reduce the model’s training loss and improve overall performance. Additionally, the network’s design could be jointly optimized with downstream tasks, such as pilot profiling and controller career lifecycle monitoring, to improve the model’s overall task performance. We will also consider the direction of the joint training of speaker diarization and speech recognition systems. Additionally, we can integrate large language models to directly input speech vectors into the model, bypassing speech recognition, and directly obtain downstream results, thereby reducing the system’s response time.