Adapting a Large-Scale Transformer Model to Decode Chicken Vocalizations: A Non-Invasive AI Approach to Poultry Welfare

Neethirajan, Suresh

doi:10.3390/ai6040065

Open AccessArticle

Adapting a Large-Scale Transformer Model to Decode Chicken Vocalizations: A Non-Invasive AI Approach to Poultry Welfare

by

Suresh Neethirajan

^1,2

¹

Faculty of Agriculture, Dalhousie University, 6050 University Ave, Halifax, NS B3H 1W5, Canada

²

Faculty of Computer Science, Dalhousie University, 6050 University Ave, Halifax, NS B3H 1W5, Canada

AI 2025, 6(4), 65; https://doi.org/10.3390/ai6040065

Submission received: 16 February 2025 / Revised: 16 March 2025 / Accepted: 21 March 2025 / Published: 25 March 2025

(This article belongs to the Special Issue Artificial Intelligence in Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

Natural Language Processing (NLP) and advanced acoustic analysis have opened new avenues in animal welfare research by decoding the vocal signals of farm animals. This study explored the feasibility of adapting a large-scale Transformer-based model, OpenAI’s Whisper, originally developed for human speech recognition, to decode chicken vocalizations. Our primary objective was to determine whether Whisper could effectively identify acoustic patterns associated with emotional and physiological states in poultry, thereby enabling real-time, non-invasive welfare assessments. To achieve this, chicken vocal data were recorded under diverse experimental conditions, including healthy versus unhealthy birds, pre-stress versus post-stress scenarios, and quiet versus noisy environments. The audio recordings were processed through Whisper, producing text-like outputs. Although these outputs did not represent literal translations of chicken vocalizations into human language, they exhibited consistent patterns in token sequences and sentiment indicators strongly correlated with recognized poultry stressors and welfare conditions. Sentiment analysis using standard NLP tools (e.g., polarity scoring) identified notable shifts in “negative” and “positive” scores that corresponded closely with documented changes in vocal intensity associated with stress events and altered physiological states. Despite the inherent domain mismatch—given Whisper’s original training on human speech—the findings clearly demonstrate the model’s capability to reliably capture acoustic features significant to poultry welfare. Recognizing the limitations associated with applying English-oriented sentiment tools, this study proposes future multimodal validation frameworks incorporating physiological sensors and behavioral observations to further strengthen biological interpretability. To our knowledge, this work provides the first demonstration that Transformer-based architectures, even without species-specific fine-tuning, can effectively encode meaningful acoustic patterns from animal vocalizations, highlighting their transformative potential for advancing productivity, sustainability, and welfare practices in precision poultry farming.

Keywords:

bioacoustics; poultry welfare; chicken vocalizations; natural language processing; animal welfare; acoustic analysis; precision livestock farming; animal vocal communication; sound classification; acoustic feature encoding

1. Introduction

The application of Artificial Intelligence (AI) to the study of animal vocalizations has transformed how researchers and farmers assess the health and welfare of livestock. In particular, acoustic analysis combined with advanced machine learning (ML) techniques offers a promising non-invasive strategy for real-time monitoring of poultry welfare [1,2]. This approach addresses several limitations of traditional observation-based methods, which are often subjective, time-consuming, and potentially stressful for the animals [3,4]. By leveraging AI-driven acoustic analysis within precision livestock farming practices, researchers have the opportunity to enhance productivity while simultaneously improving ethical standards in poultry farming [5,6].

Traditional welfare assessments in poultry rely heavily on physical and behavioral observations, such as feather condition, posture, vocal frequency, or changes in feeding activity. While these qualitative assessments are widely used, they are prone to variability among observers and lack the continuous monitoring necessary for timely interventions [7]. In contrast, AI models designed for vocal signal detection provide automated surveillance capable of identifying subtle changes in vocal patterns associated with stress, fear, disease, or distress. These systems reduce human–animal interactions during monitoring, thereby minimizing stress responses triggered by observers and preserving natural bird behavior [8,9].

1.1. AI for Acoustic Analysis in Poultry Welfare

Recent advances in machine learning—particularly deep learning—have enabled significant breakthroughs in detecting and classifying animal vocalizations [10]. Historically, much of this work has relied on specialized Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs), which operate on time–frequency representations (spectrograms) of recorded calls [11,12]. For example:

(a): Successful identification of bird species from field recordings [13,14].
(b): Real-time monitoring of farm animal distress signals [15,16].
(c): Early detection of poultry diseases through cough-like vocal patterns [17,18].

Despite their effectiveness in classification tasks such as identifying distress calls or categorizing alarm calls [19,20], these models often require carefully curated datasets labeled by domain experts. The reliance on large labeled datasets remains a bottleneck for broader adoption. Transfer learning techniques have emerged as a solution to this challenge by repurposing pretrained models originally developed for human speech recognition to detect specific avian vocal patterns [21,22]. For instance, CNNs trained on thousands of hours of human speech have been successfully adapted to identify avian vocalizations with reduced domain-specific training requirements.

While these methods are powerful for classification tasks like detecting distress calls or categorizing alarm signals, they fail to capture the semantic complexity inherent in animal communication [23]. Chickens produce a variety of vocalizations—such as alarm calls, food calls, comfort calls, and distress calls—that reflect not only immediate threats or resource availability but also underlying emotional or physiological states [24,25]. Deciphering these nuanced signals is critical for enabling farmers to respond appropriately and improve both animal welfare and farm efficiency [26].

1.2. State of the Art in Bioacoustic Methods

The field of bioacoustics encompasses research not only on poultry but also on wildlife and other livestock species. CNNs remain central to many state-of-the-art approaches, particularly for species identification in acoustic ecology [27,28]. These methods frequently achieve performance metrics exceeding 90% accuracy for species detection tasks and demonstrate robustness against environmental noise when using advanced architectures like hybrid CNN-RNN models or Attention-based Inception networks [29].

More recently, researchers have explored Transformer architectures capable of processing sequences of acoustic frames over extended time windows—a feature that is particularly relevant for analyzing social and emotional contexts in vocalizations [30,31]. However, applying these methods to poultry welfare remains limited. Current classifiers primarily distinguish between “distress” and “non-distress” calls or “healthy” versus “unhealthy” sounds without providing deeper semantic interpretations or emotional insights into the vocalizations [32].

Large-scale Automatic Speech Recognition (ASR) models like DeepSpeech, wav2vec, and Whisper have revolutionized human speech recognition by leveraging massive amounts of textual and audio data. Whisper stands out for its ability to handle multilingual inputs and noisy environments with high accuracy [33,34]. Although Whisper was designed specifically for human language processing, its Transformer-based architecture may be capable of detecting structural similarities in non-human vocalizations—such as pitch changes or rhythmic patterns—if suitably fine-tuned or guided [35]. This raises an intriguing question: can a model trained purely on human language be adapted to extract meaningful text-like representations from chicken vocalizations?

1.3. The Potential of Large-Scale ASR Models (Whisper) for Poultry

Whisper’s Transformer architecture excels at capturing long-range dependencies within sequential data through its self-attention mechanism. While its front-end acoustic processing is optimized for human phonetics and lexicons, there is potential for its high-level representations to encode non-linguistic information such as pitch transitions or amplitude envelopes. If these patterns emerge consistently as pseudo-linguistic tokens during transcription processes—despite being nonsensical—they could still provide meaningful insights into stress or contentment levels in chickens when analyzed using sentiment tools originally designed for textual data [36,37].

Early proofs-of-concept have demonstrated partial success in applying large-scale Transformers to other animal communication systems such as whale calls [38,39]. While direct translation is not feasible due to the absence of linguistic meaning in animal sounds, hidden layers within these models appear capable of clustering similar call types based on consistent acoustic features. This phenomenon highlights an important insight: advanced ASR systems can act as generalized audio encoders rather than strictly speech decoders [40].

1.4. Gaps, Challenges, and Ethical Considerations

Despite promising results from early studies applying AI to animal vocalizations, several significant challenges remain:

(a): Data Requirements: Training large-scale Transformers like Whisper requires extensive high-quality labeled recordings annotated by poultry experts—a time-intensive process that limits scalability [41,42].
(b): Background Noise: Commercial farms are inherently noisy environments with overlapping bird calls and mechanical sounds such as ventilation fans. While Whisper is robust against noise in human speech contexts, farm-specific noise poses unique challenges that complicate analysis [43].
(c): Ethical Constraints: Inducing stress or disease states in flocks solely for data collection purposes raises ethical concerns. Additionally, variations in farm layouts and climates reduce the feasibility of standardized sampling protocols across different facilities [44].
(d): Interpretation vs. Translation: Even if Whisper generates consistent transcriptions like “nihai nihai”, these strings lack semantic meaning in human language. Extracting actionable insights requires careful correlation with validated welfare metrics such as physiological indicators or behavioral observations [45].

Successfully addressing these challenges would advance animal welfare science significantly by creating pipelines for continuous real-time monitoring that reduce labor demands while improving early detection capabilities and overall welfare standards.

1.5. Objectives and Contributions

Building on the above context, the present study aims to bridge state-of-the-art AI approaches with practical poultry welfare needs. Specifically, we investigate whether OpenAI’s Whisper—a large-scale model trained on massive human speech datasets—can be harnessed or adapted to interpret chicken vocalizations. The overarching hypotheses are that certain repetitive or distinctive token sequences in Whisper’s output will align with known behavioral states (e.g., pre- vs. post-stress, healthy vs. unhealthy, quiet vs. noisy environments). By applying standard sentiment analysis methods to these automatically generated texts, we explore the potential to gauge changes in emotional valence or stress. Four key objectives guide this work:

(1): Adapt Whisper for Non-Human Audio—Assess if Whisper can produce stable “transcriptions” of chicken vocalizations. Even if the “words” do not match any natural language, their consistency across multiple audio samples could encode stress or welfare signals.
(2): Sentiment and Emotional Analysis—Apply standard NLP sentiment tools—originally built for English text—to the token sequences, checking for systematic shifts that map onto conditions like disease, fear, or contentment.
(3): Compare Across Conditions—Measure how the distribution of these token-based “sentiments” diverges between healthy vs. unhealthy birds, or pre- vs. post-stress. If the approach is sensitive, it may capture early welfare changes more effectively than visual inspection.
(4): Generate Insights for Future Fine-Tuning—Evaluate how domain mismatch impacts performance and highlight strategies—such as domain-specific lexicon augmentation or partial re-training—to refine such models for poultry bioacoustics.

In the broader context of AI and agriculture, this research offers conceptual and practical benefits. Conceptually, it expands the potential of large-scale Transformer-based ASR models beyond human speech, suggesting a path to “universal” acoustic recognition systems that can decode communication in numerous species [46]). Practically, real-time welfare monitoring reduces reliance on invasive tests, ensures earlier interventions for disease or stress, and fosters a more humane environment in large-scale poultry operations [47,48].

Overall, the intersection of machine listening, bioacoustics, and poultry welfare stands at a pivotal juncture. By harnessing the latest AI techniques—both specialized CNN-based classifiers and emerging Transformer-based ASR—we can move closer to a future where farm management decisions are informed by continuous, data-driven, and empathetic insights into the lives of the animals under our care.

2. Materials and Methods

2.1. Datasets

Two open-access datasets were used to evaluate the feasibility of applying Whisper to poultry vocalizations. The first dataset (Dataset 1) was collected at the CARUS animal facility at Wageningen University and comprises audio recordings from fifty-two Super Nick chickens housed in three separate cages. Mild stressors, such as an umbrella opening or pre-recorded dog barking, were introduced at controlled intervals, enabling the collection of both baseline and post-stress vocalizations. These recordings, detailed in [49], are publicly available through Zenodo. The dataset captures a range of stress responses, offering both controlled and ecologically relevant vocal samples for analysis.

The second dataset (Dataset 2), described in [50], contains 346 audio recordings obtained at the Bowen University poultry research farm. Each file is categorized as healthy, noisy, or unhealthy, based on observed behaviors and acoustic characteristics. Healthy recordings consist of typical chicken vocalizations without audible distress, while noisy recordings include background disruptions such as vehicular sounds or human chatter. In contrast, unhealthy recordings feature vocal indicators of potential health issues, including coughing and snoring. The recordings, which range in duration from 5 to 60 s, represent a diverse set of environmental and health-related conditions. This dataset is freely accessible on Mendeley Data. Details on the characteristics, experimental conditions, and composition of both datasets used in this study are summarized in Table 1.

The classification of stress-related vocalizations in this study was based on prior literature rather than direct physiological measurements. Existing research has established well-documented categorizations of poultry distress calls, linking distinct vocal patterns—such as increased frequency, amplitude, and duration—to stressors like predator threats, handling, and environmental disturbances [6,45]. Additionally, previous studies have identified characteristic alarm calls, discomfort vocalizations, and illness-related sounds in chickens, which have been manually annotated in earlier datasets [9,25]. Given these established associations, our labeling process adheres to previously validated classification frameworks rather than relying on new expert annotations or physiological markers.

However, we recognize that directly associating acoustic patterns with welfare states requires further validation through behavioral observations and physiological monitoring. Future work should incorporate expert-annotated datasets and sensor-based welfare indicators to enhance the accuracy and reliability of automated stress detection models. All experimental protocols related to these datasets were approved by the respective institutional review boards. In Dataset 1, stress induction was conducted under strict ethical guidelines, with measures such as controlled exposure to stimuli implemented to minimize discomfort to the chickens.

2.2. Analysis Pipeline

To unify both datasets under a single processing strategy, we designed a multi-step pipeline that addresses audio preparation, transcription via Whisper, and downstream text-based analyses.

2.2.1. Preprocessing

All audio files were converted to a standardized format using FFmpeg (16 kHz sampling rate, single-channel mono). Background noise in the audio files was meticulously filtered out using the Spectral De-noise module of the Izotope RX software (Sound on Sound, UK, Version 11.2.0), a critical step in preprocessing to ensure clarity in the vocalization data. Subsequently, to achieve consistent intensity across all recordings, the sounds were normalized using Sound Forge Pro 11 (Magix Software GmbH, Germany, version 17.0.3.177), further refining the quality of the audio data for analysis. Each file was then segmented into shorter clips of 1–5 s to facilitate computational efficiency. These segments were chosen because abrupt changes in vocalization patterns often occur within a few seconds, making shorter clips more amenable to detailed, per-clip analysis.

2.2.2. Transcription with Whisper

The preprocessed audio segments were fed into the base version of OpenAI’s Whisper model, which generates text tokens traditionally intended to represent words in human speech. Although Whisper’s training data derive solely from human languages, we hypothesized that stable patterns or repeated tokens might emerge in the output when presented with non-human sounds. In practice, Whisper often produced “nonsensical” strings—for instance, repeated fragments of text or unusual characters. However, the consistency of these fragments across segments with similar acoustic properties was of particular interest, as it could reflect underlying acoustic structures such as intensity, pitch, and harmonic content.

2.2.3. Sentiment and Textual Analysis

Following transcription, all text outputs were structured and analyzed in Python (Version 3.13.1). Using the SentimentIntensityAnalyzer from the NLTK library, each transcription was assigned a sentiment score (positive, negative, neutral). While these lexicons were originally developed for English text, we examined whether relative shifts in positivity or negativity correlated with changes in vocal intensity or distress levels.

The rationale for sentiment analysis stems from the hypothesis that Whisper’s phonetic patterns and structured token sequences may encode latent acoustic features linked to poultry welfare. Although sentiment scores do not directly reflect emotional states in chickens, they serve as proxies for stress-induced vocal changes. Prior research has established a connection between increased vocalization intensity, higher pitch, and repetitive distress calls with heightened physiological arousal [50]. By analyzing sentiment shifts across experimental conditions, we aim to capture variations in acoustic expression that align with known behavioral indicators of welfare.

To further validate this approach, future iterations should integrate direct acoustic feature extraction methods—such as Mel-frequency cepstral coefficients (MFCCs) and spectral entropy measures—alongside Whisper’s sentiment-based outputs. Additionally, correlating sentiment scores with external welfare metrics, such as heart rate, body temperature, or corticosterone levels, would provide stronger biological validation for AI-driven poultry welfare monitoring.

Beyond sentiment analysis, additional textual features—including n-gram frequencies (unigram, bigram, trigram) and part-of-speech tags—were extracted to explore potential syntactic or “vocabulary-like” structures in Whisper’s outputs. These analyses were conducted using NumPy, Pandas, and Scikit-learn, while Matplotlib (version 3.9) and Seaborn (Version 0.9.1) were employed for visualization through distribution plots, boxplots, and heatmaps.

2.2.4. Statistical and Visualization Tools

To gauge the similarity of transcribed tokens across different experimental conditions (e.g., pre- vs. post-stress, healthy vs. unhealthy, or quiet vs. noisy), we computed cosine similarity scores between the text embeddings. This approach allowed us to determine how closely the “transcriptions” aligned or diverged within each category. Furthermore, we used Latent Dirichlet Allocation (LDA) topic modeling to discern whether any overarching “topics” or clusters of tokens emerged consistently in subsets of the data (e.g., all post-stress clips forming one distinct cluster). This higher-level structure helps identify patterns that sentiment scores alone may not capture.

2.3. Comparisons with Prior Acoustic Analysis

Whisper’s outputs are not literal chicken “words”, making standard speech recognition metrics such as word error rate (WER) inapplicable. Instead, we compared our approach to prior research that employed CNNs, Hidden Markov Models (HMMs), and random forests to detect chicken distress calls or diagnose avian diseases. These studies report classification accuracies ranging from 67% to over 90%, depending on task complexity [15,16]. However, our objective is not to surpass these models in word recognition but to assess whether Whisper-generated “transcriptions” exhibit sentiment and textual patterns that correlate with known welfare conditions (healthy vs. unhealthy, pre- vs. post-stress). Therefore, our performance evaluation prioritizes coherence and interpretability of sentiment shifts over literal transcription accuracy.

While Whisper shows promise in extracting meaningful patterns from non-human vocalizations, a direct benchmark against domain-specific avian models would enhance its comparative value. However, such a comparison presents inherent challenges. CNNs and RNN-based models for bioacoustics rely on spectrogram-based feature extraction and require large-scale labeled datasets for supervised learning. In contrast, Whisper operates as a sequence-to-sequence ASR model, producing text-like outputs that do not directly align with conventional classification metrics such as accuracy, precision, recall, or F1-score. Consequently, standard speech recognition performance evaluations do not apply to non-linguistic vocalizations.

Despite these differences, prior studies using CNNs and HMMs for poultry distress calls have reported 67–90% classification accuracy [15,16]. Given that Whisper has not been fine-tuned for avian data, direct comparison would require a domain adaptation step, which falls beyond the scope of this feasibility study. Instead, our analysis focuses on interpreting Whisper’s token outputs and their correlation with welfare indicators. Future work should explore a hybrid approach, integrating CNN-based acoustic features with Whisper’s text outputs to balance interpretability and classification accuracy in poultry vocalization analysis.

2.4. Limitations and Potential Biases

The domain mismatch between human speech and chicken vocalizations stands out as the most significant limitation. Whisper’s underlying weights prioritize human phonetic and linguistic structures, thus producing text tokens (e.g., “nihai”, “kichi”) that have no direct meaning in poultry communication. The practical assumption here is that consistent token repetition or sentiment skew could correlate with acoustic features indicative of stress or well-being. Another limitation pertains to annotation. The datasets used in this study provide broad labels such as “healthy”, “unhealthy”, and “noisy”, but do not include granular annotations that differentiate alarm calls, threat calls, or various social calls. Future work would benefit from more extensive labeling efforts to precisely link model output to each vocalization’s ecological or emotional function.

Background noise remains a technical challenge in large-scale operations. Despite applying basic noise reduction, farm environments can contain overlapping calls and irregular mechanical sounds that potentially reduce the model’s sensitivity to subtle changes in chicken vocalizations. Addressing this would require domain-adaptive noise filtering or targeted fine-tuning of Whisper’s acoustic front-end. Finally, any sentiment analysis performed with an English-based lexicon inherently lacks domain-specific calibration for animal calls. Consequently, the positive or negative sentiment scores reported here should be interpreted with caution. They do not directly map onto the emotional states of the chickens but serve as proxies for changes in vocal intensity or patterning, potentially linked to welfare conditions. Despite these caveats, the results offer an intriguing proof-of-concept for deploying large-scale ASR models in non-human acoustic domains.

3. Results

The core aim of this research was to investigate whether OpenAI’s Whisper model—originally trained on large-scale human speech corpora—could generate text outputs that meaningfully capture acoustic variations in chicken vocalizations. Two main datasets served as testbeds: one focusing on chickens before and after stress induction (Dataset 1), and the other comparing healthy, noisy, and unhealthy conditions (Dataset 2). Throughout the experiments, Whisper frequently produced “nonsensical” text tokens (e.g., “nihai”, “kichi”, “going”) when transcribing chicken calls. Despite the apparent domain mismatch, careful analysis of token patterns, sentiment scores, linguistic metrics, and topic clusters revealed consistent correlations with known stressors or health statuses.

3.1. Whisper Output Characteristics and Consistency

Whisper’s outputs for chicken vocalizations typically comprised repetitive tokens or fragments that bore no linguistic resemblance to human words. Nevertheless, these tokens emerged in a highly consistent manner across recordings with similar acoustic features. Figure 1 illustrates how loud or urgent calls—often occurring in post-stress or unhealthy contexts—regularly generated repeated sequences (for instance, “nihai nihai”), whereas calmer baseline calls produced a broader, less repetitive token distribution. Figure 1 illustrates how repeated tokens and character usage reflect distinct acoustic profiles across conditions. This consistency suggests that Whisper’s latent feature extraction was sensitive to amplitude, pitch, or call duration patterns, even though it was never trained on avian data.

In Dataset 1, chickens recorded immediately after mild stressors (e.g., an opened umbrella) exhibited more frequent repeated tokens and stronger amplitude signals. Meanwhile, pre-stress recordings yielded token outputs that were more varied, suggesting lower overall vocal intensity. A similar trend emerged in Dataset 2: “unhealthy” chickens that coughed or exhibited rale-like sounds tended to produce repetitive token strings, whereas healthy calls were more diverse. Importantly, the presence of random punctuation and foreign characters rose significantly in “noisy” environments with overlapping background sounds. Although this phenomenon contributed additional “nonsense” to the transcripts, the patterns themselves remained stable within each noisy clip, implying that Whisper’s decoder was reacting systematically to the acoustic confusion.

3.2. Sentiment Analysis Across Experimental Conditions

To probe whether these domain-mismatched transcriptions might reflect actual differences in welfare states, we applied an English-oriented sentiment scoring tool (NLTK’s SentimentIntensityAnalyzer) to each token sequence.

3.2.1. Pre- vs. Post-Stress (Dataset 1)

Segments recorded in the immediate aftermath of stress induction displayed a pronounced increase in “negative” sentiment scores, consistent with heightened vocal urgency or agitation. Meanwhile, targeted interventions, such as partial covers that calm the flock, occasionally led to lower negative sentiment, presumably reflecting more subdued call patterns. Although these “negative” labels do not literally map to avian distress, they highlight an alignment between token repetition and the sentiment analyzer’s classification of urgency or intensity.

3.2.2. Healthy vs. Noisy vs. Unhealthy (Dataset 2)

Recordings taken in noisy environments exhibited higher negative polarity scores than healthy ones, suggesting that Whisper’s repeated punctuation or foreign glyphs—prompted by background chaos—triggered the analyzer’s negative weighting. In “unhealthy” segments, the sentiment distribution was more variable; some clips approached strongly negative scores, possibly from coughing or labored breathing, while others remained closer to neutral. This variation (Figure 2) underscores the complexity of linking raw token repetition to highly heterogeneous physiological conditions. Still, across both datasets, elevated “negative” scores loosely tracked the presence of disruptive noise, stress, or health problems. It should be noted that the absolute sentiment values may be misleading but relative differences are informative.

3.3. Linguistic Metrics and Topic Modeling Insights

Despite the absence of genuine semantics, several conventional textual analyses uncovered systematic differences tied to each experimental condition. Character and word frequency counts, for example, highlighted how tokens like “nihai”, “kichi”, “going”, or “room” disproportionately clustered in high-distress or high-volume recordings. Figure 3 and Figure 4 explore these frequencies further, revealing that “nihai” often topped the token list in post-stress calls, while calmer sessions showed a more balanced vocabulary. When part-of-speech (POS) tags were assigned—albeit with an English-trained NLP tagger—post-stress transcripts contained higher counts of “adjectives” or “present participles”, presumably because repeated tokens or morphological variants (e.g., “nihaiii”, “goinggg”) triggered these labels.

To determine whether these token patterns might form distinct clusters, we employed Latent Dirichlet Allocation (LDA). Topic modeling identified recurring sets of tokens that separated pre-stress from post-stress segments and healthy from unhealthy vocalizations. These “topics” significantly overlapped with repeated token fragments often associated with negative sentiment. Such clustering supports the notion that the same acoustic indicators leading to repeated text outputs also cause the model to group these transcripts into distinct “themes”. Even though the tokens themselves have no dictionary meaning, their distribution across topics reflects underlying acoustic conditions. Figure 3 (Panel B) shows an example of how repeated “nihai” sequences coalesce into a post-stress cluster, while more diverse tokens populate pre-stress or healthy topics.

3.4. Text Similarity and Condition Separation

To quantify the similarity of transcripts, we vectorized each token sequence (for instance, via TF-IDF weighting) and calculated cosine similarity. Figure 4 depicts the resulting heatmaps, in which segments from the same condition (e.g., all healthy recordings) consistently group together, while segments from opposing conditions (e.g., healthy vs. unhealthy) occupy distinct regions. This aligns with prior findings that repeated or intense calls yield characteristic textual proxies, whereas quieter, stable vocalizations map onto more varied or sparse outputs. Post-stress or unhealthy states thus exhibit lower textual similarity to calmer, baseline conditions.

These clustering results parallel the phenomenon observed in CNN-based spectrogram classification, where distress calls cluster together in high-dimensional feature space. Here, however, the separation arises from a Transformer-based ASR converting amplitude and pitch features into repeated tokens that NLP similarity metrics interpret as “closely related documents.” This underscores a key point: large-scale speech models can function as generalized acoustic encoders, capturing enough detail to differentiate stress or illness conditions in poultry even without specialized training on avian calls.

Extended visualizations—covering bigram/trigram word clouds, correlation matrices, and more detailed LDA plots—appear in the Supplemental File (Figures S1–S9). Those additional figures illustrate how character-level distributions, parts-of-speech ratios, and multi-dimensional scaling corroborate the main findings presented here.

All extended analyses that support the quantitative and qualitative observations above are detailed in the Supplemental Material. For example, Figures S1 and S2 expand on the token distributions across different stress levels, while Figures S3–S9 illustrate additional topic modeling results, multimodal correlation plots, and character-level frequency charts. These supplemental items enrich the core findings, providing finer-grained evidence of how Whisper’s transcriptions reflect acoustic changes in multiple poultry contexts.

4. Discussion

The human-oriented ASR model (Whisper) has the potential to generate text outputs consistently correlated with variations in chicken vocal patterns across stress, noise, and health conditions. Despite the clear mismatch in language domain, the model’s acoustic front-end appears adept at encoding pitch, amplitude, and spectral cues into stable sequences of tokens. When these tokens are analyzed using standard NLP sentiment and clustering tools, they reveal differences that align with known events (e.g., stress induction, background chaos) or health issues (e.g., coughing, rales).

One of the most striking outcomes is that Whisper, trained solely on human speech, repeatedly produced token strings like “nihai nihai” whenever birds emitted louder or more rapid calls. Although these strings hold no genuine meaning in English, their repetitive nature effectively captures “signature” acoustic features typical of urgent or distressed vocalizations. Similar phenomena were noted in pilot studies applying speech encoders to whale calls or bat echolocation signals. The ability of large-scale ASR models to act as generic feature extractors challenges conventional assumptions that these models are confined to processing human phonemes and syntax. Instead, they appear to extract high-level temporal or frequency cues that can generalize to non-human vocal repertoires, at least for the purpose of classification or monitoring.

From a practical standpoint, such an approach offers a non-invasive, real-time monitoring strategy for poultry farms. Rather than relying on labor-intensive visual inspections or invasive physiological sampling, farmers could place microphones to record calls continuously and run Whisper locally. Sudden spikes in negative sentiment or repeated tokens might alert them to health problems, predatory threats, or environmental stressors (e.g., equipment failure). This could enable early intervention—potentially mitigating large-scale losses and reducing antibiotic use. Moreover, this method avoids the need for extensive custom labeling of chicken calls before initial deployment. The unscrupulous textual outputs become relevant only insofar as they are stable or repeated under certain conditions.

Despite these benefits, several constraints and areas for improvement arise. Domain mismatch remains the largest hurdle. Because Whisper is not trained on avian calls, it occasionally “hallucinates” random punctuation, foreign letters, or morphological expansions. In quiet or intermediate conditions, these artifacts can reduce the clarity of the textual signals. Additionally, the broad “negative”, “positive”, and “neutral” labels from standard English-based sentiment dictionaries do not necessarily map onto genuine emotional states in chickens. Such labels are better interpreted as proxies for amplitude or repetition in the calls. A domain-specific lexicon or direct correlation with validated ethograms would likely yield a more accurate measure of real distress or well-being.

Noise contamination in commercial barns also poses a challenge. While Whisper handles moderate noise well for human speech, the chaotic overlaps of machinery, ventilation, and hundreds of clucking birds may degrade recognition. Preliminary filtering or specialized microphone arrays might isolate individual calls more effectively, boosting interpretive accuracy. Additionally, multi-speaker diarization—commonly used in human teleconferencing—could help segregate overlapping calls from different birds, further refining the textual output.

Comparison with prior poultry bioacoustics research underscores the novelty of this approach. Traditional methods typically employ Convolutional Neural Networks (CNNs), Hidden Markov Models (HMMs), or random forests trained on spectrograms or carefully annotated data. These strategies can achieve high accuracy (often above 90%) for specific tasks (e.g., distress-call detection), but they often require substantial domain-specific labeling. By contrast, the Whisper-based pipeline leverages an enormous pretrained model “as-is”, with minimal or no additional training. It does not compete directly with fine-grained classification accuracy, yet it provides a fast route to real-time detection of broad changes in vocalization patterns. Looking ahead, partial fine-tuning of Whisper’s acoustic layers on a smaller, labeled set of avian calls could improve the clarity of the textual tokens, potentially bridging some of the gap to specialized models.

A final point of interest is the concept of a future “chicken translator”. While genuine translation is unlikely—chickens do not produce language in the human sense—these results hint at advanced acoustic systems that might map repeated calls to well-defined categories such as alarm calls, comfort calls, or social chatter. Integrating other data streams (camera-based posture analysis, physiological sensors) could enrich the interpretability of repeated tokens and provide robust multimodal indicators of a flock’s welfare. For instance, consistent spikes in “nihai nihai” tokens plus decreased feeding activity might indicate significant stress or illness, prompting a more targeted veterinary check.

Taken as a whole, this proof-of-concept demonstrates that a large-scale human ASR model can be repurposed—without specialized domain re-training—to capture meaningful patterns in chicken vocalizations. While certainly not a replacement for domain-tuned architecture, it opens new directions for rapid, non-invasive acoustic surveillance in precision livestock farming. The next phase of research might incorporate partial fine-tuning, advanced noise mitigation, and more precise linking of token repetition to specific behavioral or physiological states. In time, these innovations could substantially improve early-warning systems, reduce mortality through prompt intervention, and support a more humane and effective management of poultry welfare.

5. Conclusions

The findings demonstrate that OpenAI’s Whisper, a Transformer-based model originally developed for human speech recognition, can effectively decode acoustic features from chicken vocalizations. Despite inherent domain differences, Whisper reliably extracted consistent patterns from poultry vocalizations associated with specific emotional and physiological states. Sentiment analysis further revealed systematic changes in acoustic signals corresponding to recognized stress and welfare conditions, effectively distinguishing periods of distress from states of calmness. The observed correlation between Whisper-generated sentiment shifts and known stress indicators indicates that Transformer-based architectures, even without species-specific fine-tuning, can serve as powerful generalized acoustic analyzers in animal welfare research.

Several promising areas emerge for future research to enhance model accuracy and robustness. Firstly, fine-tuning the Whisper model using annotated chicken vocalization datasets could significantly improve sensitivity to poultry-specific acoustic cues. Creating carefully annotated vocalization databases, including diverse environmental conditions, varied flock sizes, and different breeds—would further refine acoustic feature detection capabilities. Additionally, developing a multimodal validation framework that integrates physiological markers, such as heart rate variability, corticosterone levels, and thermographic imaging, along with behavioral observations through video analysis, could significantly improve the biological interpretability of acoustic findings. Another promising direction is the application of anomaly detection techniques, such as Isolation Forest or Dynamic Time Warping (DTW), to detect unusual vocal patterns that conventional sentiment-based methods might overlook. Moreover, conducting rigorous null-model analyses with synthetic or shuffled vocalizations would help establish clearer baseline comparisons.

Ultimately, combining Transformer-based acoustic analysis with these complementary approaches holds substantial potential to revolutionize precision poultry farming, enabling real-time monitoring systems capable of significantly enhancing welfare, operational efficiency, productivity, and overall sustainability.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/ai6040065/s1, Figure S1: Comparative analysis of text metrics: text length, number of vowels, number of consonants, and number of stopwords across disease pre-stress, disease post-stress, treatment pre-stress, and treatment post-stress; Figure S2: Distribution of vowel frequencies across different stages of stress and treatment; Figure S3: Consonant frequency analysis in various phases of stress and treatment; Figure S4: Emotional landscape of laying hens—pie chart analysis across phases; Figure S5: Comparative distribution of parts of speech in chicken vocalizations across disease and treatment phases; Figure S6: Comparative analysis of text metrics—healthy vs. noise vs. unhealthy phases; Figure S7: Distribution of vowel frequencies across different stages of chicken vocalization—healthy, noise, and unhealthy dataset analysis; Figure S8: frequency analysis in various phases of chicken vocalizations—healthy, noise, and unhealthy dataset comparisons; Figure S9: Distribution of sentiment scores and emotional profiles in healthy vs. noise vs. unhealthy phases; Figure S10: Extended correlation and emotional state analyses across chicken health phases; Figure S11: Additional unigram (A), bigram (B), and trigram (C) word clouds for healthy, noise, and unhealthy phases; Figure S12: Comparative distribution of parts of speech in chicken vocalizations—healthy, noise, and unhealthy phases; Table S1: Comparative sentiment analysis of poultry vocalizations under varied stress and environmental conditions. Dataset can be downloaded from Zenodo [51]—Neethirajan, S. (2023). Vocalization Patterns in Laying Hens—An Analysis of Stress-Induced Audio Responses [Data set]. https://doi.org/10.5281/zenodo.10433023.

Funding

This research was funded by the Natural Sciences and Engineering Research Council of Canada (Project R37245).

Institutional Review Board Statement

The data used in this study were obtained from previously published work and all the animal experiments in the respective studies were approved by the ethics boards of the respective universities.

Data Availability Statement

The data will be made available upon reasonable request.

Conflicts of Interest

The author declares no conflicts of interest.

References

Neethirajan, S. Digital phenotyping: A game changer for the broiler industry. Animals 2023, 13, 2585. [Google Scholar] [CrossRef] [PubMed]
Ferreira, J.C.; Campos, A.T.; Ferraz, P.F.P.; Bahuti, M.; Junior, T.Y.; Silva, J.P.D.; Ferreira, S.C. Dynamics of the Thermal Environment in Climate-Controlled Poultry Houses for Broiler Chickens. AgriEngineering 2024, 6, 3891–3911. [Google Scholar] [CrossRef]
Marino, L. Thinking chickens: A review of cognition, emotion, and behavior in the domestic chicken. Anim. Cogn. 2017, 20, 127–147. [Google Scholar] [CrossRef]
Hernandez, E.; Llonch, P.; Turner, P.V. Applied animal ethics in industrial food animal production: Exploring the role of the veterinarian. Animals 2022, 12, 678. [Google Scholar] [CrossRef]
Rollin, B. Why is agricultural animal welfare important? The social and ethical context. In Improving Animal Welfare: A Practical Approach, 3rd ed.; CABI: Wallingford, UK, 2021; pp. 46–59. [Google Scholar]
Neethirajan, S. Automated tracking systems for the assessment of farmed poultry. Animals 2022, 12, 232. [Google Scholar] [CrossRef] [PubMed]
Neethirajan, S.; Kemp, B. Digital livestock farming. Sens. Bio-Sens. Res. 2021, 32, 100408. [Google Scholar] [CrossRef]
Neethirajan, S. Transforming the adaptation physiology of farm animals through sensors. Animals 2020, 10, 1512. [Google Scholar] [CrossRef]
Vasdal, G.; Muri, K.; Stubsjøen, S.M.; Moe, R.O.; Kittelsen, K. Qualitative behaviour assessment as part of a welfare assessment in flocks of laying hens. Appl. Anim. Behav. Sci. 2022, 246, 105535. [Google Scholar] [CrossRef]
Cai, J.; Yan, Y.; Cheok, A. Deciphering Avian Emotions: A Novel AI and Machine Learning Approach to Understanding Chicken Vocalizations. Res. Sq. 2023. [Google Scholar] [CrossRef]
Ghani, B. Machine Learning-Based Analysis of Bird Vocalizations. Ph.D. Thesis, University of Göttingen, Göttingen, Germany, 2022. Available online: https://ediss.uni-goettingen.de/handle/11858/13959 (accessed on 15 September 2023).
Ranjard, L.; Ross, H.A. Unsupervised bird song syllable classification using evolving neural networks. J. Acoust. Soc. Am. 2008, 123, 4358–4368. [Google Scholar] [CrossRef]
Piczak, K.J. Recognizing bird species in audio recordings using deep convolutional neural networks. In CLEF (Working Notes); Sun SITE Central Europe: London, UK, 2016; pp. 534–543. Available online: https://ceur-ws.org/Vol-1609/16090534.pdf (accessed on 16 September 2023).
Gupta, G.; Kshirsagar, M.; Zhong, M.; Gholami, S.; Ferres, J.L. Comparing recurrent convolutional neural networks for large scale bird species classification. Sci. Rep. 2021, 11, 17085. [Google Scholar] [CrossRef]
Mao, A.; Giraudet, C.S.; Liu, K.; De Almeida Nolasco, I.; Xie, Z.; Xie, Z.; Gao, Y.; Theobald, J.; Bhatta, D.; Stewart, R.; et al. Automated identification of chicken distress vocalizations using deep learning models. J. R. Soc. Interface 2022, 19, 20210921. [Google Scholar] [CrossRef] [PubMed]
Jung, D.H.; Kim, N.Y.; Moon, S.H.; Kim, H.S.; Lee, T.S.; Yang, J.S.; Lee, J.Y.; Han, X.; Park, S.H. Classification of vocalization recordings of laying hens and cattle using convolutional neural network models. J. Biosyst. Eng. 2021, 46, 217–224. [Google Scholar] [CrossRef]
Cheng, B.; Zhang, S. A novel chicken voice recognition method using the orthogonal matching pursuit algorithm. In Proceedings of the 2015 8th International Congress on Image and Signal Processing (CISP), Shenyang, China, 14–16 October 2015; IEEE: Piscataway, NJ, USA, 2016; pp. 1266–1271. [Google Scholar]
Sadeghi, M.; Banakar, A.; Khazaee, M.; Soleimani, M.R. An intelligent procedure for the detection and classification of chickens infected by clostridium perfringens based on their vocalization. Braz. J. Poult. Sci. 2015, 17, 537–544. [Google Scholar]
Lokhandwala, S.; Sinha, R.; Ganji, S.; Pailla, B. Decoding Asian elephant vocalisations: Unravelling call types, context-specific behaviors, and individual identities. In Proceedings of the International Conference on Speech and Computer, Hubli, India, 29 November–1 December 2023; Springer Nature: Cham, Switzerland, 2023; pp. 367–379. [Google Scholar]
Zhang, L.; Liu, J.; Zhang, Z.; Liang, W. Tit alarm calls trigger anti-predator behavior in free-range domestic chickens. Appl. Anim. Behav. Sci. 2023, 265, 106009. [Google Scholar]
Guo, X. UL-net: Fusion spatial and temporal features for bird voice detection. In Proceedings of the 2022 IEEE 2nd International Conference on Electronic Technology, Communication and Information (ICETCI), Changchun, China, 27–29 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1009–1013. [Google Scholar]
Rajan, R.; Noumida, A. Multi-label bird species classification using transfer learning. In Proceedings of the 2021 International Conference on Communication, Control and Information Sciences (ICCISc), Idukki, India, 16–18 June 2021; IEEE: Piscataway, NJ, USA, 2021; Volume 1, pp. 1–5. [Google Scholar]
Nicol, C.J. The Behavioural Biology of Chickens; CABI: Wallingford, UK, 2015. [Google Scholar]
Sibley, D.A. What It’s Like to Be a Bird: From Flying to Nesting, Eating to Singing—What Birds Are Doing, and Why; Knopf: New York, NY, USA, 2020. [Google Scholar]
Herborn, K.A.; McElligott, A.G.; Mitchell, M.A.; Sandilands, V.; Bradshaw, B.; Asher, L. Spectral entropy of early-life distress calls as an iceberg indicator of chicken welfare. J. R. Soc. Interface 2020, 17, 20200086. [Google Scholar]
Burnham, R. Animal calling behaviours and what this can tell us about the effects of changing soundscapes. Acoustics 2023, 5, 631–652. [Google Scholar] [CrossRef]
Potamitis, I. Deep learning for detection of bird vocalisations. arXiv 2016, arXiv:1609.08408. [Google Scholar] [CrossRef]
Wang, H.; Xu, Y.; Yu, Y.; Lin, Y.; Ran, J. An efficient model for a vast number of bird species identification based on acoustic features. Animals 2022, 12, 2434. [Google Scholar] [CrossRef]
Noumida, A.; Rajan, R. Multi-label bird species classification from audio recordings using attention framework. Appl. Acoust. 2022, 197, 108901. [Google Scholar]
Fundel, F.; Braun, D.A.; Gottwald, S. Automatic bat call classification using transformer networks. Ecol. Inform. 2023, 78, 102288. [Google Scholar] [CrossRef]
Yang, J.; Carstens, B.C.; Provost, K.L. Machine learning reveals relationships between song, climate, and migration in coastal Zonotrichia leucophrys. bioRxiv 2023. [Google Scholar] [CrossRef]
de Carvalho, P.S.; Grzywalski, T.; Hou, Y.; Thomas, P.; Dedeurwaerder, A.; De Gussem, M.; Tuyttens, F.; Devos, P.; Botteldooren, D.; Antonissen, G. Automated detection of broiler vocalizations. A Machine Learning Approach for Broiler Chicken Vocalization Monitoring. Poult. Sci. 2025, 104, 104962. [Google Scholar]
Vachhani, B.; Singh, D.; Lawyer, R. Multi-resolution Approach to Identification of Spoken Languages and To Improve Overall Language Diarization System Using Whisper Model. In Proceedings of the Interspeech 2023, Dublin, Ireland, 20–24 August 2023; Volume 2023, pp. 1993–1997. [Google Scholar]
Yeo, J.H.; Kim, M.; Watanabe, S.; Ro, Y.M. Visual Speech Recognition for Low-resource Languages with Automatic Labels From Whisper Model. arXiv 2023, arXiv:2309.08535. [Google Scholar]
Stowell, D.; Wood, M.D.; Pamuła, H.; Stylianou, Y.; Glotin, H. Automatic acoustic detection of birds through deep learning: The first bird audio detection challenge. Methods Ecol. Evol. 2019, 10, 368–380. [Google Scholar] [CrossRef]
Dongan, G.; Akbulut, F.P. Multi-modal fusion learning through biosignal, audio, and visual content for detection of mental stress. Neural Comput. Appl. 2023, 35, 24435–24454. [Google Scholar] [CrossRef]
Webster, J.; Margerison, J. (Eds.) Management and Welfare of Farm Animals: The UFAW Farm Handbook; John Wiley & Sons: Hoboken, NJ, USA, 2022. [Google Scholar]
Andreas, J.; Beguš, G.; Bronstein, M.M.; Diamant, R.; Delaney, D.; Gero, S.; Goldwasser, S.; Gruber, D.F.; de Haas, S.; Malkin, P.; et al. Cetacean translation initiative: A roadmap to deciphering the communication of sperm whales. arXiv 2021, arXiv:2104.08614. [Google Scholar]
Kather, S. Analysis of Sperm Whale (Physeter macrocephalus) Vocalisations in the Azores: Coda Repertoires and Their Behavioural Context. Master’s Thesis, The University of the Azores, Ponta Delgada, Portugal, 2023. [Google Scholar]
Malik, M.; Malik, M.K.; Mehmood, K.; Makhdoom, I. Automatic speech recognition: A survey. Multimed. Tools Appl. 2021, 80, 9411–9457. [Google Scholar] [CrossRef]
Neethirajan, S.; Reimert, I.; Kemp, B. Measuring farm animal emotions—Sensor-based approaches. Sensors 2021, 21, 553. [Google Scholar] [CrossRef]
Bessa Ferreira, V.H.; Dutour, M.; Oscarsson, R.; Gjøen, J.; Jensen, P. Effects of domestication on responses of chickens and red junglefowl to conspecific calls: A pilot study. PLoS ONE 2022, 17, e0279553. [Google Scholar] [CrossRef]
Pijpers, N.; van den Heuvel, H.; Duncan, I.H.; Yorzinski, J.; Neethirajan, S. Understanding chicks’ emotions: Are eye blinks & facial temperatures reliable indicators? bioRxiv 2022. [Google Scholar] [CrossRef]
Neethirajan, S. Artificial intelligence and sensor innovations: Enhancing livestock welfare with a human-centric approach. Hum.-Centric Intell. Syst. 2024, 4, 77–92. [Google Scholar]
Marler, P. Bird calls: Their potential for behavioral neurobiology. Ann. N. Y. Acad. Sci. 2004, 1016, 31–44. [Google Scholar] [PubMed]
Vernes, S.C.; Kriengwatana, B.P.; Beeck, V.C.; Fischer, J.; Tyack, P.L.; Ten Cate, C.; Janik, V.M. The multi-dimensional nature of vocal learning. Philos. Trans. R. Soc. B 2021, 376, 20200236. [Google Scholar]
Papageorgiou, M.; Goliomytis, M.; Tzamaloukas, O.; Miltiadou, D.; Simitzis, P. Positive Welfare Indicators and Their Association with Sustainable Management Systems in Poultry. Sustainability 2023, 15, 10890. [Google Scholar] [CrossRef]
van den Heuvel, H.; Youssef, A.; Grat, L.M.; Neethirajan, S. Quantifying the Effect of an Acute Stressor in Laying Hens using Thermographic Imaging and Vocalisations. bioRxiv 2022. [Google Scholar] [CrossRef]
Adebayo, S.; Aworinde, H.O.; Akinwunmi, A.O.; Alabi, O.M.; Ayandiji, A.; Sakpere, A.B.; Adeyemo, A.; Oyebamiji, A.K.; Olaide, O.; Kizito, E. Enhancing poultry health management through machine learning-based analysis of vocalization signals dataset. Data Brief 2023, 50, 109528. [Google Scholar]
Griebel, U.; Oller, D.K. From emotional signals to symbols. Front. Psychol. 2024, 15, 1135288. [Google Scholar]
Neethirajan, S. Vocalization Patterns in Laying Hens—An Analysis of Stress-Induced Audio Responses [Data Set]. 2023. Available online: https://zenodo.org/records/10433023 (accessed on 15 September 2023).

Figure 1. Unigram word cloud (A) and character frequency analysis (B) demonstrating consistent token patterns correlated with chicken welfare conditions.

Figure 2. Comparative sentiment analysis showing differences between pre-stress and post-stress conditions (A) and among healthy, noisy, and unhealthy contexts (B), highlighting shifts in negativity corresponding to increased vocal intensity or distress.

Figure 3. Token frequency rankings (A) and topic modeling (B) demonstrating recurrent Whisper-generated tokens (e.g., “nihai”, “going”) that distinguish between healthy and distressed poultry vocalizations.

Figure 4. Cosine similarity heatmaps illustrating distinct clustering of Whisper-generated transcripts: Panel (A) differentiates post-stress or noisy vocalizations from baseline calls; Panel (B) demonstrates tight clustering among healthy recordings, confirming Whisper’s sensitivity to condition-specific acoustic patterns.

Table 1. Overview and characteristics of chicken vocalization datasets used in the study.

Dataset	Source	Number of Birds	Experimental Conditions	Number of Recordings	Duration of Recordings	Recording Characteristics
Dataset 1	CARUS facility, Wageningen University [49]	52	Pre-stress and post-stress groups (acute stress induced via umbrella opening), and a non-stressed control group.	>900 audio files (available via Zenodo)	1 h recordings per session, segmented into shorter clips (1–5 s)	Controlled experimental setting; mild acute stressors regularly introduced; minimal environmental noise.
Dataset 2	Bowen University poultry research farm [50]	100	Healthy (treated) and unhealthy (untreated) groups monitored over 65 days; unhealthy group developed respiratory illness naturally.	346 audio files categorized as - Healthy (139) - Unhealthy (121) - Noise (86) (available via Mendeley Data)	5–60 s per recording	Field environment; systematic recording (morning, afternoon, night); high-resolution audio (24-bit, 96 kHz); abnormal vocalizations captured (e.g., coughing, snoring); background noise recorded separately; spectral de-noising applied

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Neethirajan, S. Adapting a Large-Scale Transformer Model to Decode Chicken Vocalizations: A Non-Invasive AI Approach to Poultry Welfare. AI 2025, 6, 65. https://doi.org/10.3390/ai6040065

AMA Style

Neethirajan S. Adapting a Large-Scale Transformer Model to Decode Chicken Vocalizations: A Non-Invasive AI Approach to Poultry Welfare. AI. 2025; 6(4):65. https://doi.org/10.3390/ai6040065

Chicago/Turabian Style

Neethirajan, Suresh. 2025. "Adapting a Large-Scale Transformer Model to Decode Chicken Vocalizations: A Non-Invasive AI Approach to Poultry Welfare" AI 6, no. 4: 65. https://doi.org/10.3390/ai6040065

APA Style

Neethirajan, S. (2025). Adapting a Large-Scale Transformer Model to Decode Chicken Vocalizations: A Non-Invasive AI Approach to Poultry Welfare. AI, 6(4), 65. https://doi.org/10.3390/ai6040065

Article Menu

Adapting a Large-Scale Transformer Model to Decode Chicken Vocalizations: A Non-Invasive AI Approach to Poultry Welfare

Abstract

1. Introduction

1.1. AI for Acoustic Analysis in Poultry Welfare

1.2. State of the Art in Bioacoustic Methods

1.3. The Potential of Large-Scale ASR Models (Whisper) for Poultry

1.4. Gaps, Challenges, and Ethical Considerations

1.5. Objectives and Contributions

2. Materials and Methods

2.1. Datasets

2.2. Analysis Pipeline

2.2.1. Preprocessing

2.2.2. Transcription with Whisper

2.2.3. Sentiment and Textual Analysis

2.2.4. Statistical and Visualization Tools

2.3. Comparisons with Prior Acoustic Analysis

2.4. Limitations and Potential Biases

3. Results

3.1. Whisper Output Characteristics and Consistency

3.2. Sentiment Analysis Across Experimental Conditions

3.2.1. Pre- vs. Post-Stress (Dataset 1)

3.2.2. Healthy vs. Noisy vs. Unhealthy (Dataset 2)

3.3. Linguistic Metrics and Topic Modeling Insights

3.4. Text Similarity and Condition Separation

4. Discussion

5. Conclusions

Supplementary Materials

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI