Do Stop Words Matter in Bug Report Analysis? Empirical Findings Using Deep Learning Models Across Duplicate, Severity, and Priority Classification

Ji, Jinfeng; Yang, Geunseok

doi:10.3390/app15169178

Open AccessArticle

Do Stop Words Matter in Bug Report Analysis? Empirical Findings Using Deep Learning Models Across Duplicate, Severity, and Priority Classification

by

Jinfeng Ji

¹ and

Geunseok Yang

^2,*

¹

Department of Computer Applied Mathematics, Hankyong National University, Anseong-si 17579, Republic of Korea

²

Department of Computer Applied Mathematics, Computer System Institute, Hankyong National University, Anseong-si 17579, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(16), 9178; https://doi.org/10.3390/app15169178

Submission received: 29 July 2025 / Revised: 18 August 2025 / Accepted: 19 August 2025 / Published: 20 August 2025

Download

Browse Figures

Versions Notes

Abstract

As software systems continue to increase in complexity and scale, the number of reported bugs also grows. Bug reports are essential artifacts in software maintenance, supporting critical tasks such as detecting duplicate reports, predicting bug severity, and assigning priority levels. Although stop word removal is a common text preprocessing step in natural language processing, its effectiveness in deep learning-based bug report analysis has not been thoroughly evaluated. This study investigates the impact of stop word removal on three core bug report classification tasks. The analysis uses a dataset containing over 1.9 million bug reports from eight large-scale open-source projects, including Eclipse, FreeBSD, GCC, Gentoo, Kernel, RedHat, Sourceware, and WebKit. Five deep learning models are applied: convolutional neural networks, long short-term memory networks, gated recurrent units, Transformers, and BERT. Each model is evaluated on its performance with and without stop word removal during preprocessing. The results show that the F1 score difference was less than 0.01 in over 85% of comparisons, so stop word removal has little to no effect on predictive performance in eight open-source projects. Average F1-scores remain consistent across all tasks and models, with 0.36 for duplicate detection, 0.33 for severity prediction, and 0.33 for priority prediction. Statistical significance tests confirm that the observed differences are not meaningful across datasets or model types. The findings suggest that stop word removal is not necessary in deep learning-based bug report analysis. Removing this step may simplify preprocessing pipelines without reducing accuracy, particularly in large-scale and real-world software engineering applications.

Keywords:

bug duplicate detection; bug severity prediction; bug priority prediction; stop words; software bug report analysis

1. Introduction

Bug reports are essential elements in the software development and maintenance lifecycle. They document issues encountered during system operation and provide valuable information for identifying, prioritizing, and resolving defects. As software becomes more widely used and complex, the number of bug reports submitted to tracking systems increases accordingly. A significant portion of these reports are duplicates, which leads to inefficiencies in the triage process. Research by Banerjee et al. [1] showed that in projects such as Eclipse [2] and Mozilla [3], more than 80 and 350 reports were received per day, respectively. Duplicate bug reports account for 42% of all reports (Mukherjee et al. [4]) in bug tracking systems such as Bugzilla [5]. Developers need to evaluate each report to determine whether it is a new issue or a previously reported issue. This process results in substantial time and resource consumption, which highlights the need for more efficient duplicate detection systems.

In addition to identifying duplicate reports, predicting the severity and priority of reported bugs is critical for effective software maintenance. Severity refers to the technical impact of a bug on system functionality, while priority indicates how urgently the issue should be addressed, considering user experience and operational requirements. High-severity bugs often affect core functionalities and require immediate attention. High-priority bugs typically align with business or project goals and demand timely resolution to minimize disruption.

To reduce manual effort and improve classification consistency, numerous automated approaches have been proposed. These include the application of deep learning techniques such as convolutional neural networks (CNNs) [6], recurrent neural networks (RNNs) [7], and Transformer-based architectures. While these methods have achieved promising results in individual tasks like duplicate detection or severity prediction, most prior studies operate on limited datasets or narrow scopes, limiting their generalizability across diverse bug tracking environments.

One commonly used text preprocessing step in NLP pipelines is stop word removal, where frequent functional words (e.g., “the”, “is”, “of”) are discarded. This technique is often assumed to improve learning efficiency by reducing input noise and dimensionality. However, in the context of deep learning, where models can capture hierarchical and contextual semantics, its benefit remains unclear. This question is particularly relevant in software engineering, where bug reports include domain-specific language, informal descriptions, and high lexical variability.

This paper systematically evaluates whether stop word removal has a measurable impact on model performance in bug report classification tasks. We examine this across three core tasks: duplicate detection, severity prediction, and priority prediction, using a dataset of over 1.9 million bug reports from eight open-source software projects: Eclipse, FreeBSD [8], GCC [9], Gentoo [10], Kernel [11], RedHat [12], Sourceware [13], and WebKit [14].

To address this question, we apply five widely used deep learning architectures: CNN, LSTM [15], GRU [16], Transformer [17], and BERT [18]. Each model is trained and evaluated under two preprocessing settings—with and without stop word removal.

Our study reveals that stop word removal has minimal influence on classification performance across tasks and models. These insights challenge conventional assumptions about preprocessing and suggest that certain steps can be reconsidered or simplified in large-scale, deep learning-based bug report analysis.

2. Background Knowledge

2.1. Bug Report and Bug Life Cycle

Bug reports are essential artifacts in software engineering, as they document detailed information about defects encountered during system operation. A typical bug report includes a unique identifier (bug ID), a brief title or summary, severity and priority levels, the current status of the issue, and a more detailed description of the problem. Figure 1 provides an example of a bug report submitted to the GCC issue tracking system [19].

Once submitted, bug reports follow a structured process known as the bug life cycle, which is customizable to match the needs of the organization [20]. This life cycle defines the various states that a bug may transition through, from initial discovery to eventual resolution [21]. When a user detects an issue, they submit a bug report to the tracking system. The report is then assigned to a developer, who is responsible for verifying and addressing the defect.

As the bug progresses through the system, its status is updated according to predefined categories. Common statuses include the following:

Open: The bug has been reported and is pending investigation or assignment.
Closed: The issue has been resolved, verified, or determined to be invalid or unreproducible.
Reopened: A previously closed bug is reopened because the fix was ineffective or the problem has reoccurred.
Deferred: The bug is considered low priority or out of scope for the current release and is scheduled for future consideration.

A visual representation of this life cycle is shown in Figure 2. Understanding the stages and transitions of the bug life cycle is important for effective defect tracking, resolution planning, and communication among development team members.

This foundational understanding of bug reports and their life cycle provides essential context for tasks such as duplicate detection, severity classification, and priority prediction, which are addressed in subsequent sections of this study.

2.2. Bug Report Preprocessing

Preprocessing is a fundamental step in preparing bug report text for automated analysis. It involves transforming raw textual data into a more structured and machine-readable form that can be effectively processed by learning algorithms. Common preprocessing techniques include tokenization, lemmatization, punctuation removal, and stop word removal [22].

Stop words refer to frequently occurring words, such as articles, prepositions, and conjunctions, that are often considered semantically uninformative in natural language processing tasks. These words are typically removed to reduce noise and improve computational efficiency. Although stop word removal is widely adopted in traditional NLP workflows, its necessity and effectiveness in domain-specific tasks such as bug report analysis have not been extensively studied.

This study explores whether stop word removal is beneficial for downstream bug classification tasks, including duplicate detection, severity prediction, and priority prediction. The broader goal is to evaluate the practical value of this preprocessing step in the context of deep learning-based software engineering applications.

2.3. Feature Selection in Bug Reports

Feature selection plays a critical role in text classification by identifying the most informative terms or attributes from the input data [23]. In the context of bug report analysis, effective feature selection can enhance model performance and reduce computational cost by filtering out irrelevant or redundant information.

Techniques such as term frequency-inverse document frequency (TF-IDF) and statistical tests like chi-square are commonly used to rank features according to their relevance to the target classification task. These selected features help models focus on meaningful patterns in the data rather than being influenced by noise or high-dimensional sparsity.

While stop word removal is often assumed to improve the feature selection process by eliminating non-discriminative terms, its actual impact in software-specific contexts remains unclear. This study includes an investigation into whether retaining or removing stop words significantly influences feature selection and model performance across multiple bug classification tasks.

2.4. Bug Duplicate Detection

Bug duplicate detection is the process of identifying whether a newly reported issue has already been submitted to the bug tracking system. This task is critical in software maintenance, as duplicate reports can lead to redundant triage efforts, wasted resources, and delayed resolutions. An overview of the duplicate detection process is illustrated in Figure 3.

Traditionally, duplicate detection has been approached using textual similarity metrics, heuristic-based matching, or machine learning algorithms [24,25,26]. More recent studies [27,28] have adopted neural network models that learn semantic representations of bug reports to improve detection accuracy. These models typically rely on the textual content of the report, such as summaries and descriptions, to assess similarity between new and existing entries.

Figure 4 presents an illustrative example of a duplicate report pair from the Eclipse bug repository. Although the two reports use different terminology, they refer to the same underlying issue.

The summary of the original bug report (#30959) is “NullPointerException on rename field,” while the duplicate report (#31082) is titled “NPE when renaming field” [29,30]. After preprocessing, the first report may be reduced to tokens such as “nullpointerexception,” “rename,” and “field.” The second report yields similar terms like “npe,” “rename,” and “field.” Despite differences in wording, both reports describe the same defect, which shows the importance of semantic-aware models in duplicate detection.

While prior work has achieved notable success in improving duplicate detection performance, preprocessing techniques such as stop word removal are often applied without evaluating their actual necessity. In this study, we examine the role of stop word removal within duplicate detection as part of a broader investigation into its effect on bug report classification tasks.

2.5. Bug Severity Prediction

Bug severity prediction refers to the process of estimating how seriously a reported defect affects the functionality and stability of a software system. This task is important because it helps development teams allocate resources efficiently and resolve the most critical issues first. An overview of the bug severity prediction process is shown in Figure 5.

Severity levels are typically categorized into multiple classes, each reflecting the degree of functional disruption caused by a bug. These categories assist in establishing resolution priorities and ensuring that mission-critical defects are addressed promptly. Figure 6 provides representative examples of typical severity levels used in many software projects.

S1 (Very Severe): Issues that cause major failures, such as system crashes, memory leaks, or complete data corruption.
S2 (Severe): Defects that affect core functionality or essential features but do not lead to total system failure.
S3 (Moderate): Problems such as interface inconsistencies, logical errors, or boundary condition issues that affect user workflows without halting operation.
S4 (Minor): Cosmetic issues, including typographical errors or formatting inconsistencies, that do not impact software usability or correctness.

Prior research has applied various machine learning and deep learning techniques to automate severity classification. These methods aim to reduce manual workload, increase consistency, and enhance the speed of triaging processes. Typical models rely on textual information in bug reports, such as summaries and descriptions, to infer severity levels.

Although preprocessing steps like stop word removal are commonly used in text classification tasks [31,32], their effectiveness in software engineering domains is not well understood. This study investigates the influence of stop word removal on severity prediction models as part of a broader analysis of preprocessing choices in bug report classification.

2.6. Bug Priority Prediction

Bug priority prediction plays a vital role in managing software maintenance workflows efficiently. It involves assigning urgency levels to reported bugs based on their impact on system functionality, user experience, and operational continuity. As development teams often handle a high volume of bug reports, accurate prioritization helps ensure that critical issues are addressed promptly while less urgent problems are scheduled appropriately.

Figure 7 provides an overview of the bug priority prediction process, which is often supported by machine learning models. These models are designed to analyze the textual content of bug reports and estimate the appropriate priority level based on learned patterns from labeled data.

Bug reports are commonly classified into five priority levels, as shown in Figure 8. These levels help guide developers in allocating resources and scheduling bug fixes based on urgency and severity.

P1 (Highest Priority): Critical defects that cause major system failures or render essential functions inoperable. Immediate attention is required.
P2 (High Priority): Significant issues that affect key components but may not prevent system operation. These issues should be addressed promptly.
P3 (Medium Priority): Defects that affect non-core functionality or lead to unexpected results without causing major disruption.
P4 (Low Priority): Minor usability or cosmetic issues that do not interfere with normal operation and can be resolved at a later time.
P5 (Lowest Priority): Trivial or unclear problems that have minimal or no impact on system behavior or user experience.

In recent years, deep learning approaches have been applied to automate bug priority classification. While many models rely on common natural language preprocessing techniques such as stop word removal, there has been limited analysis of whether such steps are necessary or beneficial in this specific task. This study addresses that gap by examining the influence of stop word removal on priority prediction within a broader bug report analysis framework.

3. Our Approach

This study proposes a unified methodology for analyzing software bug reports with the goal of automating three core classification tasks: duplicate detection, severity prediction, and priority prediction. The primary focus of the methodology is to evaluate the influence of stop word removal on model performance across these tasks. The approach includes structured phases, from data acquisition and preprocessing to feature engineering, model training, and evaluation. A visual overview of the proposed workflow is provided in Figure 9.

3.1. Data Extraction and Preprocessing

To support a comprehensive and comparative analysis, we collected bug reports from eight open-source projects: Eclipse, FreeBSD, GCC, Gentoo, Kernel, RedHat, Sourceware, and WebKit. These projects were selected to ensure coverage of a wide range of software domains, including operating systems, compilers, web engines, and enterprise applications. Each report contains structured and unstructured fields relevant to the target tasks, such as summary, description, and associated labels indicating duplicate status, severity level, and priority level.

The preprocessing phase was designed to prepare the raw textual data for use with deep learning models, while enabling direct comparison between two distinct conditions: with stop word removal and without it. The following steps were applied to each report:

Tokenization: Text fields were segmented into individual word-level tokens.
Lemmatization: Each token was normalized to its base or root form to reduce morphological variance.
Stop Word Removal (optional): Common function words that are typically considered semantically uninformative, such as “the,” “is,” and “at,” were removed in one version of the dataset.

This preprocessing step resulted in two parallel datasets: one where stop words were retained and another where they were removed. These dual datasets allowed for controlled experimentation to isolate the effect of stop word removal on model performance across all three tasks. An example of the transformation process during preprocessing is illustrated in Figure 10.

This structured preprocessing framework enabled consistent input preparation across all models and tasks, forming the foundation for subsequent training, evaluation, and statistical comparison phases.

3.2. Feature Selection

Feature selection is a critical step in text classification tasks, as it reduces input dimensionality and computational complexity while improving model generalization. In the context of this study, feature selection was used to identify the most informative words in bug reports for the tasks of duplicate detection, severity prediction, and priority prediction.

To quantify the relevance of each term, we applied two widely used techniques: term frequency-inverse document frequency (TF-IDF) [33] and the Chi-square statistical test [34]. TF-IDF measures the importance of a word in a corpus, while the chi-square test assesses the statistical significance of the relevance of a word to a category. Combining the two can more comprehensively reflect the importance of features, thereby improving the rationality of feature selection. TF-IDF transforms the raw textual data into weighted numerical vectors that emphasize terms with high discriminatory power relative to their frequency in the corpus. The Chi-square test evaluates the statistical dependence between each term and the target class label, providing a complementary measure of importance based on observed versus expected frequency distributions.

In our pipeline, both TF-IDF and Chi-square values were computed for each word in the corpus. The product of these two values was then used to rank features, and the top-ranked terms were retained for model input. This combined approach ensures that selected features are both statistically and semantically significant in relation to the classification tasks.

Figure 11 illustrates an example of the feature selection process applied to a bug report from the GCC project (ID: #65092). In this case, high-ranking terms such as “missing,” “container,” “constructors,” and “adaptors” were identified. These terms often indicate specific types of software issues, such as incomplete functionality, structural problems in data handling, or design pattern mismatches. Their inclusion in the final feature set enhances the model’s ability to capture task-relevant patterns in the input data.

This feature selection strategy was applied uniformly to both versions of the dataset, with and without stop word removal. This consistency enabled a fair comparison of model performance under each preprocessing condition in the subsequent evaluation phase.

3.3. Model Training and Evaluation

Five widely adopted deep learning architectures were used in this study: convolutional neural networks (CNNs), long short-term memory networks (LSTM), gated recurrent units (GRUs), Transformer models, and bidirectional encoder representations from Transformers (BERTs). Each model was applied to all three bug report classification tasks: duplicate detection, severity prediction, and priority prediction.

To evaluate the effect of stop word removal, every model was trained on two versions of the dataset. One version was preprocessed with stop words removed, while the other retained all tokens, including stop words. This experimental design allowed for a controlled assessment of how stop word removal influences model performance across different architectures and tasks.

The following subsections describe the architectural characteristics of each model and their role in the classification framework. An overview of the CNN model is presented first.

3.3.1. Convolutional Neural Networks (CNNs)

CNNs are commonly used in natural language processing tasks to identify local patterns in text sequences. In the context of bug report classification, CNNs transform tokenized input into an embedding matrix and apply convolutional operations to capture n-gram features and spatial dependencies within the text.

The architecture consists of the following components:

Input Layer: Accepts preprocessed text in the form of token indices or embedding vectors.
Convolutional Layer: Applies filters to extract local feature patterns such as key phrases or term co-occurrences.
Pooling Layer: Reduces the dimensionality of feature maps and preserves the most salient features.
Fully Connected Layer: Aggregates extracted features to form a dense representation suitable for classification.
Output Layer: Produces the final prediction for each task using a softmax or sigmoid function, depending on the classification type.

Figure 12 provides a schematic representation of the CNN architecture used in this study.

This architecture is well-suited for identifying contextually important term combinations in relatively short texts such as bug report summaries and descriptions.

3.3.2. Long Short-Term Memory (LSTM)

Long short-term memory networks (LSTM) are a variant of recurrent neural networks (RNNs) specifically designed to capture long-range dependencies in sequential data. This capability makes LSTMs particularly suitable for text classification tasks, where the relationship between tokens across positions in a sequence can significantly influence prediction accuracy.

The LSTM architecture incorporates memory cells and gating mechanisms to regulate the flow of information. It includes a forget gate that determines whether previous information should be discarded, an input gate that assesses which new information should be added, and an output gate that decides which part of the memory should be propagated to the next time step. These components enable the model to maintain and update contextual information across long sequences without suffering from vanishing or exploding gradients.

Figure 13 illustrates the core components of the LSTM architecture used in this study.

X_{t}

is the input at time t,

H_{t - 1}

represents the hidden state at the previous time step,

H_{t}

is the hidden state,

h_{t}

represents the output hidden state,

C_{t - 1}

represents the memory cell at the previous time step, and

C_{t}

is the current cell state. The forget gate

F_{t}

, input

I_{t}

, and output

O_{t}

control the flow of information using sigmoid activations σ.

\tilde{C_{t}}

denotes the candidate cell state generated using tanh. Multiplications and additions are element-wise operations across vectors.

LSTM models were trained on both the stop word removal and non-removal versions of the dataset, allowing for a direct comparison of their performance under each condition.

3.3.3. Gated Recurrent Unit (GRU)

Gated recurrent units (GRUs) are a streamlined variant of LSTM networks, designed to achieve similar sequential modeling performance with fewer parameters. GRUs simplify the architecture by combining the forget and input gates into a single update gate, and by eliminating the separate memory cell used in LSTM models.

The GRU architecture consists of an update gate that controls how much of the previous hidden state should be retained, and a reset gate that determines how much of the past information should be considered when computing the new candidate state. These mechanisms allow GRUs to capture relevant temporal dependencies while offering faster training and reduced computational complexity.

Figure 14 shows the architecture of the GRU model applied in our experiments.

x_{t}

is the input vector, and

h_{t}

is the current hidden state.

r_{t}

is the reset gate and

z_{t}

is the update gate, both using sigmoid activation σ.

\tilde{h_{t}}

is the candidate hidden state calculated with tanh. The final hidden state

h_{t}

is a gated combination of

h_{t - 1}

and

\tilde{h_{t}}

, controlled by

z_{t}

.

Like the other models in this study, GRUs were evaluated under both preprocessing settings. This ensured consistent analysis of how stop word removal affects model performance in sequence-based architectures.

3.3.4. Transformer

Transformer-based models have become a foundational architecture in natural language processing due to their ability to capture long-range dependencies through attention mechanisms. Unlike recurrent or convolutional models, the Transformer architecture is entirely built on multi-head self-attention and positional encoding, which allows for highly parallelizable computations and improved learning of contextual relationships in sequences.

In this study, we used a standard Transformer encoder-decoder configuration for bug report classification. The encoder processes input sequences by embedding tokens and their positional information, and applies stacked self-attention and feedforward layers. The decoder uses attention mechanisms to generate predictions based on the encoded representations. Although originally designed for sequence-to-sequence tasks, the Transformer was adapted here for classification purposes.

Figure 15 illustrates the core structure of the Transformer model used in our experiments.

The Transformer model was evaluated on both preprocessed versions of the dataset, with and without stop word removal, to measure its sensitivity to preprocessing decisions across the three classification tasks.

3.3.5. BERT (Bidirectional Encoder Representations from Transformers)

BERT is a pre-trained language model based on a bidirectional Transformer encoder that has shown strong performance across a wide range of NLP tasks. Unlike unidirectional models, BERT captures context from both the left and right of each token, which enables deeper semantic understanding of text sequences.

BERT’s architecture consists of multiple stacked Transformer encoder layers, each employing self-attention and feedforward components. The input representation combines token embeddings, positional embeddings, and segment embeddings. The model is fine-tuned on downstream tasks by modifying the output layer according to the classification objective.

In this study, BERT was fine-tuned separately for duplicate detection, severity prediction, and priority prediction tasks. As with the other models, training was conducted under both preprocessing conditions to assess the effect of stop word removal.

Figure 16 shows a high-level overview of the BERT architecture used in our experiments.

The use of BERT enables a comparison between traditional models trained from scratch and a powerful pre-trained language model within the same experimental framework.

3.3.6. Training Configuration

All models in this study were trained using consistent training settings to ensure comparability across architectures and preprocessing conditions. The training process was designed to support both binary and multi-class classification tasks relevant to bug report analysis.

The loss functions were selected based on the nature of each task. Binary cross-entropy was used for duplicate detection, which is a binary classification problem. For severity and priority prediction, which involve multiple categorical labels, categorical cross-entropy was applied. These loss functions quantify the difference between predicted and true labels and are standard in classification tasks.

The Adam optimizer was employed to update model weights during training. Adam is widely used for its adaptive learning rate capabilities and combines the strengths of AdaGrad and RMSProp to provide efficient and stable convergence.

The input features consisted of tokenized and lemmatized text extracted from bug report summaries and descriptions. Two versions of the input data were prepared for each experiment. One version included stop words, and the other excluded them. Tokenization was used to segment the text into individual word units, while lemmatization mapped each word to its base or canonical form. These preprocessing steps ensured that the models received normalized and structured inputs for learning.

This training configuration was applied uniformly across all models and tasks to enable reliable evaluation of the impact of stop word removal, which was conducted during the experimental analysis in Section 4.

3.4. Implementation Details

All models were implemented using Python 3.8 and TensorFlow 2.13 with the Keras 2.13 backend. The training and evaluation were executed on a machine equipped with an NVIDIA RTX 4090 GPU. Mixed-precision training was enabled to optimize GPU memory utilization and speed up computation.

3.4.1. Model Configuration

The following deep learning architectures were implemented and trained from scratch:

CNN: Consists of a 1D convolutional layer (64 filters, kernel size = 3, ReLU activation, padding = ‘same’), followed by batch normalization, max pooling, a dense layer (64 units), and dropout (rate = 0.1).

LSTM: Two stacked LSTM layers (each with 64 units, return_sequences = True), followed by batch normalization, a dense layer (64 units), and dropout.

GRU: Same structure as LSTM, replacing LSTM units with GRU units.

Transformer: Four Transformer encoder blocks with multi-head attention (4 heads, key dimension 16), feed-forward layers (128 → 16), layer normalization, and dropout. Positional encoding was applied using sinusoidal patterns.

BERT: Similar to Transformer, but uses learnable positional embeddings (non-trainable) initialized with a truncated normal distribution (stddev = 0.02).

3.4.2. Text Processing and Vectorization

Tokenization: nltk.word_tokenize was used.

Stopword Removal: Custom list of stopwords (over 1000 terms) optionally applied based on experimental settings.

TF-IDF and Feature Selection: After tokenization, TF-IDF vectors were computed and the top 100 features were selected using chi-squared feature selection (SelectKBest(chi2)).

Padding: Token sequences were padded to the 95th percentile length across the dataset using pad_sequences.

Vocabulary Size: Dynamically determined from token frequency.

3.4.3. Training Settings

Cross-Validation: 10-fold stratified cross-validation using KFold (random_state = 42).

Learning Rate Scheduling: ReduceLROnPlateau with factor = 0.2 and min_lr = 0.0001.

Loss Function: Binary cross-entropy.

Optimizer: Adam.

Batch Size: 256.

Epochs: 1.

Dropout Rate: 0.1 (applied after dense or attention layers).

3.4.4. Imbalanced Data Handling

Oversampling: SMOTE (version 0.10.1) from imblearn was applied to balance classes.

3.4.5. Evaluation Metrics

Macro-averaged precision, recall, F1-score, ROC AUC, and precision–recall AUC were computed using scikit-learn. ROC curves and PR curves were averaged over all folds.

4. Experiment Result

4.1. Overview

This study investigates the effectiveness of deep learning models in three essential bug report classification tasks: duplicate detection, severity prediction, and priority prediction. The experimental evaluation is conducted using bug report data collected from eight open-source software projects, namely Eclipse, FreeBSD, GCC, Gentoo, Kernel, RedHat, Sourceware, and WebKit. These projects represent a broad range of software domains, including operating systems, compilers, development tools, server platforms, and web rendering engines.

A primary objective of the study is to assess whether the removal of stop words during preprocessing has a meaningful impact on model performance. To explore this, we trained and evaluated each model under two preprocessing configurations: one with stop word removal and one with stop words retained. This experimental setup enables a direct comparison of the models’ effectiveness under each condition across all three classification tasks.

4.2. Data Collection and Preprocessing

A total of 1,901,084 bug reports were collected from the eight open-source projects. The dataset spans reports submitted between 1995 and 2023, covering a diverse set of software systems. Each bug report includes textual fields such as titles and descriptions, along with categorical labels indicating whether the report is a duplicate, its severity level, and its assigned priority. A single report may be associated with more than one of these categories, making it suitable for use in multiple classification tasks.

Table 1 provides a summary of the number of collected reports for each project. The diversity in project types and time spans offers a comprehensive foundation for evaluating the generalizability of the models across various software environments.

The preprocessing pipeline involved multiple steps to prepare the textual data for model training. These steps included tokenization, lemmatization, and conditional stop word removal. The stop word list used in this study consisted of 1158 English stop words; it is based on the standard NLTK list and adds 960 common nonsense words such as “fix”, “system”, and “http”, compiled from standard NLP resources [35]. Following preprocessing, feature selection was conducted to extract the most relevant terms for each task, using methods such as TF-IDF and chi-square scoring.

This structured preprocessing and feature engineering process ensured that all models were trained on high-quality, task-relevant input features under consistent experimental conditions.

4.3. Model Training and Evaluation

All models in this study were trained using a consistent set of hyperparameters to ensure comparability across architectures. The maximum sequence length was set to 120 tokens, the embedding dimension was fixed at 16, and the vocabulary size was limited to the top 1000 most frequent tokens in the corpus. The models evaluated included convolutional neural networks (CNNs), long short-term memory networks (LSTM), gated recurrent units (GRUs), Transformer models, and BERTs.

To evaluate generalization performance, we used 10-fold cross-validation [36]. This approach divides the dataset into 10 equally sized subsets. For each iteration, one subset is used for validation while the remaining nine are used for training. This process is repeated 10 times, and the average performance across all folds is computed. This method provides a robust estimate of the models’ effectiveness on unseen data and reduces the risk of overfitting to a particular data split.

Model performance was assessed using several well-established metrics: precision, recall, F1-score [37], the area under the receiver operating characteristic curve (ROC-AUC), and the area under the precision–recall curve (PR-AUC) [38]. Each of these metrics captures a different aspect of classification performance, particularly in the context of imbalanced data.

P r e c i s i o n = \frac{T P}{T P + F P}

R e c a l l = \frac{T P}{T P + F N}

F 1 = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

R O C - A U C = \int_{0}^{1} T P R (F P R) d (F P R)

T P R = \frac{T P}{T P + F N}

F P R = \frac{F P}{F P + T N}

P R - A U C = \int_{0}^{1} P r e c i s i o n (R e c a l l) d (R e c a l l)

Precision: It is the proportion of correctly predicted positive cases to the total predicted positive cases. It reflects how many of the predicted duplicates, severe bugs, or high-priority reports are actually correct.

Recall: It is the proportion of correctly predicted positive cases to all actual positive cases. It indicates how many of the true duplicates, severe bugs, or high-priority reports were successfully identified.

F1-Score: It is the harmonic mean of precision and recall, balancing the two in a single measure.

ROC-AUC: It quantifies the model’s ability to distinguish between classes across different threshold settings. It is especially useful for binary classification tasks like duplicate detection.

PR-AUC: It focuses on performance in detecting rare or minority classes, such as high-severity or high-priority bugs, which are often underrepresented in real-world datasets.

In addition to these metrics, we also tracked true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs) for each task. From these, we computed the true positive rate (TPR) and false positive rate (FPR), which further characterize model behavior in terms of sensitivity and specificity.

This comprehensive set of evaluation criteria ensures that our assessment accounts for both the accuracy and the reliability of each model under different preprocessing conditions and task settings.

4.4. Research Questions

To systematically evaluate the model performance and the impact of stop word removal, we formulated the following three research questions (RQ):

RQ1. How effectively do deep learning models perform in detecting duplicate bug reports and predicting severity and priority levels across diverse open-source projects?

Traditional NLP-based techniques for bug report analysis often rely on handcrafted features and bag-of-words representations, which may fail to capture the contextual and semantic richness of the reports. Deep learning models offer the potential to model semantic similarity and linguistic nuances more effectively. This question assesses the overall predictive performance of deep learning models on three key bug report analysis tasks: duplicate detection, severity classification, and priority assignment. The goal is to evaluate the models’ generalizability and reliability across multiple datasets and project domains.

RQ2. To what extent does stop word removal affect the performance of deep learning models on bug report classification tasks?

This question examines whether the inclusion or exclusion of stop words during preprocessing significantly impacts model performance in duplicate detection, severity prediction, and priority prediction. The analysis compares results obtained with and without stop word removal across multiple models and datasets.

RQ3. How does the performance of the models with stop word removal compare to those without, in relation to standard baseline architectures such as CNN, LSTM, GRU, Transformer, and BERT?

This question benchmarks the relative effectiveness of stop word removal by comparing model outputs under both preprocessing conditions. The comparison is conducted across five widely used deep learning architectures to determine whether stop word removal offers any measurable performance advantage.

These research questions will guide the study in evaluating the model’s capabilities and understanding the effects of preprocessing choices on its performance.

4.5. Result

This section presents and analyzes the experimental results in relation to the three research questions outlined in Section 1. Detailed numerical results can be found in Appendix A, Table A1, Table A2, Table A3, Table A4, Table A5, Table A6, Table A7, Table A8, Table A9, Table A10, Table A11, Table A12, Table A13, Table A14, Table A15, Table A16, Table A17, Table A18, Table A19 and Table A20. The corresponding statistical significance tests are summarized in Table A21, Table A22 and Table A23.

Answer for RQ1.

To assess the overall effectiveness of the proposed approach, we measured its performance on three key tasks using bug reports from eight open-source projects: Eclipse, FreeBSD, GCC, Gentoo, Kernel, RedHat, Sourceware, and WebKit. The evaluation was based on standard classification metrics, including precision, recall, and F1-score.

The results showed that the proposed model performed consistently across all tasks and datasets:

Duplicate Detection: All five deep learning models (CNN, LSTM, GRU, Transformer, and BERT) achieved an average F1-score of 0.36 across the eight projects.
Severity Prediction: The average F1-score across all models was 0.33, reflecting stable performance in classifying the impact level of reported bugs.
Priority Prediction: The models also achieved an average F1-score of 0.33 in predicting the urgency of resolving each bug report.

These outcomes show that the proposed model is capable of handling all three tasks reliably, and its performance remains stable across different open-source projects and model architectures.

Answer for RQ2.

To examine the impact of stop word removal during preprocessing, we conducted a comparative experiment using two versions of each dataset. One version included stop words, and the other excluded them. The performance of all five deep learning models was evaluated under both conditions.

The findings indicate that stop word removal had minimal influence on model performance. For all tasks and models, the differences in F1-scores were negligible:

For duplicate detection, the F1-score was 0.36 both with and without stop word removal.
For severity prediction, the F1-score remained at 0.33 in both cases.
For priority prediction, the F1-score also remained stable at 0.33 regardless of whether stop words were removed.

These results were consistent across all model types and project datasets. The minor fluctuations observed, typically within a range of 0.01 to 0.03, are not practically significant. Therefore, the decision to include or exclude stop word removal may be based on implementation preference rather than expected performance gains.

Answer for RQ3.

To benchmark the proposed model, we compared its results with those of widely used deep learning baselines: CNN, LSTM, GRU, Transformer, and BERT. The comparison was based on the same performance metrics used in RQ1 and RQ2.

The evaluation showed that the proposed approach achieved results that are equivalent to, or in some cases slightly better than, the baseline models:

For duplicate detection, both the proposed model and all baseline models achieved an average F1-score of 0.36.
For severity prediction, the F1-score remained stable at 0.33 across all models.
For priority prediction, the proposed model also matched the baseline performance with an average F1-score of 0.33.

These results indicate that the proposed model is at least as effective as the individual deep learning architectures when applied independently. The unified approach provides consistent performance across tasks and datasets, demonstrating its robustness and generalizability in bug report classification.

Statistical Hypothesis Testing

To determine whether the differences in model performance due to stop word removal were statistically significant, we formulated a series of hypotheses and conducted formal statistical tests.

The null hypotheses were established as follows.

H1₀~H15₀: There is no significant difference in duplicate bug report detection, severity bug report prediction, and priority bug report prediction between the proposed models of CNN, LSTM, GRU, Transformer, and BERT with and without stopword removal in Eclipse.

H16₀~H30₀: There is no significant difference in duplicate bug report detection, severity bug report prediction, and priority bug report prediction between the proposed models of CNN, LSTM, GRU, Transformer, and BERT with and without stopword removal in Freebsd.

H31₀~H45₀: There is no significant difference in duplicate bug report detection, severity bug report prediction, and priority bug report prediction between the proposed models of CNN, LSTM, GRU, Transformer, and BERT with and without stopword removal in Gentoo.

H46₀~H60₀: There is no significant difference in duplicate bug report detection, severity bug report prediction, and priority bug report prediction between the proposed models of CNN, LSTM, GRU, Transformer, and BERT with and without stopword removal in GCC.

H61₀~H75₀: There is no significant difference in duplicate bug report detection, severity bug report prediction, and priority bug report prediction between the proposed models of CNN, LSTM, GRU, Transformer, and BERT with and without stopword removal in Kernel.

H76₀~H90₀: There is no significant difference in duplicate bug report detection, severity bug report prediction, and priority bug report prediction between the proposed models of CNN, LSTM, GRU, Transformer, and BERT with and without stopword removal in RedHat.

H91₀~H105₀: There is no significant difference in duplicate bug report detection, severity bug report prediction, and priority bug report prediction between the proposed models of CNN, LSTM, GRU, Transformer, and BERT with and without stopword removal in Sourceware.

H106₀~H120₀: There is no significant difference in duplicate bug report detection, severity bug report prediction, and priority bug report prediction between the proposed models of CNN, LSTM, GRU, Transformer, and BERT with and without stopword removal in Webkit.

The alternative hypotheses to the null hypotheses are as follows.

H1_a~H15_a: There is a significant difference in duplicate bug report detection, severity bug report prediction, and priority bug report prediction between the proposed models of CNN, LSTM, GRU, Transformer, and BERT with and without stopword removal in Eclipse.

H16_a~H30_a: There is a significant difference in duplicate bug report detection, severity bug report prediction, and priority bug report prediction between the proposed models of CNN, LSTM, GRU, Transformer, and BERT with and without stopword removal in Freebsd.

H31_a~H45_a: There is a significant difference in duplicate bug report detection, severity bug report prediction, and priority bug report prediction between the proposed models of CNN, LSTM, GRU, Transformer, and BERT with and without stopword removal in Gcc.

H46_a~H60_a: There is a significant difference in duplicate bug report detection, severity bug report prediction, and priority bug report prediction between the proposed models of CNN, LSTM, GRU, Transformer, and BERT with and without stopword removal in Gentoo.

H61_a~H75_a: There is a significant difference in duplicate bug report detection, severity bug report prediction, and priority bug report prediction between the proposed models of CNN, LSTM, GRU, Transformer, and BERT with and without stopword removal in Kernel.

H76_a~H90_a: There is a significant difference in duplicate bug report detection, severity bug report prediction, and priority bug report prediction between the proposed models of CNN, LSTM, GRU, Transformer, and BERT with and without stopword removal in RedHat.

H91_a~H105_a There is a significant difference in duplicate bug report detection, severity bug report prediction, and priority bug report prediction between the proposed models of CNN, LSTM, GRU, Transformer, and BERT with and without stopword removal in Sourceware.

H106_a~H120_a: There is a significant difference in duplicate bug report detection, severity bug report prediction, and priority bug report prediction between the proposed models of CNN, LSTM, GRU, Transformer, and BERT with and without stopword removal in Webkit.

Hypotheses were grouped by task (duplicate detection, severity prediction, and priority prediction) and project (Eclipse, FreeBSD, Gentoo, GCC, Kernel, RedHat, Sourceware, and WebKit). A total of 240 null hypotheses were formulated (120 per preprocessing condition), covering all combinations of task, dataset, and model.

To test these hypotheses, we applied two types of statistical tests depending on the distribution of the performance metrics:

The t-test [39] was used when the performance data followed a normal distribution. This test compares the mean values of two groups to determine if the difference is statistically significant.
The Wilcoxon signed-rank test [40], a non-parametric alternative, was applied when normality could not be assumed. This test evaluates whether the median difference between paired samples is significantly different from zero.

For example, in Table A21, hypothesis H5₀ corresponds to the performance of the BERT model on the Eclipse dataset for duplicate detection. The p-value obtained was 0.9549, which exceeds the 0.05 threshold. This result supports the null hypothesis, indicating that stop word removal does not have a statistically significant effect in this case.

Across all tested hypotheses, the vast majority produced p-values greater than 0.05. This confirms that the differences observed in performance between models trained with and without stop word removal were not statistically significant. Consequently, the null hypotheses were retained and the alternative hypotheses were rejected.

These statistical findings reinforce the earlier empirical results. They provide strong evidence that stop word removal does not yield meaningful performance improvements in deep learning-based bug report analysis. Accordingly, the preprocessing step of removing stop words can be omitted without compromising the effectiveness of the models.

Across all tests, p-values consistently exceeded 0.63, with the majority above 0.90, as detailed in Table A21, Table A22 and Table A23. These results confirm that stop word removal did not lead to any statistically significant performance changes.

Given the large number of comparisons, there is a potential risk of inflated Type I error rates due to multiple testing. Applying a Bonferroni [41] correction for 240 tests yields a highly conservative adjusted significance threshold of α = 0.05/240 ≈ 0.000208. All p-values remain well above the threshold, supporting the finding that removing stop words does not significantly affect the performance of deep learning models in bug report classification.

5. Discussion

5.1. Experiment Results

This study examined whether stop word removal significantly influences the performance of deep learning models in bug report analysis. Experiments were conducted using a large-scale dataset compiled from eight open-source projects, including Eclipse, FreeBSD, GCC, Gentoo, Kernel, RedHat, Sourceware, and WebKit. Five widely used deep learning models were evaluated: convolutional neural networks (CNNs), long short-term memory networks (LSTM), gated recurrent units (GRUs), Transformer models, and BERTs. Each model was trained and tested on two versions of the dataset, one with stop word removal applied during preprocessing and one without.

The models were evaluated on three core bug report analysis tasks: duplicate detection, severity prediction, and priority prediction. The results showed that the inclusion or exclusion of stop words had minimal effect on model performance across all tasks. Specifically, the average F1-score for duplicate detection was 0.36. For both severity prediction and priority prediction, the average F1-score was 0.33. These values remained consistent regardless of whether stop words were removed during preprocessing.

To further examine the robustness of these observations, statistical significance testing was conducted using p-values. All p-values obtained were greater than 0.05, indicating that the differences in performance between models trained with and without stop word removal were not statistically significant. This confirms that the presence or absence of stop words does not meaningfully affect model accuracy in this context.

A closer inspection of the experimental results revealed only minor fluctuations in F1-scores between the two preprocessing settings:

For duplicate detection, the difference in F1-scores typically ranged between 0.01 and 0.05. For instance, in the Eclipse dataset using the CNN model (Table A1), the F1-score was 0.34 with stop word removal and 0.35 without.
For severity prediction, the difference was generally around 0.01 across models and datasets. In Table A2, the RedHat dataset with the LSTM model showed an F1-score of 0.34 when stop words were removed and 0.33 when they were retained.
For priority prediction, differences in F1-scores were also limited, with most variations falling between 0.01 and 0.03. In Table A11, for example, the Kernel dataset using CNN produced an F1-score of 0.38 with stop word removal and 0.35 without.

These findings suggest that removing stop words has little to no effect on the models’ ability to learn meaningful representations from bug report text. As a result, stop word removal can be treated as an optional preprocessing step in deep learning-based bug report analysis. The decision to remove or retain stop words may be guided more by practical constraints, such as preprocessing time or system complexity, rather than expected gains in performance.

Three main insights can be drawn from the results.

The limited impact of stop word removal indicates that the models are already effective at identifying relevant semantic features without needing to discard common words.
The consistency in performance suggests that duplicate and non-duplicate bug reports may share a common core vocabulary, which reduces the influence of stop words in distinguishing between them.
The deep learning models used in this study, particularly BERT, Transformer, and LSTM, are designed to capture contextual information and exhibit robustness to uninformative tokens. This likely diminishes the importance of removing stop words during preprocessing.

The overall experimental evidence supports the conclusion that stop word removal does not enhance the performance of deep learning models in the classification of bug reports. Therefore, it is reasonable to simplify preprocessing workflows by retaining stop words, especially when working with large datasets in practical software engineering environments.

5.2. Detailed Analysis of Experimental Results

To understand the performance variation patterns in more detail, we conducted a deeper analysis of the results for different projects and input types. Table A1, Table A2, Table A3, Table A4, Table A5, Table A6, Table A7, Table A8, Table A9, Table A10, Table A11, Table A12, Table A13, Table A14, Table A15, Table A16, Table A17, Table A18, Table A19 and Table A20 and Figure A1, Figure A2, Figure A3, Figure A4, Figure A5, Figure A6, Figure A7, Figure A8, Figure A9, Figure A10, Figure A11, Figure A12, Figure A13, Figure A14, Figure A15, Figure A16, Figure A17, Figure A18, Figure A19 and Figure A20 show the experimental results of five deep learning models on eight open-source projects, covering four types of data input: summary and description, summary only, description only, and comment only, and the F1 score performance on three tasks: duplicate report detection, severity prediction, and priority prediction.

Overall, in most projects, removing or retaining stop words has a very limited impact on model performance. Taking Table A1 and Figure A1 as an example, in the duplicate report detection task on the FreeBSD dataset, the model’s F1 score slightly dropped from 0.51 (stop words removed) to 0.50 (stop words retained); while on the Eclipse and Gentoo datasets, the score slightly increased by 0.01.

For different models, taking the experiment of CNN model as an example, the biggest difference occurs in the duplicate report detection task of the comment dataset of the WebKit project: the F1 score drops from 0.38 (stop words removed) to 0.33 (stop words not removed), with a difference of 0.05, which is the largest change observed in the CNN model in the current experiment. In contrast, about 89% of the F1 score differences are 0 or less than 0.01, further indicating that the impact of stop word processing on model performance is not statistically significant, nor does it have an observable improvement or decline trend in practical terms.

Models like Transformer and BERT show the least fluctuation, suggesting robustness to noisy tokens. In contrast, CNN and LSTM are more sensitive in certain tasks, especially with comment-only inputs.

Across all eight projects and three tasks, the F1 score differences rarely exceed 0.03, and in more than 85% of comparisons, the differences are below 0.01. This trend suggests that removing stop words introduces only minor performance changes that are statistically insignificant and practically negligible.

5.3. Threats to Validity

Internal Validity: Internal validity refers to the degree to which the observed outcomes can be attributed to the experimental design rather than to uncontrolled variables or systematic errors. In this study, potential threats to internal validity may arise from inaccuracies in data collection, preprocessing, or feature selection. For example, if data were incorrectly labeled or key textual features were improperly extracted, the results could be biased or inconsistent. To reduce these risks, we employed well-established tools and techniques. Specifically, we used NumPy [42] for efficient and consistent data preprocessing, which helped to identify and remove irrelevant entries and mitigate the effect of missing values. In addition, we applied 10-fold cross-validation to ensure that the models were evaluated on a diverse and representative distribution of the data. This technique reduces the likelihood of over-fitting to a particular subset and improves the robustness of the evaluation results.

However, despite these measures, the reported F1-scores remain modest (0.33–0.36) across tasks. This could be attributed to several internal factors:

(1): Severe class imbalance, particularly in the duplicate detection task (e.g., 11,539 duplicates vs. 231,068 non-duplicates), which may not be fully addressed by SMOTE due to its synthetic nature;
(2): Linguistic ambiguity and noise in bug reports, including jargon and inconsistent phrasing;
(3): Simplified model architectures without advanced tuning, domain-specific embedding, or attention mechanisms;

These factors collectively constrain the internal validity of conclusions drawn about model performance improvements from preprocessing steps like stop word removal.

External Validity: External validity concerns the extent to which the study’s findings can be generalized beyond the specific datasets and contexts used. Our experiments were conducted using bug reports from eight open-source software projects, which span various domains such as operating systems, compilers, and web browsers. While this diversity provides a reasonably broad testing ground, the models may not generalize well to proprietary, small-scale, or domain-specific projects that differ substantially in report structure or language usage. Although the use of cross-validation improves the reliability of within-sample evaluations, it does not address variation across different organizational settings. Future research should aim to replicate these findings on industrial datasets and across additional project types to confirm their generalizability.

Construct Validity: Construct validity relates to how well the chosen evaluation metrics reflect the true effectiveness of the models in performing the intended tasks. In this study, we used commonly accepted metrics in the field, including precision, recall, and F1-score, to evaluate model performance. While these metrics offer valuable insights, they may not capture all dimensions of performance, such as robustness to class imbalance or interpretability in practical applications. In particular, the F1-score can appear modest even when precision or recall is high for one class but low for another, especially in imbalanced contexts. To mitigate this, we reported AUC-ROC and AUC-PR values where appropriate and presented per-class analysis in appendices. However, future work should incorporate error analysis, calibration metrics, or task-specific cost functions to better understand model behavior and further improve the construct validity of the evaluation framework.

Despite the above limitations and the observation of slight fluctuations in certain specific cases (such as comment input and CNN model), comprehensive analysis of all projects and models, as well as the results of statistical significance tests, our core conclusion is as follows: stop word removal has no significant impact on the performance of deep learning-based defect report classification still holds true.

6. Related Work

6.1. Bug Duplicate Detection

The problem of detecting duplicate bug reports has received significant attention in prior research. Various approaches have been proposed to improve detection accuracy by leveraging both textual and behavioral features of bug reports.

Chaparro et al. [43] introduced a method that reformulates queries by combining observed software behaviors with bug report titles. Their technique improved duplicate detection accuracy by 56.6% to 78%, depending on the dataset. Kukkar et al. [44] designed a convolutional neural network (CNN)-based model that extracts semantic features to distinguish between duplicate and non-duplicate reports. Chauhan et al. [45] proposed the DENATURE framework, which integrates information retrieval techniques to achieve a prediction accuracy of 88.81%. Rocha et al. [46] used a combination of individual and clustered bug report representations, resulting in an average recall of 85%. Xie et al. [47] employed a CNN framework to model similarity between bug reports, showing significant improvements over traditional string-matching methods. He et al. [48] developed a Dual-Channel CNN that integrates feature matrices from pairs of bug reports, achieving high accuracy across several open-source datasets.

While these studies have shown strong performance in detecting duplicate bug reports, they typically focus on this task in isolation. Most of them do not consider related tasks such as severity classification and priority prediction, which are also critical for effective bug triaging. Moreover, prior work has generally adopted stop word removal during preprocessing as a standard step, without evaluating its necessity or impact on model performance.

In contrast, the present study adopts a unified framework that evaluates deep learning models across three interconnected tasks: duplicate detection, severity prediction, and priority prediction. Using a dataset of over 1.9 million bug reports from eight open-source projects, our study systematically examines whether removing stop words affects model performance across all tasks. The results show that stop word removal has little to no impact on classification accuracy, and statistical testing confirms that these differences are not significant.

Therefore, while earlier work has primarily focused on optimizing duplicate detection using various neural network architectures, our contribution lies in evaluating the broader implications of preprocessing decisions. Specifically, we show that stop word removal may not be necessary for deep learning-based bug report analysis, which provides practical value for streamlining preprocessing pipelines in large-scale software maintenance systems.

6.2. Bug Severity Prediction

Predicting the severity of software bugs is a crucial step in managing defect resolution and ensuring system reliability. Prior studies have introduced various machine learning techniques to enhance the accuracy of severity classification.

Mashhadi et al. [49] applied large language models, including CodeBERT, to the severity prediction task. Their method showed substantial performance gains, with improvements ranging from 29 to 140 percent compared to traditional models. Shatnawi and Alazzam [50] developed a severity assessment tool based on machine learning, which achieved improved classification accuracy through the use of over-sampling techniques and feature selection. Ramay et al. [51] proposed a deep learning approach that combines emotion analysis and natural language processing techniques, resulting in state-of-the-art performance in severity classification.

Although these studies have advanced severity prediction using increasingly powerful algorithms, they generally assume standard preprocessing steps, such as stop word removal, without explicitly evaluating their necessity or impact. Unlike these works, our study systematically investigates the effect of stop word removal on severity prediction performance. The analysis is conducted within a unified experimental framework that also includes duplicate detection and priority prediction tasks. This broader approach enables a more comprehensive understanding of how text preprocessing influences severity classification in bug report analysis.

6.3. Bug Priority Prediction

Predicting the priority of bug reports is essential for managing maintenance resources and scheduling resolution activities. Several researchers have developed classification models to automate this process.

Bani-Salameh et al. [52] applied a recurrent neural network with an LSTM architecture and reported superior results in comparison to support vector machines (SVMs) and k-nearest neighbors (KNNs), particularly in terms of accuracy, area under the curve (AUC), and F-measure. Rathnayake et al. [53] used natural language processing techniques to preprocess bug reports for CNN-based models and achieved a prediction accuracy of 71 percent. In addition, researchers such as Zhang et al. [54], Umer et al. [55], and Choudhary et al. [56] developed various machine learning, deep learning, and fuzzy logic-based methods that consistently outperformed traditional rule-based approaches in priority classification tasks.

While these studies show strong performance in priority prediction, they tend to address the task in isolation and often adopt standard preprocessing routines without critical evaluation. Our study differs by examining priority prediction in conjunction with duplicate detection and severity classification. More importantly, we evaluate the impact of stop word removal across all three tasks, which provides a more integrated perspective on model behavior and preprocessing strategy.

6.4. Comparison with Our Study

Many existing studies in bug report analysis rely on stop word removal as a default preprocessing step. However, few have questioned whether this step is necessary in software-specific contexts, where domain vocabulary and task-specific relevance may differ from general natural language processing tasks.

Our study fills this gap by systematically evaluating the effect of stop word removal across three classification tasks: duplicate detection, severity prediction, and priority prediction. Unlike previous research, which often treats these tasks independently and focuses on model design, we adopt a unified experimental framework using consistent datasets and evaluation metrics. The analysis spans multiple text segments of bug reports, including summary, description, and comments.

The results of our experiments indicate that stop word removal has a negligible impact on model performance for all tasks and architectures considered. Statistical tests confirm that the observed differences in F1-scores are not significant. These findings suggest that stop word removal may not be a necessary preprocessing step in deep learning-based bug report analysis. Eliminating it can simplify the preprocessing pipeline without compromising model effectiveness.

By providing a task-comprehensive and empirically validated analysis of preprocessing effects, our study contributes practical insights that are directly applicable to real-world software maintenance workflows.

7. Conclusions

Accurate and efficient analysis of bug reports is essential for maintaining software quality and supporting timely defect resolution. This study evaluated the performance of deep learning models on three critical bug report analysis tasks: duplicate detection, severity prediction, and priority prediction. The primary objective was to assess the practical impact of stop word removal during preprocessing, a step that is widely adopted in natural language processing workflows but rarely questioned in the context of software engineering tasks.

To conduct this evaluation, we used a large-scale dataset of over 1.9 million bug reports collected from eight open-source projects. Five commonly used deep learning models were applied, including convolutional neural networks (CNNs), long short-term memory networks (LSTM), gated recurrent units (GRUs), Transformer models, and BERTs. Each model was trained and tested on datasets prepared both with and without stop word removal.

The results showed that stop word removal had minimal impact on model performance. Across all tasks and model types, the average F1-scores remained consistent. Specifically, duplicate detection yielded an average F1-score of 0.36, while severity prediction and priority prediction both achieved average F1-scores of 0.33. They are comparable to the baselines, which shows that our method is still effective for this specific task in eight open-source projects. Statistical significance tests confirmed that the differences between models trained with and without stop word removal were not meaningful. These findings indicate that the exclusion of stop words does not enhance or degrade the effectiveness of deep learning models for bug report classification tasks.

Based on these results, we conclude that stop word removal is not necessary for effective bug report analysis using deep learning. Omitting this step can simplify preprocessing pipelines, particularly in large-scale or automated maintenance settings, without sacrificing accuracy. This contributes to a more streamlined and practical approach to bug triaging in real-world environments.

Future work will focus on expanding the dataset to include additional projects from diverse domains and incorporating more advanced text representations. We also plan to explore the impact of other preprocessing steps, such as stemming, abbreviation normalization, or domain-specific token filtering, to further refine the effectiveness and generalizability of bug report analysis models.

Author Contributions

Writing—original draft, J.J.; Writing—review & editing, G.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Korea National University Development Project (2025) at Hankyong National University.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Acknowledgments

This research was supported by Hankyong National University Korea National University Development Project (2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. CNN performance on summary + description input with/without stopword removal.

Model		CNN
		Apply Stopword Removal					Not Apply Stopword Removal
Project	Task	Precision	Recall	F1	ROC	PR	Precision	Recall	F1	ROC	PR
Eclipse	Duplicate	0.35	0.50	0.34	0.53	0.54	0.38	0.50	0.35	0.57	0.56
	Severity	0.25	0.50	0.33	0.50	0.53	0.27	0.50	0.33	0.51	0.56
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.55
FreeBSD	Duplicate	0.35	0.58	0.51	0.71	0.71	0.38	0.57	0.50	0.69	0.70
	Severity	0.27	0.50	0.33	0.50	0.55	0.25	0.50	0.33	0.51	0.53
	Priority	0.25	0.50	0.33	0.51	0.57	0.26	0.50	0.33	0.52	0.59
GCC	Duplicate	0.35	0.50	0.33	0.53	0.53	0.37	0.50	0.33	0.52	0.53
	Severity	0.25	0.50	0.33	0.49	0.53	0.25	0.50	0.33	0.51	0.53
	Priority	0.26	0.50	0.33	0.50	0.50	0.26	0.50	0.33	0.50	0.51
Gentoo	Duplicate	0.38	0.50	0.35	0.56	0.58	0.36	0.50	0.36	0.57	0.58
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.35	0.51	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Kernel	Duplicate	0.38	0.50	0.34	0.50	0.56	0.38	0.50	0.34	0.50	0.53
	Severity	0.26	0.50	0.33	0.52	0.53	0.30	0.50	0.33	0.51	0.53
	Priority	0.26	0.50	0.33	0.51	0.52	0.31	0.50	0.34	0.50	0.50
RedHat	Duplicate	0.32	0.52	0.39	0.61	0.59	0.38	0.51	0.38	0.57	0.56
	Severity	0.25	0.50	0.33	0.50	0.50	0.27	0.50	0.33	0.55	0.61
	Priority	0.25	0.50	0.33	0.50	0.50	0.27	0.50	0.33	0.52	0.63
Sourceware	Duplicate	0.34	0.50	0.34	0.49	0.49	0.38	0.50	0.34	0.51	0.50
	Severity	0.25	0.50	0.33	0.50	0.49	0.27	0.50	0.33	0.51	0.50
	Priority	0.32	0.50	0.33	0.50	0.49	0.28	0.50	0.33	0.50	0.55
WebKit	Duplicate	0.36	0.50	0.33	0.63	0.62	0.35	0.50	0.34	0.63	0.62
	Severity	0.25	0.50	0.33	0.50	0.50	0.30	0.50	0.33	0.51	0.74
	Priority	0.26	0.50	0.33	0.50	0.50	0.27	0.50	0.33	0.50	0.75

Table A2. LSTM performance on summary + description input with/without stopword removal.

Model		LSTM
		Apply Stopword Removal					Not Apply Stopword Removal
Project	Task	Precision	Recall	F1	ROC	PR	Precision	Recall	F1	ROC	PR
Eclipse	Duplicate	0.38	0.50	0.33	0.54	0.54	0.25	0.52	0.35	0.62	0.60
	Severity	0.26	0.50	0.33	0.50	0.51	0.25	0.50	0.34	0.51	0.50
	Priority	0.25	0.50	0.33	0.52	0.50	0.25	0.50	0.33	0.51	0.53
FreeBSD	Duplicate	0.70	0.71	0.69	0.76	0.77	0.68	0.70	0.69	0.76	0.77
	Severity	0.25	0.50	0.33	0.50	0.52	0.25	0.50	0.33	0.51	0.53
	Priority	0.25	0.50	0.33	0.53	0.50	0.25	0.50	0.33	0.51	0.50
GCC	Duplicate	0.39	0.51	0.35	0.60	0.58	0.35	0.50	0.33	0.50	0.50
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.51	0.52
	Priority	0.25	0.50	0.34	0.50	0.51	0.25	0.50	0.33	0.51	0.52
Gentoo	Duplicate	0.46	0.50	0.34	0.60	0.61	0.46	0.51	0.34	0.60	0.60
	Severity	0.25	0.50	0.33	0.50	0.51	0.25	0.50	0.33	0.51	0.50
	Priority	0.25	0.50	0.33	0.50	0.55	0.25	0.50	0.33	0.51	0.50
Kernel	Duplicate	0.51	0.50	0.35	0.55	0.54	0.52	0.56	0.38	0.62	0.58
	Severity	0.25	0.50	0.33	0.51	0.58	0.25	0.50	0.33	0.49	0.60
	Priority	0.25	0.50	0.33	0.50	0.60	0.25	0.50	0.33	0.49	0.55
RedHat	Duplicate	0.66	0.61	0.68	0.68	0.65	0.70	0.67	0.65	0.75	0.72
	Severity	0.25	0.50	0.34	0.53	0.52	0.25	0.50	0.33	0.50	0.64
	Priority	0.25	0.50	0.33	0.52	0.52	0.25	0.50	0.33	0.50	0.66
Sourceware	Duplicate	0.48	0.50	0.34	0.50	0.50	0.48	0.50	0.34	0.50	0.50
	Severity	0.25	0.50	0.33	0.50	0.56	0.25	0.50	0.33	0.50	0.60
	Priority	0.25	0.50	0.33	0.51	0.61	0.25	0.50	0.33	0.50	0.62
WebKit	Duplicate	0.44	0.51	0.33	0.66	0.66	0.53	0.50	0.33	0.65	0.65
	Severity	0.25	0.50	0.33	0.51	0.52	0.25	0.50	0.33	0.50	0.75
	Priority	0.25	0.50	0.33	0.50	0.51	0.25	0.50	0.33	0.50	0.73

Table A3. GRU performance on summary + description input with/without stopword removal.

Model		GRU
		Apply Stopword Removal					Not Apply Stopword Removal
Project	Task	Precision	Recall	F1	ROC	PR	Precision	Recall	F1	ROC	PR
Eclipse	Duplicate	0.48	0.50	0.33	0.54	0.54	0.45	0.50	0.33	0.56	0.56
	Severity	0.25	0.50	0.33	0.50	0.57	0.25	0.50	0.33	0.49	0.50
	Priority	0.25	0.50	0.33	0.51	0.50	0.25	0.50	0.33	0.50	0.59
FreeBSD	Duplicate	0.44	0.50	0.33	0.53	0.50	0.43	0.50	0.33	0.50	0.50
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.63
	Priority	0.25	0.50	0.34	0.51	0.50	0.25	0.50	0.33	0.50	0.55
GCC	Duplicate	0.33	0.50	0.34	0.54	0.54	0.35	0.50	0.33	0.50	0.50
	Severity	0.25	0.50	0.33	0.49	0.50	0.25	0.50	0.33	0.51	0.51
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.51
Gentoo	Duplicate	0.32	0.50	0.35	0.55	0.50	0.29	0.50	0.33	0.56	0.56
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Kernel	Duplicate	0.63	0.58	0.34	0.66	0.63	0.57	0.50	0.35	0.52	0.50
	Severity	0.25	0.50	0.33	0.50	0.62	0.25	0.50	0.33	0.50	0.58
	Priority	0.25	0.50	0.33	0.50	0.62	0.25	0.50	0.33	0.50	0.64
RedHat	Duplicate	0.65	0.59	0.53	0.59	0.54	0.67	0.62	0.58	0.67	0.62
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.71
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.66
Sourceware	Duplicate	0.42	0.50	0.33	0.51	0.51	0.41	0.50	0.33	0.56	0.50
	Severity	0.25	0.50	0.33	0.50	0.66	0.25	0.50	0.33	0.50	0.63
	Priority	0.25	0.50	0.33	0.50	0.68	0.25	0.50	0.33	0.50	0.59
WebKit	Duplicate	0.38	0.50	0.33	0.65	0.64	0.27	0.50	0.33	0.65	0.65
	Severity	0.26	0.50	0.33	0.51	0.51	0.25	0.50	0.33	0.50	0.73
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.69

Table A4. Transformer performance on summary + description input with/without stopword removal.

Model		Transformer
		Apply Stopword Removal					Not Apply Stopword Removal
Project	Task	Precision	Recall	F1	ROC	PR	Precision	Recall	F1	ROC	PR
Eclipse	Duplicate	0.43	0.50	0.33	0.51	0.55	0.35	0.50	0.33	0.55	0.56
	Severity	0.25	0.50	0.33	0.49	0.50	0.25	0.50	0.34	0.50	0.55
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.55	0.50
FreeBSD	Duplicate	0.33	0.50	0.33	0.56	0.55	0.36	0.50	0.33	0.55	0.50
	Severity	0.25	0.50	0.33	0.49	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.27	0.50	0.34	0.50	0.50	0.25	0.50	0.33	0.50	0.50
GCC	Duplicate	0.28	0.50	0.36	0.50	0.50	0.32	0.50	0.35	0.55	0.50
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.51	0.56	0.25	0.50	0.33	0.52	0.56
Gentoo	Duplicate	0.26	0.50	0.34	0.56	0.55	0.28	0.50	0.34	0.50	0.56
	Severity	0.25	0.50	0.34	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.35	0.50	0.50
Kernel	Duplicate	0.48	0.50	0.34	0.50	0.49	0.48	0.50	0.35	0.50	0.50
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.51	0.50
	Priority	0.26	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.51	0.50
RedHat	Duplicate	0.43	0.50	0.35	0.56	0.54	0.37	0.50	0.34	0.50	0.67
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Sourceware	Duplicate	0.50	0.50	0.34	0.49	0.50	0.49	0.50	0.33	0.50	0.51
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
WebKit	Duplicate	0.40	0.51	0.36	0.59	0.60	0.43	0.50	0.33	0.61	0.62
	Severity	0.26	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50

Table A5. BERT performance on summary + description input with/without stopword removal.

Model		BERT
		Apply Stopword Removal					Not Apply Stopword Removal
Project	Task	Precision	Recall	F1	ROC	PR	Precision	Recall	F1	ROC	PR
Eclipse	Duplicate	0.34	0.50	0.33	0.50	0.52	0.37	0.50	0.33	0.58	0.57
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.52	0.50
	Priority	0.25	0.50	0.33	0.49	0.52	0.25	0.50	0.33	0.50	0.69
FreeBSD	Duplicate	0.33	0.50	0.34	0.50	0.52	0.32	0.50	0.33	0.50	0.50
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.26	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.51
GCC	Duplicate	0.32	0.50	0.35	0.50	0.50	0.32	0.50	0.33	0.50	0.50
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.34	0.50	0.50
	Priority	0.25	0.50	0.33	0.49	0.50	0.25	0.50	0.33	0.50	0.50
Gentoo	Duplicate	0.29	0.50	0.33	0.52	0.59	0.32	0.50	0.34	0.59	0.51
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Kernel	Duplicate	0.53	0.52	0.38	0.57	0.54	0.52	0.50	0.34	0.58	0.55
	Severity	0.25	0.50	0.33	0.52	0.50	0.25	0.50	0.33	0.51	0.50
	Priority	0.26	0.50	0.34	0.51	0.50	0.25	0.50	0.33	0.51	0.50
RedHat	Duplicate	0.62	0.53	0.41	0.58	0.55	0.64	0.59	0.44	0.67	0.64
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.53	0.52
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.52	0.51
Sourceware	Duplicate	0.44	0.50	0.34	0.49	0.49	0.39	0.50	0.33	0.50	0.50
	Severity	0.28	0.50	0.33	0.50	0.49	0.29	0.50	0.33	0.51	0.50
	Priority	0.26	0.50	0.33	0.50	0.49	0.26	0.50	0.33	0.51	0.50
WebKit	Duplicate	0.30	0.50	0.33	0.60	0.59	0.42	0.50	0.33	0.58	0.58
	Severity	0.26	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.49	0.50

Table A6. CNN performance on summary-only input with/without stopword removal.

Model		CNN
		Apply Stopword Removal					Not Apply Stopword Removal
Project	Task	Precision	Recall	F1	ROC	PR	Precision	Recall	F1	ROC	PR
Eclipse	Duplicate	0.42	0.50	0.33	0.51	0.51	0.54	0.50	0.34	0.53	0.53
	Severity	0.25	0.50	0.34	0.52	0.50	0.25	0.50	0.33	0.52	0.50
	Priority	0.25	0.50	0.33	0.50	0.52	0.25	0.50	0.33	0.51	0.50
FreeBSD	Duplicate	0.53	0.50	0.33	0.55	0.56	0.48	0.51	0.36	0.65	0.66
	Severity	0.25	0.50	0.33	0.50	0.72	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
GCC	Duplicate	0.42	0.50	0.33	0.54	0.55	0.36	0.50	0.33	0.54	0.55
	Severity	0.25	0.50	0.33	0.52	0.50	0.25	0.50	0.33	0.50	0.69
	Priority	0.25	0.50	0.33	0.51	0.69	0.25	0.50	0.33	0.52	0.50
Gentoo	Duplicate	0.43	0.50	0.33	0.53	0.53	0.50	0.50	0.33	0.54	0.54
	Severity	0.25	0.50	0.33	0.51	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Kernel	Duplicate	0.31	0.50	0.33	0.51	0.51	0.34	0.50	0.34	0.53	0.52
	Severity	0.25	0.50	0.33	0.49	0.49	0.25	0.50	0.33	0.49	0.50
	Priority	0.26	0.50	0.33	0.51	0.50	0.25	0.50	0.33	0.50	0.51
RedHat	Duplicate	0.45	0.50	0.33	0.49	0.49	0.50	0.50	0.33	0.50	0.51
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.52	0.59
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.51	0.59
Sourceware	Duplicate	0.49	0.50	0.34	0.49	0.49	0.45	0.50	0.33	0.50	0.52
	Severity	0.25	0.50	0.33	0.49	0.50	0.25	0.50	0.33	0.49	0.53
	Priority	0.25	0.50	0.33	0.49	0.50	0.25	0.50	0.33	0.49	0.49
WebKit	Duplicate	0.40	0.50	0.33	0.51	0.51	0.33	0.50	0.33	0.53	0.54
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.51	0.67
	Priority	0.25	0.50	0.33	0.50	0.51	0.25	0.50	0.33	0.50	0.67

Table A7. LSTM performance on summary-only input with/without stopword removal.

Model		LSTM
		Apply Stopword Removal					Not Apply Stopword Removal
Project	Task	Precision	Recall	F1	ROC	PR	Precision	Recall	F1	ROC	PR
Eclipse	Duplicate	0.34	0.50	0.33	0.51	0.51	0.57	0.50	0.33	0.53	0.53
	Severity	0.25	0.50	0.33	0.50	0.54	0.25	0.50	0.33	0.51	0.50
	Priority	0.25	0.50	0.33	0.49	0.50	0.25	0.50	0.33	0.50	0.69
FreeBSD	Duplicate	0.57	0.55	0.46	0.69	0.70	0.57	0.55	0.47	0.68	0.68
	Severity	0.26	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.59	0.25	0.50	0.33	0.52	0.50
GCC	Duplicate	0.40	0.51	0.34	0.60	0.59	0.39	0.51	0.35	0.60	0.60
	Severity	0.25	0.50	0.33	0.51	0.50	0.25	0.50	0.33	0.50	0.55
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Gentoo	Duplicate	0.47	0.50	0.34	0.60	0.61	0.46	0.50	0.33	0.61	0.60
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Kernel	Duplicate	0.51	0.50	0.35	0.55	0.54	0.53	0.50	0.34	0.53	0.52
	Severity	0.25	0.50	0.33	0.50	0.58	0.25	0.50	0.33	0.49	0.58
	Priority	0.25	0.50	0.33	0.50	0.62	0.25	0.50	0.33	0.50	0.65
RedHat	Duplicate	0.62	0.50	0.35	0.55	0.54	0.62	0.51	0.37	0.58	0.56
	Severity	0.25	0.50	0.33	0.50	0.50	0.26	0.50	0.33	0.50	0.71
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.73
Sourceware	Duplicate	0.39	0.50	0.33	0.51	0.50	0.38	0.50	0.33	0.50	0.50
	Severity	0.25	0.50	0.33	0.50	0.64	0.25	0.50	0.33	0.50	0.56
	Priority	0.25	0.50	0.33	0.50	0.58	0.25	0.50	0.33	0.50	0.66
WebKit	Duplicate	0.52	0.50	0.34	0.56	0.56	0.43	0.50	0.33	0.55	0.56
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.55
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.55

Table A8. GRU performance on summary-only input with/without stopword removal.

Model		GRU
		Apply Stopword Removal					Not Apply Stopword Removal
Project	Task	Precision	Recall	F1	ROC	PR	Precision	Recall	F1	ROC	PR
Eclipse	Duplicate	0.41	0.50	0.33	0.50	0.50	0.50	0.50	0.33	0.53	0.53
	Severity	0.26	0.50	0.33	0.51	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.26	0.50	0.33	0.50	0.50
FreeBSD	Duplicate	0.39	0.50	0.33	0.50	0.50	0.33	0.50	0.33	0.52	0.56
	Severity	0.25	0.50	0.33	0.50	0.70	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
GCC	Duplicate	0.33	0.50	0.33	0.56	0.56	0.32	0.50	0.34	0.55	0.55
	Severity	0.25	0.50	0.34	0.50	0.58	0.26	0.50	0.33	0.51	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Gentoo	Duplicate	0.31	0.50	0.33	0.59	0.58	0.33	0.50	0.35	0.59	0.59
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.51
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Kernel	Duplicate	0.48	0.50	0.34	0.52	0.53	0.50	0.50	0.34	0.51	0.51
	Severity	0.25	0.50	0.33	0.50	0.63	0.25	0.50	0.33	0.50	0.65
	Priority	0.25	0.50	0.33	0.50	0.66	0.25	0.50	0.33	0.50	0.61
RedHat	Duplicate	0.58	0.50	0.34	0.53	0.52	0.61	0.51	0.36	0.56	0.54
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.73
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.72
Sourceware	Duplicate	0.45	0.50	0.34	0.50	0.50	0.39	0.50	0.33	0.50	0.50
	Severity	0.25	0.50	0.33	0.50	0.59	0.25	0.50	0.33	0.50	0.59
	Priority	0.25	0.50	0.33	0.50	0.67	0.25	0.50	0.33	0.50	0.63
WebKit	Duplicate	0.55	0.50	0.33	0.55	0.56	0.57	0.50	0.33	0.57	0.58
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.75
	Priority	0.25	0.50	0.33	0.50	0.59	0.25	0.50	0.33	0.50	0.75

Table A9. Transformer performance on summary-only input with/without stopword removal.

Model		Transformer
		Apply Stopword Removal					Not Apply Stopword Removal
Project	Task	Precision	Recall	F1	ROC	PR	Precision	Recall	F1	ROC	PR
Eclipse	Duplicate	0.37	0.50	0.34	0.50	0.51	0.36	0.50	0.33	0.51	0.51
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.51	0.51	0.25	0.50	0.33	0.51	0.50
FreeBSD	Duplicate	0.32	0.50	0.33	0.51	0.51	0.33	0.50	0.33	0.56	0.52
	Severity	0.25	0.50	0.33	0.51	0.49	0.25	0.50	0.33	0.50	0.68
	Priority	0.25	0.50	0.33	0.50	0.52	0.25	0.50	0.33	0.51	0.50
GCC	Duplicate	0.31	0.50	0.34	0.50	0.55	0.33	0.50	0.35	0.52	0.55
	Severity	0.25	0.50	0.33	0.50	0.49	0.25	0.50	0.33	0.50	0.59
	Priority	0.25	0.50	0.33	0.51	0.50	0.25	0.50	0.33	0.49	0.50
Gentoo	Duplicate	0.26	0.50	0.33	0.55	0.52	0.29	0.50	0.34	0.50	0.51
	Severity	0.25	0.50	0.33	0.50	0.51	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.34	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Kernel	Duplicate	0.37	0.50	0.34	0.50	0.51	0.51	0.50	0.35	0.50	0.50
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
RedHat	Duplicate	0.48	0.51	0.41	0.51	0.50	0.50	0.50	0.41	0.50	0.51
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Sourceware	Duplicate	0.40	0.50	0.33	0.50	0.50	0.41	0.50	0.34	0.50	0.55
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
WebKit	Duplicate	0.44	0.50	0.33	0.54	0.55	0.46	0.50	0.33	0.57	0.57
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.51

Table A10. BERT performance on summary-only input with/without stopword removal.

Model		BERT
		Apply Stopword Removal					Not Apply Stopword Removal
Project	Task	Precision	Recall	F1	ROC	PR	Precision	Recall	F1	ROC	PR
Eclipse	Duplicate	0.48	0.50	0.33	0.52	0.51	0.47	0.50	0.33	0.55	0.56
	Severity	0.25	0.50	0.33	0.50	0.54	0.25	0.50	0.33	0.51	0.50
	Priority	0.25	0.50	0.33	0.53	0.50	0.25	0.50	0.33	0.50	0.50
FreeBSD	Duplicate	0.35	0.50	0.33	0.56	0.55	0.33	0.50	0.34	0.50	0.52
	Severity	0.26	0.50	0.33	0.50	0.50	0.26	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.26	0.50	0.33	0.50	0.50
GCC	Duplicate	0.29	0.50	0.34	0.50	0.50	0.29	0.50	0.33	0.51	0.51
	Severity	0.25	0.50	0.33	0.52	0.50	0.25	0.50	0.33	0.50	0.69
	Priority	0.27	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.51	0.50
Gentoo	Duplicate	0.25	0.50	0.34	0.55	0.59	0.26	0.50	0.35	0.51	0.51
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Kernel	Duplicate	0.52	0.50	0.34	0.49	0.49	0.41	0.50	0.34	0.49	0.49
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.26	0.50	0.33	0.51	0.50	0.25	0.50	0.33	0.50	0.50
RedHat	Duplicate	0.58	0.50	0.34	0.50	0.50	0.60	0.50	0.34	0.51	0.51
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Sourceware	Duplicate	0.43	0.50	0.33	0.50	0.51	0.41	0.50	0.33	0.52	0.52
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
WebKit	Duplicate	0.39	0.50	0.33	0.53	0.53	0.52	0.50	0.33	0.52	0.51
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50

Table A11. CNN performance on description-only input with/without stopword removal.

Model		CNN
		Apply Stopword Removal					Not Apply Stopword Removal
Project	Task	Precision	Recall	F1	ROC	PR	Precision	Recall	F1	ROC	PR
Eclipse	Duplicate	0.35	0.50	0.33	0.55	0.55	0.33	0.50	0.34	0.55	0.55
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.52	0.50	0.25	0.50	0.33	0.50	0.50
FreeBSD	Duplicate	0.63	0.59	0.45	0.70	0.69	0.57	0.54	0.43	0.69	0.69
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.56	0.25	0.50	0.33	0.50	0.62
GCC	Duplicate	0.34	0.50	0.33	0.54	0.54	0.41	0.50	0.33	0.53	0.53
	Severity	0.25	0.50	0.33	0.51	0.50	0.25	0.50	0.33	0.52	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.52
Gentoo	Duplicate	0.30	0.50	0.33	0.57	0.57	0.40	0.50	0.36	0.58	0.58
	Severity	0.25	0.50	0.33	0.53	0.51	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Kernel	Duplicate	0.47	0.50	0.34	0.50	0.51	0.52	0.50	0.34	0.51	0.51
	Severity	0.29	0.50	0.35	0.51	0.51	0.32	0.50	0.34	0.51	0.51
	Priority	0.39	0.50	0.38	0.49	0.50	0.27	0.50	0.35	0.50	0.50
RedHat	Duplicate	0.60	0.51	0.38	0.60	0.59	0.62	0.55	0.40	0.65	0.64
	Severity	0.25	0.50	0.33	0.50	0.50	0.31	0.50	0.33	0.51	0.74
	Priority	0.25	0.50	0.33	0.50	0.50	0.36	0.50	0.33	0.50	0.75
Sourceware	Duplicate	0.40	0.50	0.34	0.51	0.51	0.36	0.50	0.33	0.51	0.52
	Severity	0.31	0.50	0.34	0.50	0.50	0.30	0.50	0.33	0.51	0.52
	Priority	0.33	0.50	0.33	0.52	0.51	0.25	0.50	0.33	0.51	0.50
WebKit	Duplicate	0.34	0.50	0.33	0.61	0.60	0.30	0.50	0.33	0.63	0.61
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50

Table A12. LSTM performance on description-only input with/without stopword removal.

Model		LSTM
		Apply Stopword Removal					Not Apply Stopword Removal
Project	Task	Precision	Recall	F1	ROC	PR	Precision	Recall	F1	ROC	PR
Eclipse	Duplicate	0.48	0.51	0.37	0.61	0.59	0.49	0.51	0.35	0.60	0.60
	Severity	0.25	0.50	0.33	0.50	0.52	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.51	0.50
FreeBSD	Duplicate	0.36	0.50	0.36	0.58	0.59	0.37	0.50	0.36	0.59	0.60
	Severity	0.25	0.50	0.33	0.50	0.55	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.51	0.50	0.25	0.50	0.33	0.50	0.50
GCC	Duplicate	0.39	0.50	0.34	0.55	0.58	0.40	0.50	0.35	0.60	0.60
	Severity	0.25	0.50	0.33	0.50	0.69	0.25	0.50	0.33	0.50	0.59
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Gentoo	Duplicate	0.37	0.50	0.33	0.61	0.61	0.35	0.50	0.33	0.59	0.59
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Kernel	Duplicate	0.54	0.52	0.39	0.60	0.57	0.60	0.55	0.37	0.63	0.50
	Severity	0.26	0.50	0.33	0.50	0.59	0.26	0.50	0.33	0.49	0.54
	Priority	0.25	0.50	0.33	0.50	0.60	0.25	0.50	0.33	0.50	0.57
RedHat	Duplicate	0.68	0.65	0.64	0.72	0.69	0.71	0.69	0.67	0.75	0.72
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.73
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.72
Sourceware	Duplicate	0.49	0.50	0.34	0.50	0.50	0.47	0.50	0.33	0.51	0.55
	Severity	0.28	0.50	0.33	0.50	0.64	0.25	0.50	0.33	0.50	0.58
	Priority	0.25	0.50	0.33	0.50	0.57	0.26	0.50	0.33	0.49	0.59
WebKit	Duplicate	0.42	0.52	0.38	0.63	0.62	0.57	0.53	0.40	0.65	0.64
	Severity	0.25	0.50	0.33	0.52	0.56	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.51	0.50	0.25	0.50	0.33	0.50	0.50

Table A13. GRU performance on description-only input with/without stopword removal.

Model		GRU
		Apply Stopword Removal					Not Apply Stopword Removal
Project	Task	Precision	Recall	F1	ROC	PR	Precision	Recall	F1	ROC	PR
Eclipse	Duplicate	0.42	0.50	0.33	0.58	0.57	0.41	0.50	0.33	0.60	0.59
	Severity	0.25	0.50	0.33	0.51	0.55	0.25	0.50	0.33	0.50	0.69
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
FreeBSD	Duplicate	0.32	0.50	0.33	0.52	0.50	0.35	0.50	0.34	0.50	0.51
	Severity	0.25	0.50	0.33	0.51	0.50	0.25	0.50	0.33	0.51	0.61
	Priority	0.25	0.50	0.33	0.50	0.51	0.25	0.50	0.33	0.50	0.50
GCC	Duplicate	0.34	0.50	0.33	0.53	0.54	0.33	0.50	0.33	0.54	0.54
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.51	0.52
	Priority	0.25	0.50	0.33	0.52	0.50	0.25	0.50	0.33	0.50	0.50
Gentoo	Duplicate	0.36	0.50	0.33	0.52	0.50	0.33	0.50	0.33	0.49	0.50
	Severity	0.25	0.50	0.33	0.49	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Kernel	Duplicate	0.42	0.50	0.34	0.52	0.52	0.42	0.50	0.34	0.52	0.52
	Severity	0.25	0.50	0.33	0.50	0.59	0.25	0.50	0.33	0.50	0.60
	Priority	0.25	0.50	0.33	0.50	0.62	0.25	0.50	0.33	0.50	0.65
RedHat	Duplicate	0.64	0.58	0.62	0.60	0.55	0.67	0.64	0.62	0.68	0.62
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.72
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.73
Sourceware	Duplicate	0.46	0.50	0.34	0.51	0.51	0.46	0.50	0.33	0.50	0.51
	Severity	0.25	0.50	0.33	0.50	0.66	0.25	0.50	0.33	0.50	0.65
	Priority	0.25	0.50	0.33	0.50	0.60	0.25	0.50	0.33	0.50	0.68
WebKit	Duplicate	0.39	0.50	0.33	0.62	0.61	0.36	0.50	0.34	0.63	0.62
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50

Table A14. Transformer performance on description-only input with/without stopword removal.

Model		Transformer
		Apply Stopword Removal					Not Apply Stopword Removal
Project	Task	Precision	Recall	F1	ROC	PR	Precision	Recall	F1	ROC	0.50
Eclipse	Duplicate	0.28	0.50	0.33	0.56	0.56	0.29	0.50	0.33	0.50	0.50
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.64
	Priority	0.25	0.50	0.33	0.51	0.55	0.25	0.50	0.33	0.51	0.50
FreeBSD	Duplicate	0.35	0.50	0.33	0.50	0.51	0.35	0.50	0.33	0.51	0.51
	Severity	0.27	0.50	0.33	0.51	0.50	0.25	0.50	0.33	0.50	0.61
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
GCC	Duplicate	0.40	0.50	0.34	0.50	0.49	0.41	0.50	0.33	0.51	0.51
	Severity	0.25	0.50	0.33	0.50	0.55	0.25	0.50	0.33	0.52	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.34	0.50	0.50
Gentoo	Duplicate	0.36	0.50	0.34	0.50	0.50	0.35	0.50	0.33	0.50	0.50
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.52	0.50	0.25	0.50	0.33	0.50	0.51
Kernel	Duplicate	0.50	0.50	0.38	0.50	0.54	0.46	0.50	0.35	0.52	0.50
	Severity	0.25	0.50	0.33	0.50	0.54	0.25	0.50	0.33	0.51	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.51	0.50
RedHat	Duplicate	0.45	0.50	0.36	0.51	0.53	0.54	0.52	0.35	0.52	0.46
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Sourceware	Duplicate	0.43	0.50	0.36	0.50	0.50	0.42	0.50	0.35	0.52	0.52
	Severity	0.26	0.50	0.34	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
WebKit	Duplicate	0.40	0.51	0.35	0.60	0.60	0.35	0.50	0.33	0.58	0.59
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50

Table A15. BERT performance on description-only input with/without stopword removal.

Model		BERT
		Apply Stopword Removal					Not Apply Stopword Removal
Project	Task	Precision	Recall	F1	ROC	PR	Precision	Recall	F1	ROC	PR
Eclipse	Duplicate	0.35	0.50	0.34	0.59	0.59	0.35	0.50	0.33	0.56	0.56
	Severity	0.25	0.50	0.33	0.50	0.54	0.25	0.50	0.33	0.52	0.50
	Priority	0.26	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.74
FreeBSD	Duplicate	0.35	0.50	0.33	0.55	0.56	0.34	0.50	0.33	0.52	0.52
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.51	0.50
	Priority	0.25	0.50	0.33	0.51	0.50	0.25	0.50	0.33	0.51	0.64
GCC	Duplicate	0.33	0.50	0.33	0.50	0.50	0.32	0.50	0.33	0.51	0.51
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.69	0.25	0.50	0.33	0.49	0.54
Gentoo	Duplicate	0.31	0.50	0.33	0.55	0.55	0.32	0.50	0.33	0.50	0.50
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.51	0.55	0.25	0.50	0.33	0.50	0.50
Kernel	Duplicate	0.45	0.50	0.34	0.55	0.55	0.42	0.50	0.33	0.56	0.52
	Severity	0.25	0.50	0.33	0.51	0.50	0.25	0.50	0.33	0.51	0.50
	Priority	0.25	0.50	0.33	0.52	0.50	0.25	0.50	0.33	0.51	0.50
RedHat	Duplicate	0.42	0.50	0.33	0.56	0.55	0.41	0.50	0.34	0.52	0.52
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Sourceware	Duplicate	0.36	0.50	0.36	0.50	0.50	0.33	0.50	0.35	0.54	0.56
	Severity	0.26	0.50	0.33	0.50	0.49	0.25	0.50	0.33	0.50	0.49
	Priority	0.25	0.50	0.33	0.50	0.49	0.25	0.50	0.33	0.49	0.49
WebKit	Duplicate	0.35	0.50	0.34	0.50	0.51	0.33	0.50	0.33	0.56	0.56
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50

Table A16. CNN performance on comment-only input with/without stopword removal.

Model		CNN
		Apply Stopword Removal					Not Apply Stopword Removal
Project	Task	Precision	Recall	F1	ROC	PR	Precision	Recall	F1	ROC	PR
Eclipse	Duplicate	0.29	0.50	0.33	0.56	0.56	0.30	0.50	0.35	0.51	0.50
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.51	0.61
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.34	0.50	0.50
FreeBSD	Duplicate	0.70	0.69	0.69	0.76	0.77	0.74	0.73	0.73	0.79	0.80
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.56
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.34	0.50	0.50
GCC	Duplicate	0.40	0.50	0.33	0.60	0.59	0.40	0.50	0.33	0.60	0.60
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.54	0.55
	Priority	0.25	0.50	0.34	0.51	0.55	0.25	0.50	0.33	0.50	0.50
Gentoo	Duplicate	0.46	0.51	0.36	0.63	0.63	0.39	0.50	0.34	0.63	0.63
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Kernel	Duplicate	0.54	0.51	0.36	0.56	0.54	0.46	0.51	0.36	0.52	0.59
	Severity	0.25	0.50	0.33	0.51	0.52	0.28	0.50	0.33	0.51	0.51
	Priority	0.25	0.50	0.33	0.49	0.49	0.25	0.50	0.33	0.49	0.50
RedHat	Duplicate	0.78	0.78	0.78	0.82	0.79	0.78	0.78	0.78	0.81	0.78
	Severity	0.25	0.50	0.33	0.50	0.50	0.29	0.50	0.33	0.51	0.58
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.52	0.61
Sourceware	Duplicate	0.49	0.50	0.37	0.50	0.51	0.43	0.50	0.35	0.50	0.51
	Severity	0.31	0.50	0.33	0.50	0.50	0.28	0.50	0.33	0.49	0.51
	Priority	0.25	0.50	0.33	0.50	0.54	0.27	0.50	0.33	0.50	0.50
WebKit	Duplicate	0.50	0.53	0.38	0.62	0.64	0.27	0.50	0.33	0.63	0.66
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50

Table A17. LSTM performance on comment-only input with/without stopword removal.

Model		LSTM
		Apply Stopword Removal					Not Apply Stopword Removal
Project	Task	Precision	Recall	F1	ROC	PR	Precision	Recall	F1	ROC	PR
Eclipse	Duplicate	0.38	0.50	0.34	0.59	0.58	0.35	0.50	0.33	0.52	0.56
	Severity	0.25	0.50	0.33	0.50	0.51	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.51	0.52	0.25	0.50	0.33	0.50	0.55
FreeBSD	Duplicate	0.35	0.50	0.33	0.52	0.52	0.33	0.50	0.33	0.50	0.52
	Severity	0.25	0.50	0.33	0.52	0.50	0.25	0.50	0.33	0.51	0.61
	Priority	0.25	0.50	0.33	0.49	0.49	0.25	0.50	0.33	0.50	0.50
GCC	Duplicate	0.32	0.50	0.34	0.50	0.52	0.33	0.50	0.34	0.55	0.56
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.51	0.51
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Gentoo	Duplicate	0.29	0.50	0.33	0.55	0.51	0.32	0.50	0.33	0.50	0.52
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.51	0.50
Kernel	Duplicate	0.49	0.50	0.35	0.55	0.54	0.58	0.51	0.37	0.56	0.54
	Severity	0.26	0.50	0.33	0.50	0.57	0.25	0.50	0.33	0.49	0.62
	Priority	0.25	0.50	0.33	0.51	0.50	0.25	0.50	0.33	0.50	0.61
RedHat	Duplicate	0.63	0.57	0.40	0.66	0.64	0.62	0.52	0.39	0.56	0.54
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.72
	Priority	0.25	0.50	0.33	0.50	0.50	0.26	0.50	0.33	0.50	0.71
Sourceware	Duplicate	0.44	0.50	0.33	0.50	0.50	0.42	0.50	0.33	0.51	0.50
	Severity	0.25	0.50	0.33	0.51	0.60	0.26	0.50	0.33	0.50	0.63
	Priority	0.25	0.50	0.33	0.49	0.59	0.25	0.50	0.33	0.49	0.64
WebKit	Duplicate	0.41	0.50	0.33	0.58	0.57	0.45	0.50	0.33	0.60	0.60
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50

Table A18. GRU performance on comment-only input with/without stopword removal.

Model		GRU
		Apply Stopword Removal					Not Apply Stopword Removal
Project	Task	Precision	Recall	F1	ROC	PR	Precision	Recall	F1	ROC	PR
Eclipse	Duplicate	0.42	0.50	0.33	0.58	0.57	0.42	0.50	0.34	0.58	0.58
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.49
FreeBSD	Duplicate	0.32	0.50	0.33	0.50	0.54	0.32	0.50	0.33	0.51	0.51
	Severity	0.25	0.50	0.33	0.51	0.50	0.25	0.50	0.33	0.51	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.52
GCC	Duplicate	0.33	0.50	0.33	0.53	0.54	0.34	0.50	0.33	0.50	0.52
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.55
Gentoo	Duplicate	0.32	0.50	0.34	0.51	0.51	0.31	0.50	0.33	0.51	0.52
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.65
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Kernel	Duplicate	0.52	0.50	0.34	0.52	0.52	0.42	0.50	0.34	0.52	0.52
	Severity	0.25	0.50	0.33	0.50	0.62	0.25	0.50	0.33	0.50	0.64
	Priority	0.25	0.50	0.33	0.50	0.64	0.25	0.50	0.33	0.50	0.59
RedHat	Duplicate	0.64	0.58	0.52	0.60	0.55	0.67	0.64	0.52	0.68	0.62
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.72
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.70
Sourceware	Duplicate	0.46	0.50	0.34	0.52	0.50	0.45	0.50	0.34	0.51	0.51
	Severity	0.25	0.50	0.33	0.50	0.61	0.25	0.50	0.33	0.50	0.65
	Priority	0.25	0.50	0.33	0.50	0.66	0.25	0.50	0.33	0.50	0.63
WebKit	Duplicate	0.39	0.50	0.33	0.62	0.61	0.38	0.50	0.33	0.62	0.62
	Severity	0.25	0.50	0.33	0.49	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50

Table A19. Transformer performance on comment-only input with/without stopword removal.

Model		Transformer
		Apply Stopword Removal					Not Apply Stopword Removal
Project	Task	Precision	Recall	F1	ROC	PR	Precision	Recall	F1	ROC	PR
Eclipse	Duplicate	0.28	0.50	0.33	0.56	0.56	0.28	0.50	0.33	0.55	0.52
	Severity	0.25	0.50	0.33	0.50	0.51	0.25	0.50	0.33	0.51	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
FreeBSD	Duplicate	0.29	0.50	0.33	0.51	0.52	0.29	0.50	0.33	0.52	0.52
	Severity	0.25	0.50	0.33	0.51	0.56	0.25	0.50	0.33	0.49	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.57
GCC	Duplicate	0.40	0.50	0.33	0.57	0.55	0.38	0.50	0.33	0.50	0.51
	Severity	0.25	0.50	0.33	0.51	0.50	0.25	0.50	0.33	0.49	0.50
	Priority	0.25	0.50	0.33	0.49	0.50	0.25	0.50	0.33	0.50	0.51
Gentoo	Duplicate	0.31	0.50	0.33	0.55	0.56	0.32	0.50	0.33	0.55	0.55
	Severity	0.25	0.50	0.33	0.50	0.50	0.26	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Kernel	Duplicate	0.50	0.50	0.38	0.50	0.54	0.46	0.50	0.35	0.52	0.50
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
RedHat	Duplicate	0.45	0.50	0.36	0.50	0.53	0.54	0.52	0.35	0.52	0.56
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Sourceware	Duplicate	0.53	0.50	0.36	0.50	0.50	0.52	0.50	0.35	0.51	0.50
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.49	0.50
WebKit	Duplicate	0.41	0.51	0.35	0.60	0.60	0.35	0.50	0.33	0.58	0.59
	Severity	0.26	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50

Table A20. BERT performance on comment-only input with/without stopword removal.

Model		BERT
		Apply Stopword Removal					Not Apply Stopword Removal
Project	Task	Precision	Recall	F1	ROC	PR	Precision	Recall	F1	ROC	PR
Eclipse	Duplicate	0.35	0.50	0.33	0.52	0.52	0.36	0.50	0.33	0.55	0.51
	Severity	0.25	0.50	0.33	0.51	0.51	0.25	0.50	0.33	0.51	0.51
	Priority	0.25	0.50	0.33	0.51	0.50	0.25	0.50	0.33	0.50	0.50
FreeBSD	Duplicate	0.28	0.50	0.33	0.50	0.50	0.28	0.50	0.33	0.52	0.51
	Severity	0.25	0.50	0.33	0.49	0.49	0.25	0.50	0.33	0.50	0.55
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.34	0.51	0.56
GCC	Duplicate	0.33	0.50	0.34	0.55	0.56	0.33	0.50	0.33	0.50	0.51
	Severity	0.25	0.50	0.33	0.49	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.49	0.25	0.50	0.33	0.49	0.69
Gentoo	Duplicate	0.36	0.50	0.33	0.50	0.54	0.35	0.50	0.34	0.56	0.56
	Severity	0.25	0.50	0.34	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Kernel	Duplicate	0.35	0.50	0.33	0.59	0.60	0.36	0.50	0.33	0.54	0.56
	Severity	0.25	0.50	0.33	0.51	0.50	0.25	0.50	0.33	0.51	0.50
	Priority	0.25	0.50	0.33	0.49	0.50	0.25	0.50	0.33	0.51	0.50
RedHat	Duplicate	0.57	0.53	0.42	0.49	0.49	0.57	0.56	0.45	0.49	0.49
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50
Sourceware	Duplicate	0.32	0.50	0.33	0.50	0.51	0.33	0.50	0.33	0.52	0.52
	Severity	0.27	0.50	0.33	0.51	0.51	0.25	0.50	0.33	0.51	0.51
	Priority	0.25	0.50	0.33	0.51	0.51	0.25	0.50	0.33	0.52	0.51
WebKit	Duplicate	0.35	0.50	0.33	0.51	0.54	0.27	0.50	0.33	0.54	0.57
	Severity	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.54
	Priority	0.25	0.50	0.33	0.50	0.50	0.25	0.50	0.33	0.50	0.50

Table A21. Statistical test results (t-test, Wilcoxon) for duplicate detection performance differences due to stopword removal.

Hypothesis	p-Value	Result	Hypothesis	p-Value	Result	Hypothesis	p-Value	Result
H1₀	0.9944 (t)	Accept	H2₀	0.9234 (w)	Accept	H3₀	0.9238 (w)	Accept
H4₀	0.9550 (w)	Accept	H5₀	0.9549 (t)	Accept	H6₀	0.9934 (t)	Accept
H7₀	0.9838 (w)	Accept	H8₀	0.9731 (w)	Accept	H9₀	0.7903 (t)	Accept
H10₀	0.8318 (w)	Accept	H11₀	0.9835 (t)	Accept	H12₀	0.9250 (t)	Accept
H13₀	0.8620 (t)	Accept	H14₀	0.9675 (t)	Accept	H15₀	0.9492 (w)	Accept
H16₀	0.9779 (t)	Accept	H17₀	0.9389 (t)	Accept	H18₀	0.9303 (t)	Accept
H19₀	0.9089 (t)	Accept	H20₀	0.9593 (w)	Accept	H21₀	0.8965 (t)	Accept
H22₀	0.9635 (t)	Accept	H23₀	0.9529 (t)	Accept	H24₀	0.6545 (t)	Accept
H25₀	0.8565 (t)	Accept	H26₀	0.8865 (t)	Accept	H27₀	0.9556 (w)	Accept
H28₀	0.9517 (w)	Accept	H29₀	0.9995 (w)	Accept	H30₀	0.9545 (t)	Accept
H31₀	0.9970 (t)	Accept	H32₀	0.9994 (t)	Accept	H33₀	0.9954 (t)	Accept
H34₀	0.6988 (t)	Accept	H35₀	0.7965 (t)	Accept	H36₀	0.9677 (t)	Accept
H37₀	0.9638 (w)	Accept	H38₀	0.9550 (w)	Accept	H39₀	0.9626 (t)	Accept
H40₀	0.9324 (t)	Accept

Table A22. Statistical test results for severity prediction performance differences due to stopword removal.

Hypothesis	p-Value	Result	Hypothesis	p-Value	Result	Hypothesis	p-Value	Result
H41₀	0.9826 (t)	Accept	H42₀	0.9590 (w)	Accept	H43₀	0.7940 (t)	Accept
H44₀	0.9306 (t)	Accept	H45₀	0.9830 (t)	Accept	H46₀	0.9851 (t)	Accept
H47₀	0.8837 (t)	Accept	H48₀	0.8697 (t)	Accept	H49₀	0.9018 (t)	Accept
H50₀	0.9819 (t)	Accept	H51₀	0.9360 (t)	Accept	H52₀	0.8971 (t)	Accept
H53₀	0.9367 (t)	Accept	H54₀	0.9632 (t)	Accept	H55₀	0.9836 (t)	Accept
H56₀	0.9183 (t)	Accept	H57₀	0.9607 (w)	Accept	H58₀	0.9721 (t)	Accept
H59₀	0.9543 (t)	Accept	H60₀	0.9153 (w)	Accept	H61₀	0.9775 (w)	Accept
H62₀	0.9015 (t)	Accept	H63₀	0.9859 (t)	Accept	H64₀	0.9699 (w)	Accept
H65₀	0.9776 (t)	Accept	H66₀	0.8544 (w)	Accept	H67₀	0.9544 (w)	Accept
H68₀	0.8582 (t)	Accept	H69₀	0.9452 (w)	Accept	H70₀	0.9339 (t)	Accept
H71₀	0.8080 (w)	Accept	H72₀	0.9681 (t)	Accept	H73₀	0.9958 (t)	Accept
H74₀	0.8232 (w)	Accept	H75₀	0.8938 (w)	Accept	H76₀	0.7578 (w)	Accept
H77₀	0.8982 (t)	Accept	H78₀	0.9386 (t)	Accept	H79₀	0.9568 (w)	Accept
H80₀	0.9169 (t)	Accept

Table A23. Statistical test results for priority prediction performance differences due to stopword removal.

Hypothesis	p-Value	Result	Hypothesis	p-Value	Result	Hypothesis	p-Value	Result
H81₀	0.9719 (t)	Accept	H82₀	0.9126 (t)	Accept	H83₀	0.9701 (t)	Accept
H84₀	0.9648 (t)	Accept	H85₀	0.9564 (t)	Accept	H86₀	0.8379 (t)	Accept
H87₀	0.9672 (t)	Accept	H88₀	0.8774 (t)	Accept	H89₀	0.8283 (t)	Accept
H90₀	0.8210 (t)	Accept	H91₀	0.9140 (t)	Accept	H92₀	0.9670 (t)	Accept
H93₀	0.9248 (t)	Accept	H94₀	0.9527 (t)	Accept	H95₀	0.9853 (t)	Accept
H96₀	0.9820 (t)	Accept	H97₀	0.9479 (t)	Accept	H98₀	0.8891 (t)	Accept
H99₀	0.9457 (t)	Accept	H100₀	0.8779 (t)	Accept	H101₀	0.9521 (w)	Accept
H102₀	0.8726 (t)	Accept	H103₀	0.9979 (t)	Accept	H104₀	0.9284 (t)	Accept
H105₀	0.9875 (t)	Accept	H106₀	0.8701 (t)	Accept	H107₀	0.8946 (t)	Accept
H108₀	0.8522 (t)	Accept	H109₀	0.9585 (t)	Accept	H110₀	0.9508 (t)	Accept
H111₀	0.8978 (w)	Accept	H112₀	0.9339 (w)	Accept	H113₀	0.9952 (w)	Accept
H114₀	0.9066 (w)	Accept	H115₀	0.6370 (t)	Accept	H116₀	0.9201 (t)	Accept
H117₀	0.9077 (t)	Accept	H118₀	0.8913 (t)	Accept	H119₀	0.9214 (t)	Accept
H120₀	0.9583 (w)	Accept

Figure A1. CNN performance on summary + description input with/without stopword removal.

Figure A2. LSTM performance on summary + description input with/without stopword removal.

Figure A3. GRU performance on summary + description input with/without stopword removal.

Figure A4. Transformer performance on summary + description input with/without stopword removal.

Figure A5. BERT performance on summary + description input with/without stopword removal.

Figure A6. CNN performance on summary-only input with/without stopword removal.

Figure A7. LSTM performance on summary-only input with/without stopword removal.

Figure A8. GRU performance on summary-only input with/without stopword removal.

Figure A9. Transformer performance on summary-only input with/without stopword removal.

Figure A10. BERT performance on summary-only input with/without stopword removal.

Figure A11. CNN performance on description-only input with/without stopword removal.

Figure A12. LSTM performance on description-only input with/without stopword removal.

Figure A13. GRU performance on description-only input with/without stopword removal.

Figure A14. Transformer performance on description-only input with/without stopword removal.

Figure A15. BERT performance on description-only input with/without stopword removal.

Figure A16. CNN performance on comment-only input with/without stopword removal.

Figure A17. LSTM performance on comment-only input with/without stopword removal.

Figure A18. GRU performance on comment-only input with/without stopword removal.

Figure A19. Transformer performance on comment-only input with/without stopword removal.

Figure A20. BERT performance on comment-only input with/without stopword removal.

References

Banerjee, S.; Syed, Z.; Helmick, J.; Culp, M.; Ryan, K.; Cukic, B. Automated Triaging of very Large Bug Repositories. Inf. Softw. Technol. 2017, 89, 1–13. [Google Scholar] [CrossRef]
Eclipse. Available online: https://bugs.eclipse.org/bugs (accessed on 5 July 2024).
Mozilla. Available online: https://bugzilla.mozilla.org/ (accessed on 5 July 2024).
Mukherjee, U.; Rahman, M.M. Understanding the Impact of Domain Term Explanation on Duplicate Bug Report Detection. arXiv 2025, arXiv:2503.18832. [Google Scholar] [CrossRef]
Bugzilla. Available online: https://bugzilla.org/ (accessed on 5 July 2024).
Yoon, K. Convolutional Neural Networks for Sentence Classification. arXiv 2014, arXiv:1408.5882. [Google Scholar] [CrossRef]
Wang, L.; Zhang, L.; Jiang, J. Duplicate Question Detection with Deep Learning in Stack Overflow. IEEE Access 2020, 8, 25964–25975. [Google Scholar] [CrossRef]
FreeBSD. Available online: https://bugs.freebsd.org/bugzilla/ (accessed on 5 July 2024).
GCC. Available online: https://gcc.gnu.org/bugzilla/ (accessed on 5 July 2024).
Gentoo. Available online: https://bugs.gentoo.org/ (accessed on 5 July 2024).
Kernel. Available online: https://bugzilla.kernel.org/ (accessed on 5 July 2024).
RedHat. Available online: https://bugzilla.redhat.com/ (accessed on 5 July 2024).
Sourceware. Available online: https://sourceware.org/bugzilla/ (accessed on 5 July 2024).
Webkit. Available online: https://bugs.webkit.org/ (accessed on 5 July 2024).
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
Devlin, J. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Vaswani, A. Attention is All You Need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
GCC #65092. Available online: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65092 (accessed on 5 July 2024).
Bugzilla. Available online: https://bugzilla.readthedocs.io/en/5.2/using/editing.html (accessed on 5 July 2024).
Eren, Ç.; Şahin, K.; Tüzün, E. Analyzing Bug Life Cycles to Derive Practical Insights. In Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering, Oulu, Finland, 14–16 June 2023; pp. 162–171. [Google Scholar]
Kao, A.; Poteet, S.R. Natural Language Processing and Text Mining; Springer: London, UK, 2007. [Google Scholar]
Xia, Y.; Hua, J.; Dougherty, E.R. Quantification of the Impact of Feature Selection on the Variance of Cross-Validation Error Estimation. EURASIP J. Bioinform. Syst. Biol. 2007, 2007, 1–11. [Google Scholar] [CrossRef]
Nguyen, A.T.; Nguyen, T.T.; Nguyen, T.N.; Lo, D.; Sun, C. Duplicate Bug Report Detection with a Combination of Information Retrieval and Topic Modeling. In Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering (ASE), Essen, Germany, 3–7 September 2012; pp. 70–79. [Google Scholar]
Sun, C.; Lo, D.; Wang, X.; Jiang, J.; Khoo, S.-C. A Discriminative Model Approach for Accurate Duplicate Bug Report Retrieval. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering (ICSE), Cape Town, South Africa, 2–8 May 2010; Volume 1, pp. 45–54. [Google Scholar]
Wang, X.; Zhang, L.; Xie, T.; Anvik, J.; Sun, J. An Approach to Detecting Duplicate Bug Reports Using Natural Language and Execution Information. In Proceedings of the 30th International Conference on Software Engineering (ICSE), Leipzig, Germany, 10–18 May 2008; pp. 461–470. [Google Scholar]
Meng, Q.; Zhang, X.; Ramackers, G.; Visser, J. Combining Retrieval and Classification: Balancing Efficiency and Accuracy in Duplicate Bug Report Detection. arXiv 2024, arXiv:2404.14877. [Google Scholar] [CrossRef]
Patil, A.; Jadon, A. Auto-Labelling of Bug Report Using Natural Language Processing. In Proceedings of the 2023 IEEE 8th for Convergence in Technology (I2CT), Pune, India, 7–9 April 2023; pp. 1–7. [Google Scholar]
Eclipse. Available online: https://bugs.eclipse.org/bugs/show_bug.cgi?id=30959 (accessed on 5 July 2024).
Eclipse. Available online: https://bugs.eclipse.org/bugs/show_bug.cgi?id=31082 (accessed on 5 July 2024).
Gupta, S.; Gupta, S.K. A Systematic Study of Duplicate Bug Report Detection. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 578–589. [Google Scholar] [CrossRef]
Al-Msie, R.F. BushraDBR: An Automatic Approach to Retrieving Duplicate Bug Reports. arXiv 2024, arXiv:2407.04707. [Google Scholar]
Ramos, J. Using TF-IDF to Determine Word Relevance in Document Queries. In Proceedings of the First Instructional Conference on Machine Learning; Citeseer: Princeton, NJ, USA, 2003; Volume 242, pp. 29–48. [Google Scholar]
McHugh, M.L. The Chi-Square Test of Independence. Biochem. Med. 2013, 23, 143–149. [Google Scholar] [CrossRef]
Stopword List. Available online: https://gist.github.com/rg089/35e00abf8941d72d419224cfd5b5925d (accessed on 5 July 2024).
Berrar, D. Cross-validation. In Encyclopedia of Bioinformatics and Computational Biology, 2nd ed.; Elsevier: Amsterdam, The Netherlands, 2019; pp. 542–545. [Google Scholar]
Goutte, C.; Gaussier, E. A Probabilistic Interpretation of Precision, Recall and F-score, with Implication for Evaluation. In Proceedings of the European Conference on Information Retrieval, Santiago de Compostela, Spain, 21–23 March 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 345–359. [Google Scholar]
Davis, J.; Goadrich, M. The Relationship between Precision-Recall and ROC Curves. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 233–240. [Google Scholar]
Gravetter, F.J.; Wallnau, L.B. Introduction to the T statistic. Essent. Statist. Behav. Sci. 2014, 8, 252. [Google Scholar]
Woolson, R.F. Wilcoxon signed-rank test. In Wiley Encyclopedia of Clinical Trials; Wiley: Hoboken, NJ, USA, 2007; p. 13. [Google Scholar]
Bland, J.M.; Altman, D.G. Multiple Significance Tests: The Bonferroni Method. BMJ 1995, 310, 170. [Google Scholar] [CrossRef]
Harris, C.R.; Millman, K.J.; Van Der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Oliphant, T.E. Array Programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef]
Chaparro, O.; Florez, J.M.; Singh, U.; Marcus, A. Reformulating Queries for Duplicate Bug report Detection. In Proceedings of the IEEE 26th International Conference on Software Analysis, Evolution and Reengineering (SANER), Hangzhou, China, 24–27 February 2019; pp. 218–229. [Google Scholar]
Kukkar, A.; Mohana, R.; Kumar, Y.; Nayyar, A.; Bilal, M.; Kwak, K.S. Duplicate Bug report Detection and Classification System based on Deep Learning Technique. IEEE Access 2020, 8, 200749–200763. [Google Scholar] [CrossRef]
Chauhan, R.; Sharma, S.; Goyal, A. DENATURE: Duplicate. Detection and Type Identification in Open Source Bug Repositories. Int. J. Syst. Assur. Eng. Manag. 2023, 14, 1–18. [Google Scholar] [CrossRef]
Rocha, T.M.; Carvalho, A.L.D.C. SiameseQAT: A Semantic Context-Based Duplicate Bug report Detection using Replicated Cluster Information. IEEE Access 2021, 9, 44610–44630. [Google Scholar] [CrossRef]
Xie, Q.; Wen, Z.; Zhu, J.; Gao, C.; Zheng, Z. Detecting Duplicate Bug reports with Convolutional Neural Networks. In Proceedings of the IEEE 25th Asia-Pacific Software Engineering Conference (APSEC), Nara, Japan, 4–7 December 2018; pp. 416–425. [Google Scholar]
He, J.; Xu, L.; Yan, M.; Xia, X.; Lei, Y. Duplicate Bug report Detection using Dual-Channel Convolutional Neural Networks. In Proceedings of the 28th International Conference on Program Comprehension, Seoul, Republic of Korea, 13–15 July 2020; pp. 117–127. [Google Scholar]
Mashhadi, E.; Ahmadvand, H.; Hemmati, H. Method-Level Bug Severity Prediction using Source Code Metrics and LLMs. In Proceedings of the IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), Florence, Italy, 9–12 October 2023; pp. 635–646. [Google Scholar]
Shatnawi, M.Q.; Alazzam, B. An Assessment of Eclipse Bugs’ Priority and Severity Prediction Using Machine Learning. Int. J. Commun. Netw. Inf. Secur. 2022, 4, 62–69. [Google Scholar] [CrossRef]
Ramay, W.Y.; Umer, Q.; Yin, X.C.; Zhu, C.; Illahi, I. Deep Neural Network-Based Severity Prediction of Bug reports. IEEE Access 2019, 7, 46846–46857. [Google Scholar] [CrossRef]
Bani-Salameh, H.; Sallam, M.; AI shboul, B. A Deep-Learning-Based Bug Priority Prediction using RNN-LSTM Neural Networks. E-Inform. Softw. Eng. J. 2021, 15, 29–45. [Google Scholar]
Rathnayake, R.M.D.S.; Kumara, B.T.G.S.; Ekanayake, E.M.U.W.J.B. CNN-Based Priority Prediction of Bug reports. In Proceedings of the IEEE International Conference on Decision Aid Sciences and Application (DASA), Online, 7–8 December 2021; pp. 299–303. [Google Scholar]
Zhang, W.; Challis, C. Automatic Bug Priority Prediction using DNN Based Regression. In Advances in Natural Computation, Fuzzy Systems and Knowledge Discovery, Proceedings of the International Conference on Natural Computation, Fuzzy Systems and Knowledge Discovery, Xi′an, China, 1–3 August 2020; Springer: Cham, Switzerland, 2020; Volume 1, pp. 333–340. [Google Scholar]
Umer, Q.; Liu, H.; Sultan, Y. Emotion based Automated Priority Prediction for Bug reports. IEEE Access 2018, 6, 35743–35752. [Google Scholar] [CrossRef]
Choudhary, P.A.; Singh, S. Neural Network Based Bug Priority Prediction Model using Text Classification Techniques. Int. J. Adv. Res. Comput. Sci. 2017, 8, 1315. [Google Scholar]

Figure 1. Example bug report structure from GCC tracker (from [19]).

Figure 2. Standard lifecycle phases of software bug management.

Figure 3. System architecture for bug duplicate detection.

Figure 4. Example pair of duplicate bug reports in Eclipse.

Figure 5. Workflow for severity classification of bug reports.

Figure 6. Classification criteria for bug severity levels.

Figure 7. Bug report pipeline for priority prediction.

Figure 8. Examples of bug priority levels and their definitions.

Figure 9. Proposed framework for bug report analysis and prediction.

Figure 10. Illustrative example of preprocessing steps for bug reports.

Figure 11. Feature selection process using TF-IDF and Chi-square methods.

Figure 12. CNN architecture for classification with convolution and pooling layers.

Figure 13. LSTM architecture for sequential bug report analysis.

Figure 14. Architecture of the GRU-based model for bug report classification.

Figure 15. Transformer-based text classification model for bug reports using multi-head self-attention and positional encoding.

Figure 16. BERT model architecture adapted for bug report classification with fine-tuning on downstream tasks.

Table 1. Dataset summary: bug report statistics across eight open-source projects.

Project Type	Project	Time Frame	Number of Reports
Development Tool	Eclipse	28/02/02–21/08/23	559,680
Development Tool	GCC	24/06/02–15/08/23	94,589
Server	Sourceware	27/10/00–16/08/23	30,710
Web Browser Engine	WebKit	01/01/01–15/08/23	242,605
Operating System	FreeBSD	03/06/95–21/08/23	262,410
	Gentoo	09/11/02–23/08/23	536,646
	Kernel	20/04/08–17/12/14	14,366
	Red Hat	31/03/00–29/08/07	160,178
Total			1,901,084

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ji, J.; Yang, G. Do Stop Words Matter in Bug Report Analysis? Empirical Findings Using Deep Learning Models Across Duplicate, Severity, and Priority Classification. Appl. Sci. 2025, 15, 9178. https://doi.org/10.3390/app15169178

AMA Style

Ji J, Yang G. Do Stop Words Matter in Bug Report Analysis? Empirical Findings Using Deep Learning Models Across Duplicate, Severity, and Priority Classification. Applied Sciences. 2025; 15(16):9178. https://doi.org/10.3390/app15169178

Chicago/Turabian Style

Ji, Jinfeng, and Geunseok Yang. 2025. "Do Stop Words Matter in Bug Report Analysis? Empirical Findings Using Deep Learning Models Across Duplicate, Severity, and Priority Classification" Applied Sciences 15, no. 16: 9178. https://doi.org/10.3390/app15169178

APA Style

Ji, J., & Yang, G. (2025). Do Stop Words Matter in Bug Report Analysis? Empirical Findings Using Deep Learning Models Across Duplicate, Severity, and Priority Classification. Applied Sciences, 15(16), 9178. https://doi.org/10.3390/app15169178

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Do Stop Words Matter in Bug Report Analysis? Empirical Findings Using Deep Learning Models Across Duplicate, Severity, and Priority Classification

Abstract

1. Introduction

2. Background Knowledge

2.1. Bug Report and Bug Life Cycle

2.2. Bug Report Preprocessing

2.3. Feature Selection in Bug Reports

2.4. Bug Duplicate Detection

2.5. Bug Severity Prediction

2.6. Bug Priority Prediction

3. Our Approach

3.1. Data Extraction and Preprocessing

3.2. Feature Selection

3.3. Model Training and Evaluation

3.3.1. Convolutional Neural Networks (CNNs)

3.3.2. Long Short-Term Memory (LSTM)

3.3.3. Gated Recurrent Unit (GRU)

3.3.4. Transformer

3.3.5. BERT (Bidirectional Encoder Representations from Transformers)

3.3.6. Training Configuration

3.4. Implementation Details

3.4.1. Model Configuration

3.4.2. Text Processing and Vectorization

3.4.3. Training Settings

3.4.4. Imbalanced Data Handling

3.4.5. Evaluation Metrics

4. Experiment Result

4.1. Overview

4.2. Data Collection and Preprocessing

4.3. Model Training and Evaluation

4.4. Research Questions

4.5. Result

5. Discussion

5.1. Experiment Results

5.2. Detailed Analysis of Experimental Results

5.3. Threats to Validity

6. Related Work

6.1. Bug Duplicate Detection

6.2. Bug Severity Prediction

6.3. Bug Priority Prediction

6.4. Comparison with Our Study

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI