Optimizing Accuracy, Recall, Specificity, and Precision Using ILP

Marioriyad, Arash; Ramazi, Pouria

doi:10.3390/math13071059

Open AccessArticle

Optimizing Accuracy, Recall, Specificity, and Precision Using ILP

by

Arash Marioriyad

¹ and

Pouria Ramazi

^2,*

¹

Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan 84156-83111, Iran

²

Department of Mathematics and Statistics, Brock University, St. Catharines, ON L2S 3A1, Canada

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(7), 1059; https://doi.org/10.3390/math13071059

Submission received: 23 September 2024 / Revised: 6 March 2025 / Accepted: 11 March 2025 / Published: 25 March 2025

(This article belongs to the Special Issue Statistical Forecasting: Theories, Methods and Applications)

Download

Browse Figure

Versions Notes

Abstract

Accuracy, recall, specificity, and precision are key performance measures for binary classifiers. To obtain these measures, the probabilities generated by classifiers must be converted into deterministic labels using a threshold. Exhaustive search methods can be computationally expensive, prompting the need for a more efficient solution. We propose an integer linear programming (ILP) formulation to find the threshold that maximizes any linear combination of these measures. Simulations and experiments on four real-world datasets demonstrate that our approach identifies the optimal threshold orders of magnitude faster than an exhaustive search. This work establishes ILP as an efficient tool for optimizing classifier performance.

Keywords:

accuracy; recall; specificity; precision; integer learner programming

MSC:

90C10; 68T05; 62H30; 90C90

1. Introduction

In binary classification tasks, selecting an appropriate threshold to convert into deterministic class labels is necessary for optimizing the performance of classifiers that generate probabilistic outputs [1,2]. Common performance measures—accuracy, recall, specificity, and precision—each capture different aspects of a classifier’s effectiveness and are used in applications such as medical diagnosis [3], fraud detection [4], spam filtering [5], ecology [6], and earthquake occurrence prediction [7].

In many binary classification tasks, the goal is to optimize a single performance measure, such as accuracy or recall. However, there are critical situations where it is essential to maximize two or more performance measures simultaneously to ensure balanced and robust classifier performance [8,9,10]. For example, in medical diagnostics, recall is needed to detect as many cases of a disease as possible, while specificity is necessary to avoid misdiagnosing healthy individuals. Precision helps ensure that when the model flags a condition, it is highly likely to be correct, and accuracy ensures overall reliability across both positive and negative cases. Similarly, fields such as fraud detection, spam filtering, and cybersecurity often involve high-stakes decisions where errors on either side—positive or negative—can have serious consequences, making it critical to balance all four measures effectively.

Threshold selection often involves trade-offs among these performance measures [11]. Traditional methods for threshold selection, including heuristic approaches and exhaustive searches, can be computationally intensive and may become impractical for large datasets due to their quadratic or sub-quadratic time complexity [1].

Several studies investigated advanced strategies for threshold selection that go beyond single-metric optimization. Berger and Guda [12] proposed a fixed-point method for optimizing macro-averaged precision and recall. Pillai et al. [13] improved F-measure optimization in multi-label classification. Tasche [14] introduced a plug-in approach to maximize precision and recall at the top-K predictions. Arora et al. [15] incorporated uncertainty estimates into threshold selection, achieving better precision–recall trade-offs. Koseoglu et al. [16] developed a mixed-integer linear programming (MILP) framework for adaptable thresholding across classifiers, while Sanchez [17] applied game theory to identify robust operating points.

We propose an integer linear programming (ILP) formulation to efficiently determine the optimal threshold that maximizes any linear combination of accuracy, recall, specificity, and precision. By reformulating the non-linear components of these performance measures into linear expressions, our approach leverages ILP solvers to efficiently address the optimization problem, ensuring both scalability and computational feasibility. Unlike previous methods that may focus on a single performance measure or require exhaustive computations, the ILP formulation provides a flexible framework capable of balancing multiple performance metrics simultaneously. We validate the efficiency of this approach through simulations, demonstrating substantial improvements over exhaustive search methods. This approach provides a scalable solution for binary classification threshold optimization and lays the foundation for extending these techniques to multi-class tasks, enabling impactful trade-offs among multiple metrics.

2. Problem Formulation and Results

Given a binary-labeled dataset, consisting of positive and negative target values, and a binary classifier, true positives and true negatives are the correctly classified instances with positive and negative target values, respectively. Similarly, positive and negative instances incorrectly predicted by the classifier are called false positives and false negatives.

Denote the number of true positive, true negative, false positive, and false negative instances by

T P

,

T N

,

F P

, and

F N

, respectively. Accuracy [18], recall [19], specificity [20], and precision [19] are defined as

\begin{matrix} a c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}, r e c a l l = \frac{T P}{T P + F N}, \\ s p e c i f i c i t y = \frac{T N}{T N + F P}, p r e c i s i o n = \frac{T P}{T P + F P} . \end{matrix}

(1)

While calculating the performance measures introduced in (1) requires deterministic labels, classification models often produce probabilistic outputs. Hence, one must apply a real-valued threshold, between 0 and 1, to the probabilities generated by a classifier to obtain deterministic labels. The problem is then to find the optimal threshold that maximizes accuracy (resp. recall, specificity, or precision) or, in general, a linear combination of them:

\begin{matrix} maximize α \frac{T P + T N}{T P + T N + F P + F N} + β \frac{T P}{T P + F N} + γ \frac{T N}{T N + F P} + ζ \frac{T P}{T P + F P}, \end{matrix}

(2)

where

α, β, γ, ζ \in R

are arbitrary coefficients. We propose an integer linear programming (ILP) formulation to solve this problem.

Consider a set of N instances with a binary target vector

y \in {0, 1}^{N}

, where 0 and 1 represent negative and positive lablels. Denote the number of positive and negative instances by

N^{+}

and

N^{-}

. Suppose

p_{i}

is the probability generated by a binary classifier for the ith instance. Let

\hat{y_{i}}

be a binary decision variable that represents the label of the ith instance after applying the threshold

τ

to

p_{i}

. More precisely, if

p_{i} < τ

, then

\hat{y_{i}}

is 0 and is 1 otherwise. Then, according to (1), accuracy, recall, specificity, and precision can be written as follows:

\begin{matrix} a c c u r a c y = 1 - \frac{1}{N} \sum_{i = 1}^{N} | y_{i} - \hat{y_{i}} |, r e c a l l = \frac{1}{N^{+}} \sum_{i = 1}^{N} y_{i} \hat{y_{i}}, \\ s p e c i f i c i t y = \frac{1}{N^{-}} \sum_{i = 1}^{N} (1 - y_{i}) (1 - \hat{y_{i}}), p r e c i s i o n = \frac{\sum_{i = 1}^{N} y_{i} \hat{y_{i}}}{\sum_{i = 1}^{N} \hat{y_{i}}} . \end{matrix}

(3)

This results in the following non-linear optimization formulation:

\begin{matrix} maximize \sum_{i = 1}^{N} (- \frac{α}{N} | y_{i} - \hat{y_{i}} | + \frac{β}{N^{+}} y_{i} \hat{y_{i}} + \frac{γ}{N^{-}} (1 - y_{i}) (1 - \hat{y_{i}})) + ζ \frac{\sum_{i = 1}^{N} y_{i} \hat{y_{i}}}{\sum_{i = 1}^{N} \hat{y_{i}}} \\ subject to : \\ \hat{y_{i}} = \{\begin{matrix} 0, & p_{i} < τ \\ 1, & p_{i} \geq τ \end{matrix} \forall 1 \leq i \leq N, \\ \hat{y_{i}} \in {0, 1} \forall 1 \leq i \leq N \\ 0 \leq τ \leq 1 . \end{matrix}

(4)

where

\hat{y_{i}}

is a binary decision variable, and

τ

is a continuous variable that can take a value between 0 and 1.

To make the term

| y_{i} - \hat{y_{i}} |

linear, we introduce a new decision variable

z_{i}

along with two new constraints for each i listed in (5). More precisely, the constraint sets

(1)

and

(2)

force the decision variable

z_{i}

to be greater than

| y_{i} - \hat{y_{i}} |

, and hence, by maximizing

- z_{i}

, we are maximizing

- | y_{i} - \hat{y_{i}} |

too. We define

z_{i}

as a continuous variable to speed up the optimization process, although it will finally take the values 0 or 1.

Moreover, to linearize the

\hat{y_{i}}

constraints in (4), we combine the conditions on

τ

and replace them with two new constraint sets for each i in (5). These constraints represent the relation between

\hat{y_{i}}

and

τ

. For example, if

p_{i} = 0.7

and

τ = 0.5

for an arbitrary instance i, then

\hat{y_{i}}

is forced to be 1 according to the constraints

0.7 \hat{y_{i}} + 0.5 > 0.7

and

(0.7 - 1) \hat{y_{i}} - 0.5 \geq - 1

.

The term

\sum_{i = 1}^{N} y_{i} \hat{y_{i}} / \sum_{i = 1}^{N} \hat{y_{i}}

, which is corresponding to the precision measure, can be linearized using the Charnes–Cooper transformation [21]. However, due to the complicated form of the final objective function and the large number of new constraints, we ignore the precision measure in (5) by setting

ζ = 0

. Later, we linearize an optimization formulation whose objective function only contains the precision measure (see (7) and (8)).

Note that other performance measures such as the false positive rate (FPR) and the false negative rate (FNR) can also become linear and incorporated into the optimization problem (5) by applying the same approach we used for accuracy, recall, specificity, or precision:

\begin{matrix} maximize \sum_{i = 1}^{N} (- \frac{α}{N} z_{i} + \frac{β}{N^{+}} y_{i} \hat{y_{i}} + \frac{γ}{N^{-}} (1 - y_{i}) (1 - \hat{y_{i}})) \\ subject to : \\ z_{i} + \hat{y_{i}} \geq y_{i} \forall 1 \leq i \leq N (constraint set 1) \\ z_{i} - \hat{y_{i}} \geq - y_{i} \forall 1 \leq i \leq N (constraint set 2) \\ p_{i} \hat{y_{i}} + τ > p_{i} \forall 1 \leq i \leq N (constraint set 3) \\ (p_{i} - 1) \hat{y_{i}} - τ \geq - 1 \forall 1 \leq i \leq N (constraint set 4) \\ \hat{y_{i}} \in {0, 1} \forall 1 \leq i \leq N \\ z_{i} \in R \forall 1 \leq i \leq N \\ 0 \leq τ \leq 1 . \end{matrix}

(5)

The proposed ILP formulation in the above equation does not output

τ = 0

as the optimal threshold, and this case must be considered separately.

An optimization formulation which aims to only maximize the precision measure (

α = β = γ = 0

and

ζ = 1

) can be written in the matrix form as follows:

\begin{matrix} maximize \frac{c^{T} x}{d^{T} x} \\ subject to : \\ A x \leq b \\ x \geq 0, \end{matrix}

(6)

where

x \in R^{N + 1}

and

A \in R^{(2 N) \times (N + 1)}

are the variable vector and coefficient matrix defined as below:

\begin{matrix} x = \begin{matrix} (\begin{matrix} \hat{y_{1}} \\ ⋮ \\ \hat{y_{N}} \\ τ \end{matrix}) \end{matrix}, c = \begin{matrix} (\begin{matrix} y_{1} \\ ⋮ \\ y_{N} \\ 0 \end{matrix}) \end{matrix}, d = \begin{matrix} (\begin{matrix} 1 \\ ⋮ \\ 1 \\ 0 \end{matrix}) \end{matrix} \\ A = \begin{matrix} (\begin{matrix} - p_{1} & 0 & 0 & - 1 \\ ⋮ & ⋱ & ⋮ & ⋮ \\ 0 & 0 & - p_{N} & - 1 \\ 1 - p_{1} & 0 & 0 & 1 \\ ⋮ & ⋱ & ⋮ & ⋮ \\ 0 & 0 & 1 - p_{N} & 1 \end{matrix}) \end{matrix}, b = \begin{matrix} (\begin{matrix} - p_{1} - ϵ \\ ⋮ \\ - p_{N} - ϵ \\ 1 \\ ⋮ \\ 1 \end{matrix}) \end{matrix} \end{matrix}

(7)

where

ϵ

is a small enough positive number. Using the Charnes–Cooper transformation and by introducing new decision variables

w \in R^{N + 1}

and

t \in R

, we can convert the non-linear optimization formulation in (7) into an ILP problem (8):

\begin{matrix} maximize c^{T} w \\ subject to : \\ A w \leq b t \\ d^{T} w = 1 \\ w \geq 0 \\ t \geq 0 . \end{matrix}

(8)

By solving the above ILP problem, one can retrieve the values of the main decision variables

x

by using the equation

x = \frac{1}{t} w

.

To further reduce the computational complexity of the introduced ILP formulations in (5) and (8), we considered the case where the number of unique ordered pairs

(y_{i}, p_{i})

,

1 \leq i \leq N

, is significantly fewer than the total number of instances N. Suppose that M is the total number of distinct ordered pairs, and

m_{i}

is the repetition number of a particular ordered pair. The following ILP formulation is equivalent to (5), with a noticeably smaller number of constraints and decision variables when

M < < N

:

\begin{matrix} maximize \sum_{i = 1}^{M} m_{i} (- \frac{α}{N} z_{i} + \frac{β}{N^{+}} y_{i} \hat{y_{i}} + \frac{γ}{N^{-}} (1 - y_{i}) (1 - \hat{y_{i}})) \\ subject to : \\ z_{i} + \hat{y_{i}} \geq y_{i} \forall 1 \leq i \leq M \\ z_{i} - \hat{y_{i}} \geq - y_{i} \forall 1 \leq i \leq M \\ p_{i} \hat{y_{i}} + τ > p_{i} \forall 1 \leq i \leq M \\ (p_{i} - 1) \hat{y_{i}} - τ \geq - 1 \forall 1 \leq i \leq M \\ \hat{y_{i}} \in {0, 1} \forall 1 \leq i \leq M \\ z_{i} \in R \forall 1 \leq i \leq M \\ 0 \leq τ \leq 1 . \end{matrix}

(9)

3. Simulations

The proposed ILP-based approach in (9) was compared with an exhaustive search method with a time complexity of

O (N^{2})

in terms of average execution time (Figure 1). The experiment was implemented using the C++ programming language, and the open-source mixed-integer linear programming (MILP) solver GLPK [22] was used to solve the ILP problem. The solver takes a decimal precision as input, which sets the granularity of the threshold

τ

in our problem. The results demonstrate that by increasing the number of instances (N), the ILP method quickly overcomes the exhaustive search in terms of execution time. Moreover, the slight performance drop in the ILP method when increasing the decimal precision from 2 to 3 had minimal impact on this advantage.

Note that the exhaustive search method used in the simulation can be improved by, for example, using advanced data structures, but so can the ILP approach. Further investigation of the efficiency of the proposed method is subject to future work.

4. Real-World Applications

We compared the performance of the ILP approach to that of an exhaustive search in four real-world case studies. While the primary focus of this paper was the speed of the ILP approach, we also provide a detailed discussion on the selection of weights for accuracy, recall, and specificity in this section. For the sake of simplicity, we set the precision weight to zero (see Section 2).

4.1. Medical Diagnosis

We demonstrate the practical applicability of our ILP-based threshold optimization method using the Breast Cancer Wisconsin (Diagnostic) dataset [24]. This dataset is commonly used for binary classification tasks in medical diagnosis, where the goal is to classify tumors as benign or malignant based on various features computed from digitized images of fine needle aspirate (FNA) of breast masses.

4.1.1. Dataset, Preprocessing, and Classifier Training

The Breast Cancer Wisconsin dataset contains 569 instances with 30 numerical features each. The target variable indicates whether a tumor is benign (0) or malignant (1). To ensure robust evaluation, we split the dataset into a training set (70%) and a validation set (30%). The training set was used to build and optimize the classifier, while the validation set provided an unbiased assessment of its performance.

We employed a logistic regression classifier, which is widely used in medical diagnosis due to its “interpretability”. Logistic regression generates probabilistic outputs, making it suitable for threshold optimization. The classifier predicts the probability that a given tumor is malignant, which is then converted into a binary label using a threshold.

4.1.2. Optimization Using ILP

In medical diagnosis, the consequences of misclassification can vary in severity. For instance, false negatives (failing to identify a malignant tumor) can lead to delayed treatment, which can be life-threatening. On the other hand, false positives (misclassifying a benign tumor as malignant) can result in unnecessary stress for patients and costly follow-up procedures.

To optimize the classifier’s threshold, we applied our ILP-based method to maximize a linear combination of the following performance measures: accuracy, recall, and specificity. Each measure captures a different aspect of classification performance:

Accuracy: Ensures overall performance across both benign and malignant cases.
Recall: Measures the ability to correctly identify malignant tumors, which is crucial for minimizing false negatives.
Specificity: Focuses on correctly identifying benign tumors, reducing unnecessary follow-up procedures.

In this context, recall is the most critical measure, as missing a malignant tumor (a false negative) has severe consequences. Therefore, we assigned the highest weight to recall. However, specificity and accuracy are also important to maintain a balanced performance. We set the weights as follows:

$α = 0.2$ (accuracy): A moderate weight to maintain overall performance.
$β = 0.6$ (recall): The highest priority was given to recall to ensure that malignant cases were identified as often as possible.
$γ = 0.2$ (specificity): Specificity was considered, but with a lower weight than recall, as reducing false positives is important but less critical than detecting malignant cases.

These weights reflect the emphasis on minimizing the risk of false negatives while still considering false positives.

4.1.3. Results and Discussion

Using the probabilities predicted by the logistic regression model on the validation set, we applied our ILP formulation (5) and the exhaustive search to find the optimal threshold

τ

that maximized the weighted performance metrics.

We used a decimal precision of three in the solver. Both the ILP optimization and the exhaustive search yielded an optimal threshold of

τ = 0.316

. The execution time for ILP (0.01 s) was approximately half of the exhaustive search (0.018 s). As illustrated in Figure 1, the advantage of ILP in terms of execution time becomes more pronounced with larger datasets, a trend that will be further examined in subsequent sections.

The optimal threshold (

τ = 0.316

) achieved the best balance between recall and specificity based on the predefined weights (Table 1). Specifically, the optimized threshold significantly enhanced recall, increasing it from 94.34% (at the default threshold of

τ = 0.5

) to 97.64%. This improvement indicates that the model successfully identified a greater number of malignant tumors. Although specificity experienced a slight reduction from 97.76% to 95.80%, this trade-off is acceptable in medical diagnostics, where failing to detect malignant tumors (false negatives) is considerably more consequential than incorrectly classifying benign ones (false positives). While the overall accuracy remained high, the weighted performance metric—which accounted for the specified weights—improved from 95.45% to 97.04%, further confirming that the optimized threshold offers a superior balance between recall and specificity.

The weights

α

,

β

,

γ

, and

ζ

can be adjusted based on the specific needs and risk considerations of different medical scenarios. For instance, if resources for follow-up diagnostics are limited, one could increase the weight of specificity to reduce false positives. Alternatively, if minimizing the risk of missed diagnoses is the top priority, the weight on recall can be further increased.

4.2. Earthquake Occurrence Prediction

Earthquake occurrence prediction aims to determine whether an earthquake will happen within a specific time frame and region, a task which is crucial for early warning systems. This problem is commonly formulated as a binary classification task, where each spatiotemporal unit is labeled as earthquake-occurring (1) or non-earthquake-occurring (0). Prediction models utilize seismic and geophysical features such as historical earthquake records, ground deformation, and tectonic stress accumulation.

4.2.1. Dataset and Classifier Training

We utilized the historical spatiotemporal dataset from the United States Geological Survey (USGS) platform [25], focusing on a seismically active region in East Asia bounded by 75–119° E longitude and 23–45° N latitude. The study area was divided into nine equal sub-regions covering the period from October 1966 to September 2016. For evaluation, we applied temporal partitioning, where the most recent

30 %

of instances were designated as the validation set, resulting in 1584 validation samples, and the remaining 70% as the training dataset, resulting in 3696 samples. A K-Nearest Neighbors (KNNs) [26] model was then employed for the binary classification task.

4.2.2. Optimization Using ILP

In the context of earthquake occurrence prediction, the importance of the metrics can be explained as follows:

Recall: This is the most critical metric, because missing an earthquake (false negative) could lead to catastrophic consequences by failing to issue a timely warning or take preventive actions.
Specificity: While false positives (predicting an earthquake when there is none) should be minimized, they are generally less costly than missing a real earthquake. However, maintaining a reasonable level of specificity is still important to avoid unnecessary alarms and resources being diverted to false events.
Accuracy: While accuracy is a useful overall measure, it can be misleading, especially in imbalanced datasets.

Hence, we assigned the following weights to the metrics:

$α = 0.1$ (accuracy);
$β = 0.6$ (recall);
$γ = 0.3$ (specificity).

4.2.3. Results and Discussion

Similar to the previous case, we employed both ILP optimization and an exhaustive search to determine the optimal threshold

τ

that maximized the weighted performance metric, utilizing the predicted probabilities generated by the classifier. Using a decimal precision of four, both ILP optimization and the exhaustive search identified an optimal threshold of

τ = 0.3

. ILP completed the process in 0.035 s, making it approximately four times faster than the exhaustive search, which required 0.137 s.

The optimal threshold (

τ = 0.3

) provided a better balance among the three metrics compared to the default threshold of

τ = 0.5

(Table 2). Specifically, the most important metric, recall, significantly improved from

72.65 %

to

84.65 %

, resulting in an enhancement of more than

3 %

in the weighted performance metric.

4.3. Spam Email Detection

Spam email detection is a crucial task in cybersecurity, which is aimed at distinguishing spam (unwanted or malicious emails) from legitimate (ham) emails. With the increasing volume of spam messages, which often contain phishing attempts, malware, or fraudulent content, effective classification is essential for protecting users from security threats and reducing inbox clutter.

4.3.1. Dataset and Classifier Training

We utilized the spam-email dataset from the Kaggle platform [27], which comprises 5695 public email samples. A

40 %

portion of dataset (2278 samples) was reserved for evaluation. Initially, common text preprocessing techniques, including stop word removal, lemmatization, and stemming, were applied to the raw email texts. Next, the TF–IDF (Term Frequency–Inverse Document Frequency) method was employed to extract features from each processed email. Finally, a feedforward neural network with three layers was implemented for the binary classification task using the binary cross-entropy loss function.

4.3.2. Optimization Using ILP

In this context, the importance of the metrics can be ranked as follows:

Specificity: It is critical to minimize false positives (misclassifying legitimate emails as spam), especially if important emails are inadvertently sent to the spam folder.
Recall: Missing a spam email (false negative) means that a harmful or unwanted message could reach the user’s inbox, which is a concern for security and user experience.
Accuracy: Similar to the previous case study, here, accuracy may not be reliable in imbalanced datasets, where spam emails are fewer than non-spam emails.

Therefore, we assigned the following weights to the metrics:

$α = 0.1$ (accuracy);
$β = 0.3$ (recall);
$γ = 0.6$ (specificity).

4.3.3. Results and Discussion

With a decimal precision of three, both ILP optimization and the exhaustive search determined the optimal threshold to be

τ = 0.376

. Regarding execution time, ILP finished the process in 0.073 s, which was over three times faster than the exhaustive search, which took 0.259 s. Using higher decimal precisions resulted in the same weighted metric. The optimal threshold (

τ = 0.376

) achieved through optimization provided a more favorable balance among the three metrics when compared to the default threshold of

τ = 0.5

, resulting in an improvement of over

1 %

in the value of the weighted metric (Table 3).

4.4. Sentiment Analysis of Movie Reviews

Sentiment analysis of movie reviews involves using natural language processing (NLP) to determine whether the sentiment expressed in a review is positive or negative. This helps filmmakers, marketers, and production companies understand audience reception, improve marketing strategies, make informed production decisions, and predict a movie’s success.

4.4.1. Dataset and Classifier Training

We used the IMDB-Review dataset [28], consisting of 50,000 reviews from the IMDB platform. In total, 25,000 samples (

50 %

of the total dataset) were designated as the validation set for evaluation. As with the previous task, standard text preprocessing methods were applied to the raw reviews. The BERT model [29] was then leveraged to extract features from the text, followed by the use of a two-layer classification head for the binary sentiment analysis task.

4.4.2. Optimization Using ILP

For this problem with balanced data, we considered accuracy to be more important than the other metrics, as it captures the overall performance of the model. Therefore, we assigned the following weights to the metrics:

$α = 0.7$ (accuracy);
$β = 0.2$ (recall);
$γ = 0.1$ (specificity).

4.4.3. Results and Discussion

Using a decimal precision of three, both ILP optimization and the exhaustive search identified the optimal threshold as

τ = 0.455

. In terms of execution time, ILP completed the process in 1.53 s, making it more than twenty times faster than the exhaustive search, which required 35.34 s. The optimal threshold (

τ = 0.455

) provided a better balance across the three metrics compared to the default threshold of

τ = 0.5

, leading to an improvement of approximately

0.6 %

in the weighted metric value (Table 4).

5. Conclusions

We have introduced an integer linear programming (ILP) approach to optimize thresholds in binary classification, aiming to maximize any linear combination of accuracy, recall, specificity, and precision by linearizing their non-linear components. Our simulations demonstrate that the ILP-based method significantly outperformed exhaustive search techniques in computational efficiency, particularly for large datasets, making it highly suitable for real-world applications in healthcare, finance, cybersecurity, and earthquake prediction, where specific performance measures are prioritized. This approach allows practitioners to tailor threshold optimization to their specific needs, such as maximizing recall in medical diagnostics or precision in spam detection. Future work may extend this formulation to multi-class classification problems, incorporate additional performance metrics like the F1 score or area under the ROC curve (AUC), and integrate the ILP approach into broader machine learning pipelines, further enhancing its practical utility.

Author Contributions

A.M. performed the analysis, conducted the simulations, and led the writing; P.R. supervised the study and contributed to the writing. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All codes and generated instances are available at the online repository https://github.com/Arash-Mari-Oriyad/optimal_threshold (accessed on 24 March 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Fawcett, T. An introduction to ROC analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
Hanley, J.A.; McNeil, B.J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982, 143, 29–36. [Google Scholar]
Kononenko, I. Machine learning for medical diagnosis: History, state of the art and perspective. Artif. Intell. Med. 2001, 23, 89–109. [Google Scholar] [CrossRef]
Phua, C.; Lee, V.; Smith, K.; Gayler, R. A comprehensive survey of data mining-based fraud detection research. arXiv 2010, arXiv:1009.6119. [Google Scholar]
Metsis, V.; Androutsopoulos, I.; Paliouras, G. Spam filtering with Naive Bayes—Which Naive Bayes? In Proceedings of the Third Conference on Email and Anti-Spam (CEAS), Mountain View, CA, USA, 27–28 July 2006.
Heggerud, C.M.; Xu, J.; Wang, H.; Lewis, M.A.; Zurawell, R.W.; Loewen, C.J.; Vinebrooke, R.D.; Ramazi, P. Predicting imminent cyanobacterial blooms in lakes using incomplete timely data. Water Resour. Res. 2024, 60, e2023WR035540. [Google Scholar] [CrossRef]
Wang, Q.; Guo, Y.; Yu, L.; Li, P. Earthquake prediction based on spatio-temporal data mining: An LSTM network approach. IEEE Trans. Emerg. Top. Comput. 2017, 8, 148–158. [Google Scholar] [CrossRef]
Cullerne Bown, W. Sensitivity and Specificity versus Precision and Recall, and Related Dilemmas. J. Classif. 2024, 41, 402–426. [Google Scholar]
Zeng, S.; Li, X.; Liu, Y.; Huang, Q.; He, Y. Automatic Annotation Diagnostic Framework for Nasopharyngeal Carcinoma via Pathology–Fidelity GAN and Prior-Driven Classification. Bioengineering 2024, 11, 739. [Google Scholar] [CrossRef]
Powers, D.M. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020, arXiv:2010.16061. [Google Scholar]
Alvarez, S.A. An Exact Analytical Relation Among Recall, Precision, and Classification Accuracy in Information Retrieval; Technical Report BCCS-02-01; Boston College: Boston, MA, USA, 2002; pp. 1–22. [Google Scholar]
Berger, C.; Guda, C. Threshold Optimization for F-measure and Macro-averaged Precision and Recall. arXiv 2020, arXiv:2001.05647. [Google Scholar] [CrossRef]
Pillai, I.; Fumera, G.; Roli, F. On the Detection of Thresholds in Multi-label Classification. Pattern Recognit. Lett. 2013, 34, 513–519. [Google Scholar]
Tasche, D. A Plug-in Approach to Maximizing Precision at the Top and Recall at the Top. arXiv 2018, arXiv:1804.03077. [Google Scholar]
Arora, G.; Merugu, S.; Saladi, A.; Rastogi, R. Leveraging Uncertainty Estimates to Improve Classifier Performance. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Koseoglu, B.; Traverso, L.; Topiwalla, M.; Kraev, E.; Szopory, Z. OTLP: Output Thresholding Using Mixed Integer Linear Programming. arXiv 2024, arXiv:2405.11230. [Google Scholar]
Sanchez, I.E. Optimal Threshold Estimation for Binary Classifiers Using Game Theory. F1000Research 2016, 5, 2762. [Google Scholar] [CrossRef]
Huang, J.; Ling, C.X. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans. Knowl. Data Eng. 2005, 17, 299–310. [Google Scholar] [CrossRef]
Juba, B.; Le, H.S. Precision-Recall versus Accuracy and the Role of Large Data Sets. Proc. AAAI Conf. Artif. Intell. 2019, 33, 4039–4048. [Google Scholar] [CrossRef]
Glaros, A.G.; Kline, R.B. Understanding the accuracy of tests with cutting scores: The sensitivity, specificity, and predictive value model. J. Clin. Psychol. 1988, 44, 1013–1023. [Google Scholar] [CrossRef]
Charnes, A.; Cooper, W.W. Programming with linear fractional functionals. Nav. Res. Logist. Q. 1962, 9, 181–186. [Google Scholar] [CrossRef]
Makhorin, A. GNU Linear Programming Kit (GLPK). Available online: https://www.gnu.org/software/glpk, (accessed on 15 January 2025).
Çalik, S.; Güngör, M. On the expected values of the sample maximum of order statistics from a discrete uniform distribution. Appl. Math. Comput. 2004, 157, 695–700. [Google Scholar] [CrossRef]
Wolberg, W.H.; Street, W.N.; Mangasarian, O.L. Breast cancer Wisconsin (diagnostic) data set. UCI Mach. Learn. Repos. 1992. [Google Scholar]
Survey, U.S.G. United States Geological Survey (USGS). Available online: https://www.usgs.gov/ (accessed on 15 January 2025).
Altman, N.S. An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 1992, 46, 175–185. [Google Scholar] [CrossRef]
jackksoncsie. Spam Email Dataset. Available online: https://www.kaggle.com/datasets/jackksoncsie/spam-email-dataset/data (accessed on 15 January 2025).
Face, H. IMDB Dataset. Available online: https://huggingface.co/datasets/imdb (accessed on 25 January 2025).
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019. [Google Scholar]

Figure 1. Comparison between the ILP and exhaustive search methods in terms of average execution time. Each curve demonstrates the proportion of the average execution time obtained by the ILP approach with respect to the average execution time obtained by the exhaustive search method in a logarithmic scale, where

α = 1

,

β = 0

,

γ = 0

, and

ζ = 0

. The x-axis represents the number of instances times 1000. The red and blue curves represent experiments with predicted values of maximum two and three decimal digits, respectively. All experiments were repeated 100 times. The binary values

y_{i}

were generated from the discrete uniform distribution [23] in the interval

[0, 1]

, and the predictions

p_{i}

were obtained using the continuous uniform distribution in the interval

[0, 1]

.

Figure 1. Comparison between the ILP and exhaustive search methods in terms of average execution time. Each curve demonstrates the proportion of the average execution time obtained by the ILP approach with respect to the average execution time obtained by the exhaustive search method in a logarithmic scale, where

α = 1

,

β = 0

,

γ = 0

, and

ζ = 0

. The x-axis represents the number of instances times 1000. The red and blue curves represent experiments with predicted values of maximum two and three decimal digits, respectively. All experiments were repeated 100 times. The binary values

y_{i}

were generated from the discrete uniform distribution [23] in the interval

[0, 1]

, and the predictions

p_{i}

were obtained using the continuous uniform distribution in the interval

[0, 1]

.

Table 1. Performance metrics comparison between default and optimized thresholds with a decimal precision of three for medical diagnosis.

Threshold	Recall	Specificity	Accuracy	Weighted Metric
Default ( $τ = 0.5$ )	94.34%	97.76%	96.49%	95.45%
Optimized ( $τ = 0.316$ )	97.64%	95.80%	96.49%	97.04%

Table 2. Performance metrics comparison between default and optimized thresholds, with a decimal precision of four for earthquake occurrence prediction.

Threshold	Recall	Specificity	Accuracy	Weighted Metric
Default ( $τ = 0.5$ )	72.65%	73.84%	73.23%	73.06%
Optimized ( $τ = 0.3$ )	84.65%	60.69%	72.92%	76.29%

Table 3. Performance metrics comparison between default and optimized thresholds, with a decimal precision of three for spam email detection.

Threshold	Recall	Specificity	Accuracy	Weighted Metric
Default ( $τ = 0.5$ )	93.77%	99.82%	98.29%	97.85%
Optimized ( $τ = 0.376$ )	98.62%	99.12%	98.99%	98.95%

Table 4. Performance metrics comparison between default and optimized thresholds, with a decimal precision of three for sentiment analysis of movie reviews.

Threshold	Recall	Specificity	Accuracy	Weighted Metric
Default ( $τ = 0.5$ )	74.03%	83.59%	78.81%	78.33%
Optimized ( $τ = 0.455$ )	78.38%	79.55%	78.97%	78.91%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Marioriyad, A.; Ramazi, P. Optimizing Accuracy, Recall, Specificity, and Precision Using ILP. Mathematics 2025, 13, 1059. https://doi.org/10.3390/math13071059

AMA Style

Marioriyad A, Ramazi P. Optimizing Accuracy, Recall, Specificity, and Precision Using ILP. Mathematics. 2025; 13(7):1059. https://doi.org/10.3390/math13071059

Chicago/Turabian Style

Marioriyad, Arash, and Pouria Ramazi. 2025. "Optimizing Accuracy, Recall, Specificity, and Precision Using ILP" Mathematics 13, no. 7: 1059. https://doi.org/10.3390/math13071059

APA Style

Marioriyad, A., & Ramazi, P. (2025). Optimizing Accuracy, Recall, Specificity, and Precision Using ILP. Mathematics, 13(7), 1059. https://doi.org/10.3390/math13071059

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimizing Accuracy, Recall, Specificity, and Precision Using ILP

Abstract

1. Introduction

2. Problem Formulation and Results

3. Simulations

4. Real-World Applications

4.1. Medical Diagnosis

4.1.1. Dataset, Preprocessing, and Classifier Training

4.1.2. Optimization Using ILP

4.1.3. Results and Discussion

4.2. Earthquake Occurrence Prediction

4.2.1. Dataset and Classifier Training

4.2.2. Optimization Using ILP

4.2.3. Results and Discussion

4.3. Spam Email Detection

4.3.1. Dataset and Classifier Training

4.3.2. Optimization Using ILP

4.3.3. Results and Discussion

4.4. Sentiment Analysis of Movie Reviews

4.4.1. Dataset and Classifier Training

4.4.2. Optimization Using ILP

4.4.3. Results and Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI