1.2.1. Citation Function Categories: Development and Utilization
Many papers have reviewed the literature on categorization by citation function. This section focuses more on listing categories, understanding the basis for the arrangement, and the utility of selecting various categories. This information will be helpful in developing a categorical scheme based on previous studies.
Various categories of citation functions have been proposed. The earliest typology of citation function was proposed by Moravcsik and Murugesanin in 1975 [
12]. They divided the citation function into four pairing categories: conceptual or operational, organic or perfunctory, evolutionary or juxtapositional, and confirmative or negational. These initial categories were then first adopted by Chubin and Moitra [
13] to study the content analysis of references to citation counting. In studies using machine learning for citation function classification, the most straightforward scheme consisted of three categories: background/information, methods, and result comparisons [
14]. This scheme was chosen because it is related to the structure of scientific articles, applicable for exploring topics, and easy to implement using machine learning (ML). Another scheme consists of four categories, namely (Weak)ness, Compare and Contrast (CoCo), (Pos)itive, and (Neut)ral [
11]. Unfortunately, only the CoCo category, which is defined as comparing methods or results, is easy to comprehend. This category is structured to classify the citation and provenance functions simultaneously. The further objectives of the classification system were not explained. Bakhti et al. [
15] proposed a new scheme consisting of five citation functions: useful, contrast, mathematical, correct, and neutral. This general categorization scheme can be used by various fields of science, easily distinguished by human annotators, used to share implementation needs, and used for building automatic classification models.
Three studies have divided citation functions into six nearly similar categories. Yousif et al. [
7] compiled categories consisting of criticism, comparison, use, substantiation, basis, and neutral. Perier-Camby et al. [
10] created a new scheme consisting of background, motivation, uses, extension, comparison or contrast, and future. Zhao et al. [
8] divided the functions into use, extend, produce, compare, introduction, and other. Four of the six categories these studies compiled are nearly identical, namely use, compare, extend, and background/introduce. Out of these three studies, only one explains the basis for its selection of categories, that is, it was based on previous literature. However, it did not specify how the selection proceeded. Two of these studies completed the automatic classification process without stating how the resulting classification systems were to be utilized. One study did explain that the classification scheme was developed to construct a reference recommendation system [
7].
Teufal et al. [
9] produced a more detailed citation function categorization scheme by translating a general scheme into more specific categories. The scheme consists of four general categories, which are then broken down into 12 further categories. The general categories are weakness, contrast, positive, and neutral. Contrast is then divided into four categories related to the section in which the comparison occurs, such as methods and results. The positive class is divided into six categories that indicate the type of agreement that occurs, such as problems, ideas, and methods. Rachman et al. [
6] modified this scheme to create a document summarization system.
The citation function categorization schemes vary according to the purpose of the classification to be carried out. Only a small part of the literature focuses on the basis, development, and evaluation of new schemes. Most of these studies develop schemes used directly in the classification process by humans and machines. Assessments of the results, both in terms of the agreement between annotators and machine accuracy, are not always related to the scheme’s quality, as they can be related to the classification model’s accuracy. The literature [
9] that focuses on evaluating schemes first before using them to classify large amounts of data is referred to in this paper.
1.2.2. Classification Scheme Evaluation Methods
Many studies have conducted citation classifications, but not many have focused on evaluating these schemes. In general, the evaluation of a classification scheme in linguistics aims to ensure that the compiled dataset can be used appropriately in the classification process. Seaghdha [
16] proposed several criteria for evaluating annotation schemes that are relevant to all classifications related to the meaning of language. The proposed criteria include: (1) the developed categories must take into account, to the greatest degree possible, the characteristics of the dataset; (2) there must be coherence (i.e., clarity) in concepts and category boundaries to prevent overlap; (3) there must be a balanced distribution between classes; (4) there must be ease of use originating from a coherent and simple scheme, supported by the availability of detailed guidelines; and (5) there must be utility, in the sense that categories should be able to provide information that can be used further. Some of the above criteria can be measured quantitatively in a scheme’s evaluation. Coherence and ease of use are evaluated with measurements of the agreement and disagreement between classifiers. Balance can be seen in the resulting distribution of citations, and the effect can be assessed using an automatic classification algorithm. The categories’ coverage can be seen by comparing several schemes ranging from simple to complex against one corpus, but this analysis is rarely done. As for utility, it is subjective because the objectives behind each classification scheme can be different.
Most evaluation studies use measurements that refer to the above criteria. Boldrini et al. [
17] evaluated manual categorizations using correlations and agreements between categories. Ritz et al. [
18] used a similar method and suggested intensive training of annotators before classifying to reduce the error rate. Teufel et al. [
9] evaluated classification schemes using an agreement value between three annotators, calculated and discussed annotation errors in a narrative, and compared machines’ classification results with those of humans. A similar method was used by Palmer et al. [
19] and Ovrelid [
20] to compare automatic and manual classification results and to check to see which category had the most errors. The results identified which classes are most difficult for humans and machines to identify. Ibanez and Ohtani [
21] describe the design, evaluation, and improvement process for classification schemes in more detail. They analyzed the types of disagreements that can arise, improved guidelines, created new schemes to improve reliability, and then created gold standard classifications.
1.2.3. Automatic Classification Evaluation
Automatic classification via ML and deep learning (DL) is a widely used method for evaluating citation classification schemes. DL is increasingly being used, but shallow learning is still used as the primary comparative method. Rachman et al. [
6] developed a linear Support Vector Machine (SVM) method to classify citation functions that modified Teufel’s scheme. Using a small amount of data, this method produced an F1-score of 68%. Taskin and Al [
22], using an enormous amount of data with the Naïve Bayes Binomial (NB) classifier, achieved an F1-score of 78%. Both these results were better than the results of an experiment conducted by Zhao et al. [
23] using the Long Short-Term Memory (LSTM) classifier, which achieved an F1-score of 63%. Bakhti et al. [
15] used Convolutional Neural Networks (CNN) to achieve an F1-score of 63%, and Perier-Camby et al. [
10], using the BCN model (Biattentive Classification Network) and ELMo (Embeddings from Language Models), achieved an F1-score of 58%. These results show that DL does not always provide superior results; many factors influence the results, such as classification schemes and datasets. However, CNN, LSTM, and their variations are the most widely used classifiers in citation function classification.
DL methods that have recently been used with better results include Recurrent Neural Networks (RNN), CNN, BiLSTM, and a combination of CNN and BiLSTM. Su et al. [
11] used CNN and achieved an accuracy of 69%. Cohan et al. [
14], using Glove and Elmo on BiLSTM with a scaffolding structure, achieved an F1-score of 67% on a small dataset (1941 sentences). Meanwhile, the score rose to 84% for 11,020 sentences. Experiments with big data (>50,000 sentences) were conducted by Zhao et al. [
8], who used the RNN and Bidirectional Encoder Representations from Transformers (BERT) pre-trained models to achieve an F1-score of 78%. Meanwhile, Yousif et al. [
7] used CNN and BiLSTM to achieve an F1-score of 88%. This score is among the highest achieved in any reported experiments. This study uses several classifiers used previously to evaluate the proposed classification scheme.