1. Introduction
Abbreviations have widespread, everyday usage across multiple languages and domains. Their ease of use and time-saving benefits have incentivized their utilization in both professional and casual contexts. An analysis by Barnett and Doubleday was performed on the abstracts and titles of scientific literature from the years 1950 to 2019 on the usage of acronyms. Their results showed an increase in the use of acronyms from 0.7 per 100 words in 1950, to 2.4 per 100 words in 2019 [
1].
The increasing amount of abbreviations in the professional sciences alone causes clarity issues and misunderstandings when trying to interpret the long form definitions. A survey done by Sheppard et al. [
2] found over 2286 abbreviations being used in 25 clinical handout sheets. According to Tariq and Sharma, within the last twenty years, approximately 7000 to 10,000 people have died due to medical mistakes every year in the United States alone.Among these errors, misunderstanding about abbreviations have contributed to roughly 5% of these casualties. [
3]. Minimizing the clarity issue with abbreviations could save a large portion of those documented 5% of lives, and researchers Liu et al. showed that expanding the definitions of abbreviations overwhelmingly increased the patient comprehension in regards to their health records [
4].
While abbreviations can pose an immediate and fatal danger in the medical field, ambiguity regarding them can plague every text-related field, including computer science. Source code utilizes many creative shortenings of words and phrases to facilitate typing speed. Deployment of incorrect abbreviations or their over-usage can severely hurt the readability and understanding [
5] of code. While researchers Jiang et al. found that overly lengthy words in code were detrimental to reader comprehension, and recommended the frequent usage of abbreviations to improve understanding, Hales et al. made the argument that this leads to more instances of misuse. This, when paired with the usage of esoteric jargon, can lead to readers becoming alienated and overwhelmed with the text, increasing confusion and misunderstandings [
6].
First, we must lay a foundation of what acronyms and abbreviations are, before modifying their properties to satisfy their professional usage. Acronyms are usually a type of abbreviation where a phrase or sequence of words are shortened into the initials of each major word (e.g., DOE stands for Department of Energy). An abbreviation is generally a relative term used to describe the shortening of a word using contractions or contracting parts of the word (e.g., dept stands for department). Each type of abbreviation will often follow a loose set of rules and cues that hint the reader that it is an abbreviation. Taghva and Gilbreth created the first rule-based machine algorithm to detect acronyms. This algorithm operated under rules of identification where the acronym length must be at least three letters long and its full nomenclature or definition must be provided within a certain window of text [
7]. This later developed into a machine learning approach, utilizing HMMs (Hidden Markov Models), to accurately select the definition within a proportionate window size [
8]. In a later section, we will modify some of these general purpose rules to better suit the medical domain and to suit contraction-type abbreviations.
Currently, there are three main approaches to handling abbreviations with regards to machine reading comprehension. The first approach involves rule-based/statistical models, such as HMMs or decision trees, which require both understanding the cues and inferences of the document, to select the best definition. These approaches often do well when data are structured and uniform, but fail in real-life scenarios, because abbreviations do not have a uniform set of rules that every author follows. In particular, these methods require some sort of regulation of their usage, such as employing all capital letters or including a definition in parentheses to indicate the existence of an acronym. Certain models use prior knowledge of context to determine the best expansion or try to find its definition nearby.
The second approach is through maximizing word sense disambiguation (WSD). This approach typically utilizes the context to determine the meaning from a set of potential definitions and selects the one which makes the most sense. Researchers such as Sultan et al. have considered parts-of-speech (POS) to determine whether or not definitions fit their grammatical place. In addition to their usage of POS, the authors also created pipelines of alignment modules to determine how closely related certain words are, based on their syntactic dependencies [
9]. Other researchers working on WSD thought to use graph-like approaches such as PageRank [
10,
11] to map relationships between pages and words. This later developed further, as researchers started to correlate word meanings as vertices and the edges as the semantic relationships derived from WordNet to construct a disambiguation graph [
12]. The disambiguation graph provides a context and understanding of words in relation to each other and can offer a deeper insight into the likelihood of determining which words fit which contexts. Similarly to PageRank, a different approach of considering the term frequency-inverse document frequency (TF-IDF) was utilized to relate context with frequency of occurrence [
13,
14]. TF-IDF is a common indexing method used for analyzing likelihood and has applications in areas such as web page retrieval and recommendation systems but can also be utilized to analyze text and the context of words. The authors Turtel and Shasha utilized TF-IDF for acronym disambiguation [
15]. Similarly, the authors Li et al. utilized word embedding paired with TF-IDF to solve acronym disambiguation [
16].
The third approach is to use a transformer model such as BERT (bidirectional encoder representations from transformers) [
17,
18]. Traditionally, recurrent neural networks using a long short-term memory have been deployed for natural language models. Recently BERT has replaced most of those networks. Essentially, BERT is an upgrade on recurrent neural network models, as it is able to take in information directionlessly. Previously, we demonstrated some of the power of BERT in its ability to generate abbreviation definitions [
19] and further successfully tested it on ambiguous definitions [
13]. The authors Daza et al. utilized a SloBERTa model with an additional single neural layer to tackle abbreviation disambiguation for Slovenian biographical lexicons [
20]. The benefits of using BERT models include having access to differently trained variations on domain-specific data including, but not limited to, models such as RoBERTa [
21], SciBERT [
22], and AlBERT [
23].
Within this scope, there are still many limitations that can be improved upon. The most salient problem is the lack of domain specific data and centralized data. In regards to proprietary and personal data, especially in the medical field, there is a lack of sufficient datasets that encompass all definitions and their multiple abbreviation forms. Additionally, ambiguous definitions can exist across domains, with no accounting for these disparities. There has been some progress towards creating a consolidation of publicly available data, such as the database MeDal, which is composed of 14,393,619 medical articles and abstracts. The authors, Wen et al. utilized this dataset to pre-train several machine learning models. The results showed improvements towards being able to accurately define abbreviations [
24]. Alternatively, the researchers Skreta et al. combined data sampling and reverse sampling (RS) to automatically create their dataset, without the need for human aid [
25].
Abbreviations, with their convenience of use and their improvement of efficiency in all text-related fields, will continue to find prolific usage in professional contexts. Inevitably, this leads to their misunderstanding and misinterpretation, due to definition ambiguity and over-utilization. However, with insight into how abbreviations are commonly used and keeping to a standard of abbreviation guidelines, we could minimize these potential misreadings. In this paper, we introduce a novel way to generate ad hoc abbreviations, produce their definitions, and to reverse engineer their candidate definitions. Our work is still in the early stages, but we have found several interesting statistics, such as higher definition retrieval rates for abbreviations retaining 40% of their characters and trends such as a 93% omission rate for vowels. These preliminary results can be further utilized to lay foundations and create certain rules involving contraction-type abbreviations.
2. Materials and Methods
To fully tackle abbreviation disambiguation, the ideal scenario is to have a dictionary with all possible abbreviations and their definitions assigned. Realistically, this is not plausible, as new abbreviations and deviations are constantly being produced. However, it may be possible to generate all or close to all possible abbreviations that are likely to be used and assign them sufficient context to aid WSD and allow selecting the best possible candidate for expansion.
Currently, most of the work and data have focused on tackling acronyms. However, not much work has been done around contraction-based abbreviations. Contractions are a subset of abbreviations where, instead of a phrase being abbreviated, a single word is. Similarly to acronyms, contraction-type abbreviations can be related to sub-string matching, but with characters. Pennel et al. introduced a normalization model for contractions found in Twitter data [
26]. Their method leans more towards a statistical model for detection and focuses on manual annotations to create reasonable abbreviations.
Traditionally, acronyms have unique characteristics, such as containing all or a majority of capital letters, annotated within parentheses, and having long-form definitions provided. Contractions tend to have traits that are more difficult for a computer to parse, such as having punctuation as part of the word, containing misspellings of words, and having a collection of letters that do not make for fluid pronunciation. Additionally, what makes this type of abbreviation harder to expand is the lack of a provided definition. The underlying goal of contraction expansion is to identify which long-form definition the substring could soundly be a part of.
Figure 1 shows our experimental design process. In the later sections, we will elaborate more on the data generation process, word selection, abbreviation process, and reconstruction of the definitions.