**1. Introduction**

*Mycobacterium* species that do not cause tuberculosis are referred to as non-tuberculous mycobacteria (NTM) and are ubiquitous in nature. NTM cause pulmonary diseases in which organisms of *Mycobacterium avium* complex (MAC) are widely distributed [1]. The incidence rate of infection caused by *M. avium* is found to be higher than that of the other *Mycobacterium* species. For example, a literature survey showed that the pulmonary infection rate in Japan is sevenfold greater by *M. avium* than any other *Mycobacterium* species [2]. MAC consists of two closely linked species, *M. intracellulare* and *M. avium* [3]. Furthermore, *M. avium* is comprised of four subspecies: *M. avium* subsp. *paratuberculosis* (MAP), *M. avium* subsp. *avium* (MAA), *M. avium* subsp. *silvaticum* (MAS) and

*M. avium* subsp. *hominissuis* (MAH); and each one is host specific. The first two subspecies cause avian infection, while the third causes diseases in wild livestock and the last one is the most common pathogen in humans and other mammals, including pigs, and therefore has huge economic impact [4].

Opportunistic MAH is responsible for causing disseminated and pulmonary infections that affect immunocompromised patients who are su ffering from AIDS, leukemia, lung diseases or chemotherapy [5,6]. The bacterial virulence factor and host-related risk factor contribute to MAC pulmonary diseases. The prevalence of the disease is relatively high in women; however, much of the information about the bacterial virulence factor is still unknown [7]. Environmental risk factors also arise when patients with MAC pulmonary disease are exposed to soil at home or in soil pots [8]. The disease is characterized by adherence to the respiratory mucosa, formation of biofilms [9] and lesions in the linings of epithelial cells of the lungs [7].

MAC pulmonary diseases are controlled by treatment with antibiotics that include macrolide-based multidrug therapy, comprising macrolides (clarithromycin or azithromycin) in combination with rifampin, ethambutol, aminoglycosides (streptomycin or amikacin) and ciprofloxacin [10,11]. However, emerging virulent strains are found to be resistant to these antibiotics [12]. Consequently, these life-threatening microbial pathogens pose an alarming threat for scientists to combat emerging antibiotics resistance. In fact, the emerging strains are capable of becoming more virulent and tolerant to existing drugs [13]. However, the application of genomics has brought about a revolution in the field of drug discovery by providing increased information about the microbial as well as the human genome [14]. This genomic information unveils the mechanism through which pathogens cause the infection. Finding novel and unique drug targets is one of the possible and alternative approaches to overcoming the infections caused by such drug-resistant pathogens. Similarly, finding therapeutic drugs to combat infections of lethal organisms is the most widely applied method albeit with limited success with respect to drug-resistant pathogens [15]. In this scenario, advancements in the fields of computational biology and bioinformatics tools paved the way to propose new and unique drug targets using the subtractive genomics strategy. In the subtractive genomics approach, the genomes of the host and the pathogen are compared, and the non-host pathogen's unique and essential proteins are proposed as drug targets that are vital to the pathogen's survival [16,17]. This strategy recognizes genes that are absent in the host, so called "non-host" genes; however, these genes must be present in the pathogen for its survival, replication and sustainability. Additionally, these non-host genes play crucial roles in unique metabolic pathways and mechanisms. Therefore, when the pathogen's metabolic targets are ideally hit by therapeutic compounds, the therapy must a ffect the function of the pathogen without altering the host biology [18,19]. The disruption of the essential genes will eventually overcome the pathogen's infection. Recently, several studies applied the same approach for the identification of potential drug targets of *Acinetobacter baumannii* [20], *Helicobacter pylori* [21], *Mycobacterium* species [22], *Pseudomonas aeruginosa* [23] and others [24–28]. Such computational studies help to minimize experimental e fforts with high-speed performance for the prioritization of drug targets. For example, by using the information retrieved from such computational studies, a life scientist can express only the prioritized target gene (which is predicted as a potential drug target), resulting in saving the cost of extra experiments and fostering the research.

### **2. Results and Discussion**

With the aim to identify unique and potential druggable targets of *M. avium* subsp. *hominissuis* (MAH), the subtractive genomics method was used, which is the most applicable approach to prioritize potential drug targets [18,29–31].

### *2.1. Removal of Duplicate Sequences after Proteome Retrieval*

Three strains of MAH, i.e., MAH-TH135, OCU466 and A5, were selected from the available non-redundant strains of *M. avium* subsp. *hominissuis* in the UniProt database. Their complete proteomes were downloaded in FASTA format in February 2019. On applying CD-HIT algorithm with 80% identity, 20 sequences were identified as paralogous out of 4614 proteins in MAH-TH135, 54 out of 5165 in MAH-OCU466 and 14 out of 4502 proteins of A5 strain. The CD-HIT clustered the paralogous sequences and, hence, reduced the total number of sequences of each strain. The sequence dataset was comprised of 4596, 5111 and 4488 protein sequences for the MAH-TH135, OCU466 and A5 strains, respectively.

### *2.2. Searching of Essential, Non-Homologous and Druggable Proteins*

In this step, protein sequences that were only present in the pathogens were segregated. Thus, by applying a subtractive approach, sequences were excluded that showed similarity to the human host. The remaining orthologous sequences, retrieved from the previous step, were subjected to BLASTp against the complete human proteome, and the resultant file was parsed. The only sequences that were retained were those that showed "no hits found", and a total of 3151, 3619 and 3072 non-homologous sequences were found in the MAH-TH135, OCU466 and A5 strains, respectively.

The Database of Essential Genes (DEG) provides information on essential genes of Gram-positive and Gram-negative bacteria determined from experimental methods (http://www.essentialgene.org/). Homology with the sequences found in the DEG database is the basis of essentiality of non-homologous proteins. To do this, the parsed results of each strain from the last step were subjected to BLASTp against the DEG with a 10−<sup>5</sup> threshold. The BLASTp results depict 1360, 1451 and 1352 essential protein sequences in MAH-TH135, OCU466 and A5, respectively. These identified sequences were considered viable for the pathogen's life cycle. These sequences include functional, non-functional or uncharacterized proteins, and they were dealt with using di fferent bioinformatics tools for further characterization.

### *2.3. Characterization of Essential Non-Homologous Proteins*
