# **Molecular World Today and Tomorrow Recent Trends in Biological Sciences**

Edited by Wajid Zaman

Printed Edition of the Special Issue Published in *International Journal of Molecular Sciences*

www.mdpi.com/journal/ijms

## **Molecular World Today and Tomorrow: Recent Trends in Biological Sciences**

## **Molecular World Today and Tomorrow: Recent Trends in Biological Sciences**

Editor

**Wajid Zaman**

MDPI Basel Beijing Wuhan Barcelona Belgrade Manchester Tokyo Cluj Tianjin

*Editor* Wajid Zaman Department of Life Sciences Yeungnam University Gyeongsan Korea, South

*Editorial Office* MDPI St. Alban-Anlage 66 4052 Basel, Switzerland

This is a reprint of articles from the Special Issue published online in the open access journal *International Journal of Molecular Sciences* (ISSN 1422-0067) (available at: www.mdpi.com/journal/ ijms/special issues/Biological Sciences).

For citation purposes, cite each article independently as indicated on the article page online and as indicated below:

LastName, A.A.; LastName, B.B.; LastName, C.C. Article Title. *Journal Name* **Year**, *Volume Number*, Page Range.

**ISBN 978-3-0365-6997-0 (Hbk) ISBN 978-3-0365-6996-3 (PDF)**

© 2023 by the authors. Articles in this book are Open Access and distributed under the Creative Commons Attribution (CC BY) license, which allows users to download, copy and build upon published articles, as long as the author and publisher are properly credited, which ensures maximum dissemination and a wider impact of our publications.

The book as a whole is distributed by MDPI under the terms and conditions of the Creative Commons license CC BY-NC-ND.

## **Contents**


Reprinted from: *Int. J. Mol. Sci.* **2022**, *23*, 11309, doi:10.3390/ijms231911309 ............. **177**


## **About the Editor**

#### **Wajid Zaman**

Dr. Wajid Zaman completed his Ph.D. at the Institute of Botany, Chinese Academy of Sciences, China. Currently, he is working as an International Research Professor at the Department of Life Sciences, Yeungnam University, Republic of Korea. Previously he worked as an Assistant Researcher/Professor at Lushan Botanical Garden, Chinese Academy of Sciences, China. He is interested in developing a career that combines teaching and research while maintaining his interest in Biological Sciences with advanced technologies.

## *Article* **Nonspecific Amyloid Aggregation of Chicken Smooth-Muscle Titin: In Vitro Investigations**

**Alexander G. Bobylev 1,\* , Elmira I. Yakupova 2, Liya G. Bobyleva 1, Nikolay V. Molochkov 1, Alexander A. Timchenko 3, Maria A. Timchenko <sup>4</sup> , Hiroshi Kihara 5, Alexey D. Nikulin 3, Azat G. Gabdulkhakov <sup>3</sup> , Tatiana N. Melnik <sup>3</sup> , Nikita V. Penkov <sup>6</sup> , Michail Y. Lobanov 3, Alexey S. Kazakov 4, Miklós Kellermayer <sup>7</sup> , Zsolt Mártonfalvi <sup>7</sup> , Oxana V. Galzitskaya 1,3 and Ivan M. Vikhlyantsev 1,8,\***


**Abstract:** A giant multidomain protein of striated and smooth vertebrate muscles, titin, consists of tandems of immunoglobulin (Ig)- and fibronectin type III (FnIII)-like domains representing βsandwiches, as well as of disordered segments. Chicken smooth muscles express several titin isoforms of ~500–1500 kDa. Using various structural-analysis methods, we investigated in vitro nonspecific amyloid aggregation of the high-molecular-weight isoform of chicken smooth-muscle titin (SMTHMW, ~1500 kDa). As confirmed by X-ray diffraction analysis, under near-physiological conditions, the protein formed amorphous amyloid aggregates with a quaternary cross-β structure within a relatively short time (~60 min). As shown by circular dichroism and Fourier-transform infrared spectroscopy, the quaternary cross-β structure—unlike other amyloidogenic proteins—formed without changes in the SMTHMW secondary structure. SMTHMW aggregates partially disaggregated upon increasing the ionic strength above the physiological level. Based on the data obtained, it is not the complete protein but its particular domains/segments that are likely involved in the formation of intermolecular interactions during SMTHMW amyloid aggregation. The discovered properties of titin position this protein as an object of interest for studying amyloid aggregation in vitro and expanding our views of the fundamentals of amyloidogenesis.

**Keywords:** smooth muscle titin; protein aggregates; amyloid aggregation; amyloids; cross-β

#### **1. Introduction**

It is known that, to normally perform their biological functions, newly synthesized proteins must fold into a certain three-dimensional structure [1,2]. The folded structures of proteins are only moderately stable. Certain factors, such as genetic mutations or disturbances of protein synthesis and degradation, can lead to incorrect protein stacking or misfolding followed by the formation of pathological aggregates [3]. Improper protein stacking is a rather common phenomenon associated most often with the development of diseases such as amyloidoses [4]. With the development of amyloidosis in humans or animals, the protein loses its native conformation to form amyloids, aggregates of

**Citation:** Bobylev, A.G.; Yakupova, E.I.; Bobyleva, L.G.; Molochkov, N.V.; Timchenko, A.A.; Timchenko, M.A.; Kihara, H.; Nikulin, A.D.; Gabdulkhakov, A.G.; Melnik, T.N.; et al. Nonspecific Amyloid Aggregation of Chicken Smooth-Muscle Titin: In Vitro Investigations. *Int. J. Mol. Sci.* **2023**, *24*, 1056. https://doi.org/10.3390/ ijms24021056

Academic Editor: Wajid Zaman

Received: 28 November 2022 Revised: 29 December 2022 Accepted: 29 December 2022 Published: 5 January 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

incorrectly folded forms of protein having a specific structure. Amyloid aggregates, as well as their intermediate forms (mainly oligomers), may lead to cellular death [5–10].

Amyloid aggregates of many proteins have been found in various tissues and organs of amyloidosis humans and animals. These diseases include, in particular, liver amyloidosis, Alzheimer's disease, Parkinson's disease, type II diabetes, and prion diseases, as well as systemic amyloidoses [11–14]. Amyloids have a number of specific characteristics, such as the ability to bind to Congo Red and thioflavin T; they are resistant to proteases and insoluble in most solvents [14]. The main property of amyloids, regardless of the type of aggregates they form, is the presence of a quaternary cross-β structure [15,16].

Besides pathological amyloids, there are functional amyloids—protein aggregates that perform certain functions. Functional amyloids are found in many species of living organisms from plants, fungi, and protozoa to higher animals, including humans [17–27].

Despite the extensive investigation, there is as yet no clear understanding of the molecular mechanisms of amyloid aggregation. The differences behind the mechanisms of the formation of functional versus pathological amyloids have yet to be resolved. This is probably due to the complexity of the object of study. In order to obtain the required information about amyloids, it is necessary to use an array of methods (such as nuclear magnetic resonance spectroscopy, transmission electron microscopy (TEM), cryo-electron microscopy, atomic-force microscopy (AFM), X-ray diffraction, small-angle X-ray scattering (SAXS), Fourier-transform infrared spectroscopy (FTIR), which is often technically difficult to perform, since each of the methods has its own limitations regarding the object of study.

Titin (also called connectin [28,29]) is one of the most difficult objects when studying its aggregation properties. In vitro studies on isolated titin preparations revealed a tendency of this protein to form various sorts of intermolecular interactions that led to the formation of oligomers and aggregates. Thus, the aggregation of titin in vitro was first investigated in 1993 [30]. It was shown that, in solutions of low ionic strength (0.1 M KCl near neutral pH (6.5)), striated muscle titin assembles into higher-order aggregates [30]. In 2003, atomicforce microscopy revealed that, in a solution containing 25 mM imidazole–HCl, 0.2 M KCl, 4 mM MgCl2, 1 mM EGTA, 0.01% NaN3, 1 mM dithiothreitol (DTT), 20 μg/mL leupeptin, 10 μM E-64, pH 7.4, isolated titin molecules are capable of self-assembly into oligomers within a mere 10 min [31]. The observed titin oligomers could be divided into bimolecular species, as well as consisting of several molecules. In each case, however, a globular head was observed, with which other titin molecules interacted, thus forming an oligomer [31].

In vitro investigations of the specificity of aggregation between titin domains conducted in 2005 concluded that the ability for aggregation increases when the identity of amino acid sequences of the domains is greater than 40% [32]. In 2015, the in vitro aggregation of some titin Ig-domains obtained by a recombinant method was studied [33]. The neighboring identical titin domains were shown to be able to form misfolded structures [33].

The experiments in vitro with molecular simulations carried out in 2015 showed that during the refolding of tandem repeats (I27–I27 and I27–I28 immunoglobulin-like domains of titin), independent of sequence identity, more than half of all molecules transiently formed a wide range of misfolded conformations [33]. Simulations suggested that a large fraction of these misfolds resemble an intramolecular amyloid-like state named "intramolecular amyloids" [33]. These authors also reported that, during the refolding of tandem repeats, the development of a strand-swapped long-lived misfolded state without an amyloid-like structure is possible [33].

In 2016, the ability of a low-molecular-weight (500 kDa) isoform or possibly a truncated form of chicken smooth-muscle titin (SMTLMW) to form in vitro amyloid aggregates was described [34]. In 2018, the ability of that isoform to form in vitro amyloid aggregates with different properties in two solutions differing in ionic strength was found [35].

In the present work, using a number of structural analysis methods, we investigated the formation of an in vitro amyloid structure of a chicken smooth-muscle titin isoform with a molecular weight of ~1500–1600 kDa (SMTHMW). A peculiar feature of this protein and, in particular, this isoform of titin relative to other amyloid proteins studied is its large size. One titin molecule is comparable to the size of the protofibrils of some amyloid proteins. Thus, it was of interest to elucidate whether this high-molecular-weight isoform of smooth-muscle titin was capable of forming aggregates, their structural features, and the type of aggregation—amyloid/non-amyloid. The results obtained point at the nonspecific amyloid aggregation of the protein.

#### **2. Results**

#### *2.1. DLS Analysis of SMT Aggregations*

Based on our previous research into SMTLMW [34,35] and SMTHMW [36], we were aware of the conditions required to form amyloid aggregates of these proteins. This work, in order to understand the features of chicken titin aggregation rate, investigated the kinetics of SMTHMW aggregation. We used dynamic light scattering (DLS), the simplest and most convenient method to study the aggregation kinetics of amyloid proteins [34,35,37]. The method also enables analyzing the aggregation reversibility, which has been shown earlier for SMTLMW [35].

Figure 1a shows a change in the autocorrelation function of the scattered light upon the formation of SMTHMW aggregates in a solution containing 0.15 M glycine–KOH at pH 7.0–7.5 over 60 min. During the first 20 min of incubation, the correlation function *g*1(*t*) was observed to decay almost mono-exponentially (Figure 1a), with the subsequent emergence of a shoulder at high correlation times. This is indicative of the formation of large aggregates with smaller diffusion coefficients. After 40 min, the correlation function *g*1(*t*) featured a greater decay time and a more pronounced shoulder (Figure 1a) at high correlation times, which indicated an aggregation increase. A pronounced correlation-function shoulder was also observed during 40, 50, and 60 min of incubation (Figure 1a).

Several peaks reflecting the dimensions of SMTHMW aggregates were obtained by the correlation function analysis. Prior to the formation of aggregates, only two well-resolved peaks with the average *R*<sup>h</sup> of approximately 16 nm (the dominating peak of approximately 73%) and 86 nm (the minor peak of approximately 27%) were observed (Figure 1b, 0 min). The first peak corresponds, most likely, to SMTHMW molecules, whereas the second can be indicative of their insignificant aggregation. During the first 20 min of incubation, we observed a shift of the peaks and an increase in the ratio of the volumes of the larger- and smaller-size fraction, indicating the development of the titin aggregation process (*R*h~53 nm and ~479 nm) (Figure 1b).

After 40 min of the experiment, a stable fraction of larger aggregates with a hydrodynamic radius of ~1175 nm (96.7%) appeared. This fraction was also observed after 50 and 60 min of the experiment (Figure 1b). Another peak, indicating the formation of titin aggregates with the average *R*<sup>h</sup> of approximately 3813 nm emerged after 50 min of incubation and, after over 60 min, *R*<sup>h</sup> was more than 4475 nm. It should be noted that the peak with *R*<sup>h</sup> of approximately 4500 nm was at the limit of the range for this method. Therefore, the formation of larger SMTHMW aggregates cannot be excluded (Figure 1b).

The next task was to elucidate the ability of SMTHMW aggregates to disaggregate upon increasing ionic strength. Figure 1c presents data on the partial disaggregation of SMTHMW aggregates after incubation in a solution containing 0.6 M KCl and 30 mM KH2PO4 (pH 7.0). A decrease in the percentage of particles with larger hydrodynamic radii and an increase in the number of particles with smaller hydrodynamic radii (Figure 1c) were observed.

**Figure 1.** (**a**) Evolution of field autocorrelation functions *g*1(*t*) of light scattered during SMTHMW amyloid aggregate formation (pH 7.0–7.5, *T* = 25 ◦C). Aggregation was observed in a solution containing 0.15 M glycine–KOH, pH 7.0–7.5; (**b**) size distributions of SMTHMW particles. Generation of large aggregates and their time-dependent growth are shown; (**c**) distribution of SMTHMW particles after 1 h disaggregation (for clarity, the graph shows monodispersed SMTHMW). The data reflect the distributions corresponding to certain time points. Three independent experiments were carried out.

#### *2.2. Electron and Atomic-Force Microscopy of SMTHMW Aggregates*

Based on the DLS data presented above, investigations were conducted using two time intervals (1 h and 24 h). Figure 2 shows AFM images of SMTHMW molecules in a solution of 0.6 M KCl, 30 mM KH2PO4, 1 mM DTT, 0.1 M NaN3, and pH 7.0. It was found that, as in the case of striated-muscle titin [31,38,39], the smooth-muscle titin molecules had a filamentous shape with threads 3–4 nm thick and ~300 nm (on average) long with a globular head on one end (Figure 2a, inside the white squares). Besides individual molecules, oligomers of this protein were also found: several titin molecules interacting with one another, often forming a thickening in the center (Figure 2b). Similar oligomers have been found earlier in titin preparations isolated from rabbit skeletal muscles [31].

Figure 3 shows electron micrographs of negatively stained SMTHMW aggregates formed in a solution containing 0.15 M glycine–KOH, pH 7.0–7.5, at 4 ◦C. SMTHMW formed amorphous aggregates after 1 h and 24 h incubation (Figure 3). Aggregates became larger after 24 h of incubation. In a number of cases, filamentous structures of about 4 nm in diameter (represented, most likely, by titin filaments) were observed between amorphous aggregates (Figure 3c).

**Figure 2.** AFM images of molecular SMTHMW obtained in a solution of 0.6 M KCl, 30 mM KH2PO4, 1 mM DTT, 0.1 M NaN3, and pH 7.0. In the field of view, there are filamentous protein molecules monomers (marked with white squares) (**a**), which assemble into oligomers forming a central thickening (**b**).

**Figure 3.** Electron microscopy of SMTHMW aggregates formed in a solution containing 0.15 M glycine–KOH at pH 7.0–7.5 at 4 ◦C. (**a**) 1 h aggregation of SMTHMW. (**b**,**c**) 24 h aggregation of SMTHMW. Protein filaments of a diameter of about 4 nm between amorphous aggregates (shown by red arrows on the insert) can be seen. The most representative data, obtained as a result of 10 independent experiments, are given. Scale bar, 100 nm.

According to AFM, after 1 h, aggregation SMTHMW aggregates looked like large amorphous structures several micrometers long and up to 350 nm high (Figure 4a). After 24 h of aggregation, they had the form of branching flattened structures of more than 10 μm in length and 200–250 nm in height (Figure 4b,c).

**Figure 4.** Atomic-force microscopy of SMTHMW aggregates in a solution containing 0.15 M glycine and pH 7.0–7.5. (**a**) SMTHMW after 1 h aggregation. (**b**,**c**) SMTHMW after 24 h aggregation. (**d**,**g**) SMTHMW aggregates after 1 h disaggregation; (**e**,**f**) SMTHMW aggregates after 24 h disaggregation. Scale bar in (**g**), 1 μm. The most representative data, obtained as a result of 3 independent experiments, are given.

To determine their capability of disaggregation, SMTHMW aggregates formed during 1 and 24 h were transferred to conditions with increased ionic strength (0.6 M KCl, 30 mM KH2PO4, 1 mM DTT, 0.1 M NaN3, and pH 7.0). Figure 4d presents the results of 1 h disaggregation. As can be seen in the figure, SMTHMW became much smaller and spherical amorphous aggregates of about 100–200 nm in diameter and 100–120 nm in height. Apart from amorphous aggregates, individual filaments occurred on the substrate (Figure 4g); judging by their size, they may be bundles of SMTHMW molecules.

Figure 4e,f shows the disaggregation results for SMTHMW aggregates formed for 24 h. Spherical amorphous aggregates up to 23 nm in height can be seen, as well as a network of threads with a height of 5–6 nm.

#### *2.3. Circular Dichroism*

Figure 5a illustrates the circular dichroism (CD) spectrum of SMTHMW before and after the formation of aggregates. No changes in the secondary structure were detected upon the formation of SMTHMW aggregates: after chromatography the preparation had 6 ± 6% α-helices and 41 ± 6% β-structures, 53 ± 6% of turns and a disordered structure, whereas aggregated SMTHMW had a helix and β-structure content of 5 ± 6% and 42 ± 6%, respectively, 53 ± 6% of turns and disordered structure. Thus, in both cases, we detected a high content of β-structure and a disordered secondary structure.

**Figure 5.** Investigation of the structure of SMTHMW aggregates by various methods. (**a**) CD spectrum of SMTHMW; (**b**) FTIR spectra of titin and its aggregates at 20 ◦C. Protein concentration was 10 mg/mL; (**c**,**d**) thioflavin T (ThT) staining of SMTHMW aggregates formed in a solution containing 0.15 M glycine–KOH at pH 7.0–7.5. (**c**) 1 h formation of aggregates; (**d**) 24 h formation of aggregates; (**e**,**f**) X-ray diffraction of SMTHMW aggregates formed after 1 h (**e**) and 24 h (f).

#### *2.4. Fourier-Transform Infrared Spectroscopy*

Figure 5b presents FTIR data obtained at 20 ◦C and corrected for the spectral contribution of water vapor and CO2. The experimental data were analyzed following the principles described in [40]. The obtained estimates of the content of secondary structure elements in samples of titin and its aggregates are given in Table 1. As the circular dichroism data, the FTIR data indicate that the secondary structure of the protein does not change during SMTHMW aggregation.

**Table 1.** Secondary structure content in chicken titin samples calculated by the FTIR method.


The results of two experiments are presented. The means and standard deviations are given.

#### *2.5. Association of SMTHMW Aggregates with Thioflavin T*

To identify the amyloid nature of SMTHMW aggregates, we investigated their binding to thioflavin T. A significant increase in ThT fluorescence intensity in the presence of SMTHMW aggregates formed over 1 h (Figure 5c) and 24 h (Figure 5d) was recorded and compared with that in the presence of monodispersed SMTHMW. After disaggregation, the ThT fluorescence in the presence of SMTHMW aggregates was at the same level as that of monodispersed SMTHMW after 1 and 24 h of the experiment.

#### *2.6. X-ray Diffraction of SMTHMW Aggregates*

The amyloid cross-β structure of SMTHMW aggregates was revealed by X-ray diffraction (Figure 5e,f). A 1 h aggregation featured a diffuse reflex at ~10 Å and a relatively sharp reflex at ~4.7 Å (Figure 5e). X-ray diffraction of SMTHMW aggregates after a 24 h aggregation revealed a diffuse reflex at ~10 Å and a sharp reflex at ~4.8 Å (Figure 5f). The detected reflections can be ascribed to a cross-β structure. Thus, the presence of a cross-β structure identified by X-ray diffraction analysis confirms that SMTHMW aggregates are amyloids.

#### *2.7. Small-Angle X-ray Scattering (SAXS)*

SAXS was used to obtain information about the conformation of molecules in solution. For such high-molecular-weight structures as titin, this method makes it possible to approximately estimate their internal conformation, taking into account the tangent of the slope (tan *A*) of the log *I*–log *S* dependence. It is known that, for a rod-shaped conformation, tan *A* = 1; that, for a planar conformation, tan *A* = 2; that, for globular particles tan, *A* = 4. Using the DAMMIF program [41], it is also possible to visualize the approximate three-dimensional structure of the protein.

From these data (Figure 6), it follows that molecular SMTHMW has a flat shape (plate conformation) (tan *A* = −2.7) (Figure 6a). The latter is clearly visible in the insert (Figure 6c). The aggregated shape of SMTHMW particles is also close to plate-like (more charged plate conformation) (tan *A* = 2.65) (Figure 6b).

#### *2.8. Differential Scanning Calorimetry of SMTHMW*

The thermal stability of SMTHMW aggregates was elucidated by differential scanning calorimetry (DSC). Figure 7 presents typical temperature dependences of excess heat capacity of SMTHMW in monomeric and aggregated forms. As seen in the figure, the curves have heat absorption peaks that might correspond to cooperative disruption of the structure. Repeated heating of the preparations confirms that the process is irreversible. The heat absorption peak maximum temperature of molecular titin, *T*m, was 317.7 ± 0.1 K; the value of transition calorimetric enthalpy, <sup>Δ</sup>*H*cal = 870 ± 90 kJ mol−1. The heat absorption peak maximum temperature of titin aggregates, *T*m, was 321.7 ± 0.1 K; the value of transition calorimetric enthalpy, <sup>Δ</sup>*H*cal = 1090 ± 110 kJ mol<sup>−</sup>1.

#### *2.9. SMT Amino Acid Sequence Identity*

To reveal segments with large amino acid sequence identity that, according to available data, have an increased tendency for aggregation [33], we calculated the identity in the amino acid sequence between adjacent pairs of domains in smooth-muscle titin. Chicken titin (UniProtKB—A6BM71\_CHICK) was chosen for the calculations. Calculations carried out using the BLAST program showed that the average identity in the amino acid sequence between neighboring FnIII domains does not exceed 33 ± 7%, and between neighboring Ig domains, it does not exceed 20 ± 11%, which is a relatively low indicator (Table 2, Supplementary File S1). In two cases, however, domains with an identity higher than 50% were observed.

**Figure 6.** Investigation of the structure of SMTHMW aggregates by small-angle X-ray scattering. (**a**) The scattering curve for molecular SMTHMW in logarithmic representation log *I*–log *Q*, where *I* is the scattering intensity in 0.6 M KCl, *c* = 0.5 mg/mL (non-aggregated protein form), *Q* is the scattering vector, tan *A* = −2.7; (**b**) SMTHMW in 0.15 M Gly (aggregated); *Q*, scattering vector; tan *A* = −2.65; (**c**) SMTHMW in 0.6 M KCl, *c* = 0.5 mg/mL; dark line, approximation by the DAMMIF program.

**Figure 7.** Typical temperature dependence of excess heat capacity of SMTHMW in monomeric (black) and aggregated (blue) forms. Experiments were carried out 2 times for both the monomeric and aggregated SMTHMW forms. The obtained curves coincided in temperature *T*<sup>m</sup> to an accuracy of 0.1 K. The relative error in determining calorimetric enthalpy, ΔHcal, did not exceed 10%.



**Table 2.** Identity of the chicken-titin amino acid sequence.

FnIII-like =

and the A–I junction [42]. The complete sequence for the smooth muscle isoform of titin or full-sized chicken skeletal titin is not available in the literature.

Fibronectin

type-III was created. This

corresponded

 to the I-band region of the skeletal muscle sarcomere,

 which is involved in extension and contraction

 between the Z-line

#### *2.10. Calculation of Unstructured Areas in the SMT Molecule*

The disorder was revealed by analyzing the amino acid sequence UniProtKB—A6BM71 \_CHICK using IsUnstruct, a specialized program for predicting the natural disorder of proteins [43,44]. The average disorder of a titin fragment was 90% for PRK12323; 90%, Ig-like; 30%, FnIII-like; 55%, none (Figure 8).

**Figure 8.** Titin-chicken domain structure schematic aligned with the protein disorder prediction plots by the IsUnstruct program [44]. Probabilities of ≥0.5 mean a disorder.

#### **3. Discussion**

In the present work, we investigated the amyloidogenic propensity in the smoothmuscle isoform of the giant protein titin. According to our results, three titin isoforms—with molecular weights of ~500, ~1200, and ~1500–1600 kDa—were isolated from chicken gizzard muscle tissue. Previously, we have shown the 500 kDa splice-isoform of titin or its truncated fragment (SMTLMW) to form in vitro aggregates with amyloid properties and structure [34,35]. In this study, we show that the smooth-muscle titin isoform with MW~1500–1600 kDa and an isoform or a proteolytic fragment of this protein with MW~1200 kDa (SMTHMW) also form aggregates in vitro. Detailed research into the SMTHMW aggregation process was conducted to better understand the changes occurring in this protein, which is undoubtedly involved in smooth muscle contraction.

Using DLS, we found that, upon decreasing ionic strength, chicken SMTHMW formed large aggregates with a hydrodynamic radius of ~4500 nm over 1 h (Figure 1a). EM and AFM showed SMTHMW aggregates formed over 1 h to be amorphous (Figures 3 and 4). In general, chicken SMTHMW aggregation was fast, which made it impossible to determine the lag period. The high aggregation rate of SMTHMW (three times as high as that for SMTLMW [34]) was, apparently, due to the oligomeric forms of the protein and its monomers present, as shown by AFM, in the high ionic-strength solution (Figure 2). Our results are consistent with the literature data showing that the presence of oligomeric forms at the initial stage of aggregation accelerates the process [45].

An increase of ThT fluorescence detected at the binding of the dye to SMTHMW aggregates an hour after their formation indicates the amyloid nature of the structures formed (Figure 5c,d). This is also supported by the X-ray diffraction data: the presence of reflexes at 4.7 and 10 Å confirms the amyloid nature of SMTHMW aggregates formed both after 1 and 24 h of incubation (Figure 5e,f). The X-ray diffraction data are indicative of the presence of a quaternary cross-β structure in SMTHMW aggregates. It should be noted that both reflexes are circular and partially blurred. Nevertheless, the reflex representing the distance between the beta segments in the sheet was observed both after 1 h (4.7 Å, Figure 5e) and after 24 h (4.8 Å, Figure 5f) of aggregation. The 10 Å reflex representing the distance between the beta sheets was blurred but was also present in both cases. According to numerous literature data, such reflexes are characteristic of amyloid or amyloid-like structures [46].

An uncharacteristic feature of the amyloid aggregation of titin is the formation of aggregates without changing the secondary structure, which was confirmed by CD and FTIR independent techniques (Figure 5a,b, Table 1). According to the literature data*,* in vitro experiments with some proteins have shown that prior to the formation of amyloids, the structure of their molecules must undergo a transformation of the type of α-helix to β-folding or random coil to β-sheet [47–49]. We revealed no such changes in the amyloid aggregation of chicken SMTHMW. Similar results have been obtained earlier for chicken SMTLMW [34,35] and for skeletal myosin binding protein C [46] that also consists of Ig-like and FnIII-like domains. It appears that the ability to form amyloid aggregates without changes in the secondary structure of molecules is a characteristic feature of the above-mentioned multidomain proteins, which distinguishes them from most other amyloid proteins.

Taking into account the EM data on protein filaments between amorphous SMTHMW aggregates (Figure 3c) and the data on the presence of protein "filaments" after disaggregation of SMTHMW aggregates (Figure 4e–g), we suggested that not the entire protein but only its particular domains were involved in intermolecular interactions in the process of amyloid SMTHMW aggregation. The proposed SMTHMW scheme of stacking to form an amyloid structure during the aggregation is given in Figure 9. In particular, segments that have a disordered structure and those whose structure is amyloid are shown.

Are there any confirmations of the assumption we make? In the literature, there are comparative data obtained by cryoelectron microscopy on the structure of functional amyloid aggregates of Orb2 and pathological aggregates of amyloid β-peptide [50]. Those authors have shown that only a small part of its molecule is involved in the formation of the amyloid nucleus in the functional amyloids of Orb2, whereas most of the molecule remains dynamically disordered. Formation of the amyloid nucleus in pathological amyloid βpeptide, on the contrary, involves most of its molecule. Given these data and the revealed properties of SMTHMW aggregates, they can be classified as functional, which is indirectly confirmed by their ability to disaggregate with ionic strength increasing (Figures 1 and 4). It should only be noted that this type of aggregation is a model, since neither functional nor pathological SMTHMW aggregates have been found in living cells.

It should also be understood that, due to the huge size and complex structure of the titin molecule, its complete transition to the amyloid form is hardly possible. However, its particular segments are, most likely, capable of forming an amyloid structure. This assumption can be supported by literature data on the unfolding of individual titin domains [51]. In particular, those authors have shown that the stepwise unfolding/folding of titin immunoglobulin (Ig) domains occurs in the elastic I band region of intact myofibrils at physiological sarcomere lengths and forces of 6–8 pN [52]. For this reason, we consider the formation of amyloid segments to occur exactly between partially opened domains of neighboring titin molecules.

It has also been shown that most proteins have amyloidogenic segments [53,54]. We calculated the number of amyloidogenic segments in randomly selected domains of a chicken titin fragment, predicted using the FoldAmyloid program (three Ig-like and three FnIII-like). From the data obtained (Supplementary File S2), there are at least two amyloidogenic segments in each of the counted domains; furthermore, for example, the Ig-like (I10) domain has five such segments. These data indicate a high potential propensity of titin, inherent in its domains, to form amyloid aggregates.

When discussing the type of SMTHMW aggregates, it is necessary to dwell in more detail on the DSC data, which are indicative of yet another feature of titin's amyloid aggregation. It is known that amyloid fibrils or aggregates of several proteins melt and dissociate at temperatures of the order of 75–100 ◦C, which manifests itself in the form of characteristic endothermic transitions on the thermograms [55–59]. It has been shown that

amyloid fibrils are more resistant to temperature than the native protein (their transition temperatures differ by about 30–40 K) [59]. At the same time, it has been shown that the enthalpy of the melting of amyloid fibrils is less than that of the native protein. Based on this, it has been concluded that the density of intermolecular interactions in the amyloid structure is lower than in the native protein [59]. Our experiments showed that the melting point of the aggregated form of SMTHMW is slightly higher (by only 4 K) than that of the molecular form of the protein (Figure 7). Herewith, the melting enthalpies of the aggregated and non-aggregated forms of SMTHMW practically did not differ. These data suggest that the total binding energy during the formation of SMTHMW aggregates did not change. Attention should be also paid to the heat absorption peaks themselves. If we compare the data obtained by us for SMT with the data for globular proteins, then the heat absorption peak should be much larger. There is a possibility that the current peak of heat absorption can correspond to the energy spent on the breakdown of the aggregates or oligomers of the molecular and aggregated forms of the protein or even its particular domains. If this is the case, then we failed to register the melting point, and it is above 373 K. Nevertheless, the obtained data are of interest. It can be summed up that, in this temperature range (up to 373 K), the molecular form of the protein and its aggregates have relatively similar stability.

**Figure 9.** A schematic of the internal structure of aggregates with a cross-β-sheet structure, formed by smooth-muscle titin. (**a**) The proposed structure of SMT aggregates based on circular dichroism, FTIR and X-ray diffraction data, suggesting the presence of a large amount of disorder. (**b**) A beta sheet and the distance of 4.7 Å between the beta strands. (**c**) The beta sheets and the distance of 10 Å between them are shown separately.

Discussing the issue of the type of (functional or pathological) amyloid aggregation, it is worth noting that no cases of the transformation of functional to pathological amyloids have been recorded. Most probably, there are molecular mechanisms of protection against such a transformation that developed in the process of evolution. Thus, it is known that, in multidomain proteins with homologous tandem repeats, the neighboring domains have a low identity in the amino acid sequence. This feature formed, apparently, as a result of evolutionary pressure, prevents incorrect protein stacking and subsequent aggregation [33]. It is generally accepted that a low tendency for aggregation is characteristic of those proteins in which the identity between domains is less than 40% [33]. Data on the identity of individual titin domains in the amino acid sequence are known [33]. We calculated the identity of amino acid sequences for chicken titin (Table 2). Calculations showed this fragment to contain only two domains with an identity greater than 50% (Supplementary File S1). The average identity of the amino acid sequence between pairs of neighboring FnIII-like or Ig-like domains does not exceed 40% (Table 2). Thus, it can be concluded that chicken SMT

has a low tendency towards aggregation. It has been shown, however, that upon sarcomere elongation, titin domains are capable of unfolding in situ [52], herewith opening the hidden hydrophobic sites that may lead to its aggregation.

In conclusion, regarding the possible role of the revealed changes for the muscle in vivo, it is necessary to recall one of the early works in which the authors show, using AFM, that repeated mechanical cycles of the extension/relaxation of multidomain proteins can lead to the formation of misfolded structures formed by two neighboring domains [60]. The misfolding was completely reversible and changed the mechanical topology of the domains, while maintaining the same stability as in the original folded state. The authors conclude that multidomain proteins can assume a new state of incorrect stacking. These data and the data we obtained can be important from the viewpoint of a better understanding of the functioning of multidomain muscle proteins, in particular titin, in the sarcomere and muscle as a whole. It is not to be ruled out that structural changes occurring in this protein during the muscle extension/contraction cycle are involved not only in the fine-tuning of elastic properties but also in changing the contractile response of the muscle. Future in vivo studies of titin structural modifications and misfolding will reveal more subtle nuances of the involvement of this protein in the functioning of muscle cells.

#### **4. Conclusions**

In summary, in vitro nonspecific amyloid aggregation of chicken SMTHMW was found. Over relatively short times (within 1 h), the protein formed amorphous amyloid aggregates with a quaternary cross-β structure without undergoing changes in the secondary structure. The thermal stability of SMTHMW aggregates did not practically differ from that observed in protein preparations containing both monomers and oligomers. Amyloid aggregates of SMTHMW disaggregated almost completely at an increase in the ionic strength of the solution. Amyloid aggregates of this type are functional rather than pathological, so a similar aggregation of titin in vivo to perform certain functions cannot be ruled out. However, the involvement of this protein in the formation of amyloid deposits has not been shown yet, which positions this protein as a model object to study the process of the amyloid aggregation of a non-pathological type. Our data expand the views about the fundamentals of amyloidogenesis. Besides, disclosing the mechanisms of the formation of functional and pathological amyloids and understanding their differences at the structural level can give an idea of how the cell regulates amyloid aggregation for its desired functioning, avoiding the manifestation of the toxicity of pathological amyloids.

#### **5. Materials and Methods**

#### *5.1. Purification of Chicken Gizzard SMT*

Smooth-muscle titin was prepared from chicken gizzard by the method described in [61] with our modifications described in [34]. SMTHMW was purified by gel filtration on a Sepharose-CL2B column equilibrated in a buffer containing 0.6 M KCl, 30 mM KH2PO4, 1 mM DTT, 0.1 M NaN3, pH 7.0. Protein concentration was determined by a SPECORD UV VIS spectrophotometer using the extinction coefficient (*E*2801 mg/mL) of 1.37 for titin [62]. For this research, the protein has been isolated more than 20 times.

#### *5.2. SDS-PAGE and Mass Spectrometry Analysis of Titin*

The presence of SMTHMW in the sample was confirmed by sodium dodecyl sulfate– polyacrylamide gel electrophoresis (SDS-PAGE) (Supplementary File S3) and mass spectrometry analysis, the data of which are described in [36]. The molecular weight of SMTHMW was assessed by TotalLab software v1.11 (Supplementary File S3). Two protein bands are visible, which are most likely SMTHMW isoforms, or the lower band is a proteolytic fragment of this protein (Supplementary File S3, gels 1–4). By the densitometry data, the molecular weight of the upper band of the protein is ~1635 ± 245 kDa; its content is ~68.5%. The molecular weight of the lower band is ~1245 ± 189 kDa; its content is ~31.5%. The SDS-PAGE of titin was performed using a separating gel containing 6.5–7% polyacrylamide

prepared as described [62]. The gels were stained with Coomassie Brilliant Blue G-250 and R-250 mixed at a 1:1 ratio. For shotgun mass spectrometry analysis, the sample was solubilized into a buffer (4% sodium dodecyl sulfate in 0.1 M Tris-HCl pH 7.6, 0.1 M dithiothreitol) and incubated for 5 min at 95 ◦C as described [36,63]. The samples were sonicated (4 × 30 s at 20 W; ME220, Covaris, Woburn, MA, USA), centrifuged (5 min, 16,000× *g*), and the supernatant was collected. The YM-30 filter (Millipore, Ireland) was used for alkylation and trypsinolysis (14 h, 2 μg of trypsin (Trypsin Gold, Promega, Madison, WI, USA)) according to the FASP method [64]. Peptides were desalted using C18 microcolumns and subjected to HPLC–MS/MS analysis using the HPLC Ultimate 3000 RSLCnano system (Thermo Scientific, Waltham, MA, USA) as described [36,63].

#### *5.3. Conditions for the Formation of SMT Aggregates*

Purified SMTHMW in a column buffer (0.6 M KCl, 30 mM KH2PO4, 1 mM DTT, 0.1 M NaN3, pH 7.0) was used to form aggregates. SMT aggregates (concentration, 0.2–0.4 mg/mL) were formed by dialysis in Sigma-Aldrich cellulose membrane tubing (size, 25 × 16 mm) for 1 and 24 h at 4 ◦C against a solution containing 0.15 M glycine–KOH, pH 7.0–7.5. In disaggregation experiments, SMT aggregates were dialyzed during 1 and 24 h against a column buffer.

#### *5.4. Dynamic Light Scattering Experiments*

DLS experiments were conducted according to a protocol described in [34,35]. For the DLS analysis of SMT aggregation, a protein sample in a buffer containing 0.6 M KCl, 30 mM KH2PO4, 1 mM DTT, 0.1 M NaNO3, pH 7.0, at an initial concentration of 1 mg/mL, was transferred into a solution of 0.15 M glycine–KOH, pH 7.0–7.5, by gradual dilution to a final concentration of 0.1 mg/mL to decrease ionic strength. Further steps were as in [35]. The collected autocorrelation functions were converted into particle-size distributions, using the general-purpose algorithm provided with the ZS Zetasizer Nano (Malvern Instruments Ltd., Malvern, UK) used in this experiment. Particle-size distributions obtained from alternative inversion algorithms yielded comparable results. The dynamic viscosity of the protein solutions determined using an SV-10 Sine-wave Vibro Viscometer (A&D Company Ltd., Tokyo, Japan) at 25 ◦C was 0.92 cP. This value was taken into consideration when measuring the particle dimensions in SMT samples collected 60 min after the dialysis. The analyzed volume of scattering with a beam cross-section of ~100 μm, accounting for the protein concentration used, contained about 10 billion SMT protein molecules, the signal from which was measured. The correlation function signal accumulated over 15 cycles of 15 s each (the way it is described in [34,35]). The results were obtained from three independent experiments.

#### *5.5. Transmission Electron Microscopy*

A drop of aggregated protein suspension at a concentration of 0.1 mg/mL was applied to a carbon-coated collodion film (2% collodion solution in amyl acetate (Sigma-Aldrich, St. Louis, MO, USA)) on a copper grid (Sigma-Aldrich, St. Louis, MO, USA) and negatively stained with 2% aqueous uranyl acetate (SPI-Chem., West Chester, PA, USA). Samples were examined under a JEM-100B electron microscope (JEOL Ltd., Tokyo, Japan). Samples obtained from five independent protein isolations were analyzed; many different fields of view were analyzed.

#### *5.6. Atomic-Force Microscopy*

For AFM measurements, titin aggregates were attached to freshly cleaved mica. An aliquot of titin (10–20 μL) was pipetted onto the mica surface and incubated at room temperature for 10 min. Unbound protein was washed away by extensive rinsing with distilled water, then by blowing gently with a stream of high-purity N2 gas. The noncontact mode (alternating current or AC mode) AFM images of titin aggregates bound to the mica surface were acquired with a Cypher ES AFM instrument (Asylum Research, Santa Barbara, CA, USA). Scanning was performed at high set-point values (0.8–1.2 V) to avoid the binding of the sample to the cantilever tip. Silicon nitride cantilevers (Olympus) were used for scanning in air (AC160TS, resonance frequency~300 kHz). At a typical scanning frequency of 0.7–1.4 Hz, we collected 512 × 512 pixel or 1024 × 1024 pixel height-, amplitude-, and phase-contrast images.

To prepare samples of SMTHMW aggregates for AFM, 2 μL of the protein was transferred to freshly cleaved mica and incubated for 5 min. The sample was then washed three times in a drop of distilled water deionized by a type I Milli-Q system for 30 s and dried in the air. AFM imaging was performed using an AFM Ntegra-Vita microscope (NT-MDT, Russia) in noncontact (tapping) mode in air. The typical scan rate was 0.5–1 Hz. Measurements were carried out using NSG03 cantilevers with a resonance frequency of 47–150 kHz and ensured a 10 nm tip curvature radius. The processing and presentations of AFM images were performed using Nova software 1.0.26 (NT-MDT, Moscow Region, Russia) and Gwyddion 2.4450 software (http://gwyddion.net/download-old.php accessed on 17 April 2018. Experiments were replicated by independently executing the process of protein preparation, their incubation at 37 ◦C, and sample analysis three times. AFM images of a buffer solution (0.15 M glycine–KOH, pH 7.0–7.5) containing no titin are presented in Supplementary File S4. Samples obtained from three independent protein isolations were analyzed; many different fields of view were analyzed.

#### *5.7. Circular Dichroism*

SMTHMW was dialyzed for 24 h against a buffer containing 0.15 M glycine–KOH, pH 7.0–7.5. The CD spectra prior to and after SMTHMW aggregation were recorded in a Jasco J-815 spectrometer (JASCO Inc., Tokyo, Japan) using 0.1 cm optical path-quartz cells and wavelengths of 250–190 nm.

Three repeats of the spectrum were taken for each investigated sample. Data processing and graphical representation were performed in the SigmaPlot program. Based on the absorption spectrum, the exact protein concentration was calculated using the formula: C = ABS280/*l/K*<sup>e</sup> (where ABS280 is the absorption value at a wavelength of 280 nm, *l* is the optical path length in cm, *K*e is the extinction coefficient).

Three spectra in the far UV region obtained for the investigated sample were averaged and smoothed in the spectropolarimeter software (Spectra Manager Version 2, Spectra Analysis Version 2.02.06 (Build 1) Spectra Analysis Jasco). A similar procedure was done for the three spectra obtained for the buffer solution. The averaged spectrum of the buffer solution was subtracted from the obtained averaged spectrum of the investigated sample. The value of molar ellipticity [Θ] was calculated by the formula:

$$[\Theta]\_{\lambda} = \Theta\_{\lambda} \cdot \text{RMMW}/l \cdot \text{c.}$$

where Θ <sup>λ</sup> is the measured value of ellipticity at the wavelength λ, millidegrees; RMW, the average molecular weight of the residue, calculated from the amino acid sequence; *l*, the optical path length, mm; *c*, protein concentration, mg/mL.

The secondary structure was calculated using the CONTIN/LL module of the CDPro program [65]. The mean root square deviation (RMSD) according to the CDPRO program did not exceed 6%.

#### *5.8. Fourier-Transform Infrared Spectroscopy*

Measurements were carried out on a Thermo Scientific Nicolet 6700 FT-IR spectrometer, equipped with the Smart Proteus accessory with a Peltier cuvette holder, in transmission mode in a cuvette of crystalline calcium fluoride with an optical pathlength of 4 μm, using a liquid nitrogen-cooled MCT detector. Scanning in the wavenumber range from 650 to 4000 cm−1; resolution, 1 cm−1; averaging over 256 spectra. The device was calibrated according to the manufacturer's instructions.

The IR spectra of titin preparations' solutions in a corresponding buffer and the spectra of the buffer itself were measured at 20 ◦C. The concentration of the protein was 10 mg/mL. The optical path length of the CaF2 cuvette was calculated for each measurement based on the optical density of the test sample at 3404 cm<sup>−</sup>1, using the water absorption value at an optical pathlength of 1 μm equal to 0.533 AU, adjusted for the protein concentration in the sample [66]. The optical pathlength of the cuvette was 4.52 ± 0.04 μm. The IR spectrum of the protein preparation was measured twice; the buffer spectrum was also registered twice. The IR spectrum of the buffer (0.6 M KCl, 30 mM KH2PO4, 1 mM DTT, 0.1 M NaNO3, pH 7.0, for molecular SMT; and 0.15 M glycine–KOH, pH 7.0–7.5, for the aggregated form of the protein) was subtracted from each protein spectrum, taking into account the difference in the values of the optical path length in the measurements. Each difference spectrum was adjusted for the spectral contribution of water vapor and CO2, followed by the analysis in the wavenumber range of 1725 to 1481 cm–1 for the content of secondary structure elements in the protein, following the principles described in [40]. A sample obtained from three different isolations of the protein was used. The obtained estimates of secondary structure elements in the protein were averaged by the results of two measurements. The standard deviations of the values of the secondary structure elements in the protein are given.

#### *5.9. Fluorescence Analysis with Thioflavin T*

The amyloid nature of the SMTHMW aggregates was estimated by the intensity of thioflavin T (ThT) fluorescence (1 ThT:5 SMT (*w*/*w*)). Fluorescence was measured at λex = 440 nm and λem = 488 nm using a Cary Eclipse spectrophotometer (Varian, Palo Alto, CA, USA). Four independent series of measurements were carried out. The amyloid nature was assessed from the difference of fluorescence intensity between the non-aggregated and aggregated forms of the protein. This method was used as a control of the amyloid nature of titin aggregates after each isolation of the protein.

#### *5.10. X-ray Diffraction*

SMTHMW aggregates for X-ray diffraction analysis were prepared after a 1 h and 24 h incubation at 4 ◦C in an experimental solution. A sample obtained from three different isolations of the protein was used. Then the aggregates were concentrated up to more than 10 mg/mL by centrifugation at 12,000 rpm for 60 min. Droplets of this preparation were placed between the ends of wax-coated glass capillaries (approximately 1 mm in diameter) separated by a gap of approximately 1.5 mm. Fiber diffraction images were collected using a Microstar X-ray generator with HELIOX optics equipped with a Platinum135 CCD detector (X8 Proteum system, Bruker AXS) at the Institute of Protein Research, Russian Academy of Sciences, Pushchino. Cu Kα radiation, λ = 1.54 Å (1 Å = 0.1 nm), was used. The samples were positioned at a right angle to the X-ray beam using a four-axis kappa goniometer. Different exposures and different oscillation angles were used; the sample itself was irradiated in different orientations.

#### *5.11. Small Angle X-ray Scattering*

SAXS experiments were carried out on a small-angle camera of the Photon Factory (Tsukuba, Japan). The protein solution in a thermostatted cuvette with mica windows was irradiated by X-rays of a wavelength 1.503 Å at 23 ◦C. The distance between the sample and the detector was 2.35 m. The range of the measured scattering vectors is *Q* = 0.008–0.2 Å−<sup>1</sup> (*Q* = 4π sin θ/λ, where λ is the X-ray radiation wavelength and 2θ is the scattering angle). X-ray scattering was detected by a PILATUS 100K two-dimensional X-ray detector. The shape of the particles was estimated from the tangent of the angle of inclination log *I* on log *Q*, where *I* is the scattering intensity, and *Q* is the scattering vector modulus [67]. Log *I*–log *Q* dependences in SAXS data were approximated by linear regression. Correlation coefficients (*R*2) ranged from 0.92 to 0.99.

#### *5.12. Differential Scanning Calorimetry*

DSC measurements were made on a SCAL-1 precision scanning microcalorimeter (Scal Co. Ltd., Pushchino, Russia) with 0.33 mL glass cells at a scanning rate of 1 K/min and under a pressure of 2.5 atm [68]. The experiments were performed in 0.15 M glycine–KOH at pH 7.0–7.5. The protein concentrations were 1.2 mg/mL. The experimental calorimetric traces were adjusted for the calorimetric baseline, and the molar partial heat capacity functions were calculated in a standard manner. The excess heat capacity was evaluated by the subtraction of the linearly extrapolated initial and final heat capacity functions with correction for the difference of these functions by using a sigmoidal baseline [69]. DSC experiments were carried out 2 times for both the monomeric and aggregated SMTHMW forms. The obtained curves coincided in temperature *T*<sup>m</sup> to an accuracy of 0.1 K. The relative error in determining calorimetric enthalpy, ΔHcal did not exceed 10%.

#### *5.13. Calculation of the Identity of the Amino Acid Sequence and Disordered Regions in the SMTHMW Molecule*

The SMTHMW amino-acid sequence identity was calculated by the BLAST program. The data were retrieved from the UniProtKB databases: UniProtKB—A6BM71\_CHICK.

**Supplementary Materials:** The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/ijms24021056/s1.

**Author Contributions:** A.G.B. conceived and designed the experiments; A.G.B., E.I.Y., L.G.B., N.V.M., A.A.T., M.A.T., H.K., A.D.N., A.G.G., T.N.M., N.V.P., M.Y.L., A.S.K., Z.M., O.V.G. and I.M.V. performed the experiments; A.G.B., E.I.Y., L.G.B., N.V.M., A.A.T., M.A.T., T.N.M., N.V.P., A.S.K., M.K., O.V.G. and I.M.V. analyzed the data; A.G.B., M.K. and I.M.V. wrote the manuscript; A.G.B., E.I.Y., N.V.M., A.A.T., M.A.T., T.N.M. and A.S.K. prepared the figures; A.G.B., M.K., O.V.G. and I.M.V. reviewed the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Russian Science Foundation, grant number 22-24-00805, by the Hungarian National Research, Development and Innovation Office (K135360 to M.K.) and by grants from the Hungarian National Research, Development and Innovation Office: FK128956 and 2019-2.1.7-ERA-NET-2020-00013 to Z.M.

**Institutional Review Board Statement:** The manuscript does not contain clinical studies or patient data.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** All data generated or analyzed in the course of this research (including files of additional information) were incorporated into the article and supplementary files.

**Acknowledgments:** The research used the equipment of the ITEB RAS Shared Facilities Centre "Structural and Functional Studies of Biosystems" (https://www.ckp-rf.ru/catalog/ckp/3037/ accessed on 29 December 2022), Electron Microscopy Core Facilities of the Pushchino Biological Research Centre (https://www.ckp-rf.ru/catalog/ckp/670266/ accessed on 29 December 2022).

**Conflicts of Interest:** The authors declare that there is no conflict of interest regarding this work.

#### **Abbreviations**

SMTHMW, high-molecular-weight isoform of smooth-muscle titin (~1500 kDa); SMTLMW, lowmolecular-weight isoform of smooth-muscle titin (~500 kDa); Ig, titin immunoglobulin-like domain; FnIII, titin fibronectin-like type III domains; CD, circular dichroism; FTIR, Fourier-transform infrared spectroscopy; DLS, dynamic light scattering; DTT, dithiothreitol; SDS–PAGE, sodium dodecyl sulfate– polyacrylamide gel electrophoresis; ThT, thioflavin T; TEM, transmission electron microscopy, AFM, atomic-force microscopy; SAXS, small-angle X-ray scattering; DSC, differential scanning calorimetry.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

## *Article* **Genome-Wide Identification of the** *SUN* **Gene Family in Melon (***Cucumis melo***) and Functional Characterization of Two** *CmSUN* **Genes in Regulating Fruit Shape Variation**

**Ming Ma, Suya Liu, Zhiwei Wang, Ran Shao, Jianrong Ye, Wei Yan, Hailing Lv, Agula Hasi \* and Gen Che \***

Key Laboratory of Herbage & Endemic Crop Biology, Ministry of Education, School of Life Sciences, Inner Mongolia University, Hohhot 010070, China

**\*** Correspondence: hasiagula@imu.edu.cn (A.H.); chegen@imu.edu.cn (G.C.)

**Abstract:** Melon (*Cucumis melo*) is an important economic crop cultivated worldwide. A unique *SUN* gene family plays a crucial role in regulating plant growth and fruit development, but many *SUN* family genes and their function have not been well-characterized in melon. In the present study, we performed genome-wide identification and bioinformatics analysis and identified 24 *CmSUN* family genes that contain integrated and conserved IQ67 domain in the melon genome. Transcriptome data analysis and qRT-PCR results showed that most *CmSUN*s are specifically enriched in melon reproductive organs, such as young flowers and ovaries. Through genetic transformation in melons, we found that overexpression of *CmSUN23-24* and *CmSUN25-26-27c* led to an increased fruit shape index, suggesting that they act as essential regulators in melon fruit shape variation. Subcellular localization revealed that the CmSUN23-24 protein is located in the cytoplasmic membrane. A direct interaction between CmSUN23-24 and a Calmodulin protein CmCaM5 was found by yeast twohybrid assay, which indicated their participation in the calcium signal transduction pathway in regulating plant growth. These findings revealed the molecular characteristics, expression profile, and functional pattern of the *CmSUN* genes, and may provide the theoretical basis for the genetic improvement of melon fruit breeding.

**Keywords:** melon; CmSUN; IQ67 domain; expression analysis; overexpression phenotype; fruit shape regulation; protein interaction

#### **1. Introduction**

Melon (*Cucumis melo* L.) is a globally cultivated horticulture crop, bearing sweet and pleasant fruit with high nutritional value [1–3]. The long-term domestication process and wide distribution have produced multiple melon cultivars with diversified fruit traits, especially the fruit shape and size [4]. Fruit shape index (FSI) refers to the ratio of the fruit's longitudinal diameter to the transverse diameter, and is an important agronomic trait that affects melon consumption [5]. Melon FSI can vary between round, oval, and long, and has always been used as a fruit quality screening standard to classify different market groups and satisfy consumer preferences [6]. In melon, the molecular mechanism underlying the outstanding diversity of FSI needs to be further studied [3]. The establishment of fruit shape is mediated by cell differentiation and cell expansion, which are caused by multiple endogenous gene activity [7]. Considerable research on the tomato has found that the *SUN* gene is one of the pivotal regulators in fruit shape regulation [8,9]. Overexpression of *SlSUN*/*SlIQD12*leads to long slender fruits with an increased number of vertical cells and a reduced number of horizontal cells, and *SlSUNs* regulate the expression of genes involved in cell division, cell wall development, and the auxin pathway [10,11].

*SUN* genes encode the IQD protein, a plant-specific calmodulin-binding protein that consists of 67 amino acid residues, containing three copies of the IQ motif and separated in precise spacing by short sequences of 11 and 15 amino acid residues. IQ67 domain

**Citation:** Ma, M.; Liu, S.; Wang, Z.; Shao, R.; Ye, J.; Yan, W.; Lv, H.; Hasi, A.; Che, G. Genome-Wide Identification of the *SUN* Gene Family in Melon (*Cucumis melo*) and Functional Characterization of Two *CmSUN* Genes in Regulating Fruit Shape Variation. *Int. J. Mol. Sci.* **2022**, *23*, 16047. https://doi.org/ 10.3390/ijms232416047

Academic Editor: Wajid Zaman

Received: 4 November 2022 Accepted: 10 December 2022 Published: 16 December 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

usually contains 1~3 core 'IQ' motifs, 1~3 '1-8-14' motifs, and 1~4 '1-5-10' motifs [12,13]. The conserved motifs can bind to multifunctional calmodulin/calmodulin-like (CaM/CmL) proteins [14,15]. CaM/CmL transmits signals to downstream IQD receptor proteins through different forms of interaction with calcium signals [16–22]. In the *Arabidopsis thaliana*, the *SUN* family genes *AtIQD11*, *AtIQD14,* and *AtIQD16* have also been shown to regulate cell morphology by elongating or distorting cell shape [23].

*SUN* family genes participate in many aspects of plant development. *Arabidopsis AtIQD1* gene was the first to be discovered [12], and can activate the expression of glucosinolate metabolic pathway-related genes, thus increasing the glucosinolate content and improving the disease resistance [13]. Knockout of *AtIQD5* leads to variations in cell size, skeleton, and stability of microtubule [24]. *AtIQD18* is mediated by auxin signaling, and its overexpression results in a loose cell morphology phenotype [25]. The IQD role in microtubules may be mediated by a functional unknown domain, DUF4005 [26]. *SUN* genes have been involved in regulating cell shape and cell arrangement, which are closely related to the plant development and fruit shape index.

In recent years, the *SUN* family genes have been identified in many species. There are 33 *SUN* members in *Arabidopsis* [12], 29 members in *Oryza sativa* [12], 33 members in *Solanum lycopersicum* [27], 67 members in *Soybean* [28], 26 members in *Zea mays* [29], and 22 members in *Cucumis sativus* L. [30]. *CsSUN*, a homologous gene of tomato *SlSUN*/*SlIQD12*, was the candidate gene for the fruit shape of the cucumber QTL *FS1.2*, and its expression was lower in round-fruit cucumber than in long-fruit cucumber [31]. In watermelon, a 159 bp deletion in *ClSUN25-26-27a* underlies the difference between elongated fruit and spherical fruit [32,33]. Pan and Monforte mentioned 21 and 24 melon *SUN* members in the previous study, respectively [30,34]. In *Oryza sativa*, *OsIQD14* can control grain shape by regulating cell shape [35]. In other plants, such as soybean, maize, and Arabidopsis, IQD gene expression can be positively or negatively regulated by hormone treatments and abiotic stress [28,29,36].

In this study, we performed genome-wide identification and bioinformatics exploration of melon *SUN* family genes. The *CmSUN*s expression profile was analyzed by high-throughput sequencing data and verified by qRT-PCR. In the genome-wide association analysis of 297 melon germplasm, *CmSUN23-24* and *CmSUN25-26-27c* were presumed to be underlying the key loci of melon fruit shape [37]. Performing functional analyses through genetic transformation in melon, we found that overexpression of *CmSUN23-24* and *CmSUN25-26-27c* resulted in fruit shape variation. Furthermore, we identified a CaM protein that can directly interact with CmSUN23-24, suggesting that they contribute to fruit shape regulation through a regulatory gene network in melon.

#### **2. Results**

#### *2.1. CmSUN Family Gene Identification and Sequence Analyses*

A total of 24 CmSUN members were identified by searching the Hidden Markov model PF00612 of the IQ motifs and blasting probe sequences in the melon protein database, including the 21 CmSUNs consistent with the previous study [30,34]. In the CmSUN family, CmSUN9-10a is the shortest protein consisting of 261 amino acids (aa), and the longest protein, CmSUN32b, is 845 aa in length. Physical and chemical analyses showed that the molecular weights of the corresponding proteins range from 29.9 to 93.4 k Dalton (kD), and the theoretical isoelectric points range from 5.59 (CmSUN32b) to 10.87 (CmSUN13- 14a) (Table 1). The average theoretical isoelectric point of 24 CmSUN proteins was 9.94, indicating that they contain many basic amino acids.

The 24 *CmSUN*s were unevenly distributed on 12 chromosomes (Figure 1). The gene structure analyses showed that all *CmSUN*s contained exons and introns. The number of exons in *CmSUN*s varies from 2 to 6. *CmSUN19b*, *19c*, *25-26-27a*, and *25-26- 27b* have three exons that display similar-length arrangements in structures following a 'medium—short—long' pattern; *CmSUN1-2a*, *1-2b*, *3*, *4*, *6*, *7-8*, *11*, *13-14a*, *13-14b*, and *17-18a* have five exons that display similar-length arrangement in structures following a

'short—medium—medium—short—long' pattern; *CmSUN5*, *30-31*, and *32b* have six exons that display similar-length arrangements in structures following a 'short—medium—medium short—long—short' pattern (Figure 2). The strong similarity of the three groups of *CmSUN* genes may indicate their conservation in the gene evolution process and function mode.


**Table 1.** Information and physicochemical properties of 24 *CmSUN* members.

**Figure 1.** Distribution of 24 *CmSUN* genes on melon chromosomes.

Through analyzing the CmSUNs protein domains, we found that arginine (R), isoleucine (I), glutamine (Q), and leucine (L) sites are highly conserved in different species (Figure 3A). Multiple sequence alignments of CmSUNs showed that all of the proteins contain a typical IQ67 conserved domain with 1~3 core 'IQ' motifs (IQXXXRGXXXR), or the lessrestricted motifs ([ILV]QXXXRXXXR[RK]), 1~3 '1-8-14' motifs ([FILVW]XXXXXX[FAILVW] XXXXX[FILVW]), and 1~4 '1-5-10' motifs ([FILVW]XXX[FILV]XXXX[FILVW]). The three IQ motifs are precisely separated by 11 and 15 amino acids [12] (Figure 3B).

**Figure 2.** Gene structure of 24 *CmSUN* genes. The pink box represents the exon, the black lines represents the intron, blue box represents the UTR.

**Figure 3.** CmSUNs protein sequence analyses. (**A**) The logos show the conserved sites in IQ67 domain in CmSUN sequences. (**B**) Multiple sequence alignment of 24 CmSUN proteins. The red lines represent the 'IQ' core motif; The blue lines represent the '1-8-14' motif; The green lines represent the '1-5-10' motif; '11 aa' and '15 aa' represent the amino acid interval between the three 'IQ' motifs.

#### *2.2. Phylogenetic Tree Analysis of SUN Family Proteins*

The phylogenetic tree was constructed by aligning the 33 AtSUNs in *Arabidopsis* [12], 22 CsSUNs in *Cucumis sativus* [30], 33 SlSUNs in *Solanum lycopersicum* [27], and 24 CmSUNs in *Cucumis melo*. The SUN family members can be divided into five clades I~V (Figure 4). CmSUN25-26-27a, 25-26-27b, and 25-26-27c are branched in the same clade with the fruit shape regulator gene CsSUN25-26-27a [31]. CmSUN23-24 and CmSUN25-26-27c are the homolog genes of CsSUN23-24 and CsSUN25-26-27c, which also participate in regulating fruit shape in cucumber [38]. CmSUN11 is homologous to the cell shape-related AtSUN11 gene; CmSUN1-2a and 1-2b are branched in the same clade with AtSUN1, which is the foundational SUN member in Arabidopsis.

**Figure 4.** Phylogenetic tree analyses of SUN proteins from *Arabidopsis*, *Solanum Lycopersicon*, *Cucumis sativus*, and *Cucumis melo*. The blue represents AtSUNs, the green represents CsSUNs, the yellow represents CmSUNs, and the red represents SlSUNs. 'I~V' indicates different subfamily of SUN members in four species. The asterisk indicates the functionally studied CmSUN genes.

#### *2.3. Expression Profiling of CmSUN Genes*

The heat map of *CmSUN* expression was generated according to the transcriptomic analyses in different melon tissues (Figure 5A). The significant differences in relative expression in the ten different tissues are indicated by salient letter notation. The relatively high transcripts of the *CmSUN*s accumulated in melon stem, ovary, and flower. In the late development stages of the fruit, such as the growing stage (18 days after pollination, DAP), ripening stage (36 DAP), climacteric stage (determined according to breathalyzer), and postclimacteric stage (48 h after climacteric stage), the *CmSUN*s displayed low expression levels. We further verified the expression patterns of *CmSUN* genes by performing qRT-PCR in different melon tissues. The qRT-PCR results were consistent with the transcriptome analyses (Figure 5B). *CmSUN25-26-27a* has the highest expression in female flowers, and may have an active role in regulating gynoecium development; *CmSUN13-14a* and *CmSUN25-26-27c* are highly expressed in male flowers; *CmSUN1-2b*, *CmSUN3*, *CmSUN17-18a*, *CmSUN23-24*, and *CmSUN30-31* showed the highest expression in the ovary compared to the other tissues, suggesting their potential role in fruit development regulation.

**Figure 5.** Expression analysis of *CmSUN*s in melon. (**A**) Heat map of 24 *CmSUN* gene expression in different melon tissues. (**B**) Expression profiles of part *CmSUN* genes in melon. Rt, S, L, FF, MF, O, G, R, C, and P represent root, stem, leaf, female flower, male flower, ovary, growing stage, ripening stage, climacteric stage, and post-climacteric stage, respectively. Three biological replicates and three technique replicates were performed for each qRT-PCR analysis. 'a–g' indicates the significance difference in the tissues.

#### *2.4. Overexpression of CmSUN23-24 and CmSUN25-26-27c Resulted in Melon Fruit Shape Variation*

*CmSUN23-24* and *CmSUN25-26-27c* were presumed as candidate genes for melon fruit shape QTL [37]. To verify the role of *CmSUN23-24* and *CmSUN25-26-27c* in fruit regulation, we facilitated the transformation of the genes driven by the 35S promoter into the melon. We obtained six *CmSUN25-26-27c*-overexpressed transgenic lines, and two representative phenotypic lines were chosen for further characterization. Compared to the wild type plants, the vertical diameter of the *CmSUN25-26-27c-Oe* mature fruits was significantly increased, while the horizontal diameter showed no difference (Figure 6A–C). Thus, the melon fruit shape index is more enlarged in the transgenic lines than in WT (Figure 6D). In *CmSUN25-26-27c-Oe* lines, the fruit shape index was about 1.23 (Oe-L1, n > 20) and 1.21 (Oe-L2, n > 20), and the average FSI 1.22 was 11.9% larger than that in WT (1.09, n > 100). We measured the vertical and horizontal diameter of the transgenic fruits at DAP 1~7, which was the crucial time for determining the melon fruit shape. The data showed a similar fruit enlargement curve in the WT and transgenic lines (Figure 7A–C). *CmSUN25-26-27c-Oe* lines had a higher FSI at early fruit elongation stages than WT, particularly in the fruits at 1 DAP and 3 DAP (Figure 7A–C). Then we examined *CmSUN25-26-27c* expression in the ovaries at 1 DAP and found twofold elevated expression in the overexpression lines (Figure 7D). In the ovaries at 3 DAP, *CmSUN25-26-27c* expression was increased by about 7.2 and 3.3 times in Oe-L1 and Oe-L2 lines compared to the WT, respectively (Figure 7E). A scatterplot analysis was performed to verify the relevant between FSI and *CmSUN25-26- 27c* expression (Figure 7F). The correlation coefficient indicates the significant relevance (R<sup>2</sup> = 0.76) between gene expression and FSI (Figure 7F). In the transcriptomic analyses, *CmSUN25-26-27c* transcripts were also highly accumulated in the male flower (Figure 7G). The phenotypic analyses suggested that the *CmSUN25-26-27c* gene positively regulated melon fruit elongation and functions in the early developmental phase.

**Figure 6.** Phenotypic analysis of *CmSUN25-26-27c*-overexpressed transgenic lines. (**A**) Phenotypic observation of the fruits in transgenic lines. (**B**) Fruit vertical diameter significantly increased in transgenic lines. (**C**) Fruit horizonal diameter comparison between transgenic lines and WT. (**D**) Statistical analyses of fruit shape index in transgenic lines and WT. Scale bars = 2 cm. Oe-L1 and Oe-L2 indicate *CmSUN25-26-27c*-overexpressed transgenic lines. \*\*\*\* *p* value < 0.0001.

**Figure 7.** Statistical data of *CmSUN25-26-27c*-overexpressed transgenic fruits. (**A**,**B**) Vertical and horizonal diameter comparison of the fruits at 1 DAP (day after pollination), 3 DAP, 5 DAP, and 7 DAP in transgenic lines and WT. (**C**) Statistical analyses of ovary shape index in transgenic lines and WT in DAP 1~7. (**D**,**E**) qRT-PCR analyses of *CmSUN25-26-27c* expression in the ovary at 1 DAP and 3 DAP. Oe-L1 and Oe-L2 indicate *CmSUN25-26-27c*-overexpressed transgenic lines. (**F**) Scatterplot analysis shows the relevant of FSI and *CmSUN25-26-27c* expression. (**G**) Schematic model of relative expression of *CmSUN25-26-27c* in different melon tissues. \* *p* value < 0.05; \*\* *p* value < 0.01; \*\*\* *p* value < 0.001; \*\*\*\* *p* value < 0.0001.

In *CmSUN23-24-Oe* lines, the altered vertical diameter and unchanged horizontal diameter resulted in the increased fruit shape index (Figure 8A–D). The average FSI (1.18, n > 40) of the mature fruits in *CmSUN23-24-Oe* lines was 8.3% larger than in WT (1.09, n > 100) (Figure 8D). *CmSUN23-24-Oe* lines also had increased vertical diameter and FSI at early developmental stages (Figure 9A–C). *CmSUN23-24* expression was dramatically

increased at 1 DAP and 3 DAP in ovaries in the severe line (Figure 9D,E). The *CmSUN23-24* expression showed a weak relevance (R2 = 0.31) with the FSI (Figure 9F). The transcriptomic analyses in different melon tissues showed that *CmSUN23-24* was highly accumulated in female flowers and ovary (Figure 9G). The phenotype of *CmSUN23-24-Oe* lines is similar to *CmSUN25-26-27c-Oe* lines, indicating their redundant function in mediating melon fruit size development.

**Figure 8.** Phenotypic analysis of *CmSUN23-24*-overexpressed transgenic lines. (**A**) Phenotypic observation of the fruits in transgenic lines. (**B**) Fruit vertical diameter significantly increased in transgenic lines. (**C**) Fruit horizonal diameter comparison between transgenic lines and WT. (**D**) Statistical analyses of fruit shape index in transgenic lines and WT. Scale bars = 2 cm. Oe-L1 and Oe-L2 indicate *CmSUN23-24*-overexpressed transgenic lines. \*\* *p* value < 0.01; \*\*\* *p* value < 0.001.

**Figure 9.** Statistical data of *CmSUN23-24*-overexpressed in transgenic fruits. (**A**,**B**) Vertical and horizonal diameter comparison of the fruits at 1 DAP, 3 DAP, 5 DAP, and 7 DAP in transgenic lines and WT. (**C**) Statistical analyses of ovary shape index in transgenic lines and WT in DAP 1~7. (**D**,**E**) qRT-PCR analyses of *CmSUN23-24* expression in the ovary at 1 DAP and 3 DAP. Oe-L1 and Oe-L2 indicate *CmSUN23-24*-overexpressed transgenic lines. (**F**) Scatterplot analysis showed the positive relevant of FSI and *CmSUN23-24* expression. (**G**) Schematic model of relative expression of *CmSUN23-24* in different melon tissues. \* *p* value < 0.05; \*\* *p* value < 0.01; \*\*\* *p* value < 0.001.

#### *2.5. Yeast Two-Hybrid and Subcellular Localization*

The transcriptional activation activity of CmSUN23-24 and CmSUN25-26-27c proteins were examined. We found that CmSUN23-24 and CmSUN25-26-27c did not contain transcriptional activation activity (Figure 10A). CmCaM5 and CmCmL11 are homologous proteins of AtCaM1, which can interact with several IQD proteins. The Y2H result showed that CmSUN23-24 had a direct interaction with CmCaM5 protein (Figure 10B). However, there was no interaction between CmSUN25-26-27c, CmCaM5 and CmCmL11. We performed subcellular localization in tobacco leaves and found that CmSUN23-24 was localized in the cell membrane (Figure 10C).

**Figure 10.** CmSUN23-24 has a protein interaction with CmCaM5. (**A**) Transcriptional activation of CmSUN23-24 and CmSUN25-26-27c protein. pGBKT7-53 + pGADT7-T and pGBKT7-lam + pGADT7- T were served as positive and negative control, respectively. The values 100, 101 and 102 indicate different concentration solutions; -Trp/-His/-Ade indicates DO supplement; X-α-gal was a chromogenic substrate of yeast galactosidase. (**B**) Yeast-two hybrid assays were performed between CmSUN23-24, CmSUN25-26-27c, and CmCaM5. (**C**) Subcellular localization of CmSUN23-24 in tobacco leaves. Bright indicates light field; GFP indicates green fluorescence excitation field; mCherry indicates fluorescence excitation field; Merged indicates the overlapping of the three fields. Scale bar = 20 μm.

#### **3. Discussion**

*SUN* family genes participate in plant disease defense [13], stress resistance, plant development [28,29,36], cell arrangement, and cytoskeleton regulation [8,9,24,25]. *Arabidopsis* [12] and *Solanum lycopersicum* [27] have nearly 30 members, and *Soybean* has 67 more members [28]. In previous studies, Monforte identified 24 CmSUNs in melon by blasting the protein sequences of the known tomato SUN genes into the melon genome [34]; Pan mentioned 21 proteins containing both IQ67 and DUF4005 domains and identified Cm-SUNs by their homology with the *Arabidopsis* genes [30]. The integrity of the IQ67 domain is crucial in binding with the Ca2+ receptors Calmodulin/Calmodulin-like (CaM/CmL) [39]. However, some SUN members were identified incompletely due to the low conservation of the IQ67 domain in sequence and quantity during evolution. In this study, we considered the protein containing an integrated and typical conserved IQ67 domain with three IQ motifs precisely separated by 11 and 15 amino acid residues to be the *CmSUN* genes, following the same criteria as the *SUN* family gene identification in *Arabidopsis* and *Oryza sativa* [12]. We characterized 24 *CmSUN* members, including 21 consensus genes with the previous work. Moreover, we found two other protein families, Myosin (7 members) and CaMTA (2 members), and several proteins that contained IQ motifs. It is noteworthy that

nine genes of the Myosin and CaMTA family contain 1–3 IQ motifs, but possess unique amino acid spacing patterns. The two gene families have long amino acid sequences (744–1523 aa), high molecular weights (84.7–172.8 kD), and average low isoelectric points (7.6). CmSUNs are relatively short in length, have a high average isoelectric point (9.9), and contain a typical conserved IQ67 domain and a conserved exon-intron structure. During the evolution, the first IQ motif is the most conservative in the sequences, and the second and third IQ are more distinctive [12]. In our preliminary results, MELO3C005949.2.1 and MELO3C005636.2.1, which contain two IQ motifs and the specific intervals, were also annotated as IQD proteins. However, considering the absence of the first IQ motif and low sequence conservation, we did not nominate the two genes as *CmSUN* members. Taken together, our solid work on identifying *CmSUN* family members in melon provided a theoretical foundation for further characterization of their function and molecular mechanism.

Expression analyses in melon tissues showed that *CmSUN* transcripts were mainly accumulated in the lateral organs and the young fruits. Most members tended to be highly expressed in flower and ovary, and gradually disappeared in the fruit at the late developmental stage. In cucumber, *CsSUN30-31* had a high expression level in various tissues, which may have been involved in the many aspects of plant growth. The expression of *CsSUN*, *CsSUN17-18b*, and other three *SUN* genes was correlated with cucumber fruit development. *CsSUN23-24* and *CsSUN25-26-27c* expression were significantly higher in long fruit than short fruit, while *CsSUN21a* and the other two genes had much higher expression in short fruit [38], meaning they had the opposite role in regulating fruit elongation. In tomato, *SlSUN1* and *SlSUN28* showed high expression in growing fruits; *SlSUN5*, *SlSUN11*, *SlSUN12*, *SlSUN21*, *SlSUN22*, and *SlSUN27* were expressed actively in vegetative growth [27]. The expression pattern of *SlSUN*s in tomato was similar to *CmSUN*s in different melon tissues, which may imply their conservative function in horticulture plant development.

In the functional analyses, we revealed that the overexpression of *CmSUN23-24* and *CmSUN25-26-27c* resulted in elongated fruits and increased fruit shape index, suggesting that they had an important role in melon fruit shape regulation. In cucumber, *CsSUN* (*CsSUN25-26-27a*) is a positive regulator in fruit elongation, and the phenotype has a consistency with its expression in fruits of different lengths [38]. In watermelon, *ClSUN25- 26-27a* is a pivotal factor in regulating the round or elongated fruit shape [32,33]. In melon, *CmSUN25-26-27c* is homologous to *CsSUN25-26-27a*, *CsSUN25-26-27c*, and *ClSUN25-26-27a*. Overexpression of *CmSUN23-24* and *CmSUN25-26-27c* altered the fruit shape at the early ovary developmental stage. The increased expression level, enlarged fruit shape index, and significant correlation coefficient in melon transgenic fruits demonstrated that *CmSUN23-24* and *CmSUN25-26-27c* mediate fruit shape index variation.

*SUN* family genes usually act as downstream targets in the calcium signal transduction pathway, which contributes to various physiological activities in plant development [40–43]. Most *CmSUN*s are rich in basic amino acids and hydrophobic amino acids and provide advantages for them to bind with the Ca2+ receptors Calmodulin/Calmodulinlike (CaM/CmL) [23,35]. Through performing the yeast two-hybrid assay, we found that CmSUN23-24 can directly bind to CmCaM5. CmSUN23-24 localized in the cell membrane, which may be regulated by the CaM and Ca2+ binding complex in Ca2+ signal transduction, and then participated in melon plant development. In *Arabidopsis*, IQD1 localized in microtubules and interacted with KLCR1 (kinesin light chain-related) protein or CaM2 to recruit them for cell division regulation [39]. AtIQD5 and AtIQD18 can also regulate the cell size and cell shape by protein interaction with microtubulin and then affecting the microtubule dynamics [24,25]. The combination between IQD and microtubulin was mediated by a DUF4005 domain [26]. CmSUN23-24 and CmSUN25-26-27c both contained the DUF4005 domain. Therefore, we speculated that CmSUN23-24 positively regulates melon fruit elongation through the CmCaM5-dependent Ca2+ signaling transduction pathway and may have a positive regulation in microtubules. Although CmSUN25-26-27c has the potential CaM binding sites, there was no interaction between CmSUN25-26-27c and

CmCaM5, implying that there may other CmCaMs/CmLs working with CmSUN25-26-27c and then participating in the calcium signal and microtubule regulation.

#### **4. Materials and Methods**

#### *4.1. Identification of SUN Gene Family Members*

The melon protein sequence file (CM3.6.1\_pep) was downloaded from the CuGenDB (http://cucurbitgenomics.org/, accessed on 5 January 2022) online website [44], which refers to the melon genome reported by Garcia-Mas [45]. The Hidden Markov model file (PF00612) of the IQ motif was downloaded from the Pfam (http://pfam.xfam.org/, accessed on 5 January 2022) database [46]. The IQD protein sequences of *Arabidopsis*, *Solanum lycopersicum* and *Cucumis sativus* was downloaded from previous study [12]. We searched the melon IQD proteins in HMM (http://www.hmmer.org/, accessed on 6 January 2022) software [47] using the Hidden Markov model (PF00612) and blasted the melon protein data using IQD sequences from Arabidopsis, tomato and cucumber as the probe in CuGenDB. Redundant sequences were removed and the remaining sequences were aligned through the online websites of Pfam (http://pfam.xfam.org/, accessed on 9 January 2022) and Smart (http://smart.embl-heidelberg.de/, accessed on 9 January 2022) [48] to verify the existence of the IQ motifs. The integrity of IQ67 conserved domain was demonstrated by multi-sequence alignment in MEGA7 software [49]. The protein sequences with three IQ motifs separated by 11 and 15 amino acid residues were identified as CmSUNs.

#### *4.2. Sequence Analysis and Phylogenetic Tree Construction*

Open reading frame (ORF) length, molecular weight (MW) and theoretical isoelectric point (pI) of *CmSUN* members were analyzed by ExPASy online website (http://web. expasy.org/protparam/, accessed on 12 January 2022) [50]. Location information of *CmSUN* members on chromosome was obtained from CuGenDB, and the chromosome map was drawn by the MG2C (http://mg2c.iask.in/mg2c\_v2.0/, accessed on 16 January 2022) [51]. The CmSUN protein sequences were aligned by the Clustal W program in MEGA7 and visualized by Jalview [52]. Conserved sites of CmSUN protein domain were predicted by online software WEBLOGO (http://weblogo.berkeley.edu/, accessed on 18 January 2022) [53]. The *CmSUN* gene structure information (intron-exon) was obtained from the GFF file (CM3.6.1\_gene.gff3) in the CuGenDB, and visualized by GSDS (http://gsds.cbi. pku.edu.cn/, accessed on 20 January 2022) [54]. The neighbor-joining method in MEGA7 software was utilized to construct a phylogenetic tree. The bootstrap value was 1000. SUN members in *Arabidopsis*, *Cucumis sativus*, *Solanum lycopersicum*, and *Cucumis melo* were compared. The phylogenetic tree was visualized through iTOL online software (https://itol.embl.de/, accessed on 28 January 2022) [55].

#### *4.3. Plant Materials and Growth Conditions*

The melon (*Cucumis melo* cv. Hetao) inbred lines were cultivated in Dengkou county of Inner Mongolia region and used in this study. All plant materials were provided by Hasi lab at Inner Mongolia University, Hohhot, China. The melon seedlings grew in an artificial climate chamber under the conditions of 60% humidity, 16 h light (25 ◦C), and 8 h dark (18 ◦C). The female fruits at anthesis day were self-pollinated. The female flower, male flower, ovary, fruits from four different developmental stages, roots, stems, and leaves were collected for expression analyses. Samples from the young lateral root, the second or third section of stem, the 3 to 5 true leaves, the unpollinated female flower, and male flowers at anthesis, and the fruit mesocarp were frozen in liquid nitrogen and stored at −80 ◦C. The tissues were sampled from different plants at same growth phase.

#### *4.4. Expression Analysis and Quantitative Real-Time PCR*

Different tissues from melon were used for the total RNA extraction. Raw reads for RNA-Seq data were quoted from our previous work [56]. The expression level of *CmSUN*s in the above ten periods was obtained from statistical data analyses. The heat map was drawn by using TBtools software [57], and parameters were set as default.

The relative expression level of *CmSUN23-24* and *CmSUN25-26-27c* in the transgenic melon were verified by qRT-PCR. Primers were designed by Primer 5. The primer information was listed in Supplemental Table S1. RNA was reversed to cDNA by using the PrimeScript™ RT Reagent Kit and gDNA Eraser Kit (Takara Bio, Shiga, Japan). SYBR® Premix Ex Taq™ II (Takara Bio, Shiga, Japan) and 96-well Chromo4 Real-Time PCR System were used for qRT-PCR and 2−ΔΔCT method was utilized for expression level analyses. Three biological replicates and three technical replicates were performed for each gene.

#### *4.5. Gene Cloning and Plant Transformation*

The whole-length coding sequence of *CmSUN23-24* (MELO3C006884) and *CmSUN25- 26-27c* (MELO3C013004) were amplified by gene-specific primers. The primers are listed in Supplemental Table S1. The restriction sites *Xba*I and *Bam*HI were used for constructing the overexpression vector. The target gene was recombined into the overexpression vector pCAMBIA-1305 driven by 35S promoter. To obtain transgenic plants, the recombinant plasmids were diluted to 100 ng/μL with 2 × SSC solution (sodium citrate buffer, pH = 7.0) and ddH2O. After 7 h in artificial pollination, 10 μL plasmid solution was injected into the melon fruits as described previously [58]. Fruits of the T1 generation transgenic plants were sampled and detected by PCR with a pair of specific primers designed on pCAMBIA-1305 vector.

#### *4.6. Subcellular Localization*

The full-length coding sequence of *CmSUN23-24* without the termination codon was jointed to the vector pCAMBIA-1300 at the upstream of GFP and transformed into agrobacterium GV3101. The primers are listed in Supplemental Table S1. The lower epidermis of tobacco leaves was infected for subcellular localization experiments. A confocal laserscanning microscope was used for observing the fluorescence signals under 488-nm wave length excitation.

#### *4.7. Yeast Two-Hybrid Assay*

Full lengths of *CmSUN23-24* and *CmSUN25-26-27c* were constructed to pGBKT7 vectors. Full lengths of *CmCaM5* (MELO3C014698) and *CmCmL11* (MELO3C006491) were constructed to pGBKT7 and pGADT7 vectors, respectively. The pGBKT7 recombinant plasmids were transformed into the yeast strain AH109 (TaKaRa). SD/-Trp/-His/-Ade solid medium with 4 mg/mL X-α-gal and without X-α-gal was used for the activation test. Concentrations of 100, 10−1, and 10−<sup>2</sup> of bacterial concentrations were used for selecting interacted combination. pGBKT7-53+pGADT7-T served as positive control; pGBKT7 lam+pGADT7-T was used as negative control. CmSUN23-24-pGBKT7 or CmSUN25-26- 27c-pGBKT7 were co-transformed with CmCaM5-pGADT7 or CmCmL11-pGADT7.

#### **5. Conclusions**

We identified 24 *SUN* family genes in melon, and performed a comprehensive analysis. Twenty-four CmSUN members all contained typical IQ67 motifs precisely separated by 11 and 15 amino acids. Most *CmSUNs* transcripts were accumulated in the melon stem, ovary, and flower. The biological function of *CmSUN23-24* and *CmSUN25-26-27c* in regulating melon fruit shape was revealed by the gene transformation analysis. Protein interaction between CmSUN23-24 and CmCaM5 suggested their participation in the Ca2+ signaling transduction pathway. Future studies on the function and molecular mechanism of *CmSUN*s would be valuable for uncovering the gene regulation in fruit development in melon.

**Supplementary Materials:** The supporting information can be downloaded at: https://www.mdpi. com/article/10.3390/ijms232416047/s1.

**Author Contributions:** Conceptualization, A.H. and G.C.; data curation, J.Y., W.Y. and H.L.; formal analysis, M.M. and S.L.; funding acquisition, A.H. and G.C.; investigation, M.M.; methodology, S.L., Z.W. and R.S.; resources, A.H. and G.C.; supervision, A.H. and G.C.; validation, M.M. and Z.W.; visualization, M.M. and S.L.; writing—original draft, M.M.; writing—review & editing, M.M. and G.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by National Natural Science Foundation of China (31860563 and 32202513), Applied Technology Research and Development Foundation of Inner Mongolia Autonomous Region (2021PT0001), Natural Science Foundation of Inner Mongolia Autonomous Region (2021BS03002), and the Inner Mongolia University High-Level Talent Research Program (10000-21311201/056).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** All the data and plant materials in relation to this work can be obtained through contacting with the corresponding author Gen Che (chegen@imu.edu.cn).

**Acknowledgments:** We thanked all the colleagues in our laboratory for providing useful discussions and technical assistance. We are very grateful to the editor and reviewers for critically evaluating the manuscript and providing constructive comments for its improvement.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Screening Key Genes and Biological Pathways in Nasopharyngeal Carcinoma by Integrated Bioinformatics Analysis**

**Junhu Tai <sup>1</sup> , Jaehyung Park 1, Munsoo Han 1,2 and Tae Hoon Kim 1,2,\***


**Abstract:** The purpose of this study was to identify the hub genes and biological pathways of nasopharyngeal carcinoma (NPC) through bioinformatics analysis and potential new therapeutic targets. In this study, three datasets were downloaded from the Gene Expression Omnibus (GEO), and differentially expressed genes (DEGs) between NPC and normal tissues were analyzed using the GEO2R online tool. Volcano and heat maps of the DEGs were visualized using the hiplot database. Gene ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analyses of the upregulated and downregulated DEGs were performed using the DAVID database. Finally, we established a protein-protein interaction (PPI) network using the STRING database and showed the differential expression of hub genes between the normal and tumor tissues. In all, 109,371,221 upregulated DEGs and 139,226,520 downregulated DEGs were obtained in datasets GSE40290, GSE61218, and GSE53819, respectively, and 18 common differential genes, named co-DEGs, were screened in the three datasets. The most abundant biological GO terms of the co-DEGs were inflammatory response et al. The KEGG pathway enrichment analysis showed that co-DEGs mainly participated in the interleukin (IL)-17 signaling pathway et al. Finally, we identified four hub genes using PPI analysis and observed that three of them were highly expressed in tumor tissues. In this study, the hub genes of NPC, such as PTGS2, and pathways such as IL-17 signaling, were identified through bioinformatics analysis, which may be potential new therapeutic targets for NPC.

**Keywords:** nasopharyngeal carcinoma; bioinformatics; genes

#### **1. Introduction**

Nasopharyngeal carcinoma (NPC) is an aggressive head and neck cancer that forms in the tissues of the nasopharynx with high malignancy, which often occurs in the pharyngeal recess, and is relatively rare compared with other cancers [1]. In 2018, approximately 130,000 new NPC cases and 73,000 related deaths were reported in Southeast Asia [2]. The occurrence and development of NPC are related to various factors, including Epstein-Barr virus (EBV) infection [3]. EBV-associated NPC is extremely sensitive to radiation therapy, while squamous histological subtypes are much less sensitive. There are also significant differences in clinical manifestations and treatment responses between undifferentiated and squamous variants [4]. Concurrent chemoradiotherapy is one of the main treatment methods for NPC; however, the emergence of chemotherapy resistance and high incidence of adverse events limit its application [5].

Bioinformatics is an analytical method that uses mathematical, statistical, and computational methods to process and analyze biological data, which differs from traditional laboratory work [6]. For example, Song et al. analyzed the key genes of NPC using bioinformatics [7], and Yue et al. expounded on the differentially expressed genes (DEGs) in NPC tissues and their correlation with the recurrence and metastasis of NPC [8]. Although

**Citation:** Tai, J.; Park, J.; Han, M.; Kim, T.H. Screening Key Genes and Biological Pathways in Nasopharyngeal Carcinoma by Integrated Bioinformatics Analysis. *Int. J. Mol. Sci.* **2022**, *23*, 15701. https://doi.org/10.3390/ ijms232415701

Academic Editors: Weg M. Ongkeko and Wajid Zaman

Received: 7 November 2022 Accepted: 9 December 2022 Published: 11 December 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

many studies have focused on identifying biomarkers related to NPC, a more comprehensive analysis is needed to explore better molecular targets to treat NPC and clarify its biological pathways.

The purpose of this study was to screen DEGs and co-DEGs in multiple datasets by analyzing NPC-related datasets in the Gene Expression Omnibus (GEO) database [9], and to conduct gene ontology (GO) terminology and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis. Finally, hub genes were obtained by constructing a protein-protein interaction (PPI) network, and the expression of hub genes in NPC was studied.

#### **2. Results**

#### *2.1. Identification of DEGs in the Three GEO Data Sets*

DEGs were defined as follows: when *p* < 0.05 and fold change > 2, DEGs were upregulated; when *p* < 0.05 and fold change < 2, DEGs were downregulated. After online analysis using GEO2R, 109,371,221 upregulated DEGs and 139,226,520 downregulated DEGs were obtained in datasets GSE40290, GSE61218, and GSE53819, respectively. The corresponding volcano maps of GSE40290 (Figure 1a), GSE61218 (Figure 1b), and GSE53819 (Figure 1c) are shown. Heat maps of the top 20 DEGs in GSE40290 (Figure 2a), GSE61218 (Figure 2b), and GSE53819 (Figure 2c) were generated. By cross-analyzing the three datasets, eight co-upregulated genes (Figure 3a) and 10 co-downregulated genes (Figure 3b) were identified, and a scale Wayne diagram was used to visualize the data.

**Figure 1.** Volcano maps of DEGs in GSE40290 (**a**), GSE61218 (**b**), and GSE53819 (**c**).

#### *2.2. GO and KEGG Pathway Enrichment Analysis of Upregulated DEGs*

We used GO and KEGG pathway enrichment analyses to characterize the functional effects of each dataset and co-DEGs in the three datasets. The most abundant GO terms in the upregulated DEGs of GSE40290 included signal transduction, cell adhesion, extracellular matrix organization, and so on (Figure 4a). The most abundant GO terms in the upregulated DEGs of GSE61218 included cell division, signal transduction, cell cycle, and so on (Figure 4b). The most abundant GO terms in the upregulated DEGs of GSE53819 included signal transduction, positive regulation of transcription from RNA polymerase II promoter, cell adhesion, and so on (Figure 4c). The most abundant GO terms in the co-upregulated DEGs in the three datasets included inflammatory response, extracellular matrix organization, cell-cell signaling, and so on (Figure 4d).

KEGG analysis results showed that the upregulated DEGs of GSE40290 were significantly enriched in pathways in cancer, neuroactive ligand-receptor interaction, protein digestion and absorption, and so on (Figure 5a); upregulated DEGs of GSE61218 were significantly enriched in the cell cycle, pathways in cancer, PI3K-Akt signaling pathway, and so on (Figure 5b); upregulated DEGs of GSE53819 were significantly enriched in cytokinecytokine receptor interaction, IL-17 signaling pathway, human papillomavirus infection, and so on (Figure 5c); and co-upregulated DEGs in the three datasets were significantly enriched in the IL-17 signaling pathway, rheumatoid arthritis, viral protein interaction with cytokine and cytokine receptor, and so on (Figure 5d).

**Figure 2.** *Cont*.

**Figure 2.** Heat maps of the top 20 DEGs in GSE40290 (**a**), GSE61218 (**b**), and GSE53819 (**c**).

**Figure 3.** Wayne diagram of the upregulated DEGs (**a**) and downregulated DEGs (**b**).

#### *2.3. GO and KEGG Pathway Enrichment Analyses of Downregulated DEGs*

The most abundant GO terms in the downregulated DEGs of GSE40290 included immune response, innate immune response, cell adhesion, and so on (Figure 6a). The most abundant GO terms in the downregulated DEGs of GSE61218 included cilium movement, cilium assembly, spermatogenesis, and so on (Figure 6b). The most abundant GO terms in the downregulated DEGs of GSE53819 included cell adhesion, cilium assembly, cilium movement, and so on (Figure 6c). The most abundant GO terms in the co-downregulated DEGs in the three datasets included immune response, activation of GTPase activity, and cilium assembly (Figure 6d).

**Figure 4.** GO analyses of upregulated DEGs in GSE40290 (**a**), GSE61218 (**b**), GSE53819 (**c**), and co-upregulated DEGs (**d**) in the three datasets.

**Figure 5.** KEGG analyses of the upregulated DEGs in GSE40290 (**a**), GSE61218 (**b**), GSE53819 (**c**), and co-upregulated DEGs (**d**) in the three datasets.

**Figure 6.** GO analyses of downregulated DEGs in GSE40290 (**a**), GSE61218 (**b**), GSE53819 (**c**), and co-downregulated DEGs (**d**) in the three datasets.

KEGG analysis results showed that the downregulated DEGs of GSE40290 were significantly enriched in amyotrophic lateral sclerosis, drug metabolism-cytochrome P450, hematopoietic cell lineage, and so on (Figure 7a); downregulated DEGs of GSE61218 were significantly enriched in cytokine-cytokine receptor interaction, pathways of neurodegeneration-multiple diseases, drug metabolism-cytochrome P450, and so on (Figure 7b); downregulated DEGs of GSE53819 were significantly enriched in hematopoietic cell lineage, chemokine signaling pathway, cytokine-cytokine receptor interaction, and so on (Figure 7c); and co-downregulated DEGs in the three datasets were most significantly enriched in the B-cell receptor signaling pathway and hematopoietic cell lineage (Figure 7d).

**Figure 7.** KEGG analyses of downregulated DEGs in GSE40290 (**a**), GSE61218 (**b**), GSE53819 (**c**), and co-downregulated DEGs (**d**) in the three datasets.

#### *2.4. PPI Network Construction of Co-DEGs*

We submitted eight co-upregulated DEGs and 10 co-downregulated DEGs in GSE40290, GSE61218, and GSE53819 to the STRING database for PPI analysis to identify hub genes. DEGs with connectivity greater than eight were selected as the hub genes, and four hub genes were identified, of which prostaglandin-endoperoxide synthase 2 (PTGS2) had a connectivity of 10; the degree of chemokine (C-C motif) ligand 21 (CCL21) was 9, degree of matrix metalloproteinase (MMP) 1 was 8, and degree of MMP3 was 8 (Figure 8).

#### *2.5. Expression of Selected Hub Genes in Tumor Tissues*

The UALCAN database was used for the analysis of the TCGA head and neck squamous cancer (HNSC) dataset. We found that the PTGS2 showed higher expression in tumor tissues (Figure 9a). CCL21 did not show differential expression in normal and tumor tissues, and the expression of MMP1 (Figure 9c) and MMP3 (Figure 9d) showed higher expression in tumor tissues. Finally, we confirmed the expression of PTGS2 (Figure 10a) and MMP3 (Figure 10c) in various cancer tissues and high expression of PTGS2 (Figure 10b) and MMP3 (Figure 10d) in HNSC through Human Protein Atlas (HPA) database.

**Figure 8.** PPI network for co-DEGs in the three datasets.

**Figure 9.** *Cont*.

**Figure 9.** Different expressions of PTGS2 (**a**), CCL21 (**b**), MMP1 (**c**), and MMP3 (**d**) between normal and tumor tissues, \*\*\*: *p* < 0.001.

**Figure 10.** Expressions of PTGS2 in various cancer tissues (**a**), PTGS2 shows high staining in HSNC (**b**), expressions of MMP3 in various cancer tissues (**c**), and MMP3 also shows strong staining in HSNC (**d**).

#### **3. Discussion**

Various cancers, including NPC, are among the major causes of human death worldwide. Globalization and an increase in various risk factors may further aggravate this situation. Bioinformatics methods are rapidly being developed for the analysis of biological data, especially in the analysis of large datasets, and have become an area of interest for researchers. Transforming biological data into knowledge through bioinformatics methods for study and analysis is more time-saving, efficient, and cost-effective than traditional methods [10]. In this study, we analyzed three datasets (GSE40290, GSE61218, and GSE53819) related to NPC using microarray data and identified 18 co-DEGs.

GO enrichment analysis showed that the most abundant GO terms in the co-upregulated DEGs in the three datasets included inflammatory response, extracellular matrix organization, and cell-cell signaling. Li et al. showed that the EBV M81 strain isolated from NPC-induced chronic inflammation in its target cells resulted in an increase in virus production. They explained the relationship between M81 virus replication and chemokines involved in inflammation and carcinogenesis [11]. Consistent with our results, inflammatory response was significantly enriched in the co-upregulated DEGs in their study. A clinical study that included 30 patients with NPC and 20 controls found that the process of tumor invasion and metastasis can be effectively reduced by controlling the activity of MMPs and extracellular matrix components [12]. This indicates that extracellular matrix components may play a promoting role in the development of NPC, which is consistent with our results. Exosomal microscopic RNAs from cancer cells play a key role in mediating cell-cell signaling and tumor microenvironment crosstalk. Lu et al. identified the inhibitory role of tumor-derived, exosome-related miR-9 in NPC tumorigenesis. In our results, the upregulated GO terms included cell-cell signaling, which is also similar to the result obtained by Lu et al. [13]. KEGG analysis results showed that the co-upregulated DEGs in the three datasets were significantly enriched in the IL-17 signaling pathway, rheumatoid arthritis and so on. The results of Wang et al. strongly revealed that IL-17 could activate the p38-NF-κB signaling pathway and promote the migration and invasion of NPC cells [14], which is consistent with our finding that co-upregulated DEGs were most enriched in the IL-17 signaling pathway. It has been pointed out that in rheumatoid arthritis, the response of antibodies to EBV induced cell antigens is significantly higher than that of healthy individuals [15], which is also consistent with our finding.

The most abundant GO terms in the co-downregulated DEGs in the three datasets were immune response, activation of GTPase activity, and cilium assembly. NPC tumorigenesis is significantly associated with genetic susceptibility. Recent epidemiological and large-scale genome-wide association studies have demonstrated an association between HLA class I genes and the risk of NPC. HLA class I gene coding is used to initiate the host immune response against malignant cells, but studies have found that high-risk people with several specific HLA haplotypes have low efficiency in the immune response to persistent EBV infection [16], which is consistent with our results. Jiang et al. showed that low GTPase expression is related to an increase in signal transduction, cell movement, and metastatic behavior of NPC cells [17]. This is consistent with the finding of our study that GO terms enriched in the co-downregulated DEGs included activation of GTPase activity. A previous study analyzing gene expression data from NPC and non-NPC nasopharyngeal tissues through a comprehensive pathway showed that the loss of function of the axonemal dynein complex in patients with NPC leads to impaired ciliary function, which in turn leads to poor mucociliary clearance and respiratory tract infection [18], which is also consistent with our results. The co-downregulated DEGs in the three datasets were most significantly enriched in the B-cell receptor signaling pathway and hematopoietic cell lineage. Morrison et al. showed that LMP2A expressed in most EBV-related tumors, including NPC, maintains virus latency by blocking the activation and signaling of B-cell receptors [19], which is consistent with our finding that co-downregulated DEGs were most enriched in the B-cell receptor signaling pathway. An article studying new aberrant methylation, differentially expressed genes and pathways in NPC pointed out that the hypermethylation/low-expression genes

significantly enriched in hematopoietic cell lineage [20], which is also consistent with our finding.

Based on the PPI network, we screened four genes with the highest node degrees, including PTGS2, CCL21, MMP1, and MMP3. HNSC develops from the mucous lining of upper respiratory and digestive tract, including nasal cavity, paranasal sinus, oropharynx, larynx, and so on. However, NPC is a specific entity different from HNSC, the disease behavior of NPC is different from HNSC, and the treatment strategy is also different [21]. Since there is no separate NPC dataset in the TCGA database, we can only use the HNSC dataset for analysis. Our analysis using the HNSC dataset in the TCGA database revealed that the expression of PTGS2, MMP1, and MMP3 in tumor tissues was significantly higher. Through HPA database, we also found that PTGS2 and MMP3 are expressed in various cancers, including head and neck cancers. The results of antibody staining showed that PTGS2 and MMP3 were strongly expressed in HSNC. A recent meta-analysis identified the upregulation of PTGS2, MMP1, and MMP3 in NPC tissues, shows that the maladjustment of nasal epithelial barrier and maladjusted immune response are the key components in the pathogenesis of NPC [22]. Another previous meta-analysis found that the overexpression of PTGS2 was significantly associated with a low survival rate in patients with NPC [23]. A study detected high expression of PTGS2 in patients with NPC and distant metastasis and showed that PTGS2 was related to the migration and invasion of NPC cells, in addition to the low survival rate of patients with NPC [24]. A study involving 56 normal people and 114 patients with NPC was conducted to explore the correlation between PTGS2 gene polymorphism and the occurrence of NPC [25]. It was found that PTGS2 gene polymorphism was related to the susceptibility of NPC, and both smoking and EBV infection, which are the main risk factors of NPC, can affect PTGS2 gene polymorphism. Some studies have confirmed that the upregulation of MMP1 is related to lymph node metastasis in NPC [26]. Additionally, studies have shown that MMP1 is significantly associated with the risk of NPC [27]. Song et al. confirmed the upregulation of MMP1 in NPC tissues and cell lines by RT-qPCR and western blotting and found that knockdown of the MMP1 gene significantly inhibited cell proliferation and enhanced apoptosis [28]. A study that detected the mRNA and protein levels of MMP3 in NPC tissues and cells found that the concentration and enzymatic activity of MMP3 in the NPC group were much higher [29]. Another study showed that the overexpression of MMP3 in NPC epithelial cells increased EBV-induced epithelial cell migration and invasion in an in vitro cell model [30]. One study is aimed at analyzing the co-deregulated genes and their transcriptional regulators in lung cancer [31]. They used a Connectivity Map to find putative repurposing drugs for selected hub genes. Although we also tried to discover the putative repurposing drugs by the same method, the results were not satisfactory. In future research, through more data and deeper research, we plan to complete the unfinished research of discovering the putative repurposing drugs.

#### **4. Materials and Methods**

#### *4.1. Microarray Data*

The NCBI-GEO is a public database (https://www.ncbi.nlm.nih.gov/geo/) (accessed on 28 September 2022). By searching for keywords, such as nasopharyngeal carcinogen and Homo sapiens, we obtained three datasets, GSE40290 dataset including 25 NPC tissues and 8 normal tissues, GSE61218 dataset including 10 NPC tissues and 6 normal tissues, and GSE53819 dataset including 18 NPC tissues and 18 normal tissues, which could be downloaded and analyzed by GEO2R.

#### *4.2. Identification of DEGs and Data Visualization*

We analyzed DEGs between NPC and normal tissues in the GSE40290, GSE61218, and GSE53819 datasets using the GEO2R tool. Volcano and heat maps drawn in each dataset were obtained from the hiplot database (https://hiplot-academic.com/) (accessed on 29 September 2022) [32].

#### *4.3. GO and KEGG Pathway Enrichment Analysis of Up- and Downregulated DEGs*

DAVID [33] is a web server for gene lists, functional enrichment analysis, and functional annotation (https://david.ncifcrf.gov/) (accessed on 29 September 2022). We used the latest version of the DAVID database (version 7.0) for GO and KEGG pathway enrichment analyses of the upregulated and downregulated DEGs.

#### *4.4. PPI Network Construction of Up- and Downregulated DEGs*

STRING [34] is an online resource database used to obtain protein association networks (https://string-db.org/) (accessed on 29 September 2022). The content in the database is precomputed and can be downloaded separately by users. We used the 11.5 version of STRING for PPI analysis.

#### *4.5. Analyzing the Expression of Hub Genes in Tumor*

UALCAN [35] data portal is an interactive network resource (http://ualcan.path.uab. edu/) (accessed on 29 September 2022). The Cancer Genome Atlas (TCGA) transcriptome and clinical patient data were used to study the differential expression of hub genes in normal and tumor tissues through this data portal. And through HPA (https://www. proteinatlas.org/) (accessed on 29 November 2022), protein related data were obtained.

#### **5. Conclusions**

In conclusion, our analysis identified hub genes and signaling pathways associated with NPC. This provides information for exploring the pathogenesis, identifying molecular targets, and clarifying the biological pathways of NPC. However, further experiments are needed to verify and explore the functions of these genes. Authors should discuss the results and how they can be interpreted from the perspective of previous studies and of the working hypotheses. The findings and their implications should be discussed in the broadest context possible. Future research directions may also be highlighted.

**Author Contributions:** Writing-original draft preparation; corrections after review: J.T.; data curation: J.P.; methodology: M.H.; review and editing: T.H.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported by the Basic Science Research Program of the National Research Foundation of Korea, funded by the Ministry of Science and Technology and the Ministry of Science, ICT and Future Planning (2017R1A2B2003575 and NRF-2020R1A2C1006398); the Ministry of Science and ICT, Korea, under the ICT Creative Consilience program (IITP-2023-2020-0-01819), supervised by the Institute for Information and Communications Technology Planning and Evaluation (IITP); and the Korea Health Technology R&D Project (HI17C0387, HR22C1302) through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health and Welfare. This research was supported by a Korean University grant and a grant from the Korea University Medical Center and Anam Hospital in Seoul, Republic of Korea.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The data presented in this study are openly available in GEO datasets.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Communication* **PSSNet—An Accurate Super-Secondary Structure for Protein Segmentation**

**Denis V. Petrovsky, Vladimir R. Rudnev, Kirill S. Nikolsky, Liudmila I. Kulikova, Kristina M. Malsagova \* , Arthur T. Kopylov and Anna L. Kaysheva**

> Biobanking Group, Branch of Institute of Biomedical Chemistry "Scientific and Education Center", 109028 Moscow, Russia

**\*** Correspondence: kristina.malsagova86@gmail.com; Tel.: +7-499-764-98-78

**Abstract:** A super-secondary structure (SSS) is a spatially unique ensemble of secondary structural elements that determine the three-dimensional shape of a protein and its function, rendering SSSs attractive as folding cores. Understanding known types of SSSs is important for developing a deeper understanding of the mechanisms of protein folding. Here, we propose a universal PSSNet machinelearning method for SSS recognition and segmentation. For various types of SSS segmentation, this method uses key characteristics of SSS geometry, including the lengths of secondary structural elements and the distances between them, torsion angles, spatial positions of Cα atoms, and primary sequences. Using four types of SSSs (βαβ-unit, α-hairpin, β-hairpin, αα-corner), we showed that extensive SSS sets could be reliably selected from the Protein Data Bank and AlphaFold 2.0 database of protein structures.

**Keywords:** super-secondary structure; data bank; AlphaFold 2.0; graph neural network; machine learning; protein features

#### **1. Introduction**

Protein folding mechanisms have fascinated scientists for a half of a century [1–3]. According to the "nucleation–condensation" model of protein folding, self-folding proteins, such as molecular chaperones [4], that do not participate in the protein machinery, become unstructured tangles immediately after translation. Folding nuclei (a time-limiting stage) are formed and condensed in coils, and the process is completed by spontaneous packing into a native three-dimensional structure [5–7]. In relation to this concept, attention has been focused toward simple motifs such as super-secondary structures (SSSs) that comprise several secondary structure elements with unique and compact folding of a polypeptide chain. Super-secondary structures serve as a bridge between the secondary and tertiary structure of a protein and probably are autonomously stable (i.e., stable outside the protein globule) [8,9].

The use of SSSs to solve biomedical problems is rather desirable, as the alpha-helical and beta-hairpin types of SSSs can serve as initial unique structures for the construction of protein epitope mimetics (PEMs) [10,11]. These PEMs mimic the structural and conformational properties of their target epitopes (SSS), as well as their biological activity (protein–protein and protein–nucleic acid interactions). It is possible to optimize biological activity to maintain antimicrobial activity, for example, by transferring an epitope from a recombinant to a synthetic scaffold [11].

In previous studies, we reported the possibility of studying SSSs in aberrant protein forms caused by post-translational modifications (PTMs). We observed that PTMs that have been detected in patients with various types of cancer are frequently localized in the SSS (alpha-structural motifs, beta-hairpins) [12]. So far, it is obvious that a comprehensive study of the known SSS types is essential for deeper insights of protein folding mechanisms and to solve some challenges in biomedical research [9].

**Citation:** Petrovsky, D.V.; Rudnev, V.R.; Nikolsky, K.S.; Kulikova, L.I.; Malsagova, K.M.; Kopylov, A.T.; Kaysheva, A.L. PSSNet—An Accurate Super-Secondary Structure for Protein Segmentation. *Int. J. Mol. Sci.* **2022**, *23*, 14813. https://doi.org/ 10.3390/ijms232314813

Academic Editor: Wajid Zaman

Received: 28 October 2022 Accepted: 24 November 2022 Published: 26 November 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

Predicting the secondary and supersecondary structures of proteins by their 3Dstructures (PDB, AlfaFold) is becoming a top priority in structural biology research. Numerous approaches for prediction are currently known but the most commonly used are based on (1) probabilistic models, such as kernel density estimation (KDE) [13,14] and naïve Bayes [15]; (2) linear classifiers, such as support vector machines (SVM) [16–18]; and (3) machine learning methods [9,19]. The performance of the first two approaches is limited by the huge amounts of data with relatively low classification and semantic segmentation accuracy (60–75%) [16,20].

Neural networks (NN) have recently been applied to the problem of structures classification and segmentation. Neural networks are typically designed to classify and/or predict one or two types of SSSs, though several NN-utilizing methods are now capable of predicting β-hairpin and βαβ-units (StackSSSPred) [21–23]. The following main groups of machine learning models are most widely used:


Here, we present a new approach to classify different types of SSSs, specifically βαβ-unit, α-hairpin, β-hairpin, and αα-corner, and the approach was tested on standard format files extracted from the public Protein Data Bank (PDB) [9]. The neural network PSSNet (Protein Secondary Structure Segmentation) was realized on a new deep learning architecture that uses the integrative synergy of CGN, convolutional neural networks (CNN), and (bidirectional) recurrent neural network (RNN) predictions. The proposed architecture achieves an accuracy of 84% and endorses a wide range of valuable annotations for over 1.9 million SSSs available in the open-access knowledge base at https://psskb.org/ (accessed on 28 October 2022). In addition to the secondary structure prediction, PSSNet can also be applied for the prediction of free energy, solvent availability, contact maps, and searching for stable protein structures.

#### **2. Results and Discussion**

#### *2.1. SSS Segmentation Using the PSSNet Model*

This model was the basis for the filling of gaps in the open knowledge base of SSSs (available at https://psskb.org/, accessed on 28 October 2022). After training the model, we applied it to complete dataset maintained in the PDB and AlphaFold. The results were selectively assessed by expert researchers, entered into a database, and finalized as publicly available. The number of SSSs defined by the model are listed in Table 1.

**Table 1.** Number of SSSs recognized in open knowledge databases (PDB and AlphaFold).


The model was built according to the proposed architecture and combined high accuracy and performance. We assessed the performance of the model engine by comparing its quality with that of networks with distinct architectures, i.e., CurveNet [28] and DGCNN36 networks (Table 2). Such networks are among the top 10 utilized in "3D point cloud" classification and segmentation [29]. Training and evaluation of the results were carried out on the same datasets. Plots of the loss function and IOU versus iterations are provided in Supplementary Materials (model evaluation metrics section).


**Table 2.** Comparison between our proposed and other models (mean IOU).

The predictive power of a machine learning model is mainly determined by its feature representation and feature extraction algorithms. The models considered in our comparative experiment operate only with the 3D coordinates of protein atoms, however these features were barely enough to provide acceptable recognition accuracy (Table 2). In contrast, our model operates with an extended set of specific structural features, encompassing torsion angles, the spatial positions of atoms in the amino acid sequence, and the primary protein sequence. The convolutional filters of our neural network blocks capture and generalize the local geometric features of the protein sequence well, and the subsequent blocks of Bi-GRUs capture the features of the global feature context.

#### *2.2. Practical Evaluation of the Model: Key Issues*

The proposed model was empirically tested to assess its accuracy and discover key objections that might arise from structure segmentation and classification. A random sample of 5000 SSS structures from PDB and AlphaFold was consolidated, and topology compliance with the studied motifs was examined.

The main issues that arose from the classification and segmentation of structural motifs were as follows (Figure 1):


Despite the results of the model validation being quite satisfying (IoU = 0.92), the estimated accuracy ranged between 0.83–0.85. Thus, we established several problems specific to the topology of a specific type of SSS.


on a meaningfully larger representative sample that covers all such elements; the retraining and sample collecting are currently in progress.

**Figure 1.** Evaluation of the model on empirical consolidated datasets (random sampling of 5000 elements for each type of SSS) of (**a**) βαβ-unit, (**b**) β-hairpin, (**c**) α-hairpin and (**d**) αα-corner, (**e**) heat map for the problem of extra α-helix.

Despite the difference in accuracy between the actual and validation datasets, the model managed SSS segmentation and classification tasks well. The difference in accuracy suggests relatively high folding variability among the structures downloaded from PDB and AlphaFold. Hence, the training dataset must be sufficiently extended, especially in terms of negative examples, to improve the accuracy of segmentation and classification.

The composition of super-secondary structures is simple combinations of α-helices, β-hairpins, and short loops with a well-defined hydrophobic core involved in SSS stabilization. The loop–helix, β-hairpin, and Greek key motifs are prominent representatives of SSSs [30–32]. Characterization of such structures allows us to collect a catalog of autonomously stable protein motifs and archetypes [30,31]. These structures also serve as promising objects of study in protein physics (the study of folding), bioengineering (the development of peptide mimetics), and biomedicine (the study of conformational changes in aberrant forms of proteins compared with intact forms).

Medical proteomic research is mainly focused on the extensive study of the molecular basis of a disease associated with the arrival of aberrant forms of proteins that are regularly not found under normal (healthy) conditions. Aberrant proteins are frequently caused by genetic polymorphisms, alternative splicing, and PTMs [33,34]. Such structural changes associated with the disease can be localized in different types of SSSs. Numerous aberrant forms of proteins are fraught with dire structural changes, including isomers of betaamyloid in Alzheimer's disease [35], splice isoforms of osteopontins b and c in prostate cancer [36], amino acid substitutions in protein C7 in type II diabetes [37], and PTMs of proteins in oncological diseases [12].

Here, we present a new approach to frame the problem of SSS recognition and segmentation based on the geometric characteristics of structures and spatial relationships within a protein sequence. The main advantage of our method is a low requirement for computing sources. We used a standard personal desktop computer with a typical GeForce GTX 1650—4Gb video card for training and data processing in the PDB and AlphaFold2.0 databases. We also operated the PSSNet model with high recognition accuracy (mean IoU > 0.84; F1 > 0.08) and annotation capability of >1.9 million SSSs of βαβ-unit hairpin, β-hairpin, and αα-corner. This opens up wide margins for investigation on the PSSKB resource. The model does not require a large training set, since sets of 2000 specimens were used to train the model. Plots of GPU memory usage as a function of protein sequence length can be found in Supplementary Materials (GPU memory usage section).

A distinctive feature of our model is its ability to recognize and segment the SSSs within a protein sequence of arbitrary length, i.e., regardless of the sequence length. The model can operate directly with any file in PDB format, including those with low data quality, poor resolution, and sparse protein sequences. The only limitation of the model is the amount of graphics processing unit (GPU) memory. Likewise, most current models focus on recognition of only one certain type of structure and work only with a few prepared datasets, resulting in relatively low numbers of recognized structures within the range of thousands to tens of thousands.

The architecture of the proposed method is powered by a comprehensive combination of CGNs, CNNs, Bi-GRUs, self-attention, and multi-head attention mechanisms, which encourage network flexibility and easy adaptation to solve problems in structural biology and bioinformatics. Primary examinations have shown that a model with minimal modifications can predict the structural alphabet based on geometric characteristics for differentiable molecular modeling problems. A subsequent investigation will target this and other issues.

#### **3. Materials and Methods**

#### *3.1. Data Preparation*

Training, test, and validation datasets downloaded from the Protein Data Bank were represented by the following types of SSSs: a βαβ-motif (beta-alpha-beta motif), a β-hairpin, an α-hairpin, and an αα-corner (70◦–90◦).

The datasets were generated using STRIDE, which takes a PDB file as input and returns secondary structure assignments. Thereafter, data were manually curated by a team of experts to ensure compliance with the declared types of SSS. Eventually, the sets of positive and negative examples included almost 2000 and 4000 elements of SSS of each type. Training and test model datasets are available at https://doi.org/10.6084/m9.figshare.2152 9812.v1 (open access, accessed on 28 October 2022) [38].

The balance between positive and negative examples in packets supplied to the network input was regulated by the software implementation. Before entering the network, the coordinates of atoms *x*, *y*, and *z* were augmented (rotation around the *x*, *y*, and *z* axes at a random angle and the *y*-axis with random jitter for each point using Gaussian noise with zero-mean value and standard deviation of 0.08). Data augmentation was executed dynamically during the training time for 40% of input structures.

Before entering the network, elements of amino acid sequence (AA codes and 3D coordinates of the corresponding group of atoms (N, Cα, C, and O)) were extracted from PDB files.

Ultimately, an array of 3D coordinates was generated to describe the 3D structure of the protein. The final array of coordinates was applied to generate a graph, with each vertex representing a Cα-atom in the main protein chain, connected by edges to the 32 nearest Cαatoms (KNN-graph, k = 32). Each edge and vertex of the graph contained scalar and vector features describing the 3D geometry of the protein structure. The method for determining the optimal value of k is described in the Supplementary Materials (determining the optimal value of k (nearest Cα atoms)).

#### *3.2. Feature Extraction and Input Encoding*

3.2.1. Node-Level Features

The signs of a graph node are described by the following elements (Figure 2):

**Figure 2.** Feature extractions from protein sequences. The graph shows the protein structure.


$$
\stackrel{\rightarrow}{Vo} = \sqrt{\frac{1}{3}} \frac{(a \times b)}{||a \times b||} - \sqrt{\frac{2}{3}} \frac{(a+b)}{||a+b||} 
$$

where vectors *a* and *b* are defined as *a* = *Ni* − *Cαib* = *Ci* − *Cαi*. This vector, together with the forward and reverse vectors ( - *Vf* 1, - *Vr*1) determines the orientation of the amino acid residue in 3D Euclidean space.

• The amino acid sequence is encoded as a sequence of numbers (0–21).

#### 3.2.2. Edge-Level Features

Graph edge features are described by the following elements:


$$\varphi(r) = e^{-\left(\varepsilon r\right)^2} \left(r = \|\mathfrak{x} - \mathfrak{x}\_{\bar{l}}\|\right).$$

For each edge, the distance was encoded with 32 Gaussian functions, with centers uniformly spaced in the range of 0–24 Å. The edge position code (*i*, *j*) was obtained using a sinusoidal encoder, which is widely used in transformer models. This approach to the positional encoding of sequences has been previously described in detail [39].

#### *3.3. Network Architecture*

The architecture of our model is based on a combination of the geometric vector perceptron (GVP), graph neural network (GNN), and multi-layer gated recurrent unit (GRU) methods (Figure 3). The network architecture is based on the encoder–decoder principle, which is widely used in classification and segmentation problems. The encoder generates a feature map based on the input data (node position in the graph, local topology, vector, and scalar attributes of the node itself and its neighbors). The decoder extracts information from the feature map and generates classification labels for the graph nodes. The model was implemented using a binary classifier, and a separate model was trained for each SSS in the training set. As data are being accumulated, a multiclass model that works with a variety of structures has to be used in the future.

The GVP elements of the model architecture extract invariant and equivariant features from a combination of scalar and vector representations of geometric features. In addition, the GVP can approximate any continuous rotation and reflection invariant scalar function. The architecture (GVP–GNN) uses GVP modules for feature extraction and the graph convolutional network (GCN) mechanism for message, which passes between graph nodes (messaging), feature aggregation of neighboring nodes and edges, and updates node attachments during a propagation operation [40]. The GVP architecture has been previously described in detail [41].

**Figure 3.** The architecture of the proposed PSSNet.

A GVP-based neural network was used to predict amino acid sequences based on the geometric characteristics of a protein and PPI (protein–protein interaction). Because proteins are connected in sequential structures, we supplemented the model with bidirectionally controlled recurrent units (Bi-GRUs) to highlight relationships between geometric characteristics [41]. Adding GRU layers to the model significantly increased the predictive accuracy and reduced the amount of time required to train the model. Table 3 shows the architecture of the model and a brief description of the functions of the blocks.


**Table 3.** Model architecture and implementation details.

The Adam optimizer was applied with a reduced learning rate when the accuracy metric stopped the improvement process (start with a 1 × <sup>10</sup>−<sup>3</sup> and reduce factor of 0.5). The Dice BCE loss was selected as the loss function, as it combines the Dice loss with the standard binary cross-entropy (BCE) loss, which is generally the default for segmentation models. Combining the two methods allowed for moderate diversity in the loss while improving the stability of the BCE.

#### *3.4. Training and Performance Evaluation*

The model training process lasted 24 epochs for each SSS and models were assessed on the validation datasets. During the learning process, the learning rate was changed from 1 × <sup>10</sup>−<sup>3</sup> to less than 1 × <sup>10</sup>−<sup>4</sup> in order to reduce the learning rate on the plateau. We used the intersection over union (IoU; also known as the Jaccard index) as the main metric for assessing the quality of model predictions. Values from 0–1 show the extent to which positions of two objects (reference [ground-truth]) predicted by the model coincide according to the following equation:

$$\text{IoU} = \frac{\text{Area of Overlap}}{\text{Area of Union}}$$

We considered the position of SSS in the reference and predicted structures and evaluated the coincidence of their positions. The harmonic mean of the recall and precision metrics (F1) were also evaluated (Table 4).

**Table 4.** Performance metrics: IoU and F1.


#### **4. Conclusions**

Super-secondary structures are blocks of protein molecules with unique and compact spatial arrangements. Such structures are stable outside the protein globule due to pronounced hydrophobic cores. Structural biology considers SSSs as the nuclei of protein folding and as starting structures when looking for the possible folding pattern of polypeptide chains while modeling protein structures. Our model combines GNN, CNN, and RNN methods and suggests the following advantages:


Our model can classify more than 2.3 million SSSs for all protein structures available in the PDB and AlphaFold databases. The reliability and accuracy of the model were demonstrated on four types of SSSs taken from the public Structural Elements Database (PSSKB, https://psskb.org/, accessed on 28 October 2022); however, the model is generic and can be applied to a wider set of SSS types. The assembled set of SSS structures opens up new options for studying the uniqueness and compactness of protein spatial packing and folding nuclei, and can also act as starting structures for searching for possible polypeptide chain folding while modeling protein structures.

Future efforts will target the diversity of SSS types (Greek key, Rossmann fold, etc.) in the segmentation model and replenishing the database. We will also focus on improving annotations and ensuring the quality of SSS presentations. Furthermore, we will generate sufficient information for users with extensive experience in structural biology and new entrants into that. We will also tailor the database to meet the needs of the research community and provide accurate SSS information for future updates.

**Supplementary Materials:** The supporting information can be downloaded at: https://www.mdpi. com/article/10.3390/ijms232314813/s1. Reference [44] is cited in the supplementary materials.

**Author Contributions:** D.V.P., V.R.R., A.L.K. and L.I.K., conceived the project; D.V.P., conducted experiments; L.I.K., K.S.N. and V.R.R., generated sets of positive and negative examples; D.V.P., K.M.M., A.L.K. and A.T.K., wrote the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study proceeded within the framework of the Russian Federation Fundamental Research Program for the long-term period of 2021–2030 (№ 122092200056-9).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The code source designed for super-secondary structures classification (PSSNet) has been deposed to the open-access GitHub resource and is available at the following link: https://github.com/Denis21800/PSSNet.

**Acknowledgments:** The authors are grateful to A.V. Efimov for the helpful discussions. Equipment at the shared research facilities of HPC computing resources at Lomonosov Moscow State University and the Joint Supercomputer Center of the Russian Academy of Sciences was used for simulations.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Review* **Computational Tactics for Precision Cancer Network Biology**

**Heewon Park 1,\* and Satoru Miyano 1,2**


**Abstract:** Network biology has garnered tremendous attention in understanding complex systems of cancer, because the mechanisms underlying cancer involve the perturbations in the specific function of molecular networks, rather than a disorder of a single gene. In this article, we review the various computational tactics for gene regulatory network analysis, focused especially on personalized anticancer therapy. This paper covers three major topics: (1) cell line's (or patient's) cancer characteristics specific gene regulatory network estimation, which enables us to reveal molecular interplays under varying conditions of cancer characteristics of cell lines (or patient); (2) computational approaches to interpret the multitudinous and massive networks; (3) network-based application to uncover molecular mechanisms of cancer and related marker identification. We expect that this review will help readers understand personalized computational network biology that plays a significant role in precision cancer medicine.

**Keywords:** gene regulatory network; computational cancer biology; precision medicine; oxaliplatin and capecitabine (XELOX)

#### **1. Introduction**

Gene regulatory network describes functional interactions between genes, where the network is presented by a graph whose nodes present the genes, and the edges between nodes represent the regulatory interactions between genes [1,2]. Heterogeneous gene regulatory system is a useful tool to analyse and visualize biological activities and is crucial to understanding complex biological processes of cancer, because the molecular mechanisms underlying diseases reflect the perturbations in a specific function of molecules in the complex cellular network, rather than a consequence of an abnormality in a single gene [3].

The molecular interplays between genes involved in cellular processes and pathways can be represented by statistical and mathematical models. The computational strategies to estimate large-scale gene networks from gene expression levels have drawn a large amount of attention. The Gaussian graphical model (GGM), that is the probability model, has often been used to infer the conditional dependence structure of a set of genes. The GGM represents which genes (variables) predict one another and allows for sparse modeling of covariance structures, highlighting potential causal relationships between genes [4]. The Bayesian network (BN) is also a probabilistic graphical model describing a directed acyclic graph. BN has been used to uncover cancer mechanisms, i.e., unique cancer molecular mechanisms of clone cancer [5], causal networks of breast metastasis to bone, brain, or lung [6], assessing the risk of breast cancer [7], etc. Boolean networks are discrete models and one of the most widely used techniques to estimate the gene regulatory system. In the model, gene expression levels are discretized, and each gene takes on two values, i.e., if the gene expression is above a threshold value, then 1, otherwise 0, and the interactions between genes are described by standard logic (Boolean) functions [8]. Various cancer

**Citation:** Park, H.; Miyano, S. Computational Tactics for Precision Cancer Network Biology. *Int. J. Mol. Sci.* **2022**, *23*, 14398. https:// doi.org/10.3390/ijms232214398

Academic Editor: Wajid Zaman

Received: 16 October 2022 Accepted: 17 November 2022 Published: 19 November 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

research has been based on Boolean networks for cancer drug discovery [9], identifying lung cancer diagnostic and prognostic biomarkers [10], uncovering the mechanisms of tumorigenesis and possible treatment responses of prostate cancer [11], etc. Additionally, various computational models and strategies (e.g., differential equation-based Model, artificial neural network (ANN) approaches, correlation network, information theory, etc.) have been developed and applied to cancer research. Furthermore, the effectiveness of the networks-based analysis has been proven in various fields of research, e.g., cancer prediction, drug combinations identification, and protein-protein interaction [12–14].

Although many computational tactics for gene regulatory network estimation have been developed and numerous studies have been conducted to uncover cancer mechanisms based on the estimated gene networks, the existing studies were conducted by an averaged gene network for all cell lines. Thus, we cannot effectively identify crucial information for precision cancer medicine.

In this article, we reviewed the computational strategies for the cell line's (or patient's) cancer characteristics specific gene network analysis. Especially, we reviewed machine learning approaches for varying coefficient models, where the varying coefficients describe the strength of the interaction between genes for a specific characteristic of each cell line. That is, the model enables us to construct a gene regulatory network for a specific status related to cancer of the cell line. The cell line characteristic specific gene networks estimation provides hundreds of networks for hundreds of cell lines, where each network is given as a matrix form with about 20,000 columns for target genes, 2000 rows for regulator genes, and the elements of the matrix indicate the strength of interaction between the regulator and target genes. The analysis and interpretation of the multiple and massive networks are quite difficult tasks and have remained a serious challenge in computational biology. In this article, we also review some computational tactics for comprehensive analysis and interpretation of the large-scale networks.

The remainder of this paper is organized as follows. In the gene regulatory network estimation section, the regression framework to gene regulatory network estimation is represented. The computation tactics estimate the cell line characteristic specific gene regulatory network in the sample-specific gene network estimation section. In the section of gene network analysis in multi-dimensional cell line space, the machine learning and Artificial Intelligence (AI) approaches to comprehensive analysis of the estimated multiple and massive gene regulatory network are represented. In the Applications section, the application result of the reviewed computation strategies for network-based anti-cancer drug prediction and related markers identification is introduced. Conclusions are provided in the Discussion section.

#### **2. Gene Regulatory Network Estimation**

Suppose *<sup>X</sup>* = (*x*1, ... , *<sup>x</sup>n*)*<sup>T</sup>* <sup>∈</sup> <sup>R</sup>*n*×*<sup>p</sup>* is an *<sup>n</sup>* <sup>×</sup> *<sup>p</sup>* data matrix describing the expression of *<sup>p</sup>* regulators that may control the transcription of *th* target gene *<sup>y</sup>* <sup>∈</sup> <sup>R</sup>*n*, <sup>=</sup> 1, ... , *<sup>q</sup>*. Consider the linear regression model,

$$\mathbf{y}\_{\ell} = \sum\_{j=1}^{p} \beta\_{j\ell} \mathbf{x}\_{j} + \mathbf{e}\_{\ell\prime} \quad \ell = 1, \ldots, q\_{\prime} \tag{1}$$

where *βj* describes the effect of the *j th* regulator gene on the *th* target gene, and *ε* is a random error vector *ε* = (*ε*1, ... ,*εn*)*<sup>T</sup>* that is assumed to be independently and identically distributed with mean 0 and variance *σ*2. To estimate the gene regulatory network, the following *L*1-type regularization methods were used successfully,

$$L(\mathfrak{B}\_{\ell}) = \underset{\mathfrak{B}\_{\ell}}{\text{arg min}} \{ \frac{1}{2} \sum\_{i=1}^{n} (y\_{i\ell} - \sum\_{j=1}^{p} \beta\_{j\ell} \mathbf{x}\_{ij})^2 + P(\mathfrak{B}\_{\ell}) \},\tag{2}$$

where


• etc.

and *λ*, *γ* > 0 are the regularization parameters, where *λ* controls model complexity, and *γ* is a mixing parameter between the lasso and ridge penalties. The *L*1-type regularization methods enable us to simultaneously identify crucial regulators and estimate their effect on a target gene. In particular, the methods effectively perform analysis of the high dimensional genomic alterations dataset.

Although the methods successfully perform edge selection and network estimation, the approaches provide an averaged network for all *n* cell-lines. Thus, we cannot estimate cell line (or patient) characteristic-specific models (i.e., molecular interplay). In other words, the methods are not enough to extract useful information for precision medicine.

#### **3. Sample-Specific Gene Network Estimation**

To effectively extract crucial information for precision medicine, cell line (or patient) characteristic-specific identification is a crucial issue. We reviewed computational approaches for cell-line characteristic-specific modelling, especially cell line characteristicspecific gene regulatory network estimation. The following varying coefficient model was used for cell-line characteristic-specific modelling [18],

$$\mathbf{y}\_{\ell} = \sum\_{j=1}^{p} \beta\_{j\ell}(m\_{\ell}) \cdot \mathbf{x}\_{j} + \mathbf{e}\_{\ell}, \quad \ell = 1, \ldots, q, \quad \alpha = 1, \ldots, n,\tag{3}$$

where *βj*(*mα*) describes the effect of the *j th* regulator gene on the *th* target gene in the *αth* target cell line. *m<sup>α</sup>* is a cancer related characteristic of the *αth* cell lines, such as drug sensitivity and survival risk of cell lines. The model enables us to describe cell-line characteristic- (*M* = *mα*) specific molecular interplay between genes, i.e., *βj*(*mα*).

#### *3.1. NetworkProfiler*

The varying coefficient *βj*(*mα*) describing cell-line characteristic-specific strength of the relationship between the *j th* regulator and the *th* target genes in the *αth* cell line can be estimated by the following kernel-based *L*1-type regularization method, called a NetworkProfiler [19],

$$L(\mathcal{B}\_{\ell\alpha}|b\_{\ell}) = \frac{1}{2} \sum\_{i=1}^{n} \left\{ y\_{i\ell} - \sum\_{j=1}^{p} \beta\_{j\ell}(m\_{\alpha}) x\_{ij} \right\}^2 G(m\_i - m\_{\alpha}|b\_{\ell}) + P(\mathcal{B}\_{\ell\alpha}),\tag{4}$$

where

$$G(m\_i - m\_a | b\_\ell) = \exp\{\frac{-(m\_i - m\_a)^2}{b\_\ell}\},\tag{5}$$

is a Gaussian kernel function to control the weight of cell lines when modelling the *αth* target cell line. The NetworkProfiler groups cell lines, according to the similarity of the specific characteristics of cell lines (i.e., modulator *mi* for *i* = 1, ... , *n*), and performs modelling for *αth* cell lines, based only on the cell-lines in the neighbourhood around the *αth* cell line. That is, the modelling for the *αth* cell line is based only on the cell lines having similar modulator characteristics to with the target sample's modulator value *mα*. This implies that the NetworkProfiler can estimate cell line characteristic -specific gene regulatory networks.

The cancer-related characteristics of cell lines are not usually uniformly distributed. Figure 1 shows the anti-cancer drug sensitivity of cell lines, where the eight drugs are randomly selected from Genomics of Drug Sensitivity in Cancer (GDSC) and Cancer

Dependency Map (DepMap) projects. As shown in Figure 1, sensitivities of some anticancer drugs (characteristics of cell lines: modulator) are non-uniformly distributed, i.e., there are cell lines having rare cancer characteristics.

**Figure 1.** Drug sensitivities of GDSC and DepMap databases: each of the four drugs are randomly selected from GDSC and DepMap datsets.

**Limitation:** The NetworkProfiler cannot perform well when the modulator is not uniformly distributed, especially when modelling the target cell line with a rare characteristic located in a sparse region of its distribution, because the method is based on the constant bandwidth (*b*) of Gaussian kernel function. In the NetworkProfiler, the bandwidth specifies the length-scale of the kernel function and controls the weights of cell lines. It implies that the NetworkProfiler based on the constant bandwidth performs cell line characteristic-specific modelling without consideration of the distribution of the modulator and location of the modulator value *m<sup>α</sup>* of the target sample in the distribution. Thus, the NetworkProfiler imposes a small amount of weight to almost all samples for modelling a target sample in a sparse region. Figure 2 shows the values of the Gaussian kernel function (i.e., weight for cell lines) with a constant bandwidth *b* for a target sample in both sparse and dense regions, where *y*-axis and *x*-axis indicate weights and modulator values of cell-lines, respectively. As shown in Figure 2, the Gaussian kernel function based on the constant bandwidth imposes the non-zero weight on only a few samples for the modelling of the target sample in a sparse region. It leads to extremely high dimensional data situations; thus, gene regulatory network estimation (i.e., edges selection and edge size estimation) cannot be appropriately performed.

**Figure 2.** Gaussian kernel function to impose weight on cell lines, where *y*-axis and *x*-axis indicate weight and modulator values of cell lines.

#### *3.2. Adaptive NetworkProfiler*

To settle on the issue, Park et al. [20] developed a novel strategy, called an adaptive NetworkProfiler, based on the adaptive bandwidth of the Gaussian function. The adaptive NetworkProfiler computes the weight of cell lines by using the adaptive Gaussian kernel function, where the bandwidth is based on the *k*-nearest neighbour (KNN) rule, called an adaptive bandwidth [21]. The adaptive bandwidth for an *αth* target cell line is based on the Euclidean distance between the modulator value of *αth* cell line (*mα*) and its *kth* nearest neighbour. By using not a constant but the adaptive bandwidth based on Euclidean distance, the KNN-Gaussian kernel function has a relatively wide width of the kernel for modelling a target sample, having a rare modulator value located in the sparse region, because the *kth* nearest neighbourhood of *αth* target cell line is also far from the *mα*. Thus, the KNN-Gaussian kernel function can overcome the drawback of the ordinary NetworkProfiler for modelling a target sample in the sparse region.

The adaptive NetworkProfiler was developed based on the adaptive kennel function, with an additional parameter incorporating dispersion of modulators (i.e., range of a modulator: *r*(*M*)) as follows,

$$L(\mathcal{B}\_{\ell a}|b\_{\ell}^{\text{KNN}},r(M)) = \frac{1}{2} \sum\_{i=1}^{n} \{y\_{i\ell} - \sum\_{j=1}^{p} \beta\_{j\ell}(m\_a)x\_{ij}\}^2 \mathcal{K}(m\_i - m\_a|b\_{\ell a,r(M)}^{\text{KNN}}) + P(\mathcal{B}\_{\ell a}), \tag{6}$$

where

$$K(m\_i - m\_\hbar | b\_{\ell\alpha}^{\text{KNN}}, r(M)) = \exp\left(\frac{-(m\_i - m\_\hbar)^2}{b\_{\ell\alpha}^{\text{KNN}} \cdot r(M)}\right),\tag{7}$$

and *b*KNN *<sup>α</sup>* is the Euclidean distance between *<sup>m</sup>α*, with its *<sup>k</sup>th* nearest neighbour *<sup>m</sup>kth <sup>α</sup>* ,

$$b\_{\ell a}^{\rm KNN} = \sqrt{(m\_a - m\_a^{\rm th})^2} \quad \text{for} \quad a = 1, 2, \ldots, n,\tag{8}$$

and *r*(*M*) is the hyperparameter, incorporating dispersion of the modulator *M*.

The adaptive bandwidth in the Gaussian kernel function with the additional parameter incorporates distribution of cell line characteristics (*M*) and location of the characteristic value (*mα*) in their distribution. Thus, the adaptive NetworkProfiler can overcome only a small number of samples that have non-zero weight and can effectively perform cell line characteristic-specific gene network estimation for modelling the target sample in not only dense regions but also sparse regions.

**Limitation:** The NetworkProfiler and Adaptive NetworkProfiler construct the cancer characteristics-specific gene network based on a specific cancer characteristic. That is, the methods consider a characteristic and measure similarity of

cell lines in one-dimensional cell line characteristic space based only on one characteristic. Thus, the cancer characteristic-specific gene networks estimated by the methods cannot described gene regulatory system under varying conditions of various cancer characteristics because the methods are based on a characteristic.

#### *3.3. Gene Network Analysis in Multi-Dimensional Cell Line Space*

In order to incorporate various cancer-related characteristics of cell lines and extract more precise cell-line specific molecular interplays, the cell line characteristic specific gene network estimation is extended to the multi-dimensional cell line space [22].

For *<sup>h</sup>* characteristics of cell line *<sup>M</sup>* = (*m*1, ... , *<sup>m</sup>h*) <sup>∈</sup> <sup>R</sup>*n*×*h*, the varying coefficient model in (3) is given as follows,

$$\mathbf{y}\_{\ell} = \sum\_{j=1}^{p} \beta\_{j\ell}(\mathfrak{m}\_{\mathfrak{a}}) \cdot \mathbf{x}\_{j} + \mathbf{e}\_{\ell}, \quad \ell = 1, \dots, q, \quad \mathfrak{a} = 1, \dots, n,\tag{9}$$

where *m<sup>α</sup>* = (*mα*1, ... *mαh*). In the *h*-dimensional cell line space, the similarity between cell lines is measured by the following multivariate Gaussian kernel function,

$$K(\boldsymbol{m}\_{i} - \boldsymbol{m}\_{a}|\boldsymbol{H}\_{\ell}) = |\boldsymbol{H}\_{\ell}|^{-1/2} \exp\left\{ -\frac{1}{2} (\boldsymbol{m}\_{i} - \boldsymbol{m}\_{a})^{T} \boldsymbol{H}\_{\ell}^{-1} (\boldsymbol{m}\_{i} - \boldsymbol{m}\_{a}) \right\} \tag{10}$$

where *H* is the bandwidth matrix (e.g., covariance matrix). Then, the multi-dimensional cell line characteristic specific gene network is estimated by the following multivariate kernel-based *L*1-type regularization method,

$$L(\mathcal{J}\_{\ell\alpha}|H\_{\ell}) = \frac{1}{2} \sum\_{i=1}^{n} \left\{ y\_{i\ell} - \sum\_{j=1}^{p} \beta\_{j\ell}(m\_{\alpha}) x\_{ij} \right\}^2 \mathcal{K}(m\_i - m\_{\alpha}|H\_{\ell}) + P(\mathcal{J}\_{\ell\alpha}).\tag{11}$$

The multi-dimensional cell line characteristic specific analysis enables us to extract more precise characterization of cell-lines, and thus we can effectively estimate precision cancer gene regulatory networks.

**Limitation:** The precision cancer gene networks estimation provides hundreds of matrices with more than 2000 rows for regulator genes and more than 10,000 columns for target genes. Although various computational tactics have been developed and successfully applied to gene regulatory network estimation, the interpretation of the large-scale gene networks remains a challenge. The existing studies on the cell line characteristic-specific gene networks focused only on the known markers and then interpreted the massive networks based on the neighbourhoods of the known markers, i.e., only narrow interpretation was performed. However, comprehensive analysis of the multiple massive networks is essential to understand the complex mechanism of cancer. The interpretation of the multi-layer massive network was the bottle network of the existing studies on the precision cancer gene networks analysis.

#### **4. Interpretation of the Multi-Layer Massive Networks**

In this section, we review computational strategies to interpret the multiple and massive gene regulatory networks.

#### *4.1. Network Constrained Sparse Common Component Analysis (NetSCCA)*

Park et al. [22] considered common structure identification of the multiple matrix datasets to interpret multilayer massive networks. The cell line specific gene regulatory system can be described by the following regulatory effect of the *j th* regulator gene on the *th* target gene in the *αth* cell line [19,22],

$$\mathbf{r}\_{\rm alj} = \hat{\boldsymbol{\beta}}\_{\ell \dot{\boldsymbol{\beta}}}(\boldsymbol{m}\_{\boldsymbol{\alpha}}) \cdot \mathbf{x}\_{\rm aj} \quad \text{for} \quad j = 1, \ldots, p,\tag{12}$$

where *xα<sup>j</sup>* is an expression level value of the *j th* gene in the *αth* cell line. For the *th* target gene, a matrix for the regulatory effect of *<sup>p</sup>* regulators is given as *<sup>R</sup>* = (*r*1, ... ,*rn*)*<sup>T</sup>* ∈ R*n*×*p*, where *<sup>r</sup>α* = (*rα*1,...,*rαp*)*T*.

To interpret the large-scale gene regulatory networks and identify crucial biomarkers that play a key role in cancer-related mechanism of interest, the network-constrained sparse common component analysis (NetSCCA) was developed. The crucial common component of multiple datasets (*R*, = 1, . . . , *q*) can be estimated by [23],

$$\begin{aligned} \underset{A}{\arg\min} & \{ \sum\_{\ell=1}^{q} ||\mathbf{R}\_{\ell} - \mathbf{R}\_{\ell} \mathbf{A} \mathbf{A}^{T}||\_{F}^{2} \} , \\ & \text{subject to} \quad \mathbf{A}^{T} \mathbf{A} = \mathbf{I}\_{K} . \end{aligned} \tag{13}$$

As show in (13), the common component analysis of *q* datasets can be considered as a principal component analysis (PCA) of *q* datasets. That is, if there is only one dataset *R*1, then the model becomes a standard PCA. Wang et al. [23] showed that the common loading matrix *A* in (13) can be optimized as the solution to the following problem,

$$\underset{A}{\text{arg}\,\text{max}}\,\text{Tr}(A^T G A)\_{\prime} \tag{14}$$

$$\text{subject to}\quad A^T A = I\_K.$$

where *G* = ∑*<sup>q</sup>* =<sup>1</sup> *<sup>R</sup><sup>T</sup> R*. It implies that the common loading matrix *A* can be estimated by the standard PCA problem.

$$\underset{A}{\text{arg min}} \{||\mathbf{Q} - \mathbf{Q}AA^T||\_F^2\}\_{\prime} \tag{15}$$

$$\text{subject to} \quad A^T A = I\_{K\prime}$$

where *Q* is the square root of *G*, i.e., *QTQ* = *G*. The common component analysis enables us to estimate the common subspace of the multiple massive networks (i.e., *R*, = 1, ... , *q*), and extract the crucial common component of the datasets.

The common component estimation in (13) provides a fully dense projection matrix *A*. That is, the common component is estimated by a linear combination of all features. It not only leads to difficult to interpret estimated common components but also erroneous estimation results, because the common component analysis is based on crucial and noisy features. To settle the issue, a sparse learning-based strategy was proposed and developed to achieve better biological interpretability, called a NetSCCA [22]. The NetSCCA estimates the projection matrix *A* based only on crucial features without disturbance of noisy features by using sparse learning and incorporates network biology knowledge that the genes with similar molecular interactions may have similar biological function in the common component estimation.

The NetSCCA measures the similarity between genes on networks by using the following jaccard similarity [24]:

$$\mathcal{W}\_{\mathbf{j},\mathbf{s}} = \frac{|\mathcal{N}\_{\mathbf{j}} \cap \mathcal{N}\_{\mathbf{s}}|}{|\mathcal{N}\_{\mathbf{j}} \cup \mathcal{N}\_{\mathbf{s}}|} \tag{16}$$

where *Nj* is the set of nodes that are directly connected to the *j th* gene via an edge in at least one cell line. Then, the similarity between genes *Wj*,*<sup>s</sup>* is incorporated into the sparse common loading matrix (*A*) estimation as follows,

$$\underset{\boldsymbol{\Theta},\boldsymbol{A}}{\text{arg}\,\min} \{ \sum\_{\ell=1}^{q} \|\boldsymbol{\xi} - \boldsymbol{\mathsf{Q}} \boldsymbol{\mathsf{Q}} \boldsymbol{\mathsf{A}}^{T} \|\_{F}^{2} \} + \lambda\_{1} \sum\_{k=1}^{K} \|\boldsymbol{\theta}\_{k} \|\_{1} + \lambda\_{2} \sum\_{k=1}^{K} \sum\_{j$$

where *θ<sup>k</sup>* ∝ *Ak* is the *p*-dimensional vector and **Θ** = (*θ*1, ... , *θk*). The last term (penalty term) of (17) enables us to locally smooth the coefficients and encourage the simultaneous selection of related genes. In other words, a large amount of weight is imposed on the coefficients of the two genes with many common interactions, and it encourages similarity in their coefficients of common structure estimation. Thus, the NetSCCA can provide biologically interpretable results of the common component analysis of the multiple networks.

The NetSCCA algorithm is given in Algorithm 1.

#### **Algorithm 1** NetSCCA: Network constrained sparse common component analysis.


3.1: Start *A* at *V* = [*V*1, *V*2, ... , *VK*], which is the loading matrix from ordinary PCA of *Q*.

3.2: Given a fixed *A* = [*a*1, *a*2,..., *aK*], solving the following problem,

$$\underset{\theta\_k}{\theta}\_k \arg\min \{ \|\mathbf{z}\_k - \mathbf{Q}\theta\_k\|^2 \} + \lambda\_1 \|\theta\_k\|\_1 + \lambda\_2 \sum\_{j$$

where *z<sup>k</sup>* = *Qak*. Update **Θ**ˆ = [*θ*ˆ 1, *θ*ˆ 2,..., *θ*ˆ*K*].

3.3: For a fixed **Θ**ˆ , perform the singular value decomposition of *QTQ***Θ**ˆ = *U***Γ***V<sup>T</sup>* and update *A*ˆ = *UV<sup>T</sup>* (see Zou et al. [25]).

3.4: Repeat Steps 3.2–3.3, until convergence.


#### *4.2. Explainable AI for Gene Network-Based Prediction (Xprediction)*

In this section, we review an explainable AI approach for the network-based prediction, called Xprediction [28]. Although the machine learning-based AI approaches provide effective prediction results, most of the existing approaches were developed focusing only on mathematical/statistical accuracy. Thus, the existing AI methodologies cannot explain their decision rules (i.e., the existing AI cannot explain how and why a decision has been made, causing the black-box problem). However, the interpretability and explainability are essential for use of AI strategies in various fields of research, especially medical science.

Xprediction achieves not only prediction accuracy but also interpretability of deep learning-based AI. The method is based on the widely used machine learning and deep learning approaches, e.g., the kernel support vector machine, random forest and deep neural network for prediction models, and describes the cruciality of input on output by comparison with the results of the model without the input. That is, Xprediction constructs a model by removing a feature (i.e., by removing a molecular interaction between *th* target and *j th* regulator genes) individually and performing a prediction, and the prediction is iterated based on the randomly constructed cross-validation datasets. The cruciality of each molecular interaction was measured by comparing with the prediction accuracy based on all molecular interactions Acc(*y*ˆ).

The significance of each molecular interaction is computed by the *t*-test between prediction accuracies between models with and without the edge (i.e., interaction). Let *N* be a number of iterations for computing prediction accuracy from the randomly constructed cross-validation dataset, then Acc(*y*ˆ) and *sy*<sup>ˆ</sup> are mean and standard deviation of the prediction of accuracies in *N* iterations, respectively. In the model without (*l*, *j*) interaction, corresponding notations are given *<sup>N</sup>*(*l*,*j*), Acc(*y*ˆ(*l*,*j*)) and *<sup>s</sup>*(*y*ˆ(*l*,*j*)) , respectively. We performed the following *t*-test,

$$T\_{\ell j} = \frac{\overline{\text{Acc}(\hat{y})} - \overline{\text{Acc}(\hat{y}^{(l,j)})}}{s\_p \sqrt{\frac{1}{N} + \frac{1}{N^{(l,j)}}}} \tag{18}$$

where *sp* = *sy*ˆ(*N*−1)+*s y*ˆ(*l*,*j*)(*N*(*l*,*j*)−1) *<sup>N</sup>*+*N*(*l*,*j*)−<sup>2</sup> . Then, the cruciality of (*l*, *<sup>j</sup>*)*th* interaction *<sup>I</sup>j*) on the prediction result was measured by the *p* value of the *t*-test. The algorithm of Xprediction is given in Algorithm 2.

#### **Algorithm 2** Xprediction: explainable prediction.


#### **5. Applications**

In this section, we introduce an application of the introduced computational strategy to precision cancer network analysis. We consider the application of the explainable AI, Xprediction, to identify anti-cancer drug markers. The drug sensitivity data (i.e., primary-screen-replicate-collapsed-logfold-change) and RNA-expression levels of genes are obtained from the CCLE dataset (https://depmap.org/portal/, accessed on 4 August 2022). For expression levels of genes, we extracted 1922 genes that had the highest 10% variances in cell lines. We focused on anti-cancer drugs, capecitabine, and oxaliplatin, which are used in a chemotherapy combination known as XELOX or CAPEOX. The XELOX is used to colorectal and gastric cancer [29–31].

We first estimated capecitabine's sensitivity specific gene networks by using the NetworkProfiler. We then defined oxaliplatin sensitive and resistant cell lines based on fifth (5P) and ninety-fifth (95P) percentiles of the drug sensitivity (DS) values, i.e., sensitive cells: DS < 5P and resistant cells: DS > 95P. We constructed a prediction model based on the deep learning approach (i.e., deep neural network) to predict the sensitivity of the oxaliplatin. In our analysis, a two hidden layered, fully-connected feed-forward neural network was used. The *ReLU* activation function was used on the hidden layers, and the sigmoid function was used on the output layer. We randomly split the dataset 10-fold and evaluated the prediction accuracy based on the 10-fold cross-validation, i.e., prediction accuracy was given as an average of prediction accuracies of 10 test sets. By using Xprediction, crucial molecular interactions to explain sensitivities of the oxaliplatin were identified based on *p* value<0.05. Table 1 shows the identified crucial interactions and corresponding *p* value.


**Table 1.** Crucial molecular interplays to explain oxaliplatin sensitivity, where *X* → *Y* indicates interaction from regulator gene *X* to target gene *Y*.

Figure 3 shows gene regulatory networks consisting of the identified crucial molecular interplays to oxaliplatin sensitivity prediction, where the top and bottom indicates the networks in drug sensitive and resistance cell lines, respectively. The edge sizes represent the median of strengths of interactions between genes in drug-sensitive cell lines and -resistant cell lines, respectively.

As shown in Figure 3, drug-sensitive and-resistant cell lines show different gene regulatory systems of the identified markers. The interplay of SYNE1→ IFITM1 can be considered as a oxaliplatin-resistant specific gene regulatory system. The interplays of SPRY2→ ETV1 and SLPI → PTK1B become weaker from sensitive to resistant cell lines. It was uncovered that high expression levels of the identified drug resistant markers SYNE1 and IFITM1 are associated with poorer chemotherapy efficacy of gastric cancer and resistance to endocrine therapy and chemotherapy [32,33]. The existing studies support our results that the high activities of SYNE1 and IFITM1 are characteristics of capecitabineresistant cell lines. On the other hand, it was demonstrated that the high expression levels of the SPRY2 are associated with chemotherapy-sensitive cell line MEK inhibitors, BRAF inhibitor-resistant cells, and ovarian cancer cells [34–36]. The results of the literature are consistent with our result that the activity of SPRY2 is a signature of capecitabine-sensitive cell lines. This implies that our gene network analysis results are strongly supported by existing literatures.

Gene network in Capecitabine sensitive cell lines

Gene network in Capecitabine resistant cell lines

**Figure 3.** Gene regulatory networks of the crucial molecular interplays to oxaliplatin sensitivity prediction. Color of edge indicate negative (red) and positive (blue) effects of regular genes on their target genes.

Table 2 shows that the genes consisted of the crucial interplays, related anti-cancer drugs, and cancer, where the column "Resistant" indicates that the gene was identified as a drug-resistant marker in existing studies. It can be seen from Table 2 that more than half of the identified genes are confirmed as a therapeutic target for not only XELOX (i.e., Oxaliplatin and Capecitabine) but also various anti-cancer drugs (e.g., 5-FU, cisplatin, Paclitaxel, etc.). Furthermore, the cancer-related mechanism of the genes has been verified in the literature. Although the mechanism of some genes has not yet been uncovered, it can be considered through our results and literatures that not just a single gene but the identified molecular interplays may be crucial to understanding the mechanism of anti-cancer drug resistance of cell lines.

We suggest though the application results of precision cancer network analysis and literature that molecular interplays between "SYNE1 and IFITM1" may lead to capecitabine resistance of cancer cell lines, while weakening the molecular regulatory interactions between "SPRY2 and ETV1" and "SLPI and PTK1B" induce capecitabine-resistance in cell lines.


**Table 2.** Identified markers and their evidence.

AML: Acute myelogenous leukaemia; BC: breast cancer; CC: colon cancer; CRC: colorectal cancer; EAC: oesophageal adenocarcinoma; GBC: gallbladder cancer; GS: gastric cancer; HCC: hepatocellular carcinoma; LC: lung cancer; OC: ovarian cancer; PC: pancreatic cancer; PDC: pancreatic ductal carcinomas; cDDP: cisdiamminedichloroplatinum. The \* indicates that the gene was identified as a drug resistant marker in the existing studies.

#### **6. Discussion**

In this article, we reviewed computational tactics for precision cancer network analysis. Although many studies have been conducted to develop computational approaches to gene regulatory network analysis and the gene network-based analysis has been applied to cancer research, the existing studies focused on an averaged gene network for all cell lines. Thus, we cannot extract crucial information for precision cancer research. In this article, we have focused on cancer characteristic-specific gene networks and reviewed the computational strategies for cell line specific modelling to identified cancer characteristicspecific molecular interplays. We also reviewed the studies on analysis and interpretation

of the estimated multiple and massive gene regulatory networks. Finally, we introduced the application results of the introduced computational tactics to anti-cancer drug sensitivityspecific gene network analysis. The application section described cell line's characteristic- (drug sensitivity) specific gene regulatory network analysis. Our analysis can be easily extended to patient's characteristic-specific gene network analysis by using expression levels and drug sensitivities summarized in each patient. We expect that the results of a cancer patient's characteristic-specific gene network analysis provides crucial evidence for precision medicine.

Although we have reviewed some computational tactics for interpretation of multiple and massive gene networks, from cell line characteristic-specific gene network estimation to computational network biology, interpretation and analysis of the large-scale gene networks is still in its infancy. Thus, researchers in various fields of research are faced with a challenge to interpret the estimated large gene networks. Explainable machine learning and, more specifically, interpretable artificial intelligence will be a key tool to overcome this bottleneck in the near future.

**Author Contributions:** H.P. performed data analysis for the Applications section and drafted the manuscript. S.M. supervised the works. All authors have read and approved the final version of the manuscript.

**Funding:** This work was supported by MEXT, as a "Program for Promoting Researches on the Supercomputer Fugaku" (Unravelling origin of cancer and diversity by large-scale data analysis and artificial intelligence technology, Project ID: JPMXP1020200102, hp200138, hp210167, hp220163), and by JSPS KAKENHI (JP19K20402, JP22K12259).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The datasets used in the Application section are from the Dependency Map (DepMap) projects (https://depmap.org/portal/, accessed on 4 August 2022).

**Acknowledgments:** This research used the computational resources of supercomputer Fugaku, provided by the RIKEN Center for Computational Science and the Super Computer System, Human Genome Center, Institute of Medical Science, University of Tokyo.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Integrative Analysis of Transcriptome-Wide Association Study and Gene-Based Association Analysis Identifies In Silico Candidate Genes Associated with Juvenile Idiopathic Arthritis**

**Shuai Liu 1,2,†, Weiming Gong 1,2,†, Lu Liu 1,2, Ran Yan 1,2, Shukang Wang 1,2,\* and Zhongshang Yuan 1,2,\***


**\*** Correspondence: wsk2001@sdu.edu.cn (S.W.); yuanzhongshang@sdu.edu.cn (Z.Y.)

† These authors contributed equally to this work.

**Abstract:** Genome-wide association study (GWAS) of Juvenile idiopathic arthritis (JIA) suffers from low power due to limited sample size and the interpretation challenge due to most signals located in non-coding regions. Gene-level analysis could alleviate these issues. Using GWAS summary statistics, we performed two typical gene-level analysis of JIA, transcriptome-wide association studies (TWAS) using FUnctional Summary-based ImputatiON (FUSION) and gene-based analysis using eQTL Multi-marker Analysis of GenoMic Annotation (eMAGMA), followed by comprehensive enrichment analysis. Among 33 overlapped significant genes from these two methods, 11 were previously reported, including TYK2 (*P*FUSION = 5.12 <sup>×</sup> <sup>10</sup><sup>−</sup>6, *<sup>P</sup>*eMAGMA = 1.94 <sup>×</sup> <sup>10</sup>−<sup>7</sup> for whole blood), IL-6R (*P*FUSION = 8.63 <sup>×</sup> <sup>10</sup>−7, *<sup>P</sup>*eMAGMA = 2.74 <sup>×</sup> <sup>10</sup>−<sup>6</sup> for cells EBV-transformed lymphocytes), and Fas (*P*FUSION = 5.21 <sup>×</sup> <sup>10</sup>−5, *<sup>P</sup>*eMAGMA = 1.08 <sup>×</sup> <sup>10</sup>−<sup>6</sup> for muscle skeletal). Some newly plausible JIAassociated genes are also reported, including IL-27 (*P*FUSION = 2.10 <sup>×</sup> <sup>10</sup><sup>−</sup>7, *<sup>P</sup>*eMAGMA = 3.93 <sup>×</sup> <sup>10</sup>−<sup>8</sup> for Liver), LAT (*P*FUSION = 1.53 <sup>×</sup> <sup>10</sup>−4, *<sup>P</sup>*eMAGMA = 4.62 <sup>×</sup> <sup>10</sup>−<sup>7</sup> for Artery Aorta), and MAGI3 (*P*FUSION = 1.30 <sup>×</sup> <sup>10</sup>−5, *<sup>P</sup>*eMAGMA = 1.73 <sup>×</sup> <sup>10</sup>−<sup>7</sup> for Muscle Skeletal). Enrichment analysis further highlighted 4 Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways and 10 Gene Ontology (GO) terms. Our findings can benefit the understanding of genetic determinants and potential therapeutic targets for JIA.

**Keywords:** juvenile idiopathic arthritis; transcriptome-wide association study; gene-based association analysis; enrichment analysis

#### **1. Introduction**

Juvenile idiopathic arthritis (JIA) is one of the most common rheumatic diseases characterized by arthritis in childhood [1]. It can cause damage to multiple organs such as arthrosis, heart, lung, skin, and eyes, and is an important cause of disability in children under the age of 16 [2]. Previous studies have shown that the concordance rate of JIA in monozygotic twins and the relative risk of the disease in the siblings of JIA patients are higher [3,4], highlighting the important role of genetic factors in the development of JIA [5,6]. Therefore, it is of great significance to probe the complex genetic association of JIA to better understand the genetic mechanisms and investigate potential intervention targets.

Genome-wide association studies (GWAS) have successfully identified 22 risk loci associated with JIA [7–9], however, these GWASs suffer from either small sample size or low case proportions, which may be presumably due to the difficulty in accurate JIA diagnosis and the lack of pathognomonic features [10]. So far, the JIA GWAS with relatively larger sample size and the largest case proportions only includes 3305 JIA cases and 9196 controls [9]. In addition, most genetic variants identified from GWAS of JIA are located in non-coding regions [11], leading to the difficulty in explaining the association signals.

**Citation:** Liu, S.; Gong, W.; Liu, L.; Yan, R.; Wang, S.; Yuan, Z. Integrative Analysis of Transcriptome-Wide Association Study and Gene-Based Association Analysis Identifies In Silico Candidate Genes Associated with Juvenile Idiopathic Arthritis. *Int. J. Mol. Sci.* **2022**, *23*, 13555. https:// doi.org/10.3390/ijms232113555

Academic Editors: Giuseppe Novelli and Wajid Zaman

Received: 21 August 2022 Accepted: 2 November 2022 Published: 4 November 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

On the other hand, statistically, GWAS often provide JIA-associated single nucleotide polymorphisms (SNPs), the effects of which are too weak to be detected. Gene-level analysis can not only aggregate many SNPs with small effects to improve the power, but provide good biological interpretations, which is more straightforward to be translated into clinical practice.

With the increase of publicly available GWAS summary data and the well-developed efficient tools, it is feasible to conduct the gene-level analysis for JIA [12]. There are two typical gene-level association analysis methods with different model assumptions, one is transcriptome-wide association studies (TWAS), which has shown great promise in interpreting the GWAS signals and is powerful in detecting the association between the gene expression level and the complex disease [13,14]. Recently, one TWAS analysis of JIA has been conducted, however, it only involves two tissues with lower JIA case proportions [15], which may lead to the power loss and the insufficiency in capturing the tissue information related with JIA. The other is multi-marker analysis, which can assign SNPs to genes based on physical proximity and further conduct gene-based association analysis. Both methods, though have different statistical principles, can produce the genelevel *p* values. We would like to emphasize that using different gene-level methods with different model statistical principles to obtain the common genes can avoid the risk of false discoveries from using single method. Actually, the results from different analysis could complement to each other [16,17].

In the present study, using the GWAS summary data of JIA with relatively larger sample size and case proportions (3305 cases and 9196 controls), we performed TWAS analysis and the gene-based association analysis to identify the tissue-specific JIA-associated genes. We used the false discovery rate (FDR) correction on each tissue to declare the significant genes. Finally, we overlapped the genes significantly detected from these two gene-level methods, and performed the enrichment analysis for these overlapped genes on the Metascape website to identify the significant Gene Ontology (GO) terms as well as the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways.

#### **2. Methods**

#### *2.1. Study Design, Data Source, and Quality Control*

The analysis flowchart of this study is shown in Figure 1. We used the GWAS of JIA from a large-scale meta-analysis [9]. We obtained the largest GWAS summary statistics of JIA from the NHGRI-EBI GWAS Catalog (Study Accession Code GCST90010715), where the JIA cases were diagnosed according to The International League Against Rheumatism (ILAR) criteria. The data were restricted to European ancestry with stringent quality control as described previously [9]. Briefly, the GWAS of JIA initially recruited 4520 UK JIA patients and 9965 healthy individuals. Individuals with call rate less than 0.98 and discrepancy between genetically predicted sex and database record were removed. In addition, SNPs that were non-autosomal, had a call rate <0.98 or a minor allele frequency (MAF) <0.01 were further excluded. About 12,501 individuals (3305 cases and 9196 healthy controls) were finally remained. For the summary data, we further excluded the major histocompatibility complex (MHC) region (chromosome 6: 25–35 Mb) due to its complex structure, restricted to biallelic SNPs and removed SNPs with duplicated or missing rs ID for subsequent analyses. Totally 7,415,262 SNPs are finally included. Bearing in mind that using different methods with different model assumptions to obtain the overlapped genes can avoid the risk of false discoveries from using single method, we here applied two gene-level approaches with distinct principles, TWAS analysis and gene-based association analysis, as parallel analyses to obtain the common JIA-associated genes for enrichment analysis.

**Figure 1.** The flowchart of integrative analysis of FUSION and eMAGMA. ABBR: JIA, Juvenile idiopathic arthritis; GWAS, genome-wide association study; MHC, major histocompatibility complex; GTEx, Genotype-Tissue Expression Project; LD, linkage disequilibrium; EUR, European; FUSION, functional summary-based imputation; eQTL, expression quantitative trait loci; eMAGMA, eQTL Multi-marker Analysis of GenoMic Annotation; FDR, false discovery rate; KEGG, Kyoto Encyclopedia of Genes and Genomes; GO, Gene Ontology.

#### *2.2. TWAS Analysis*

TWAS aim to integrate GWAS and expression quantitative trait loci (eQTL) studies to identify tissue-specific gene-trait associations [18], which has shown great promise both in interpreting GWAS findings and in elucidating the underlying disease mechanisms. We here adopted FUnctional Summary-based ImputatiON (FUSION) method (http://gusevlab.org/projects/fusion/, accessed on 8 April 2022) to conducted TWAS analysis of JIA. FUSION is the most commonly used TWAS method and has shown great promise in large-scale integrative omics data analysis [14]. Once inputting the GWAS summary data and expression weight, FUSION will impute the gene expressions in GWAS, and then perform an association analysis between the predicted gene expression and JIA. We selected Genotype-Tissue Expression Project (GTEx) v8 with pre-computed gene expression weights from totally 49 tissues. Using all the tissues may introduce the nuisance information and increase the computation burden; previous studies recommend using an expression panel from only plausible disease-related tissues in TWAS [13]. Here, we determine the analyzed tissue based on not only previous studies [2,19], but also the clinical symptoms and involved organs of JIA, such as hepatomegaly, splenomegaly, anemia, disseminated intravascular coagulation, arthritis, rash, mesenteric lymphadenopathy, pericarditis, and pneumonia. We finally selected 18 tissues for analysis including artery aorta, artery coronary, artery tibial, cells EBV-transformed lymphocytes, colon sigmoid, colon transverse, heart atrial appendage, heart left ventricle, liver, lung, muscle skeletal, nerve tibial, skin not sun exposed suprapubic, skin sun exposed lower leg, small intestine terminal ileum, spleen, stomach, whole blood. We used the 1000 Genomes European panel as a linkage disequilibrium (LD) reference data, and obtained tissue-specific *p* value for each gene across different tissues. We finally performed FDR correction on each tissue, and genes with FDR less than 0.05 are declared to be significant.

#### *2.3. Gene-Based Association Analysis*

We used eQTL Multi-marker Analysis of GenoMic Annotation (eMAGMA) method (https://github.com/eskederks/eMAGMA-tutorial, accessed on 28 March 2022) to conduct gene-based association analysis for JIA. eMAGMA follows the same statistical framework as MAGMA, which is based on a multiple linear principal component regression model and can provide better statistical performance in gene-based association analysis [16]. Besides this, eMAGMA can further utilize tissue-specific cis-eQTL information to assign SNPs to putative genes, providing more biologically meaningful and interpretable results [17]. Gene-based analysis using eMAGMA typically involves two stages, annotation and gene analysis [16]. We, in the annotation stage, used the same tissues as that in above TWAS analysis and directly used the GTEx v8-based annotation files provided on eMAGMA s website. In the gene analysis stage, we used the 1000 Genomes European panel as the reference panel and tested the association between the annotated genes and JIA. We further performed FDR correction on each tissue, and selected significant genes with FDR less than 0.05.

#### *2.4. Gene Set Enrichment Analysis*

We conducted gene set enrichment analysis for the overlapped genes that were significantly identified by both TWAS and eMAGMA analysis. Specifically, these overlapped significant genes were subjected to GO term and KEGG pathway enrichment analysis on the Metascape website (https://metascape.org/gp/index.html#/main/step1, accessed on 26 April 2022) to better understand the biological mechanisms. Metascape essentially utilizes the hypergeometric test and Benjamini-Hochberg *p* value correction algorithm to identify all ontology terms. A large number of terms would make the results redundant and complicate the interpretation, Kappa consistency test was thus performed and terms with Kappa > 0.3 were grouped into a cluster, and the most statistically significant term in the cluster was selected to represent the cluster [20]. The parameters of Min Overlap, *p* Value Cutoff, and Min Enrichment are set to be the default values, respectively. In addition, we also made a protein–protein interaction (PPI) network for the overlapped significant genes on the STRING website (https://cn.string-db.org/, accessed on 15 September 2022).

#### **3. Results**

#### *3.1. TWAS Analysis*

We analyzed all genes involved in the 18 tissues, among which 275 tissue-specific genes were significantly detected by FUSION with FDR less than 0.05. Note that LAT is significantly detected at the border line (*<sup>p</sup>* = 1.53 × <sup>10</sup>−<sup>4</sup> and FDR = 5.40 × <sup>10</sup>−<sup>2</sup> for Artery Aorta). These TWAS significant genes included some established JIA-associated genes that have been reported previously, such as CCDC101 (*<sup>p</sup>* = 5.82 × <sup>10</sup>−<sup>8</sup> and FDR = 1.64 × <sup>10</sup>−<sup>4</sup> for Muscle Skeletal), CLN3 (*<sup>p</sup>* = 5.82 × <sup>10</sup>−<sup>8</sup> and FDR= 2.53 × <sup>10</sup>−<sup>4</sup> for Whole Blood), ERAP2 (*<sup>p</sup>* = 5.49 × <sup>10</sup>−<sup>6</sup> and FDR = 2.16 × <sup>10</sup>−<sup>3</sup> for Cells EBV-transformed lymphocytes), LNPEP (*<sup>p</sup>* = 3.53 × <sup>10</sup>−<sup>6</sup> and FDR = 2.36 × <sup>10</sup>−<sup>3</sup> for Artery Tibial). The consistent results with previous GWAS partly indicates the correctness of the FUSION analysis. All significant genes in FUSION analysis were displayed in Supplementary Table S1.

#### *3.2. Gene-Based Association Analysis Results*

Similarly, we analyzed all genes involved in the 18 tissues, among which 380 tissuespecific genes were significantly detected by eMAGMA with FDR less than 0.05. These

included some well-known JIA-associated genes, such as SGF29 (*<sup>p</sup>* = 2.22 × <sup>10</sup>−<sup>8</sup> and FDR = 3.38 × <sup>10</sup>−<sup>5</sup> for Whole Blood), ANKRD55 (*<sup>p</sup>* = 3.09 × <sup>10</sup>−<sup>8</sup> and FDR = 1.08 × <sup>10</sup>−<sup>4</sup> for Spleen), ATP8B2 (*<sup>p</sup>* = 7.53 × <sup>10</sup>−<sup>7</sup> and FDR = 6.59 × <sup>10</sup>−<sup>4</sup> for Stomach), PTPN2 (*<sup>p</sup>* = 3.96 × <sup>10</sup>−<sup>6</sup> and FDR = 2.75 × <sup>10</sup>−<sup>3</sup> for Spleen). Again, all these genes are included in the 22 risk loci identified by previous GWAS illustrate the eMAGMA results more reliable. All significant genes from eMAGMA analysis were summarized in Supplementary Table S2.

#### *3.3. Gene Set Enrichment Analysis*

We intersected the significant genes from both TWAS analysis and eMAGMA genebased association analysis according to different tissues, and found a total of 132 tissuespecific genes. A total of 33 unique genes were further identified after removing the duplicated ones, among which 11 genes have been reported in previous studies [7–9], such as TYK2 (*P*FUSION = 5.12 × <sup>10</sup>−<sup>6</sup>*, PeMAGMA <sup>=</sup>* 1.94 × <sup>10</sup>−<sup>7</sup> for Whole Blood), IL(Interleukin)- 6R (*P*FUSION = 8.63 × <sup>10</sup>−<sup>7</sup>*, PeMAGMA <sup>=</sup>* 2.74 × <sup>10</sup>−<sup>6</sup> for Cells EBV-transformed lymphocytes), and Fas (*P*FUSION = 5.21 × <sup>10</sup>−<sup>5</sup>*, PeMAGMA <sup>=</sup>* 1.08 × <sup>10</sup>−<sup>6</sup> for Muscle Skeletal). The remaining newly discovered 22 genes are APOBR, ATXN2L, GSDMB, IKZF3, IL27, KIAA1109, LAT, LMAN2L, MAGI3, NFATC2IP, NUPR1, ORMDL3, PHTF1, PSMB7, RGS14, SBK1, SH2B1, STEAP1B, TNFSF15, TUFM, UXS1, ZNF197. Among them, IL-27 (*P*FUSION = 2.10 × <sup>10</sup>−7*, PeMAGMA <sup>=</sup>* 3.93 × <sup>10</sup>−<sup>8</sup> for Liver), LAT (*P*FUSION = 1.53 × <sup>10</sup>−<sup>4</sup>*, PeMAGMA <sup>=</sup>* 4.62 × <sup>10</sup>−<sup>7</sup> for Artery Aorta) and MAGI3 (*P*FUSION = 1.30 × <sup>10</sup>−<sup>5</sup>*, PeMAGMA <sup>=</sup>* 1.73 × <sup>10</sup>−<sup>7</sup> for Muscle Skeletal) are the novel genes that are more likely to be associated with JIA. Detailed information for 33 genes were summarized in Table 1.

**Table 1.** Overlapped gene identified by FUSION and eMAGMA.



**Table 1.** *Cont.*

22 novel gene were shown in bold. Gene position (start and end) are based on GRCh37/hg19 by Ensembl.

We further performed KEGG and GO enrichment analysis on the overlapped 33 significant genes on the Metascape website, respectively. For KEGG enrichment analysis, we totally found four significant KEGG pathways (Figure 2), including Th17 cell differentiation (*p =* 5.83 × <sup>10</sup><sup>−</sup>6), cytokine–cytokine receptor interaction (*p =* 2.92 × <sup>10</sup><sup>−</sup>4), spinocerebellar ataxia (*p =* 5.11 × <sup>10</sup><sup>−</sup>4), and Rap1 signaling pathway (*p =* 1.55 × <sup>10</sup><sup>−</sup>3). For GO enrichment analysis, we totally found ten significant GO terms (Figure 2), including signaling receptor complex adaptor activity (*p =* 2.25 × <sup>10</sup><sup>−</sup>5), apoptotic signaling pathway (*p =* 2.09 × <sup>10</sup><sup>−</sup>4), cytokine receptor binding (*p = 2.14* × <sup>10</sup><sup>−</sup>4), regulation of blood pressure (*p =* 1.16 × <sup>10</sup><sup>−</sup>3), positive regulation of protein phosphorylation (*p =* 1.18 × <sup>10</sup>−3), inflammatory response (*p = 2.44* × <sup>10</sup><sup>−</sup>3), steroid metabolic process (*p =* 2.66 × <sup>10</sup>−3), regulation of autophagy (*p =* 5.96 × <sup>10</sup><sup>−</sup>3), protein homodimerization activity (*p =* 6.40 × <sup>10</sup>−3), and regulation of MAPK cascade (*p =* 6.47 × <sup>10</sup>−3). In addition, chord graphs (Figure 3) were also depicted to show the most significant enrichment pathways of KEGG and GO and visualize the targeting relationship between significant genes and significant pathways, so as to visualize which pathways every gene is enriched to, and which genes are enriched in each pathway. PPI network (Figure 4) shows the interaction between proteins.

**Figure 3.** Chord graphs for four significant KEGG pathways and ten significant GO terms. (**a**) Chord graph of four significant KEGG pathways. (**b**) Chord graph of ten significant GO terms. For each panel, the right semicircle represented significant pathways or terms, and the left semicircle represented the genes enriched in these pathways or terms.

**Figure 4.** Protein–protein interaction (PPI) network for 33 overlapped genes by STRING. Each circle represents a protein, a line between proteins indicates PPI, line thickness indicates the strength of data support.

#### **4. Discussion**

In the present study, we performed a comprehensive large-scale gene-level analysis using co-complementary methods and successfully detected 33 common genes, including 11 previously reported genes such as TYK2, IL-6R, and Fas [7–9], and 22 novel potential genes such as IL-27, LAT, and MAGI3. Enrichment analysis suggested important role of pathways involving Th17 cell differentiation and Rap1 signaling pathway, followed by PPI network illustrating the protein–protein interactions. All these findings provide novel insights into the potential molecular mechanisms underlying the development of JIA.

TYK2, IL-6R, and Fas appear more frequently in the significant KEGG pathways and Go terms, and the expression products of these three genes have been shown to play important roles in the inflammatory and immune responses of JIA [21–23]. Tyrosine kinase 2 encoded by gene TYK2 is a part of janus kinase (JAK), which mediates the activation of signal transducers and activators of transcription (STAT) proteins. That is, TYK2 may play a role in autoimmunity and inflammation through abnormal expression in JAK-STAT pathway, thus leading to JIA [21,24,25]. IL-6R encodes part of the interleukin 6 receptor, as a pro-inflammatory cytokine, IL-6 is significantly elevated in the serum of JIA patients. Inhibition of IL-6R expression reduces IL-6 and IL-6 receptor binding, thereby reducing inflammation and immune responses in JIA patients [22,26]. Fas can induce T-cell apoptosis by binding to the Fas Ligand (FasL), so decreased gene Fas expression may lead to the accumulation of activated T-cells and cause autoimmune diseases [27].

IL-27 is involved in encoding the synthesis of IL-27, a cytokine that plays a role in innate immunity and whose primary function is to promote pro-inflammatory Th1 differentiation and inhibit anti-inflammatory Th2 responses [28–32]. IL-27 promotes Th1 cell differentiation, which in turn produces a large amount of the proinflammatory cytokine interferon-γ (IFN-γ) to play a pathogenic role in JIA [32,33]. The promotion of Th1 differentiation and the inhibition of Th2 differentiation by IL-27 is also dependent on the action of STAT1 [28,31–33], that is, the role of IL-27 is also involved in the JAK-STAT signaling pathway. Therefore, if the expression of gene IL-27 is inhibited, the inflammatory response of JIA patients can be correspondingly reduced.

LAT encodes a protein called T-cell activation adaptor [34]. T-cell receptor (TCR) signaling is an important process in T-cell development and its activation in the periphery [35]. LAT is a key signaling hub connecting TCRs to trigger downstream T-cell responses. If LAT gene expression is reduced, peripheral T-cell development and numbers are inhibited. Decreased numbers of T cells are prone to lead to immunodeficiency and autoimmune diseases [36], and patients with JIA are likely to have decreased autoimmune function due to lack of LAT.

MAGI3 encodes Membrane-associated guanylate kinase, WW and PDZ domaincontaining protein 3. Abnormal expression of MAGI3 may affect Notch signaling and thus affect bone and joint development in children [37]. In addition, MAGI3 is also a risk gene for rheumatoid arthritis (RA), Graves disease, and other autoimmune diseases, indicating it can also cause JIA by affecting the human immune system [38,39].

Enrichment analysis suggested important role of pathways involving Th17 cell differentiation and Rap1 signaling pathway. The Th17 cell differentiation pathway is involved in inflammation and bone destruction, IL-27, IL-6R, LAT, TYK2 are all on this pathway. Th17 cells are actively differentiated and mainly secrete IL-17, which not only promotes the production of inflammatory cytokines in the JAK-STAT signaling pathway, but also catalyzes the maturation of osteoclasts, leading to osteopenia and joint damage [40,41]. The activation of inflammatory cytokines and osteopenia together lead to arthritis, and so inhibiting the differentiation of Th17 cells may inhibit and treat JIA to a certain extent [42–44].

The Rap1 signaling pathway plays a central role in the functional outcome of T-cell stimulation [45]. Rap1 is a T-cell receptor proximal signaling protein, and its abnormal expression may lead to abnormal T cells [46]. T cells activate macrophages and synovial stromal cells pleiotropically through cell-to-cell contact and interleukin production, leading to synovitis and joint destruction in RA [47,48]. The pathogenic behavior of the above T cells is caused by Rap1 inactivation, and sustained Rap1 signaling in T cells can effectively reduce the incidence and severity of arthritis [49,50]. Therefore, activation and enhancement of Rap1 signaling also contribute to the prevention and treatment of JIA.

Our study is not without limitations. First, we only focused on European ancestry due to the current large-scale GWASs of JIA was only available for European population. The findings cannot be directly generalized to other ethnic population. Second, the results from data analysis are often less reliable than that from serious and high-cost experiments, which is likely to be the gold standard in biomedicine studies. However, the data analysis is still valuable. For example, it is often hard to pre-specify the experimental target under a hypothesis-free approach, data analysis can help to narrow down the candidate experimental target list and provide the evidence of target priority. In addition, the current analysis pipeline can be easily extended, an alternative way is to search for restriction endonuclease (RE) sites in the non-coding regions and gain insights through RE digestion patterns [51,52]. Third, our findings are obtained from a joint analysis of all subtypes of JIA, and there is no guarantee that the conclusions would be valid for any subtype. Finally, the results must necessarily be confirmed by experiments in the laboratory, given that all the analysis are essentially in silico.

#### **5. Conclusions**

In summary, we performed gene-level analysis as well as enrichment analysis on the largest GWAS summary data of JIA. We identified novel JIA-associated genes including IL-27, LAT, and MAGI3, and highlight the important role of Th17 cell differentiation, Rap1 signaling pathway in the development of JIA. Our results can provide new insights into the pathogenic mechanisms as well as potential therapeutic targets of JIA; however, further studies are still required to validate these findings.

**Supplementary Materials:** The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/ijms232113555/s1.

**Author Contributions:** Conceptualization, Z.Y. and S.W.; methodology, W.G. and S.L.; software, W.G. and S.L.; formal analysis, W.G. and S.L.; writing—original draft preparation, S.L.; writing—review and editing, Z.Y., S.W., W.G., S.L., L.L., and R.Y.; supervision, Z.Y. and S.W.; funding acquisition, Z.Y. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the National Natural Science Foundation of China (No. 81872712 and No. 82173624), the Natural Science Foundation of Shandong Province (No. ZR2019ZD02), the Cheeloo Young Talent Program of Shandong University, and the Taishan Scholar Project of Shandong Province, all awarded to Z.Y.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The GWAS summary data analyzed during the current study are available in the NHGRI-EBI GWAS Catalog (https://www.ebi.ac.uk/gwas/downloads/summarystatistics, accessed on 9 November 2021) (Study Accession Code GCST90010715). The FUSION software and GTEx v8 gene expression datasets used during the current study are available in the FUSION website (http://gusevlab.org/projects/fusion/, accessed on 8 April 2022). The eMAGMA software and GTEx v8 annotion datasets used during the current study are available in the eMAGMA website (https://github.com/eskederks/eMAGMA-tutorial, accessed on 28 March 2022).

**Acknowledgments:** We would like to thank GWAS Catalog for providing us with the JIA GWAS summary data.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Review* **Developing Genetic Engineering Techniques for Control of Seed Size and Yield**

**Intikhab Alam 1,2,3,4, Khadija Batool 4, Yuanyuan Huang 1,3, Junjie Liu 1,3 and Liangfa Ge 1,3,\***


**Abstract:** Many signaling pathways regulate seed size through the development of endosperm and maternal tissues, which ultimately results in a range of variations in seed size or weight. Seed size can be determined through the development of zygotic tissues (endosperm and embryo) and maternal ovules. In addition, in some species such as rice, seed size is largely determined by husk growth. Transcription regulator factors are responsible for enhancing cell growth in the maternal ovule, resulting in seed growth. Phytohormones induce significant effects on entire features of growth and development of plants and also regulate seed size. Moreover, the vegetative parts are the major source of nutrients, including the majority of carbon and nitrogen-containing molecules for the reproductive part to control seed size. There is a need to increase the size of seeds without affecting the number of seeds in plants through conventional breeding programs to improve grain yield. In the past decades, many important genetic factors affecting seed size and yield have been identified and studied. These important factors constitute dynamic regulatory networks governing the seed size in response to environmental stimuli. In this review, we summarized recent advances regarding the molecular factors regulating seed size in Arabidopsis and other crops, followed by discussions on strategies to comprehend crops' genetic and molecular aspects in balancing seed size and yield.

**Keywords:** seed-specific transcription factors; signaling pathways; seed size; seed development

#### **1. Introduction**

The seed-producing crops are essential sources of foodstuffs, fodder, and fuel worldwide. The number and size of seeds are very imperative with respect to the evolution of various plant species for their preservation [1]. In addition, the seeds could be used to produce biofuels, which are becoming more popular as an alternative to fossil fuels [2,3]. Seeds fulfill almost 70% of the world's food demand. Environmental factors significantly affect plant development, consequently reducing seed production. Higher yields are possible with the entire primary and secondary branches, as well as the size and pod number on every plant, flowering time, seeds number per pod, seed weight, and sometimes plant height [4,5]. The seed development in angiosperm plants begins by merging only a single sperm with a single egg cell or beside two polar nuclei, resulting in a diploid embryo or a triploid endosperm, respectively [6]. An endosperm's single layer envelops the embryo in the mature seed of the Arabidopsis, and the seed coat encloses the endosperm, which is designed by many layers of particular maternal tissues that provide safety, enhance latency, and sprouting [7]. The development of zygotic tissues including embryo and endosperm as well as maternal components (e.g., eggs) determines seed size. Thus, the development of the embryo, endosperm, and seed envelope are the major constituents in crops, determining

**Citation:** Alam, I.; Batool, K.; Huang, Y.; Liu, J.; Ge, L. Developing Genetic Engineering Techniques for Control of Seed Size and Yield. *Int. J. Mol. Sci.* **2022**, *23*, 13256. https://doi.org/ 10.3390/ijms232113256

Academic Editor: Wajid Zaman

Received: 29 August 2022 Accepted: 15 October 2022 Published: 31 October 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

the seed size and weight [8]. The seed development process is divided into two phases: the morphogenesis phase, which includes cell division, endosperm and embryo development, and cotyledon differentiation; and the maturing phase, which includes embryo development at the expense of endosperm, seed dehydration, and source collection [9]. Arabidopsis seed development programs are similar to those of dicotyledon types of seed crops, i.e., canola and soybean. Moreover, early phases of seed development are identical in monocots and dicots but vary in later stages [10]. To inhabit various types of habitations, seed plants produce an enormous range of growth farms and many deviations in seed size, accompanied by seed travel ways. Moreover, after a long evolution time, plants exhibit most of the seed sizes, starting from the dust-like seeds of orchids to the 20 kg of the double coconut [11]. Seed size in flowering plants is very important from an evolutionary point of view, as well as essential for crops [12], but the molecular mechanism of seed size is not clear. Seed size is the main component in determining plant fitness and seed yield [11–13]. In plants, variation in seed size is an interesting phenomenon in developmental biology. Seed size is controlled by the inherent inner material regarding maternal tissues and zygotic materials [11], while the progress of the seed is affected by environmental and climatic aspects. Given the significance of seeds as the primary food source, many efforts have been made to understand seed development, size regulation, and total yield in crops. Many crop genomes have been sequenced, and genetic factors have been identified to dissect the network controlling seed growth and development [14,15].

Genetic screening is a powerful strategy widely used to identify the regulators of seed size. The advances in genome-wide association studies (GWAS) and RNA sequencing technology have been extensively applied to investigate the fundamentals of the seed size variation [16]. Plants have the good factor of huge genetic variations occurring naturally, as well as being extensively studied in many species [17]. Seed size is among the most complex characteristics, elaborately controlled by multiple positive and negative factors, constituting a well-balanced network to regulate seed size by modulating biological processes. Genes with a higher population and more density markers and several quantitative trait loci (QTLs) with a novel QTL for grain length, qGL11, were recognized in seven panicle and grain-related traits [18]. In the present review, we summarized recent advances in dissecting the regulatory network that controls seed size and yields to an extent, including the genetic factors and signaling pathways that regulate seed size in Arabidopsis and other various crops.

#### **2. Genetic Factors Controlling Seed Size**

Controlling seed size and weight is determined via the genetic composition of zygotic and maternal tissues. In recent times, many scientific studies have been reported to identify genetic factors with plant responses to control seed size and weight (Table 1).


**Table 1.** List of reported genes that control seed size and yield in different plants.


#### **Table 1.** *Cont.*


**Table 1.** *Cont.*

In Arabidopsis, early endosperm cellularization induces diploids to pollinate tetraploids, ultimately producing smaller seeds. By contrast, late or failed endosperm cellularization causes highly abortive seeds [68,69]. The endosperm cellularization time also regulates seed size, though embryo development covers the seed cavity afterward by replacing the endosperm space. The mutant genes of the IKU-pathway, including haiku (iku1), iku2, and miniseed3 (mini3), developed smaller seeds due to precocious cellularization of the endosperm [19,20,70,71], while the SHORT HYPOCOTYL UNDER BLUEI (SHB1) gain-of-function mutants postponed endosperm cellularization, ultimately producing larger seeds [19,22,23]. Transcriptional co-activators, such as SHB1 linked with IKU2 and MINI3 promoters and stimulate their expression in Arabidopsis [22,23]. Furthermore, adjacent tissues influence endosperm development; for instance, a mutation of maternal sporophytic in TRANSPARENT TESTA GLABRA2 (TTG2) limits integument enlargement and the origins of advanced endosperm cellularization [24]. The IKU pathway interacts with TTG2 genetically by way of double mutants ttg2 and iku2, showing stronger seed phenotypes than single mutants [24]. Furthermore, a DNA topoisomerase, TOPOISO-MERASE Iα (TOP1α), and an ATP-dependent RNA helicase, UP-FRAMESHIFT SUPPRES-SOR 1 (UPF1), biparentally regulate the seed size (regulating TTG2). Loss of function of UPF1 or TOP1α or induces the ectopic appearance of TTG2 in antipodal cells. Genetic analysis has consistently shown TOP1α and UPF1 function upstream of TTG2. TTG2 is directly suppressed by TOP1 and UPF1 by causing a chromatin disruption [25]. Further, it has been suggested that comparative TTG2 quantity in antipodal cells to gametes regulates seed size, indicating the role of the maternal and paternal dose-dependent molecular framework. The late embryogenesis abundant (LEA) protein family named LuLEA is expressed at later stages of seed development [26]. LuLEA1 negatively regulates seed size and yield as its higher expression reduces seed size and fatty acid content in Arabidopsis. Recently, it was reported that Arabidopsis TERMINAL FLOWER1 (TFL1) acts as a mobile controller produced in the chalazal endosperm and promotes endosperm cellularization on time when moving to the syncytial peripheral endosperm, and enhances seed size by stabilizing ABA insensitive 5 (ABI5) genes [21]. Most recently, in rice, mutations in the EMBRYONIC FLOWER2a (OsEMF2a) gene resulted in delayed cellularization and autonomic endosperm development [48].

Another research group hypothesized that the seed size and length of the hypocotyl might be regulated by AtSOB3, which belongs to the SUPPRESSOR OF PHYTOCHROME B. They tested this hypothesis and demonstrated that AtSOB3-D produced lighter seeds with a shorter hypocotyl compared to WT, while a dominant-negative mutation, *AtSOB3-6-OX*, produced heavy and larger seeds with a longer hypocotyl in Arabidopsis [43]. Researchers have reported that the impact of accumulated soluble sugar on the seed size is controlled by the ANGUSTIFOLIA3-YODA (AN3-YDA) gene cascade as well as the addition of environmental and/or metabolic factors by sugar and ethylene metabolism to regulate seed mass in Arabidopsis by interacting with ETHYLENE INSENSITIVE 3 (EIN3) [28] and AN3 [27]. The ENHANCER3 OF DA1 (EOD3) exists in the sporophytic tissues of the mother plant, encodes many cytochrome P450/CYP78A6, and stimulates seed growth in plants. Overexpression of EOD3 has been shown to drastically increase the size of the seed, whereas loss-of eod3-ko function mutants produce smaller seeds. Mutants of the *BnaEOD3*

gene were competently produced with stable changes by the stable transformation of the CRISPR/Cas9. These mutations were steadily transferred into the T1 and T2 generations, resulting in an accumulation of homozygous mutants with a combined loss of function alleles. The T1 line contains smaller siliques and seeds, but the higher number of seeds per silique suggests that BnaEOD3 negatively regulates seed growth and development in rapeseed [64].

Rice is one of the most essential crops, feeding a significant amount of the world's population. Many years of studies have focused on identifying QTLs that control the grain size of rice [72]. For example, GLW7 is a major QTL that has been identified as controlling grain size and yield [49]. However, the investigation of the genetic variability of Arabidopsis has significant potential to provide insight into commercially essential traits in crops. Ren et al. (2019) identified new factors responsible for controlling seed size; CYCB1;4 is a cell cycle-related gene that encodes a cyclin protein. Growing regions are enriched in CYCB1;4 and may control the cell cycle that is positive in maternal and zygotic parts, thereby increasing seed and other organ sizes, implying that CYCB1;4 is an essential component in Arabidopsis for regulating seed size. They reported comprehensive information on GWAS in Arabidopsis related to seed size, with 38 important linked loci, and one locus interlinked with CYCB1;4. A higher level of CYCB1;4 was noted in transgenic plants, which increased seed size and grain yield in response to faster cell cycle progression, while the CYCB1;4 mutant produced smaller seeds. In short, CYCB1;4 may potentially target yield improvement in plants as the temporal and spatial expression pattern may inform functioning in both maternal and paternal tissues that are involved in absolutely organizing seed size [30]. Another study reported that overexpression of OsGW2, a QTL on chromosome 2, changes grain size and reduces Indica rice yield. The downregulation of *OsGW2* using RNAi technology resulted in larger and plump kernels by regulating cell expansion and cell proliferation in the spikelet hull [50].

Genomic imprinting regulation is very complicated and may include non-coding RNAs, DNA methylation, and histone adaptations. Many scientific studies have evaluated that the endosperm plays an essential role in divergent plant tissues, with imprinted gene expression and a unique DNA methylation [73]. The increased time for the development of endosperm between different seed parts was considered to be under the epigenetic regulator. Nevertheless, it was shown that the maternally expressed in embryo 1 (mee1) gene is imprinted in the embryo and endosperm of the maize plant [74]. Genome-wide methods relating to RNA-seq of mutual crosses in Arabidopsis [75], rice [76], and maize [77], have indicated the occurrence of numerous tentative imprints [74], and the genetic makeup of embryos. Parental imprinting of genes is caused by distinctive DNA methylations at specific loci produced by DEMETER demethylase (DME) in the maternal gametes [78], DNA Methyltransferase 1 (MET1) in the paternal gametophyte, and POLYCOMB REPRESSIVE COMPLEX 2 (PRC2)-facilitated suppression of transcription [29]. Though DNA methylation is not primarily responsible for embryo-specific imprinting, allele-definite siRNAs may induce parent-of-origin and precise genetic expression occupied by the embryo. Hypomethylated regions in the endosperm generate maternal-specific siRNAs, which monitor ROS1-mediated DNA de-methylation of the particular area after moving into the embryo. Sequence-specific siRNAs distinguishing maternal alleles individually could result in allele specificity [79].

*CURLY LEAF* (*CLF*), which encodes a histone methyltransferase of PRC2 coordinated in tri-methylation of histone H3 Lys 27 (H3K27me3), controls a group of genes that interact with each other during embryo development [80]. Several PcG proteins such as FERTILIZATION INDEPENDENT SEED 2 (FIS2), FIS3, FERTILIZATION INDEPEN-DENT ENDOSPERM (FIE), FIS1/MEDEA (MEA), and MULTI COPY SUPPRESSOR OF IRA1 (MSI1) control the development of the endosperm [31]. By way of methylation of chromatin histones, repressive complexes are formed by these PcG proteins that inhibit the expression of involved genes controlling developmental processes. Numerous PcG constituents can be imprinted, and the functions of FIS complexes in plants are mani-

fested after pollination. Moreover, *fis* mutants decrease endosperm cellularization and, ultimately, the size of seeds. The function of FIS is mediated by AGL62, which belongs to the agamous-like MADS-box proteins, whereas in *fis* mutants, *AGL62* expression is suppressed, leading to endosperm cellularization failure [32]. Regarding the FIS gene family, FLOWERING WAGENINGEN (FWA), *MEA*, and *MEA* genes are maternally expressed; however, *MADS-box I like PHERES1* (*PHE1*) and a FIS PcG complex are presumed to be paternally expressed. Seed growth is terminated due to peak propagation of the endosperm without embryo development in *medea* (*mea*) mutants [81]. The FIS proteins, including MEA, FIE, and FIS2, strongly regulate the MADS-box gene *PHE1* expression. *PHE1* expression was normal after fertilization in wild-type plants (WT), whereas *fis* mutants showed higher expression until seed abortion, indicating that FIS-class proteins deregulate *PHE1* expression after fertilization to prevent seed abortion [33]. Recently, using CRISPR/Cas9-based genome editing, it was found that the MADS-box TF genes MADS78 and MADS79 are very important for regulating endosperm cellularization and early seed formation in rice [55]. A single MADS78 or MADS79 knockout mutant displayed early endosperm cellularization, while double mutants hindered seed development and produced no viable seeds [55].

DNA methylation plays a significant role in seed development as it may help increase seed size and weight [82]. It also plays an important part in plant genome stability and developmental processes [83]. Chinese super hybrid rice LYP9 (Liangyou Peijiu) is a successful hybrid rice variety with high heterosis [84]. This indicates that DNA methylation in hybrid rice seeds plays a key function in the initiation and maintenance of heterosis [85]. Xiao et al. (2006) showed that methylation of *FIS2* and *MEA* is facilitated by *MET1* containing DNA methylase, methylating CpG islands, and a genomic cytosine base. A *met1* mutation in the maternal genome increases seed size due to maternal hypomethylation and delays endosperm cellularization [37]. By contrast, *met1* mutations in the paternal genome lead to early cellularization of the endosperm and decreased seed size, thus confirming exclusive maternal control of *MEA* expression, which provides insights into the evolution of imprinting in plants. Similar findings were also obtained in reciprocal crosses with parents with mutations in *ddm1. FIS2* [38], *FWA* [39], and *MPC* [40], which are imprinted genes that regulate parental imprinting by DNA methylation [86]. However, the effects of paternal imprinting are minor or non-existent. Another study in Arabidopsis revealed that a deviation in the paternal alleles regulates the growth of seeds, and these deviations occur in seeds when mutants with the maternal *mea* allele showed loss of function of this gene, indicating that paternal MEA may compensate maternal MEA deficiency during seed development [41]. During the developmental phase of rice seeds, genome-wide DNA methylome analyses showed higher frequencies of DNA methylation in embryos than in the endosperm, and non-transposable element regions were more variable than transposable ones during endosperm and embryo development [87]. Many epigenetic regulatory mechanisms participate in seed development; however, the regulatory mechanisms underlying major genome imprinting processes are not yet clear. Thus, integrated exploration of genome-related DNA methylome differences, histone amendment, transcriptomics, and cleavage positions of miRNA in developing seeds through current genomic approaches, including degradome and bisulfite sequencing, is required for a better understanding of epigenetic mechanisms underlying seed development. The turgor pressure of the endosperm drives seed expansion during the early seed development [88], whereas later, turgor pressure decreases during endosperm cellularization, which is controlled by POLYCOMB REPRESSIVE COMPLEX 2 (*PRC2*) [42].

MicroRNAs (miRNAs) have become an important part of the basic molecular network of important agronomic seed traits in various plant species. The loss of the specific gene functions that control miRNA biogenesis results in defects in seed development [89]. MicroRNA160 targeting auxin response factors in cotton and suppression of *miR160* resulted in a smaller seed size in cotton [62]. Another study examined the potential role of miR156a in linseed flax (*Linum usitatissimum* L.) by decreasing seed oil content in transgenic lines compared to WT plants [90]. In addition, miRNAs also control embryogenesis

by regulating transcription in maize [56]. Recently, 72 existing miRNAs and 39 recently identified miRNAs were discovered to be expressed in the growing seeds of legumes (*Phaseolus vulgaris* L.), and their involvement was more prominent throughout late embryogenesis and desiccation. Numerous TF families have been identified as known miRNA targets, including ARF, HD-ZIP, NF-Y, and SPL, and a majority of new miRNA targets were anticipated to be expressed as functional proteins [91]. In addition, small interfering RNA (siRNAs) can also induce transcriptional gene silencing, which may be associated with epigenetic control of seed expansion. Plant-specific RNA polymerase named RNA polymerase IV (Pol IV) is involved in the production of siRNAs [92]. Genome-related monotonous elements are used to produce p4-siRNAs, and some are interlinked with exclusive regions, comprising numerous characterized imprinted loci that may have a significant function in the initiation or continuation of imprinted gene expression [93]. Pol IV-base p4-siRNAs are produced from transposon elements present in the genome and are expressed exclusively in the endosperm tissues [44].

Quantitative trait locus analysis is useful for providing information on seed size control and other traits which may be relevant in plant breeding. Recently, 39 QTLs were identified on different chromosomes in common beans, and one major QTL, SL9.1 GA, was found to control seed length [63]. Three epistatic QTLs (qSL-13-3zy, qSL-13-4zy, and qSW-13-4zy) were found to regulate soybean seed size and yield [65], and many well-categorized QTLs have been recognized in different plants, including *GS3, qSW5*, and *qGL7* in rice [51,53], *GSI*, *qZmGW2-CHR4*, *zMgs3*, and *qZmGW2-CHR5* in maize [57–59] and *TaGW2* in wheat [61].

Moreover, many QTLs involved in seed size control have been detected in other crops, although their functions have not been categorized [94–96]. Genes with higher abundance and density markers and 43 QTLs were recognized to be linked with seven panicle- and grain-related characteristics. For example, the new locus qGL11 enhances rice grain weight and length, and it occurs in a chromosome segment substitution line (CSSL) obtained through segregation of the population; moreover, it mapped exceptionally well onto a 25-kb sequence that comprises the IAA-amido-synthetase gene *OsGH3.13* [18]. A recent study investigated QTL regions at the 168.37 kb chromosome fragment with regard to the SNPs Aradu\_A07\_1148327 and Aradu\_07\_1316694. Twenty-two genes were predicted in this region, with two genes of interest, Aradu.RLZ61 and Aradu.DN3DB, encoding a transcriptional regulator, an F-box SNEEZY (SNE), and STERILE APETALAlike (SAP), respectively. These genes play a critical role in regulating seed and fruit size [67]. Moreover, *MINI SEED 2* (*MIS2*) regulates grain size by coordinately controlling epidermal cell number and cell size, and microscopic analysis showed that *mis2* mutant rice revealed reduced cell size of spikelet epidermis but higher cell numbers [54]. The *MIS2* encodes CRINKLY4 (CR4), the receptor-like kinase through Map-based cloning, which exhibited the strongest expression during panicle development [97].

In rice, the *LARGE1* gene encodes the MEI2-LIKE PROTEIN4 (*OML4*) and negatively regulates seed size through terminating cell enlargement in spikelet hulls. Over-expression of *OML4* results in smaller and lighter seeds, whereas loss of function of *OML4* leads to larger and heavier grains [52]. In eukaryotes, cyclin-dependent kinases (CDKs) and their governing subunit cyclin (CYC) control the cell cycle progression [34]. CDK inhibitors (CDKIs) are inhibitors of CDK/CYC complex action and were thus the principal cell cycle controllers. In plants, two types of CKIs have been discovered and categorized: SIAMESE/SIAMESE-RELATED (SIM)-related suppressors and INHIBITOR OF CDC2 KINASE/KIP-RELATED (ICK/KPR) [98]. Central-cell *rbr* mutants have superfluous nuclei, indicating failed cell cycle arrest [99]. Moreover, irregularities in *rbr1* maternal gametes decrease cell propagation in the integument and stop further development of the seed envelope [47]. Furthermore, a complex of MUTICOPY SUPPRESSOR OF IRAI (MSII) RBR1, which stops the expression of *MET1,* is responsible for activating imprinted genes present in germ cells throughout megagametophyte development [35]. Seed termination was observed in the tri-mutant *cycd3-1:2:3* (CYCD3 belongs to the CYCD gene family), because of late cell propagation during embryo development. However, cell division

stimulated by the trans-activation of *CYCD3:1* or *CYCD7:1* in the endosperm and embryo results in fatal embryo deficiencies [100]. Directed upregulation of *CYCD7:1* in the endosperm and in the central cell, which resulted in escape from the cell cycle in the central cell, and enhanced endosperm propagation at the syncytial stage, resulting in a larger seed. Nevertheless, death occurred in a few larger seeds, indicating an incomplete seed termination [46]. The *ENO* gene is strongly expressed in plants encoding glycolytic metalloenzyme enolase (ENO2) that boosts the dehydration of 2-phospho-D-glycerate (2-PGA) to phosphoenolpyruvate (PEP) [101]. Three *ENO* genes have been identified in Arabidopsis, including *At1g74030* (*ENO1*), *At2g36530* (*ENO2*), and *At2g29560* (*ENO3*) [102]. Sub-cellular occurrence of isoforms of enolase showed that ENO1 and ENO3 are situated in the chloroplast and cytoplasm, while ENO2 has been detected in the nucleus and cytosol [103]. In addition, mutations in AtENO2 decreased the size and mass of seeds with a reduced concentration of cytokinin. Carbohydrate data analyses and RNA sequencing showed that metabolism pathways, particularly regarding the secondary metabolism, occurred in *AtENO2 T-DNA* mutants [36]. Furthermore, AtENO2 cooperated with Arabidopsis basic-leucine zipper 75 (AtbZIP75) as a substitute for AtMBP-1 [36]. Therefore, AtbZIP75 may contribute to seed development. This study clarified the novel function of AtENO2 in seed growth and improvement and proposed a good target gene for gene manipulation for the purpose of plant breeding. Recently, two SWEET homologs, i.e., *GmSWEET10a* and *GmSWEET10b*, were found to simultaneously affect seed size, oil quantity, and protein content in soybean, suggesting that seed size and oil constituents can be controlled by adjusting the combination of both alleles [15]. A different study demonstrated that the reproductive functions of this plant, including fertility, seed size, and yield, can be controlled by C-TERMINALLY ENCODED PEPTIDE RECEPTOR1 (CEPR1) [45]. The authors showed that two *cepr1* knockdown mutants produced smaller seeds and reduced yield (by 88–98%). In Arabidopsis, *BZR1* controls organ size via the BR signaling pathway. Recently, studies have shown that overexpression of *ZmBZR1* in Arabidopsis causes increased organ and seed size. Moreover, *ZmBZR1* is attached to the promoter of *GRACE* and *KRP6* to control their expression and regulate seed size [60].

#### **3. The Role of Signaling Pathways in Seed Size Regulation**

*3.1. The Ubiquitin-26s Proteasome Pathway*

The ubiquitin–26S pathway regulates the size and development of seeds; the genes regulating the ubiquitin–proteasome pathway are shown in Figure 1A. Ubiquitination and/or protein degradation processes by the proteasome 26S pathway are an important posttranslational protein turnover mechanism in plants [104–106]. The ubiquitin–proteasome system, which is highly controlled, is involved in the regulation of all aspects of plant life and governs numerous developmental and stress-related activities mediated by hormone signals [107]. Ubiquitin (Ub) is a protein of 76 amino acids that affect the target cellular proteins via a multistep reaction including three enzymes, E1 enzyme (Ub-activating), E2 enzyme (Ub-conjugating), and E3 enzyme (Ub ligases), in the ubiquitin–proteasome system (UPS) [108]. Ubiquitination specificity is largely regulated by E3 ligases, which recognize specific substrates and catalyze the bond between the Ub and substrates [108]. E3 ligases are important regulatory components in the ubiquitin-dependent pathway because they mediate substrate specificity [109]. The ubiquitination pathway participates in seed size regulation; ubiquitin receptor, E3 Ubiquitin ligase, ubiquitin-specific protease, 26S proteasome, plant-specific APC/C regulatory factors, and ubiquitination pathway interaction proteins, etc., all play a central function in regulating seed size and development [13,110,111]. In Arabidopsis, ubiquitin receptors such as DA1 and DA1-related protein (DAR1) regulate the size of seeds by restraining cell propagation in the maternal integument [112]. The DA2, E3 Ub-ligases, and BIG BROTHER (BB)/ENHANCER OF DA1 (*EOD1*) function together with DA1 to limit seed size and the development of other plant parts [110,113], indicating that E3 ubiquitin ligases and DA1 may function in the same complex or have mutual downstream targets. Supporting this concept, DA1 strongly

co-operates with DA2 in vitro as well as in vivo [110]. E3 ubiquitin ligase shows specific substrate relationships between DA1 and DA2, and it may help to identify the precise substrates of DA1 with regard to degradation. Genetic studies have revealed that BB/EOD1 and DA2 function in diverse ways [110], indicating that variant substrates may be targeted by them for degradation. In rice, grain width and weight 2 (*GW2*) shows sequence resemblance with DA2, encoded by a QTL regulating seed size [13]. GW2 negatively affects the size of grains by controlling cell propagation in spikelet casings. Homologs of GW2 are also associated with seed size regulation in maize and wheat crops [58,61], suggesting that *GW2* retains preserved functions of seed size regulation. In Arabidopsis, *SUPPRESSOR OF DA1* (*SOD2*) encoding UBIQUITIN-SPECIFIC PROTEASE 15 (UBP15) is maternally responsible for enhancing seed development. Plants with *sod2/ubp15* mutations show reduced seed size and other organs because of the reduction in cell propagation and epistatic effects on DA-1 [114,115]. In Arabidopsis, DA1 interacts physically with UBP15 and modifies its durability, suggesting that UBP15 is the downstream target of DA1. Further study showed that UBP15 individually affects EOD1 and DA2 to regulate grain size, specifying that other unidentified E3 ubiquitin ligases may ubiquitinate UBP15. Moreover, in Arabidopsis, the anaphase-promoting complex/cyclosome (APC/C) ubiquitin ligase and the 26S proteasome, along with many other factors such as RPT2 and SAMBA, affect seed development [116–118]. SAMBA mutations, a negative regulator of APC/C, result in a larger seed size [117]. The mutation in SAMBA markedly increases seed size and structure in eod1-2 da1-1 phenotypes [119], implying that SAMBA may exert substantial effects or act as a complex identical to EOD1 and DA1 regarding grain size regulation. The important QTL qSW5/GW5 is responsible for seed thickness; it is located on chromosome 5 and controls the size of grains by suppressing cell propagation in spikelet casings in rice [120]. GW5 reacts to ubiquitin in vitro, signifying that it may act as a ubiquitin–proteasome pathway [121]. In rice, GW5 may function separately from GW2 to regulate grain width [122]. Histone ubiquitylation controls the transcription of *DA1/DA2*, which affect seed size. OTU1 de-ubiquitinase de-ubiquitylates DA1/DA2 chromatin and acts as an epigenetic transcriptional suppressor of the *DA1/DA2* genes. OTU1 is nucleocytoplasmic and plays a role in nuclear and cytoplasmic functions [123]. *SMALL LEAF AND BUSHY1* (*SLB1*) encodes the F-box protein, a fragment of the SKP1/Cillin/F-box E3 Ub-ligase complex. *SLB1* regulates the growth of plant organs and secondary branches by regulating the stability of BIG SEEDS1 (BS1), leading to increased leaf and seed size in soybean and *Medicago truncatula* [124]. A different study reported that the regulatory complex GW2– WG1–OsbZIP47 regulates grain size in rice [27]. GW2, WG1, and OsbZIP47 work together to control grain width and length through the GW2–WG1–OsbZIP47 regulatory module. Specifically, *WG1* encodes a glutaredoxin protein and promotes cell proliferation, leading to enhanced grain growth. WG1 acts as a co-repressor with ASP1 and interacts with the transcription factor (TF) OsbZIP47 to terminate its transcription. OsbZIP47 suppresses seed growth by reducing cell proliferation [28]. Moreover, the E3 Ub-ligase *TaPUB1* reduces the seedlings' sensitivity to the abscisic acid (ABA) by interacting with TaPYL4 and TaPY15 (involved in ABA signal transduction and inducing their degradation), resulting in smaller grain size and yield; this suggests that TaPUB1 is a negative regulator of seed development [125].

**Figure 1.** Signaling pathways and their correlated molecules involved in seed size regulation. (**A**) Ubiquitin–26s proteasome pathway (**B**) MAPK signaling pathway (**C**) IKU pathway (**D**) G-protein signaling pathway (**E**) Transcription regulatory factors.

#### *3.2. The Mitogen-Activated Protein Kinase (MAPK) Signaling Pathway in Plants*

MAPK is responsible for many transduction pathways, including hormone signaling and stress responses [126–128]. The genes regulating the MAPK signaling pathway are shown in Figure 1B. A MAPK cascade comprises three kinases: MAPK, MAPK kinase (MAPKK), and MAPKK kinase (MAPKKK). MAPKKKs activate and phosphorylate MAPKKs in response to an external stimulus signal, and then the activated MAPKs phosphorylate a variable number of downstream target substrates, including transcription factors, chromatin remodeling factors, kinases, or different enzymes, resulting in transcriptome and proteome reprogramming in the entire cell. The successive phosphorylation of MAPK proteins and their substrates is critical for MAPK cascade-mediated interactions and signal transduction [129]. Several components of the MITOGEN-ACTIVATED PRO-TEIN KINASE (MAPK) cascade have previously been identified as key regulators of seed development [130]. The *small grain 1* (*smg1*) encoding MAPKK4 mutants generate short grains in rice because of reduced cell proliferation in spikelet exteriors [131]. In addition, a previous study suggested that alterations in OsMAPK6 result in short grains, which resemble those produced by *smg1* mutants [132]. Furthermore, loss of function mutations in OsMKP1 produce larger grains, while OsMKP1 overexpression generates smaller grains. OsMKP1 regulates rice grain size by limiting cell proliferation in grain hulls. OsMKP1 directly interacts with and deactivates OsMAPK6 [133]. Moreover, OsMKK4 is physically co-related with OsMAPK6 [132]. Therefore, the OsMKK4–OsMAPK6-controlling module plays an essential role in controlling seed size in rice. It is difficult to investigate the upstream OsMAPKKKs of OsMKK4 and downstream substrates of OsMAPK6 regarding the

control of grain size. Furthermore, OsMKK4 and OsMAPK6 affect brassinosteroid (BR) reactions and many BR-related genes' expression [131,132], suggesting a link between the MAPK pathways and BRs with respect to seed development. Guo et al., 2018 demonstrated that GSN1 functions as a negative regulator of the OsMKK10–OsMKK4–OsMPK60 cascade by inducing precise dephosphorylation of OsMPK6 to coordinate the trade-off between grain size and grain number per panicle [134].

The leucine-rich repeat (LRP)-receptor kinases ER, ERL1, and ERL2 stimulate fruit growth through signaling pathways regulated by EPFL9 in the carpel wall. EPFL2 expression controls the spacing of ovules in the wall of the carpel, and in inter-ovule spaces, this is controlled by ERL1 and ERL2, which may facilitate equal distribution of resources [135]. This may provide an understanding of trade-offs in rice panicle expansion and constitute a basis for increasing crop yield. The novel AGC protein kinase AGC1-4, belonging to the subfamily of AGC VIIIa, encodes a serine–threonine kinase and regulates seed size. Seeds with higher expression of AGC1-4 are smaller, whereas *agc1-4* mutant seeds were significantly larger [136]. A recent study reported that *OsINV3* is a positive seed size regulator in rice [137]. They showed that overexpression of *OsINV3* increased grain size, while loss of its function reduced grain size, in comparison with WT plants. *OsINV2* is a homolog of *OsINV3* and has no function in seed size regulation by itself; however, both *OsINV2* and *OsINV3* play roles in sucrose metabolism in sink organs and increase seed size [137]. In Arabidopsis, the function of the novel gene *ZmRLK7*, which belongs to the LRP receptorlike protein kinases (LRR-RLKs) isolated from maize, was investigated through ectopic expression to examine its effects on plant development [138]. Most recently, in Arabidopsis, the lectin receptor-like kinase LecRK-VIII.2 has been found to regulate seed production by organizing seed size, seed number, and silique. Lecrk-VIII.2 mutants produce fewer seeds but more seeds and siliques, resulting in a higher yield. On the other hand, overexpressing of LecRK-VIII.2 produces larger seeds but fewer seeds and siliques, resulting in yields comparable to wild-type plants [139]. MPK3 and MPK6 double mutants are toxic to the embryo, but single MPK6 mutants exhibit a variety of abnormal phenotypes related to seeds, similar to those seen in mutants of the kinases MAPK KINASE 4 (MKK4) and MKK5, which function upstream of MPK3 and MPK6, respectively. Given the importance of MPKs in seed development and their function in RLK signaling. (Xiao et al., 2021) investigated whether MPK3 and MPK6 could act downstream of LecRK-VIII.2 [139]. All these findings suggested that the phosphorylation of MPK6 and not MPK3 corresponds with the expression levels of LecRK-VIII.2 in the seed. According to (Xiao et al., 2021), LecRK-VIII.2 promotes growth throughout development. Furthermore, this important receptor positively promotes seed growth from maternal sporophytic tissues during seed development by activating expansions, and possibly through the MAPK cascade and MPK6. This discovery adds to the body of evidence supporting the importance of LecRKs in developmental processes and opens up new paths for advanced research, such as investigating the signal that LecRK-VIII.2 perceives. Understanding the mechanism behind the greater yield in LecRK-VIII.2 KO plants, despite an overall loss in plant growth, will be fascinating. Is it feasible that a plant with lower vegetative biomass has more available resources for seed production before it reaches senescence? Is it possible to manipulate this signaling pathway to generate crops with larger seeds, stronger plants, and higher yields?

#### *3.3. The IKU Pathway*

Seed size controlled through an IKU signaling pathway loss-of-function mutation in the VQ motif protein HAI-KU1 (IKU1), receptor kinase IKU2, WRKY TF MINI-SEED 3 (MINI3), and the leucine-rich repeat (LRR) decreased seed sizes due to advanced endosperm cellularization. The genes regulating the IKU pathway are shown in Figure 1C. The embryo and endosperm genome, instead of maternal genetic factors, determine the size of the seed phenotype in these mutants [20,22]. Transcriptional co-activators such as SHORT HYPOCOTYL UNDER BLUE 1 (SHB1) have been shown to link with IKU2 and MINI3 promoters and stimulate their expression in Arabidopsis [22,23]. Seed size is regulated by

endosperm growth through the signaling pathways of IKU1, IKU2, SHB1, and MINI3 [140]. MINI3, which is associated with the cytokinin oxidase (CKX2) promoter stimulates the initiation of CKX2 expression and is responsible for controlling endosperm growth [110]. Higher expression of CKX2 enhances seed size in iku2 seeds, confirming the role of cytokines in down-regulating the IKU pathway with regard to seed size regulation [110]. ABSCISIC ACID-INSENSITIVE 5 (AB15) is a TF that directly attaches to the upstream region of SHB1 and inhibits its expression [141]. ABA controls the development of seeds by ABI5-facilitated transcription and is regulated by SHB1 in the endosperm; thus, the mutants abi5 and ABA-deficient 2 (aba2) abi5 produce the larger size of seeds [141]. In combination with cytokinin and ABA signaling, the IKU pathway affects endosperm growth to control seed size in Arabidopsis and other plants. In Arabidopsis, the function of the new gene *ZmRLK7*, which belongs to the LRP receptor-like protein kinases (LRR-RLKs) isolated from maize, was investigated using ectopic expression [138]. The roles of *WRKY10/MINI3* and *MAPK10* in endosperm development were examined, and they appeared to function in opposite patterns. Furthermore, *mapk10* mutants consistently produce larger seeds, and seed size is positively regulated by *WRKY10/MINI3* [142].

#### *3.4. The G-Protein Signaling Pathway*

In plants and animals, G-protein signaling regulates many functions related to growth and development. Molecules regulating G-protein signaling are shown in Figure 1D. Many cell surface G-protein-related receptors are connected to intracellular effectors by the G-protein complex. Ligands activate receptors, and effectors control different cellular responses [143]. G-protein complexes comprise Ga, Gb, and Gg subunits, and G-proteincoupled pathways transfer signals through membrane-bearing receptors and heterotrimeric compounds to downstream effectors. Thus, mutations in Ga (GPA1) or Gb (AGB1) result in small leaves and flowers in Arabidopsis [144,145]. Over-expression of the Arabidopsis atypical Gg (AGG3) stimulates the growth of seeds and organs by enhancing cell propagation, and loss-of-function mutations in AGG3 lead to short seeds and organs [146,147]. AGG3 overexpression in *Camelina sativa* resulted in increased seed size, confirming the function of AGG3 in the regulation of seed size [148]. Likewise, loss of function or functional suppression of rice Ga (RGA1) or Gb (RGB1) reduces rice grain size [149–151]. Similarly, the knockdown of the RGB1 gene in rice delayed seed development, and lowered seed weight and starch accumulation [152]. Two QTLs affecting the size of grains and panicles are Rice GRAIN SIZE 3 (GS3) and DENSE AND ERECT PANICLE 1 (DEP1), respectively [146,147]. However, long seeds are produced after the loss of function due to increased cell propagation, whereas short grains are produced in response to GS3 or DEP1 gain-of-function alleles [153–155]. It may be possible that AGG3 in Arabidopsis and GS3 and DEP1 in rice have various cofactors or effectors, suggesting differential effects on seed size. A minor QTL, Small grain 3 (SG3) encoding an R2R3 MYB protein is associated with a major QTL grain size 3 (GS3) and negatively controls rice grain length. The γ subunit of a G protein encoded by *GS3* competitively interacts with Gβ, leading to reduced grain length [156,157]. DEP1 also encodes a γ subunit of a G protein, and GS3 appears to act as a cofactor of OsMADS1 in the control of grain size through the regulation of common target genes [158]. Recently, a study introduced a serine–threonine protein kinase AGC1-4 belonging to the subfamily of AGC VIIIa [136]. AGC1-4 overexpression causes a reduction in seeds, whereas *agc1-4* mutants produce significantly larger seeds compared with WT plants. AGC1-4 regulates seed size by regulating the number of embryonic cells [136]. Recently, heterotrimeric G-protein mutants of rice were produced using CRISPR/Cas9 gene editing. The *gs3* and *dep1* mutants produced more favorable agronomic traits than WT plants, whereas the *rga1* mutation resulted in a dwarf phenotype, leading to an extreme reduction in grain yield. The heterotrimeric G-protein β subunit, RGB1, plays a significant role in plant development, and *rgb1* mutants show suppressed growth and development of the embryo [159]. Most recently, the E3 ligase gene Chang Li Geng 1 (CLG1) has been shown to negatively regulate grain length by targeting the Gγ protein GS3, and thus control grain

size. CLG1 overexpression increased grain length, and CLG1 mutations with variations in three essential amino acids decreased grain length. CLG1 directly interacts with and ubiquitinates GS3, which is then destroyed via the endosome degradation pathway, resulting in improved grain size [160]. More studies are required to comprehensively elucidate the functions of G-protein signaling in seed size regulation.

#### *3.5. Transcriptional Regulation*

Particular transcription factors, i.e., TRANSPARENT TESTA GLABRAA2 (TTG2) and APETALA 2 (AP2), affect seed development in Arabidopsis [24,161–164]. Genes regulating transcription factors are shown in Figure 1E. The transcriptional repressor NGATHA-like (NGAL-2)/SUPPRESSOR-OF-DA1 (SOD7) and its homolog NGLA3/DEVELOPMENTAL-RELATED-TARGET-IN-THE-APEX 4 (DPA4) have been predicted to strongly limit Arabidopsis seed development [165], and *sod7-2 dpa4-3* mutants showed enhanced cell proliferation in the maternal integument, resulting in big seeds [165]. The Arabidopsis *AINTEGU-MENTA* (*ANT*) gene is associated with integument and organ growth regulation [166,167], and it encodes APETALA2-like TF *ant* mutants that exhibit smaller leaves, reduced flowering, imperfections in integument origination, and development of ovules. However, Arabidopsis and tobacco plants with higher expression of *ANT* showed higher production of leaves and flowers because of stimulated cell proliferation, and consequently, they produced larger seeds. An-1 encodes a basic helix–loop–helix protein, which regulates cell division. Transgenic studies have confirmed that An-1 positively regulates awn elongation, but negatively regulates grain number per panicle [168]. APETALA2 is a member of the AP2/EREBP (ethylene-responsive element binding protein) group belonging to the family of TFs that plays a significant role in the specification of floral organs in Arabidopsis [169], and it is responsible for seed size regulation [161,163]. The production of seeds is higher in Arabidopsis *ap2* mutants than in WT plants, regardless of the genotype of the pollen donor, signifying that AP2 is a maternal factor controlling seed size [161]. Tissue-specific epigenetic mechanisms control gene expression and transcriptional regulators at specific stages, and epigenetic processes allow genomic imprinting; thus, the allele's expression after pollination depends on the parent of origin. The "parental conflict theory can describe the importance of biological imprinting". A substantial barrier prevents resource distribution from mothers to offspring, and the main function of the paternal genome is to quickly deliver maternal resources to embryos harboring the paternal genome. However, the maternal genome attempts to distribute resources equally between offspring by downregulating the effects of the paternal genome [170]. Further, *ap2-10* (+/–) seeds generated by an *ap2* mutant are bigger than WT seeds, even in flowers pollinated with WT pollen, but seeds are smaller compared to *ap2-10* (–/) seeds of mutant flowers pollinated with *ap2* pollen [163], indicating that *AP2* functions maternally and zygotically to regulate seed size. In support of this concept, seeds yielded by cross-pollination of *35S:AP2* antisense transgene mutants with WT flowers were bigger than WT seeds and smaller than *ap2* mutant seeds [163]. For instance, SMOS1 (SMALL ORGAN SIZE 1) encodes an auxin-regulated APETAL2-type transcription factor and is positively regulated by OsARF1 in rice [171]. The loss of function of SMOS1 causes pleiotropic developmental phenotypes, including small seed size [171,172], suggesting a crucial role for SMOS1 in seed size regulation.

In *transparent testa glabra*2 (*ttg2*) mutants, yellow seeds are produced due to insufficient proanthocyanidin production and deposition of mucilage in the seed coat by the mutant [24]. The *ttg2* mutants' seeds are round and small, causing reduced cell length in the integument, in contrast to WT seeds. Reciprocal crossing trials showed the effects of *ttg2* on seed size in a strongly maternal sporophyte. The *ttg2* mutation also induced advanced endosperm cellularization, resulting in endosperm size reduction. *TTG2* encodes a WRKY family TF which regulates many steps of tannin synthesis [162], and, as several products of the tannin synthesis pathway may affect the cell wall and its capability to increase, mutations in *TTG2* may reduce the elongation ability of the cell wall [24]. The *ALM1* gene encodes Golden

2-like (GLK) a member of the GARP subfamily of Myb TFs, and *alm1.g* and *alm1.a* mutants revealed a decrease in 100-grain weight by 15.8% and 23.1%, respectively [173,174].

Several genes, including GROWTH-REGULATING FACTOR (GRF) genes, encode DNA binding transcription factors that interact with and form a functional transcriptional complex with the transcription cofactor GRF-INTERACTING FACTOR (GIF) [175]. In this functional unit, GIF operates to recruit SWI/SNF chromatin remodeling complexes to their target genes so that they can be transcriptionally activated or inhibited by GRF. GRF expression is post-transcriptionally inhibited by microRNA (miR396) [175].

It is precisely the miR396–GRF/GIF module that influences a wide range of essential plant growth and development traits that could have agricultural implications. However, the most convincing results demonstrating the agronomical values of the miR396– GRF/GIF system come from a set of many studies that independently identified a rare naturally occurring allele of OsGRF4 from *Oryza sativa* landraces with larger grain size in various genetic backgrounds [176–181]. Overexpression of OsGIF1 also improved grain size [179,180,182–184]. Likewise, the expression of a mimic transgene that binds and inactivates miR396 in rice, a strategy also used in Arabidopsis and other species, can increase the yield. Together with the fact that effects similar to those were artificially introduced in Arabidopsis, these findings demonstrate the transferability of knowledge from models like *A. thaliana* to crops, and the general importance of the miR396–GRF/GIF module in other plant species.

The NACs are transcription factors (TFs) that are specific to plants and are involved in many developmental processes [185]. For example, NAC-like TF EjNACL47 might be linked with the larger organ size in triploid loquat. Furthermore, ectopic expression of EjNACL47 results in larger organs i.e., leaves, flowers, and siliques in Arabidopsis, implying a positive role in organ enlargement [186]. In rice, the *ONAC020*, *ONAC026*, and *ONAC023* genes encoding NAC TFs express specifically during seed development. They have a strong relationship with seed size or weight as sequence changes in the upstream regulatory region [187]. The KIX–PPD complex regulates maternal integument growth and, consequently, regulates seed size through cell propagation and development. In Arabidopsis, the transcription factor MYC3/4 reacts with PPD1/2 and KIX8/9 to form the KIX–PPD–MYC product. The KIX–PPD–MYC product suppresses the expression of the *GIFI* promoter when interacting with the G-box region located in the *GIFI* promoter, and controls seed size [188]. Promoters of the grain filling-specific TF gene Opaque2 (O2), which regulates the factors known as B3 domain TF and ZmABI19, directly fuse with the O2 promoter for trans-activation and affect the developmental process of the endosperm and embryo, resulting in smaller seeds [189].

The KIX domain-containing protein family acts as a negative regulator of cell propagation in plants [190]. Loss of function in GmKIX8-1 mutants significantly increases the size of above-ground organs, including leaves and seeds, with increased cell propagation and *CYCLIN D3;1-10* expression. In addition, molecular analysis of soybean germplasms showed that increased expression of the qSw17-1 QTL with a large seed phenotype is caused by decreased expression of GmKIX8-1 [66]. A different study reported that a regulatory complex, GW2–WG1–OsbZIP47, regulates rice grain size. *GW2*, *WG1*, and *OsbZIP47* together regulate grain width and length through the GW2–WG1–OsbZIP47 regulatory module. Specifically, *WG1* encodes a glutaredoxin protein and promotes cell proliferation, leading to enhanced grain growth. WG1 acts as a co-repressor with ASP1 and interacts with the transcriptional factor OsbZIP47 to terminate its transcription. OsbZIP47 suppresses seed development by reducing cell proliferation [28]. During development, the shape of the grain is controlled by many genes that interact with one another. GW8 is a member of the SBP transcription factor family. When it is overexpressed, it promotes cell growth and grain filling, which leads to wider grains and higher yields. In addition, GW8 also interacts with the GL7/GW7 promoter, inhibiting transcription and regulating cell proliferation in spikelet glumes [191,192]. Furthermore, GLW7 and GW8 both positively control grain hull cell size, increasing grain length and yield. Small and round seed 5 (SRS5) with a single

amino acid mutation (p.Arg308Leu) decreases rice grain length by inhibiting cell elongation. Although GLW7 can interact with SRS5 to regulate grain size, the exact mechanism remains unclear. Furthermore, the novel gene *OrMKK3* affects morphology and grain size, suggesting a relationship between MAPK and BR-responsive pathways with regard to grain development [193]. Moreover, LARGE2, encoding the HECT E3 UB-ligase OsUPL2, controls the rice panicle size and grain numbers, and *large2* mutants showed increased panicle size and number of seeds per panicle [194].

#### **4. Seed Size Regulation by Phytohormones**

Phytohormones control a vast range of developmental and physiological functions in plants. Furthermore, phytohormone levels fluctuate at various stages in different tissues during seed development, playing a significant role in the processes involved [195]. Multihormonal regulation, mediated via auxins, cytokinins, brassinolides, and jasmonic acid, plays an important role in endosperm propagation and embryo growth. Phytohormones and related molecules are schematized in Figure 2.

**Figure 2.** Phytohormones and their correlated molecules involved in seed size regulation: (**A**) auxins, (**B**) cytokinins, (**C**) brassinosteroids, and (**D**) jasmonic acid.

#### *4.1. Auxins*

Auxin, a plant hormone, has been linked to a wide range of characteristics of plant growth, including seed production [196]. The key auxin in all plants is indole-3-acetic acid (IAA). It is biosynthesized from the tryptophan (Trp) amino acid in a two-step mechanism which is highly conserved throughout plants [197,198]. First, the amino group is removed from Trp in the first phase, which is catalyzed by transaminases from the TRYPTOPHAN AMINOTRANSFERASE OF the ARABIDOPSIS (TAA/TAR) family, resulting in indole-3 pyruvate (IPA). YUCCA is known to upregulate the auxin biosynthesis pathway and the indole-3-pyruvic acid (IPA) pathway. Previously, 11 members of YUC genes have been reported in Arabidopsis and are involved in the development of the basal body during embryogenesis [199]. Throughout embryogenesis, YUCI and YUC4 are expressed in a variety of cell types, and their expression coincides with that of YUC10 and YUC11 in developing seeds [199]. LEAFY COTYLEDON (LEC), BABY BOOM (BBM), and PLETHORA (PLT) transcription factors are known to bind or control auxin-associated genes under normal conditions, which stimulate somatic embryo development. In addition, ectopic

expression of LEC2 promotes YUC2 and YUC4 expression early in the seedling somatic embryo development [200], whereas LEC1 ectopic expression promotes YUC gene expression during 2,4-D-induced somatic embryogenesis from immature zygotic embryos (YUC1, YUC4, and YUC10) and seedlings (YUC10) [201,202]. Auxin response factors (ARFs) control auxin-responsive genes' expressions in plants [203]. ARFs are B3-like transcription factors that directly bind to the auxin-responsive element (AuxRE) and activate or repress auxinresponsive genes in plants. In Arabidopsis, twenty-three ARF proteins, including ARF2 and ARF8, have been suggested as funneling auxin signals, to control seed size by reducing cell division [204,205]. In Arabidopsis, twenty-three ARF proteins, including ARF2 and ARF8, have been suggested as funneling auxin signals, to control seed size via reducing cell division [204,205]. ARF3/ETTIN mutations cause significant polarity abnormalities in the gynoecium, including apical tissue over-proliferation and ovary development [206]. ARF6, ARF7, and ARF8 promote cell growth, and hypocotyl elongation [207]. NtARF8 is involved in NtTTG2-regulated seed development in tobacco (*Nicotiana tabacum* L.) and supports seed quantity in collaboration with NtARF17 and NtARF19 [55]. Another study identified a B3 transcriptional factor, Arabidopsis maternal effect embryo arrest 45 (MEE45), a downstream regulator of auxin biosynthesis in the ovule, which controls seed size maternally and regulates cell proliferation via the transcriptional activation of aintegumenta (ANT). In this case, the largest seed size was observed for MEE45 expression, whereas the mee45 mutant expression resulted in reduced seed size [208]. Auxin also regulates the growth of siliques, which is noteworthy. Overexpression of BnaA9.CYP78A9, a member of the P450 monooxygenase gene family, stimulates the elongation of *Brassica napus* siliques by causing a significant increase in auxin [209]. However, most recently the modified expression of TaCYP78A5 has been shown to increase grain weight and yield per wheat plant by accumulating auxin [210]. Hence, the aforementioned studies based on transgenic and mutant studies imply that the auxin biosynthesis pathway, signaling, and transporter genes are essential in regulating the auxin levels during seed development, controlling embryogenesis, endosperm development, and seed coat development.

#### *4.2. Cytokinins*

Cytokinins (CKs) are phytohormones involved in the modulation of cytokinesisrelated enzyme activity and CK-mediated signaling. Additionally, CKs are important for initial endosperm cellularization [211]. The discovery of CK-catabolizing genes, in addition to their distinctive genetic and biochemical properties, has introduced the role of CKX enzymes in CK regulation [212], as well as multi-gene families synthesizing CKX enzymes. Various studies have aimed to determine the role of CKX family members (GFMs) in improving grain yield in many crops, including barley [213,214], rice [159,215], and wheat [216,217]. A transcription factor basic helix–loop–helix, cytokinin growth regulators (CKG) promote CK-mediated regulation of cell expansion and cell cycle development in Arabidopsis. CKG expression has been noted to enhance the size of cotyledons during the reproductive stage, whereas expression of ckg mutants has been noted to mediate the opposite effect [218]. Another study elucidated that CK activity is primarily regulated by the transcription factor receptors ARR1 and AHK3. Similarly, increased levels of CK in the external layer of reproductive tissues and the placenta lead to larger siliques and increased ovule formation, thereby increasing the number of seeds and resulting in higher seed yield [219]. CKs have also been observed as methylthiolated derivatives (MET); however, details of the function of this group of compounds are scarce [220]. The level of CKs during seed development is controlled by balancing the activities of isopentenyl transferase (IPT) and CK oxidase/dehydrogenase (CKX), which play an important role in the biosynthesis and degradation of CKs, respectively. AtIPT4 and AtIPT8 are expressed at higher levels during seed development in the chalazal endosperm area, during morphogenesis [221,222]. Furthermore, Song et al., 2015 suggested that silique and growing seeds have the ability to synthesize CK [223]. The endosperm-specific expression of IPT increases the levels of CK to grow seeds and seed mass without causing any morphological aberrations in transgenic

tobacco plants [224]. Numerous studies have reported that higher CK levels occur during development, upon external application, and that ectopic exposure of IPT is linked with increased CK degradation resulting from increased CKX gene expression [225,226]. Furthermore, the expression of the CKX and IPT GFMs suggests a positive correlation between the two in wheat [227] and *Brassica napus* [223].

CKs have a role in transcription control, and different genes in Arabidopsis influence the activation of its receptor histidine protein kinases (AHKs). AHKs activate AHPs, which are histidine phosphotransferase proteins, in response to CKs, and AHPs stimulate Arabidopsis response regulators (ARRs), thereby inducing the expression of all genes. Triple mutants, ahk-1,2,3 and ahp-1,2,3, have been observed to enlarge the embryo and seeds [221]; moreover, the same study concluded that cytokinin independent 1 (CKI1) expression plays an important function in CK signaling. CKI1 contains a histidine kinase-deficient CK perception domain of AHK. The cki mutants' expression results in fewer seeds that are larger in size, thereby suggesting CKI1 as playing a function in determining seed number and size during seed development [228]. Moreover, Li et al., 2013, elucidated that the expression of cytokinin oxidase 2 (CKX2), which encodes a protein that induces CK degradation, controls endosperm development [89] and up-regulates the CKX2 expression, is controlled by IKU2 and H3K27m3 [86]. These experimental findings explain that CK expression controls seed development phases through members of the CK signaling pathway and other epigenetic approaches [229]. Another study identified 17 GmCKX GFMs in soybean, and natural alterations were probed among cultivars with different yields. A total of 5 out of 17 CKX genes were noted to be responsible for the regulation of CK content during the grain-filling stage, as indicated by the results of introducing single nucleotide polymorphisms (SNPs) in them. GmCKX7-1 was discovered to contain a non-synonymous mutation (H105Q) on histidine 105, one of the amino acid residues in the active site, during the critical grain filling period, to preserve structural reliability of the enzyme, consequently enhancing seed yield [66].

#### *4.3. Brassinosteroids*

Brassinosteroids (BRs) are a recently documented class of phytohormones that play significant roles in plant growth and development, such as cell elongation and division [230–232]. BRs mainly take part in regulating the yield determination of agriculture traits, including seed size regulation [233]. In Arabidopsis and rice, molecular analysis of BR-defective mutants permitted the identification of very important genes involved in the BR-mediated regulation of seed development. The DEP1 promoter has been revealed to increase rice grain size and yield by optimizing the expression of BR-like genes [234,235]. Three steps are involved in the synthesis of BR from campesterol: (1) formation of campestanol from campesterol; (2) two concurrent routes from campestanol to castasterone (CS); and (3) B-ring lactonization of CS to 24-epibrassinolide (BL). OsDWARF (OsBR6ox) is a crucial gene in the biosynthesis of BR. The OsDWARF product catalyzes a late stage in bioactive BR formation via combining the early and late C-6 oxidation pathways and encodes BR-6-oxidase, which can convert 6-deoxotyphasterol (TY) and 6-deoxoCS to TY and CS, respectively [236,237]. The OsDWARF loss of function produces a decrease in TY and CS, resulting in the BR-deficient phenotype. Further, dwarf mutant shrink1-D (shk1-D) exhibited a lower level of BRs, which caused reduced seed length in Arabidopsis. In the case of shk1-D mutants, CYP72C1 overexpression and hydroxylation of BRs resulted in low endogenous BR levels and decreased seed size [235]. Furthermore, putative brassinolide receptor brassinosteroid in-sensitive1, a defective mutant, also caused lower seed production [226]. Presumably, this receptor activated the expression of genes encoding BRs that act as positive regulators of seed size (HAI-KU2. MINI3 and SHB1) and inhibited that of genes encoding negative regulators of seed size (ARF2 and AP2) through the TF brassinazole resistant 1 (BZR1) [238]. Higher expression of BR synthesis-related gene OsD11, or that of BR signaling factor OsBZR1 induced an increased sugar-accumulation rate in developing seeds and increased grain yield in rice. By contrast, the knockdown of these genes cleared the imperfect pollen, seed size, and weight reduction compared with the control [239]. In the BR

signaling pathway, cytochrome P450 mutant key enzyme D2 (D11) [234,240], BR-deficient dwarf 1 (BRD1) [236], DWARF 2 (D2) [241], and other genes related to BR biosynthesis, including BRD2 [242], all have very close phenotypes, such as decreased plant height, shoot length, spikelet length, and grain size. Furthermore, GW5 also interacts with and suppresses the activity of glycogen synthase kinase 2 (GSK2), a crucial kinase involved in BR signaling. The BZR1 and DLT transcription factors, downstream of GSK2, enter the nucleus in an un-phosphorylated state to modulate BR signaling-responsive genes, thereby affecting grain size in rice [120,226]. Moreover, GW5 and GW5-related proteins physically bind to the important components of the BR signaling pathway, GSK2, and BIN2, resulting in the accumulation of un-phosphorylated DLT and OsBZR1 [243]. Additionally, GSK2 binds to and phosphorylates members of the OVATE family protein 8 (OFP8) [244]. Furthermore, OFP8, OFP14, and the transcriptional activator GS9 interact with one another. OFP14 inhibits GS9 transcriptional activity, and GS9 overexpression causes circular grain shapes with BR-defective phenotypes. As a result, GS9 appears to negatively regulate the BR signaling [245]. In yeast two-hybrid screens, OFP3 binds to both GSK2 and DLT and negatively regulates the BR response, in addition to OFP8 and OFP14 [246]. Plants with the dss1 mutation were dwarf and showed erect leaves and smaller seeds. Similar results were observed in BR-deficient mutants [247]. In addition, SERK2 has been reported as a BR signaling element for many agronomic traits, especially yield and stress-related traits. Moreover, SERK2 knockdown mutants showed enhanced rice grain size [248].

#### *4.4. Jasmonic Acid*

Jasmonic acid (JA), a phytohormone produced by free α–linolenic acid, is similar to animal prostaglandins [249,250] and serves a variety of functions in plant response to biotic and abiotic stressors as well as development [226,251,252]. JA influences pollen formation or pollen shedding [253], embryo and seed development [254,255], spikelet formation [256], and maize sex determination [257–259], during inflorescence development. Linolenic acid is the substrate of JA biosynthesis, which is then metabolized to produce the bioactive isoleucine conjugate, JA-Ile [250]. The interaction of JA-Ile with the COI1 receptor (CORONATINE INSENSITIVE1) activates JA signaling, resulting in the proteasomal degradation of JAZ (JASMONATE ZIM-DOMAIN) transcriptional repressor proteins [260,261]. JAZ repressor degradation releases a number of transcription factors, especially MYCs, which interact with MED25/PFT1 to negatively regulate seed size [262]. The eg1 (extra glume 1) JA biosynthetic mutant displayed aberrant spikelet formation. The gene EG1 encodes a plastid-targeted class I lipase, which is required for the production of JA precursors. Furthermore, eg2-1D showed disrupted floral identity as well as impairment in floral meristem determination [256]. Map-based cloning showed that EG2 encodes for OsJAZ1, a JAZ repressor. Another comprehensive investigation revealed that the spikelet defects in eg2-1D are caused by the inhibited function of the OsMYC2-controlled E-class gene, OsMADS1. In addition, OsJAZ11 regulates the width and weight of rice seeds. When compared to wild-type, transgenic rice lines overexpressing OsJAZ11 showed up to a 14% increase in seed width and a 30% increase in seed weight. The constitutive expression of OsJAZ11 had a significant impact on spikelet development, resulting in additional glumelike structures, an open hull, and an abnormal number of floral organs. Also, transgenic lines accrued increased JA levels in spikelets and developing seeds. Overexpression lines exhibited altered expression of JA signaling and MADS-box genes when compared to WT [263]. According to yeast two-hybrid and pull-down experiments, OsJAZ11 interacts with Os-MADS29 and OsMADS68. Surprisingly, in overexpression lines, the expression of OsGW7, an important negative regulator of grain size, was dramatically decreased. Furthermore, multi-seeded 3 (msd3) sorghum mutants that can quadruple grain number per panicle by expanding panicle size and modifying floral development such that all spikelets are viable and set grain, in turn, decreased the JA levels [263].

#### **5. Possible Strategies Used to Maintain Seed Size**

Increasing grain yield is always an important goal in plant breeding programs, especially in the case of domesticated crops. Additionally, increasing seed size without affecting seed number via conventional plant breeding programs has allowed only limited progress because of the trade-off between these two yield components [264]. Many researchers have indicated that modern technology, including transgenic technology, genome editing, marker-assisted selection (MAS), and genomic selection (GS), has major challenges regarding the disruption of grain yield in different crops. The use of molecular markers for plant breeding selection has resulted in major advancements in efficiency and the successive announcement of many new varieties in the last 30 years [265,266]. MAS is often more effectual than traditional methods, resulting in improved accuracy, charge, or time savings, or the ability to screen for diseases that would otherwise be impossible to detect using standard phenotyping methods [267]. One of the major benefits of markers is that homozygosity can quickly be detected. In the private sector, molecular breeding programs have been widely adopted, with reports indicating increasing rates of genetic gain [268,269]. Numerous reports of marker-assisted variety improvement have emerged as a result of the widespread use of MAS in major crop breeding programs [270]. Backcrossing with DNA markers significantly improves selection efficiency. In essence, marker-assisted backcrossing (MABC) allows for very effective detection of the most important genes of interest or QTLs while preserving the recipient variety's original important features, allowing the original variety to be "upgraded" [265]. Whole-genome-based molecular breeding methodologies have been developed as a result of advances in rice genomics. Genomic selection (GS) is a new approach that has recently evolved [271]. Genomic selection, like MAS, is based on making predictions on a genomic scale from many DNA markers rather than focusing on individual genes or QTL [272,273]. Over the last ten years, pilot research in rice, wheat, and maize has yielded promising results in reducing breeding cycles and speeding up variety creation. Genomic selection provides a lot of potential for accurate selection of complicated variables such as yield and shortening the breeding cycle to improve genetic gain [274]. Recently, it has been reported that the *GW6* gene is better for increasing grain length and width than Baodali, which was transferred into the recurrent parent 9311 *indica* cultivar and Zhonghua 11 cultivar *japonica* via MAS. Near isogenic lines (NILs) of these two cultivars displayed improved grain weight and grain production [275]. In addition, most studies have suggested that the pyramiding of grain size is a convenient method for QTLs to boost grain production in different plant species. For example, a study found that the pyramid of grain size and weight was better through QTLs and the involvement of the NILs NIL-qGL3, NIL-*qGW2a*, and NIL-*qGS5*. Similarly, the implementation of NILs in the genetic study of Zhenshan 97 using MAS and conventional back-crossing has been reported. The combination of three major QTLs for grain size showed improved effects on grain weight without any type of interaction [276]. A similar study reported that NIL*GS3*/*qgl3* has been established via crossing NIL-*GS3* with NIL*qgl3* by the MAS approach. The *GS3* and *qGL3* combined effects on grain length were compared among NILs. Furthermore, primary panicle transcription analysis in NILs showed that the gene expression regulated by *GS3* and *qGL3* did not overlap [277].

Fine-mapping and cloning for crop production, particularly grain weight, have made significant improvements in the last two decades. To date, 20 QTLs have been cloned for grain size and weight, including GL7/GW7 and GS9, which have opposite allelic directions of additive effects on grain length and weight, controlling grain size but having little effect on grain weight [192,278,279]. GSA1 and GW6a have similar results on grain length and weight with similar directions, implying that they significantly impact grain weight [13,280]. The remaining 16 QTLs have an impact on both grain size and grain weight. GW2 [281], TGW2 [282], GS5 [283], qSW5/GW5 [120,121], GW6 [284], and GW8 [191], are six that largely control grain width and weight, while the other ten QTLs containing GS2/GL2 [177,178], OsLG3 [285], qLGY3/OsLG3b [158,286], GS3 [153], GL3.1/qGL322 [287], TGW3/GL3.3 [288,289], GL4 [290], TGW6 [291], GL6 [292] and

GLW7 [49] largely control grain length and weight. All these QTLs characterizations have greatly aided our understanding of the genetic regulation of rice grain size and weight, but further research is required to further improve our understanding of the regulatory mechanism for these important agronomic traits [111].

Numerous studies have reported that alterations in the regulatory genes of quantitative traits, such as oil accumulation in the seed, showed a significantly increasing trend in intrinsic yields [293–295]. Selective genes that have played a vital role in seed development at different developmental phases require expression data of different developmental stage-specific tissues [296]. Furthermore, in polyploid species, the selection of target genes involved in seed size engineering may require specific tissue- and different stage-specific expression outlines of separate GFMs [297]. New technologies in functional genomics have been introduced for the development of seeds, and genetic engineering approaches provide greater scope for engineering to control seed size. Furthermore, increasing evidence shows that a thorough molecular knowledge of seed formation could yield opportunities for controlling seed size in plants [14,234,298]. For example, in a field study, the downregulation of *BnDA1* in *B. napus,* the *AtDA1* ortholog (a harmful regulator of seed formation)*,* improved seed weight and seed yield by 21% and 13%, respectively [14]. In Arabidopsis, the two genes, *AtSHB1* and *AtKLUH,* involved in the seed development were introduced in *Brassica juncea* and improved seed weight by 40% [299]. *AtSHB1* overexpression in *B. napus* increased seed size 1.6-fold [300]. The loss of *ARF18* function mutant in *B. napus* showed an approximately 8% increase in seed size compared to WT plants [301].

Grain size and yield increase in rice crops through the response to auxins by *big grain 1* (*BG1*), encoding a membrane-connected protein (increased by about 15.2% in length and 17.0% in width, respectively). Furthermore, RNAi suppression of *BG1* led to a decrease in rice grain size and yield relative to WT plants [301]. Downregulation of *big seed 1* (*BS1*) and *BS2* genes in soybean and *Medicago* resulted in significant enhancement of seed size [302]. Engineering techniques that are better for achieving larger seed size involve the overexpression of important regulators for the development of seeds in Arabidopsis [14,303] and other respective species [298] or the silencing of orthologs that are harmful regulators of seed growth [302]. Genetic material regulating biosynthesis, metabolism, and signaling is useful for gene manipulation, altering hormonal regulation, and producing larger seeds. For example, *BZR1*, a BR-responsive TF expressing most of the positive regulators, *SHB1* and *IKU2*, while *AP2* and *ARF2* are harmful regulators [238], may be modified in a spatiotemporal manner. Epigenomes are used as a genome editing technique in locus-specific ways and via methylation or de-methylation. Furthermore, several target genes that control the organized growth of maternal and zygotic tissues could be manipulated to increase seed size and crop yield. For example, KLUH and SHB1 regulate the development of seed coat and endosperm, and embryo formation, in a specific spatiotemporal manner, respectively. Moreover, control of cell cycle core gene expression is an alternative approach to increase seed size and yield. For example, target-specific up-regulation of the core cell cycle component CYCD7;1 in the endosperm helps overcome incomplete seed termination interlinked with ectopic expression [104]. Furthermore, genetic manipulation of genes that govern seed traits may reduce the negative relationship between the number and seed size. In Arabidopsis, FATA2, BZR1, LecRK-VIII.2, and CKX2 expression controls seed number, and expression of many other genes increases seed size and yield.

Genome editing technology, especially the CRISPR/Cas9 editing system, has widely been used in many plant species to knockout an individual gene, the key objective of its editing and repair mechanisms (Figure 3). The widely employed CRISPR/Cas9 and various CRISPR/Cas systems often generate a double-strand break (DSB) by cutting the doublestranded DNA. The double-strand break will often be repaired using non-homologous end joining (NHEJ). Because one or more nucleotide deletions or insertions may inhibit targeted gene expression due to frame shifting in the coding region, genome editing typically results in either gene knockout or silencing. CRISPR/Cas-induced gene silencing has many advantages over T-DNA mutagenesis. The main advantage of CRISPR/Cas-based

editing is its specificity; it may be used to target a particular gene without having any negative effects. Agrobacterium-mediated gene transformation has largely been used to overexpress a specific gene in a plant cell via T-DNA implantation, which includes the foreign DNA sequences. Even so, the haphazard insertion of a specified gene into a genome always has numerous adverse effects, such as the silencing of other useful genes. The CRISPR/Cas system may cleave the DNA sequence at a specified location, and homologdirected repair (HDR) will then insert a targeted gene sequence into such a cleavage position. As a result, CRISPR/Cas may be used to accurately insert a gene of interest into a particular site inside a genome, avoiding interfering with neighboring genes or preventing position effects. Base editing (BE) is also a new technology that can specifically and proficiently implant point mutations at target positions without the use of DSB formation and donor DNA templates [304]. The cytosine base editor (CBE) and adenine base editor (ABE) are common examples of base editing technology, which has been applied in various plant species to interrogate gene function for crop trait improvement [305,306].

**Figure 3.** Application diversity of the CRISPR/Cas system for functional genomic research and crop improvement. CRISPR may directly generate gene knockout (silencing) via deletion or addition with a number of bases and repair through NHEJ depending on the DNA double-strand break mechanism. Alternatively, genome editing may replace an undesirable gene and/or overexpress (knock-in) a single gene when homolog-directed repair occurs with a DNA donor. In addition, CRISPR/Cas may also be employed for base editing, and epigenome editing by deactivating the Cas9 enzyme while using transcription effectors or other enzymes coupled with the dCas9.

In gene replacement for gene mutations, as in the assembly of HDR, any sequence may be inserted and replace the existing genomic sequence surrounding the DSB locations. Thus, if the nucleotide sequences have mutations, including point mutations or other undesired sequences, we may replace them with suitable nucleotide sequences that have homologous arms with the existing sequences. It was recently reported that transcripttemplate HDR was employed to completely replace the rice aceto-lactate synthase gene (ALS) by transport of such a DNA-free ribonucleoprotein complex [307]. Prime editing (PE) is a new DSB-independent precise genome editing technology that can introduce any base conversion, small indels, or a combination thereof, at target positions [304]. The prime editing approach was initially established in the animal system but rapidly utilized in plants to edit a precise gene. Prime editing is now being used to effectively replace genes in numerous key plant species, such as Arabidopsis, tobacco [308], rice [309–312], wheat [313], maize [314], tomato [315], and potato [316].

A CRISPR/dCas9 interference (CRISPRi) system was developed based on the application of the CRISPR/dCas9 system to disrupt transcriptional activities and to be used

as a gene knockdown strategy for controlling gene expression [317,318]. CRISPRi could be used to inhibit the expression of genes by restricting transcription start and/or elongation depending on where the CRISPR/dCas system interacts. When the Cas system connects to the upstream area, it will impede TF and RNA polymerase (RNAP) binding and restrict transcription initiation. However, when it attaches to the coding region, it will prevent RNAP interaction and restrict transcription elongation [318]. Regulating a gene's methylation and demethylation can be an effective strategy to influence the expression and, ultimately, trait regulation. As a result, since the CRISPR/Cas system was successfully permitted as an effective genome editing system, research efforts have focused on tweaking it to alter DNA methylation.

CRISPR/Cas9 has great potential for specifically targeting desired genes [319]; thus, it could be used to develop crops with the desired seed size by knocking out specific genes without altering other traits [320]. However, in some cases, seed size may be linked with other important traits. For example, CRISPR-Cas9 genome editing of the TaIPK1 gene in wheat enhanced nutritional value and increased seed size simultaneously [321]. Recently, using CRISPR/Cas9-based genome editing, it was found that the MADS-box TF genes MADS78 and MADS79 are very important for regulating endosperm cellularization and early seed formation in rice [76]. A single MADS78 or MADS79 knockout mutant displayed early endosperm cellularization, while double mutants hindered seed development and produced no viable seeds [76]. Most recently, genome editing of cis-regulatory elements (CREs) has been found to result in both gain- and loss-of-function alleles, which is proving very useful for broadening the phenotypic range of traits related to yield and architecture [322,323].

#### **6. Conclusions and Future Recommendations**

Seed size and seed weight regulated by seed development are critical determinants of crop yields. Seed size can be determined through the development of zygotic embryos and endosperm as well as maternal tissues. Many signaling pathways that may determine seed size by the development of endosperm and maternal tissues, such as the IKU pathway, MAPK signaling, G-protein signaling, ubiquitin–protease signaling, etc., induce significant effects on entire features of the growth and development of plants, as well as regulating seed size. Transcription factors are responsible for enhancing cell growth in the maternal ovule and affect seed size. Therefore, there is a need to increase seed size with no effect on seed number through convention breeding programs to improve crop yield.

Many researchers have indicated that modern types of molecular approaches i.e., MAS, GS, and CRISPR/Cas system-based editing and transgenic technology have limitations to their disruption of seed yield hurdles in different plant species. In this review, we summarized many factors regulating seed sizes, such as genetic factors, signaling pathways, and transcriptional factor regulators in Arabidopsis and other crops, followed by engineering seed size by using recently evolved novel transgenic and breeding techniques, and ending with a brief discussion on recent studies, and those conducted over the last decade, which aimed to comprehend the genetic and molecular aspects controlling seed size. However, more research studies are required to understand the seed development pathways via molecular and genetic factors. Moreover, there is a need to introduce more strategies for genetic modifications to improve grain size and yield.

**Author Contributions:** Writing—original draft preparation, I.A.; writing—review and editing, I.A., K.B., Y.H., J.L. and L.G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Key-Areas Research and Development Program of Guangdong Province (2020B020220008), and the start-up fund from South China Agricultural University (to L.G.).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Ferritin Heavy Chain Binds Peroxiredoxin 6 and Inhibits Cell Proliferation and Migration**

**Maddalena Di Sanzo 1,†, Flora Cozzolino 2,3,† , Anna Martina Battaglia 1, Ilenia Aversa <sup>1</sup> , Vittoria Monaco 2,3 , Alessandro Sacco 1, Flavia Biamonte 1,4, Camillo Palmieri <sup>1</sup> , Francesca Procopio 5, Gianluca Santamaria <sup>1</sup> , Francesco Ortuso <sup>5</sup> , Piero Pucci 3, Maria Monti 2,3 and Maria Concetta Faniello 1,\***


**Abstract:** The H Ferritin subunit (FTH1), as well as regulating the homeostasis of intracellular iron, is involved in complex pathways that might promote or inhibit carcinogenesis. This function may be mediated by its ability to interact with different molecules. To gain insight into the FTH1 interacting molecules, we analyzed its interactome in HEK293T cells. Fifty-one proteins have been identified, and among them, we focused our attention on a member of the peroxiredoxin family (PRDX6), an antioxidant enzyme that plays an important role in cell proliferation and in malignancy development. The FTH1/PRDX6 interaction was further supported by co-immunoprecipitation, in HEK293T and H460 cell lines and by means of computational methods. Next, we demonstrated that FTH1 could inhibit PRDX6-mediated proliferation and migration. Then, the results so far obtained suggested that the interaction between FTH1/PRDX6 in cancer cells might alter cell proliferation and migration, leading to a less invasive phenotype.

**Keywords:** H Ferritin subunit; PRDX6; protein-protein interaction

#### **1. Introduction**

Ferritin, the major intracellular iron-storage protein, binds iron in a soluble, non-toxic form and makes it available for many cellular processes. It is localized in cytoplasm [1], nucleus [2] and mitochondria [3] and is composed of 24 subunits of two different types, a light chain (L; FTL) 19 kDa and a heavy chain (H; FTH1) 21 kDa. The two subunits, which share extensive homology, are functionally distinct: FTH1 is responsible for the ferroxidase activity of the ferritin molecule, while FTL is mainly involved in iron storage [4,5]. FTH1 and FTL are coded by two different genes [6,7] whose activity is regulated at transcriptional [8–10] and post-transcriptional levels. The post-transcriptional control of the ferritin genes is iron-dependent [11] and acts mainly on the translational efficiency of the FTH1 and FTL mRNAs [12]. FTH1 controls the cellular pool of chelatable and redox-active iron, named the labile iron pool (LIP) [13]. It has been shown that FTH1 repression evokes an increase in the cellular LIP level and activity, while its overexpression decreases the LIP content [14]. A high level of LIP results in oxidative damage by catalyzing ROS generation in mitochondria [15]. Besides its role in iron metabolism, FTH1 is involved in signaling pathways related to physiologic and pathologic processes by interacting with different protein partners. FTH1 regulates angiogenesis during inflammation and malignancy by antagonizing the cleavage of HKa (high molecular weight kininogen), an

**Citation:** Di Sanzo, M.; Cozzolino, F.; Battaglia, A.M.; Aversa, I.; Monaco, V.; Sacco, A.; Biamonte, F.; Palmieri, C.; Procopio, F.; Santamaria, G.; et al. Ferritin Heavy Chain Binds Peroxiredoxin 6 and Inhibits Cell Proliferation and Migration. *Int. J. Mol. Sci.* **2022**, *23*, 12987. https://doi.org/10.3390/ ijms232112987

Academic Editor: Wajid Zaman

Received: 16 September 2022 Accepted: 23 October 2022 Published: 26 October 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

endogenous inhibitor of angiogenesis [16], which is also involved in the onset of HIV-mediated neuropathology in patients with a history of drug abuse [16]. FTH1 interacts with the Alacrima-Acalasia-Adrenal Insufficiency Neurological Disorder (ALADIN) protein [17], whose mutations determine the triple A syndrome characterized by impaired FTH1 nuclear uptake [17]. FTH1 may also physically interact with the CXCR4 chemokine receptor, and in response to CXCL12 stimulation, FTH1 is phosphorylated and translocated into nuclei [18]. FTH1/CXCR4 interaction also has functional feedback because CXCR4-mediated ERK1/2 activation is inhibited by FTH1 overexpression [18]. FTH1, under oxidative stress, interacts with p53, thus increasing its transcriptional activity [19]. Moreover, it also binds Nuclear Receptor Coactivator-4 (NCOA4), which targets its degradation to iron recycling [20]. Furthermore, FTH1 may form a complex with death domain-associated nuclear proteins (Daxx), and together they can participate in the apoptosis pathway [21]. FTH1 might function as a tumor suppressor or as an oncogene; indeed, FTH1 knockdown reduces, both in vivo and in vitro, melanoma cell proliferation [22] and may modulate the MHC class I molecules expression leading to NK cells activation [23] while in the erythroleukemia K562 cell line determines an increased expression of a specific set of onco-miRNAs [24], activation of H19/miR-675 axis [25] and severe protein misfolding [26]. In the MCF7 and in NCI-H460 cells induced epithelial-to-mesenchymal transition (EMT) [27], while in ovarian cancer cells, FTH1 is involved in the inhibition of cancer cell proliferation and in cancer stem cell propagation [28]. The evidence that FTH1 may physically interact with molecules expressed in various human malignancies raises the question of the functional role of FTH1 in critical cellular pathways. This function may be mediated by its ability to interact with several signaling molecules. Therefore, studies focused on FTH1 "interactome" analysis are expected to allow an understanding of both physiological and pathological processes involving it. The investigation of new potential FTH1 interacting proteins was performed by using a functional proteomic approach, relying on the immunoprecipitation of a 3xFlag-FTH1 containing complexes from HEK-293T protein extract and their identification by nanoLC-MS/MS methodologies. In this study, we identified Prdx6 as a novel FTH1 interacting protein and that both were involved in the cellular response to oxidative stress. We demonstrated that the FTH1/PRDX6 interaction also occurred in NCI-H460 cells and investigated the functional role of this binding in cell migration and proliferation.

#### **2. Results**

#### *2.1. Identification of FTH1 Interacting Proteins*

To unravel novel functional roles of ferritin-heavy chain (FTH1), its interactome has been investigated in order to identify new putative protein partners by immunoprecipitation of a tagged 3xFlag-FTH1 and its interacting proteins from HEK-293T protein extracts. HEK-293T cells, transfected with an empty vector, were used as control. Immunoprecipitated proteins were fractionated by SDS-PAGE, and the 39 slices cut from both sample and control lanes (Figure S1) were in situ digested by trypsin. Peptide mixtures were then analyzed by LC-MS/MS, and proteins were identified by Mascot software.

The final list of FTH1's putative partners, obtained by removing common proteins between the sample and control, is reported in Table S1.

Subsequently, a functional analysis of identified FTH1 putative interactors was conducted according to the STRING database, up to a maximum of five co-interactors.

The graphical representation of the inferred network (Figure 1), weighted by protein interactions, provided a detailed view of the putative functional linkage among the 51 proteins used as input and FTH1, facilitating the comprehension of the modularity analysis in biological processes. The analysis generates a putative protein–protein interaction score. We sorted interactions by applying a score threshold of 0.3 and being FTH1 interactors.

**Figure 1.** String Interaction Analysis: In silico interaction analysis by String (https://string-db.org/ (accessed on 15 September 2022)) suggests putative interaction of FTH1 and PRDX6.

The majority of FTH1 interacting proteins belongs to the cytoskeleton, cell proliferation and traffic, signal transduction, spindle organization and ferroptosis process. Among the proteins already known to interact with FTH1, NCOA4 has been identified, and the interaction has also been confirmed by immunoprecipitation (data not shown). Among the novel putative FTH1 interacting proteins was identified the HSPB1 protein involved in iron metabolism. HSPB1inhibits transferrin receptor 1 (TFR1) expression by modulating intracellular iron accumulation. CDKAL1 belongs to the methyltransferase family, whose function is impaired by cellular iron deficiency. Another interactor is Vimentin, whose function is also modulated by iron metabolism.

Additionally, a novel partner, Peroxiredoxin 6 (PRDX6), captured our attention, being co-expressed with FTH1 in different tissue (see Table S2) and involved in several biological processes, including focal adhesion (FDR 1.35 × <sup>10</sup><sup>−</sup>5, cell junction (FDR 0.006), signaling by ROBO receptors (FDR 0.01) and activation of ATR in response to replication stress (FDR 0.02), as predicted by the STRING functional association analysis. (Table S3). PRDX6 is an enzyme involved in the containment of oxidative stress preventing cell peroxidation causative of membrane lesions [29]. Moreover, it has been demonstrated that its catalytic activity favors cancer cell proliferation, motility promoting invasion and metastasis processes [30,31]. In light of these important roles in promoting pathological states, the interaction between FTH1 and PRDX6 was further investigated.

#### *2.2. FTH1/PRDX6 Interaction in HEK-293T and NCI-H460 Cells*

To assess PRDX6-FTH1 physical interaction, HEK-293T cells transiently transfected with 3xFlag-FTH1 or 3xFlag control vector were employed for the co-immunoprecipitation experiment. The cell lysates were incubated with the ANTI-FLAG M2 affinity gel, and the

immunoprecipitated complexes were separated by SDS-PAGE and analyzed by Western blot with anti-PRDX6 and anti-Flag antibodies, respectively. As shown in Figure 2A, the interaction between the two proteins was confirmed. The same result was obtained by investigating the interaction between the endogenous proteins expressed at basal levels in wild-type HEK-293T cells (Figure 2B). Confocal microscopy analysis demonstrated the cytoplasmatic colocalization of FTH1 and PRDX6 (Figure 2C), as expected by their localization. Furthermore, to verify if the interaction between FTH1 and PRDX6 is independent by cell lines, we replicated the same experiment in non-small cell lung cancer (NSCLC) NCI-H460 cells, transiently transfected with 3xFlag-FTH1 or 3xFlag control vector. In NCI-H460 cells, as well as in HEK-293T, FTH1 interacted with PRDX6 (Figure 3A) and colocalized in the cytoplasm (Figure 3B).

**Figure 2.** Co-IP and colocalization analysis of the interaction between FTH1 and PRDX6. (**A**) HEK-293T cells were transfected with Flag-FTH1 or a control vector (3x-Flag). Cell lysates were immunoprecipitated with anti-Flag M2 resin and then analyzed by immunoblotting with anti-PRDX6 and anti-Flag antibodies. (**B**) HEK-293T cell lysates were immunoprecipitated with FTH1 antibody or anti-mouse IgG, respectively. Eluates were analyzed by immunoblotting with anti-FTH1 or anti-PRDX6 antibodies. (**C**) HEK-293T cells were grown on a coverslip, fixed with 4% paraformaldehyde and processed for double-label immunofluorescence with anti-FTH1 antibody and anti-PRDX6 antibody. Images were collected using confocal microscopy system (40×). Representative data from one of three experiments.

132

**\$**

**Figure 3.** Co-IP and colocalization analysis of the interaction between FTH1 and PRDX6. (**A**) NCI-H460 cells were transfected with Flag-FTH1 or a control vector (3x-Flag). Cell lysates were immunoprecipitated with anti-Flag M2 resin and then analyzed by immunoblotting with anti-PRDX6 and anti-Flag antibodies. (**B**) NCI-H460 cells were grown on a coverslip, fixed with 4% paraformaldehyde and processed for double-label immunofluorescence with anti-FTH1 antibody and anti-PRDX6 antibody. Images were collected using confocal microscopy system (40×). Representative data from one of three experiments.

#### *2.3. Molecular Recognition in FTH1/PRDX6 Complex*

The interaction between FTH1 and PRDX6 was investigated by means of computational methods. Theoretical models of FTH1 and PRDX6, the latter analyzed both as reduced (PRDX6\_r) and sulphinic acid (PRDX6\_s) states, were built and submitted to docking simulation (as reported in Materials and Methods). Results clearly indicated a productive recognition of FTH1 against both models of PRDX6 states. Globally, 50 and 55 possible configurations of FTH1 were predicted against PRDX6\_r and PRDX6\_s targets, respectively. The average interaction energies were very similar and were estimated at −15.44 kcal/mol for FTH1•PRDX6\_r and at −15.59 kcal/mol for FTH1•PRDX6\_s. Even considering the most stable configuration only, the difference was lower than 1 kcal/mol. All docking-generated complexes were graphically inspected, revealing a wide exploration of FTH1 around both targets (Figure 4).

On the theoretical complexes, the in-house GBPM method was applied to highlight and classify, by quartiles, the most relevant interacting residues (Table 1).

**%**

**Figure 4.** Superimposition of all theoretical complexes between FTH1 and (**a**) PRDX6\_r and (**b**) PRDX6\_s. Ferritin protein is depicted as wireframe CPK colored, and PRDX6 chains are shown as green and magenta cartoon.


**Table 1.** FTH1 (left) and PRDX6 (right) most relevant, quartile 1, interacting residues.

<sup>a</sup> FTH1 residues interacting to PRDX6 phospholipase site <sup>b</sup> FTH1 residues interacting to PRDX6 peroxidase site.

Interestingly, in both PRDX6 models, FTH1 was able to directly recognize several residues close to His39, Cys47 and Arg132, corresponding to the PRDX6 peroxidase site, and to His26, Ser32 and Asp140 that are involved in the PRDX6 phospholipase activity (Table 1). These interactions could rationalize the experimentally observed FTH1 modulation of PRDX6. To deeply investigate the above-reported event, all theoretical complexes were searched to identify, among the highest complexes stabilizing ones, the FTH1 residues recognizing PRDX6 phospholipase or peroxidase sites. Of course, the phospholipase site, with more solvent exposed than the peroxidase one, allowed larger interaction with FTH1 (Figure 5).

**Figure 5.** Most stable theoretical complexes of FTH1 with (**a**) PRDX6\_r and (**b**) PRDX6\_s. Ferritin protein is depicted as wireframe CPK colored, and PRDX6 chains are shown as green and magenta cartoon. PRDX6 residues at peroxidase and phospholipase sites are reported as space-filling yellow carbon or cyan carbon colored, respectively. Ferritin-bound Fe2+ ions are shown as orange space-filling.

#### *2.4. FTH1 Inhibits Proliferation and Migration in PRDX6 Overexpressed NCI-H460 Cells*

Accumulated evidence suggests that PRDX6 exerts specific functions in cancer progression, affecting cell growth, survival, migration, differentiation, invasiveness, and metastasis. In NSCLC, PRDX6 over-expression promotes cancer cell proliferation and invasive phenotype [30]. To investigate whether the interaction between FTH1 and PRDX6 might affect cell migration and proliferation, NCI-H460 cells were transiently transfected with 3xFlag-PRDX6, 3xFlag-PRDX6/3xFlag-FTH1, 3xFlag-PRDX6/siFTH1 or 3xFlag control vector. The confluent NCI-H460 cells were scraped to create a wound, and cell migration was assessed 24, 48 and 72 h later. The results of a triplicate set of independent assays (A and B of Figure 6) demonstrate that at 72 h after the scratch, the wound area of 3xFlag-PRDX6 cells was significantly narrower than that of the control. Interestingly, when FTH1 and PRDX6 were contemporaneously overexpressed, the wound area was significantly larger than in the presence of PRDX6 alone. Conversely, silencing of FTH1 induced by siRNA transfection and PRDX6 overexpression significantly increased the migratory ability compared to PRDX6 and FTH1 overexpressed cells. Next, we compared the proliferation ability of NCI-H460 FTH1/PRDX6 vs. NCI-H460 PRDX6 and vs. NCI-H460 siFTH1/PRDX6 cells. The results of the MTT test, reported in Figure 6C, showed that, at 72 h, the simultaneous overexpression of FTH1 and PRDX6 was able to significantly reduce cell proliferation with respect to the PRDX6 overexpression, the FTH1 knockdown and PRDX6 overexpression increased proliferation rate compared to FTH1 and PRDX6 overexpression. Western blot analysis (Figure 6D) demonstrated that FTH1 levels are markedly reduced in 3xFlag-PRDX6 cells compared to both control and 3xFlag-FTH1/3xFlag-PRDX6 overexpression. These data strongly indicate that the presence of FTH1 might counteracts cell proliferation and migration processes induced by PRDX6 overexpression.

**Figure 6.** Scratch wound healing and MTT proliferation assays. (**A**) After transfection with control vector, (3xFlag) Flag-PRDX6, Flag-PRDX6/Flag-FTH1 and Flag-PRDX6/siFTH1 confluent NCI-H460 cells monolayers were scratched to induce horizontal migration (magnification of 10×) at 0, 24, 48 and 72 h. (**B**) The wound areas were measured using Image J 1.42q software. The histogram indicates wound area/% of initial area. Value, represent mean ± SD; n = 3; *p*-value Flag-PRDX6\_vs.\_3xFlag, Flag-PRDX6\_vs.\_Flag-PRDX6/Flag-FTH1 and Flag-PRDX6\_vs\_Flag-PRDX6/siFTH1 is considered not statistically significant at 24 h. Flag-PRDX6\_vs\_3xFlag, Flag-PRDX6\_vs\_Flag-PRDX6/Flag-FTH1 is considered statistically significant at 48 and 72 h, \* *p* < 0.005; Flag-PRDX6\_vs.\_Flag-PRDX6/siFTH1 is considered not statistically significant at 48 and 72 h. (**C**) NCI-H460 cells, transfected with control vector, Flag-PRDX6, Flag-PRDX6/Flag-FTH1, and Flag-PRDX6/siFTH1 seeded in 24 wells and were cultured for 0, 24, 48 and 72 h, and their proliferation was determined by absorbance at 490 nm. Final results represent mean ± SD of three independent experiments, each performed in triplicate. *p*-value Flag-PRDX6\_vs.\_3xFlag, Flag-PRDX6\_vs.\_Flag-PRDX6/Flag-FTH1 and Flag-PRDX6\_vs.\_Flag-PRDX6/siFTH1 are considered not statistically significant at 24 and 48 h. Flag-PRDX6\_vs.\_3xFlag *p* < 0.05 and Flag-PRDX6\_vs\_Flag-PRDX6/Flag-FTH1 \* *p* < 0.005 at 72 h; Flag-PRDX6\_vs.\_Flag-PRDX6/siFTH1 is considered not statistically significant at 72 h. (**D**) Western Blot analysis was performed to confirm PRDX6 and PRDX6/FTH1 overexpression and siFTH1 silencing. γ-tubulin was used as a loading control.

#### **3. Discussion**

In recent years, our work has mainly focused on the analysis of the iron-independent roles of FTH1 in human-transformed cell lines [22,27,28,32]. Taken all together, the data from our and other groups strongly suggest that the FTH1 is involved in the complex pathways that might promote or inhibit carcinogenesis [33]. Although a lot of experimental evidence suggests that altered expression of FTH1 is a common event in cancer, most of the available studies do not clarify the specific role of FTH1 in this process. This is mainly due to incomplete knowledge of the overall scenario in which FTH1 and its partners interact to promote tumor progression. Therefore, the knowledge of molecules interacting with FTH1 can clarify this issue. Identification and characterization of each component of protein networks is a critical step to fully understand these processes at the molecular level.

The aim of this paper was the identification of FTH1 protein partners to further shed light on its functions in cancer cells. To this purpose, we have pursued a proteomic-based approach relying on the IP-MS (Immunoprecipitation—Mass Spectrometry) of 3xFLAG-FTH1 protein overexpressed in HEK-293T cells.

In order to further investigate the FTH1 role in tumorigenesis, we focused our attention on the PRDX6, a member of the peroxiredoxin family. PRDX6 is a bifunctional enzyme and possesses both peroxidase and calcium-independent phospholipase A2 (iPLA2) activity, protects against oxidative stress and prevents cell peroxidation causative of membrane lesions [29]. Elevated levels of PRDX6 expression have been shown to be associated with a variety of human cancers, among which lung and breast [30,34]. In lung cancer cells, it has been demonstrated that the upregulation of PRDX6 results in the activation of Akt via phosphoinositide 3-kinase (PI3K) and p38 kinase [35]. It has been demonstrated that PRDX6 promotes cell proliferation and inhibits cancer cell apoptosis; its peroxidase activity promotes the growth of cancer cells, whereas its PLA2 activity promotes the invasion and metastasis of cancer cells [30]. Their physical interaction was also confirmed by coimmunoprecipitation in both FTH1 overexpressing cells and in basal conditions. Confocal images further confirmed the interaction between FTH1 and PRDX6 and showed their cytoplasmic predominantly colocalization. Our data, in agreement with the literature, showed that PRDX6 overexpression induced NCI-H460 cell proliferation and migration; on the contrary, NCI-H460 cells show a strong reduction of their proliferative and migratory capability when simultaneous transfected with FTH1 and PRDX6. These results demonstrate that the interaction between FTH1/PRDX6 alters cell proliferation and migration leading to a less invasive phenotype, suggesting for FTH1 an inhibitory effect on PRDX6 oncogenic role.

It has been demonstrated that PRDX6 is highly expressed in human cancer cells and plays a critical role in cancer development [36,37]. Indeed, high levels of PRDX6 contribute to the proliferation of cancer cells [30]. So, our findings demonstrate the inhibitory effects of FTH1 on PRDX6 and suggest a new potential therapeutic target for the development of novel therapeutic strategies for the treatments of such cancers overexpressing PRDX6.

#### **4. Materials and Methods**

#### *4.1. Cell Cultures*

HEK-293T cells (Sigma-Aldrich, St. Louis, MO, USA) human embryonic kidneys were cultured in adherent conditions in a DMEM (Sigma-Aldrich) medium with 10% FBS and 1% Penicillin-Streptomycin (Sigma-Aldrich). NCI-H460 human non-small lung cancer cells (ATCC, Manassas, VA, USA) were cultured in an RPMI 1640 (Sigma-Aldrich) medium supplemented with 10% fetal bovine serum (FBS, Belize City, Belize) and 1% Penicillin-Streptomycin (Sigma-Aldrich). The two cell lines were maintained at 37 ◦C in a humidified 5% CO2 atmosphere.

#### *4.2. Transient Transfection and Cell Lysis*

HEK-293T cells were transiently transfected with an expression vector containing the full length of human FTH1 cDNA (3xFlag-FTH1) (N-Flag tag-Sino Biological, Beijing, China), using Lipofectamine 3000 transfection reagent (Thermo Fisher Scientific, Waltham, MA, USA) following the manufacturer's instructions [38]. Cells transfected with empty vector 3xFlag were used as a negative control. Cells were incubated in the medium containing the transfection mixture for up to 72 h. Experiments were performed at least three times. Cell lysis was conducted using a buffer composed of the following components: 50 mM Tris/HCl pH 7.4, 150 mM NaCl, 10% Glycerol, 1 mM Na3(VO)4, 1 mM NaF, 0, 5 mM PMSF, 1% Triton and 1 mini pill of protease inhibitor (EDTA free ROCHE). Cell pellets were resuspended, left for 10 min in ice, and then put on the tube rotator at 4 ◦C for 45 min. Finally, the lysates were centrifuged for 30 min at 13,000 rpm and 4 ◦C. A quantitative evaluation of protein extracts was carried out with Bradford assay (BIORAD, Hercules, CA, USA). Bovine serum albumin was used as the standard protein for the calibration curve.

#### *4.3. Isolation of Protein Complexes by Immunoprecipitation*

Lysates from 3xFlag-FTH1 transfected cells and 3xFlag negative control (cells transfected with empty vector) were subjected simultaneously pre-cleared onto DynabeadsTM Protein G (Invitrogen, Waltham, MA, USA); thus, to remove the background of unspecific proteins from cellular extracts. The latter were then subjected to immunoprecipitation

protocol by using anti-FLAG M2 (Sigma Aldrich) magnetic beads to isolate the bait, and its putative interactors, as previously reported [39]. Surnatants containing the unbound proteins were removed, and the beads were washed with two different NaCl concentrations (150 and 300 mM) in lysis buffer. Elution was performed using the free 3xFLAG peptide with a concentration of 200 μg/mL for 5 h at 4 ◦C. Protein samples were precipitated in a mixture of methanol and chloroform and then dried by a Speed-Vac system (Thermo Fisher, Waltham, MA, USA).

#### *4.4. Mass Spectrometry Analysis and Protein Identification*

Immunoprecipitated protein complexes were fractionated by SDS-PAGE on a 16 × 16 cm, 8–15% gradient acrylamide/bis-acrylamide gel. The latter was stained with Colloidal Blue Coomassie (PIERCE, Hercules, CA, USA), and the excess of the dye was removed by washing with deionized water. Thirty-nine bands were excised from both sample and control lanes and subjected to an in-situ hydrolysis protocol, as previously reported [40]. Each peptide mixture was dried and then suspended again in 10 μL of 0.2% HCOOH (J.T. Baker, Waltham, MA, USA ) and analyzed by LC-MS/MS on an LTQ Orbitrap XL system (Thermo Fisher) equipped with a nano-LC Proxeon nanoEasy-II system. Peptides were fractionated onto a C18 reverse-phase capillary column (5 μm biosphere, 75 μm ID, 200 mm length) working at 250 nL/min flow rate and adopting a linear gradient from 10% to 60% of eluent B (0.2% formic acid, 95% acetonitrile LC-MS Grade) over 69 min. Mass spectrometric analyses were carried out in Data Dependent Acquisition mode (DDA): from each MS scan, spanning from 300 to 1800 m/z, the five most abundant ions were selected and fragmented.

Output data were processed generating. mgf files employed for protein identification procedure in NCBI database according to Mascot licensed software (Matrix Science, Boston, MA, USA). Protein identification was carried out by using 10 ppm as peptides mass tolerance for MS and 0.6 Da for MS/MS search; *Homo sapiens* as taxonomy, carbamidomethyl (C) as fixed modification and Gln→pyro-Glu (N-term Q), Oxidation (M), Pyro-carbamidomethyl (N-term C) as variable modifications. The proteins were identified with at least two significant peptides, overcoming the Mascot score threshold.

#### *4.5. Co-Immunoprecipitation*

HEK-293T cells were lysated in a Ripa buffer assay (Radioimmunoprecipitation assay) [41–43] and incubated overnight with 1 μg of anti-FTH1 antibody or with nonimmune serum at 4 ◦C and then incubated for 1h at 4 ◦C with Protein A/G plus-agarose. The beads were collected by centrifugation, washed four times with lysis buffer and then loaded onto an SDS 12 % (*w*/*v*) polyacrylamide gel [44].

#### *4.6. Western Blotting Analysis*

A total of 60 μg of protein extract was resolved on 15% SDS-PAGE and then transferred to a nitrocellulose membrane by electroblotting [32]. Non-specific reactivity was blocked in nonfat dry milk in TTBS 1X [5% (*w*/*v*) milk in TBS 1X (pH 7.4) and 0.1% Tween [20] for 2 h at room temperature. The membrane was incubated with specific primary antibodies anti-FTH1 (sc-376594 Santa Cruz, Santa Cruz, CA, USA), Anti-Flag M2 (F-1804 Sigma Aldrich) and anti-PRDX6 (ab59543 Abcam, Cambridge, UK) overnight at 4 ◦C. After incubation, and the membranes were washed three times with TTBS 1X for 10 min and incubated with an appropriate horseradish peroxidase-conjugated secondary antibody at room temperature for 1 h. According to the manufacturer's instructions. The membranes were washed three times with TTBS 1X, and signals were visualized by ECL-Western blot detection reagents (Santa Cruz Biotechnology, Dallas, TX, USA [45]).

#### *4.7. Molecular Modeling*

Protein Data Bank (PDB) [46] entries 3AJO [47], 5B6M and 5B6N [48] were selected for building our theoretical models of FTH1 and PRDX6 in reduced (PRDX6\_r) and sulphinic acid (PRDX6\_s) states, respectively. The original PDB 3AJO structure was modified by removing co-crystallized water molecules and Mg2+ ions. To identify the most favorable FTH1 sites for iron recognition, FE + 2 probe molecular interaction field (MIF), as implemented in GRID ver. 22d [49], was computed. To obtain a more accurate position of the iron ions, the NPLA GRID keyword was equal to 3, while other parameters were set to the default value. MIF was filtered by means of the GRID MINIM utility using an energy cutoff equal to 10 kcal/mol above the global minimum energy point (−85.27 kcal/mol). Such an approach allowed us to identify two favorable positions for the iron ion with respect to 3AJO PDB entry. Therefore, two iron ions were included in our FTH1 preliminary model.

Both PRDX6\_r and PRDX6\_s PDB entries contained, reported as chains A, B and C, three conformationally different structures of the protein. These models were divided by the chain, and each of them was considered a conformer of PRDX6.

To prepare FTH1 and PRDX6 structures for further simulation, these were optimized using MacroModel ver. 11.9 (Schrödinger Release 2018-1: MacroModel, Schrödinger, LLC, New York, NY, USA, 2021). In detail, hydrogen atoms were added, and 10,000 steps of the Polack Ribiere Conjugate Gradient energy minimization method, coupled to the AMBER\* force field, were applied. Aqueous environment effects were mimicked by means of the implicit solvation model GB/SA water.

The aim was to investigate the recognition between FHT1 and PRDX6. The resulting optimized structures were submitted to AutoDock Vina v. 1.2.1 molecular docking [50]. Using the MGL-tool ver. 1.5.6, Kollman charge distribution was computed on all protein models. PRDX6\_r and \_s optimized conformations were considered as receptor models and FHT1 as ligand. Therefore, for each receptor model, an AutoDock Vina simulation was performed, considering a regular box equal to 4,000,752 Å<sup>3</sup> entirely surrounding targets. For each FHT1·PRDX6 theoretical complex, a maximum of 20 configurations were allowed (num\_modes = 30), and the other docking parameters were set to default values. Docking results highlighted 57 and 56 FHT1 poses with respect to PRDX6\_r and PRDX6\_s, respectively. The affinity of FHT1 against both PRDX6\_r and PRDX6\_s was estimated by computing the AutoDock Vina average interaction energy values.

To consider the induced fit phenomena, all theoretical complexes were submitted to the same energy minimization protocol previously reported for the protein models preparation. The most relevant interacting residues were highlighted by analyzing DRY, N1 and O GRID MIFs using a modified GBPM method against all optimized structures [51,52].

#### *4.8. Immunofluorescence Assay*

For immunofluorescence analysis, HEK-293T and NCI-H460 cells were treated as previously described by Aversa et al. [27]. Briefly, HEK-293T and NCI-H460 cells were incubated with primary antibodies anti-FTH1 (sc-376594 Santa Cruz) and anti-PRDX6 (ab59543 Abcam) overnight at 4 ◦C. Appropriate secondary antibodies (anti-mouse IgG Alexa Fluor 488 and anti-rabbit IgG Alexa Fluor 555, Thermo Fisher Scientific) diluted in PBS 1X were applied for 1 h at room temperature. After three washes, Nuclear Dapi (1:1000, Invitrogen, Carlsbad, CA, USA) was added for 20 min. The slides were mounted on microscope slides using a mounting solution ProLong Gold antifade reagent (Thermo Fisher Scientific). Images were collected using a Leica TCS SP2 confocal microscopy system (Leica Microsystems, Wetzlar, Germany) [53].

#### *4.9. Proliferation Assays*

For the proliferation assay 3 × 103 3xFlag control vector, 3xFlag-PRDX6, 3xFlag FTH1/PRDX6 and 3xFlag-PRDX6/siFTH1 NCI-H460 cells were plated in a 96-well flat bottom tissue culture plate. At 24, 48 and 72 h of culture, 10 μL of 3-(4,5-dimethylthiazol-2 yl)-2, 5-diphenyl tetrazolium bromide (MTT) (Sigma-Aldrich) solution (2 mg/mL) were added per well. After 4 h of incubation, the culture medium was discarded and replaced with 200 μL of isopropanol. Optical density (OD) was read on a multi-well scanning

spectrophotometer (ELISA reader) (BIORAD) at 595 nm. The proliferation assay was performed in triplicate.

#### *4.10. Wound Healing Assay*

NCI-H460 cells 3xFlag control vector, 3xFlag-PRDX6, 3xFlag FTH1/PRDX6 and 3xFlag-PRDX6/siFTH1 were seeded in 60 mm dishes at a density of 4 × 105. After 24 h, a yellow pipette tip was used to make a scratch. Scratch closure was monitored, and images were captured at 0, 24, 48 and 72 h using a light microscope (using the Leica DFC420 C and Leica Application Suite X Software 3.7.4.23463) (Leica Microsystems). Wound closure was measured by calculating the density of the pixels in the area where the cut was made and expressed as a percentage of wound closure in the area. The percentage of wound closure was calculated by ImageJ64 software.

#### *4.11. Statistical Analyses*

Data were presented as the mean ± standard error (SD) of three independent experiments. Statistical and data analysis was carried out using GraphPad Prism 9 software. Statistical differences between samples were assessed by Student's *t*-test.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/ijms232112987/s1.

**Author Contributions:** M.C.F., M.M. and P.P. writing—original draft, writing—review and editing, validation, visualization, project administration, and supervision. M.D.S., F.C. and V.M. methodology, formal analysis, investigation, validation, and visualization. G.S., F.P. and F.O. software. C.P., A.M.B., I.A., A.S. and F.B. formal analysis, and data curation. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data is contained within the article or Supplementary Materials.

**Acknowledgments:** We thank Caterina Alessi for providing technical support. We thank the Centre of Interdepartmental Services (CIS), 'Magna Graecia' University of Catanzaro, Italy, for supporting part of this research.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **3**β**-Corner Stability by Comparative Molecular Dynamics Simulations**

**Vladimir R. Rudnev 1,2, Kirill S. Nikolsky 1, Denis V. Petrovsky 1, Liudmila I. Kulikova 1,2, Anton M. Kargatov 3, Kristina A. Malsagova 1,\* , Alexander A. Stepanov <sup>1</sup> , Arthur T. Kopylov <sup>1</sup> , Anna L. Kaysheva <sup>1</sup> and Alexander V. Efimov <sup>3</sup>**


**Abstract:** This study explored the mechanisms by which the stability of super-secondary structures of the 3β-corner type autonomously outside the protein globule are maintained in an aqueous environment. A molecular dynamic (MD) study determined the behavioral diversity of a large set of non-homologous 3β-corner structures of various origins. We focused on geometric parameters such as change in gyration radius, solvent-accessible area, major conformer lifetime and torsion angles, and the number of hydrogen bonds. Ultimately, a set of 3β-corners from 330 structures was characterized by a root mean square deviation (RMSD) of less than 5 Å, a change in the gyration radius of no more than 5%, and the preservation of amino acid residues positioned within the allowed regions on the Ramachandran map. The studied structures retained their topologies throughout the MD experiments. Thus, the 3β-corner structure was found to be rather stable per se in a water environment, i.e., without the rest of a protein molecule, and can act as the nucleus or "ready-made" building block in protein folding. The 3β-corner can also be considered as an independent object for study in field of structural biology.

**Keywords:** super-secondary structure; 3β-corner; folding nuclei; structure stability

#### **1. Introduction**

Structural motifs (super-secondary structures, SSS) of globular proteins are defined as commonly occurring folding units composed of two or more elements of secondary structure that are adjacent along the polypeptide chain and are in close contact in threedimensional space. While many different structural motifs have been observed to recur within proteins, only some of motifs exhibit the definite handedness and a unique overall fold, irrespective of whether they occur in homologous or non-homologous proteins [1–4]. The high incidence of occurrence of structural motifs in unrelated proteins and the fact that many small proteins are merely composed of the motifs indicate their stability and ability to fold into unique structures per se. Such structural motifs are of particular interest since they can act as nuclei, or "ready-made" building blocks, in protein folding, or can be utilized as starting structures in protein modeling [4,5]. The structural motif with a unique overall fold occurring over all proteins of the structural group can be taken as the starting structure in modeling or as the root structure of structural tree. Larger protein structures are obtained by stepwise addition of α-helices and/or β-strands to the root motif in accordance with a restricted set of rules inferred from known principles of protein structures. Several structural trees for protein superfamilies, including several thousand of known proteins (their 3D structures were taken from PDB), have been constructed and are available at the following source: http://strees.protres.ru/ (accessed on 15 August 2022).

**Citation:** Rudnev, V.R.; Nikolsky, K.S.; Petrovsky, D.V.; Kulikova, L.I.; Kargatov, A.M.; Malsagova, K.A.; Stepanov, A.A.; Kopylov, A.T.; Kaysheva, A.L.; Efimov, A.V. 3β-Corner Stability by Comparative Molecular Dynamics Simulations. *Int. J. Mol. Sci.* **2022**, *23*, 11674. https:// doi.org/10.3390/ijms231911674

Academic Editor: Wajid Zaman

Received: 25 August 2022 Accepted: 27 September 2022 Published: 2 October 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

There is growing interest in structural motifs because they can be embryos in the process of protein folding. Over the past 30 years, significant research efforts have been made to identify and characterize the folding nuclei for the known protein structures [6,7]. In this scope, structural motifs were broadly discussed as a promising candidate [7–9]. The prominent representatives of structural motives are α-α-corner, β-hairpin, and Greek key motif [10]. Specification of these structures caused researchers to develop a catalog of autonomously folded protein motifs and archetypes.

This study is devoted to comparative analysis of variety molecular dynamics simulations of the stability of 3β-corner structural motif (the super-secondary structure).

The 3β-corner can be represented as a triple-stranded β-sheet folded on to itself so that two ββ-hairpins are packed roughly orthogonally in different layers and the central strand bends by ≈90◦ in right-handed direction when passing from one β-layer to the other [11] (Figure 1). All the 3β-corners observed in proteins can be considered as Z-like β-sheets when viewed from their concave surfaces.

**Figure 1.** A schematic representation of the 3β-corner (**a**), ribbon and wireframe model of the 3βcorner, (**b**,**c**) beta-adrenergic receptor kinase 1 (bos taurus), PDB ID 1OMW region: A586-A625. Putative hydrophobic contacts are highlighted by dashed red lines, and hydrogen bonds by dashed blue lines.

The molecular dynamics experiment provides a range of tools for studying the properties of SSSs and their stability, and, consequently, it is crucial for the selection of candidate protein structures as folding nuclei [12].

#### **2. Results**

#### *2.1. The 3β-Corner Motif as an Autonomous Structure*

In the first stage, we examined the possibility of studying the SSS type 3β-corner as autonomously stable in an aqueous medium, that is, outside the protein globule. In this study, autonomous stability refers to the preservation of the structural topology of the 3β-corner in the MD experiment for the whole protein and separately from the protein environment in an aqueous medium. To do this, we removed parts of the structures corresponding to the 3β-corners from the standard files in the ".\*pdb" format extracted from the PDB bank.

We carried out a comparative analysis of changes in the geometric characteristics of the 3β-corners in the composition of whole proteins and autonomously in an aqueous medium on the basis of the results of a 300 ns MD experiment. We performed this comparative analysis on two proteins, PDB ID 2E3H and PDB ID 2E3I.

Analyzing the results of the MD experiment revealed the strict preservation of the 3β-corner geometry in the protein and outside the protein (Table 1). The geometric characteristics were obtained by clustering the MD trajectory corresponding to the values for the conformer in the major cluster. By major cluster, we mean the conformer of the structure under study, which inhabits the trajectory for at least 80% of the total duration of the analysis. Table 1 presents the calculated values of the characteristics of the 3β-corner in the course of the MD experiments carried out with an individual SSS and the whole protein, where this 3β-corner is its constituent block. As an example, the table lists the characteristics of the two 3β-corners recognized in two proteins: 2E3H, chain A, region: 216–246 and PDB ID 2E3I, chain A, region: A62–A93 (Table S1).


**Table 1.** Comparative analysis of the geometric characteristics of the 3β-corner in a protein environment and autonomously from protein in an aqueous medium (three technical repetitions).

Designations: SASA is the area available for the solvent, Å2; Rg is the gyration radius; BH is the hydrogen bond; and PDB structure \* is the experimental structure extracted from the PDB bank. Values are given for the conformer in the major cluster of the MD trajectory. Values are given for the structure of the 3β-corner in the protein; MD: 3β-corner \*\*—results of MD experiment for autonomous 3β-corner outside the protein globule; MD: whole protein \*\*\*—MD experiment of a whole protein containing a 3β-corner; the results are shown for the whole protein (three technical repetitions).

The values of the solvent-accessible area, gyration radius, and number of hydrogen bonds of the 3β-corner structure were determined before the start of the experiment (experimental structure from the PDB bank), and their changes throughout the course of the experiment were analyzed. Table 1 illustrates the values of the studied parameters for the major cluster of the 3β-corner that participates in the experiment autonomously (outside the protein molecule) and as a part of the protein. The table lists the number of hydrogen bonds. For example, for 2E3I in the initial structure, 14 were found. Table 1 lists the RMSD values calculated for these structures during the MD experiment. For example, the following were observed for 2E3H:


These RMSD values were relatively small and favored the stability of the 3β-corner outside the protein molecule. The values of the studied characteristics 2E3I, chain A, region: A62–A93 also speak in favor of the stability of the 3β-corner outside the protein molecule.

To check the stability of the studied 3β-corners in the course of a computational experiment, we recorded the values of the torsion angles of each amino acid residue of the structure to analyze the change in conformation. According to the distribution of the limiting values of the angles ϕ and ψ on the Ramachandran map for all possible regions, a conformational description of all the remains of the studied structures was performed. Figure 2 shows the distributions of the values of the angles ϕ and ψ on the Ramachandran map of the studied amino acids. 3β-corners 2E3H and 2E3I were used during the MD experiment. The drawing contains two cards: Ramachandran plot for all amino acid motives, and for glycine. Green dots are the calculated angles for the initial structure (PDB), yellow dots are the calculated angles for the major cluster of the 3β-corner in the protein, and black dots are the angles for the major cluster of the 3β-corner that participated in the autonomous experiment. Since the amino acid sequence of the motif under study does not contain proline, the analysis of the allowed proline regions was excluded. From the figure, we see that the investigated amino acid residues in the 3β-corner did not leave the allowed areas. The angles of the studied amino acids for the original structure, the major cluster of the 3β-corner in the protein composition, and the major cluster of the 3β-corner that participated in the experiment autonomously were always located close to each other. All these peaks (as well as other parameters analyzed in the MD experiment) also favored the stability of this structure.

**Figure 2.** Distribution of angles ϕ and ψ on the Ramachandran map of the studied amino acid residues of the 3β-corners of 2E3H, chain A, region: 216–246 and 2E3I, chain A, region: A62–A93 during the MD experiment. The calculated angles are highlighted: for the initial structure, in yellow; the angles for the major cluster of the 3β-corner in the protein composition, in green; and the angles for the major cluster of the 3β-corner that participated autonomously in the experiment, in black. (**a**) Ramachandran map for all amino acid residues (except for amino acids highlighted on a separate map) and glycine GLY for 2E3H. (**b**) Schematic representation of the 2E3I of the original motif (green), the major cluster of the 3β-corner within the protein (red), and the major cluster of the 3β-corner that participated autonomously in the experiment (grey). (**c**) Ramachandran map for all amino acid residues. GLY, amino acid before proline; Pre-PRO, proline PRO for 2E3H. (**d**) Schematic representation of the 2E3I of the motif (green), the major cluster of the 3β-corner within the protein (red), and the major cluster of the 3β-corner that participated autonomously in the experiment (grey).

Figure 2b,d illustrates the overlay of images of the original motif, major cluster of the 3β-corner within the protein, and major cluster of the 3β-corner that participated in the experiment outside the protein (autonomously). We could clearly see the complete coincidence of the elements of the secondary structure, β-strands. Minimal deviation of irregular areas (constrictions) was observed between the three analyzed images of the studied 3β-corner. However, the magnitude of the deviation, that is, the distance between the irregular sections of the structures, was insignificant. This fact, established during the experiment, proved that this super-secondary structure is a stable autonomous structure in aquatic environments. Ramachandran maps showed the distribution of the values of the angles ϕ and ψ for the studied amino acid residues of two types of the 3β-angle during the MD experiment. We can state that most of the studied amino acids for both studied structures, for major clusters of 3β-corners in the composition of the protein and for major clusters of autonomous 3β-corners during the experiment, had angles ϕ and ψ close in value. This behavior of the 3β-corner structures is typical and not limited to the cases described.

In the second stage, we examined the possibility of reducing the duration of the MD experiment for autonomous 3β-corners in an aqueous medium. A comparative analysis was conducted on the behavior of the structures of the 3β-corners in a "long" trajectory (300 ns) and a "short" trajectory (20 ns). This methodology was guided by several arguments to reduce the duration of MD. Throughout most of our experiments, we observed that clustering the MD trajectory allowed us to clearly identify the conformer that inhabits the trajectory at least 80% of the time. In the MD experiment, we did not observe any significant changes in the values of the geometric characteristics for the 3β-angles between the "long" and "short" trajectories. A justified reduction in MD time makes it possible to significantly improve calculation results.

Results of "long" and "short" MD for some instances of 3β-corner structures excised from 1OMV (PDB ID), 2CX8 (PDB ID), 1V6Z (PDB ID), and 2BJQ (PDB ID) proteins (Figure 3) showed that the lower values of RMSD in both "long" and "short" trajectory dynamics corresponds to sections of the amino acid sequences identified as β-strand (in Figure 3, black lines at the bottom). On the contrary, unstructured areas (Figure 3, blank region on the bottom) were subject to greater variability in MD experiments, which was the most prominent for 2BJQ region, A32-A62 (Figure 3d). Results with similar values of RMSD were observed in the "long" and "short" MD experiments (Table S2).

**Figure 3.** *Cont*.

**Figure 3.** Comparison of changes in the values of the RMSD along the "long" (red color) and "short" (blue color) MDs. Unstructured sections of polypeptide chain are marked with omissions, whereas β-strands correspond to black lines at the bottom of each panel. Averages value of RMSD and associated standard deviations. Typical cases are presented for structures of the 3β-corner type: 1OMV: region A605–A634 (**a**), 2CX8: region A33–A60 (**b**), 1V6Z: region A33–A64 (**c**), and 2BJQ region: A32–A62 (**d**). All changes were made in three technical repetitions.

On the basis of the obtained results of the "long" and "short" MD, we assert that the 3β-corners in the composition of the studied set retained their stability in the aqueous environment. The optimal time for the MD experiments was 20 ns. Further reduction in the duration of MD was deemed inappropriate. The next section describes the MD results for the entire dataset.

#### *2.2. Control Experiments*

Control experiments were performed with the involvement of several types of protein structures close to 3β-corner structures in the length of amino acid sequence but were different in terms of spatial arrangement (Figure 4). It was demonstrated (Figure 4) that negative controls embodied structures that did not contain β-strands, as well as α-helix. The selected testing structures were characterized by radius of gyration and solvent-accessible areas comparable to the study set of 3β-corner structures (Table S3).

**Figure 4.** A set of overlapped protein structures used as a negative control. (**a**) Brain tumor protein (drosophila melanogaster) PDB ID 1Q7F region: A761–A789; (**b**) exosome complex component Rrp4 (archaeoglobus fulgidus) PDB ID 2BA0 region: A3–A21; (**c**) histone-lysine N-methyltransferase EHMT1 (homo sapiens) PDB ID 3HNA region: A993–A1016; (**d**) putative pectate lyase L (bacteroides thetaiotaomicron) PDB ID 5OLQ region: A58–A80. Color-scale legend: green color indicates structure in the protein composition (PDB data); blue, pink, and yellow colors define technical repetitions; \*\*—chain A, region (PDB).

The MD experiments displayed changes in the geometry of negative control structures and structures in the 3β-corner dataset. The results of MD experiments in three technical repetitions (20 ns) revealed high variability of RMSD values and a low number of hydrogen bonds among the selected negative control structures, which defines them as unstructured tangles (Figure 4). Nevertheless, visualization before and after the MD experiment showed that the structures 1Q7F and 2BA0 arranged a small β-hairpin and one turn of the α-helix, respectively (according to STRIDE), only in one of the technical replications. Meanwhile, the gyration radius and solvent-accessible area were significantly smaller compared to those for the 3β-corner set (Table S3).

#### *2.3. Results of the MD Experiment for 3β-Corners*

Comparative analysis was managed for 330 experimental PDB structures under the study, and following the MD operation, the number of hydrogen bonds, alterations of permitted and prohibited areas of the Ramachandran map (Figure 5), and solvent-accessible areas (the number of water molecules in the contact or residual water in Å2) were calculated.

**Figure 5.** Comparative analysis of MD geometric characteristics for 330 structures of 3β-corners of experimental structures (exp) versus structures after MD (data are shown in three technical replicates). Distribution of (**a**) solvent-accessible surface area (SASA, Å2); (**b**) gyration radius (Rg); (**c**) the number of hydrogen bonds (HB); (**d**) RMSD (nm). All the structures involved in the experiment were deposited along the OH axis. The order of structures along the OH axis was generated separately for each characteristic and arranged in incremental order (from maximum to minimum). Red lines indicate measurements for the structures of the control set (negative control); green lines indicate measurements of selected motifs within a whole protein molecule; black lines designate measurements of characteristics for 3β-corners of the target dataset.

Results of MD experiments for all structures examined in three technical repetitions are summarized in Figure 5. Variation of solvent-accessible area (Figure 5a) of the experimental 3β-corner structures during molecular dynamics was estimated as

$$\text{ASASA} = \text{SASA (EXP)} - \text{SASA (MD)}\tag{1}$$

Whether or not the value was equal to zero along the OY axis (marked with a black line) indicated the unaltered solvent-accessible area of the structure during the MD experiment;

thus, the indicator remained equal to that before the MD experiment (Equation (1). We noticed that most of the structures were localized close to a y~0 value, meaning a slight change of the measured parameter during MD modeling. However, even if the measured value was not equal to zero, most of structures lay along the selected axis (y = 0). The SASA values were calculated according to three technical specifications and represented by value spreads (Figure 5a, red lines), the largest of which were found among structures of the control set (negative control) suggesting instability of structures. At the same time, we can note that SASA had a minimal variation in the three techniques for the selected motifs that participated in the MD experiment as part of a protein molecule, which implied the stability of structures. Values of ΔSASA were characterized by a small spread for three technical repetitions among 3β-corner structures of the target dataset.

Similarly, most of the tested structures demonstrated a slight change of the radius of gyration (Equation (2)) through three replications of the MD experiment (Figure 5b).

$$
\Delta \text{Rg} = \text{Rg(EXP)} - \text{Rg(MD)} \tag{2}
$$

The ΔRg values of the control set structures were also characterized by a small spread, and the behavior of these structures for this indicator fell within the general scenario.

The number of hydrogen bonds (Equation (3)) is yet another important characteristic of the studied structures. The number of hydrogen bonds and their quantitative change were determined for each structure through the MD experiment (Figure 5c).

$$
\Delta \mathbf{H}\_{-} \mathbf{B} = \mathbf{H}\_{-} \mathbf{B} \text{ (EXP)} - \mathbf{H}\_{-} \mathbf{B} \text{ (MD)} \tag{3}
$$

The zero value on the OY axis (y = 0) means the lack of quantitative changes of hydrogen bonds during the experiment. Most of the structures lost the minimum number of hydrogen bonds during the MD experiment, suggesting the stability of structures. Moreover, there was minimal spread of ΔHB through over technical replicates of MD in the majority of tested structures assuming high stability of the 3β-corners (Figure 5c). The amount of hydrogen bonds broken among the studied structures reached 10 and ranged from 0% to 50% of the total amount of bonds established before the experiment. The ΔHB spread was insignificant in the negative control set, probably because structures of the control set did not have a large number of hydrogen bonds initially; thence, such protein blocks did not have a hydrophobic core stabilizing the structure.

The mean RMSD value for three replicates of MD experiments varied roughly less than 5 Å among the structures of the target dataset in each single case. This confirmed the hypothesis that the 3β-corner is a stable structure. On the contrary, structures of the negative control set showed wider spread of RMSD, significantly exceeding out of 5 Å and reaching up to 10 Å, suggesting low stability of the control set structures.

More detailed information regarding results of the MD testing under the studied structures can be approached in Tables S2 and S4. Specifically, the radius of gyration, solvent-accessible area, the number of hydrogen bonds, and retention of amino acid residue position at a certain zone of the Ramachandran map for each of the examined 330 structural motifs were calculated and collected (Table S4). These data were associated with coordinates of localization of motifs (3β-corners) being found in tested proteins and with an average value of RMSD for such motifs. It was found that only 18 3β-corner structures out of 330 bore the calculated average RMSD values for three MD techniques more than 5 Å, whereas 50 structures were characterized by RMSD values less than 2 Å, and the rest of the structures encompassed a range between 2 and 5 Å (Table S4). The narrow range of calculated values evidenced the stability of 3ß-corner turning a fresh look at this structure and to consider it as an independent object of research in the field of structural biology. On the contrary, structures of the control set were featured by greater variability of calculated parameters.

The beguiling result was achieved next to the analysis of probable contacts between amino acid residues involved in the organization and stabilization of the hydrophobic core of 3β-corners. Indeed, we observed that hydrophobic amino acid residues hold about 40% (Table S5) of the total amount of residues in the selected set of structures. Analysis of probable contacts between amino acid residues revealed a bimodal distribution of distances between interacting hydrophobic amino acids (Figure 6).

**Figure 6.** Distribution of distances between interacting hydrophobic amino acid residues in 3ßcorners: within one β-strand ("inside-strand") and between different β-strands ("cross-strand"): red color indicates contacts within one strand, blue color indicates contacts of hydrophobic amino acid residues between β-strands A and B, yellow color designates contacts between β-strands B and C, and green color specifies contacts between β-strands A and C.

The number of contacts arranged within one β-strand (inside-strand) was only a part of the likely contacts (Figure 6, red color), while most of the contacts were attributed to interactions of amino acid residues localized in different β-strands ("cross-strand").

There were two upper limits in the distribution of distances between interacting hydrophobic amino acid residues in the 3β-corners that can be determined: (1) the "insidestrand" type interactions and (2) contacts between adjacent β-strands in the amino acid sequence (A → B, B → C, and A → C). The maximum distribution of distances between contacting amino acids was at 6,8–7 Å and the second maximum was at 9.4–9.6 Å. It should be noticed that the distance 9.4–9.6 Å was characteristic of contacts between the first and the third β-strands (A → C) of 3β-corners since these elements of SSSs (by definition) were localized on different orthogonal planes, so the distances between them were greater. Thus, the 3β-corner structures were distinguished by a high proportion of hydrophobic amino acid residues and "cross-strand"-type contacts, unique compact stacking, and autonomous stability in a molecular dynamic experiment.

#### **3. Discussion**

Research on SSSs is important due to unique and compact spatial packing of the polypeptide chain, which makes it possible to consider SSSs as possible nuclei for protein folding. Studies on the modeling of protein structures and stability of the obtained proteins have been conducted using molecular dynamics simulation experiments [13,14]. The present study analyzed the autonomous stability of the 3β-corner-type structural motif in an aqueous medium outside the protein environment. The MD study of the structural motif, which is small (in terms of the number of atoms) compared to the entire protein molecule, is characterized by high performance, relatively low computational requirements, and low experimental costs.

This study aimed to substantiate the possibility of studying the structure of the 3βcorner type outside the protein globule. To do so, we selected a set of 3β-corners of

330 structures extracted from the PDB bank. The generated structure database facilitated a comprehensive study of the characteristics of the motif when examining a larger set.

We conducted molecular dynamics experiments to establish the autonomous stability of a super-secondary structure of the 3β-corner type outside the protein environment. The stability of conformational templates was analyzed on the basis of the distribution of the angles ϕ and ψ on the Ramachandran map during the experiment. A conformational description of the amino acid residues of the studied structures was performed according to the distribution of the limiting values of the angles ϕ and ψ on the Ramachandran map for all possible regions. The change in the conformation of the structure of the motif under the study was analyzed in the molecular dynamics experiment by recording the frequency of hitting the torsion angles of amino acid residues in certain areas.

In the study, we determined that the structures of 3β-corners behave in the same way in the MD experiment. Thus, analysis of the changes in geometric characteristics did not reveal significant differences between the experimental structures extracted from the PDB bank and the autonomous structures of the 3β-corners after MD operation. On the basis of the obtained results, we recommend an MD duration of 20 ns.

#### **4. Materials and Methods**

#### *4.1. Dataset of 3β-Corner Structures*

In this study, we focused on a simple type of SSS—3β-corner. The aim of this study was to analyze the stability of 3β-corner as an autonomous unit or outside the protein globule in an aqueous environment. A set of 330 structures of the 3β-corner type was selected from the Protein Data Bank (PDB) (https://www.rcsb.org/, accessed on 5 August 2022) to perform MD experiments with the following analysis of standard deviation, change in gyration radius, solvent-accessible area, lifetime of the major conformer and torsion angles, and the number of hydrogen bonds reported to characterize the stability of proteins. Super-secondary 3β-corner structures are widely distributed, and there are a few small proteins consisting of only the 3β-corner and short irregular regions in nature [15,16].

The collected dataset was organized from β-proteins, most of which contained small β-barrels according to the SCOP [17] classification (https://scop.berkeley.edu/, accessed on 5 August 2022) folds b.34 (SH3-like; 21 superfamilies), b.43 (common domain of reductase/isomerase/elongation factor; three superfamilies), b.47 (trypsin-like serine proteases; one superfamily), b.55 (domain-like PH; one superfamily), etc. (see Table S6) [17–19]. Small stacks of β-barrel represent a closed structure, wherein the first and last β-weights were stabilized by hydrogen bonds [20]. The β-barrel structures can also be represented as two orthogonally arranged β-sheets [21]. Such β-barrel structures are characterized by a few β-strands (n = 3–5) and an elliptical type of cross-section with high shear values (S~6–10), which provide a tight fit—a "flattened" ellipse. The 3β-corner structural motifs ordered as "β-Strand → Coil → β-Strand → Coil → β-Strand" were extracted from the β-barrel folds. The correlation of amino acid residues within the selected structures to elements of the secondary structure was elaborated by using three STRIDE algorithms [22], DSSP [23], and iCn3D [24]. The results of convergence between three algorithms utilized for the identification of elements of the secondary structure are presented in Table S7. The match rate between STRIDE/DSSP, Stride/iCn3D, and DSSP/iCn3D was approximately 50% (Table S7). β-Strands in the 3β-corner structure are usually short and consist of 4–6 to 10 amino acid residues, connected by loops. Short loops (from three to seven amino acid residues) shape a turn, while long loops (8–12 amino acid residues) are characterized by an unstructured shape. The studied dataset was represented by motifs extracted from proteins of various origins and containing different lengths of its constituent elements (Figure 7 and Table S4).

**Figure 7.** Origin of proteins in which 3β-corners were identified (**a**); length distribution of 3β-corners in the examined dataset (**b**).

SSSs were selected for homologous and non-homologous proteins of various origins (Figure 7a). Most of these proteins belong to humans, animals, and bacteria. There are also small groups of plant and viral proteins (Figure 7a). This observation is consistent with the variety of annotated protein structures in the PDB for different entities.

The analysis of lengths amongst selected 3β-corners showed that the majority of studied structures were 25–40 amino acids in length. Moreover, depending on the length, the maximum distribution of 3β-corners fell on 30–35 amino acid residues (Figure 7b), and only a few of the 3β-corners consisted of 45–55 amino acid residues.

The gathered 3β-corners were extracted from the PDB and SCOP and are presented in Table S6.

#### *4.2. Molecular Dynamics Simulation*

Simulations were performed using GROMACS software package (version 2021.4). The simulation process was identical for every simulation except simulation time (20 or 300 ns) and some parameters, depending on size and charge of every specific molecule, and was automatically calculated. Configuration and procedure already were used in our previous studies [25–27] but were modified for current research. Molecules were prepared with pdb2gmx tool (GROMACS, version 2021.4), which adds missing hydrogen atoms and forms the correct topology system using CHARMM36 force field converted for GROMACS. Molecules were placed into rectangular boxes for simulation. Box size was automatically configured (with gmx editconf tool) for each simulation depending on protein size, leaving at least 1.0 nm between molecule and box borders.

Boxes were also filled with water (gmx solvate tool, GROMACS, version 2021.4). Water was represented with SPC model selected as a compromise between simulation performance and realism of behavior in normal conditions (see [28] for details about water model comparison). The simulation system also was neutralized with Na+ or Cl− ions. GROMACS automatically calculated the charge of the system and replaced some amount of water molecules in the box with ions to make the system neutral (using gmx genion tool). If the charge of the system was negative, it added positive ions (sodium), and if the system's charge was positive, it added negative (chlorine) ions.

The next preparation step was energy minimization (EM). EM is short specific simulation performed using the steepest decent algorithm. The simulation was performed until energy became lower than 1000.0 kJ/mol/nm or until 1000 simulation steps (0.002 ps per step [29]).

The final preparation was heating pre-run: the simulation system was heated from 5 to 311 K for 200 ps with 0.001 step using a v-rescale temperature coupling algorithm. The molecular dynamics simulation itself ran for 20 or 300 ns (different runs had different simulation times, which is mentioned in the corresponding sections of the study). Tem-

perature for simulation was kept at 311 K. The simulation was performed by GROMACS default leap-frog algorithm for integrating Newton's equations of motion. Simulation step was 0.002 ps.

Resulting trajectories were processed using GROMACS analysis utilities such as gmx rms, gmx gyrate, and gmx sasa. Clusterization was performed using the gmx clusters utilized. Cut-off value for clusterization was a varying value depending on the first clusterisation result with a cut-off = 0.3 nm: if the result contained > 12 clusters, clusterization was redone with a larger cut-off value until the cluster amount reached 20. Moreover, if there were < 8 clusters, clusterization was redone with a larger cutoff. The step for raising/lowering cutoff was set to 0.01 nm, and the limit for the clusterization attempt was set to 30.

RMSD changes were calculated for the backbone for all the trajectories after the heating step. These data in XG format were aggregated with our own specially written scripts. Intact values were taken from an initial PDB file or from the first step of the trajectory. MD resulting values were the average of values for trajectory parts where structure conformation matched the major cluster in the final clusterization result.

All the GROMACS configuration files are available on the following link https:// github.com/protdb/beta-corner-stability/tree/master/GROMACS%20configuration (accessed on 5 August 2022).

A total of 330 structures were used in the MD experiments. In this study, the following MD experiments were carried out for SSS of the 3β-corner type:


According to the MD results, the changes in the following geometric characteristics of the SSS were analyzed:


Data on the conditions of MD experiments are available on the following link https: //github.com/protdb/beta-corner-stability (accessed on 5 August 2022).

#### **5. Conclusions**

It can be concluded that in the water environment, the 3β-corner is rather stable per se, i.e., without the rest part of a protein molecule, and it can act as the nucleus or "ready-made" building block in protein folding. The 3β-corner can also be considered as an independent object for study in field of structural biology.

Understanding the protein architecture and recognizing and studying their individual structural blocks with unique and compact polypeptide chain folds provides an opportunity to gain insight regarding the structure, geometry, internal contacts, and patterns of organization of structural motifs of protein molecules. This knowledge provides a strong basis for addressing fundamental studies, such as prediction and modeling of threedimensional protein structures, protein folding, and structural classification of proteins, as well as applied problems such as developing novel approaches for disease diagnosis and improving our understanding of pathogenesis, identifying drug targets, and designing proteins with desired properties (mimetics).

We annotated a set of 3β-corner SSSs in 330 structures. As a result of the MD experiments, we showed the autonomous stability of this type of super-secondary structure and demonstrated the possibility of analyzing them as independent objects. We observed the preservation of key geometric characteristics because of the MD of autonomous structures of 3β-corners in comparison with experimental structures extracted from the PDB. We

present an annotated set of 3β-corners of the research in the fields of structural biology and biomedicine.

**Supplementary Materials:** The following supporting information can be downloaded at https: //www.mdpi.com/article/10.3390/ijms231911674/s1.

**Author Contributions:** Conceptualization, A.L.K., L.I.K. and V.R.R.; methodology, K.S.N., D.V.P. and V.R.R.; software, K.S.N., A.A.S. and D.V.P.; validation, K.S.N., D.V.P. and A.V.E.; investigation, V.R.R., D.V.P. and A.T.K.; data curation, A.M.K. and L.I.K.; writing—original draft preparation, A.L.K., L.I.K. and K.S.N.; writing—review and editing, V.R.R., A.V.E. and K.A.M.; visualization, K.S.N., K.A.M. and D.V.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** The work was conducted in the framework of the Russian Federation Fundamental Research Program for the long-term period for 2021–2030 (No. 122092200056-9).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The authors are grateful to A. V. Veselovsky for helpful discussions and Amir Taldaev for organizing calculations at the computational facilities of the Tyumen State University (Tyumen, Russia). Simulations were performed using the equipment of the shared research facilities of the HPC computing resources at Lomonosov Moscow State University and the Joint Supercomputer Center of the Russian Academy of Sciences.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Deep-Learning to Predict BRCA Mutation and Survival from Digital H&E Slides of Epithelial Ovarian Cancer**

**Camilla Nero 1,\* , Luca Boldrini <sup>2</sup> , Jacopo Lenkowicz 2, Maria Teresa Giudice 1, Alessia Piermattei 3, Frediano Inzani <sup>3</sup> , Tina Pasciuto <sup>4</sup> , Angelo Minucci <sup>5</sup> , Anna Fagotti 1, Gianfranco Zannoni 3, Vincenzo Valentini <sup>6</sup> and Giovanni Scambia <sup>1</sup>**


**Abstract:** BRCA 1/2 genes mutation status can already determine the therapeutic algorithm of high grade serous ovarian cancer patients. Nevertheless, its assessment is not sufficient to identify all patients with genomic instability, since BRCA 1/2 mutations are only the most well-known mechanisms of homologous recombination deficiency (HR-d) pathway, and patients displaying HR-d behave similarly to BRCA mutated patients. HRd assessment can be challenging and is progressively overcoming BRCA testing not only for prognostic information but more importantly for drugs prescriptions. However, HR testing is not already integrated in clinical practice, it is quite expensive and it is not refundable in many countries. Selecting patients who are more likely to benefit from this assessment (BRCA 1/2 WT patients) at an early stage of the diagnostic process, would allow an optimization of genomic profiling resources. In this study, we sought to explore whether somatic BRCA1/2 genes status can be predicted using computational pathology from standard hematoxylin and eosin histology. In detail, we adopted a publicly available, deep-learning-based weakly supervised method that uses attention-based learning to automatically identify sub regions of high diagnostic value to accurately classify the whole slide (CLAM). The same model was also tested for progression free survival (PFS) prediction. The model was tested on a cohort of 664 (training set: *n* = 464, testing set: *n* = 132) ovarian cancer patients, of whom 233 (35.1%) had a somatic BRCA 1/2 mutation. An area under the curve of 0.7 and 0.55 was achieved in the training and testing set respectively. The model was then further refined by manually identifying areas of interest in half of the cases. 198 images were used for training (126/72) and 87 images for validation (55/32). The model reached a zero classification error on the training set, but the performance was 0.59 in terms of validation ROC AUC, with a 0.57 validation accuracy. Finally, when applied to predict PFS, the model achieved an AUC of 0.71, with a negative predictive value of 0.69, and a positive predictive value of 0.75. Based on these analyses, we have planned further steps of development such as proving a reference classification performance, exploring the hyperparameters space for training optimization, eventually tweaking the learning algorithms and the neural networks architecture for better suiting this specific task. These actions may allow the model to improve performances for all the considered outcomes.

**Keywords:** ovarian cancer; somatic BRCA mutational status; digital pathology; machine learning; artificial intelligence

#### **1. Introduction**

Epithelial ovarian cancer (EOC) is strongly dominated by copy number changes without a focal gene driving mutation. Approximately half of cases exhibits defective DNA

**Citation:** Nero, C.; Boldrini, L.; Lenkowicz, J.; Giudice, M.T.; Piermattei, A.; Inzani, F.; Pasciuto, T.; Minucci, A.; Fagotti, A.; Zannoni, G.; et al. Deep-Learning to Predict BRCA Mutation and Survival from Digital H&E Slides of Epithelial Ovarian Cancer. *Int. J. Mol. Sci.* **2022**, *23*, 11326. https://doi.org/10.3390/ ijms231911326

Academic Editor: Alvaro Galli

Received: 30 August 2022 Accepted: 22 September 2022 Published: 26 September 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

repair via homologous recombination (HR), frequently caused by inactivation of the breast cancer susceptibility (BRCA) genes (overall up to 30%) [1–4].

Defective HR (HR-d) reflects underlying genomic instability, which has significant therapeutic implications in EOC. It is in fact associated to a striking platinum sensitivity and can be targeted by poly-ADP ribose polymerase inhibitors (PARPi).

Although HR assessment is progressively overcoming BRCA testing, it is not yet integrated in clinical practice and can be challenging and expensive, being also still not refundable in many countries. Selecting patients who are more likely to benefit from it at an early stage of the diagnostic process (somatic BRCA 1/2 wild type [WT] patients), could allow an optimization of genomic profiling resources.

Histologic phenotypes have been recognized to somehow reflect genetic alterations in cancer tissues [5–10]. BRCA-mutated (BRCA-mut) EOC were found to have more frequent Solid, pseudo-Endometrioid, and Transitional cell carcinoma-like morphology (SET features) and higher mitotic indexes compared to BRCA WT EOC [11].

Since hematoxylin and eosin (H&E)-stained tissue slides are ubiquitously available for cancer patients, predicting mutations from tissue slides could be a time- and cost-effective method to characterize patients and address them to genomic profile.

Advances in digital pathology and artificial intelligence have presented the potential to analyze gigapixel whole-slide images (WSI), providing information on tissue microenvironment, integrative image-omic, resistence to treatments but also somatic genomic status straight away [12–33].

WSI is a complex domain with several unique challenges, that requires different deep-learning-based computational pathology approaches such as manual annotation of gigapixel WSIs in fully supervised settings or large datasets with slide-level labels in weakly supervised ones.

The delineation of a pixel, patch or region-of-interest (ROI)-level annotations has produced promising results even if it cannot be directly generalized, and if it may suffer from noisy training labels and reduced reproducibility on data from different sources and imaging devices [34–37].

On the other hand, weakly supervised approaches demonstrated an exceptional clinical-grade performance [38] but require thousands of WSIs to achieve performance comparable to fully supervised and ROI-level classifiers and may not be suitable for multiclass subtyping problems.

A recent paper proposed a clustering-constrained-attention multiple-instance learning (CLAM) as a high-throughput deep-learning framework that aims to address the key challenges outlined above [39].

The authors demonstrated that such approach can be used to localize well-known morphological features on WSIs without the need for spatial labels, overperforming standard weakly supervised classification algorithms and resulting adaptable to independent test cohorts, smartphone microscopy and different tissue content.

In this study, we aimed at identifying somatic BRCA 1/2 mutational status directly from H&E slides using a CLAM-based approach. In particular we planned to evaluate negative and positive predictive values of the CLAM-based model compared to genomic sequencing as the reference standard. As secondary endpoints, we aimed at evaluating CLAM-based model accuracy in prognosis prediction measured as progression free survival (PFS).

#### **2. Results**

From November 2016 to November 2020, 1265 consecutive patients underwent BRCA 1/2 testing in our institution.

A total of 664 patients was finally analyzed in the current study (see Figure 1). Regarding secondary endpoints, 8 patients were lost at follow-up and 656 patients were therefore lastly included.

**Figure 1.** Flowchart of the study.

Table 1 shows main clinic-pathological characteristics. Overall, median age of included patients was 61 years old. More than half of the patients had positive family history for cancers (mainly breast). The vast majority of the population had a serous histotype (95.9%), grade 3 (97.1%) and III or IV FIGO stage (92.4%).

**Table 1.** Clinical, pathological and surgical characteristics of the study population.





Results are presented as *n* (%) except where indicated. BMI: Body Mass Index. FIGO: Federation of International of Gynecologists and Obstetricians. BRCA: BReast CAncer gene. VUS: Variants of Uncertain Significance. PDS: Primary Debulking Surgery. IDS: Interval Debulking Surgery. LPS: LaParoScopy. PIV: Peak Integral Value. \* Information available for 642/664 patients. § Information available for 649/664 patients.

Regarding BRCA 1/2 status, 431 (64.9%) patients resulted WT while 233 (35.1%) mutated, 51.5% of which were BRCA 1 mutated. We have reported specific mutations analyzed in the Table 1; with a majority of frameshift mutations of 40.4%.

All mutated patients were addressed to genetic counselling and 86.6% were tested for germline BRCA 1/2 pathogenetic variants. A third (38.2%) had a germline alteration.

Table 2 shows treatment and oncological outcome data. Regarding therapeutic choices, 54.5% of these patients underwent neoadjuvant chemotherapy with a median number of cycles prior to interval debulking surgery of 4, after a laparoscopic assessment. Three hundred and forty out of 644 (52.8%) recurred, but only 27.7% died.

**Table 2.** Treatment and oncological outcome of the study population.


**Table 1.** *Cont.*


**Table 2.** *Cont.*

Results are presented as *n* (%) except where indicated. IDS: Interval Debulking Surgery. GCIG: Gynecologic Cancer InterGroup. PFI: Platinum-Free Interval. \* Information available for 127/664 patients. § Information available for 243 out of 278 patients. † Information available for 175/278 patients. ‡ Information available for 398/664 patients. II Information available for 494/664 patients. ¥ Information available for 184/278 patients. Ω Information available for 359/664 patients. Ψ Information available for 427/664 patients.

The whole process is represented in Figure 2. The outcome was BRCA 1/2 mutated yes/no, in which VUS were considered as mutated patients.

#### *2.1. Phase 0: Reference Standard for BRCA Status Prediction*

A reference classification performance was established according to the classification accuracy and AUC ROC of an expert pathologist, based on available criteria [11]. On the whole 664 slides of the dataset, the reference performance was as follows: accuracy 0.629, specificity 0.879, sensitivity 0.167, negative predictive value 0.661, positive predictive value 0.428 (TN: 379, TP: 39, FN: 194, FP: 52).

#### **Figure 2.** Pipeline of analysis.

#### *2.2. Phase 1: WSI for Somatic BRCA Status Prediction*

The dataset was randomly split into into a development set, consisting of a training set and internal validation set, and an hold-out testing. The proportion between development and testing set set was 80% to 20%. Thus, we used 464 images for training (244/220), 132 images for validation (86/46), and 68 images were hold out for testing (44/24).

The performance on the training set was 0.7 in terms of AUC, but on the testing set the AUC was 0.55. In detail, for training set class zero (BRCA wild type) the model correctly identified 153 out of 244 images, while for class one (BRCA mutated) 139 out of 220. In the testing set class zero, the model correctly identified 49 out of 86, while in class one 24 out of 46.

#### *2.3. Phase 2: ROI on WSI for Somatic BRCA Status Prediction*

For this analysis, a subset of images were used, because of the time consuming process of manual ROI delineation. The process of dataset splitting was the same as before, only with slightly different proportions, so that we used 198 images for training (126/72), 87 images for validation (55/32), and only three images were hold out for testing (2/1), merely to create heatmaps of activation regions on unseen images.

The outcome was BRCA 1/2 mutated yes or no, and VUS were considered as mutated patients.

At the end of training, the model reached a zero classification error on the training set, but the performance of the predictive model on the validation set was 0.59 AUC ROC, with 0.57 validation accuracy (see Figures 3 and 4). The model correctly classified 39 out of 55 class zero (BRCA wild type) images, and 11 out of 32 class one (BRCA mutated) images on the validation set. All of the three held out images were correctly classified.

In particular, the model assigned to the BRCA 1/2 mutated held out image a probability of 98% of mutation, while a probability of 38% and <1% were assigned to the other two held out images (both WT), respectively.

**Figure 3.** Phase1: Performance of the model during training (300 epochs). Top: validation AUC ROC. Low left: negative predictive value (NPV). Low right: positive predictive value (PPV).

**Figure 4.** Phase2: Performance of the model during training (300 epochs). Top: validation AUC ROC. Low left: negative predictive value (NPV). Low right: positive predictive value (PPV).

#### *2.4. Phase 3: Exploration of the Hyperparameters Space for Training Optimization*

Given the results obtained in the previous two phases, we looked for chances of performance improvement by exploring the CLAM model hyperparameters space through a grid search. To do so, we let the following hyperparameters vary: patch level between zero and 2; attention branch single or multiple; bag loss function and clustering loss function one between support vector machine or cross entropy; the relative weight of the two loss in the overall loss function between 0.2 and 0.8 with 0.1 steps; the number of highest and lowest attention patches to be fed to the clustering algorithm within the set 4, 8, 16, 32, 64, 100, 500, 1000. The dataset splitting was the same as phase 1. Grid search type was random search with an early stopping criterion on validation loss decreasing. For each patch level, the best five experiments in terms of validation AUC were retained for performance assessment on the testing set. None of these hyperparameters combination led to significant or even relevant BRCA classification performance improvement neither in the validation nor in the testing set.

#### *2.5. Phase 4: WSI for Predicting Relapse*

For this analysis, slide resolution was taken to be fixed at the highest available value on the image (called "patch level zero" in CLAM framework). As in the previous steps, the dataset was split into a development set (training and validation) and a testing set. Grid search on hyperparameters was performed on the development set to select the five highest performing models, which were later assessed on the hold-out testing set for predictive performance.

The combination of parameters led to a grid search on 64 different models for 200 epochs training length with an early stopping criterion of 20 epochs non-decreasing loss for each outcome.

On a total of 656 images (229 class 0; 427 class 1), 394 images were assigned to training set, 131 images to validation set, 131 images were hold out for testing (46/85). The AUC on the testing set was 0.71 (see Figure 5).

**Figure 5.** Phase 4: classification on testing set with the highest test AUC model. Top: validation AUC ROC. Bottom left: ROC curve on BRCA mut cases. Bottom right: ROC curves on BRCA WT cases.

#### **3. Discussion**

Molecular profiling in cancer patients has been increasingly important to determine the optimal therapeutic strategy. The combination of the digitization of pathology WSI with deep learning to predict somatic mutations, could be a promising approach to achieve a time- and cost-effective complementary method for personalized treatment.

When applied on our dataset, the available morphological criteria (SET features) showed disappointing results (accuracy 0.629, negative predictive value 0.661). This suggests that phenotype and genotype may not be strongly related, as previously suggested [11].

In our phase 1 (high testing and validation errors), we focused on the specific features/patterns within the tissue images that the model recognized to make response predictions. Looking at the activation map (heatmap) of the highest and lowest prediction on the testing set (see Figure 6a–c), we found that the model identified tumor cells in the mutated cases and stroma in the wild type case which could reflect the morphological differences previously described namely solid phenotype and higher mitotic index [11].

However, given the performances, we hypothesize that the highest attention pattern should be focused on tumor areas on which reported differences might be more evident thus useful for outcome prediction. It was necessary to tweak parameters, starting from optimal tissue identification in the segmentation process.

Unfortunately, the process of manually delineating ROI by a dedicated pathologist did not improve the overall performances and neither did the exploration of the hyperparameters space for training optimization, even tweaking the learning algorithms and the neural networks architecture for better suiting the task of BRCA 1/2 status identification.

Several issues and limitations have been encountered during the analysis.

First, the retrospective nature of the study represents an unavoidable source of selection bias and imaging data inhomogeneity.

Although we collected only H&E slices of peritoneal tissue and checked for a minimum percentage of tumor cells in all specimens, patients were not divided in subgroups according to the type of surgery performed; thus small biopsies obtained from exploratory laparoscopy might have provided less representative specimens and images of lower quality compared to peritonectomies.

Second, the absence of an external validation does not allow us to draw any definitive conclusion on the replicability of the model, though the use of H&E slides for cancer diagnosis is spread all over the world. Moreover, we are well aware that there are concerns of between-center variation in imaging results which might significantly impact on the reproducibility of the model results.

Third, our BRCA 1/2 testing was mainly performed on fresh frozen ovarian cancer tissue. We assumed that all other areas of the disease within the same patient shared the same mutational status but this consideration may not be entirely correct. given the significant EOC heterogeneity.

Fourth, our patients were only screened for BRCA 1/2: no other genes involved in the HRD pathways were included in the analysis. Therefore, we cannot exclude the presence of mutations in the other genes whose mutational status could correlate with imaging features typical of mutated patients. Moreover, any type of BRCA 1/2 pathogenic variant was labeled as "mutated". There are not enough data to establish whether different mutations produce different downstream effects but we cannot rule out differences in phenotype. This might have affected our analysis.

Fifth, the analysis was carried out using open source pipelines such as the CLAM model which are not customized for the purpose of the study. Entirely in-house designed pipelines, tailored on genomic status identification, might improve final results.

Overall, the model could provide a critical information at the very beginning of the diagnostic process and, if proven effective, tailor further genomic testing (e.g., only BRCA testing or HRD testing) and optimizing genomic testing resources.

(**a**)

(**b**)

**Figure 6.** *Cont*.

(**c**)

**Figure 6.** (**a**). Heatmap of the held-out BRCA mutated image. Red area represent areas with higher model activation, i.e., areas where the model recognizes pattern associated to the somatic BRCA mutation. (**b**). Patch level top 5 highest attention patterns (WSI). (**c**). Patch level top 5 highest attention patterns (ROI on WSI).

#### *Our Results in the Context of Other Observations*

Preliminary but encouraging results have been published in the last 3 years on computational pathology.

In 2020, Jang and colleagues showed that APC, KRAS, PIK3CA, SMAD4, and TP53 mutations can be predicted from H&E pathology images using deep learning-based classifiers [40]. The AUCs for ROC curves ranged from 0.693 to 0.809 for frozen WSIs and from 0.645 to 0.783 for the FFPE WSIs.

Xu et al., developed a deep learning model to accurately classify chromosomal instability status on a cohort of 1010 patients with breast cancer (Training set: *n* = 858, Test set: *n* = 152) from The Cancer Genome Atlas achieving, an area under the curve of 0.822 with 81.2% sensitivity and 68.7% specificity in the test set. Patch-level predictions of chromosomal instability status suggested intra-tumor heterogeneity within slides [41].

Fu et al. in 2020 quantified histopathological patterns across 17,396 H&E stained histopathology slide images from 28 cancer types and successfully correlate these with matched genomic, transcriptomic and survival data [5].

Moreover, computational histopathology highlighted prognostically relevant areas, such as necrosis or lymphocytic aggregates. The authors underlined the remarkable potential of computer vision in characterizing the molecular basis of tumor histopathology.

Kiehl et al. developed a deep learning model from routine histological slides and/or clinical data to predict lymph node metastasis in colorectal cancer [42]. The deep learning model achieved an AUROC of 71.0%, the clinical classifier achieved an AUROC of 67.0% and a combination of the two classifiers yielded an improvement to 74.1%.

Finally, Wang et al. trained a deep convolutional neural network of ResNet on WSIs to predict the gBRCA mutation in breast cancer [43]. One hundred and sixty six images were combined from two different datasets. The model reached in the external validation dataset an AUC of 0.766 (0.763–0.769) at 40× magnification. The authors reported the role of histological grade on the accuracy of the prediction.

It has also to be acknowledged that new data are emerging regarding the relevance of BRCA status in the upfront surgical treatment. Not only BRCA WT OC patients seems to benefit more that BRCA mutated ones from hyperthermic intraperitoneal chemotherapy performed at primary debulking surgery but even a neoadjuvant chemotherapy approach has been supposed to be less detrimental in patients harboring BRCA mutation [44–46]. If these data are confirmed, the turnaround time of BRCA status acquisition will became

crucial. Artificial intelligence applied to digital pathology holds much promise in bringing innovative solutions to this possible future clinical unmet need.

#### **4. Materials and Methods**

#### *4.1. Patients and Study Design*

This is an observational study with patients retrospectively enrolled at the Fondazione Policlinico Universitario "Agostino Gemelli" IRCCS of Rome, Italy, from November 2016 to November 2020.

A weakly supervised deep learning-based model on H&E in EOC patients was set up for BRCA1/2 status prediction.

The retrospective data on BRCA1/2 testing performed on patients with NGS technique was considered as the reference standard of the computational pathology analysis. H&E slides were prepared by a technician and evaluated by a dedicated pathologist, according to current international indications.

In a second step, clinical and follow-up data including therapeutic regimens, progression free survival (PFS) and overall survival (OS), were considered as outcomes to be predicted.

This study was conducted in accordance with the declaration of Helsinki and was approved by the Ethical committee of Fondazione Policlinico Universitario Agostino Gemelli IRCCS (Prot.; 001134 3/21; ID: 3894, 25 March 2021), with the requirement for informed consent. The research was founded by the Italian Ministry of Health providing Institutional Financial Support 5 × 1000 (2020).

#### *4.2. Study Population*

Eligible population includes: (i) women affected by EOC, with known somatic BRCA 1/2 mutational profile and (ii) available Formalin-Fixed Paraffin-Embedded peritoneal tissue sample, collected at the time of first diagnosis of EOC with at least 30% of cancer cells.

For the second step only those patients for whom we had complete follow-up information (minimum follow up 18 months) were included.

The exclusion criteria were: (i) patients affected by recurrent ovarian cancer; (ii) samples collected after chemotherapy; (iii) patients with extra-ovarian tumors with metastases to ovaries; (iv) patients with history of other malignancies in the past 5 years; (v) patients who received any type of target therapy prior to EOC diagnosis.

All the enrolled women were required to sign written informed consent.

Standardized procedures according to previously published workflows were observed to achieve somatic BRCA 1/2 genes mutational status [47–49].

#### *4.3. Deep Learning Approach (CLAM)*

CLAM is a deep-learning-based weakly supervised method that uses attention-based learning to automatically identify sub regions of high diagnostic value to accurately classify the whole slide, while also enabling the use of instance-level clustering over the representative regions identified to constrain and refine the feature space.

CLAM is publicly available as a Python package over GitHub (https://github.com/ mahmoodlab/CLAM, accessed on 29 August 2022) [50].

For whole-slide-level learning without annotation, CLAM uses an attention-based pooling function [51] to aggregate patch-level features into slide-level representations for classification. At a high level, during both training and inference, the model examines and ranks all patches in the tissue regions of a WSI, assigning an attention score to each patch, which informs its contribution or importance to the collective slide-level representation for a specific class.

This interpretation of the attention score is reflected in the slide-level aggregation rule of attention-based pooling, which computes the slide-level representation as the average of all patches in the slide weighted by their respective attention score. Unlike the standard MIL algorithm [45,46], which was designed and widely used for weakly

supervised positive/negative binary classification (for example, cancer versus normal), CLAM is designed to solve generic multi-class classification problems. A CLAM model has N parallel attention branches that together calculate N unique slide-level representations, where each representation is determined from a different set of highly attended regions in the image viewed by the network as strong positive evidence for the one of N classes in a multi-class diagnostic task. Each class-specific slide representation is then examined by a classification layer to obtain the final probability score predictions for the whole slide.

The slide-level ground-truth label and the attention scores predicted by the network can be used to generate pseudo labels for both highly and weakly attended patches as a technique to increase the supervisory signals for learning a separable patch-level feature space. During training, the network learns from an additional supervised learning task of separating the most- and least-attended patches of each class into distinct clusters. In addition, it is possible to incorporate domain knowledge into the instance-level clustering to add further supervision.

The pipeline provided by Lu et al., first automatically segments the tissue region of each slide and divides it into many smaller patches, so that they can serve as direct inputs to a CNN. Next, using a CNN for feature extraction, the tool converts all tissue patches into sets of low-dimensional feature embeddings. Following this feature extraction, both training and inference can occur in the low-dimensional feature space instead of the high-dimensional pixel space. The volume of the data space is decreased nearly 200-fold, drastically reducing the subsequent computation required to train supervised deep-learning models.

#### *4.4. WSI Processing*

#### 4.4.1. Scanning

Clinical slides were reviewed with the supervision of a dedicated pathologist and selected hematoxylin and eosin-stained slides containing tumor were scanned using the C13220-31 NanoZoomer S360 (Hamamatsu, Japan).

Each slide was scanned using a 40× objective lens (scanning resolution 0.23 μM/pixel) of the NanoZoomer and the slide code details, the scanning area and the number of focus points for each slide were determined by the user. The number of focal points was approximately 15 focus points per slide.

Place the glass slides in the cassettes and set them in the holder of the machine, each slide was automatically handled and scanned. The approximate time taken to scan the image of the whole slide per case was up to 1 min.

Scanned images in their proprietary NDP Image (NDPI) file format were checked for the whole and details of the tissues using the NDP.view2 image viewing software for NanoZoomer on a desktop computer with a high-definition resolution screen (1920 × 1080 pixels). NDPI stores an image pyramid in TIFF directory entries.

Images and data were stored and exported to an external storage device.

#### 4.4.2. Segmentation

The first step is an automated segmentation of the tissue regions. The first step focuses on segmenting the tissue and excluding any holes. The segmentation of specific slides can be adjusted by tuning the individual parameters. The pipeline input is digitized whole slide image data in well-known standard formats (.svs, .ndpi, .tiff etc.). The WSI is read into memory at a down sampled resolution, converted from RGB to the HSV color space. A binary mask for the tissue regions (foreground) is computed based on thresholding the saturation channel of the image after median blurring to smooth the edges and is followed by additional morphological closing to fill small gaps and holes [52]. The approximate contours of the detected foreground objects are then filtered based on an area threshold and stored for downstream processing while the segmentation mask for each slide is made available for optional visual inspection. A human-readable text-file is also automatically

generated, which includes the list of files processed along with editable fields containing the set of key segmentation parameters used.

#### 4.4.3. Patching

After segmentation, the background is removed from images for each slide and the remaining pixels are grouped into a grid of smaller images (256 × 256 patches) from within the segmented foreground contours at the user-specified magnification and stores stacks of image patches along with their coordinates and the slide metadata using the Hierarchical Data Format version 5 (HDF5).

This is not a computationally intensive process and is dependent on resolution level.

The number of patches extracted from each slide can range from hundreds (biopsy slide patched at ×20 magnification) to hundreds of thousands (large resection slide patched at ×40 magnification). Output is a representation of images through patches in a high dimensional feature (HDF) space generated by a pre-trained convolutional neural network.

#### 4.4.4. Feature Extraction

Following patching, we used the pre-trained ResNet50 model [53] already embedded in CLAM to compute a low-dimensional feature representation for each image patch of each slide. Features extraction from patches is a computationally intensive step. Requiring about one minute per whole slide image on NVIDIA Quadro RTX 5000.

#### 4.4.5. Attention Branch

Each patch feature vector then enters an attention network which is trained to recognize patterns often associated with a particular Slide-level label (over-simplification).

At the end of training, the overall model should be able to identify characteristic regions of activation and make classification at the whole slide image level.

#### *4.5. Statistical Analysis*

The sample size available for the analysis consisted of 664 histological images. For computational reasons, the pipeline was first applied to about 30% of available data (171 images) and then incremented to 298 in order to measure the performance gain due to increased sample size.

The dataset was split into a training and validation set, respectively 80% and 20% of the considered number of samples.

Classification performance of the predictive model was monitored during training with ROC AUC, negative and positive predictive value on the validation set. The final model was then applied to held out images to generate activation maps on unseen BRCA mutated images. The model was trained for 300 epochs on a NVDIA Quadro RTX 5000. All statistical analysis was performed in Python version 3.7.4.

#### **5. Conclusions**

Our results confirm that models applied to H&E slides cannot yet match the performance level of the gold standards thus their use in current clinical practice cannot be advocated. Nevertheless, its potentiality as a screening tool for personalization and optimization of genomic testing warrants further investigation.

From a clinical point of view, the information obtained directly from frozen section H&E slides, may give clinicians a crucial information at an early stage of the therapeutic decision making process that could be integrated with already validated clinical scores or multiomics translational approaches [54–56].

For all these reasons, we believe that further developments are well worth carrying out. We have already planned to enlarge the dataset collecting cases from December 2020 to August 2022. Moreover, we are working on improving data input for CLAM analysis by using a recently released segmentation model code (https://github.com/MSKCC-Computational-Pathology/DMMN-ovary, accessed on 29 August 2022) [57]. On the other

hand, we are exploring in collaboration with other groups, new approaches such as the development of a persistent homology-based model on the same dataset [58].

Future studies must include data across multiple centers not used in the model training to demonstrate high accuracy and reproducibility. A deeper involvement of pathologists should be pursued in order to achieve the finest tuning possible according to well recognized features.

**Author Contributions:** C.N., L.B., J.L., M.T.G., A.P., A.F., G.Z., V.V. and G.S. were responsible for the conceptualization of the study design. C.N., M.T.G. and A.P. were responsible for sample collection and C.N., M.T.G., A.P., F.I., T.P. and A.M. for data collection. C.N., J.L. and M.T.G. were responsible for drafting of the manuscript. C.N. and J.L. were responsible for the formal data analysis. The underlying data reported in the manuscript has been accessed and verified by multiple authors (C.N., L.B., J.L., M.T.G., F.I., A.P. and A.M.). All authors have read and agreed to the published version of the manuscript.

**Funding:** The research was founded by the Italian Ministry of Health providing Institutional Financial Support 5 × 1000 (2020).

**Institutional Review Board Statement:** Prot.; 001134 3/21; ID: 3894, 25 March 2021.

**Informed Consent Statement:** Informed consent was obtained from all subjects involved in the study.

**Acknowledgments:** We would like to acknowledge the contribution of Nikon for the free trial of Nanozoomer S360, G-STeP Immunohistochemistry, Data collection and Radiomics core facilities (https://gstep.policlinicogemelli.it/#/, accessed on 29 August 2022).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Target of Rapamycin Regulates Photosynthesis and Cell Growth in** *Auxenochlorella pyrenoidosa*

**Tingting Zhu 1,2,†, Linxuan Li 1,2,3,† , Huimin Chang 3, Jiasui Zhan 4,\* and Maozhi Ren 1,2,3,\***


**Abstract:** *Auxenochlorella pyrenoidosa* is an efficient photosynthetic microalga with autotrophic growth and reproduction, which has the advantages of rich nutrition and high protein content. Target of rapamycin (TOR) is a conserved protein kinase in eukaryotes both structurally and functionally, but little is known about the TOR signalling in *Auxenochlorella pyrenoidosa*. Here, we found a conserved ApTOR protein in *Auxenochlorella pyrenoidosa*, and the key components of TOR complex 1 (TORC1) were present, while the components RICTOR and SIN1 of the TORC2 were absent in *Auxenochlorella pyrenoidosa*. Drug sensitivity experiments showed that AZD8055 could effectively inhibit the growth of *Auxenochlorella pyrenoidosa*, whereas rapamycin, Torin1 and KU0063794 had no obvious effect on the growth of *Auxenochlorella pyrenoidosa*. Transcriptome data results indicated that *Auxenochlorella pyrenoidosa* TOR (ApTOR) regulates various intracellular metabolism and signaling pathways in *Auxenochlorella pyrenoidosa*. Most genes related to chloroplast development and photosynthesis were significantly down-regulated under ApTOR inhibition by AZD8055. In addition, ApTOR was involved in regulating protein synthesis and catabolism by multiple metabolic pathways in *Auxenochlorella pyrenoidosa*. Importantly, the inhibition of ApTOR by AZD8055 disrupted the normal carbon and nitrogen metabolism, protein and fatty acid metabolism, and TCA cycle of *Auxenochlorella pyrenoidosa* cells, thus inhibiting the growth of *Auxenochlorella pyrenoidosa*. These RNA-seq results indicated that ApTOR plays important roles in photosynthesis, intracellular metabolism and cell growth, and provided some insights into the function of ApTOR in *Auxenochlorella pyrenoidosa*.

**Keywords:** TOR; photosynthesis; cell growth; AZD8055; *Auxenochlorella pyrenoidosa*

#### **1. Introduction**

*Chlorella* is a unicellular eukaryotic green alga that emerged 2 billion years ago and is a high-efficiency primary producer in ecosystems [1]. *Chlorella* can be cultured in a natural environment or in controllable closed systems, with higher productivity than most plants. For a long time, *Chlorella* has been deemed as a source of protein and fat, and it is used as a raw material for human food and animal feed [2,3]. Like land plants, *Chlorella* also performs photosynthesis via chloroplast, converting solar energy into chemical energy that is vital to its development and generates oxygen. *Chlorella* contains many high-value substances such as protein, pigment, antioxidants, vitamins, minerals and cell growth factor, and has been referred to as "the best genetic food in the 21st century" by the World Health Organization [4]. At present, over 10 species of *Chlorella* have been identified in the world [5,6], among which *Auxenochlorella pyrenoidosa* (*A. pyrenoidosa*, formerly *Chlorella pyrenoidosa*) has attracted much attention because it is edible and its protein content can account for more than 50% of dry weight [7]. The genome size of

**Citation:** Zhu, T.; Li, L.; Chang, H.; Zhan, J.; Ren, M. Target of Rapamycin Regulates Photosynthesis and Cell Growth in *Auxenochlorella pyrenoidosa*. *Int. J. Mol. Sci.* **2022**, *23*, 11309. https://doi.org/10.3390/ ijms231911309

Academic Editor: Wajid Zaman

Received: 2 August 2022 Accepted: 21 September 2022 Published: 25 September 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

*A. pyrenoidosa* FACHB-9 is 56.6 Mbp, including 10,284 genes [8]. An analysis of the genome structure provides a foundation for improving *A. pyrenoidosa* production as food and fuels. Furthermore, *A. pyrenoidosa* has been widely used in wastewater treatment, especially in high-concentration inorganic industrial wastewater [9]. Utilizing high ammonium salts in industrial wastewater can produce up to 56.7% (dry weight) protein in *A. pyrenoidosa*, and 95% of ammonium salt could be consumed [10]. Additionally, *A. pyrenoidosa* also utilizes organic carbon source and nitrogen source for high-density heterotrophic growth, with production efficiency being over ten times higher than that of autotrophic growth [8]. However, chloroplast was degraded, lipid content was increased, and protein synthesis was inhibited in heterotrophic *A. pyrenoidosa* [8,11,12]. Genomic and transcriptomic sequencing results showed that heterotrophic to photoautotrophic of *A. pyrenoidosa* resulted in global metabolic reprogramming [8].

Target of rapamycin (TOR) is a core regulatory factor for eukaryotic growth and development, which coordinates cell proliferation, growth and metabolism [13,14]. TOR protein has highly conserved structures, including N-terminal HEAT repeats, FAT, FRB, catalytic and C-terminal FATC domains [15]. TOR protein and other proteins form TOR complex 1 (TORC1) and TORC2 in yeast and mammals [16,17]. However, there was only the conserved and functional TORC1 in plants, and TORC1 was composed of TOR, regulatory associate the protein target of rapamycin (RAPTOR) and lethal with sec-13 protein 8 (LST8) [18,19]. TORC2 core proteins RICTOR and SIN1 seem to be missing in photosynthetic eukaryotes, including plants and green algae [20,21].

Rapamycin is a macrolide immunosuppressant from the bacterium *Streptomyces hygroscopicus*. It binds to the 12 kDa FK506 binding protein (FKBP12) and the FRB domain of TOR, thereby inhibiting the activity of TOR protein [22]. Loss of FKBP12 function prevents rapamycin from inhibiting the TOR protein in most plants [23–26]. Fortunately, TOR kinase inhibitors such as Torin1, AZD8055 and KU0063794 from mammals were developed and applied in plants, and have also been proved to specifically and efficiently inhibit TOR kinase activity in plants [15,27,28]. With the help of TOR inhibitors and various omics research methods, animal and plant conserved and plant-specific TOR signaling pathways have been revealed [18,29]. In plants, TOR regulates cell division and elongation, protein synthesis, nutrient and metabolism, and stress response by integrating multiple exogenous environmental signals and endogenous physiological signals [15,30–32]. TOR affects plant growth and development from embryogenesis to photomorphogenesis, root and leaf development, flowering, and senescence in plants [13,18,33].

Genomic analysis of some algal species revealed that TORC1 components are widely conserved in algae [34]. Different from other microalgae, the functions of TOR have been comprehensively studied in the model green alga *Chlamydomonas reinhardtii* (*C. reinhardtii*) [35]. Previous studies have shown that *C. reinhardtii* is sensitive to rapamycin, and the rapamycin sensitive TORC1 signaling regulates cell growth, protein synthesis, autophagy, and key metabolism processes in *C. reinhardtii* [35–37]. In addition, a recent study has shown that TOR controls the carotenoid production by phosphorylating lycopene beta/epsilon cyclase in *C. reinhardtii* [38]. This is the first evidence that TOR directly regulates the biosynthesis of secondary metabolite carotenoid in algae. As an industrial production alga, *A. pyrenoidosa* has fast growth rate, rich nutrition and high protein. However, the TOR signaling pathway of *A. pyrenoidosa* has not been reported, and whether TOR signaling regulates the cell growth and protein synthesis of *A. pyrenoidosa* remains unknown. In this study, homologous sequence alignment revealed that there was only the conserved TORC1 signaling pathway in *A. pyrenoidosa*. Drug sensitivity experiments showed that AZD8055 could effectively inhibit the growth of *A. pyrenoidosa*, while rapamycin, Torin1 and KU0063794 had no effect on the growth of *A. pyrenoidosa*. RNA-seq results showed that most genes involved in photosynthesis were significantly down-regulated in *A. pyrenoidosa* treated with AZD8055, indicating that ApTOR had an important effect on photosynthesis of *A. pyrenoidosa*. In addition, DEGs involved in the regulation of autophagy and ubiquitin mediated proteolysis were almost all up-regulated, suggesting that ApTOR was also

involved in regulating autophagy and protein catabolic process of *A. pyrenoidosa*. These results suggested that ApTOR plays major roles in regulating photosynthesis and cellular metabolism in *A. pyrenoidosa*.

#### **2. Results**

#### *2.1. Conserved TOR Signaling Pathway in Auxenochlorella Pyrenoidosa*

The conserved TORC1 signal regulates intracellular metabolism, nutrient, and energy in *C. reinhardtii* [35–37]. To investigate the conserved TOR signaling pathway in *A. pyrenoidosa*, BLASTp analysis was performed on the public transcriptome data of *A. pyrenoidosa* (NCBI accession number: PRJNA730327). Only one conserved TOR protein (ApTOR) was found in *A. pyrenoidosa*, with a maximum similarity of 69% to CrTOR protein. ApTOR contains N-terminal HEAT repeats, FAT, FRB, catalytic and FATC domains at C-terminal (Figure 1A). Homologous sequence alignment revealed that the catalytic domain of ApTOR was the most conserved with the highest similarity among species, while the FAT domain had the lowest similarity among species (Figure 1A,C). Phylogenetic tree showed that ApTOR and CrTOR were the most conserved in evolution but had the most distant evolutionary relationship with TpTOR and PtTOR (Figure 1B). Meanwhile, sequence alignment found that RAPTOR and LST8, the key proteins of TORC1, were also present in *A. pyrenoidosa* (Table 1), whereas RICTOR and SIN1, the key proteins of TORC2, were not found in *A. pyrenoidosa*. In addition, TORC1 signaling downstream components also existed in *A. pyrenoidosa* (Table 1). These results showed that there was a conserved TORC1 signaling pathway in *A. pyrenoidosa*.

**Figure 1.** A structure and sequence analysis of ApTOR. (**A**) An analysis of the conserved domains of ApTOR protein and homologs from other species. The number denotes the identity (%) of ApTOR domain with homologs from other species. *Chlamydomonas reinhardtii* (Cr) (Chlorophyta), *Arabidopsis thaliana* (At) (Plantae, Magnoliophyta), *Homo sapiens* (Hs) (Animalia, Chordata), *Saccharomyces cerevisiae* (Sc) (Fungi, Ascomycota), *Phaeodactylum tricornutum* (Pt) (Bacillariophyta), *Thalassiosira pseudonana* (Tp) (Bacillariophyta). (**B**) The phylogenetic tree of ApTOR protein and homologs from other species. The phylogenetic tree was constructed by MEGA 4 software using the Neighbor-Joining method. Numbers represent bootstrap percentages (1000 of bootstrap replicates). (**C**) Sequence alignment of the catalytic domains of ApTOR protein and homologs from other species. Red represents identical amino acid sequences, and blue represents more than 75% identical amino acid sequences.


**Table 1.** The putative components of TOR signaling pathway in *Auxenochlorella pyrenoidosa*.

#### *2.2. Effects of TOR Inhibitors on the Growth of Auxenochlorella Pyrenoidosa*

In order to elucidate function of TOR signal in *A. pyrenoidosa*, *A. pyrenoidosa* was treated with rapamycin, a specific inhibitor of TOR protein. The results showed that rapamycin has no obvious effect on the growth of *A. pyrenoidosa*, even at a higher concentration of rapamycin (Figure 2A), indicating that *A. pyrenoidosa* is insensitive to rapamycin. Previous studies have shown that rapamycin inhibits the TOR activity by forming a ternary complex with FKBP12 and the FRB domain of TOR [39]. A ApFKBP12 sequence with 43% similarity to CrFKBP12 was found in the transcriptome data of *A. pyrenoidosa*, encoding 159 amino acids (Table 1). Interestingly, ApFKBP12 protein is evolutionarily closer to rapamycin-sensitive species (Figure 2B). Some amino acids involved in the formation of the FKBP12-rapamycin complex are required for inhibiting TOR activity [40]. We found that the ApFKBP12 protein sequence contains the conserved amino acids required for FKBP12 binding to rapamycin, including Tyr26, Asp38 and Gly89 (numbered according to human HsFKBP12) (Figure 2C). However, there was an additional sequence of 52 amino acids at the N-terminal of ApFKBP12 protein compared with other species (Figure 2C), which may change the function of ApFKBP12 protein and failure in binding to rapamycin. Rapamycin also interacts with the FRB domain of TOR by binding to aromatic residues [40], and sequence alignment revealed that these key amino acids were highly conserved in *A. pyrenoidosa* and other species (Figure 2D). As the FRB domain of ApTOR is highly conserved, the resistance of *A. pyrenoidosa* to rapamycin may be due to the loss of ApFKBP12 function.

Furthermore, TOR kinase inhibitors AZD8055, Torin1 and KU0063794 were used to treat *A. pyrenoidosa*. The results showed that AZD8055 could effectively inhibit the growth of *A. pyrenoidosa*, while Torin1 and KU0063794 had no effect on the growth of *A. pyrenoidosa* even at higher concentrations (Figure 3). The 50% inhibitory concentration (IC50) of AZD8055 on the growth of *A. pyrenoidosa* was about 1 μM. When the concentration of AZD8055 reached 5 μM, the growth of *A. pyrenoidosa* was completely inhibited, indicating that the lethal concentration of AZD8055 may be 5 μM (Figure 3A,B). However, when AZD8055 was removed from the medium, the inhibited cells resumed growth (Supplementary Figure S1), indicating that inhibition of ApTOR kinase activity by AZD8055 prevents cell division without killing cells. These results suggested that AZD8055 can be applied to elucidate the function of ApTOR in *A. pyrenoidosa*.

**Figure 2.** *Auxenochlorella pyrenoidosa* is resistant to rapamycin. (**A**) The phenotype of *A. pyrenoidosa* treated with different concentrations of rapamycin for 0, 2, 4, and 6 days. The numbers denote the corresponding OD680nm values. (**B**) The phylogenetic tree of ApFKBP12 protein and homologs from other species. Phylogenetic tree was constructed by MEGA 4 software using the Neighbor-Joining method. Numbers represent bootstrap percentages (1000 of bootstrap replicates). *Schizosaccharomyces pombe* (Sp) (Fungi, Ascomycota), *Oryza sativa* (Os) (Plantae, Tracheophyta), *Solanum tuberosum* (St) (Plantae, Tracheophyta). (**C**) Sequence alignment of the ApFKBP12 protein and homologs from other species. The red rectangle denotes the amino acid required for FKBP12 binding to rapamycin. (**D**) Sequence alignment of the FRB domains of the ApTOR protein and homologs from other species. The red rectangle denotes the amino acid required for the FRB domain binding to rapamycin.

**Figure 3.** The effects of TOR protein inhibitors on the growth of *Auxenochlorella pyrenoidosa.* (**A**) AZD8055 inhibits the growth of *A. pyrenoidosa* in a dose-dependent manner. (**B**) Change curves of

OD680nm values of *A. pyrenoidosa* treated with 1, 5 and 10 μM AZD8055 for 0, 2, 4 and 6 days. (**C**) Phenotype of *A. pyrenoidosa* treated with 1, 5 and 10 μM KU0063794 for 0, 2, 4, and 6 days. (**D**) Change curves of OD680nm values as described in (**C**). (**E**) Phenotype of *A. pyrenoidosa* treated with 1, 10, 20 μM Torin1 for 0, 2, 4, and 6 days. (**F**) Change curves of OD680nm values as described in (**E**).

#### *2.3. Analysis of Transcriptome Sequencing under ApTOR Inhibition*

To further clarify the roles of ApTOR signaling pathway in *A. pyrenoidosa*, the transcriptome sequencing was performed in *A. pyrenoidosa* treated with AZD8055. The growth curve showed that *A. pyrenoidosa* was in the logarithmic phase after incubation for 4 days (Figure 3); we therefore cultured the algal cells for 4 days before AZD8055 treatment. In addition, we found that the OD680nm value, chlorophyll content, protein content, and starch content of *A. pyrenoidosa* were significantly changed during *A. pyrenoidosa* treated with 5 μM AZD8055 for 24 h (Figure 4); thus, *A. pyrenoidosa* treated with 5 μM AZD8055 for 24 h was used for transcriptome sequencing.

**Figure 4.** TOR regulates the biosynthesis of major intracellular substances in *Auxenochlorella pyrenoidosa*. (**A**) Phenotypes of *A. pyrenoidosa* treated with AZD8055 for 0, 12, 24, 36, and 48 h. *A. a pyrenoidosa* was cultured in 50 mL BG11 liquid medium for 4 days. Then, final-concentration 5 μM AZD8055 or equivalent DMSO was added into the alga solution for 0, 12, 24, 36, and 48 h. (**B**) Change curves of OD680nm values of *A. pyrenoidosa* treated with 5 μM AZD8055 for 0, 12, 24, 36, and 48 h. (**C**) Total chlorophyll content of *A. pyrenoidosa* treated with 5 μM AZD8055 for 0, 12, 24, 36, and 48 h. (**D**) Protein content of *A. pyrenoidosa* treated with 5 μM AZD8055 for 0, 12, 24, 36, and 48 h. (**E**) Starch content of *A. pyrenoidosa* treated with 5 μM AZD8055 for 0, 12, 24, 36, and 48 h. Fresh weight of *A. pyrenoidosa* was used to measure chlorophyll, protein and starch contents, respectively. The data represents the mean ± SD of n = 3 independent experiments. Asterisks denote Student's *t*-test significant difference compared with DMSO (\* *p* < 0.05; \*\* *p* < 0.01).

*A. pyrenoidosa* was cultured for 4 days, and then final-concentration 5 μM AZD8055 or equivalent DMSO was added into the algal solution for 24 h. Subsequently, AZD8055 treated algal cells were used for transcriptome sequencing. After filtering the raw data, clean reads for subsequent analysis were obtained, and the data summary is as shown in Supplementary Table S1. A total of 2823 differentially expressed genes (DEGs) were found between AZD8055 treatment and DMSO control, of which 1205 DEGs were up-regulated and 1618 were down-regulated (Figure 5A). To verify the reliability of transcriptome data, 10 DEGs were randomly selected from the transcriptome data for qRT-PCR. The qRT-PCR results showed the same trend as the transcriptome data (Supplementary Figure S2), indicating that the transcriptome data were valid and reliable.

**Figure 5.** The transcriptome data analysis of AZD8055-treated *Auxenochlorella pyrenoidosa*. (**A**) Number of up- and down-regulated DEGs between AZD8055 and DMSO treatment. (**B**) Cluster analysis of DEGs between AZD8055 and DMSO treatment. The color represents the FPKM value of the gene by Z-score. Red denotes high gene expression and green denotes low gene expression. (**C**) The top 20 enriched gene ontology in down-regulated DEGs. (**D**) The top 20 enriched gene ontology in up-regulated DEGs. (**E**) The top 20 enriched KEGG pathways in down-regulated DEGs. (**F**) The top 20 enriched KEGG pathways in up-regulated DEGs.

A cluster analysis of DEGs showed that the transcription levels of many genes were changed in *A. pyrenoidosa* treated with AZD8055 compared to the DMSO control (Figure 5B). To clarify the functions of DEGs, GO functional enrichment analysis was conducted, and a total of 121 down-regulated GO terms and 124 up-regulated GO terms were enriched in the transcriptome data. Among the down-regulated GO terms, thylakoid (GO:0009579) and photosynthesis (GO:0015979) were most significant enrichment (Figure 5C). Among the upregulated GO terms, the cellular protein modification process (GO:0006464) and autophagy (GO:0006914) were most significant enrichment (Figure 5D). These results suggested that ApTOR regulates multiple biological processes in *A. pyrenoidosa*. A KEGG pathway enrichment analysis showed that photosynthesis, carbon metabolism and carbon fixation in photosynthetic organisms were most significant enrichment in the down-regulated DEGs (Figure 5E). Among the up-regulated KEGG pathways, ubiquitin mediated proteolysis and circadian rhythm-plant were most significantly enriched (Figure 5F). These results suggested that ApTOR controls various intracellular metabolism and signaling pathways in *A. pyrenoidosa*.

#### *2.4. DEGs Involved in Regulating Chloroplast Development and Photosynthesis of Auxenochlorella Pyrenoidosa*

Chloroplasts containing chlorophyll are necessary for photosynthesis in plants [41]. Previous studies showed that TOR has the function of regulating chloroplast development and photosynthesis in *Arabidopsis* [42]. Down-regulated GO terms related to photosynthesis and chloroplast development were enriched in the transcriptome data (Figure 5C). Meanwhile, metabolic pathways related to plant photosynthesis were also found in the KEGG pathways, such as photosynthesis, carbon fixation in photosynthetic organisms, and porphyrin and chlorophyll metabolism (Figure 5E). These results indicated that ApTOR has important effects on chloroplast development and photosynthesis in *A. pyrenoidosa*.

A total of 62 DEGs were associated with photosynthesis under ApTOR inhibition, among which 29, 9 and 24 DEGs were enriched in the "Photosynthesis", "Photosynthesisantenna proteins" and "Carbon fixation in photosynthetic organisms" KEGG pathways, respectively (Table 2). Most DEGs related to photosynthesis were significantly down-regulated, and all DEGs of photosystem I, photosystem II and chlorophyll a/b binding protein pathways were down-regulated under ApTOR inhibition (Table 2 and Supplementary Figure S3). The highest down-regulated gene was *Chlorophyll a-b binding protein 4* (*Cluster-495.6691*) gene with 131.60-fold decrease. In the dark reaction of photosynthesis, 22 down-regulated DEGs and 2 up-regulated DEGs were involved in carbon fixation. In addition, all 16 chlorophyll synthesis genes involved in "Porphyrin and chlorophyll biosynthesis" pathway were down-regulated from 2.27- to 27.67-fold, and 44 down-regulated DEGs and 6 up-regulated DEGs were involved in the "Thylakoid" pathway (Supplementary Table S2). These results suggested that ApTOR positively regulates chloroplast development and photosynthesis in *A. pyrenoidosa*.

**Table 2.** Differentially expressed genes in the photosynthetic process.



**Table 2.** *Cont.*

#### *2.5. DEGs Involved in Regulating Protein Synthesis and Catabolism of Auxenochlorella Pyrenoidosa*

*A. pyrenoidosa* has a high protein content, but whether ApTOR regulates the protein synthesis of *A. pyrenoidosa* remains unknown. Previous studies have shown that ribosomes are responsible for protein synthesis in all organisms, and TOR plays an essential role in regulating ribosome synthesis [43–45]. In this study, genes involved in the "Ribosome biogenesis" pathway were significantly changed in *A. pyrenoidosa*, including ribosomal proteins and U3 small nucleolar ribonucleoprotein proteins. A total of 40 DEGs were enriched in the KEGG "Ribosome biogenesis" pathway, including 26 down-regulated DEGs and 14 up-regulated DEGs, and most of DEGs were ribosome proteins (Supplementary Table S3). Importantly, most of DEGs associated with ribosomal proteins were significantly down-regulated, and the most down-regulated gene was *50S ribosomal protein L2* (*Cluster-829.0*) with 26.17-fold decrease (Supplementary Table S3). These results indicated that ApTOR inhibition leads to dysfunction of ribosomes, especially changes in ribosomal protein-related genes, further indicating that ApTOR controls protein synthesis by ribosomes.

Autophagy plays a central role in protein degradation, and previous studies showed that TOR negatively regulates autophagy [46–48]. In this study, transcriptome analysis showed that autophagy related DEGs were significantly enriched in GO terms and KEGG pathways (Figure 5D,F). Total 8 DEGs were assigned to the "Regulation of autophagy" pathway, of which 7 genes were up-regulated including *SnRK1α* and *ATG* genes, and 1 gene was down-regulated in the RNA-seq data (Table 3). These results suggested that ApTOR negatively regulates autophagy in *A. pyrenoidosa*. Ubiquitin (Ub)/26S proteasome system (UPS) is the main pathway of protein degradation in cells. Ub is sequentially covalently

linked to the target protein by ubiquitin activase (E1), ubiquitin binding enzyme (E2), and ubiquitin protein ligase (E3), and then the target protein is degraded by the proteasome [49,50]. The "Ubiquitin mediated proteolysis" KEGG pathway was influenced by AZD8055 (Figure 5F). Total 19 DEGs were assigned to the "Ubiquitin mediated proteolysis" pathway, including 15 up-regulated genes and 4 down-regulated genes (Table 3). Four genes encoding E1 activating enzyme were up-regulated from 2.28- to 7.84-fold under ApTOR inhibition. In addition, some important E3 ubiquitin ligase genes, including *Cullin 1*, *Cullin 3*, and *Cullin 4*, were significantly up-regulated (Table 3). These results showed that ApTOR inhibition activated protein catabolism in *A. pyrenoidosa*.


**Table 3.** Differentially expressed genes in protein catabolism.

#### *2.6. DEGs Involved in Regulating the Cell Growth of Auxenochlorella Pyrenoidosa*

Carbon and nitrogen metabolism, protein and fat synthesis are important limiting factors of cell growth and proliferation [51,52]. In this study, the genes associated with carbon metabolism, amino acid metabolism and fatty acid metabolism were significantly changed under ApTOR inhibition (Supplementary Table S4). DEGs of carbon metabolism and biosynthesis of amino acids and fatty acid pathways were significantly enriched in the down-regulated KEGG pathways (Figure 5E). A total 65 DEGs were assigned to the "carbon metabolism" pathway, including 56 down-regulated genes and 9 up-regulated genes. Some rate-limiting enzyme genes in the "carbon metabolism" pathway such as fructose bisphosphate aldolase and pyruvate kinase were significantly down-regulated. A total 45 DEGs were assigned to the "biosynthesis of amino acids" pathway, including 41 down-regulated genes and 4 up-regulated genes. In addition, all 10 DEGs assigned to the "fatty acid biosynthesis" pathway were down-regulated from 2.53- to 28.44-fold (Supplementary Table S4), indicating that AZD8055 inhibited the biosynthesis of fatty

acids in *A. pyrenoidosa*. These results suggested that ApTOR inhibition affects a variety of intracellular metabolic processes, especially carbon and nitrogen metabolism and fatty acid metabolism. The disruption of metabolic homeostasis by AZD8055 may help to inhibit the growth of *A. pyrenoidosa* cells. Consistent with the growth phenotype of *A. pyrenoidosa* treated with AZD8055, all 14 DEGs related to tricarboxylic acid (TCA) cycle were down-regulated from 2.18- to 11.98-fold in the transcriptome data, including ratelimiting enzymes isocitrate dehydrogenase, α-oxoglutarate dehydrogenase and pyruvate dehydrogenase (Table 4), implying that AZD8055 inhibited cell growth of *A. pyrenoidosa* by inhibiting TCA cycle and reducing energy supply.


**Table 4.** Differentially expressed genes in the TCA cycle.

#### **3. Discussion**

TOR regulates protein synthesis, intracellular metabolism and cell proliferation by integrating nutrients, energy and environmental signals [13,14,53]. In this study, we provide some new insights into how ApTOR controls multiple cellular processes to regulate cell growth of *A. pyrenoidosa*. Only TORC1 is found in higher plants and the green algae *C. reinhardtii*, which contains key proteins TOR, RAPTOR and LST8. TORC1 activity is regulated by nutrients and environmental stresses and responds to different environmental conditions by controlling intracellular metabolic processes [21,35]. Consistent with the results of higher plants and *C. reinhardtii*, only one conserved ApTOR protein was found in *A. pyrenoidosa* (Figure 1 and Table 1). The key components RAPTOR and LST8 of TORC1 were present, while the components RICTOR and SIN1 of the TORC2 were absent in *A. pyrenoidosa*, implying that the conserved TORC1 pathway exists in *A. pyrenoidosa*.

Studies have shown that *C. reinhardtii* is sensitive to rapamycin [54]. Unexpectedly, we found that rapamycin had no obvious effect on the growth of *A. pyrenoidosa*, even at a higher concentration of rapamycin (20 μM) (Figure 2), showing that *A. pyrenoidosa* is insensitive to rapamycin. Phylogenetic tree analysis and amino acid sequence alignment showed that the resistance of *A. pyrenoidosa* to rapamycin may be caused by the loss of ApFKBP12 function. In addition, we found that AZD8055 could effectively inhibit the growth of *A. pyrenoidosa*, while Torin1 and KU0063794 had no effect on the growth of *A. pyrenoidosa* even at higher concentrations, implying that Torin1 and KU0063794 could not act on the kinase domain of ApTOR protein due to amino acid variation.

Photosynthesis is a plant-specific physiological activity, providing energy and sugars for plants autotrophic growth, which is the biggest difference from animals [55,56]. Previous studies have shown that TOR signaling is closely related to chloroplast development and photosynthesis in plants [33,57,58]. Photosynthetic absorption of CO2 increased TOR activity, which in turn the enhanced TOR activity further promoted photosynthesis in *C. reinhardtii* [58]. Most DEGs involving chloroplast development and photosynthesis, such as thylakoid, porphyrin and chlorophyll biosynthesis, and photosynthesis, were down-regulated under ApTOR inhibition by AZD8055 (Table 2), showing that ApTOR had important effects on chloroplast development and photosynthesis of *A. pyrenoidosa*.

Protein degradation is mainly mediated by the ubiquitin/26S proteasome pathway and autophagy [59,60]. In this study, we found that ApTOR inhibition activates autophagy and ubiquitin mediated proteolysis pathway in *A. pyrenoidosa* (Table 3), promoting catabolism of protein. However, genes related to ribosome synthesis were significantly down-regulated in the RNA-seq data, thus inhibiting protein synthesis. These results indicated that ApTOR is involved in regulating protein synthesis and catabolism by multiple metabolic pathways in *A. pyrenoidosa*. Furthermore, the transcriptome data showed that ApTOR controls various intracellular metabolism and signaling pathways in *A. pyrenoidosa*. Inhibition of ApTOR activity resulted in disorders of carbon and nitrogen metabolism, protein and fatty acid metabolism and TCA cycle, which further inhibited the cell growth of *A. pyrenoidosa*.

#### **4. Materials and Methods**

#### *4.1. Algae and Growth Condition*

The strain of *A. pyrenoidosa* (FACHB-9) used in this study was purchased from the Institute of Hydrobiology, Chinese Academy of Sciences (Wuhan, China). *A. pyrenoidosa* was cultured in BG11 liquid medium supplemented with 20 g.L−<sup>1</sup> glucose under 28 ◦C, 2000 lux continuous light, and 180 rpm.

#### *4.2. Treatment of Auxenochlorella Pyrenoidosa by TOR Inhibitors*

The *A. pyrenoidosa* cells was inoculated into a 50 mL BG11 liquid medium supplemented with different concentrations of TOR inhibitors (rapamycin, AZD8055, KU0063794, Torin1) and incubated at 28 ◦C, 2000 lux continuous light, and 180 rpm. The cell density at 680 nm optical density (OD680) was measured with a Microplate Reader (Biotek EpochTM2, Winooski, VT, USA) at 0, 2, 4 and 6 days.

To test whether *A. pyrenoidosa* cells were killed by high concentrations of AZD8055. *A. pyrenoidosa* cells were treated with 1, 5 and 10 μM AZD8055 for 4 days, and AZD8055 was removed from the medium, then the pellet was resuspended with BG11 and adjusted to the same OD value. Meanwhile, the removed supernatant containing different concentrations of AZD8055 was added into fresh *A. pyrenoidosa* cells. The phenotype was observed after culturing with or without AZD8055 for 4 days.

#### *4.3. Phylogenetic Tree Analysis*

Homologous sequences from different species were aligned by ClustalX software. Phylogenetic tree was generated from the Neighbor-Joining method by MEGA 4 software, and Poisson correction model was used to compute genetic distance. TpTOR (XP\_002293107.1), CrTOR (XP\_042921379.1), PtTOR (XP\_002181617.1), AtTOR (NP\_175425.2), HsTOR (NP\_001373429.1), ScTOR1 (NP\_012600.1), ScTOR2 (NP\_012719.2), CrFKBP12 (XP\_001693615.1), AtFKBP12 (NP\_201240.1), StFKBP12 (XP\_006351741.1), OsFKBP12 (XP\_015625368.1), SpFKBP12 (NP\_595257.1), HsFKBP12 (NP\_000792.1) and ScFKBP12 (NP\_014264.1) protein sequences were download from NCBI database.

#### *4.4. Construction of the RNA-seq Library and Transcriptome Sequencing*

*A. pyrenoidosa* was cultured in 50 mL BG11 liquid medium supplemented with 20 g·L−<sup>1</sup> glucose at 28 ◦C, 2000 lux continuous light, and 180 rpm for 4 days. Then, final-concentration 5 μM AZD8055 and equivalent DMSO were added into the alga solution for 24 h, and algal cells were precipitated by centrifugation and collected. Three independent biological replicates were performed for each treatment. Total RNA of *A. pyrenoidosa* treated with AZD8055 or DMSO was extracted by Plant RNA extraction kit (TIANGEN, Beijing, China). The RNA library was constructed using NEBNext® Ultra TMRNA Library Prep Kit (NEB, Boston, MA, USA) by Tianjin Novogene Bioinformatics Technology Co., Ltd. Qualified

library was sequenced on an Illumina Novaseq 6000 platform and 150 bp paired-end reads were generated. Clean reads were obtained by filtering the raw data.

#### *4.5. Transcriptome Assembly, Annotation and Differential Expression Analysis*

After obtaining clean reads, the Trinity software (V2.6.6, Marlborough, MA, USA) [61] was used to spliced clean reads to obtain reference sequences for subsequent analysis. Diamond software (V0.9.13.114, Tübingen, Germany) [62] was used to match the gene sequence into the protein database for functional annotation. Using gene function annotations information from major databases, including NR, GO, KEGG, Pfam, KOG/COG, and Swiss-prot databases, the spliced genes were annotated. DESeq2 R package (1.20.0) [63] was used to analyze the differentially expressed genes (DEGs) between AZD8055 treatment and DMSO control. *P*-adj < 0.05 and | Log2 (Fold change)|> 1 were set as the threshold values of gene differential expression. GO and KEGG plant databases were used to predict the function of genes and describe the gene products, and the annotation information related to plants was selected for GO and KEGG pathway enrichment. Goseq (V1.10.0, Parkville, Australia) and KOBAS (V2.0.12, Beijing, China) software were used for GO and KEGG pathway enrichment analysis of DEGs, respectively [64,65].

#### *4.6. Quantitative Real-Time PCR (qRT-PCR) Validation*

To verify reliability of transcriptome data, qRT-PCR was used to quantify the expression levels of 10 randomly selected genes. CDS sequences of the genes were derived from transcriptome sequencing data, and the corresponding specific primers were presented in Supplementary Table S5. *ApActin* (*Cluster-495.7101*) was used as a reference gene. RNA from *A. pyrenoidosa* that was processed in the same batch as transcriptome sequencing was selected for qRT-PCR. Relative expression levels of genes were assayed by two-step RT-PCR analysis using the Bio-Rad CFX96 Manager software (BIO-RAD, Hercules, CA, USA). Reaction was performed in a final volume of 20 μL containing 10 μL of 2 × SYBR Green PCR Mastermix (Solarbio, Beijing, China). The relative RNA products of the genes were analyzed using the formula 2−ΔΔCT.

#### **5. Conclusions**

In conclusion, this study revealed the conserved ApTOR signaling in *A. pyrenoidosa* and elucidated the effects of TOR inhibitors on the growth of *A. pyrenoidosa*. Transcriptome data results showed that ApTOR is involved in regulating chloroplast development, photosynthesis and intracellular metabolism in *A. pyrenoidosa*, and ApTOR promotes the cell growth of *A. pyrenoidosa* by regulating various signaling pathways and intracellular metabolic processes. This study provides some insights into the function of ApTOR in *A. pyrenoidosa*.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/ijms231911309/s1.

**Author Contributions:** M.R. and T.Z. designed the experiments. T.Z. and L.L. performed the experiments. M.R., T.Z. and L.L. analyzed the data. T.Z., L.L., J.Z. and H.C. wrote the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by grants from the Agricultural Science and Technology Innovation Program of Chinese Academy of Agricultural Sciences (34-IUA-02), the "open competition mechanism to select the best candidates project" of Hainan Yazhou Bay Seed Laboratory (B21HJ0203) and the special project of Nanfan Research Institute of Chinese Academy of Agricultural Sciences (YBXM12).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** The transcriptome data have been deposited in the NCBI Sequence Read Archive under accession number PRJNA841794.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Atomic Simulation of the Binding of JAK1 and JAK2 with the Selective Inhibitor Ruxolitinib**

**Maxim Kondratyev <sup>1</sup> , Vladimir R. Rudnev 2,3, Kirill S. Nikolsky 2, Alexander A. Stepanov <sup>2</sup> , Denis V. Petrovsky 2, Liudmila I. Kulikova 2,3,4, Arthur T. Kopylov <sup>2</sup> , Kristina A. Malsagova 2,\* and Anna L. Kaysheva <sup>2</sup>**


**Abstract:** Rheumatoid arthritis belongs to the group of chronic systemic autoimmune diseases characterized by the development of destructive synovitis and extra-articular manifestations. Cytokines regulate a wide range of inflammatory processes involved in the pathogenesis of rheumatoid arthritis and contribute to the induction of autoimmunity and chronic inflammation. Janus-associated kinase (JAK) and signal transducer and activator of transcription (STAT) proteins mediate cell signaling from cytokine receptors, and are involved in the pathogenesis of autoimmune and inflammatory diseases. Targeted small-molecule drugs that inhibit the functional activity of JAK proteins are used in clinical practice for the treatment of rheumatoid arthritis. In our study, we modeled the interactions of the small-molecule drug ruxolitinib with JAK1 and JAK2 isoforms and determined the binding selectivity using molecular docking. Molecular modeling data show that ruxolitinib selectively binds the JAK1 and JAK2 isoforms with a binding affinity of −8.3 and −8.0 kcal/mol, respectively. The stabilization of ligands in the cavity of kinases occurs primarily through hydrophobic interactions. The amino acid residues of the protein globules of kinases that are responsible for the correct positioning of the drug ruxolitinib and its retention have been determined.

**Keywords:** ruxolitinib; JAK inhibitor; rheumatoid arthritis; molecular modeling

#### **1. Introduction**

Rheumatoid arthritis (RA) can lead to severe disability; therefore, timely and effective treatment is important to reduce the negative effects of functional damage (deformation and destruction) to the joints, which occurs at an early stage of RA development [1]. There are no specific RA serological markers that allow a high-precision diagnosis of RA at an early stage before the onset of clinical symptoms in current clinical practice [2]. A deep study of the molecular basis of the disease's development seems to be relevant to the development of new approaches to RA prevention and cure. The most attractive approaches are non-therapeutic ones (physical activity, diet) that, on the one hand, improve the general state of human health, and, on the other hand, reduce the negative symptoms of RA development (inflammation, pain, joint damage). The literature presents the results of studies that confirm the positive effect of long-term, high intensity exercise on the general health of patients with rheumatoid arthritis and knee osteoarthritis [3–6]. The authors of the studies agree that the deterioration in health status in RA is exacerbated by the avoidance of exercise [5,6]. The authors of a number of studies note cases of combined progression in patients with RA and intolerance to certain food allergens [2,7–9]. Studies show a positive effect of short-term fasting, a vegan diet, and an elimination diet on improving the health

**Citation:** Kondratyev, M.; Rudnev, V.R.; Nikolsky, K.S.; Stepanov, A.A.; Petrovsky, D.V.; Kulikova, L.I.; Kopylov, A.T.; Malsagova, K.A.; Kaysheva, A.L. Atomic Simulation of the Binding of JAK1 and JAK2 with the Selective Inhibitor Ruxolitinib. *Int. J. Mol. Sci.* **2022**, *23*, 10466. https://doi.org/10.3390/ ijms231810466

Academic Editor: Wajid Zaman

Received: 12 August 2022 Accepted: 6 September 2022 Published: 9 September 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

of RA patients, which is likely due to a reduction in immune reactivity to certain food antigens in the gastrointestinal tract that are eliminated by changing the diet [10,11].

The activation of lymphocytes and synthesis of pro-inflammatory cytokines by macrophages, primarily interleukin 1 beta (IL1β) and tumor necrosis factor alpha (TNFα), play a key role in the inflammation of the synovial membrane in RA. These interleukins contribute to the persistence of the inflammatory process in the synovial membrane and the destruction of cartilage and bone tissue, because of the direct effect on synovial fibroblasts, chondrocytes, and osteoclasts. Similar to the inflammatory process, cyclooxygenase-2 (COX-2) is activated, leading to an increase in the synthesis of prostaglandins. Disease-modifying antirheumatic drugs (DMARDs), which are immunosuppressive and immunomodulatory agents, are currently widely used in the clinical treatment of RA [12]. Janus-associated kinase (JAK) inhibitors are a new class of targeted synthetic DMARDs (tsDMARDs) used to treat RA [13]. tsDMARDs provide targeted inhibition of JAKs, which plays a key role in the pathogenesis of RA [1]. JAK family enzymes are represented by four isoforms: JAK1, JAK2, JAK3, and tyrosine kinase 2 (TYK2), which are important signaling molecules involved in cytokine and growth factor signaling [14,15]. The cytokine receptors signal through the JAK–STAT pathway [15]. JAK molecules act together in cytokine signaling (the majority of cytokine receptors use three JAK combinations), but under certain conditions, they exhibit selectivity for one isoform (Figure 1).

**Figure 1.** Cytokines that signal through the JAK family proteins. IL-6 signaling, which plays an important role in the pathogenesis of RA, leads to the activation of JAK1, JAK2, and TYK2. Erythropoietin receptor (EpoR) signaling is mediated by JAK2 and is important in the development and deployment of reticulocytes and erythrocytes. JAK3 in combination with JAK1 is an important signaling component for cytokine receptors that share a common gamma chain (γc), such as IL-2, IL-4, IL-7, IL-9, IL-15, and IL-21.

Isoforms of the JAK family exhibit cross-activity. JAK1 plays an important role in the signaling of several pro-inflammatory cytokines, often in collaboration with other members of the JAK family. Thus, JAK2 is used mainly by receptors for hematopoietic growth factors, such as erythropoietin and thrombopoietin. JAK3 is thought to play a major role in the mediated activation of immune function, while TYK2 functions in association with JAK2 or JAK3 to signal cytokines such as interleukin-12 (IL-12) [16].

Timely and effective treatment of RA at an early stage is essential to control joint damage. Small molecule JAK inhibitors represent a new class of drugs for the treatment of rheumatoid arthritis.

Ruxolitinib (KEGG ID D09960) is a tsDMARD with the International Union of Pure and Applied Chemistry (IUPAC) name (3R)-3-cyclopentyl-3-[4-(7H-pyrrolo [2,3-d]pyrimidin-4 yl) pyrazol-1-yl] propanenitrile and a molecular weight of 306.4 Da [17]. Ruxolitinib is a pyrazole, substituted at position 1 by a 2-cyano-1-cyclopentylethyl group, and at position 3 by a pyrrolo [2,3-d]pyrimidin-4-yl group. Ruxolitinib (INCB018424) is the first potent and selective JAK1/2 inhibitor to be approved for medical use, with a half maximal inhibitory concentration (IC50) of 3.3 nM/2.8 nM in cell-free assays and >130-fold selectivity for JAK1/2 versus JAK3 [18].

Ruxolitinib is a class I molecule of the Biopharmaceutical Classification System (BCS) with high permeability, high solubility, and rapid disintegration. Ruxolitinib is primarily metabolized by the CYP3A4 cytochrome P450 family proteins (>50%) via CYP2C9. At clinically significant concentrations, the drug does not inhibit CYP1A2, CYP2B6, CYP2C8, CYP2C9, CYP2C19, CYP2D6, or CYP3A4, and is not a potent inducer of CYP1A2, CYP2B6, or CYP3A4. The drug is used in myeloproliferative neoplasms and autoimmune diseases, which are known to be associated with the dysregulation of JAK1 and JAK2. This dysregulation is believed to be due to the high levels of circulating cytokines, which are associated with JAK–STAT pathway activation, JAK2V617F mutations, and the down-regulation of negative regulatory mechanisms. Ruxolitinib inhibits JAK–STAT signaling and cell proliferation in cytokine-dependent cell models of hematologic malignancies, and Ba/F3 cells become cytokine-independent by the expression of JAK2V617F mutant protein, with IC50 ranging from 80 to 320 nM [19]. Ruxolitinib has two chiral forms: S2902 S-Ruxolitinib is the S form and S1378 Ruxolitinib is the D form. One of the carbons in this molecule is asymmetric, rendering the two molecules mirror images of each other. It is probable that the biological activities of these two molecules is different. Numerous studies show that ruxolitinib, a JAK2 inhibitor, suppresses proliferation and induces apoptosis of mutated JAK2V617F cell lines [19–22].

This study was performed to model the interactions of ruxolitinib with JAK1 and JAK2 isoforms, and to determine the binding selectivity using molecular docking.

#### **2. Results**

#### *2.1. Characterization of the "Grotto" of Ligand Binding for JAK Isoforms*

Calculations show that the studied proteins and ligands bind very specifically. The van der Waals surface of the globule of each protein has a niche, or "grotto", in which each of the studied ligands is completely sterically placed. Externally, the topology of the binding proteins and ligands appears very similar in the "grotto" (Figure 2).

The "grotto" for the JAK1 isoform is formed by a fragment of the amino acid sequence 865–1154 of the JH1 domain, which exhibits tyrosine kinase activity [20]. The secondary structure of the "grotto" is formed by 13 β-strands and 17 α-helices (according to structural identification (STRIDE)), which form the fold of Scope ID d1t46a. The protein belongs to structural class d: alpha and beta proteins (a + b). The "grotto" for the JAK2 isoform is formed by a fragment of the amino acid sequence 842–1130, also of the JH1 domain. The secondary structure of the "grotto" is rich in β-strands (12) and contains 17 α-helices that form the fold of Scope ID d1u46a. This protein also belongs to structural class d: alpha and beta proteins (a + b).

The physicochemical and structural parameters of the "grotto" are similar for both the isoforms. Thus, the area accessible to the solvent "grotto" of JAK1 and JAK2 is 29,255.34 Å2 and 31,252.36 Å2, respectively (Table 1).

**Figure 2.** Topology of the "grotto" involved in the binding of low molecular weight ligands (view in two projections) in the globule of the JAK1 protein (locus 865-1154 a.a.) (**a**), JAK2 (locus 842-1130 a.a.) (**b**).

**Table 1.** Structural and physicochemical properties of "grottoes" for JAK1 and JAK2.


HB \*—number of hydrogen bonds; distance \*\*—between nonpolar a.a.

Interesting results were obtained from the analysis of possible contacts between amino acid residues involved in the formation and stabilization of the "grotto" structures. Indeed, we observe that in selected regions of amino acid sequences, approximately 40% of the sequence is made up of hydrophobic amino acid residues. An analysis of possible contacts between amino acid residues reveals a bimodal distribution of distances between the interacting hydrophobic amino acids. The number of contacts formed within one element of the secondary structure (inside-elements) is only a small fraction of probable contacts (Table 1, column "Nonpolar "inside-elements" contacts"), most of which can be attributed to the interactions of amino acid residues localized in different elements of the secondary structures (Table 1, column "Nonpolar "cross-elements" contacts"). Of note is the compactness of the structural motifs that form the "grottoes", which is due to the presence of two maxima in the distribution of distances between the interacting hydrophobic amino acid residues. The first maximum characterizing contacts between amino acid residues is observed within 7.7–7.8 Å, and the second maximum for contacts between elements of secondary structures, is observed for distances of 8.3 Å.

#### *2.2. Results of Molecular Docking*

Calculations of flexible docking make it possible to estimate the binding affinity, or "affinity". This value is the total energy indicated in kcal/mol with a negative sign, as these complexes are stable. According to our data, the affinity energy of decernotinib (known as ligand-467 from the experimentally solved 4YTH complex in PDB) for the JAK1 protein is −8.7 kcal/mol, and the same value is obtained for the KEV ligand. Ruxolitinib shows an identical binding site with a comparable affinity value of −8.3 kcal/mol (Figure 3a–c). Characterizing the binding of ligands and the JAK2 protein, we found that decernotinib demonstrates the highest affinity (−8.7 kcal/mol), while the KEV ligand binds with an affinity of −8.67 kcal/mol, and ruxolitinib with an energy of −8.0 kcal/mol. These structures are shown in Figure 3d–f.

**Figure 3.** Ligand binding site and JAK1: (**a**) decernotinib, (**b**) KEV, and (**c**) ruxolitinib; ligands and JAK2: (**d**) decernotinib, (**e**) KEV, and (**f**) ruxolitinib (Supplementary Materials, Figures S1–S6).

Amino acid residues that are important for binding to the studied ligands are detected using the LIGPLOT+ package, and the results are given below. Initially, during visual analysis of the calculated structures, we assumed that ligands with a pronounced heterocyclic nature, as well as halogen substituents, would be bound by hydrophobic interactions (residues LEU881, GLY884, and GLY962 for JAK1) or hydrogen bonds (ARG879 for JAK1) (Supplementary materials, Table S1). This would achieve a reproducible position of the ligands in the "grotto" of the protein globule. For the JAK2 protein, a similar analysis suggests that the LEU855, GLY858, and LEU983 residues are the source of hydrophobic interactions, whereas the ARG980 residue probably acts as a hydrogen bonding agent. Calculations using the LIGPLOT+ package make it possible to confirm that the main type of interactions that stabilizes the studied complexes are hydrophobic interactions (Figure 4a–c,d–f for JAK1 and JAK2, respectively).

The analysis of the presented data makes it possible to supplement the list of residues responsible for positioning the ligand in the hydrophobic pocket of the JAK1 protein. The hydrogen bond with decernotinib is formed by the GLU966 residue, while the amino acid residue LEU959 interacts with the KEV ligand and ruxolitinib. In JAK2 kinase, LEU855, LEU983, and ASP994 residues are involved in the interactions, while in the case of decernotinib and ruxolitinib ligands, they form only hydrophobic interactions. In the KEV ligand, aspartic acid residue forms a hydrogen bond. When analyzing the structure of these homologous proteins, a question arises about the distribution of amino acid residues over identical sites of the globule, especially over the inner surface of the niche, the "grotto", where the binding of the studied set of ligands occurs. Therefore, we calculated the alignment of the amino acid sequences of these two kinases (JAK1 and JAK2) using Clustal 0 version 1.2.4 (Conway Institute UCD, Dublin, Ireland) [23]. The results are shown in Figure 5.

It is worth paying attention to the fact that the globule in the PDB file of the JAK1 protein begins with the VAL865 residue and ends with LYS1154 (290 amino acids in total), while, in the JAK2 protein PDB file, it starts with the THR842 residue and ends with MET1130 (289 residues in total). The alignment results show an average homology (approximately 54%) of the amino acid sequences forming the "grotto" of the two isoforms. Despite this, the identified amino acid substitutions are predominantly homologous; for example, the replacement of a hydrophobic amino acid with another hydrophobic one

(such as Val–Ile, Ile–Ala, and Leu–Ile), or a negatively or positively charged amino acid with an amino acid of the same charge character (Asp–Glu and Arg–Lys). This determines the high conservatism of the binding of the studied ligands in the hydrophobic pocket of the globule of each of the studied kinases.

**Figure 4.** Interactions diagram of JAK1 kinase interactions with decernotinib (**a**), with KEV ligand (**b**), and with ruxolitinib (**c**); JAK2 kinases with decernotinib (**d**), with KEV ligand (**e**), and with ruxolitinib (**f**) (Supplementary materials, Figures S7–S12).


**Figure 5.** Amino acid sequence alignment (FASTA) results for JAK1 and JAK2 kinases. The symbol (\*) under the sequences indicate identity, conservative substitutions (":") and semi-conservative substitutions (".").

#### **3. Discussion**

The study of the pathological processes molecular basis development in the human body has been a promising area of biomedical research for several decades. Even though often clinical practice, diagnosis, and treatment planning do not require knowledge of the etiology of the disease or pathogenesis, awareness of events at the molecular and cellular level is required. This allows us to propose new approaches for determining risk groups among conditionally healthy people and search for attractive therapeutic targets. The present study contributes to understanding the architecture of ligand binding to JAK family proteins, and contributes to understanding the molecular basis of RA development.

The high conservation of JAK1 and JAK2 amino acid sequences in the kinase activity domain (JH1) determines the structural and physicochemical similarities of most of the targeted kinase inhibitors (ligands). Understanding the molecular basis of ligand and target specificity is important for identifying new drugs and inhibitors of various types of kinases [24]. In the present study, a structural analysis of JAK1 and JAK2 isoform motifs involved in the selective binding of ruxolitinib was performed.

There are few studies focusing on the analysis of the structural features of ruxolitinib binding to kinases. Duan (2014) showed, for the first time, the structure of the active domain pair of chicken c-Src kinase (residues 251-533)-ruxolitinib, at a resolution of 2.26 Å; ruxolitinib has a lower selectivity for c-Src compared to JAK1 [24]. The identity between the amino acid sequences of the kinase domains of JAK1 and c-Src is 34%. The authors observed that the pyrrolopyrimidine rings of ruxolitinib are oriented towards the hinge region. The ligand in the JAK1 cavity is stabilized by two hydrogen bonds with Glu957 and Leu959. Simultaneously, the cyclopentane ring is oriented towards the N-terminus of JAK1, while the propanenitrile group of ruxolitinib interacts with JAK1 [23]. In a recent study by Babu (2022), JAK1 binding selectivity was screened for a set of 52 C-2 methyl/hydroxyethylimidazopyrrolopyridine derivatives, including ruxolitinib [25]. Ruxolitinib forms hydrogen bonds at Leu959 and Glu957 in the JAK1 isoform.

The results obtained in the present study show that the pyrazole derivative ruxolitinib binds with a high affinity to the JAK1 and JAK2 isoforms (Table 2).


**Table 2.** Steric and energetic parameters of kinase binding to ligands.

\*—number of hydrogen bonds between the target and the ligand; \*\*—number of hydrophobic bonds with distances between the contacting amino acid residues of the target and the ligand less than 10 Å.

The drug ruxolitinib shows similar selectivity for both the JAK isoforms: −8.3 and −8.0 kcal/mol for JAK1 and JAK2, respectively. This result is indirectly confirmed by the literature data (IC50 3.3 ± 1.2 nM for JAK1 and 2.8 ± 1.2 nM for JAK2). The stabilization of the ligands studied in this work, characterized by similar physicochemical properties, was performed primarily by hydrophobic interactions in the JAK "grotto". In the case of decernotinib and KEV ligands, one hydrogen bond was involved in stabilizing the ligand–target complex.

In our study, we refined the localization of ruxolitinib in the JAK1 and JAK2 "grottoes" with a structural resolution of 1.33 Å and 2.04 Å, respectively. In both the isoforms, ruxolitinib is completely located in the "grotto". For the JAK1 isoform, the pyrrolopyrimidine and pyrazole rings form a single plane, which is oriented towards the N-terminus of JAK1 and is parallel to the plane of the Leu875–Gly884 β-hairpin in the "grotto" β-barrel. The plane of the cyclopentane ring is approximately perpendicular to the plane of the pyrrolopyrimidine and pyrazole rings, and probably forms a hydrophobic interaction with the Gly884 amino acid residue in the same β-hairpin. The propanenitrile group forms the third plane, which passes through the line of intersection of the first two planes, the plane of the pyrrolopyrimidine ring and the plane of the cyclopentane ring, and is oriented towards the Asn1008–Ile1019 β-hairpin.

In the "grotto" of the JAK2 isoform, the pyrrolopyrimidine and pyrazole rings are located in intersecting planes. The plane of the pyrrolopyrimidine ring, similar to the JAK1 isoform, is oriented towards the N-terminus of JAK2 and is parallel to the plane of the Asn859–Tyr868 β-hairpin in the "grotto" β-barrel. The pyrrolopyrimidine ring forms a single plane with the cyclopentane ring, which is oriented towards the Tyr913–Glu930 β-hairpin. The propanenitrile group forms a third plane oriented towards the Ile982–Ile992 β-hairpin.

For both the JAK isoforms, we observe that the studied complexes are stabilized by hydrophobic interactions. Hydrogen interactions between the ligand and target are not revealed in our study.

#### **4. Materials and Methods**

#### *4.1. Objects of Study*

Flexible molecular docking was performed for the ligands: (a) decernotinib (PDB ID 4YTH), (b) KEV with the chemical name N-[3-(5-chloro-2-methoxyphenyl)-1-methyl-1Hpyrazol-4-yl]-2-methyl-2H-pyrazolo [4,3-c]pyridine-7-carboxamide (PDB ID 6N7A) [26], and (c) ruxolitinib, a selective antitumor inhibitor of JAK kinase (Figure 6).

**Figure 6.** Structural formula of the ligand of (**a**) decernotinib (ligand-467), (**b**) KEV (ligand is drug 39 from), (**c**) ruxolitinib.

Decernotinib (VX-509) is a JAK3 inhibitor (Table 3). This experimental drug is characterized by high selectivity for the JAK3 isoform.

The main objective of our study was to identify the ruxolitinib ligand, an antitumor agent and a selective inhibitor of the JAK1 and JAK2 isoforms. These kinases facilitate the signaling of numerous cytokines and growth factors that play an important role in hematopoiesis and immune system function (Table 3).


**Table 3.** Physicochemical properties of the ligands used in the work.

\*—hydrogen bond; \*\*—topological polar surface area; 3\*—stereocenter; 4\*—cell-free assay.

As shown in Table 3, the physicochemical properties of the ligands are similar. Two isoforms were chosen as targets for modeling: JAK1 (PDB ID 6N7A) and JAK2 (PDB ID 4YTH) (Table 4).

**Table 4.** Structures of JAK1 and JAK2 proteins used for redocking with ruxolitinib.


#### *4.2. Molecular Docking*

Data preparation for docking calculations was carried out according to standard protocols [27–29]. Ligand–target binding and binding energy rankings were performed using the AutoDock VINA 1.1.2 software package [30]. The results were visualized using AutoDockTools 1.5.6. RC3 [18]. The projection schemes of the ligand–receptor interactions were drawn using LIGPLOT 1.3.6 [28].

The spatial structures of the target protein models obtained from PDB were purified from the solvent, ligand, and buffer molecules. In the MGLTools package (CCSB, San Diego, CA, USA), charges were placed on the proteins, which made them a receptor, a target for searching for optimal binding sites of the studied ligands [31]. The structures of the ligands were created using the HyperChem molecular constructor (http://www.hypercubeusa.com/) and sequentially optimized, first by using the Assisted Model Building with Energy Refinement (AMBER) force field [32], and then quantum-chemically optimized using parametric method 3 (PM3) [33]. The arrangement of charges on the ligand molecule and its protonation/deprotonation were carried out automatically using the MGLTools 1.5.6 package [31]. The calculations were performed on a server using IntelXEON processors (40 cores).

The task of analyzing the docking results was significant. The LIGPLOT+ package was used to obtain "sweeps" of interactions [34]. From the three-dimensional coordinates of the atoms of the protein–ligand complex, interaction diagrams were generated that depict the schemes of hydrogen bonds, as well as hydrophobic contacts between the ligand and elements of the main or side chain of the protein.

#### **5. Conclusions**

Molecular modeling data shows that the drug ruxolitinib selectively binds the JAK1 and JAK2 isoforms with a binding affinity of −8.3 and −8.0 kcal/mol, respectively. In both the JAK isoforms, ruxolitinib resides entirely in the "grotto" or binding cavity. The stabilization of the ligand in the "grotto" of kinases is performed primarily by hydrophobic interactions, including 18 JAK1–ligand contacts and 11 JAK2–ligand contacts. For the first time, we detail the amino acid residues of the JH1 domains of protein globules, which are responsible for the correct positioning of the ligand and its retention. Agr879, Leu881, Gly884, and Gly962 are involved in the formation of the JAK1–ligand complex; hydrophobic amino acid residues Leu855, Gly858, Arg980, and Leu983 are also involved in the formation of the JAK2–ligand complex.

**Supplementary Materials:** The supporting information can be downloaded at: https://www.mdpi. com/article/10.3390/ijms231810466/s1.

**Author Contributions:** Conceptualization, A.L.K.; methodology, M.K. and D.V.P.; software, A.A.S.; validation, K.S.N. and A.T.K.; formal analysis, V.R.R. and K.A.M.; investigation, V.R.R. and L.I.K.; data curation, A.L.K.; writing—original draft preparation, A.L.K. and L.I.K.; writing—K.A.M.; project administration, A.L.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by the Russian Science Foundation, grant number 21-14-00381.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Article* **Aurisin A Complexed with 2,6-Di-***O***-methyl-**β**-cyclodextrin Enhances Aqueous Solubility, Thermal Stability, and Antiproliferative Activity against Lung Cancer Cells**

**Thanapon Charoenwongpaiboon 1, Amy Oo 2, Sutita Nasoontorn 3, Thanyada Rungrotmongkol 2,4, Somdej Kanokmedhakul <sup>5</sup> and Panupong Mahalapbutr 3,\***


**Abstract:** Aurisin A (AA), an aristolane dimer sesquiterpene isolated from the luminescent mushroom *Neonothopanus nambi*, exhibits various biological and pharmacological effects. However, its poor solubility limits its use for further medicinal applications. This study aimed to improve the water solubility of AA via complexation with β-cyclodextrin (βCD) and its derivatives (2,6-di-*O*-methylβCD (DMβCD) and 2-hydroxypropyl-βCD (HPβCD). A phase solubility analysis demonstrated that the solubility of AA linearly enhanced with increasing concentrations of βCDs (ranked in the order of AA/DMβCD > AA/HPβCD > AA/βCD). Notably, βCDs, especially DMβCD, increased the thermal stability of the inclusion complexes. The thermodynamic study indicated that the complexation between AA and βCD(s) was a spontaneous endothermic reaction, and AA/DMβCD possesses the highest binding strength. The complex formation between AA and DMβCD was confirmed by means of FT-IR, DSC, and SEM. Molecular dynamics simulations revealed that the stability and compactness of the AA/DMβCD complex were higher than those of the DMβCD alone. The encapsulation of AA led to increased intramolecular H-bond formations on the wider rim of DMβCD, enhancing the complex stability. The antiproliferative activity of AA against A549 and H1975 lung cancer cells was significantly improved by complexation with DMβCD. Altogether, the satisfactory water solubility, high thermal stability, and enhanced antitumor potential of the AA/DMβCD inclusion complex would be useful for its application as healthcare products or herbal medicines.

**Keywords:** Aurisin A; beta-cyclodextrins; inclusion complex; lung cancer

#### **1. Introduction**

Aurisin A (AA, Figure 1A) is an aristolane dimer sesquiterpene isolated from the luminescent mushroom *Neonothopanus nambi* Speg. (Marasmiaceae), which is normally found on logs or dead wood in broad-leaved forests in the northeast of Thailand [1]. This compound exhibits various biological and pharmacological activities, including antimycobacterial activity toward *Mycobacterium tuberculosis*, antimalarial property against *Plasmodium falciparum* [1], and anticancer potential toward lung cancer cells (NCI-H187 and A549) [1,2], breast cancer cells (BC1), epidermoid carcinoma cells (KB), cholangiocarcinoma cells (KKU-100, KKU-139, KKU-156, KKU-213, and KKU-214) [1], and cervical cancer cells (Hela, CaSki, and SiHa), with no cytotoxic effect on normal white blood cells [3]. AA exerts its anticancer effects by (i) inhibiting cancer cell growth and migration and (ii) inducing

**Citation:** Charoenwongpaiboon, T.; Oo, A.; Nasoontorn, S.; Rungrotmongkol, T.; Kanokmedhakul, S.; Mahalapbutr, P. Aurisin A Complexed with 2,6-Di-*O*-methyl-β-cyclodextrin Enhances Aqueous Solubility, Thermal Stability, and Antiproliferative Activity against Lung Cancer Cells. *Int. J. Mol. Sci.* **2022**, *23*, 9776. https://doi.org/ 10.3390/ijms23179776

Academic Editor: Wajid Zaman

Received: 21 July 2022 Accepted: 22 August 2022 Published: 29 August 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

cell cycle arrest and apoptosis through activating caspase-3/9 as well as decreasing the expression of cyclin D1, cyclin-dependent kinase 2/4 (Cdk-2/4), B-cell lymphoma 2 (Bcl-2), epidermal growth factor receptor (EGFR), phosphorylated p38 (pp38), and vascular endothelial growth factor (VEGF) [2,3]. Although AA possesses promising biological and pharmacological activities, its low water solubility limits its use for further applications as herbal medicines or healthcare products.

**Figure 1.** Chemical structures of (**A**) AA and (**B**) βCD and its derivatives (DMβCD and HPβCD), where the functional substitutions used in this study, are shown below.

Cyclodextrin (CD) is a cyclic oligosaccharide linked by α-1,4 glycosidic bonds. Natural CD consists of six, seven, and eight glucose units, namely alpha-cyclodextrin (αCD), beta-cyclodextrin (βCD), and gamma-cyclodextrin (γCD), respectively [4]. The structural arrangement of CD turns out to be a truncated cone shape structure possessing a hydrophilic outer surface with a hydrophobic inner cavity. By hosting lipophilic guest molecules into the central cavity of CD, their physico-chemical properties are tremendously improved [5–7], making CD the most frequently used excipient in pharmaceutical applications [8]. Among the three natural CDs, βCD (Figure 1B) has been widely used because of its suitable cavity size, commercial availability, desirable drug loading capacity, biocompatibility, and low price [9,10]. However, due to the limited solubility of βCD, derivatives of βCD, such as 2-hydroxypropyl-βCD (HPβCD) and 2,6-di-*O*-methyl-βCD (DMβCD), are developed to improve water solubility and reduce the limitations of the parent βCD [11,12]. Many lines of evidence have shown that the water solubility, chemical stability, and biological activity of poorly soluble compounds are significantly enhanced by complexation with βCD derivatives [13–17]. However, the information on the inclusion complexation between AA and βCD(s) has never been reported.

In the present study, we aimed to enhance the water solubility, stability, and anticancer activity of AA by inclusion complexation with βCD and its derivatives (DMβCD and HPβCD). The obtained inclusion complex was then confirmed experimentally and theoretically using physical and chemical characterization techniques as well as molecular modeling studies. In addition, the anticancer potential of the inclusion complex was evaluated. We hope that the improved physical and biological properties of AA/βCD(s) inclusion complex could pave the way for the further development of AA as herbal medicines or healthcare products.

#### **2. Results and Discussion**

#### *2.1. Phase Solubility, Thermodynamic Parameters, and UV-Vis Spectra Analyses*

The phase solubility diagrams of AA in aqueous solutions of βCD, DMβCD, and HPβCD at 20, 30, 40, and 50 ◦C are shown in Figure 2. The obtained results revealed that the solubility of AA linearly increased with increasing concentrations of βCDs (ranked in the order of AA/DMβCD > AA/HPβCD > AA/βCD). This linear relationship is a characteristic of AL-type solubility, indicating a 1:1 host–guest complexation [18–20]. Next, the stability constant (*Kc*) was calculated from the phase solubility diagrams to estimate the binding strength of all the studied inclusion complexes. As shown in Table 1, the highest *Kc* value was found in AA/DMβCD (209–237 M−1), followed by AA/HPβCD (88–148 M−1) and AA/βCD (42–80 M−1), respectively. These findings are consistent with many lines of evidence demonstrating that βCD derivatives, especially DMβCD, could significantly improve the stability and solubility of several poorly soluble compounds [15,21–23]. Interestingly, the increased temperature remarkably enhanced the stability (the *Kc* index) of all the investigated complexes.

**Figure 2.** Phase solubility diagrams of AA in complex with βCD, DMβCD, and HPβCD at 20 ◦C, 30 ◦C, 40 ◦C, and 50 ◦C. Data are expressed as mean ± SEM of three independent experiments.


**Table 1.** Stability constant (*Kc*) of the AA/βCDs inclusion complexes at different temperatures.

To obtain the thermodynamic parameters (i.e., Δ*H*, Δ*S*, and Δ*G*) for the AA/βCD(s) inclusion complexation process, the Van't Hoff plot based on Equation (3) was then employed (Figure S1). As depicted in Table 2, the Δ*H* values were positive for all systems, indicating that the inclusion complex formation was an endothermic process [14]. As expected, the inclusion complex formation between AA and βCD(s) was spontaneous, as evidenced by the negative sign of Δ*G*. The lowest Δ*G* value was detected in AA/DMβCD (−3.24 kcal/mol), followed by AA/HPβCD (−2.82 kcal/mol) and AA/βCD (−2.42 kcal/mol), respectively, which is consistent well with the aforementioned *Kc* values (Table 1).


**Table 2.** Thermodynamic values for the inclusion complexation between AA and βCD(s).

<sup>a</sup> Data were derived from Van't Hoff plots using R = 1.985 × <sup>10</sup>−<sup>3</sup> kcal·mol−1·K−<sup>1</sup> and T = 303 K.

According to the UV-Vis spectra analysis (Figure 3), we found that the maximum absorption of AA (325 nm) bathochromically shifted to 328–332 nm in all the studied complexes, indicating a possible interaction between the AA and βCD(s). Similar bathochromic shifts of ligands after complexation were also found in luteolin and *trans*-ferulic acid in complex with the βCD derivatives [24,25]. Notably, the highest absorbance was detected in the AA/DMβCD complex (0.91), followed by AA/HPβCD (0.53), AA/βCD (0.43), and AA alone (0.27), which is consistent well with the phase solubility study mentioned above (Figure 2).

**Figure 3.** UV-Vis spectra of AA, inclusion complexes, and the free form of βCDs.

Taken together, only the AA/DMβCD complex possessing the highest stability and solubility was selected for further structural characterizations in comparison with the free form of AA and DMβCD.

#### *2.2. Inclusion Complex Characterization* 2.2.1. FT-IR

FT-IR spectroscopy was used to determine the inclusion complex formation between AA and DMβCD. The obtained FT-IR spectra of the AA, DMβCD, and AA/DMβCD complex are shown in Figure 4. The spectrogram of AA showed characteristic stretching vibration peaks at (i) 1669 and 1634 cm−<sup>1</sup> for C=O, (ii) 1561 cm−<sup>1</sup> for C=C, (iii) 2954 cm−<sup>1</sup> for C−H, (iv) 1272 and 1206 cm−<sup>1</sup> for C−O, and (v) 3554 cm−<sup>1</sup> for O−H [1]. The FT-IR spectrum of DMβCD demonstrated a large band at 3399 cm−<sup>1</sup> (O−H stretching), 2923 cm−<sup>1</sup> (C−H stretching), and 1156, 1085, and 1045 cm−<sup>1</sup> (C−O and C−H stretching) [26]. After the inclusion complexation, the FT-IR spectrum of AA/DMβCD was distinctly different from that of the pure AA and DMβCD. The characteristic stretching vibration peaks of AA

at 1272, 1206, 1561, 1634, and 1669 cm−<sup>1</sup> totally disappeared in the FT-IR spectrum of the inclusion complex, which is similar to other reported hydrophobic compounds in complex with the βCD analogs [14,25,27]. This might be due to a restriction of the AA's C=O and C=C stretching vibrations, as well as a modification of the hydrophobic microenvironment inside the DMβCD cavity [28]. Moreover, the changes in the shape and position of the absorption bands of AA and DMβCD were observed in the AA/DMβCD complex. The vibration peaks of the C−H and O−H stretching of AA (2954 and 3554 cm<sup>−</sup>1) and DMβCD (2923 and 3399 cm−1) were shifted to 2925 and 3402 cm−<sup>1</sup> in the solid complex. Similarly, the vibration peaks of the C−O and C−H stretching of DMβCD (1156, 1085, and 1045 cm<sup>−</sup>1) were redshifted to 1155, 1084, and 1044 cm−<sup>1</sup> after the complex formation. Altogether, these FT-IR results indicated that AA was completely embedded in the hydrophobic cavity of DMβCD, which is supported by the differential scanning calorimetry (DSC), scanning electron microscope (SEM), and molecular dynamics (MD) simulation results, as discussed later.

**Figure 4.** FT-IR spectra of the AA, DMβCD, and AA/DMβCD complex.

#### 2.2.2. Thermal Analysis

The thermal properties of the AA, DMβCD, and AA/DMβCD complex were characterized in a solid state using DSC analysis. As shown in Figure 5, the characteristic endothermic/exothermic peaks of the free compounds were as follows: (i) AA at 93.0, 213.9, 229.7, and 284.1 ◦C and (ii) DMβCD at 51.8 ◦C. The endothermic peaks found at 213.9 and 229.7 ◦C corresponded to the melting point of AA, as previously reported [1], whereas the broad endothermic peak of DMβCD detected at 51.8 ◦C indicates the release of water molecules from the DMβCD's hydrophobic inner cavity [29]. In the thermogram of the freeze-dried AA/DMβCD inclusion complex, the characteristic thermal peaks of AA and DMβCD totally disappeared, coinciding with the appearance of a new endothermic peak at 72.0 ◦C and an exothermic peak at 198.7 ◦C, similar to other reports [15,16,23,26,30]. These findings indicated that the freeze-drying method successfully yielded the new solid phase between AA and DMβCD.

**Figure 5.** DSC thermograms of the AA, DMβCD, and AA/DMβCD complex.

#### 2.2.3. SEM

Many lines of evidence have shown that the inclusion complexation process significantly changes the surface textures of the resulting products [27,31–33]. To visualize the surface morphology of all the studied compounds, the SEM technique was employed. The SEM photographs of the AA, DMβCD, and AA/DMβCD complex are given in Figure 6. Both AA and DMβCD presented a rod-like structure [23,30]; however, the particle size and shape of the AA were bigger and more spherical than those of the DMβCD. Upon molecular complexation, the surface morphology of the obtained freeze-dried AA/DMβCD inclusion complex, appearing as a plate-like structure, was different from that of the pure forms. These findings confirmed the successful formation between AA and DMβCD. Taken together, all of the structural characterization results (Figures 4–6) revealed that the inclusion complex between AA and DMβCD was successfully formed.

**Figure 6.** SEM photographs of the AA, DMβCD, and AA/DMβCD complex.

#### *2.3. Molecular Modeling Studies*

To further verify the aforementioned experimental results and to investigate the dynamic behavior of the AA/DMβCD inclusion complex at the atomic level, all-atom MD simulations in an aqueous solution and free energy calculations based on the molecular mechanics/Poisson–Boltzmann surface area (MM/PBSA) were performed.

#### 2.3.1. System Stability

The stability of DMβCD and its inclusion complex, AA/DMβCD, along the simulation time was determined using the calculated time evolution of root-mean-square displacement (RMSD), radius of gyration (Rg), and number of atomic contacts (# Atom contacts). As shown in Figure 7A, the RMSD values of DMβCD (~2–3 Å) were higher than those of the AA/DMβCD complex (~1 Å), suggesting that the AA/DMβCD is more stable than the uncomplexed DMβCD. Similarly, the Rg values of DMβCD (~6.6–7.0 Å) were larger than those of the AA/DMβCD complex (~6.4–6.6 Å), indicating higher compactness of the AA/DMβCD structure, as evidenced by the final MD snapshots (Figure 7B). Although the # Atom contacts (native + nonnative) was highly stable along the simulation times (~100–120) for the three independent simulations, the high RMSD fluctuations at the first 150 ns of the simulations were detected. Therefore, in this work, the MD trajectories from 200–300 ns were extracted for further structural and energetic analyses. From Figure 7B, we found that the AA molecule was completely embedded in the hydrophobic cavity of DMβCD, where the C=O groups of AA at C1 and C1 (Figure 1A) were located at the center of DMβCD's cavity, while those at C8 and C8 were positioned near the secondary and primary rims of DMβCD. This complete formation of the AA/DMβCD complex is consistent well with the results of the inclusion complex characterization, as mentioned above.

**Figure 7.** (**A**) Time evolution of RMSD, Rg, and # Atom contacts of DMβCD and AA/DMβCD for three independent simulations (MD1–3). (**B**) Final MD snapshot of DMβCD and AA/DMβCD.

#### 2.3.2. DMβCD Conformation upon AA Binding

The conformational changes of DMβCD upon AA encapsulation were investigated by calculating (i) the distance of the oxygen atoms on the wider rim of DMβCD (O3(n)– O2(n+1), dO3-2), corresponding to a possibility of an intramolecular hydrogen bond (H-bond) formation (dO3-2 ≤ 3.5 Å), and (ii) the distance of glycosidic oxygen atoms (O4(n)–O4(n+1), dO4-4). Afterward, these two parameters were converted to the free energy value, F(*x*,*y*), using Equation (1):

$$\mathbf{F}(\mathbf{x}, \mathbf{y}) = -k\_{\mathrm{B}}T \log[P(\mathbf{x}, \mathbf{y})] \tag{1}$$

where *k*<sup>B</sup> is the Boltzmann constant, *T* is the temperature (303 K), and *P*(*x*,*y*) is the probability of dO3-2 (*x*) and dO4-4 (*y*). The obtained 2D free energy landscape is shown in Figure 8. When compared to the unbound form of DMβCD, the molecular encapsulation of AA in the DMβCD could enhance the formation of intramolecular H-bonds on the wider rim of DMβCD, as evidenced by the increased population of dO3-2 at ~3.0–3.5 Å. The H-bondoperated conformational changes of DMβCD upon the AA encapsulation are similar to other reported mansonones, pinostrobin, luteolin, pinocembrin, and neral in complex with various βCD derivatives [23,34–37]. In addition, the populations of dO3-2 at ~3.0–5.0 Å and dO4-4 at ~3.0–4.0 Å of the free form of DMβCD totally disappeared in the AA/DMβCD complex due to the adaptation of the DMβCD structure upon the insertion of AA to the hydrophobic cavity.

**Figure 8.** 2D free energy (F(*x*,*y*)) landscape of the intramolecular hydrogen bond distances, O3(n)– O2(n+1), against the adjacent glycosidic oxygen distances, O4(n)–O4(n+1), of the DMβCD (**top**) and AA/DMβCD complex (**bottom**) for three independent simulations (MD1–3).

#### 2.3.3. Water Accessibility toward the Inclusion Complex

The water distribution around a spherical radius *r* of the oxygen atoms of AA (Figure 9A) was visualized using radial distribution function (RDF, *g*(*r*)) calculation, and the obtained results are given in Figure 9B. In addition, the integration number (*n*(*r*)) values at the first minima, corresponding to the number of water molecules approaching the targeted oxygens, are depicted in Table 3. From the RDF plots of all the systems, no dominant peak was detected within ~3 Å of the O, O1, O2, and O2 of AA (Figure 9B), indicating that these oxygen atoms were deeply embedded in the hydrophobic inner cavity of

DMβCD (Figure 9A). This phenomenon is in good agreement with the previously reported flavonoids, demonstrating that their oxygen atom on the chromone ring, embedded at the center of the βCD cavity, displayed no sharp RDF peak at the first solvation shell [36,38]. In contrast, the other oxygen atoms (O1 , O8, O8 , O9, and O9 ) displayed the first sharp peak at ~2.5 Å, corresponding to the water distribution around these oxygens. The O8 and O8 of AA exhibited higher water accessibility than the other oxygens, suggesting that these oxygen atoms were positioned nearby either the secondary or primary rim and were feasibly accessible to the water molecules (Figure 9A).

**Figure 9.** (**A**) Final MD snapshot of the AA/DMβCD complex showing the oxygen atoms (O, O1, O1 , O2, O2 , O8, O8 , O9, and O9 ) of AA and the surrounding water molecules (cyan dot) within 5 Å of AA. (**B**) RDF of water oxygen atoms around the oxygen atoms of AA in complex with DMβCD for three independent simulations (MD1–3).

**Table 3.** *n*(*r*) up to the first minimum for the oxygen atoms of AA inside the DMβCD hydrophobic cavity.


#### 2.3.4. Binding Affinity of the Inclusion Complex

To estimate the binding affinity of the AA/DMβCD inclusion complex, the MM/PBSA calculation was performed using 100 snapshots taken from the last 100 ns MD simulations. As expected, due to the poor solubility of AA, the inclusion complexation in the gas phase was driven mainly by van der Waals (vdW) interactions (Δ*E*vdW = −35.12 ± 1.06 kcal/mol) rather than electrostatic attraction (Δ*E*ele = −17.05 ± 0.19 kcal/mol). Similarly, the summation of Δ*G*solv,non-polar + Δ*E*vdW energies (−40.64 ± 1.12 kcal/mol) showed a negative value compared to that of Δ*G*solv,polar + Δ*E*ele energies (12.64 ± 0.28 kcal/mol), indicating that the vdW forces play an important role in the complex formation between AA and DMβCD in an aqueous environment. This vdW-driven host–guest complexation process is consistent well with other lipophilic ligands in complex with βCDs [39–42]. Notably, the predicted Δ*G*bind,MM/PBSA value (−3.87 ± 0.68 kcal/mol) was almost identical to the experimental Δ*G* (Δ*G*exp, −3.24 kcal/mol) obtained from the Van't Hoff plot (Table 4), suggesting the successful calculation of the AA/DMβCD complex.

#### *2.4. DMβCD Enhances Cytotoxicity of AA against Lung Cancer Cells*

The cytotoxic activity of AA and the AA/DMβCD inclusion complex against A549 and H1975 human lung cancer cells was evaluated using MTT assay. The obtained results are depicted in Figure 10. We found that both AA and AA/DMβCD decreased cell viability in a dose-dependent manner, in which the AA/DMβCD complex exhibited significantly lower cell viability than the uncomplexed AA at the concentration of 1, 3, and 10 μM for both A549 and H1975 cell lines (Figure 10A,B). The half-maximal inhibitory concentration (IC50) values against the A549 and H1975 cells of the AA/DMβCD inclusion complex (17.14 ± 2.34 and 15.67 ± 1.33 μM) were significantly lower than those of the AA alone (32.77 ± 2.94 and 27.38 ± 3.17 μM) (Figure 10C). These findings are in good agreement with previous reports demonstrating that βCDs can distinctly enhance the anticancer activity of several hydrophobic compounds, such as camptothecin, luotonin A, resveratrol, mansonone G, curcumin, and scutellarein [23,43–46]. Thus, it was assumed that the enhanced antitumor effect of the AA/DMβCD inclusion complex was due to the improved water solubility and complex stability (Figure 2 and Table 1). In addition, DMβCD could infiltrate into the drug permeation barrier, called the unstirred water layer (UWL) [47], better than the uncomplexed AA, enhancing the flux of AA through the UWL [48].


**Table 4.** Δ*G*bind, MM/PBSA and its energy components (kcal/mol) of the AA/DMβCD complex.

<sup>a</sup> Data are shown as mean ± SEM (*<sup>n</sup>* = 3). <sup>Δ</sup>*E*MM, molecular mechanics energy; <sup>Δ</sup>*G*solv, solvation free energy comprising polar (Δ*G*solv,polar) and non-polar (Δ*G*solv,non-polar) terms; Δ*S*, entropy. <sup>b</sup> Data from Table 2.

**Figure 10.** Cell viability of (**A**) A549 and (**B**) H1975 human lung cancer cells after being treated with various concentrations of AA and AA/DMβCD for 48 h. The viable cells in the vehicle control (0.2% DMSO) were calculated as 100%. (**C**) IC50 of the AA and AA/DMβCD complex against the A549 and H1975 cells. \* *p* < 0.05, \*\* *p* < 0.01 vs. AA. Data are shown as mean ± SEM (*n* = 3).

#### **3. Materials and Methods**

#### *3.1. Materials*

AA was extracted from the culture liquid of the luminescent mushroom *Neonothopanus nambi* PW1 (Marasmiaceae), as previously described [1,3]. βCD and HPβCD were purchased from TCI (Nihonbashi-honcho, CK, Tokyo), whereas DMβCD was purchased from Sigma-Aldrich (St. Louis, MO, USA).

#### *3.2. Phase Solubility Study*

The phase solubility study was performed according to the methods described by Higushi and Connors [49]. An excess amount of AA was added to aqueous solutions containing increasing amounts of βCD(s) (0–10 mM). The mixtures were incubated in a shaking incubator at 20, 30, 40, and 50 ◦C and 250 rpm for 40 h. After that, the insoluble AA was separated from the suspension by centrifugation at 10,000 rpm for 5 min and then filtered by a 0.45-micron syringe filter [13,23,44,50–52]. Two volumes of ethanol were added into each inclusion complex solution before measuring the absorbance at 331 nm [3]. The apparent stability constant (*Kc*) was determined by Equation (2), where S0 is the y-intercept.

$$K\_c = \frac{\text{Slope}}{\text{S}\_0(1 - \text{slope})} \tag{2}$$

The Van't Hoff equation (Equation (3)) was used to calculate the change in the enthalpy (Δ*H*) and entropy (Δ*S*) of the inclusion complexation, whereas the Gibbs free energy (Δ*G*) was determined by Equation (4).

$$
\ln K\_c = -\frac{\Delta H}{RT} + \frac{\Delta H}{R} \tag{3}
$$

$$
\Delta G = \Delta H - T\Delta S \tag{4}
$$

#### *3.3. Inclusion Complex Preparation*

An excess amount of AA was added to a 10 mM DMβCD solution and incubated in a shaking incubator at 30 ◦C at 250 rpm. After that, the suspension was centrifuged (12,000 rpm for 15 min) and filtered through the 0.45-micron syringe filter, and lyophilized. The obtained freeze-dried powders were kept in a desiccator for further analysis.

#### *3.4. Inclusion Complex Characterization*

#### 3.4.1. Ultraviolet-Visible (UV-Vis) Spectroscopy

AA and its inclusion complexes were suspended in DI water at 30 ◦C for 48 h. After that, the suspension was filtered using the 0.45-micron syringe filter. The UV-Vis spectra of the solutions were recorded by Eppendorf BioSpectrometer™ (Eppendorf, Hamburgm, Germany).

#### 3.4.2. Fourier Transform Infrared (FT-IR) Spectroscopy

The FT-IR spectra of the AA, DMβCD, and AA/DMβCD complex were recorded by a Nicolet 6700 FT-IR spectrometer (ThemoFisher Scientific, Waltham, MA, USA) over a scanning range of 500–4000 cm−<sup>1</sup> via the attenuated total reflectance (ATR) mode.

#### 3.4.3. Differential Scanning Calorimetry (DSC)

The thermal behavior of the AA, DMβCD, and AA/DMβCD complex was characterized using NETZSCH DSC 204F1 Phoenix (Selb, Germany). Each solid sample (~1–2 mg) was heated from 25 ◦C to 300 ◦C in aluminum pans at a rate of 10 ◦C/min.

#### 3.4.4. Scanning Electron Microscope (SEM)

The surface morphology of the AA, DMβCD, and AA/DMβCD complex was analyzed using a Scanning Electron Microscope (JEOL JSM-IT500HR, Tokyo, Japan). Samples were coated with a thin layer of gold in a vacuum before viewing under 300 times magnification. Observations were performed using an accelerating voltage of 10 kV.

#### *3.5. Computational Details*

#### 3.5.1. System Preparation and Molecular Docking

The 3D structure of DMβCD was taken from a previous study [23], whereas that of AA was downloaded from the PubChem database (PubChem CID: 71491081) and then optimized by the Gaussian09 program (Wallingford, CT, USA) [53] using the HF/6-31G\* level of theory. The protonation state of AA was checked at a pH of 7.0 using MarvinSketch software (Budapest, Hungary). The inclusion complex model between the optimized AA and the DMβCD was generated using the CDOCKER module implemented in Accelrys Discovery Studio 2.5 (Accelrys Software Inc., San Diego, CA, USA). Among the resulting 100 docking poses, the AA/DMβCD inclusion complex with the lowest CDOCKER interaction energy was selected for further studies.

#### 3.5.2. Molecular Dynamics (MD) Simulations

The MD simulations with the isothermal-isobaric ensemble (*NPT*) of each system were performed with a time step of 2 fs using an AMBER16 software package [54]. According to the standard procedures [36,55,56], the electrostatic potential (ESP) charges of AA were calculated with the HF/6-31(d) level of theory using the antechamber module, whereas the restrained ESP (RESP) charges of AA were computed using the parmchk module in AMBER16. The SHAKE algorithm [57] was applied to constrain all chemical bonds involving hydrogen atoms, while the Particle Mesh Ewald [58] method was used to treat long-range electrostatic interactions. The cutoff value for non-bonded interactions was set to 12 Å. The general AMBER force field (GAFF) [59] and the Glycam-06 force field [60] were applied on AA and DMβCD, respectively. The TIP3P water molecules [61] were added to solvate the inclusion complex with a spacing distance of 15 Å. Subsequently, the water molecules were minimized using the steepest descent (1500 steps) and conjugated gradient (3000 steps), followed by the minimization of the whole system. Each studied system was heated up from 10 K to 298 K for 100 ps and then equilibrated for 1000 ps. After that, all-atom MD simulations were performed under a periodic boundary condition at 1 atm and 298 K until reaching 300 ns. The MD simulations were performed in three replicates (*n* = 3) for each model.

#### 3.5.3. Structural and Energetic Analyses

The CPPTRAJ module of AMBER16 was used to calculate the structural information, including the RMSD, Rg, # Atom contacts, free energy landscape, and RDF. For the energetic analysis, the binding affinity between the host and guest was calculated by the MM/PBSA method [62] using 100 snapshots extracted from the last 100 ns MD simulations.

#### *3.6. Cell Lines and Culture*

A549 and H1975 human lung cancer cells were purchased from the American Type Culture Collection (ATCC, Manassas, VA, USA). Both cells were cultured in a Dulbecco's Modified Eagle's Medium (Gibco, NY, USA) supplemented with a 10% heat-inactivated fetal bovine serum (Gibco, NY, USA), 100 U/mL penicillin, and 100 μg/mL streptomycin (Gibco, NY, USA) and were maintained at 37 ◦C in a humidified 5% CO2 atmosphere.

#### *3.7. Cell Viability Assay*

The A549 and H1975 cells were seeded into 96-well plates at a density of 1000 cells/well. After overnight incubation, the cells were treated with logarithmic concentrations (1, 3, 10, 30, and 100 μM) of AA and AA/DMβCD for 48 h. Note that the amount of MG in free form and in complex form was equivalent. An MTT reagent was then added to the wells and incubated for 3 h. Subsequently, the culture medium was withdrawn, and 100 μL of a

DMSO solution was added to dissolve the formazan crystals. Finally, the absorbance was measured at 540 nm.

#### *3.8. Statistical Analysis*

Data are shown as mean ± standard error of the mean (SEM) of three independent experiments. Differences between AA and AA/DMβCD were determined using the *t* test. A *p* value of <0.05 was considered statistically significant.

#### **4. Conclusions**

This study aimed to improve the water solubility and biological activity of AA by complexation with βCD and its derivatives (DMβCD and HPβCD). The phase solubility diagrams indicated 1:1 AA/βCD(s) binding stoichiometry, and the highest *Kc* was detected in the AA/DMβCD complex. Notably, βCDs, especially DMβCD, increased the thermal stability of the complexes. The thermodynamic study indicated that the inclusion complexation between AA and βCD(s) was a spontaneous endothermic reaction. The complex formation of the AA/DMβCD was confirmed by UV-Vis, FT-IR, DSC, and SEM techniques. MD simulations and MM/PBSA-based free energy calculations affirmed the vdW-driven formation of the AA/DMβCD complex in an aqueous environment. The anticancer effect of AA on A549 and H1975 lung cancer cells was significantly improved by complexation with DMβCD. Taken together, the satisfactory water solubility, high thermal stability, and enhanced antitumor potential of the AA/DMβCD complex would be potentially useful for its application as herbal medicines or healthcare products.

**Supplementary Materials:** The supporting information can be downloaded at: https://www.mdpi. com/article/10.3390/ijms23179776/s1.

**Author Contributions:** T.C.: methodology, formal analysis, investigation, validation, and visualization. A.O.: formal analysis, investigation, validation, and visualization. S.N.: investigation, validation, and visualization. T.R.: resources, validation, and visualization. S.K.: purification, validation, and visualization. P.M.: conceptualization, methodology, formal analysis, investigation, data curation, writing—original draft, writing—review & editing, validation, visualization, project administration, and supervision. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was funded by the "Research and Graduate studies" of Khon Kaen University (RP64-Mushroom extract-001). P.M. would like to thank the Thailand Toray Science Foundation for financial support.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Data is contained within the article or Supplementary Material.

**Conflicts of Interest:** The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

#### **References**


## *Article* **Exogenously Applied Sodium Nitroprusside Mitigates Lead Toxicity in Rice by Regulating Antioxidants and Metal Stress-Related Transcripts**

**Waqas Rahim 1,†, Murtaza Khan 2,†, Tiba Nazar Ibrahim Al Azzawi 1, Anjali Pande <sup>1</sup> , Nusrat Jahan Methela 1, Sajid Ali <sup>2</sup> , Muhammad Imran <sup>1</sup> , Da-Sol Lee 1, Geun-Mo Lee 1, Bong-Gyu Mun 1, Yong-Sun Moon 2, In-Jung Lee <sup>1</sup> and Byung-Wook Yun 1,\***

	- **\*** Correspondence: bwyun@knu.ac.kr
	- † These authors contributed equally to this work.

**Abstract:** Sustainable agriculture is increasingly being put in danger by environmental contamination with dangerous heavy metals (HMs), especially lead (Pb). Plants have developed a sophisticated mechanism for nitric oxide (NO) production and signaling to regulate hazardous effects of abiotic factors, including HMs. In the current study, we investigated the role of exogenously applied sodium nitroprusside (SNP, a nitric oxide (NO) donor) in ameliorating the toxic effects of lead (Pb) on rice. For this purpose, plants were subjected to 1.2 mM Pb alone and in combination with 100 μM SNP. We found that under 1.2 mM Pb stress conditions, the accumulation of oxidative stress markers, including hydrogen peroxide (H2O2) (37%), superoxide anion (O2 −) (28%), malondialdehyde (MDA) (33%), and electrolyte leakage (EL) (34%), was significantly reduced via the application of 100 μM SNP. On the other hand, under the said stress of Pb, the activity of the reactive oxygen species (ROS) scavengers such as polyphenol oxidase (PPO) (60%), peroxidase (POD) (28%), catalase (CAT) (26%), superoxide dismutase (SOD) (42%), and ascorbate peroxidase (APX) (58%) was significantly increased via the application of 100 μM SNP. In addition, the application of 100 μM SNP rescued agronomic traits such as plant height (24%), number of tillers per plant (40%), and visible green pigments (44%) when the plants were exposed to 1.2 mM Pb stress. Furthermore, after exposure to 1.2 mM Pb stress, the expression of the heavy-metal stress-related genes *OsPCS1* (44%), *OsPCS2* (74%), *OsMTP1* (83%), *OsMTP5* (53%), *OsMT-I-1a* (31%), and *OsMT-I-1b* (24%) was significantly enhanced via the application of 100 μM SNP. Overall, our research evaluates that exogenously applied 100 mM SNP protects rice plants from the oxidative damage brought on by 1.2 mM Pb stress by lowering oxidative stress markers, enhancing the antioxidant system and the transcript accumulation of HMs stress-related genes.

**Keywords:** nitric oxide; antioxidants; metal-stress related transcripts; rice; Pb-stress

#### **1. Introduction**

Heavy metals (HMs), including lead (Pb), are released due to rapid industrialization, mining, economic growth, anthropogenic activities, and the excessive use of inorganic fertilizers, along with other agrochemicals [1,2]. Pollutants possessing heavy metals enter the soil via a wide range of routes and pose a threat to the sustainable agroecosystem, farming, and livelihood [3,4]. Pb remains stable in soil for a long period of time, does not easily dissociate, and hence accumulates in human and animal bodies through the consumption of contaminated plants [5], thus threatening their health. For example, it is estimated that a Pb concentration of more than 40μg dL−<sup>1</sup> in an infant's blood can cause a blockage of hemoglobin synthesis, resulting in anemia [6]. Pb is ranked the second most toxic HM after arsenic (As) [7] and has a promising role in the impairment of plant growth and development by adversely affecting the plant's metabolism [8], seed

**Citation:** Rahim, W.; Khan, M.; Al Azzawi, T.N.I.; Pande, A.; Methela, N.J.; Ali, S.; Imran, M.; Lee, D.-S.; Lee, G.-M.; Mun, B.-G.; et al. Exogenously Applied Sodium Nitroprusside Mitigates Lead Toxicity in Rice by Regulating Antioxidants and Metal Stress-Related Transcripts. *Int. J. Mol. Sci.* **2022**, *23*, 9729. https://doi.org/ 10.3390/ijms23179729

Academic Editor: Wajid Zaman

Received: 16 August 2022 Accepted: 24 August 2022 Published: 27 August 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

germination, seedling growth, cell division, the permeability of plasma membrane, and various ultrastructural modifications [9,10]. A high concentration of lead (1.2 mM) induced a significant reduction in the plant height, number of tillers, number of panicles per plant, and the number of spikelets per panicle in different rice cultivars [11]. However, different rice cultivars responded differently to the 1.2 mM Pb stress. However, some species showed the lowest drop in agronomic attributes to different stresses [12].

An elevated level of Pb stimulates ROS generation, which induces oxidative stress, damages the plasma membrane, and alters metabolism and physiological reactions [13]. To cope with oxidative stress, plants have developed a sophisticated antioxidant defense mechanism comprising the generation of SOD and PO [14–16], APX, PPO, and CAT [17,18]. Furthermore, when plants are exposed to HMs, they use their inherent complex mechanisms and strategies for metal uptake, storage, transportation, detoxification, elimination, and compartmentalization [19,20]. However, a variation exists among different species or varieties of a species in the uptake, translocation, and accumulation of Pb [21]. In addition, phytochelatins (PCs) are the prime inducers of responses to the vulnerability of various heavy metals in plants [22]. They are considered to bind metals via thiolate coordination, which is involved in HMs homeostasis and detoxification. Recently, *OsPCS1* and *OsPCS2* genes have been characterized in rice [23]. In addition, the cation diffusion facilitator (CDF) genes family transports either metal ions out of the cytosol into extracellular spaces or diffuses into the vacuoles [24] and are also called metal tolerance proteins (MTPs). Similarly, metallothionein (MTs) play an important role in the detoxification of heavy metals [25], maintaining the balance of intracellular metallic ions in plants [26], the scavenging of ROS [27], and the regulation of developmental processes [28]. Furthermore, NO was awarded "molecule of the year" in 1992 and is crucial in modulating various physiological and biochemical activities in plants [29]. It is highly diffusible and participates in a wide range of abiotic stress tolerance mechanisms in plants [30]. ROS metabolism also involves the participation of NO [31]. It is noteworthy that ROS/NO interaction can cause cytotoxicity, or it can be protective, depending on the relative concentrations of ROS and NO [32]. The alleviative effect of NO on abiotic stresses in plants has been documented [33].

Rice is a necessity for life, and it has significantly impacted millions of people's economics, diets, and cultures [34]. It is most often exposed to environmental hazards, including heavy metals such as lead, the main causes of which are the overuse of agrochemicals and repeated use of waste and sewage water during rice cultivation [35]. The European chemicals agency (ECHA) has assorted Pb in the group of chemicals of great perturbation for the environment [36]. Therefore, it is essential to identify the techniques that help improve the defense system of the rice against Pb stress. For example, NO applications decrease the uptake of Pb in *Arabidopsis thaliana* [37] and affect gene expression in *Zea mays* [38]. Therefore, in the current study, we investigated alleviating the toxic effects of lead on rice by applying sodium nitroprusside (SNP). The novelty of the present work is revealing the antioxidant machinery of NO in controlling Pb-induced oxidative damage in rice.

#### **2. Results**

#### *2.1. SNP Improves Morphological Parameters of Rice under Pb Stress*

Pb-stress adversely affects the growth attributes of different crops. Our results showed a significant improvement (16%) in the plant's shoot length under SNP treatment in Pbuntreated plants compared to Pb-untreated control plants, as shown in Figure 1A,B. The results for plants under Pb stress revealed a significant reduction in shoot length (29%) compared to the control (only water). However, SNP-treated plants under Pb stress significantly enhanced (24%) the overall shoot length compared to sole Pb-treated rice plants (Figure 1A,B). Furthermore, a considerable increase in the number of tillers (14%) in NOtreated plants was observed in Pb-untreated plants compared to their respective control plants (Figure 1A,C). Pb-treatment significantly reduced the number of tillers (29%) in Pb-treated plants compared to control Pb-untreated plants. However, SNP treatment in

Pb-treated plants significantly improved the number of tillers (40%) compared to sole Pbtreated plants, as shown in Figure 1A,C. In addition, our results revealed that Pb-untreated plants supplied with SNP considerably enhanced (13%) green pigment content compared to the control Pb-untreated plants (Figure 1D). Pb stress showed a significant reduction (27%) in the visible green pigment compared to control Pb-untreated plants (Figure 1D). However, the treatment of SNP in plants under Pb stress revealed a significant increase (44%) in visible green pigment content compared to sole Pb-treated plants (Figure 1D).

**Figure 1.** Effects of exogenously applied SNP on rice with or without Pb-induced stress. (**A**) Phenotypic characteristics; (**B**) shoot length; (**C**) number of tillers per plant; (**D**) SPAD value for photosynthetic green pigment contents. Each data point indicates the mean ± standard deviation (*n* = 3). Bars with different letters indicate significant differences, according to Duncan's multiple range test. The results are compared to the respective controls (Pb-untreated and Pb-treated plants).

#### *2.2. SNP Enhances the Chlorophyll a, b, and Protein Contents*

Plants under stress face the challenge of inhibition in photosynthesis and changes in chlorophyll contents. The results revealed that SNP applications considerably increased chlorophyll a and b levels (16 and 15%, respectively) in Pb-untreated plants compared to control Pb-untreated plants, as shown in Figure 2A,B. Plants under Pb stress showed a significant decrease (36 and 34%, respectively) in chlorophyll a and b contents compared to control Pb-untreated plants (Figure 2A,B). The data also showed a significant increase in chlorophyll a and b contents (38 and 36%, respectively) in SNP-treated plants under Pb stress compared to plants under sole Pb stress (Figure 2A,B). Furthermore, the obtained data showed a significant increase in protein content (15%) in SNP-supplied Pb-untreated plants compared to control Pb-untreated plants, as shown in Figure 2C. Plants under Pb-treatment showed a highly significant decrease (49%) in protein content compared to control Pb-untreated plants (Figure 2C). However, SNP applications in Pb-treated plants significantly increased the total protein content (35%) in Pb-treated plants compared to sole Pb-treated plants (Figure 2C).

**Figure 2.** Effect of exogenously applied SNP on rice with or without Pb-induced stress. (**A**) Chlorophyll a content; (**B**) chlorophyll b content; (**C**) protein content. Each data point indicates the mean ± standard deviation (*n* = 3). Bars with different letters indicate significant differences, according to Duncan's multiple range test. The results are compared to the respective controls (Pb-untreated and Pb-treated plants).

#### *2.3. Exogenously Applied SNP Mitigates Membrane Injury and Enhances Protection of Rice against Pb-Toxicity*

In the current study, we compute the impact of SNP on ROS compounds, including H2O2, O2 −, MDA, and electrolyte leakage. Pb-treatment significantly enhanced (305%) MDA levels in plants compared to control Pb-untreated plants (Figure 3A). However, the SNP supply in Pb-treated plants significantly reduced (33%) MDA levels in Pb-treated plants compared to sole Pb-treated plants (Figure 3A). A significant increase in ROS (H2O2 and O2 −) levels (311 and 81%, respectively) was observed in Pb-treated plants compared to control Pb-untreated plants (Figure 3B,C). However, SNP applications considerably mitigate H2O2 and O2 − levels (37 and 28%, respectively) in plants under Pb stress compared to plants under sole Pb stress (Figure 3B,C). Furthermore, Pb treatments significantly enhanced (65%) ion leakage compared to control Pb-untreated plants (Figure 3D), which was significantly mitigated (34%) by SNP applications in Pb-treated plants, as shown in Figure 3D.

**Figure 3.** Effect of exogenously applied SNP on rice with or without Pb-induced stress. (**A**) MDA level; (**B**) H2O2 content; (**C**) superoxide anion level; (**D**) electrolyte leakage. Each data point indicates the mean ± standard deviation (*n* = 3). Bars with different letters indicate significant differences, according to Duncan's multiple range test. The results are compared to the respective controls (Pb-untreated and Pb-treated plants).

#### *2.4. SNP Regulates the Antioxidant Enzymes Machinery*

Overall, the results revealed that NO treatments increased the level and activity of antioxidant enzymes under Pb stress. The activities of PPO, PO, CAT, SOD, and APX were enhanced (35, 18, 15, 17, and 44%, respectively) in SNP-supplemented Pb-untreated plants compared to control Pb-untreated plants, as shown in Figure 4A–E. Pb treatment considerably activates the production of PPO, PO, CAT, SOD, and APX (7, 24, 218, 52, and 68%, respectively) compared to control Pb-untreated plants (Figure 4A–E). However, the supply of SNP significantly enhanced the activity of the above antioxidants by 60, 28, 26, 42, and 58%, respectively, compared to sole Pb-supplied plants.

**Figure 4.** Effect of exogenously applied SNP on rice with or without Pb-induced stress. (**A**) Polyphenol oxidase; (**B**) peroxidase; (**C**) catalase; (**D**) superoxide dismutase; (**E**) ascorbate peroxidase. Each data point indicates the mean ± standard deviation (*n* = 3). Bars with different letters indicate significant differences, according to Duncan's multiple range test. The results are compared to the respective controls (Pb-untreated and Pb-treated plants).

#### *2.5. SNP Modulates the Expression of the Genes Related to Metal Stress*

The results revealed that the relative expression of *OsPCS1* and *OsPCS2* is differentially and highly significantly enhanced (387 and 134%, respectively) in Pb-treated plants compared to the Pb-untreated control. However, SNP supplementation in Pb-treated plants shows significant improvements in the relative expression of *OsPCS1* (44%) and *OsPCS2* (74%) compared to sole Pb-supplied plants (Figure 5A,B).

Furthermore, our results showed that plants under Pb stress have significantly higher relative *OsMTP1* and *OsMTP5* expression (460 and 177%, respectively) when compared to control Pb-untreated plants. However, the SNP supply further improved the relative expression of *OsMTP1* and *OsMTP5* (83 and 53%, respectively) compared to sole Pb-treated plants (Figure 5C,D).

Pb treatments triggered a significant enhancement in relative expressions of *OsMT-I-1a* and *OsMT-I-1b* (151 and 1065%, respectively) compared to Pb-untreated control plants (Figure 5E,F)*. However,* SNP supplies further improved the relative expression of *OsMT-I-1a* and *OsMT-I-1b* (31 and 24%, respectively) compared to plants solely treated with Pb.

**Figure 5.** Effect of exogenously applied SNP on the relative expression of rice phytochelatin, metal transporter, and metallothionein protein candidate genes with or without Pb-induced stress. (**A**) *OsPC1*; (**B**) *OsPC2*; (**C**) *OsMTP1*; (**D**) *OsMTP5*; (**E**) *OsM1-1a*; (**F**) *OsM1-1b*. Each data point indicates the mean ± standard deviation (*n* = 3). Bars with different letters indicate significant differences, according to Duncan's multiple range test. The results are compared to the respective controls (Pb-untreated and Pb-treated plants).

#### **3. Discussion**

Nitric oxide acts as a "guardee" molecule to alleviate heavy-metal toxicities via stress perception, signaling, and the acclimatization of plants under heavy-metal stress [39]. In the current study, the observed inhibitory effects of Pb on key morphological characteristics of rice, i.e., plant height, the number of tillers, and visible green pigment content (Figure 1A–D), were significantly reduced by an exogenous application of SNP in the form of 100 μM SNP solutions. Similarly, the toxic effects of Pb on wheat seedling growth were significantly reduced via the application of 100 μM SNP [40].

Many findings reported changes in the physiological, biochemical, and molecular levels of plants, such as the accumulation and activity of chlorophyll a and b contents in plants under abiotic stress [41–46]. SNP significantly ameliorated Pb-induced negative impacts on chlorophyll a and b contents, as shown in Figure 2A,B. Similarly, under heavymetal (Pb and Cd) stress, adding SNP increases chlorophyll and carotenoid concentrations in bamboo plants [47].

Plants activate a plethora of adaptive responses to withstand abiotic stresses, including a high accumulation of proteins [48]. The current study showed that the significant reduction in protein content due to Pb-induced damages is significantly mitigated by exogenous application of SNP, as shown in Figure 2C. This result is in accordance with the previous reported result [47].

Heavy metals cause damage at the cellular and molecular levels to plants, both directly and indirectly, through overproduction and the hyperaccumulation of ROS [49,50]. However, NO protects plants against oxidative damage by scavenging ROS [49]. In the present study, due to Pb stress, a significant enhancement in the production of ROS and electrolyte leakage was observed. However, it was significantly mitigated by exogenously applied SNPs (Figure 3A–D). These results are in accordance with the previous literature published [51,52]. Exogenous applications of NO in rice and perennial ryegrass under Cd

and Pb-induced stress, respectively, decreased the production of ROS and MDA, resulting in increased activities of antioxidant enzymes [53–55].

NO regulates the antioxidant enzyme machinery at the cellular level, affecting the cellular redox level [49,56]. Our results indicated that the application of NO exogenously in the form of SNP mitigates the adverse effects of Pb on rice plants by increasing the production of antioxidant enzymes such as PPO, POD, CAT, SOD (Figure 4A–D), and APX (Figure 4E), which is according to previous literature published for various heavy metals [57].

The increased tolerance to heavy metals is linked with phytochelatin synthesis, for example, cadmium stress [58,59]. Several published reports suggest that PC-deficient mutants increased heavy-metal sensitivity [60]. Therefore, the current study was designed to check the effect of the exogenous application of SNP on the expression of two candidate phytochelatins, i.e., *OsPCS1* and *OsPCS2*. Our results showed that the relative expression of both genes was significantly enhanced under lead stress compared to control plants (Figure 5A,B), suggesting the transcript's accumulation for metal binding and vacuolar compartmentalization. To elucidate if NO affects the expression of phytochelatin genes, the observed effects and data showed that the expression levels of transcripts were considerably enhanced in SNP-supplied Pb-treated plants compared to plants under sole Pb stress.

Plants under heavy-metal stress undergo a transcriptional regulation of CDF protein family members *OsMTP1* and *OsMTP5*. They play an important role in cation homeostasis, chelation, sequestration, or the expulsion of excess heavy metals [61]. The expression pattern of *OsMTP1* is enhanced during Cd-stress exposure; overexpression and gene silencing confirmed its role in the transportation of Cd [62]. The expression analyses of *OsMTP1 and OsMTP5* in the current study for Pb stress are tallied with the results in the cited literature. The current study revealed improved expressions of *OsMTP1 and OsMTP5* in plants treated with Pb compared to control Pb-untreated plants, as shown in Figure 5C,D. To elucidate if SNP affects the relative expression of *OsMTP1 and OsMTP5*, the results showed that the expression level was significantly enhanced in NO-treated rice under Pb stress compared to rice only under Pb stress (Figure 5C,D).

The overexpression of metallothionein genes in tobacco showed decreased and increased Arsenic accumulation in roots and shoots, respectively [63]. In the current study, we found that the relative expression of two candidate genes, i.e., *OsMT-I-1a* and *OsMT-I-1b*, is upregulated by Pb toxicity, as shown in Figure 5E,F, which is further improved by the application of SNP in Pb-treated plants. Although the mechanism of heavy-metal detoxification in plants by metallothionein is still elusive, our results are supported by the previous published literature [64].

#### **4. Materials and Methods**

#### *4.1. Plant Material, Husbandry Preparation, and Growth Conditions*

The experiment was performed in soil under greenhouse conditions at Kyungpook National University, Daegu, Republic of Korea. For the experiment, the Jinbu rice cultivar *(Oryza sativa* L. ssp. *Japonica)* was selected as genetic material. Seeds sterilization, germination, and sowing were conducted as previously described [65]. Lead (II) nitrate (Pb(NO3)2) measuring 1.2 mM was applied as described earlier [11]. The pots were divided into four treatments with three replicates each, as listed in Table 1. After three weeks of transplantation, the plants were supplied with 100 μM SNP [40,54].


**Table 1.** Demonstrates different treatments and concentrations of the chemicals used in this study.

Pb(NO3)2: Lead (II) nitrate as a Pb source; SNP-R: sodium nitroprusside as a NO-donor applied through roots via sub-irrigation.

#### *4.2. Measurement of Electrolyte Leakage and Visible Green Pigment Quantification*

The electrolyte leakage assay was performed to estimate any ion leakage that would have resulted from Pb oxidative damage, as previously described [48]. Electrolyte leakage-1 (EL1) was measured using a portable conductivity meter (HURIBA Twin Cond B-173, Fukuoka, Japan). For electrolyte leakage-2 (EL2), the samples were autoclaved and cooled at room temperature. Electrolyte leakage (EL), expressed in percentage (%), was calculated using the following formula.

#### EL% = EL1/EL2 × 100

Visible green pigment content in leaves was measured in leaf samples using a SPAD meter (SPAD-502; Minolta Co. Ltd., Osaka, Japan), as previously described [11].

#### *4.3. Quantification of Chlorophyll a and b Contents*

To quantify and calculate chlorophyll a and b contents, a detailed method was followed as described previously [66,67].

#### *4.4. Quantification of Lipid Peroxidation*

The determination of lipid peroxidation was performed by calculating the amount of a byproduct of membrane bilayer oxidation, malondialdehyde (MDA), by using a published method [68].

#### *4.5. Hydrogen Peroxide (H2O2) and Superoxide Anion (O2* −*) Content*

H2O2 content in rice leaf tissue was quantified and calculated using the method described [69] and expressed in units as μmol g−<sup>1</sup> FW. Similarly, O2 − contents in rice leaf tissue were quantified and calculated as previously described [70] and expressed in units as μmol g–1 FW.

#### *4.6. Estimation of Antioxidant Activities*

As previously mentioned, CAT, POD, PPO, and SOD activity were examined [48]. In brief, 400 mg of leaf samples was powdered using a chilled mortar and pestle. The crushed samples were homogenized with 0.1 M phosphate buffer (pH 6.8) and centrifuged at 4 ◦C for 15 min at 5000 rpm. The supernatant was used as the crude enzyme source for CAT, POD, and PPO activities.

The activity of PPO and POD was estimated [48], and the activity of CAT was measured [11] as previously described. Furthermore, SOD activities were analyzed by following the photoreduction of nitro blue tetrazolium (NBT) [66], and the APX's activity was determined [71], as mentioned earlier.

#### *4.7. RNA Extraction and Quantitative Real-Time PCR*

RNA was extracted using the standardized procedure followed by [67]. Additionally, complementary DNA (cDNA) and RT-PCR were carried out according to the previous literature [72]. The synthesized cDNA was used as a template for further assessments of transcript accumulation using qRT-PCR (Eco™ Illumina™ San Diego, California, USA), as previously described [29]. The list of genes and corresponding primers with names and sequences is provided in Table 2.


**Table 2.** List of primers used in this study.

#### *4.8. Statistical Analysis*

Independent experimental analyses were performed in triplicates using a completely randomized design (CRD). To assess the statistical significance between Pb-treated plants with control and SNP-supplied Pb-plants, Statistical Analysis Software SAS version 9.1 (SAS Institute Inc., Cary, NC, USA) was used, and all data were statistically evaluated with Duncan's multiple range test. The significance threshold was set at *p* < 0.05. GraphPad Prism software version 6.0 (San Diego, CA, USA) was used to present the results graphically.

#### **5. Conclusions**

By altering the activities of SOD, POD, and PPO, as well as the amount of CAT and gene expression, Pb stress induced the overproduction of ROS and disrupted the H2O2 and MDA scavenging system. Conclusively, the significant role of exogenous NO donors (SNP) in plants has a protective effect in alleviating lead stress in rice. Our results revealed that exogenously applied SNP improves the Pb stress tolerance in rice by activating the antioxidant system, lowering electrolyte leakage, reducing the production of H2O2 and MDA, and increasing the expression of heavy-metal stress-related genes. Based on the present results, it is suggested that the application of SNP will enhance the growth and productivity of rice under lead stress conditions.

**Author Contributions:** Conceptualization, methodology, and validation, B.-W.Y., I.-J.L. and M.K.; formal analysis, investigation, and data curation, W.R., M.K., T.N.I.A.A., M.I., N.J.M., D.-S.L. and G.-M.L.; writing—original draft preparation, W.R. and M.K.; writing—review and editing M.K., W.R., B.-W.Y., B.-G.M., A.P., Y.-S.M. and S.A.; equally contributed to this work and have the rights to claim first author titles in their CVs, W.R. and M.K.; funding acquisition, B.-W.Y. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education (Grant number 2020R1l1A3073247), Republic of Korea, and a project to train professional personnel in biological materials by the Ministry of Environment.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


### *Review* **Vaccine for Diabetes—Where Do We Stand?**

**Dinesh Kumar Chellappan 1,\* , Richie R. Bhandare 2,3,\*, Afzal B. Shaik <sup>4</sup> , Krishna Prasad 5, Nurfatihah Azlyna Ahmad Suhaimi 6, Wei Sheng Yap 6, Arpita Das <sup>7</sup> , Pradipta Banerjee <sup>8</sup> , Nandini Ghosh 8, Tanner Guith 8, Amitava Das 8, Sarannya Balakrishnan 9, Mayuren Candasamy <sup>1</sup> , Jayashree Mayuren 10, Kishneth Palaniveloo <sup>11</sup> , Gaurav Gupta 12,13,14, Sachin Kumar Singh 15,16 and Kamal Dua 16,17**


**Abstract:** Diabetes is an endocrinological disorder with a rapidly increasing number of patients globally. Over the last few years, the alarming status of diabetes has become a pivotal factor pertaining to morbidity and mortality among the youth as well as middle-aged people. Current developments in our understanding related to autoimmune responses leading to diabetes have developed a cause for concern in the prospective usage of immunomodulatory agents to prevent diabetes. The mechanism of action of vaccines varies greatly, such as removing autoreactive T cells and inhibiting the interactions between immune cells. Currently, most developed diabetes vaccines have been tested in animal models, while only a few human trials have been completed with positive outcomes. In this review, we investigate the undergoing clinical trial studies for the development of a prototype diabetes vaccine.

**Keywords:** diabetes; vaccines; clinical trials; insulin; GLP

#### **1. Introduction**

Persisting as a major global health threat, diabetes mellitus (DM) affects individuals of all ages, ethnicities, and backgrounds, especially those associated with a prominent family history of diabetes and a multitude of environmental factors [1–4]. As reported by the

**Citation:** Chellappan, D.K.;

Bhandare, R.R.; Shaik, A.B.; Prasad, K.; Suhaimi, N.A.A.; Yap, W.S.; Das, A.; Banerjee, P.; Ghosh, N.; Guith, T.; et al. Vaccine for Diabetes—Where Do We Stand? *Int. J. Mol. Sci.* **2022**, *23*, 9470. https://doi.org/10.3390/ ijms23169470

Academic Editor: Wajid Zaman

Received: 31 July 2022 Accepted: 19 August 2022 Published: 22 August 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

World Health Organization (WHO), 422 million people globally suffer from diabetes and is rapidly progressing in intermediate and poverty-stricken nations [5]. Approximately 1.5 million deaths annually are caused by diabetes worldwide [5]. It has been reported that China contributed to the highest number of diabetics in 2021, with 149.1 million of its population between ages 20 and 79 being affected by this chronic disease. It is forecasted that China will have approximately 174 million diabetic patients by the year 2045. Meanwhile, a survey in 2014 by Kaveeshwar and Cornwall which reported an elevation approaching 8.3% in diabetic incidences further elucidates this observation [6–8]. Complications stemming from poorly managed DM represent a crucial cause of concern as a threat to mortality, indirectly impacting the economical status of a country [9]. The development of secondary complications worsens the mortality and morbidity caused by diabetes [10]. Initially, the classification for diabetes depended on its etiology and clinical course, before ultimately being categorized into Type 1 (T1DM) and Type 2 diabetes (T2DM), as the previous definition excluded many sufferers who exhibited atypical presentation and progression of the disease [11,12]. According to the American Diabetes Association, T1DM occurs due to defects in insulin production, whereas T2DM precipitates primarily from insulin resistance, followed by problematic reduction in insulin secretion, giving rise to hyperglycemia [13,14].

Orban et al. (2001) elucidated three phases whereby researchers may interrupt the underlying pathologic mechanisms behind T1DM, which are the autoimmunity development, autoantibody development, and clinical manifestation emergence with remaining residual β-cell function to be conserved [15]. The intention of halting these phases is to avert autoimmunity development in its initial stages as well as to inhibit clinical disease onset in high-risk individuals, since this phase is the root cause of the disease for the vast majority of patients [16]. However, as for the prevention of T2DM progression, the International Diabetes Federation (IDF), in 2006, proposed a method involving reduction in modifiable risk factors [17,18]. In terms of monitoring parameters, autoantibodies, such as insulin, insulinoma-associated protein 2 (IA-2), glutamic acid decarboxylase (GAD) or zinc transporter isoform 8 (ZT8), act as biomarkers to detect the preliminary onset of diabetes, as individuals who tested positive for more than 50% of these autoantibodies compared to single β-cell antigens are at a greater risk of developing T1DM [19,20]. On the other hand, β-cell destruction is mediated by different types of cytokines or by the direct activity of Tor B lymphocytes.

The pancreatic β-cell damage may be initiated by direct environmental toxins, a virus, or a primary immune attack against pancreatic β-cell antigens such as glutamic acid decarboxylase 65-kD antibody (GAD65). T-helper lymphocytes, such as CD4+, are activated by β-cell antigens and antigen-presenting cells, including the dendritic cells (DC) and macrophages. Interleukin (IL)-12 secreted by macrophages then stimulate the secretion of IL-2 and interferon (IFN)-γ by the CD4+ T-cells. IFN-γ then excites further resting macrophages to secrete other cytokines, such as the tumor necrosis factor (TNF-α), free radicals and IL-1β, which are lethal for pancreatic β-cells. Additionally, activated T-helper cells produce cytokines which attract T- and B lymphocytes and trigger its multiplication in the islet of Langerhans, hence precipitating insulitis. With time, B lymphocytes would attack and harm the β cells by producing antibodies against secreted pancreatic β-cell antigens, whereas cytotoxic T-lymphocytes (CD8+) directly attack β cells which carry the target autoantigens [21–23].

As diabetes is a progressive disease, diabetic patients require effective, long-term treatment and the regular monitoring of treatment to achieve the suggested glycemic HbA1c levels. This management strategy may involve a combination of regimens of oral medicines, injectables, such as insulin or GLP-1 analogs, or both dosage forms. These combinational therapies or injectable therapies confer a high chance of inducing side effects, such as diarrhea or vomiting, with GLP-1 analogs and weight gain or hypoglycemia following insulin treatment. Although certain treatment regimens are unsuccessful at decreasing a patients' HbA1c to the desired level, the undesirable side effects of the medications itself

causes patients to skip treatment, especially with higher doses, rendering the therapy ineffective. Presently, a patients' lack of adherence to their treatment plan remains a persistent clinical challenge, with over 50% of diabetic patients failing to strictly follow schedule of medication administration. In addition, although adherence to insulin treatment has improved in the past few years, due to the usage of pre-mixed formulas and smaller-sized needles, it remains sub-optimal at 63–65% [24,25].

Hence, in an effort to avert medication adherence problems for chronic diseases such as diabetes, there is a growing need for better prevention measures. Recent approaches other than intervening with environmental triggers to halt the onset of DM early has led to the discovery of effective vaccines. In this review article, we attempt to discuss advanced methods of diabetes prevention and the role of adjuvants in relation to vaccines. Some common practices for the prevention of diabetes at early stages is depicted in Figure 1. Ongoing debates and different opinions on various vaccine products made from proteins, antigens, and live pathogens were also examined. Besides that, we also reviewed other types of vaccines from different diseases which may be useful in paving the path for diabetes vaccine development.

**Figure 1.** Different modes of prevention of diabetes.

#### **2. Vaccination**

Vaccines elicit their responses in several ways: dampening the destructive Th1 immune response to a benign Th2 response, inciting antigen-specific T-reg cells, eradicating autoreactive T cells or arresting immune cell interaction [26]. A classification of different vaccines for treating diabetes is mentioned in Figure 2.

#### *2.1. Early Diabetes Prevention*

There are various approaches involved in preventing DM onset and progression, mainly by treating targeted individuals with a family history of diabetes or tenacious autoantibodies, intensive lifestyle interventions, the consumption of dietary fibers and the intake of vitamin D supplements [20,27–31]. However, the possibility of immunization being a method of prevention remains under-researched. At present, vaccines are employed as prophylactic measures in combating infectious diseases by using variants derived mainly from targeted live-attenuated pathogens. However, concerns pertaining to the safety profile of these vaccines has led to the investigation of more advanced bases known as adjuvants, which perform a key role in skewing immune responses and their fabrication [32–34].

**Figure 2.** A classification of different vaccines for treating diabetes.

#### *2.2. Rationale behind Vaccine Adjuvant Action*

A new era of vaccine development is presently emerging through novel combined therapy comprising adjuvants, which specifically activate and drive immune responses [35–38]. Traditionally, incorporating an adjuvant into a vaccine presents certain benefits, such as a reduced quantity of dose administered, leading to altered immune responses of greater quality, with minimal side effects [39–41]. A prime example of frequently used adjuvants includes alum adjuvants, which are readily available in the market today, as this compound assists in promoting humoral immunity in an individual [42,43].

#### **3. Newly Designed Vaccine Products**

Currently, several utilized vaccine products, such as autoantigens and non-autoantigenspecific therapies, are underway to be developed into vaccines for diabetes. A few of these have almost reached the final human testing stage. Hence, in this article, we aim to discuss various approaches incorporating prevention strategies with vaccines with respect to diabetes mellitus.

#### *3.1. Protein-Based Approach in Vaccine Production*

#### 3.1.1. IL-1β-Targeted Epitope Peptide (1βEPP) as a New Vaccine Product for T2DM

Inflammation of the pancreatic islet in T2DM leading to β-cell apoptosis and disruption of insulin production is mainly caused by IL-1β cytokine, a key mediator that induces insulin resistance within the peripheral tissue [44–48]. However, several studies have interestingly shown that IL-1β is capable and has potential to be enhanced as a T2DM future therapy. To elaborate on this revelation, not only did a newly developed IL-1βtargeted epitope peptide vaccine adjuvant with polylactic acid microparticles (1βEPP) stimulate the level of glucose tolerance and provide a hyperglycaemia shield when tested in diabetic KK-Aay mice model, but it also caused a reduction in the lipid profile and β-cell apoptosis action [49]. However, a few alterations were made to create a securely modified anti-IL-1β to address issues derived from the phase I and II trial studies [50]. Another noteworthy research highlighting the use of a combination therapy (CT) of anti-IL-1β and GAD65 DNA vaccine demonstrated the immense potential in reversing diabetes in its early stages [51]. Besides that, another vaccine, hlL1bQb was initially developed and tested in a preclinical simian phase before being tested in T2DM patients, where it was well documented that the hlL1bQb immunization caused harm to human subjects involved in the trial [52].

#### 3.1.2. Dipeptidyl Peptidase-4 Inhibitor (DPP4) as Novel Vaccine Product for T2DM

Incretins are hormones which are secreted from enteroendocrine cells into the blood within minutes after food intake to regulate the amount of insulin to be secreted. Incretins essentially consist of two variants, one being the glucose-dependent insulinotropic peptide (GIP) and another being the glucagon-like peptide-1 (GLP-1). Although these hormones share numerous common actions in the pancreas, they exhibit very distinct actions outside of the pancreas. Since both types of incretins are quickly deactivated by a dipeptidyl peptidase-4 inhibitor (DPP4) enzyme, DPP4 inhibitors, on the other hand, raise the concentrations of these hormones, resulting in enhanced β-cell responsiveness to raised glucose concentrations as well as the suppression of glucagon secretion [53,54]. On the other hand, activation of glucagon-like-peptide (GLP) receptor agonists results in insulin secretion and caspase-mediated cell death inhibition in pancreatic β-cells through the action of DPP4. Earlier researchers recognized GLP and DPP4 as both efficacious and long-lasting agents for future approaches in treating T2DM [55–62], as elucidated by a study performed in 2014 which succeeded in synthesizing a DPP4 vaccine. When trials were performed in C57BL/6J mice model, results revealed a rise in GLP-1 level and an enhanced sensitivity of insulin without prompting adverse autoimmune responses. This occurrence is mainly observed in B- and T-cell epitopes, where a significant increase in anti-DPP4 antibody titre is detected along with the Th2 humoral response [63]. Recently, a study was conducted using D41-IP, a newly combinatorial peptide vaccine, synthesized using the B-cell epitope D41 within DPP4 and B-cell epitope of insulinoma antigen-2 (IA-2) in an effort to enhance therapeutic responses [64,65]. However, this multi-epitope study is perceived more as an alternative diabetic therapy, compared to the previous study, which focused on diabetes onset prevention [65].

#### 3.1.3. CTB-InsB Vaccination Product to Treat T1DM

Bombyx mori, a classic host which secretes recombinant proteins in its fifth instar stage from the lumen of silk glands, has been used in silkworm biotechnology for ages [66–68]. This discovery by Dr. Maeda in 1985 originated from silkworm larvae and has drawn much attention due to its high level of recombinant protein expression [69]. Cholera toxin subunit B (CTB), which is known for its toxic characteristics, is usually used as strong adjuvant, which, together with its antigen coupling action, will eventually lead to a downregulation response in the onset of T1DM [70]. On the other hand, the B chain of insulin (INSB) is briefly known as a 30-amino-acid chain with an immunogenic epitope. When used in combination, both CTB and INSB make up a consumable vaccine to induce immune tolerance in diabetic patients, with its strong influences in evading insulitis when tested in NOD mice. This specific tolerance increases Foxp3+ regulatory T-cell proportions in peripheral lymph tissues and suppresses the biological functions of spleen lymphocytes in mice. This important research proved the effectiveness of the CTB-InsB oral protein vaccine against diabetes development [71–75].

#### *3.2. Specific Self-Antigens Approach in Vaccine Production*

Upon administration and absorption of the vaccine-adjuvant into the T1DM patients' skin, toll-like-receptor (TLR) acts on depot-containing antigens through pattern recognition receptors (PRRs) activation, which mainly leads to antigen presenting cells (APC) maturation (primarily DC). Activated APCs on major histocompatibility complexes (MHC) surface then interact with antigen-specific T cells and secrete IL-10 cytokine to suppress Th1 by Th0 stimulation. Two pathways—the Th2 anti-inflammatory process and Treg cells induction that suppresses Th cells development—eventually stimulate insulin secretion as diagrammatically shown in Figure 3 [36,76–78].

**Figure 3.** T1DM and vaccine adjuvant's mechanisms of action.

#### 3.2.1. IA-2 as New Vaccine Product for T1DM

The pharmacological action of the D41-IP vaccine in using IA-2 protein as an islet autoantigen has been tested in few different studies [79]. IA-2, a tyrosine phosphatase protein, is commonly known as a T1DM major islet antigen [80]. A study conducted in 2012 by Guan et al. claimed that the IA-2 vaccination is capable of delaying the onset and the late stages of autoimmune diabetes either on its own or when co-administered with plasmid IL-4/MCP-1, which proposed a promising future for T1DM patients [81,82]. Later, a newly designed novel peptide vaccine, IA-2-P2, was introduced to the public, where the overall idea of the vaccine was initiated based on the previous findings of Guan et al. A drastic drop in the blood glucose levels of normoglycemic mice were obtained when tested with the IA-2-P2 vaccine in comparison to P277 and control mice in the study. Hence, it was concluded that IA-2-P2 was a suitable ameliorate vaccine to combat T1DM. The P277 peptide, a human 60 kDa heat shock protein (hsp60) is a causative factor for the onset of diabetes of non-obese diabetic (NOD) mice which are genetically prone to developing spontaneous autoimmune diabetes [83]. However, Lu et al. reported that the fusion of the His-Hsp65-6IA2P2 protein vaccine through nasal inoculation is believed to serve well in regulating the T1DM response [84]. Apart from IA-2, the zinc transporter (ZnT8) has been also studied as a major autoantigen target in T1DM immunotherapy. However, no investigations have been conducted to test ZnT8 as a diabetes vaccine in trials [85–88].

#### 3.2.2. Glutamic Acid Decarboxylase 65-kD (GAD65): A New Vaccine Product for T1DM

The GAD65 antibody is an isoform of GAD targeted by self-reactive T cells that exhibits susceptibility marker detection in T1DM more frequently as compared to the IA-2 autoantigen [89,90]. On the other hand, aluminum hydroxide is the most used adjuvant [91,92]. In 2011, the GAD-alum vaccine (Diamyd) comprising GAD and aluminum hydroxide, was introduced, in which the efficacy and safety of the vaccine was tested in phase II trials preceding four years of close pharmacovigilance [93–99]. However, no desirable effects were observed within T1DM subjects, although two to three drops of injected vaccine were used in each subject throughout the three trial stages of experiment, rendering it ineffective [100–102]. In fact, HbA1c and insulin were not altered by the GAD-alum treatment [100]. In addition to that, a review article by Cook et al. also highlighted the insufficiency of the GAD-alum vaccine when tested in clinical trials of larger sample sizes [103]. However, the combination of CTB-insulin and CTB-GAD with IL-10 as a newly proposed multi-component vaccine has proven to suppress β-cell autoreactivity in T1DM [104]. Other studies also revealed that GAD65 antibodies elicit activity against glial

fibrillary acidic protein (GFAP), a predictive biomarker that is expressed within peri-islet Schwann cells in the event of the onset of T1DM, making it a suitable molecule to be incorporated in the production of an immune tolerizing vaccine [105,106]. Hyperglycemia was suppressed, whereas C-peptide secretion was enhanced significantly in T1DM using this GFAP vaccine by acting upon T-cell entrance into pancreatic islets, which subsequently shifts T-cell differentiation from a cytotoxic Th1- to a Th2-biased humoral response in NOD mice [107]. Considering the complex mechanism of the action of self-antigens toward the immune system, synthetic materials were used instead in one study to co-deliver the immunomodulatory signals, where fabricated microparticle (MP) vaccines were recently established via in vivo and in vitro methods. Therefore, the first biomaterial-based vaccine product, hydrogel/microparticle, was introduced by using dual subcutaneous immunization, and subsequently, a revised version involving the delivery of three shots of the vaccine into NOD mice models was tested [108,109]. On the other hand, Phillips et al. successfully created the first antisense oligonucleotide-formulated microsphere vaccine capable of suppressing diabetes and boosting Foxp3+ T-reg cells without inducing any unfavorable responses [110,111]. Multiple benefits were observed from these autoantigenspecific interventions over those involving nonspecific immune suppression as an immune tolerance therapy, including the reduction in pathogenic peptide epitope response and any related side effects [112]. Even so, it is unable to be predicted if similar favorable outcomes may be obtained from the phase III clinical trial [113]. In a recent double-blinded, randomized, placebo-controlled Phase IIb clinical trial, the intra-lymphatic administration of GAD-alum with vitamin D supplementation appears to preserve C-peptide in patients with recent-onset T1D carrying HLA DR3-DQ2 [114].

#### 3.2.3. Insulin as a Target in New Vaccine Product for T1DM

The BHT-3021 vaccine, which is made of proinsulin, a precursor insulin prohormone, was proven effective in enhancing insulin production within its early developmental stage [115]. Longer-term research for this vaccine was recently announced by NHS Choice in which larger group of 200 participants will be involved. In another study, the reaction between insulin-like growth factor 2 (IGF-2), an insulin dominant self-antigen, and an insulin autoantigen initiated a response resembling negative/tolerogenic self-vaccination, indicating a possible cure for T1DM, as reported by Chentoufi [116]. To further consolidate this revelation, other studies have proven that self-antigen vaccination is one of the most secure strategies against autoimmune diabetes [117].

#### *3.3. Non-Antigen Specific Approach in Vaccine Production*

Certain immunologically active microbes and their products have been reported to prevent autoimmune diabetes in different animal models [118]. These agents may confer a protective effect in humans by stimulating the immune system especially during childhood development [118]. As per "the hygiene hypothesis," the growing cases of autoimmune diseases may be caused by insufficient microbial exposure due to the improved hygienic conditions of the developed world [118]. Some of these microbial approaches are discussed below.

#### 3.3.1. Live Pathogen *Salmonella* as Vector Vaccine

Live recombinant attenuated *Salmonella*-vectored vaccines exhibit great potential as resources to improve human health by achieving long-lasting mucosal, humoral and cellular immunity against a variety of non-*Salmonella* pathogens at a low cost. The use of recombinant DNA has been a major breakthrough in antigen mucosal delivery for years through the generation of a live attenuated *Salmonella* oral vaccine since it was initially tested in phase I clinical trials [119]. It is the T-cell autoreactive downregulation response in the *Salmonella* vaccine which conferred greater therapeutic effectiveness besides its simplicity and relative safety in comparison to antigen-specific approaches, thus making it a potential future therapy for T1DM despite an uncertain mechanism of action [120]. A current study utilized *Salmonella typhimurium* bacteria in combination with other small regulatory proteins called cytokines and a low dose of an immunosuppressive drug, Anti-CD3. Results revealed that the vaccine reinstates balance to the immune system and prevents the attack of insulin-producing cells.

#### 3.3.2. Inactivated Microbial Vaccines

Apart from alum and CTB adjuvants, complete Freund's adjuvant (CFA) has also been studied as a vital T-regulatory cell inducer which successfully reverses autoimmune diabetes through multicomponent immunization. However, the injection of a combined anti-CXCL10 vaccine with CFA is not advisable for human use due to its high toxicity profile, despite its proven effectiveness, as elucidated in a study by Oikawa et al. (2010), whereby the anti-CXCL10 vaccine used successfully reversed T1DM [121]. In few years, further studies involving CFA were performed using CTB:GAD fusion protein, which successfully induced the protective effects in tested NOD mice [122]. In more recent times, a multivalent islet lysate-negative vaccine tested elicited a positive immunogenic response in the pathophysiology of diabetes, in which incomplete Freund's adjuvant (IFA) was used [123].

#### 3.3.3. BCG as Vector Vaccine in Clinical Trial Studies

In 2001, tumor necrosis factor (TNF-α) activation following BCG vaccine administration to restore endogenous β-cell function resulted in the discovery of a potential T1DM reversal mechanism [124]. In addition to that, Faustman (2012) developed another BCG vaccine which was tested in a phase I randomized control trial, resulting in a proven ability of the BCG vaccine in triggering TNF to induce apoptosis in autoreactive T cells. Furthermore, an increase in the restoration rate of pancreatic β-cell function was observed in a rodent model response [125]. Likewise, a minimum of two doses of the BCG vaccine is recommended, particularly with the first dose being administered in neonates. However, further studies pertaining to this claim is required, as its mechanism of action in human subjects remains unclear [126,127]. Nevertheless, no definite linkage was found in the preservation of β-cell function in T1DM after prophylactic administration of the BCG as hypothesized by other randomized clinical studies between the years 1999 and 2005 [128]. In contrast, approval by the Food and Drug Administration (FDA) led to the commencement of a phase II clinical trial study regarding the activity of the BCG vaccine related to T1DM reversal. This study design involves 130 participants, of which approximately over 100 candidates have received a minimum of one dose of the BCG vaccine, and progress or response is actively being monitored for five years from the date of administration.

#### **4. Potential of Other Disease-Vaccines in Treating Diabetes**

Interestingly, previous research elucidated evidence for enteroviruses (EV) being a causative factor for the onset of type I diabetes. Hence, avid and ongoing effort in conducting research and development to develop and synthesize an EV vaccine is currently in progress with the aid of technological advances. Yet, various issues are still under discussion, such as diabetogenic EV serotypes, safety concerns pertaining to it, and relevance as well as accuracy of the accumulated literature reviews, before this novel vaccine can be tested in clinical trials [129,130]. However, a noteworthy accomplishment was achieved in a recent preclinical study involving the invention of the first multivalent formalin inactivated CVB1 vaccine, where the vaccine proved to be effective with no adverse effects [131]. Apart from the usage of vaccines indicated for viral infections, the tuberculosis DNA vaccine known as DNA-HSP65 exemplifies high possibility in becoming the latest immunotherapeutic agent for the management of diabetes [132]. A list of new vaccine products in the pipeline that could be employed for diabetes can be found in Table 1.

*Int. J. Mol. Sci.* **2022**, *23*, 9470


## **1.**Listofvaccineproductsfordiabetes.

#### *Future Perspectives*

Despite various prospects and the effectiveness of each approach discussed previously, limitations which arise should be thoroughly considered and evaluated from various perspectives, where modifications may be made for future immunotherapies and later implemented in phase III and IV trials. To date, the majority of the diabetes vaccine development has been studied using animal models, with relatively few human trials [26]. A major limitation of using animal models, such as NOD mice, is their profound sensitivity to diabetes protection. As a result, several successful animal studies failed in human trials, including the Diabetes Prevention Trial-1 [26]. Recently studies have suggested the administration of non-depleting anti-CD3 antibodies or a peptide from heat shock protein 60 to be beneficial against a recent onset of T1DM [26]. The revolutionary Diamyd vaccine for the prevention of diabetes, for example, could possibly be improved via the exploration of other antigen delivery pathways, such as self-antigen DNA vaccine administration, with consideration of divergent dose administration or replacement with a different type of autoantigen [133,134]. Besides that, understanding various other routes of administration could potentially be an area of further research. Being another pivotal part of vaccine development, feasible study designs not only contribute toward grasping a better understanding of the Alum-GAD system but also could pave the path to develop newer combination therapies studies for diabetes.

#### **5. Conclusions**

The pressing concern pertaining to the rising number of patients falling victim to diabetes has garnered the interest of numerous scientists and researchers worldwide in the search for the most effective management strategies, as demonstrated with the abundance of literature available to date. The urgent need for more effective prophylaxis in addition to conventional dietary modification advice to patients has prompted various attempts in developing vaccines to delay or prevent the onset of this chronic disease, as seen in recent approaches where protein-based, self-antigen and non-antigen specific interventions have exhibited promising potential for use as vaccines against diabetes in the near future. However, a deeper understanding is still required to ameliorate and invent potent therapies with minimal side effects regardless of the cause-related factors, especially for chronic diseases, such as T2DM.

**Author Contributions:** Conceptualization, D.K.C.; resources, A.B.S., K.P. (Krishna Prasad), N.A.A.S., W.S.Y., A.D. (Arpita Das), P.B., N.G., T.G., A.D. (Amitava Das), S.B., M.C., J.M., K.P. (Kishneth Palaniveloo), G.G., S.K.S. and K.D.; data curation, A.B.S., K.P. (Krishna Prasad), N.A.A.S., W.S.Y., A.D. (Arpita Das), P.B., N.G., T.G., A.D. (Amitava Das), S.B., M.C., J.M., K.P., G.G., S.K.S. and K.D.; writing—original draft preparation, A.B.S., K.P. (Krishna Prasad), N.A.A.S., W.S.Y., A.D. (Arpita Das), P.B., N.G., T.G., A.D. (Amitava Das), S.B., M.C., J.M., K.P. (Krishna Prasad), G.G., S.K.S. and K.D.; writing—review and editing, D.K.C., R.R.B., A.B.S., K.P. (Krishna Prasad), N.A.A.S., W.S.Y., A.D. (Arpita Das), P.B., N.G., T.G., A.D. (Amitava Das), S.B., M.C., J.M., K.P. (Kishneth Palaniveloo), G.G., S.K.S. and K.D.; supervision, D.K.C.; project administration, D.K.C.; funding acquisition, R.R.B. All authors have read and agreed to the published version of the manuscript.

**Funding:** The Article Processing Charges (APC) of this review is funded by the Deanship of Graduate Studies and Research, Ajman University.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** RRB is thankful for Deanship of Graduate Studies and Research, Ajman University for providing funding towards the Article Processing Charges (APC).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **Abbreviations**

The following abbreviations are used in this manuscript:


#### **References**


## *Article* **Toxicity of** *Bacillus thuringiensis* **Strains Derived from the Novel Crystal Protein Cry31Aa with High Nematicidal Activity against Rice Parasitic Nematode** *Aphelenchoides besseyi*

**Zhao Liang 1,2,†, Qurban Ali 1,2,†, Yujie Wang 1,2, Guangyuan Mu 3, Xuefei Kan 3, Yajun Ren 1,2, Hakim Manghwar <sup>4</sup> , Qin Gu 1,2, Huijun Wu 1,2 and Xuewen Gao 1,2,\***


**Abstract:** The plant parasitic nematode, *Aphelenchoides besseyi*, is a serious pest causing severe damage to various crop plants and vegetables. The *Bacillus thuringiensis* (Bt) strains, GBAC46 and NMTD81, and the biological strain, FZB42, showed higher nematicidal activity against *A. besseyi*, by up to 88.80, 82.65, and 75.87%, respectively, in a 96-well plate experiment. We screened the whole genomes of the selected strains by protein-nucleic acid alignment. It was found that the Bt strain GBAC46 showed three novel crystal proteins, namely, Cry31Aa, Cry73Aa, and Cry40ORF, which likely provide for the safe control of nematodes. The Cry31Aa protein was composed of 802 amino acids with a molecular weight of 90.257 kDa and contained a conserved delta-endotoxin insecticidal domain. The Cry31Aa exhibited significant nematicidal activity against *A. besseyi* with a lethal concentration (LC50) value of 131.80 μg/mL. Furthermore, the results of in vitro experiments (i.e., rhodamine and propidium iodide (PI) experiments) revealed that the Cry31Aa protein was taken up by *A. besseyi*, which caused damage to the nematode's intestinal cell membrane, indicating that the Cry31Aa produced a poreformation toxin. In pot experiments, the selected strains GBAC46, NMTD81, and FZB42 significantly reduced the lesions on leaves by up to 33.56%, 45.66, and 30.34% and also enhanced physiological growth parameters such as root length (65.10, 50.65, and 55.60%), shoot length (68.10, 55.60, and 59.45%), and plant fresh weight (60.71, 56.45, and 55.65%), respectively. The number of nematodes obtained from the plants treated with the selected strains (i.e., GBAC46, NMTD81, and FZB42) and *A. besseyi* was significantly reduced, with 0.56, 0.83., 1.11, and 5.04 seedling mL−<sup>1</sup> nematodes were achieved, respectively. Moreover, the qRT-PCR analysis showed that the defense-related genes were upregulated, and the activity of hydrogen peroxide (H2O2) increased while malondialdehyde (MDA) decreased in rice leaves compared to the control. Therefore, it was concluded that the Bt strains GBAC46 and NMTD81 can promote rice growth, induce high expression of rice defenserelated genes, and activate systemic resistance in rice. More importantly, the application of the novel Cry31Aa protein has high potential for the efficient and safe prevention and green control of plant parasitic nematodes.

**Keywords:** *Bacillus thuringiensis*; Cry toxin; nematicidal activity; pore-formation

**Citation:** Liang, Z.; Ali, Q.; Wang, Y.; Mu, G.; Kan, X.; Ren, Y.; Manghwar, H.; Gu, Q.; Wu, H.; Gao, X. Toxicity of *Bacillus thuringiensis* Strains Derived from the Novel Crystal Protein Cry31Aa with High Nematicidal Activity against Rice Parasitic Nematode *Aphelenchoides besseyi*. *Int. J. Mol. Sci.* **2022**, *23*, 8189. https:// doi.org/10.3390/ijms23158189

Academic Editor: Wajid Zaman

Received: 20 June 2022 Accepted: 19 July 2022 Published: 25 July 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

#### **1. Introduction**

Plant pathogens, such as fungi, bacteria, viruses, and nematodes, cause different plant diseases, resulting in yield loss in crops around the globe [1,2]. Plant parasitic nematodes are currently serious pests, contributing to USD 157 billion in annual agricultural losses worldwide [3]. *Aphelenchoides besseyi* is one of the most damaging plant parasitic nematodes and can cause severe harm to a variety of significant crops and vegetables. The significant rice seed-borne pest, *A. besseyi*, spreads predominantly through infected seeds. This is the cause of rice white tip disease and is widely distributed throughout rice-growing areas [4]. In general, *A. besseyi* causes yield losses ranging from 10% to 30%, and in strongly infested areas, yield losses might reach 50% [5,6].

Rhizobacteria have been widely reported as biological control agents for plant parasitic nematodes. Numerous *Bacillus* species have been studied for their role in reducing nematodes and insect pests [7,8]. *Bacillus subtilis* and *Bacillus thuringiensis* (Bt) are two of the *Bacillus* species that have been investigated against various pathogens, including nematodes [9,10]. Bt strains can produce a number of parasporal crystal proteins (Cry or d-endotoxins), which are poisonous to a variety of insects and pests [11,12]. Bt strains can also produce various crystal protein toxins (Cry5, Cry6, Cry12, Cry13, Cry14, Cry21, and Cry55) during the growth process. These Cry toxins exhibit a substantial biological activity against various insects, nematodes, and other pathogenic pests [13,14].

Currently, the high efficiency and specificity of these toxic proteins enable Bt strains to be safely used in controlling various insect pests of crops and are considered to be green and safe microbial pesticides—making them one of the most successful microbial pesticides [15,16]. These are widely used in global agricultural production, effectively reducing the number of chemical pesticides [17]. In addition, Bt toxin proteins have also been used in the field of plant genetic engineering to control crop pests [18]. For example, transgenic rapeseed containing the Cry1Ac gene can control diamondback moth, hairy bug, and cotton bollworm [19], and transgenic pigeon bean containing the Cry2Aa gene is resistant to pod borers [20]. The Cry3A gene in transgenic spruce showed toxicity to spruce bark beetles, indicating that Bt and its toxin proteins add significant value to biological control [21].

In addition to insects, plant parasitic nematodes also affect the agricultural economy [22,23]. Many studies have demonstrated that Cry proteins produced by Bt strains have nematicidal activity [16,23]. Cry proteins with nematicidal activity mainly include Cry5, Cry6, Cry14, Cry21, and Cry55 [14,24]; among these, the 3D-Cry protein Cry5 and non-3D-Cry protein Cry6 have been reported to kill nematodes [25,26]. In order to find and identify novel Cry toxins with broad-spectrum activity against insects and nematodes, we compared Bt strains GBAC46 and NMTD81 based on genome-wide screening and found three unknown Cry proteins in the GBAC46 strain. Among the identified Cry proteins, Cry31Aa showed the highest nematicidal activity in killing nematodes in the intestinal cell membrane.

Furthermore, the beneficial properties of Bt strains are not limited to their biological control activity; however, a variety of endophytes possess the ability to indirectly or directly promote plant growth and development [27,28]. Bt is one of the most important phosphatedissolving bacteria [29], which can be insoluble in the soil through enzymatic activity. The phosphate was hydrolyzed into plant-absorbable phosphorus ions [30]. Bt produces siderophores through a nonribosomal synthetic peptide pathway to help plants absorb iron [31,32]. Additionally, it can synthesize plant hormones (such as indole acetic acid (IAA), deaminase, and volatile organic compounds (VOCs) to promote plant growth [33]. Thus, Bt strains can be used as a biofertilizer agent to promote the uptake and transport of plant mineral nutrients.

Therefore, in the current study, we used different *Bacillus thurgensis* (Bt) strains, GBAC46 and NMTD81, which were previously isolated in the Qinghai–Tibetan Plateau, China, and a well-known biological strain, *B. velezensis* FZB42, against the plant parasitic nematode *A. besseyi*. In this study, we identified a novel plasmid-encoded Cry protoxin

Cry31Aa produced by Bt GBAC46, which acts synergistically to kill free-living *A. besseyi* nematodes via intestinal damage. We also determined the lethal concentrations at LC50 (131.80 μg/mL) for the selected Cry proteins Cry31Aa, Cry73Aa, and Cry40OR and found that the only Cry31Aa had strong nematicidal toxicity against *A. besseyi.* The present work sheds new light on the mechanism of Bt strains and their Cry31Aa proteins that regulate the genes involved in the defense mechanism that *A. besseyi*-infested rice plants use to efficiently control white tip disease in rice. Thus, the toxins produced from the Bt strain could be employed for effective *A. besseyi* control, benefiting both the rice crop and the environment, offering a more sustainable management technique than the overuse of nonspecific chemical nematicides.

#### **2. Results**

#### *2.1. Sequence Analysis of Cry31Aa and Familiar Cry Protein*

Bt is a main pathogenic bacterium, the pathogenicity of which is largely dependent on the parasporal crystal protein (Cry) which it produces during biological control activity [13]. In order to find new Cry toxins, we used SeqHunter 2 software to screen the whole genome of the GBAC46 and NMTD81 strains, and the results predicted 3 unknown Cry protein genes in GBAC46, namely, Cry31Aa, Cry73Aa, and Cry40ORF, while in the NMTD81 strain, there were no new Cry proteins observed (Figure 1A). The Cry31Aa, Cry73Aa, and Cry40ORF proteins contained 802, 395, and 568 amino acids and were 90.257, 44.647, and 65.1902 kDa in size, respectively. The Cry31Aa and Cry73Aa, which were 110–338Aa and 24–227Aa, respectively, at the N-terminal, were in the delta-endotoxin insecticidal domain. Furthermore, the Cry73Aa C-terminal amino acids 524–659 were at the carbohydrate-binding module 6 (CBM6) functional domain, which determines the nematicidal specificity by binding with carbohydrates. In addition, the analysis of the amino acid sequences of three Cry proteins in the GBAC46 strain found that the Cry31Aa protein was 65.48%, similar to the Cry70Ba1 protein (Figure 1B). The Cry40ORF protein had more similarity to the Cry48Ab1 protein, but the similarity of the amino acid sequence was 14.26%. The percentage of amino acid sequence between Cry73Aa and MPP46Aa1 protein was the highest, i.e., 11.62%. These three Cry proteins of the GBAC46 strain are novel and unreported Cry proteins. Furthermore, the SWISS-MODEL was used to predict and analyze the three-dimensional structures of the three Cry proteins of the GBAC46 strain. The result showed that the 3D model from the crystal structure of three Cry proteins was rich in α-helix and β-sheet and predict that the respective proteins may be a perforating toxins (Figure 1C).

#### *2.2. Expression, Immunoblot, and Purification of the Three Cry Proteins of GBAC46*

The RT-PCR analysis was used for the detection of the three Cry proteins from Bt strain GBAC46. The RT-PCR results showed that the three Cry proteins of the GBAC46 strain were normally transcribed when reverse-transcribed cDNA was used as a template and strong bands were detected in the gel. When RNA was employed as a negative template, no bands were obtained, whereas the size was in agreement with the calculated size of the three proteins, Cry31Aa, Cry40ORF, and Cry73Aa, which was 100, 150, and 150 bp nucleotides, respectively (Figure 2A). In addition, the three Cry proteins of the GBAC46 strain were heterologously expressed in BL21 *E. coli* using the pOPT *E. coli* expression vector. The size of the three proteins, Cry31Aa, Cry73Aa, and Cry40ORF, was 104.39, 58.78, and 79.32 kDa, respectively. The results of the Western blot and SDS-PGAE showed that the three Cry protein target bands were correct as shown in Figure 2B.

**Figure 1.** Phylogenetic tree construction was based on the sequence alignment of full-length Cry31Aa and familiar Cry protein (**A**); the amino acid sequences alignment of the Cry31Aa, Cry73Aa, and Cry40ORF proteins (**B**); three-dimensional structure prediction of the three Cry proteins in the GBAC46 strain (**C**).

**Figure 2.** Detection of Cry protein genes and the identification of BL21 *E. coli* using the pOPT *E. coli* expression vector. (**A**) Detection of the transcription of *Cry31Aa* through RT-PCR. Total RNA was extracted from GBAC46 and BL21/pOPT. cDNA was used as a positive control, and RNA was used as a negative control. (**B**) SDS-PAGE expression analysis of Cry proteins (i.e., Cry31Aa, Cry40ORF, and Cry73Aa) after IPTG induction at 16 ◦C for 16 h. Levels of the three Cry proteins encoded by BL21/pOPT detected via Western blotting. The experiments were repeated independently three times.

#### *2.3. Virulence of Bacillus Thuringiensis Strains to A. besseyi*

The nematicidal activity of the selected Bt strains against *A. besseyi* in vitro was evaluated through a 96-well plate. The plate counting method was used to detect the concentrations of strains being treated with nematodes at different gradients. After 24 h of incubation at 20 ◦C, the GBAC46, NMTD81, and FZB42 strains showed strong changes in their ability to kill *A. besseyi*, up to 88.80, 82.65, and 75.87%, respectively, compared to the control as shown in Figure 3A. If the nematodes were unable to restore movement after being touched with a needle under a microscope, they were considered dead. In addition, various concentrations in CFU/mL of the selected strains were used. The results showed that with a high concentration in CFU/mL, the mortality rate of *A. besseyi* was 100% compared to the control, ddH2O, as shown in Figure 3B.

**Figure 3.** Nematicidal activity of selected *Bacillus* strains against *A. besseyi* compared with the control: (**A**) microscopic figures under different objectives: 10×, 40×, and 100×; (**B**) graphical details of the nematicidal activity of selected *Bacillus* strains. Different error bars show the different mean standard deviation of each treatment repeated three times with triplicates.

#### *2.4. Morphological Observation and Pore-Formation of Bt Strains Via Staining*

*Bacillus* is a typical rod-shaped bacterium. In the late stage of the stable culture period and the dying period, *Bacillus* forms spores in the cell to adapt to the external environment. We stained GBAC46 and NMTD81 strains cultured in SM medium for 48 h with biological staining reagents. The results showed that the vegetative morphology of the GBAC46 and NMTD81 strains were a typical short rod shape, while the spores were nearly elliptical (Figure 4A). Furthermore, a scanning electron microscope (SEM) was used to observe the spore and parasporal crystal protein of GBAC46 and NMTD81 after being cultured in an ICPM medium for 80 h. The outcomes showed that both selected strains formed more spores and Cry as shown in Figure 4B.

**Figure 4.** Detection of spore formation in GBAC46 and NMTD81 strains: (**A**) different colors represent different cells: blue—spores, red—living cells under the microscopic study; (**B**) observation of spore and crystal proteins in selected strains under a SEM. White arrow—spores; black arrow—circular crystal proteins; yellow arrow—bipyramidal crystal proteins; red arrow—irregular crystal proteins.

#### *2.5. Toxicity of Cry Protein to A. besseyi*

In this study, we tested the selected Cry proteins, Cry31Aa, Cry40ORF, and Cry73Aa, from GBAC46 to determine the nematicidal activity against *A. besseyi*. The results showed that the Cry31Aa protein had the highest nematicidal activity against *A. besseyi*, and its LC50 value was 131.80 μg/mL. In addition, we used rhodamine to label three Cry proteins and detect whether the protein was ingested by the nematodes, and PBS was used as the negative control. The results showed that red fluorescence could be observed in the gut of the nematodes after 4 h of treatment with rhodamine-labeled Cry31Aa protein, whereas no red fluorescence was observed in the negative control treated with PBS (Figure 5A). To verify that the Cry31Aa protein was a perforating toxin, we fed *A. besseyi* with the protein Cry31Aa and stained the nematodes with propidium iodide dye. The results showed that the Cry31Aa protein was treated, and the diffusion degree of propidium iodide was present in the intestine of *A. besseyi* (Figure 5B).

**Figure 5.** Observation of Cry proteins absorbed by *A. besseyi*: (**A**) the nematode *A. besseyi* after being treated with PBS as a control and rhodamine-labeled Cry31Aa were observed in bright and red fields; (**B**) the Cry31a protein produced the pore-formation toxin exposed to *A. besseyi* after staining with propidium iodide (PI), and fluorescence microscopy was used to monitor the signal of PI. The bright and red fields clearly indicate the changes in the intestinal cells. Different letters indicate significant differences at *p* < 0.005.

#### *2.6. Plant Growth Promotion Traits of Bt Strains in a Pot Experiment*

In the present study, the selected strains were used as a treatment, and control water was used as a CK. The rice seeds were soaked in a solution of *A. besseyi* for 36 h, and then seeds were dipped into selected bacterial solutions of OD600 = 1.0 and planted for 1 month in a greenhouse under controlled conditions. The result showed that the disease intensity was small, and the "white tip" lesions were white to yellow–brown in the control seedlings. Whereas, after being treated with the bacterial solution of selected strains, the disease symptoms on leaves were significantly reduced (Figure 6A). The length of the leaf lesions was calculated, and the disease index of different treatment groups was found to be significantly reduced up to 33.56% 45.66, and 30.34 by GBAC46, NMTD81, and FZB42, respectively.

Further, the growth of rice seedlings along with other physiological parameters were determined to be enhanced by *Bacillus* strains after treating with *A. besseyi.* It was observed that the selected strains, GBAC46, NMTD81, and FZB42, significantly increased root length by 65.10, 50.65, and 55.60%, shoot length by 68.10, 55.60, and 59.45%, and plant fresh weight by 60.71, 56.45, and 55.65%, respectively, compared to the control water treatment (Figure 6b). In addition, the whole plant of rice was cut into small pieces, the nematodes were isolated from the plants, and the concentration of nematodes isolated from the rice seedlings was calculated. The results showed that the number of nematodes

obtained from rice plants treated with the selected strains, GBAC46, NMTD81, and FZB42, and the nematode *A. besseyi* significantly decreased by 0.56, 0.83, 1.11, and 5.04 mL−<sup>1</sup> seedling−1, respectively. In conclusion, these results indicated that the selected Bt strains could significantly overcome the rice white tip nematode disease.

**Figure 6.** The effect of various *Bacillus* strains against *A. besseyi*-infested rice plants in a greenhouse experiment. The photographic representation of the effect of the selected strains on the infested rice plants (**A**). The rice growth promotion parameters, such as root and shoot length, plant fresh weight, and the detection of several *A. besseyi* nematodes from rice seedlings, after being treated with the selected strains (**B**). The error bars indicate the mean standard deviation of each treatment when replicated three times. Significant differences are indicated by the letters above the columns. Fisher's LSD test was used to determine significant differences between treatments at *p* ≤ 0.05.

#### *2.7. Hydrogen Peroxide (H2O2) and Malondialdehyde (MDA) Analysis in Rice*

To further examine the effects of GBAC46 and NMTD81 strains on the defense response of rice, we detected the contents of hydrogen peroxide (H2O2) and malondialdehyde (MDA), a product of lipid peroxidation in rice leaves after being treated with the selected strains for 24 h. The results showed that both the GBAC46 and NMTD81 strains could significantly induce the accumulation of H2O2 and reduced the MDA in rice leaves, revealing that the selected strains could induce the defense response of rice compared to the FZB42-positive and CK-negative controls (Figure 7).

**Figure 7.** Determination of H2O2 and MDA in the leaves of 30 day old seedlings following different treatments. The error bars indicate the mean standard deviation of each treatment when replicated three times. Significant differences are indicated by the letters above the columns. Fisher's LSD test was used to determine significant differences between treatments at *p* ≤ 0.05.

#### *2.8. Relative Expression of Defense-Related Genes*

The effects of the selected strains on the relative expression of rice plant defenserelated (*PR*) genes were studied. The results revealed that the GBAC46 strain significantly enhanced the relative expression levels of all defense-related genes, such as *PBZ1, PR1a, PR4*, and *LOX1*, whereas the *PR1a* genes were highly upregulated compared to other genes. Similar results were obtained after treatment with the NMTD81 strain, and the relative expression levels of *PR*-related genes, *PBZ1*, *PR1a,* and *PR4*, were significantly upregulated, indicating that the selected strains could induce high relative expression of rice defense response-related genes (Figure 8).

**Figure 8.** Relative expression analysis of defense genes *PR1a, PR4, LOX1, PBZ1,* and *PAL1* in rice plants infested with *A. besseyi* and treated with Bt strains. The error bars indicate the mean standard deviation of each treatment when replicated three times. Significant differences are indicated by the letters above the columns. Fisher's LSD test was used to determine significant differences between treatments at *p* ≤ 0.05.

#### **3. Discussion**

Plant parasitic nematodes threaten agricultural production and human health [34]. *Bacillus thuringiensis* is a pathogenic bacterium with a pathogenicity that is largely dependent on the parasporal crystal protein which it produces against nematodes [13]. For the discovery and identification of Bt toxins, whole-genome sequencing has proven to be a helpful and efficient method [35,36]. Cry proteins, Cry21Ha, Cry1Ba, Vip1/Vip2, and -exotoxin, were identified as possible nematicidal factors by genome analysis of Bt 4A4, suggesting that this strain possesses many toxins with potential activity against nematodes method [35]. The discovery of new highly toxic Cry toxins will help in developing new pesticides to effectively control diseases related to different insects and nematodes in agriculture.

In this study, we used SeqHunter 2 software to screen the whole genome of the GBAC46 strain, where three unknown Cry protein genes were found, namely, Cry31Aa, Cry73Aa, and Cry40ORF (Figure 1a). The analysis of the amino acid sequences of the three Cry proteins in the GBAC46 strain revealed that the Cry31Aa protein was 65.48% similar to the Cry70Ba1 protein [37]. Recent studies reported that the Cry70Ba1 and Cry21Ha1 proteins had good insecticidal activity against the Lepidoptera beet armyworm and *C. elegans* [38]. The Cry40ORF protein had more similarity to the Cry48Ab1 protein, but the similarity of the amino acid sequence was 14.26%, within delta-endotoxin Cry1Ac-Domain V domain [39]. Furthermore, the SWISS-MODEL was used to predict and analyze the three-dimensional structures of the three Cry proteins of the GBAC46 strain. The results show that the 3D model from the crystal structure of the three Cry proteins was rich in α-helix and β-sheet and predicted that the respective proteins may be a perforating toxin, in agreement with a previous report [40].

Various Bt strains can effectively control different plant parasitic and animal nematodes [11,41]. Bt has been discovered to have a key function in the control of plant nematodes; it not only prevents eggs from hatching but also has nematicidal activity against juveniles (J2) [42]. Various scientists have reported the role of several Bt species in killing nematodes such as *Meloidogyne incognita, Rotylenchulus reniformis, Heterodera gylcines,* and *Caenorhabditis elegans* [11,36]. In this study, we explored the nematicidal activity of the Bt strains against *A. besseyi* in a 96-well plate in vitro. The selected strains GBAC46, NMTD81, and FZB42 revealed significant differences in their ability to kill *A. besseyi* after 24 h of incubation at 20 ◦C compared to the control. These findings are consistent with prior research that found *P. pacificus* to be more disease resistant than *C. elegans* [43,44]. Another study also found that the secondary metabolite of the FZB42 strain, plantazolicin, kills *C. elegans* and has insecticidal activity against *D. elegans* [45].

Many studies have shown that Cry proteins are highly toxic to nematodes [46,47]. Bt Cry toxins are almost entirely encoded by plasmids and can be used as nematicidal candidates in Bt DB27 [36]. Interestingly, previous crystals revealed a substantial relationship with spores, a trait known as spore–crystal interaction (SCA) [48]. To explore further, we tested whether Bt GBAC46 produced Cry toxins. After culturing for 80 h in an ICPM medium, the results demonstrated that the GBAC46 strain produced more spores, and the parasporal crystal proteins were detected using SEM. These results reveal strong similarities to the SCA of a rare filamentous Bt strain [49].

Cry5, Cry6, Cry14, Cry21, and Cry55 are among the cry proteins that have reported nematicidal activity [24]. One of the prerequisites for a Cry protein to induce its toxic activity is that the toxin protein can be ingested into the intestine by nematodes. It was found that plant parasitic nematodes can only ingest food through narrow needles, limiting the size of their food [50]. It has also been shown that the ability of nematodes to take up proteins is not only determined by the protein molecular weight but may also be affected by the overall size, shape, and electrostatic charge of the protein [51]. Here, we investigated the Cry31Aa proteins from GBAC46 strains. The Cry31Aa proteins were found to show a strong nematicidal activity against *A. besseyi.* In addition, (6)-rhodamine was employed to label three cells, since carboxytetramethylrhodamine is a biological stain with strong fluorescence when dissolved in DMSO solvent and is commonly used for biological dyes [52]. After 4 h of treatment with the rhodamine-labeled Cry31Aa protein, red fluorescence was visible in the guts of the nematodes, which agrees with the results in [53]. Propidium iodide (PI) is a red fluorescent dye that can enter nematode cells through the holes formed by the Cry protein, staining the nucleus red while cells are intact; it is frequently used to determine whether a Cry protein is a perforating toxin protein [9]. The red fluorescence could be seen in the nuclei of the nematode intestinal cells if the Cry protein was a perforating toxin.

The biocontrol potential of the Bt selected isolates were evaluated for their effective control of the plant parasitic nematode *A. besseyi* under greenhouse conditions. The results showed that the disease severity was small and white tip lesions were white to yellow–brown in the control seedling. Whereas after being treated with a bacterial solution of the selected strains with an OD600 = 1.0, the disease symptoms on leaves significantly reduced. A similar result was found against root-knot nematode caused by *M. incognita* [54]. In addition, the growth of rice seedlings and the other physiological parameters were also determined. The results revealed that the root length and plant fresh weight significantly increased compared to the control water treatment, which is in agreement with [53].

Furthermore, *Bacillus* spp. and its products have been shown to stimulate the plant immune response and induce systemic resistance (ISR) and upregulate the pathogenicityrelated genes in plants [55,56]. The rapid buildup of H2O2 is considered to be a significant phytopathogen defense signal in cellular plants [57]. The accumulation of malondialdehyde (MDA) levels are important indicators in plants to combat various environmental stresses [10,58] whereas the decreased level of MDA indicated less membrane damage in plants inoculated with *Bacillus* strains under stress conditions [59]. In the current study, it was also found that the H2O2 levels were upregulated and the MDA levels were downregulated in a very efficient manner in rice plants treated with selected strains.

Bt can also directly promote plant growth by synthesizing phosphate lysing enzymes [60], siderophores [61], and plant hormones [62,63]. When analyzing the growthpromoting effects of GBAC46 and NMTD81 strains in rice, it was found that the selected strains significantly promoted rice physiological parameters such as root and shoot length [10,28]. Moreover, qRT-PCR experiments showed that the GBAC46 and NMTD81 strains were able to enhance the relative expression of rice plant defense-related (PR) genes. The experimental results showed that the selected strains significantly enhanced the relative expression of the defense related genes in rice plants; our results agree with those in [10,64].

#### **4. Materials and Methods**

#### *4.1. Plasmids, Bacterial Strains, and Cultural Conditions*

In the current study, we used *Bacillus thuringiensis* GBAC46 and NMTD81 strains, which were previously isolated at the Laboratory of Biocontrol and Bacterial Molecular Biology, Nanjing Agriculture University, Nanjing, China, along with the biological control strain *Bacillus velezensis* FZB42, which was used as a positive control, and double-distilled water (ddH2O), which was used as a negative control (Supplementary Materials Table S1) [65]. All selected strains were incubated in 1/2 Luria-Bertani (LB) liquid medium (0.5% tryptone, 0.25% yeast extract, and 0.5% NaCl, pH 7.0) or agar plates at 30 ◦C. The *Escherichia coli (E. coli)* strains DH5α and BL21 (DE3) were cultivated at 37 ◦C in LB agar plates or broth. The required antibiotic, ampicillin (Amp), was added at a final concertation of 100 μg/mL with shaking for 24 h at 200 rpm, according to [66], with some modifications.

#### *4.2. Nematode Inoculum*

The original culture of the plant parasitic nematode, *Aphelenchoides besseyi*, was provided by Xuan Wang (Laboratory of Plant Nematology, Nanjing Agricultural University, Jiangsu, China). *A. besseyi* was further cultured on *Botrytis cinerea* grown on a potato dextrose agar (PDA) medium for three weeks at 20 ◦C [67]. The mixture of various nematode stages was collected with a sufficient amount of ddH2O in the Petri dishes. The harvested

*A. besseyi* was stored at 5 ◦C in the refrigerator until it was used in the experiment. In addition, in vivo studies were conducted using eggs obtained from a 25 μm sieve after being rinsed twice with ddH2O. The Baermann funnel method was used to gather second-stage juveniles (J2) after incubating the eggs for 5 days at 20 ◦C [68]. These juveniles (J2) were used to test the effect of Bt strains and their Cry31Aa toxin on *A. besseyi*.

#### *4.3. Bioinformatics Analysis*

The three *cry* genes were found by protein-nucleic acid alignment based on the whole genome of the Bt strains using SeqHunter 2.0 software which was created by Professor Daolong Dou Lab Nanjing Agricultural University China at 2010 based on Microsoft Visual Basic 6.0. A phylogenetic tree was constructed based on the multiple sequence alignment of the full-length of the three Cry proteins from the GBAC46 strain with known Cry proteins from BPPRC database (available online: https://camtech-bpp.ifas.ufl.edu/, accessed on 25 May 2021) using MEGA X 10.0.2. The alignment of the sequences of the Cry proteins was performed using DNAMAN. The crystal structures of the three Cry proteins were predicted based on protein homology using the SWISS-MODEL (available online: https://swissmodel.expasy.org/, accessed on 25 May 2021).

#### *4.4. Expression of Cry31Aa Protein and Purification*

The full-length PCR product of the Cry31Aa gene amplified from the GBAC46 strain's chromosomal DNA using primer Cry31Aa-F and Cry31Aa-R was cloned into the expression plasmid pOPTHis between *BamH I* and *Hind III* sites with N-terminal 6hismaltose-binding protein tag and C-terminal flag tag (for primers see Supplementary Materials Table S3). Then, the recombinant vector was transformed into BL21 (DE3), and the Cry31Aa protein was purified as described previously in [69], with some modifications. Briefly, the expression plasmid-carrying BL21 (DE3) cells were grown at 37 ◦C with 200 rpm shaking in LB medium containing 100 μg/mL of Amp until an OD600 of 0.5–0.7 was achieved. Isopropyl thio-β-D-galactoside, at a final concentration of 1 mM, was added to flasks holding the cultures. The cells were extracted by centrifugation at 10,000× *g* for 20 min after 16 h of incubation at 16 ◦C with shaking at 200 rpm.

Using a high-pressure homogenizer, the bacterial pellets were resuspended in buffer A (50 mM Hepes, 300 mM NaCl, 5% glycerol, and 30 mM imidazole buffer at pH 8.0), and the cell debris was removed by centrifugation at 10,000× *g* for 20 min. The recombinant proteins were purified in a single chromatographic stage using a 5 mL HisTrapHP column (GE Healthcare, Milwaukee, WI, USA) at 5 mL/min on an AKTA avant 150 machine (GE Healthcare) [70]. The bacterial cell lysate was loaded onto the column, and non-adherent proteins were removed by washing the column with 20 volumes of wash buffer A. A 30–250 mM imidazole gradient was used to elute the proteins in wash buffer B (50 mM Hepes, 300 mM NaCl, 5% glycerol, and 250 mM imidazole buffer at pH 8.0). The purified enzymes were maintained at −80 ◦C after salt removal using Millipore Amicon Ultra. The proteins were detected and identified using SDS-PAGE. The protein concentration was determined using the Bradford Protein Assay Kit (P0006, Beyotime Institute of Biotechnology, Shanghai, China). For the Western blot, we followed the manufacturer's instructions and used method anti-Flag (F1804, Sigma-Aldrich, St. Louis, MO, USA).

#### *4.5. Total RNA Extraction and cDNA Synthesis for RT-PCR*

Total RNA was isolated from the GBAC46 strain after cultivating in liquid LB medium when the OD = 600 reached up to 1.0, according to the Bacterial RNA extraction kit's procedure (OMEGA Bio-tek, Inc., Norcross, GA, USA). After being treated with DNase-I (Takara Shuzo, Takara, Japan), total RNA was reverse transcribed into cDNA using the HiScript II Q RT SuperMix Kit (Vazyme, Nanjing, China) according to the manufacturer's instructions and determined using the NanoDrop 1000 Spectrophotometer. In a ABI 7500 Fast Real-Time PCR Detection System, reverse transcriptase RT-PCR was performed using SYBR qPCR Master Mix (Vazyme, Nanjing, China).

#### *4.6. Nematode Bioassays*

The selected strains GBAC46, NMTD81, and FZB42 were cultured overnight in liquid LB medium, then washed with ddH2O 2–3 times, and divided into various concentrations of colony-forming units (CFU/mL), 6.41 × 107, 3.98 × 106, 8.59 × 105, and 7.21 × <sup>10</sup>4, which were applied against *A. besseyi.* For nematode bioassays, the *A. besseyi* was treated with different concentrations of CFU/mL as mentioned above, and ddH2O was applied as a negative control in a 96-well plate at 20 ◦C. The live and dead J2 were counted to calculate the mortality rate (%) using the microscopic examination.

#### *4.7. Determination of Sporulation Formation*

The production of spores was studied using sporulation medium (SM). Bt cells (1%) from an overnight culture were transferred to 20 mL SM media and cultured for 40 h at 37 ◦C with 200 rpm shaking. The spore development was observed in one microliter of the culture. Malachite green and safranine O were employed in the spore staining kit (HB8300-2, Haibo Biotechnology Co., Ltd., Suzhou, China). Malachite green was used to color the spores blue, and safranine O was used to counterstain the living cells red [69]. The experimental setups were carried out in accordance with the manufacturer's instructions. Using an Olympus BX43 microscope, spores and living cells dyed with various colors were seen using cell Sens Standard Software (Tokyo, Japan).

#### *4.8. Phenotypic Observation by Scanning Electron Microscopy*

The selected Bt strains were cultivated for 72 h at 30 ◦C with a 200 rpm shaking time on a liquid LB medium. Centrifugation was used to recover the spore–crystal mixture, and the spore–crystal mixtures (1 mL of liquid culture) were washed three times with 1 M NaCl and ice-cold ddH2O and then rinsed 2–3 times with ddH2O. The parasporal crystal protein was observed using scanning electron microscopy (SEM). Following the company protocols, the samples were fixed in 2.5% glutaraldehyde and 1% osmic acid, coated with gold particles, and then observed using a Hitachi S-3000N SEM at a voltage of 5 kV (Hitachi, Tokyo, Japan).

#### *4.9. Endocytosis Assays with Rhodamine-Labeled Cry31Aa*

The endocytosis assays were carried out as previously described in [52], with some modifications. The 5(6)-carboxytetramethylrhodamine is a biological stain with strong fluorescence when dissolved in DMSO solvent and is often used as a biological dye [52]. A of 1 mg/mL of rhodamine was used to label the three Cry proteins to detect whether the protein was ingested by nematodes, and PBS was used as the negative control. After 2 h, the nematode samples were washed three times with sterilized water before being observed with an Olympus fluorescent microscope. An objective lens ×40 was used to capture the images. The rhodamine filter was used to detect fluorescence at the same time.

#### *4.10. Propidium Iodide Uptake Assays*

Propidium iodide (Sigma) uptake tests were used for pore-formation experiments, according to the methodology reported in [71], with some modifications. Nematodes were observed using a fluorescent microscope after being incubated with Cry31Aa for 4 h and 20 mM propidium iodide for 3 h. The wavelengths of excitation and emission were 555 and 580 nm, respectively. ImageJ was used to process the images.

#### *4.11. In Vivo Experiments with Rice Plants*

The rice cultivar (Daohuahua no. 4) was grown in a greenhouse in controlled condition. Seeds were surface sterilized by 70% ethanol for 30 s (s), 5% (*w*/*v*) sodium hypochlorite solution for 20 min, and finally washed with ddH2O four times. The rice seeds were soaked in ddH2O for 36 h to accelerate germination. After 36 h, the rice seeds were dipped into *A. besseyi* nematode solution (nematode concentration of approximately 500 juveniles/mL) and ddH2O was used as a control (CK) and placed at room temperature for 36 h. The soil

was autoclaved at 140 ◦C for 40 min. A similar sized pot was filled with sterilized soil and used for sowing rice seeds. The 15 mL of overnight culture of the selected strains (i.e., GBAC466 and NMTD81) and the positive control, FZB42, with an OD600nm = 1.0 was applied in rice seedlings grown in sterilized soil. Then, the rice seeds were cultured for one month in a greenhouse under control conditions.

#### *4.12. Estimation of MDA Level (Lipid Peroxidation) in Rice Plants*

Lipid peroxidation was examined in terms of the malondialdehyde (MDA) level studied in plant leaves under stress conditions according to [72], with some changes. Briefly, the fresh leaves (0.1 g) of rice plants treated with the selected strains as well as the control plants were taken and homogenized in a 0.1% (*w*/*v*) TCA 500 μL solution. The mixture was then centrifuged at 4 ◦C and 13,000 rpm. The supernatant of each treatment was then mixed with 1.5 mL of 0.5% TBA solution and heated to 95 ◦C in water for 25 min. The mixture was placed on ice for 5 min to stop any further reaction. The absorbance of the reaction was analyzed at 532 and 600 nm using a microplate reader (Spectrum max plus; Molecular Devices, Sunnyvale, CA, USA).

#### *4.13. Determination of Hydrogen Peroxide*

Rice leaves extracts were tested for hydrogen peroxides (H2O2) concentration. Fresh rice plant leaves (0.1 g) were homogenized at 4uC in a 1:9 (*w*/*v*) phosphate buffer (50 mM, pH 6.0). A hydrogen peroxide assay kit (Beyotime, Shanghai, China) was used to determine the amount of H2O2 in the leaf sample. The test tubes containing 50 mL test solutions were left at room temperature for 30 min before being measured using a spectrometer at 560 nm. The absorbance measurements were calibrated using a standard curve with known H2O2 concentrations [72].

#### *4.14. Defense-Related Genes' Expression in Rice Plants*

The expression profile of the defense genes (i.e., *PR1a, PR4, LOX1, PBZ1*, and *PAL1*) was carried out through RT q-PCR in rice plants grown after being treated with selected strains in a greenhouse experiment. For this, the selected gene sequences were taken from NCB1, followed by designing primers through the PrimerQuest tool; the primers are listed in the Supplementary Materials Table S2. The housekeeping gene elongation factor 1-alpha (*ef1*) was used in the present study. Briefly, RNA was extracted from fresh rice plant leaves inoculated with selected strains and ddH2O was used as the control grown under infested and non-infested *A. besseyi* in greenhouse conditions after 4 days' post-inoculation (dpi) through the TRizole method. The Vazyme HiScript II Q RT SuperMix Kit (Vazyme, Nanjing, China) was used for cDNA synthesis. RT-qPCR was performed to analyze the expression profile of selected genes in rice plants through a ABI 7500 Fast Real-Time PCR Detection System (Thermo Fisher Scientific, San Jose, CA, USA). The PCR machine was programmed using the following steps: initial denaturation at 95 ◦C for 30 s, including 40 cycles of 95 ◦C for 5 s, and 34 s at 60 ◦C. Finally, relative quantification was performed according to the comparative C method of 2−ΔΔCT as described in [73].

#### *4.15. Statistical Analysis*

The data were analyzed using one-way analysis of variance (ANOVA) in SPSS version 26 software and reported as the mean ± standard deviation from three biological replicates. For group comparisons, Fisher's LSD test was utilized. Outcomes were considered significant at *p* < 0.05.

#### **5. Conclusions**

In conclusion, the selected Bt strains from the Tibet region, China, showed high nematicidal activity against *A. besseyi*. Three proteins (i.e., Cry31Aa, Cry73Aa, and Cry40ORF) were also characterized and functionally assessed for food safety. The novel Cry protein, Cry31Aa, possessed highly efficient and selective nematicidal activity. Furthermore, the

current study adds new insights into the mechanisms by which nematicidal Bt regulates defense-related genes and improves plant growth promotion parameters in rice as well as contributes to the control of plant parasitic nematode diseases.

**Supplementary Materials:** The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/ijms23158189/s1.

**Author Contributions:** Z.L., Q.A. and X.G., planned and designed the research; Z.L. and Q.A., Y.W., performed the research and contributed to the methodology, writing, and editing; G.M., Y.R., X.K. and H.M., assisted in the analysis and compiled the data and results; Q.G., H.W. and X.G., critically revised and improved the manuscript. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Guidance Foundation of the Sanya Institute of Nanjing Agricultural University (NAUSY-MS18), Fundamental Research Funds for the Central Universities (KYZZ2022001), and the Key Project of NSFC Regional Innovation and Development Joint Fund (U20A2039).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


## *Review* **Degradation Mechanism of Autophagy-Related Proteins and Research Progress**

**Yanhui Zhou 1,2, Hakim Manghwar <sup>1</sup> , Weiming Hu 1,2,\* and Fen Liu 1,2,\***


**Abstract:** In all eukaryotes, autophagy is the main pathway for nutrient recycling, which encapsulates parts of the cytoplasm and organelles in double-membrane vesicles, and then fuses with lysosomes/vacuoles to degrade them. Autophagy is a highly dynamic and relatively complex process influenced by multiple factors. Under normal growth conditions, it is maintained at basal levels. However, when plants are subjected to biotic and abiotic stresses, such as pathogens, drought, waterlogging, nutrient deficiencies, etc., autophagy is activated to help cells to survive under stress conditions. At present, the regulation of autophagy is mainly reflected in hormones, second messengers, post-transcriptional regulation, and protein post-translational modification. In recent years, the degradation mechanism of autophagy-related proteins has attracted much attention. In this review, we have summarized how autophagy-related proteins are degraded in yeast, animals, and plants, which will help us to have a more comprehensive and systematic understanding of the regulation mechanisms of autophagy. Moreover, research progress on the degradation of autophagy-related proteins in plants has been discussed.

**Keywords:** autophagy-related protein; degradation; ubiquitin; proteasome; autophagy

#### **1. Introduction**

Autophagy widely exists in eukaryotic cells, which is a relatively conservative process in evolution. It is responsible for the transport of certain cytoplasmic proteins and subcellular organelles into lysosomes/vacuoles for degradation, thereby contributing to the recycling of intracellular nutrients [1,2]. In yeast and mammals, autophagy consists of three types: microautophagy, macroautophagy, and chaperone-mediated autophagy (CMA). However, CMA is absent in plants and has been replaced by mega-autophagy [3]. Among them, the so-called autophagy usually refers to macroautophagy, which contains both selective and non-selective intracellular degradation processes [4] (Figure 1). Selective autophagy is distinguished from bulk autophagy by the use of selective autophagy receptors (SARs) [5]. Macroautophagy begins with a series of autophagy-related (ATG) proteins forming phagophore structures at the assembly site of the phagophore, and then recruiting substrate molecules, and forming autophagosome with double-layer membrane structure through the expansion and closure of vesicles. Subsequently, the outer membrane of the autophagosome fuses with the lysosomal or tonoplast membrane to release the contents into the lysosomal lumen or vacuole in the form of an autophagic body with only a single membrane, and is degraded into small molecular substances for recycling under the role of acid hydrolase [6–8]. Microautophagy is a process that isolates and uptakes cell components by direct enclosure with the lysosomal/vacuolar membrane [9]. Autophagy plays an important role in maintaining cellular homeostasis. Plants are exposed to many biotic and abiotic stresses, and these stresses affect plants [10–16]. When cells face stress conditions, such as nutrient deprivation, oxygen deficiency, and endoplasmic reticulum (ER) damage,

**Citation:** Zhou, Y.; Manghwar, H.; Hu, W.; Liu, F. Degradation Mechanism of Autophagy-Related Proteins and Research Progress. *Int. J. Mol. Sci.* **2022**, *23*, 7301. https:// doi.org/10.3390/ijms23137301

Academic Editor: Wajid Zaman

Received: 2 June 2022 Accepted: 29 June 2022 Published: 30 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

autophagy is highly induced to maintain metabolic and energy balance [16–19]. During starvation, lysosomal activity is also improved to support autophagic flux [20].

**Figure 1.** Morphological steps of microautophagy and macroautophagy in eukaryotes. Macroautophagy begins with the formation of a phagophore that encapsulates damaged organelles and discarded proteins. Then, through the extension of the vesicle forms a closed structure with a double membrane, which is called an autophagosome. Subsequently, the outer membrane of the autophagosome fuses with the lysosomal membrane (animals) or the tonoplast (yeast and plants) to release the autophagic body with only a single membrane. Finally, under the digestion of acid hydrolases, the cargoes were degraded into small molecular substances for recycling. Microautophagy is a process in which the lysosome/vacuole directly packages target substrates by membrane invagination to create the autophagic body.

Autophagy is a complex pathway that is tightly regulated. The expression of autophagyrelated genes has spatiotemporal specificity and is a dynamic process. Regulation of autophagy-related proteins at different levels, including transcription, post-transcription, translation, and post-translation, which helps to inhibit or activate autophagy [21–23]. At the transcriptional level, when plants face drought stress, overexpression of heat shock factor A1a (HsfA1a) can increase the mRNA levels of ATG10 and ATG18f, and the number of autophagosomes is also increased accordingly [24]. In mammals, microRNAs (miRNAs) regulate autophagy at the post-transcriptional level [25]. Post-translational modifications, such as phosphorylation, lipidation, and ubiquitination of autophagy-related proteins, are critical in regulating autophagic activity and duration [23]. For example, AMP-activated protein kinase (AMPK) phosphorylates BECN1 (beclin 1, vacuolar protein sorting (Vps) 30/Atg6 in yeast, and ATG6 in plants) at the site of Thr388 to induce autophagy [26]. ATG4a is modified by hydrogen sulfide (H2S) persulfidation, inhibiting protease activity, thereby negatively regulating autophagy [27]. During the ubiquitination modification process, autophagy-related proteins are labeled with ubiquitin, which is recognized and degraded by the 26S proteasome [28], ultimately inhibiting the production of autophagic vacuoles. In recent years, the regulatory pathways of autophagy-related protein degradation have attracted much attention. In this review, we have mainly focused on the recent advances in core autophagy-related protein degradation mechanisms.

#### **2. ATG Complexes in Plants and Animals**

ATG proteins play an indispensable role in the process of autophagy in plant and animal cells. According to their diverse functions in different stages of autophagy, core ATG proteins can be divided into four major complexes (Figure 2).

**Figure 2.** Core protein complexes of autophagy in plants and animals. (**A**) The ATG1/ATG13 protein kinase complex. When plants are under nutrient-rich conditions, TOR kinase hyperphosphorylates ATG13. While plants are placed in nutrient starved conditions, the TOR kinase is inactivated, resulting in the dephosphorylation of ATG13, which binds tightly to ATG1. Then, the ATG1 kinase activity is activated and autophosphorylation occurs to form the ATG1, ATG11, ATG13, and ATG101 complex, which results in upregulating autophagy. (**B**) The ATG9/2/18 transmembrane complex and PI3K complex. ATG9 delivers membrane source and mediates the extension of phagophore membrane. ATG2 and ATG18 play a synergistic role in this process. The PI3K protein complex promotes the nucleation of vesicles, which include ATG6, ATG14, VPS34, VPS15, and PI3P. (**C**) The ATG5-ATG12 and ATG8-PE ubiquitin-like conjugating systems. ATG12 was transferred to the target protein ATG5 with the help of ATG7 and ATG10. Subsequently, the ATG5-ATG12 complex combines with ATG16 to form an oligomeric complex, which participated in the esterification of ATG8. During the covalent binding stage of ATG8-PE, the cysteine protease ATG4 cleaves the C-terminus of ATG8. Subsequently, ATG8 is activated by ATG7 and transferred to ATG3 through a thioester bond. At last, with the help of ATG5-ATG12-ATG16 conjugate, ATG8 forms an ATG8-PE adduct with phosphatidylethanolamine. (**D**) Core ATG proteins of plants and mammals in four complexes.

#### *2.1. ATG1/ATG13 Protein Kinase Complex*

The ATG1/ATG13 protein kinase complex is located at the most upstream position in the ATG proteins recruitment hierarchy and initiates the formation of phagophores in response to the demand for nutrients [29]. When cells are under normal growth conditions, the target of rapamycin (TOR) kinase complex hyperphosphorylates the ATG13, reducing its affinity for ATG1 (Unc-51-like autophagy activating kinases 1, ULK1 in mammals),

thereby inhibiting autophagy. When plants are placed under stress conditions, TOR kinase inactivation dephosphorylates the ATG13, promoting its tight binding to ATG1. Then, ATG1 kinase activates to autophosphorylate itself, resulting in ATG1, ATG11, ATG13, and ATG101 forming a complex and connecting to the phagophore membrane, which promotes several downstream autophagy steps [30]. In mammalian cells, the ULK complex is made up of ULK1/2, ATG13, ATG101, and FIP200/RB1CC1 (Atg11 and Atg17 in yeast) [8]. FIP200/RB1CC1 is the scaffold protein of ULK1/2 and ATG13, which is essential for the phosphorylation and stability of ULK1/2 [31]. In addition to TOR complexes, several pathways, including cAMP-dependent protein kinase A and AMP-activated protein kinase, can also regulate the ATG1/ULK complex [32].

#### *2.2. ATG9/2/18 Transmembrane Complex*

The ATG9/2/18 transmembrane complex participates in the formation of phagophore and membrane fusion. ATG9 (ATG9A in mammals) is the only transmembrane protein in the autophagy process of eukaryocytes, which provides a membrane source for forming pre-autophagosomal structure (PAS) and driving the extension of the isolation membrane. It has been found that ATG9 loses contact with the phagophore membrane in the process of autophagy, thus regulating the formation of phagophores from plant ER [33]. ATG2 (ATG2A in mammals) interacts with ATG18 (WD-repeat protein interacting with phosphoinositides, WIPI in mammals), both of which act in the later stages of phagosome formation, fixing the PAS/isolation membrane on the ER and transferring lipids, effectively completing the closing process [34].

#### *2.3. Phosphatidylinositol-3-Kinase (PI3K) Complex*

The PI3K complex mediates vesicle nucleation. The PI3K complex can be further divided into complexes I and II. In plants, complex I consists of ATG6, ATG14 (ATG14L in mammals), VPS15, and VPS34, and complex II consists of ATG6, VPS15, VPS34, and VPS38. The PI3K complex and phosphatidylinositol-3-phosphate (PI3P) are responsible for the modification of de novo synthesis of the phagophore, and PI3P recruits the ATG2-ATG18 complex to the phagophore membrane and participates in the extension of the phagophore [35].

#### *2.4. ATG5/ATG12 and ATG8-Phosphatidylethanolamine (PE) Conjugation Systems*

The ATG5-ATG12 and ATG8-PE ubiquitin-like conjugating systems not only regulate the initiation of phagophore formation but also function at a downstream step [36]. Among them, ATG4 is a key Cys-protease with important functions for ATG8 (the light chain 3, LC3 in mammals) lipidation and free ATG8 turnover [37]. During the process of autophagy, the C-terminus of ATG8 is first recognized and cleaved by ATG4 protease. Subsequently, ATG7 with ubiquitin-activating enzyme E1 activity binds to ATG8 and ATG12, activating their mature forms. The activated ATG8 and ATG12 transfer to ATG3 and ATG10 with ubiquitin-binding enzyme E2 activity, respectively, and finally link to the substrate. Among them, ubiquitination folding protein ATG8 is combined with PE, and ATG12 is coupled with ATG5 to form a complex. In addition, the ATG12-ATG5 adduct combines with ATG16 (ATG16L1 in mammals) to form oligomeric complexes, which promotes the formation of ATG8-PE adducts [38].

#### **3. Protein Degradation Pathways in Eukaryotic Cells**

To ensure that the autophagy machinery can respond in time to each stimulus, regulating the rapid inactivation/degradation of autophagy-related proteins through negative feedback would be an ideal mechanism. In eukaryotes, autophagy and the ubiquitinproteasome system (UPS) maintain proteome homeostasis by degrading redundant or misfolded proteins [39]. Among them, autophagy mainly degrades long-lived proteins and damaged or redundant organelles, and the UPS mainly impairs short-lived proteins [40,41]. Ubiquitination of proteins involves multiple steps. First, ubiquitin-activating enzyme E1

activates the C-terminus of ubiquitin molecules by consuming adenosine triphosphate (ATP) to form an acyl-adenylate complex. Second, the ubiquitin-conjugating enzyme E2 transduces activated ubiquitin molecules through cysteine residues. The ubiquitin ligase E3 then links the ubiquitin molecule to the amino group of the lysine side chain of the target protein by catalyzing the formation of isopeptide bonds. The above steps are repeated to form ubiquitinated target proteins with oligomeric ubiquitin chains in order to form ubiquitination tags. Finally, the proteasome cap recognizes the ubiquitination tag and uses ATP hydrolysis to provide energy to drive ubiquitin molecular excision and target protein unfolding. Then, the unfolded target protein is transferred to the proteasome core for degradation [42].

Ubiquitination-modified proteins can be deubiquitinated by the family of deubiquitinases (DUBs), which hydrolyze the isopeptide bonds formed by target protein lysine residues and ubiquitin molecules, thus inversely regulating the degradation of protein [43]. Thus far, six structurally distinct DUB families have been described, namely, the ubiquitinspecific proteases (USPs), the Josephin family, the ubiquitin C-terminal hydrolases (UCHs), ovarian tumor-related proteases (OTUs), a family of Zn-dependent JAB1/MPN/MOV34 metalloprotease DUBs (JAMMs), and the motif interacting with ubiquitin (MIU)-containing novel DUB family (MINDYs) [44]. Therefore, ubiquitination is a reversible and controllable process. Autophagy and the UPS work together to control intracellular protein quality at the stage of individual development. Plant autophagy-related proteins can be degraded by the UPS pathway [45], while components of the UPS, such as the 26S proteasome, can also be degraded by autophagy [46].

#### **4. Degradation of Yeast Autophagy-Related Proteins**

Among yeast autophagy-related proteins, the degradation mechanisms of the transmembrane proteins Atg9 and Atg32 have been discovered. The content of Atg9 in cells is known to be a key factor in determining the number of phagosomes [47]. Under normal growth conditions, Atg9 is ubiquitinated at the Lys113, Lys121, and Lys138 sites and subsequently degraded by the proteasome at peripheral sites away from PAS by Met30, which is one of the substrate recognition components of the ubiquitin ligase SCF complex (Skp, Cullin, F-box containing) (Table 1) [48,49]. When cells are under nutrient-deficient conditions, Atg9 is stabilized by Atg1 phosphorylation, thus inhibiting the degradation process, and activating the autophagy process [50].

In yeast mitophagy, the mitochondrial outer membrane protein Atg32 is the only specific receptor [51], which determines mitochondrial turnover. Overexpression of Atg32 can effectively activate mitophagy, while the deletion of Atg32 leads to a defect in mitophagy [52,53]. Interestingly, two published papers have different findings on the degradation mechanism of Atg32. Levchenko et al. [54] reported that there are two forms of Atg32 in *Saccharomyces cerevisiae*, namely, unmodified Atg32 and post-translation modified Atg32, and their degradation pathways are different. Post-translationally modified Atg32 is activated when mitophagy is induced to target mitochondria for degradation. Experiments show that this modification is neither phosphorylation nor ubiquitination. In contrast, the turnover of unmodified Atg32 is mediated by an unknown protease independent of proteasome and vacuolar proteases [54]. However, studies by Camougrand et al. [55] demonstrated that the degradation of Atg32 is associated with the UPS. Ubiquitination of Atg32 occurs at least on Lys282 residues and is mediated by several E3 ligases, including Rsp5, which is a relatively complicated process [55]. To sum up, apart from the different strains used in the experimental materials, these two contradictory observations may be attributed to the differences in growth conditions or mitophagy induction, leading to the coordinated regulation of Atg32 levels by multiple pathways. Therefore, further research is needed to determine the details of different degradation pathways of Atg32.


**Table 1.** E3 ligases and ubiquitin chain types involved in ATGs degradation in yeast, mammalian, and plants.

#### **5. Degradation of Autophagy-Related Proteins in Metazoa**

*5.1. Degradation of ULK Complex Components*

In mammals, the ULK complex includes four parts: ULK1/2, ATG13, RB1CC1/FIP200, and ATG101. Among them, ULK1 plays a major role in autophagy [74]. The level of ULK1 is regulated by E3 ubiquitin ligase and USPs. Under prolonged starvation, autophosphorylation of ULK1 at the Ser1042/Thr1046 site facilitates its recruitment to Kelch-like (KLHL) 20 for ubiquitination and protein degradation as a substrate for the Cullin3 (Cul3)-KLHL20 ubiquitin ligase, thereby preventing cell unrestricted activate autophagy. In addition, KLHL20 also coordinates the degradation of ATG13 through an indirect mechanism [56]. Cancer can hijack the protective function of autophagy to promote tumorigenesis [75]. In breast cancer cells, mitogen-activated protein kinase (MAPK) 1/3 kinases promote its binding to the E3 ligase BTRC by phosphorylating ULK1 at multiple sites, triggering proteasomal degradation of K48-linked ULK1 ubiquitination, ultimately attenuating mitophagy and promoting breast cancer bone metastases [57]. Tumor necrosis factor receptor-associated factor 3 (TRAF3) is a member of the TRAF family with E3 ligase activity. In macrophages, TRAF3 mediates K48-linked ULK1 ubiquitination and proteasomal degradation, and the TRK-fused gene (TFG)-TRAF3 complex is able to interfere with TRAF3-ULK1 interaction to stabilize ULK1 in response to lipopolysaccharide (LPS)-induced pyroptosis [58]. Neural precursor cell expressed developmentally downregulated 4-like (NEDD4L) is an E3 ubiquitin ligase containing a HECT domain. When cells need to endure starvation, NEDD4L ubiquitinates ULK1 at the Lys925 and Lys933 sites. It induces its degradation by the proteasome, thus autophagy is calibrated to optimal levels to ensure the survival of cells. Interestingly, NEDD4L-mediated ubiquitination of ULK1 is not a canonical K48 linkage, but a K27 and K29 linkage [59].

The degradation of ULK1 was also regulated by deubiquitination. USP20 mediates ULK1 deubiquitination in the basal state, thus interfering with lysosome-dependent degradation of ULK1 and playing a key role in autophagy initiation [76]. ATG101 is a scaffold protein that maintains the stability of ATG13 within the ULK1/ATG1 complex [77,78]. The HECT, UBA, and WWE domain containing E3 ubiquitin-protein ligase 1 (HUWE1) regulates cell proliferation and cell death and is an important factor affecting tumorigenesis [79]. In cancer cells, HUWE1 targets the ubiquitination and degradation of the C-terminal region of ATG101, inhibiting autophagy activity to reduce cancer cell survival [60].

#### *5.2. Degradation of ATG9 Complex Components*

The mammalian ATG9 complex contains four ATG18 homologs, namely, WIPI 1-4. Studies have shown that WIPI2 can associate the phagophore with ER [80] and play a role in cellular antibacterial autophagy [81]. Under basal cellular conditions, mTORC1 mediates the phosphorylation of WIPI2 at the Ser395 site, thereby enhancing the specific interaction of WIPI2 with HUWE1, promoting the ubiquitination and proteasomal degradation of WIPI2, and inhibiting the degree of autophagy [61].

#### *5.3. Degradation of PI3K Complex Components*

In mammals, the autophagy-specific PI3K complex I consists of four proteins: VPS34, VPS15, BECN1, and ATG14L [82]. Ubiquitination and proteasomal degradation of ATG14L are regulated by the E3 ubiquitin ligase complex zinc finger and BTB domain containing 16 (ZBTB16)-CUL3-Regulator of cullins 1 (Roc1). Under normal nutritional conditions, inhibition of G-protein-coupled receptors (GPCRs) activates glycogen synthase kinase-3β (GSK-3β). It mediates ZBTB16 phosphorylation to promote its autoubiquitination, which in turn upregulates ATG14L levels to activate autophagy [62]. In host cells, *Streptococcus pneumoniae* (Sp) releases choline binding protein C (CbpC) to form the ATG14L-CbpC-sequestosome 1 (SQSTM1)/p62 intracellular complex, promoting autophagydependent degradation of ATG14L, which results in inhibiting autophagosome-lysosome fusion and bactericidal autophagic degradation, increasing bacterial survival [83].

BECN1, the first tumor-associated ATG protein discovered in mammals, plays a role in the formation, elongation, and maturation of autophagosomes [84,85]. Different E3 ligases play various roles in regulating BECN1 activity by binding to specific types of ubiquitin chains. NEDD4 and Ring finger protein 216 (RNF216) modify BECN1 through K11, K63 and K48-linked polyubiquitination chains, respectively, which mediates the BECN1 degradation in a proteasome-dependent manner and negatively regulates autophagy [63,64]. Similarly, CUL3-KLHL20-mediated ubiquitination of BECN1 is primarily responsible for the termination of autophagy during prolonged starvation, whereas CUL3-KLHL38-mediated BECN1 K48-linked ubiquitination prevents autophagy under normal conditions [65]. TRAF6 acts as an E3 ligase to trigger the K63-linked ubiquitination of BECN1, while the deubiquitinating enzyme A20 inhibits the ubiquitination of BECN1 [66]. Furthermore, tripartite motif 59 (TRIM59) mediates K48-linked ubiquitination and proteasomal degradation of TRAF6, which results in affecting TRAF6 to ubiquitinate BECN1, regulating the initiation of autophagy [86]. In contrast, three ubiquitin-specific peptidases, USP10, USP13, and USP19, act as positive regulators of autophagy and maintain BECN1 by mediating deubiquitination of BECN1 in the VPS34 complex stability [87,88]. By binding to BECN1, solute carrier family 9 isoform 3 regulator 1 (SLC9A3R1) inhibits BECN1 ubiquitination and proteasomal degradation, which stimulates the formation of the autophagic core lipid kinase complex [89].

VPS34 is the only mammalian phosphatidylinositol 3-kinase class III (PI3KC3) that is activated on phagophore and endosomal membranes [90,91]. The level of VPS34 in cells is controlled by a variety of regulatory factors. For example, the ubiquitin-protein ligase E3C (UBE3C) and the deubiquitinating enzyme TRABID act antagonistically in regulating the levels of VPS34 to balance autophagic activity. UBE3C assembles K29/K48-branched ubiquitin chains on VPS34, enhancing VPS34 binding to the proteasome for degradation. TRABID stabilizes VPS34 by reducing the ubiquitination of VPS34 K29/K48, which in turn promotes phagophore formation [67]. Similarly, auto-ubiquitination of NEDD4-1 enables it to act as a scaffolding protein to recruit USP13 to form the NEDD4-1-USP13 deubiquitination complex. This complex then promotes autophagy by removing K48-linked polyubiquitin chains on VPS34 [92]. In *Caenorhabditis elegans*, the K63-linked polyubiquiti-

nation of VPS34 mediated by the E2 enzyme UBC-13/UEV-1 and the E3 enzyme CHN-1 can stabilize the level of VPS34, promoting phagophore maturation [93]. In addition, phosphorylation of substrates can enhance their recognition by SCF complex [94]. Phosphorylation of VPS34 at the Thr159 site promotes F-Box and leucine-rich repeat protein 20 (FBXL20) as an adaptor for the SCF complex, regulating the ubiquitination and proteasomal degradation of VPS34 [68]. Notably, Tang et al. highlighted that the ubiquitination modification of VPS34 is not directly linked to its degradation process [95].

#### *5.4. Degradation of the ATG12-Conjugation System Components*

The ATG12-conjugation system includes ATG5, ATG7, ATG10, ATG12, and ATG16L1. ATG12 is a member of the ubiquitin-like protein (UBL) family, which consists of about 20 members [96]. Free ATG12 is highly labile and can be targeted for proteasomal degradation through ubiquitin-dependent or ubiquitin-independent mechanisms [97]. ATG5 is a ubiquitin-like ligase that is conjugated by ATG12 [98]. In Lepidoptera, there is an interaction between ATG1 and ATG5, and *Spodoptera litura* ATG1 (SlATG1) promotes the degradation of ATG5 [99]. In cardiomyocytes, the immunoproteasome β5i subunit interacts with ATG5 to promote the ubiquitination and degradation of ATG5, thereby inhibiting autophagy and leading to cardiac hypertrophy [100]. Moreover, saturated fatty acid palmitate induces ER stress and degrades ATG5 protein through the ER-associated protein degradation (ERAD) pathway, consequently inhibiting autophagy and inducing apoptosis [101].

ATG16L1 interacts with ATG5, which stimulates the ATG8-PE coupling reaction [31]. Gigaxonin (GAN)-E3 ligase ubiquitinates ATG16L1 through K48-type ubiquitin chain polymerization, driving its degradation, thereby controlling the steady-state level of ATG16L1 to ensure fine-tuning of autophagy activation. Loss of GAN leads to the formation of ATG16L1 aggregates that impair phagophore elongation, inhibiting autophagic flux [69].

#### *5.5. Degradation of the LC3-Conjugation System Components*

In mammalian cells, LC3, GABA type A receptor-associated proteins (GABARAPs), GATE16, and ATG8L exist as yeast Atg8 homologs [102,103]. All of them can be used as autophagosome markers. Under normal conditions, BRUCE acts as an autophagy inhibitor, promoting proteasomal degradation of LC3-I, which results in reducing LC3-II levels and autophagy [104]. Similarly, UBA6 and BIRC6 act synergistically as E1 and E2/E3 enzymes, respectively, for the monoubiquitination and proteasomal degradation of LC3B, which protects cells from cell death caused by excessive autophagy [70,105]. In addition, circumsporozoite protein (CSP) downregulates LC3B through the UPS pathway, but the mechanism involved is unclear [106].

ATG3 acts as an E2-like enzyme in LC3 lipidation. It can autocatalyze itself to form a complex with ATG12 to promote mitochondrial homeostasis [107,108]. Under DNA damage conditions, protein tyrosine kinase 2 (PTK2) phosphorylates ATG3 at the Tyr203 site and promotes the degradation of ATG3 through the ubiquitin-dependent proteasome pathway, resulting in positively regulating the activity of cancer cells [71].

#### **6. Degradation of Plant Autophagy-Related Proteins**

Up to now, more than 40 kinds of ATG proteins and their related regulatory proteins have been identified in plants [109], mainly from the analysis of yeast autophagy-deficient mutants, but their degradation mechanisms are poorly understood. In *Arabidopsis thaliana*, the turnover of ATG1 and ATG13 is strongly and specifically regulated by nutrition, which closely links autophagy with plant nutritional conditions. During the period of fixed carbon/nitrogen limitation, the levels of ATG1 and ATG13 decreased sharply, but they could be reversed again during refeeding. Studies on *A. thaliana* mutants that damage autophagy or 26S proteasome show that both degradation pathways are involved in their degradation, but autophagy is more directly involved. Under the condition of limited nutrition, ATG1 and ATG13 combine with autophagic-like cytolytic structures and finally transfer to vacuoles together with autophagosomes for degradation [110].

In addition, Liu et al. [35] found a similar phenomenon in their study of ATG14 and its associated PI3K complex. By expressing GFP-ATG14b in *atg7* and *atg14a atg14b* mutants, it was found that there was still free GFP in *atg14a atg14b* mutants, which indicated that ATG14 was degraded, unlike the intact GFP-ATG14b fusion in *atg7* mutants with autophagy defects. Next, after N starvation treatment, co-localization of *atg14a atg14b* mutant roots expressing both mCherry-ATG8a and GFP-ATG14b revealed that ATG14 was bound to autophagic membranes [35]. Thus, similar to ATG1 and ATG13, ATG14 is degraded by the autophagic pathway through associating with the autophagic bodies.

The effect of ubiquitination of autophagy-related protein on autophagy has been widely studied in mammals, but less in plant cells. In plants, a single ubiquitin molecule can form a polyubiquitin chain at a certain site in a target protein (polyubiquitination) or be attached to multiple lysine residues (multi-monoubiquitination). In *A. thaliana*, K48 is commonly used to form polyubiquitin chains [111]. Qi et al. [45,72] successfully studied the turnover of ubiquitination modification of ATG6 and ATG13 in *A. Thaliana* (Figure 3). Under different nutritional conditions, as molecular adaptors, *A. thaliana* TRAF1a and TRAF1b interacted with Ring finger E3 ligase seven in absentia of *Arabidopsis thaliana* 1 (SINAT1)/SINAT2 and SINAT6 to regulate the turnover of ATG6 and ATG13. Among them, SINAT6 contains only a short truncated Ring finger domain compared to SINAT1 and SINAT2 [112]. Under nutrient-rich conditions, *A. thaliana* TRAF1a and TRAF1b interact with SINAT1 and SINAT2 to mediate ubiquitination and degradation of ATG6. However, under conditions of nutrient deprivation, starvation-induced accumulation of SINAT6 reduces the binding of SINAT1 and SINAT2 to ATG6, stabilizes ATG6 levels, and activates autophagy [45].

Similarly, at the Lys607 and Lys609 sites of the ATG13 protein, TRAF1-SINAT1/ SINAT2-ATG13 TRAFasome was degraded by ubiquitination modification linked by K48, resulting in the dissociation of ATG1-ATG13 complex to maintain a relatively low autophagy level. However, TRAF1-SINAT6-ATG13 TRAFasome promotes the stability of ATG13, which induces the biogenesis of autophagy in response to nutritional deficiency. Besides, under starvation conditions, ATG1 kinase phosphorylates TRAF1a and promotes its protein stability in vivo, indicating the feedback regulation of autophagy [72]. In short, SINAT1/SINAT2 and SINAT6 play negative and positive roles in regulating the stability of ATG6 and ATG13, thus playing opposite roles in autophagy. Furthermore, at the regulatory level, the turnover of ATG8, which interacts with various adaptor/receptor proteins to recruit specific cargos for degradation, is affected by acyl-coa-binding protein 3 (ACBP3) [113]. In *A. thaliana*, ACBP3, as a phospholipid-binding protein, is involved in the regulation of leaf senescence by regulating membrane phospholipid metabolism and the stability of ATG8. Overexpression of ACBP3 promotes the degradation of ATG8 and disrupts the formation of autophagic vesicles, which results in inhibiting autophagy [114].

Except for maintaining cell homeostasis, autophagy also plays a vital role in plant immunity against pathogens. However, bacteria have evolved the mechanism of evading autophagic clearance in order to better parasitize in the host [115]. Type III effector proteins (T3E) of plant pathogens are present in host cells, and these effectors are capable of manipulating host defense response [116]. Src homology 3 (SH3) domain-containing protein-2 (SH3P2) is a novel membrane-associated protein involved in the formation of autophagosomes [117]. When *Xanthomonas campestris* pv. *vesicatoria (Xcv)* invades plant cells; it utilizes the bacterial effector E3 ligase XopL to mediate the ubiquitination and degradation of SH3P2 in a proteasome-dependent manner, and inhibits host autophagy, thus enhancing the virulence of *Xanthomonas*. Intriguingly, in host cells, XopL is recognized and degraded by NBR1/Joka2-mediated selective autophagy related to defense. Hence, the mutual targeting of pathogen effector XopL and plant protein SH3P2 reveals the complex antagonism between pathogen and plant autophagy mechanism [73].

**Figure 3.** Degradation mechanisms of ATG6 and ATG13 in plants. During nutrient-rich conditions, the TRAF1s-SINAT1/SINAT2-ATG6 and TRAF1s-SINAT1/SINAT2-ATG13 TRAFasomes regulate the ubiquitination and proteasomal degradation of ATG6 and ATG13, which results in inhibiting autophagy. Under the condition of nutrient starvation, SINAT6 accumulates to form the TRAF1s-SINAT6-ATG6 and TRAF1s-SINAT6-ATG13 TRAFasomes, which maintain the stability of ATG6 and ATG13. Furthermore, ATG1 kinase phosphorylates TRAF1s to increase its stability.

#### **7. Concluding Remarks and Future Perspectives**

Autophagy is an important regulatory factor for eukaryotic cells to cope with various stresses. In plants, it is precisely regulated by environmental changes and developmental stages [118]. In this review, the degradation mechanism of autophagy-related proteins in eukaryotes has been discussed. There are many studies on the degradation mechanism of animal autophagy-related proteins, probably because the regulation of autophagy can be used as an effective intervention for the treatment of metabolic and neurodegenerative diseases [119,120]. However, there is little information about how autophagy-related proteins are degraded in plant cells, and only the degradation mechanisms of ATG1, ATG6, ATG13, ATG14, and SH3P2 have been discovered thus far [45,72,110].

As far as the known degradation mechanism of animal autophagy-related proteins is concerned, ubiquitin-proteasome mainly plays a role in it. What is more, some regulatory factors that affect the degradation of autophagy-related proteins were also found. Both autophagy and the UPS are involved in the degradation of the plant ATG1-ATG13 complex. Then, how plants themselves coordinate the effects of these two pathways on autophagy

still needs to be further explored. Whenever plants are under conditions that are not conducive to their own growth, such as drought, salt stress, and virus invasion, autophagy will be activated, thereby improving plant resistance [121–123]. Under suboptimal growth conditions, autophagy is beneficial for improving crop yield [124], and the turnover of ATG proteins plays a decisive role in the level of autophagy. Therefore, the degradation mechanism of plant autophagy-related proteins needs to be studied more extensively in order that they can make a beneficial contribution to agricultural production.

**Author Contributions:** The authors confirm their contributions to this work: Y.Z. in conceptualization, original draft preparation, and writing; and H.M., W.H., and F.L. in revising and review. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by a grant from the National Natural Science Foundation of China (No. 32100297 to Fen Liu).

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** Because of space limitations, we apologize to the authors whose works are not cited.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


MDPI St. Alban-Anlage 66 4052 Basel Switzerland Tel. +41 61 683 77 34 Fax +41 61 302 89 18 www.mdpi.com

*International Journal of Molecular Sciences* Editorial Office E-mail: ijms@mdpi.com www.mdpi.com/journal/ijms

MDPI St. Alban-Anlage 66 4052 Basel Switzerland

Tel: +41 61 683 77 34 Fax: +41 61 302 89 18

www.mdpi.com ISBN 978-3-0365-6996-3