Next Article in Journal
Improved Method for the Synthesis of New 1,5-Benzothiazepine Derivatives as Analogues of Anticancer Drugs
Previous Article in Journal
(R,R)-Tartaric Acid Dimethyl Diester from X-Ray and Ab Initio Studies: Factors Influencing Its Conformation and Packing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

13C NMR Spectral Prediction by Means of Generalized Atom Center Fragment Method

BIO-RAD Laboratories, Sadtler Division, 3316 Spring Garden Street, Philadelphia, PA 19104, USA
Molecules 1997, 2(8), 114-128; https://doi.org/10.3390/20800114
Submission received: 21 October 1996 / Accepted: 11 April 1997 / Published: 20 August 1997

Abstract

:
Knowledge-based NMR spectral prediction relies on the correlations between substructures and sub-spectra. To extract the correlations, a systematic substructure measurement has been developed to classify substructures according to their chemical shift values. Historically, the atom center fragment (ACF) concept has been used as a means to systematically measure substructures for NMR spectral prediction. The assumption behind this concept is that the chemical shift value of an atom is influenced by its chemical environment. Based upon the study of the ACF-type approaches, a generalized atom center fragment (GACF) approach is proposed in this paper. In the GACF approach, a substructure consists of a center atom, core layer, and external layers. The center atom and the core layer, are identified as the super center atom. The external layers are the chemical environment. A number of algorithms have been developed to measure GACF substructures from a structure database, and create the NMR knowledge base for NMR spectral prediction.

Introduction

Empirical NMR spectral prediction approaches correlate substructures and sub-spectra by means of sub-structural encoding. The oldest encoding method is the additivity model, which consists of a set of frame structures, substituents, and calculation rules [1,2]. A more general additivity model was reported by the Small and Jurs [3], and enhanced by Schweitzer and Small [4] later. Recently, a number of neural network approaches have been employed for the encoding [5,6,7,8]. Bremser proposed HOSE code to systematically encode substructures for 13C NMR knowledge extraction [9]. Robien adopted HOSE method [10], and built up a direct knowledge base retrieval method for 13C NMR spectral prediction [11]. His newer publications can be found from Anal. Chim. Acta , 1990, 229, 17 and J. Chem. Inf. Comput. Sci., 1992, 32, 291.
The critical part of NMR knowledge generation is the systematic substructure measurement. The chemical shift value of a carbon atom is influenced by the chemical environment of the atom. HOSE code uses layer (or level) to define the chemical environment (Figure 1). The first layer is defined as all the atoms being one-bond-away from the central atom (or focus atom); the second layer has the atoms being two-bond-away, etc. This idea can be represented in atom center fragment (ACF) concept, which has been addressed by many authors in different ways [12,13,14].
The effects of the environmental atoms on the central atom are not completely determined by the topological distance between the center atom and the environmental atoms. The effects are influenced by the topological distance and the bonding types. These effective contributions of environmental atoms/groups to the central atom are shown in Figure 2 and Figure 3.
The data for these Figures are adopted from reference 1; the plots use the chemical shift absolute values to emphasize the contributions. Figure 2 and Figure 3 are just showing the tendency, the continuous curves do not mean there is any effective contribution between 1-bond-away and 2-bond-away, etc.
From these Figures, it can be seen that if a center atom is in an acyclic aliphatic system, the effect of an environmental substituent to the center atom decreases along the increasing number of the atom-layer. If the substituent is four bonds away from the center atom, it has almost no effective contribution to change the center atom’s chemical shift. In Figure 3, however, even R-group is at the position 4 (four-bonds-away), its effect on the center atom’s chemical shifts is still significant. It is because the center atom is in an aromatic system.
The conclusion from these Figures is that the simple ACF measurement works only for acyclic and non-conjugated systems. Another problem in the simple ACF approach is that the number of atom layers should be included in an ACF substructure. For example, if an ACF includes four atom layers, then it may be too far for the center atom in an aliphatic system, but not far enough for the one in a ring or conjugated system (see Figure 4).
To have an objective substructure measurement, it is necessary to extend the simple ACF concept to generalized atom center fragment concept.

Generalized atom center fragment (GACF)

In the additivity model, the chemical environment of a center atom has two parts (see Figure 5), i.e., frame (core) structure and substituents (environmental substructures).
The GACF approach takes this into account in the substructure measurement. A GACF substructure consists of a center atom, core structure and environmental layers (substituents). The core structure characterizes the different bonding system, which is also responsible for the special chemical behavior. According to topological and chemical properties, core structures are classified into the following nine classes:
  • Independent non-aromatic single ring system
  • Fused non-aromatic ring system
  • Bridged ring system
  • Spiro ring system
  • Independent aromatic single ring system
  • Fused aromatic ring system
  • Conjugated system
  • Cumulene system
  • Acyclic system
The structural features of these systems are listed in Table 1.
A core structure can be classified into more than one ring class simultaneously. In this case, the assigned class is chosen by applying priority:
spiro ring > bridged ring > fused ring
The examples are shown in Figure 6.
Therefore, a GACF substructure is measured in the following steps:
  • select a center atom
  • get a class for the center atom
  • capture a core structure for the center atom (the core structure measurement is shown in blue in Table 1 in the right column)
  • capture environmental substituents (which are part of the GACF) for the GACF according to the number of the layers
An GACF example is illustrated in Figure 7.

13C NMR knowledge extraction and chemical shift prediction

50,000 structures with 13C chemical shift assignments have been selected as input data set for 13C NMR knowledge extraction. A number of graph theory algorithms have been developed to measure GACF substructures. The main algorithms are independent ring, fused ring, bridged ring, spiro ring perception algorithms, and conjugated system perception algorithm, etc. These algorithms are all based upon the GMA algorithm reported in our previous work [15].
The molecular diversity of the input data has been analyzed by means of our in-house algorithms, and shown in Figure 8.
Total 64,307 GACF (General Atom Center Fragment, 1-GACF means the first layer of GACF, so and so forth) substructures (up to 2 layers) extracted from 565,513 assigned chemical shifts. The GACF class distribution is shown in Figure 9. Different GACF reflects a carbon atom with a different core structure and chemical environment; therefore Figure 9 shows the atomic diversity.
It is known that larger numbers of atom-layers will increase the size of a substructure, the size of the knowledge body, and the accuracy of the NMR spectral prediction. But, too large a knowledge body will reduce the search performance. The larger size of the GACF has less chance to be matched; that is, the knowledge will not be used very often. Figure 10 shows that 2-GACF (each GACF substructure has 2-atom-layer chemical environment) knowledge body has significantly more knowledge entries in classes 9, 5 and 6, and less entries in classes 1, 3, 4, and 7. We generate 0~2 GACF substructures for 13C NMR knowledge base, where 0-GACF substructures have only core atoms, no environmental atoms.
13C NMR knowledge body is the correlation table of GACF substructures and 13C chemical shifts. Its format is shown in Table 2 with an example. The "shift" is the chemical shift average value, "median" is the middle value in this range, "maximum" and "minimum" define the variable range (band). "σ" is the standard deviation, an "fn" is the number of sample chemical shifts which are used to produce the chemical shift average value, also called "frequency number".
The accuracy and generality of the 13C NMR spectral prediction can be analyzed by studying the distributions of frequency numbers (fn). As shown in Table 2, fn should be larger than 3 to make statistical sense; however, if fn is too large, such as >1000, the accuracy will decrease. On the other hand, a 13C NMR KB rule with larger fn may mean that it covers larger structural diversity, and therefore, has better generality. The accuracy and generality are conflicting, and have to be balanced. Figure 11 and Figure 12 show the relationship of the fn distribution, accuracy and generality, where "the Number of Samples" is fn.
In Figure 11, about 47% 1-GACF rules have ~10 sample chemical shifts (fn). However, as shown in Figure 12, about 60% 2-GACF rules have 3~5 sample chemical shifts. Hence, 2-GACF rules will give more accurate predictions, but cover less structural diversity. In fact, the structural diversity space is extremely huge. It is almost not possible to build a knowledge base to cover the whole structural diversity with reasonable predicting accuracy. Figure 13 shows a way to estimate structural diversity space.
The 13C NMR spectral prediction for a given structure is described in the following steps:
  • input a structure (draw a structure through a graphic user interface)
  • for each carbon atom, extract a 2-GACF (2 level GACF) substructure from the structure
  • search the 2-GACF against 2-GACF 13C NMR KB
  • if this 2-GACF is found from the 2-GACF KB, the carbon’s chemical shift is predicted
  • if not:
  •  extract a 1-GACF (1 level GACF) substructure from the structure
  •  search the 1-GACF against 1-GACF 13C NMR KB
  •  if this 1-GACF is found from the 1-GACF KB, the carbon’s chemical shift is predicted with less accuracy
  •  if not report: “cannot predict chemical shift for this type of carbon atom”
  • go to 2
Searching a GACF from 64,307 GACF substructures by means of atom-by-atom search will be time-consuming. Hence, all GACF substructures have been converted to hash codes and sorted. Therefore, the GACF structure search is very fast.
The accuracy and generality are conflicting. Discriminating similar substructures can enable more predicting accuracy, but less substructures can match the GACF in the knowledge base. In Figure 14, C1 and C2 (colored in red) should have very different chemical shifts. Taking pentagon atoms (colored in blue) as the core layer (superatom center), either 1-bond-away or 2-bonds-away from the pentagon ring, C1 and C2 cannot be distinguished. In order to distinguish C1 and C2, the fused hexagon and pentagon atoms are taken as the super-atom center. With this GACF measure, C1 and C2 are distinguished in 1-GACF.
More discrimination, however, reduces generality. No matter how large the knowledge body is, many substructures are still not included. The mission of a prediction algorithm is to output the best estimation based upon existing knowledge collection.
In order to compromise the accuracy and generality, degeneracy technique (way to loosen structural pattern match restriction) is introduced. If a GACF substructure is not found in a higher level GACF knowledge base, this substructure can be degenerated to become a simplified GACF, and enable more chance to match. Figure 15 shows a way to degenerate a GACF substructure.
The degeneracy technique gains more generality for a knowledge base. Figure 16 A lists a set of structures used to estimated the carbon atom (colored in red) chemical shift of the 1-Degenerated GACF in Figure 15. Figure 16 B shows the structures not used for this estimation.

Results and conclusions

GACF-based 13C NMR spectral prediction program has been implemented in C and Visual C++ in both UNIX (SUN or SGI) and NT/Windows platforms. The Knowledge Base occupies ~4 MB space. Average standard deviation of the prediction is 4.56 ppm. The average prediction time for a small structure (<255 carbon atoms) is less than a second. The program has been tested on a number of structures selected from other data resource by a third part chemist. Some of the testing results are listed in Table 3 for comparison.
General speaking, KB-based NMR spectral prediction is influenced by the following factors:
  • algorithms to correctly classify the center atoms and their chemical environment
  • the quality of the structures-spectra database
The atomic chemical environment classification includes: (1) aromatic and non-aromatic, (2) cyclic and acyclic, (3) hetero-ring and homo-ring, (4) ring size, (5) ring types, such as single, fused, bridged, spiro, (6) saturated and unsaturated, (7) conjugated and nonconjugated, and (8) atom layers. These classifications require a set of structural perception algorithms, which have been solved in this paper.
The quality of the structures-spectra database consists of three aspects: (1) number of assignments for a atom center fragment; (2) structural diversity of the database; and (3) correctness of the database. In order to improve the prediction, we have developed tools to review the general atom center fragments, to analyze the atomic diversity of the structure database, and to correct wrong assignments. These will be discussed in a separate paper later.

References

  1. Pretsch, E.; Simon, W.; Seibl, J. Tables of Spectral Data for Structure Determination of Organic Compounds, 2nd Ed. ed; Springer Verlag: Berlin Heidelberg, 1989. [Google Scholar]
  2. Clerc, J. T.; Sommerauer, H. Anal. Chim. Acta 1977, 95, 33.
  3. Small, G. W.; Jurs, P. C. Analy. Chem. 1984, 56, 1314.
  4. Schweitzer, R. C.; Small, G. W. J. Chem. Info. Comput. Sci. 1996, 36, 310.
  5. Mitchell, B. E.; Jurs, P. C. J. Chem. Info. Comput. Sci. 1996, 36, 58.
  6. West, G. M. J. J. Chem. Info. Comput. Sci. 1993, 33, 577.
  7. Kvasnicka, V.; Skelenak, S.; Pospichal, J. J. Chem. Info. Comput. Sci. 1992, 32, 742.
  8. Anker, L. S.; Jurs, P. C. Anal. Chem. 1992, 64, 217.
  9. Bremser, W. Anal. Chim. Acta 1978, 103, 355.
  10. Robein, W. Mikrochim. Acta 1986, 2, 271.
  11. Chen, L.; Robien, W. Chemom. Intell. Lab. Syst. 1993, 19, 217.
  12. Dubois, J. E.; Carabedian, M.; Dagane, I. Anal. Chim. Acta 1984, 158, 217.
  13. Panaye, A.; Doucet, J.-P.; Fan, B. T. J. Chem. Info. Comput. Sci. 1993, 33, 258.
  14. Munk, M. E.; Lind, R. J.; Clay, M. E. Anal. Chim. Acta 1986, 184, 1.
  15. Xu, J. J. Chem. Inf. Comput. Sci. 1996, 36, 25.
  • Sample Availability: not available.
Figure 1. The center atom (focus atom) and atom layers in HOSE code approach.
Figure 1. The center atom (focus atom) and atom layers in HOSE code approach.
Molecules 02 00114 g001
Figure 2. Chemical environment effective contribution to a central carbon atom in an aliphatic structure.
Figure 2. Chemical environment effective contribution to a central carbon atom in an aliphatic structure.
Molecules 02 00114 g002
Figure 3. Chemical environment and its effect on central carbon atoms in aromatic system.
Figure 3. Chemical environment and its effect on central carbon atoms in aromatic system.
Molecules 02 00114 g003
Figure 4. Atom layers in different systems. A: Layer 4 is too far for this center atom (in red color). B: Layer 4 is still not far enough for this center atom (in red color). The numbers represent atom layers.
Figure 4. Atom layers in different systems. A: Layer 4 is too far for this center atom (in red color). B: Layer 4 is still not far enough for this center atom (in red color). The numbers represent atom layers.
Molecules 02 00114 g004
Figure 5. Center atom, core structure and environmental substructures.
Figure 5. Center atom, core structure and environmental substructures.
Molecules 02 00114 g005
Figure 6. Choosing a ring class for an ambiguous center atom. A: the center atom is in bridged ring and fused ring systems; it is assigned to bridged ring system. B: the center atom is in fused ring, bridged ring and spiro ring systems; it is assigned to spiro ring system.
Figure 6. Choosing a ring class for an ambiguous center atom. A: the center atom is in bridged ring and fused ring systems; it is assigned to bridged ring system. B: the center atom is in fused ring, bridged ring and spiro ring systems; it is assigned to spiro ring system.
Molecules 02 00114 g006
Figure 7. Example of a GACF.
Figure 7. Example of a GACF.
Molecules 02 00114 g007
Figure 8. The result shows that our 13C NMR database has well diversified substructural information, which is good enough for general 13C NMR prediction. Cumulane system will not have very good 13C NMR prediction because of insufficient information in the database.
Figure 8. The result shows that our 13C NMR database has well diversified substructural information, which is good enough for general 13C NMR prediction. Cumulane system will not have very good 13C NMR prediction because of insufficient information in the database.
Molecules 02 00114 g008
Figure 9. The GACF class distribution. The number over the percentage figure is the code of a GACF class (refer to Table 1.). The largest portion of this knowledge body is regarding fused non-aromatic carbon atom chemical shifts. 0% actually means <1% (due to the poor resolution of the graphic display).
Figure 9. The GACF class distribution. The number over the percentage figure is the code of a GACF class (refer to Table 1.). The largest portion of this knowledge body is regarding fused non-aromatic carbon atom chemical shifts. 0% actually means <1% (due to the poor resolution of the graphic display).
Molecules 02 00114 g009
Figure 10. Comparison of the sizes of 1-GACF knowledge body and 2-GACF knowledge body.
Figure 10. Comparison of the sizes of 1-GACF knowledge body and 2-GACF knowledge body.
Molecules 02 00114 g010
Figure 11. Distribution of 1-GACF fn.
Figure 11. Distribution of 1-GACF fn.
Molecules 02 00114 g011
Figure 12. Distribution of 2-GACF fn.
Figure 12. Distribution of 2-GACF fn.
Molecules 02 00114 g012
Figure 13. If R is defined as a 10-substitute group, the number of different types of carbon atoms is 220 (1-GACF substructures), for mono-substituted benzene, it is 40 due to symmetry, 900 for disubstituted benzene, etc. This Figure shows that the structural diversity space is a “combinatorial explosion” even if the scaffolds are simple and small.
Figure 13. If R is defined as a 10-substitute group, the number of different types of carbon atoms is 220 (1-GACF substructures), for mono-substituted benzene, it is 40 due to symmetry, 900 for disubstituted benzene, etc. This Figure shows that the structural diversity space is a “combinatorial explosion” even if the scaffolds are simple and small.
Molecules 02 00114 g013
Figure 14. C1 and C2 have very different chemical shifts (C1: 40 ppm, but C2: 24 ppm). In order to predict correctly, the substructure measurement should distinguish them. Topologically, if the pentagon is considered as core layer, C1 and C2 cannot be distinguished up to the second layer. But, if the fused hexagon-pentagon is measured as the core layer, C1 and C2 can be distinguished. The hexagon is encoded as remote fused ring in GACF substructure measurement scheme.
Figure 14. C1 and C2 have very different chemical shifts (C1: 40 ppm, but C2: 24 ppm). In order to predict correctly, the substructure measurement should distinguish them. Topologically, if the pentagon is considered as core layer, C1 and C2 cannot be distinguished up to the second layer. But, if the fused hexagon-pentagon is measured as the core layer, C1 and C2 can be distinguished. The hexagon is encoded as remote fused ring in GACF substructure measurement scheme.
Molecules 02 00114 g014
Figure 15. If 1-GACF and 2-GACF are not found in a knowledge base, the 1-Degenerated GACF may have more chance to match with a GACF in the knowledge base to get the closest estimation. “*” represents any atom; dashed bonds represent the bonds not in the Degenerated GACF.
Figure 15. If 1-GACF and 2-GACF are not found in a knowledge base, the 1-Degenerated GACF may have more chance to match with a GACF in the knowledge base to get the closest estimation. “*” represents any atom; dashed bonds represent the bonds not in the Degenerated GACF.
Molecules 02 00114 g015
Figure 16. A: structures used to estimate the carbon atom (colored in red) chemical shift of 1-Degenerated GACF in Figure 15. B: structures which cannot be used for this estimation.
Figure 16. A: structures used to estimate the carbon atom (colored in red) chemical shift of 1-Degenerated GACF in Figure 15. B: structures which cannot be used for this estimation.
Molecules 02 00114 g016
Table 1. Core structure classification*
Table 1. Core structure classification*
Chemical Environment ClassificationExample
Class 1:
Independent non-aromatic single ring system
Molecules 02 00114 i001
Class 2:
Fused non-aromatic ring system
Molecules 02 00114 i002
Class 3:
Bridged ring system
Molecules 02 00114 i003
Class 4:
Spiro ring system
Molecules 02 00114 i004
Class 5:
Independent aromatic single ring system
Molecules 02 00114 i005
Class 6:
Fused aromatic ring system
Molecules 02 00114 i006
Class 7:
Conjugated system
Molecules 02 00114 i007
Class 8:
Cumulene system
Molecules 02 00114 i008
Class 9:
Acyclic system
Molecules 02 00114 i009
*Red: center atom. Blue: core structure. Black: chemical environment.
Table 2. The Format of GACF-Chemical Shift Correlation Table
Table 2. The Format of GACF-Chemical Shift Correlation Table
GACFclasslayershiftmedianmaximumminimumσfn
Molecules 02 00114 i01091116.87117.80122.81108.13.40263
Table 3. Comparisons of observed chemical shift values, ACD/CNMR predictions and GACF predictions.
Table 3. Comparisons of observed chemical shift values, ACD/CNMR predictions and GACF predictions.
StructureLabel Observed ACD a, bGACF a
Molecules 02 00114 i0111127.24120.12126.32
2139.43136.95139.12
3142.26134.19143.34
4126.46120.10121.05
5135.82136.11130.06
6120.68123.68123.73
11111.25111.20111.47
15168.79161.24168.68
1623.4124.2423.26
Molecules 02 00114 i0121127.11129.03124.61
2127.43125.21128.51
3134.97158.47126.83
4134.71126.58132.20
5115.21122.91128.85
6144.73149.68145.91
Molecules 02 00114 i0131130.80131.89128.44
2150.90155.29151.13
4150.20148.88147.80
5144.70145.22144.70
6120.90119.74116.95
724.2023.4533.90
866.1054.1566.41
10163.40161.10165.10
12129.50130.10135.04
13120.30119.20116.00
Molecules 02 00114 i0141130.36129.45128.55
2134.27129.19135.48
3129.36134.10129.82
4134.62134.34131.18
5138.37152.63143.34
6123.39111.78120.12
719.3620.2419.08
Molecules 02 00114 i015113.1012.0412.70
230.1028.8327.63
334.6040.6339.95
428.8032.4930.11
524.8030.8225.28
633.9035.5132.51
732.9033.5632.78
820.4020.2219.35
1019.8018.5415.85
Molecules 02 00114 i0162203.48200.93202.47
4173.90173.73172.40
536.9235.8935.47
8129.11126.45128.40
9129.11128.03128.40
10128.57127.49126.82
11135.47133.01136.23
Molecules 02 00114 i0171143.40141.61143.98
2126.60130.88125.90
3111.40114.76127.94
4142.80142.56135.22
5114.40113.75117.42
6151.30148.02148.96
Molecules 02 00114 i0181131.51128.76131.18
2138.66133.45138.06
3128.12115.72125.44
4139.00151.67144.40
5126.10135.76129.82
6129.02134.69130.31
719.8420.2219.26
Molecules 02 00114 i0191131.33126.43128.32
2125.28121.68128.87
3128.89121.81125.44
4138.40138.03133.41
5130.40130.88136.32
6121.48120.44120.12
8163.17160.98162.65
943.7042.1443.80
10166.90163.51167.00
1428.0919.0425.49
Molecules 02 00114 i0201129.80123.86127.52
2126.70129.85128.40
3134.5127.83127.43
4131.90131.73132.20
5127.30132.60126.33
6129.30127.92130.05
7127.50128.10135.47
8141.30142.78140.42
9128.90126.40124.28
10185.00178.13180.52
11131.60128.38127.58
12126.40128.84128.40
13131.60131.10127.52
Molecules 02 00114 i0212100.50123.15107.14
3123.60116.50122.00
4116.50108.07110.23
5105.60125.14105.60
7112.90112.90112.89
Molecules 02 00114 i022127.8527.2531.03
382.3682.4080.36
555.1255.3455.95
Molecules 02 00114 i023181.1079.5180.17
2106.00107.09105.38
475.7072.9376.86
577.5078.0076.16
615.8013.0018.81
963.4063.1962.57
a Results come from the 13C NMR prediction product of Advanced Chemical Development (ACD), Inc. bThe results having absolute deviation larger than 5 ppm from the observation are in bold type.

Share and Cite

MDPI and ACS Style

Xu, J. 13C NMR Spectral Prediction by Means of Generalized Atom Center Fragment Method. Molecules 1997, 2, 114-128. https://doi.org/10.3390/20800114

AMA Style

Xu J. 13C NMR Spectral Prediction by Means of Generalized Atom Center Fragment Method. Molecules. 1997; 2(8):114-128. https://doi.org/10.3390/20800114

Chicago/Turabian Style

Xu, Jun. 1997. "13C NMR Spectral Prediction by Means of Generalized Atom Center Fragment Method" Molecules 2, no. 8: 114-128. https://doi.org/10.3390/20800114

Article Metrics

Back to TopTop