*Article* **Exploratory Analysis of Provenance Data Using R and the Provenance Package**

#### **Pieter Vermeesch**

University College London, Gower Street, London WC1E 6BT, UK; p.vermeesch@ucl.ac.uk Received: 17 January 2019; Accepted: 15 March 2019; Published: 22 March 2019

**Abstract:** The provenance of siliclastic sediment may be traced using a wide variety of chemical, mineralogical and isotopic proxies. These define three distinct data types: (1) compositional data such as chemical concentrations; (2) point-counting data such as heavy mineral compositions; and (3) distributional data such as zircon U-Pb age spectra. Each of these three data types requires separate statistical treatment. Central to any such treatment is the ability to quantify the 'dissimilarity' between two samples. For compositional data, this is best done using a logratio distance. Point-counting data may be compared using the chi-square distance, which deals better with missing components (zero values) than the logratio distance does. Finally, distributional data can be compared using the Kolmogorov–Smirnov and related statistics. For small datasets using a single provenance proxy, data interpretation can sometimes be done by visual inspection of ternary diagrams or age spectra. However, this no longer works for larger and more complex datasets. This paper reviews a number of multivariate ordination techniques to aid the interpretation of such studies. Multidimensional Scaling (MDS) is a generally applicable method that displays the salient dissimilarities and differences between multiple samples as a configuration of points in which similar samples plot close together and dissimilar samples plot far apart. For compositional data, classical MDS analysis of logratio data is shown to be equivalent to Principal Component Analysis (PCA). The resulting MDS configurations can be augmented with compositional information as biplots. For point-counting data, classical MDS analysis of chi-square distances is shown to be equivalent to Correspondence Analysis (CA). This technique also produces biplots. Thus, MDS provides a common platform to visualise and interpret all types of provenance data. Generalising the method to three-way dissimilarity tables provides an opportunity to combine several datasets together and thereby facilitate the interpretation of 'Big Data'. This paper presents a set of tutorials using the statistical programming language R. It illustrates the theoretical underpinnings of compositional data analysis, PCA, MDS and other concepts using toy examples, before applying these methods to real datasets with the provenance package.

**Keywords:** sediment; provenance; statistics; zircon; heavy minerals; point counting; petrography

#### **1. Introduction**

At its most basic level, sedimentary provenance analysis identifies the mineralogical, chemical or isotopic composition of individual grains, or assemblages of multiple grains in siliclastic sediment. These properties can then be used to group samples of similar affinity, and thereby trace the flow of sediment through a sediment routing system, e.g., [1–5]. Different levels of statistical complexity arise when multiple samples are compared to each other, or when multiple provenance proxies are applied to multiple samples.

Using a number of short tutorials, this paper will introduce several simple but effective exploratory data analysis techniques that can help to make geological sense of 'Big Data' in a sedimentary provenance context. The term 'exploratory' means that these techniques allow the user to explore the data independent of any prior knowledge about the geological setting [6–9]. It groups a number

of graphical methods to visualise the data and reveal patterns of similarity and differences between samples and variables. This paper will not introduce methods such as discriminant analysis that formally assign samples to pre-defined provenance areas or petrotectonic settings [10,11].

These notes do by no means claim to give a comprehensive overview of exploratory data analysis. The selection of methods presented herein is heavily biased towards techniques that are implemented in a software package for (sedimentary) geology that was created by Vermeesch et al. [12].

provenance is available free of charge at the Comprehensive R Archive Network (CRAN, https: //cran.r-project.org/package=provenance), on GitHub (http://github.com/pvermees/provenance), or via http://provenance.london-geochron.com. The package is written in the statistical programming language R, which is available for Windows, Mac OS-X and Linux/Unix. The easiest way to install the latest stable version of the package is to first install R from http://r-project.org and then type the following code at the command prompt (i.e., the '>'):

> install.packages("provenance")

Once installed, the package can be loaded by typing:

> library(provenance)

There are two ways to use provenance. The first of these is through a query-based user interface. To access this interface, type:

> provenance()

The main advantage of the query-based user interface is that it does not require any knowledge of R. Its main disadvantage is the relative lack of flexibility and the difficulty to automate complex and/or repetitive tasks. The second way to use provenance is via the R language itself. This is the quicker and more flexible option, whose only downside is a steeper learning curve compared to the query-based interface. This tutorial will help the reader to climb this learning curve whilst explaining the theoretical underpinnings of the methods that are implemented in the package.

This text assumes that the reader has a basic understanding of the R programming language, although a short tutorial is provided in the Appendix A for readers who lack such prior knowledge. The paper also assumes that the reader has some basic statistical knowledge. More specifically, (s)he is expected to be familiar with the normal distribution, and understand the meaning of the arithmetic mean, standard deviation and confidence intervals. The normal distribution underpins much of 'conventional' statistics, but we will see that it rarely applies to provenance data. This, in fact, is the main take-home message of this paper.

There exist three fundamental types of provenance data:


3. Detrital age spectra form a third class of data that will be referred to as *distributional* data. Sections 5 and 7 introduce kernel density estimation, the Kolmogorov–Smirnov statistic, and multidimensional scaling as ways to visualise, compare, and interpret distributional data.

Finally, Section 11 will consider the case where multiple compositional, point-counting and/or distributional datasets are combined. Procrustes analysis and 3-way multidimensional scaling are statistical techniques that aim to extract geologically meaningful trends from such 'Big Data' [13].

#### **2. Ratio Data**

*Summary: This tutorial investigates the ratios of two sets of random numbers. It shows that the arithmetic mean and confidence intervals of these synthetic data yield nonsensical results. These problems are solved by a logarithmic transformation. This simple example has important implications because ratio data are common in sedimentary provenance analysis, and are closely related to compositional data, which are introduced in Section 3.*

Many statistical operations assume normality. This includes averaging, the construction of confidence intervals, regression, etc. Although Gaussian distributions are common, it would be unwise to assume normality for all datasets. This paper makes the point that, more often than not, the normality assumption is invalid in the context of sedimentary provenance analysis. Ignoring this non-normality can lead to counter-intuitive and plainly wrong results.

To illustrate this point, we will now consider the simple case of *ratio data*, which are quite common in the Earth Sciences. Take, for example, the ratio of apatite to tourmaline in heavy mineral analysis, which has been used to indicate the duration of transport and storage prior to deposition [14]. In this part of the tutorial, we will investigate the statistics of ratio data using a synthetic example.

1. Create two vectors *A* and *B*, each containing 100 random numbers between 0 and 1:

```
ns <- 100
A <- runif(ns)
B <- runif(ns)
```
Intuitively, given that *A*/*B* = 1/(*B*/*A*) and *B*/*A* = 1/(*A*/*B*), we would expect the same to be true for their means (*A*/*B*) and (*B*/*A*). However, when we define two new variables for the (inverse) of the (reciprocal) mean ratios:

AB.mean <- mean(A/B) inv.BA.mean <- 1/mean(B/A)

then we find that AB.mean=inv.BA.mean. So (*A*/*B*) = 1/(*B*/*A*) and (*B*/*A*) = 1/(*A*/*B*)! This is a counterintuitive and clearly wrong result.

2. Calculate the standard deviation of *A*/*B* and multiply this by two to obtain a '2-sigma' confidence interval for the data:

```
AB.sd <- sd(A/B)
LL <- AB.mean - 2*AB.sd
UL <- AB.mean + 2*AB.sd
```
then we find that *LL* < 0, which is nonsensical since *A* and *B* are both strictly positive numbers and their ratio is therefore not allowed to take negative values either. Herein lies the root of the problem. The sampling distribution of A/B is positively skewed, whereas the normal distribution is symmetric with tails ranging from −∞ to +∞. Geologists frequently encounter strictly positive numbers. *Time*, for example, is a strictly positive quantity, expressed by geochronologists as 'years before present', where 'present' is equivalent to zero.

3. The problems caused by applying normal theory to strictly positive data can often be solved by simply taking logarithms [15]. The transformed data are then free to take on any value, including negative values, and this often allows normal theory to be applied with no problems. For example, when we calculate the (geometric) mean after taking the logarithm of the ratio data:

```
logAB <- log(A/B)
logBA <- log(B/A)
AB.gmean <- exp(mean(logAB))
inv.BA.gmean <- 1/exp(mean(logBA))
```
then we find that AB.gmean = inv.BA.gmean, which is a far more sensible result. 4. Calculating the 2-sigma interval for the log-transformed data:

```
LL <- exp( mean(logAB) - 2*sd(logAB) )
UL <- exp( mean(logAB) + 2*sd(logAB) )
```
also produces strictly positive values, as expected.

#### **3. Compositional Data**

*Summary: Compositional data such as chemical concentrations suffer from the same problems as the ratio data of Section 2. The tutorial uses a geochemical dataset of Al2O3 – (CaO+Na2O) – K2O data to demonstrate that the 'conventional' arithmetic mean and confidence intervals are inappropriate for data that can be constrained to a constant sum. A logratio transformation solves these problems.*

Like the ratios of the previous Section, the chemical compositions of rocks and minerals are also expressed as strictly positive numbers. They, however, do not span the entire range of positive values, but are restricted to a narrow subset of that space, ranging from 0 to 1 (if fractions are used) or from 0 to 100% (using percentage notation). The compositions are further restricted by a constant sum constraint: *<sup>n</sup>*

$$\sum\_{i=1}^{n} \mathbf{C}\_{i} = 1$$

for an *n*-component system. Consider, for example, a three-component system {*x*, *y*, *z*}, where *x* + *y* + *z* = 1. Such compositions can be plotted on ternary diagrams, which are very popular in geology. Well-known examples are the Q-F-L diagram of sedimentary petrography [16], the A-CN-K diagram in weathering studies [17], and the A-F-M, Q-A-P and Q-P-F diagrams of igneous petrology [18]. The very fact that it is possible to plot a ternary diagram on a two-dimensional sheet of paper already tells us that it really displays only two and not three dimensions worth of information. Treating the ternary data space as a regular Euclidean space with Gaussian statistics leads to incorrect results, as illustrated by the following example.

1. Read a compositional dataset containing the Al2O3 – (CaO+Na2O) – K2O composition of a number of synthetic samples:

ACNK <- read.csv("ACNK.csv",row.names=1,header=TRUE,check.names=FALSE)

where row.names=1 indicates that the sample names are contained in the first column; and the header=TRUE and check.names=FALSE arguments indicate that the first column of the input table contains the column headers, one of which contains a special character ('+').

2. Calculate the arithmetic mean composition and 95% confidence limits for each column of the dataset:

```
mu <- colMeans(ACNK)
sig <- apply(ACNK,MARGIN=2,FUN="sd")
```
and construct the 2-sigma confidence confidence bounds:

LL <- mu - 2\*sig UL <- mu + 2\*sig

3. In order to plot the compositional data on a ternary diagram, we will need to first load the provenance package into memory:

library(provenance)

Now plot the Al2O3, (CaO + Na2O) and K2O compositions on a ternary diagram alongside the arithmetic mean composition:

plot(ternary(ACNK),pch=20,labels=NA) points(ternary(mu),pch=22,bg="blue")

where ternary(x) creates a ternary data 'object' from a variable x, and pch = 20 and pch = 22 produce filled circles and squares, respectively. Notice how the arithmetic mean plots outside the data cloud, and therefore fails to represent the compositional dataset (Figure 1).

4. Add a 2-sigma confidence polygon to this figure using the ternary.polygon() function that is provided in the auxiliary helper.R script (see Online Supplement):

source("helper.R") ternary.polygon(LL,UL,col="blue")

Note that the polygon partly plots outside the ternary diagram, into physically impossible negative data space. This nonsensical result is diagnostic of the dangers of applying 'normal' statistics to compositional data. It is similar to the negative limits for the ratio data in Section 2.

A comprehensive solution to the compositional data conundrum was only found in the 1980s, by Scottish statistician John Aitchison [19]. It is closely related to the solution of the ratio averaging problem discussed in the previous section. The trick is to map the n-dimensional composition to an (n-1)-dimensional Euclidean space by means of a logratio transformation. For example, in the ternary case, we can map the compositional variables *x*, *y* and *z* to two transformed variables *v* and *w*:

$$w = \ln\left(\frac{x}{z}\right), w = \ln\left(\frac{y}{z}\right) \tag{1}$$

After performing the statistical analysis of interest (e.g., calculating the mean or constructing a 95% confidence region) on the transformed data, the results can then be mapped back to compositional space with the inverse logratio transformation. For the ternary case:

$$\text{l.x} = \frac{e^v}{\text{e}^v + \text{e}^w + 1},\\ y = \frac{e^w}{\text{e}^v + \text{e}^w + 1},\\ z = \frac{1}{\text{e}^v + \text{e}^w + 1} \tag{2}$$

This transformation is implemented in the provenance package. Let us use this feature to revisit the K-CN-A dataset, and add the geometric mean and 95% confidence region to the ternary diagram for comparison with the arithmetic mean and confidence polygon obtained before.

5. Compute the geometric mean composition and add it to the existing ternary diagram as a red square:

```
mug <- exp(colMeans(log(ACNK)))
points(ternary(mug),pch=22,bg="red")
```
This red square falls right inside the data cloud, an altogether more satisfying result than the arithmetic mean shown in blue (Figure 1).

6. To add a compositional confidence contour, we must re-read ACNK.csv into memory using the read.compositional() function. This will tell the provenance package to treat the resulting variable as compositional data in subsequent operations:

```
ACNK2 <- read.compositional("ACNK.csv",check.names=FALSE)
```
Adding the 95% confidence contour using provenance's ternary.ellipse() function:

```
ternary.ellipse(ACNK,alpha=0.05)
```
creates a 95% confidence ellipse in logratio space, and maps this back to the ternary diagram. This results in a 'boomerang'-shaped contour that tightly hugs the compositional data whilst staying inside the boundaries of the ternary diagram (Figure 1).

**Figure 1.** Graphical output of Section 3. Black circles mark 20 synthetic Al2O3, (CaO + Na2O) and K2O compositions, drawn from a logistic normal distribution. The blue square marks the arithmetic mean, which falls outside the data cloud. The blue polygon marks a 2-*σ* confidence polygon, which plots outside the ternary diagram, in physically impossible negative space. The red square represents the geometric mean, which firmly plots inside the data cloud. The red confidence envelope marks a 95% confidence region calculated using Aitchison's logratio approach. This confidence envelope neatly fits inside the ternary diagram and tightly hugs the data.

This Section (and Section 8) only touched the bare essentials of compositional data analysis. Further information about this active field of research can be found in Pawlowsky-Glahn et al. [20]. For additional R-recipes for compositional data analysis using the compositions package, the reader is referred to Van den Boogaart and Tolosana-Delgado [21,22].

#### **4. Point-Counting Data**

*Summary: Point-counting data such as heavy mineral counts are underlain by compositional distributions. However, they are not amenable to the logratio transformations introduced in Section 3 because they commonly contain zero values. Averages and confidence intervals for this type of data require hybrid statistical models combining compositional and multinomial aspects.*

The mineralogical composition of silicilastic sediment can be determined by tallying the occurrence of various minerals in a representative sample of (200–400, say) grains [23,24]. Such *point-counting* data are closely related to the compositional data that were discussed in the previous section. However, there are some crucial differences between these two data classes [25].

Point-counting data are associated with significant (counting) uncertainties, which are ignored by classical compositional data analysis. As a consequence, point-counting data often contain zero values, which are incompatible with the log-ratio transformation defined in Equation (1). Although 'rounding zeros' also occur in compositional data, where they can be removed by 'imputation' methods [26,27], these methods are ill-suited for point-counting datasets in which zeros are the rule rather than the exception.

1. Download the auxiliary data file HM.csv from the Online Supplement. This file contains a heavy mineral dataset from the Namib Sand Sea [13]. It consists of 16 rows (one for each sample) and 15 columns (one for each mineral). Read these data into memory and tell provenance to treat it as point-counting data in all future operations:

HM <- read.counts("HM.csv")

Galbraith [28]'s *radial plot* is an effective way to visually assess the degree to which the random counting uncertainties account for the observed scatter of binary point-counting data. Applying this to the epidote/garnet-ratio of the heavy mineral data (Figure 2):

```
radialplot(HM,num="ep",den="gt")
```
Each circle on the resulting scatter plot represents a single sample in the HM dataset. Its epidote/garnet-ratio can be obtained by projecting the circle onto the circular scale. Thus, low and high ratios are found at negative and positive angles to the origin, respectively. The horizontal distance of each point from the origin is proportional to the total number of counts in each sample and, hence, to its precision. An (asymmetric) 95% confidence interval for the ep/gt-ratio of each sample can be obtained by projecting both ends of a 2-sigma confidence bar onto the circular scale.

Suppose that the data are underlain by a single true population and random counting uncertainties are the sole source of scatter. Let *θ* be the true but unknown proportion of the binary subpopulation that consists of the first mineral (epidote, say). Then (1 − *θ*) is the fraction of grains that belong to the second mineral (garnet). Further suppose that we have counted a representative sample of *N* grains from this population. Then the probability that this sample contains *n* grains of the first mineral and *m* = *N* − *n* grains of the second mineral follows a binomial distribution:

$$p(n) = \binom{n+m}{n} \theta^n (1-\theta)^m \tag{3}$$

If multiple samples in a dataset are indeed underlain by the same fraction *θ*, then approximately 95% of the samples should fit within a horizontal band of two standard errors drawn on either side of the origin. In this case, *θ* can be estimated by *pooling* all the counts together and computing the proportion of the first mineral as a fraction of the total number of grains counted [25].

However, the ep/gt-ratios in HM scatter significantly beyond the 2-sigma band (Figure 2i). The data are therefore said to be *overdispersed* with respect to the counting uncertainties. This indicates the presence of true geological dispersion in the compositions that underly the point-counting data. The dispersion can be estimated by a *random effects model* with two parameters:

$$\beta \equiv \ln \left( \frac{\theta}{1 - \theta} \right) \approx \mathcal{N}(\mu, \sigma^2) \tag{4}$$

where *β* is a new variable that follows a normal distribution with mean *μ* and standard deviation *σ*, both of which have geological significance.

The 'central ratio' is given by exp[*μ*ˆ] where *μ*ˆ is the maximum likelihood estimate for *μ*. This estimates the geometric mean (ep/gt-) ratio of the true underlying composition. The 'dispersion' (*σ*ˆ) estimates the geological scatter [25,29]. In the case of our heavy mineral dataset, the epidote-garnet subcomposition is 75% dispersed. This means that the coefficient of variation (geometric standard deviation divided by geometric mean) of the true epidote/garnet-ratios is approximately 0.75.

2. The continuous mixtures from the previous section can be generalised from two to three or more dimensions. The following code snippet uses it to construct a 95% confidence contour for the ternary subcomposition of garnet, epidote and zircon (Figure 2ii). Note that this dataset contains four zero values, which would have rendered the logratio approach of Figure 1 unusable.

```
tern <- ternary(HM,x="gt",y="ep",z="zr")
plot(tern,pch=1,labels=NA)
ternary.ellipse(tern,alpha=0.05)
```
3. For datasets comprising more than three variables, the central composition can be simply obtained as follows:

> central(HM)

This produces a matrix with the proportions of each component; its standard error; the dispersion of the binary subcomposition formed by the component and the amalgamation of all remaining components; and the outcome of a chi-square test for homogeneity.

**Figure 2.** (**i**) Radial plot of the epidote/garnet-ratios of 16 samples of Namibian desert sand. These data are overdispersed with respect to the point-counting uncertainties, indicating 74% of geological scatter in the underlying compositional data. (**ii**) Ternary diagram of garnet, epidote and zircon, with a 95% confidence envelope for the underlying population, using a ternary generalisation of the random effects model. Note that four of the samples contain zero zircon counts. However, this does not pose a problem for the random effects model, unlike the logratio-procedure used for Figure 1.

#### **5. Distributional Data**

*Summary: This Section investigates a 16-sample, 1547-grain dataset of detrital zircon U-Pb ages from Namibia. It uses Kernel Density Estimation and Cumulative Age Distributions to visualise this dataset, and introduces the Kolmogorov–Smirnov statistic as a means of quantifying the dissimilarity between samples.*

Compositional data such as the chemical concentrations of Sections 3 and 8 are characterised by the relative proportions of a number of *discrete categories*. A second class of provenance proxies is based on the sampling distribution of *continuous* variables such as zircon U-Pb ages [30,31]. These *distributional* data do not fit in the statistical framework of the (logistic) normal distribution.

1. Download auxiliary data file DZ.csv from the Online Supplement. This file contains a detrital zircon U-Pb dataset from Namibia. It consists of 16 columns—one for each sample—each containing the single grain U-Pb ages of their respective sample. Let us load this file into memory using provenance's read.distributional() function:

```
DZ <- read.distributional("DZ.csv")
```
DZ now contains an object of class distributional containing the zircon U-Pb ages of 16 Namibian sand samples. To view the names of these samples:

> names(DZ)

2. One way to visualise the U-Pb age distributions is as Kernel Density Estimates. A KDE is defined as:

$$KDE\_x(t) = \frac{1}{n} \sum\_{i=1}^{n} \mathcal{K}(t | x\_i, bw) \tag{5}$$

where K is the 'kernel' and *bw* is the 'bandwidth' [32,33]. The kernel can be any unimodal and symmetric shape (such as a box or a triangle), but is most often taken to be Gaussian (where *xi* is the mean and *bw* the standard deviation). The bandwidth can either be set manually, or selected automatically based on the number of data points and the distance between them. provenance implements the automatic bandwidth selection algorithm of Botev et al. [34] but a plethora of alternatives are available in the statistics literature. To plot all the samples as KDEs:

kdes <- KDEs(DZ) plot(kdes,ncol=2)

where ncol specifies the number of columns over which the KDEs are divided.

3. Alternatively, the Cumulative Age Distribution (CAD) is a second way to show the data [35]. A CAD is a step function that sets out the rank order of the dates against their numerical value:

$$\text{CAD}(t) = \sum\_{i=1}^{n} \mathbf{1}(t < t\_i) / n \tag{6}$$

where 1(∗) = 1 if ∗ is true and 1(∗) = 0 if ∗ is false. The main advantages of CADs over KDEs are that (i) they do not require any smoothing (i.e., there is no 'bandwidth' to choose), and (ii) they can superimpose multiple samples on the same plot. Plotting samples N1, N2 and N4 of the Namib dataset:

```
plot(DZ,snames=c("N1","N2","N4"))
```
we can see that (1) the CADs of samples N1 and N2 plot close together with steepest sections at 500 Ma and 1000 Ma, reflecting the prominence of those age components; (2) sample N4 is quite different from N1 and N2.

4. We can quantify this difference using the *Kolmogorov–Smirnov* (KS) statistic [36–38], which represents the maximum vertical difference between two CADs:

```
> N124 <- subset(DZ,select=c("N1","N2","N4"))
> diss(N124)
```
This shows that the KS-statistic between N1 and N2 is KS(N1,N2) = 0.18, whereas KS(N1,N4) = 0.44, and KS(N2,N4) = 0.35 (Figure 3). The KS statistic is a *non-negative* value that takes on values between zero (perfect overlap between two distributions) and one (no overlap between two distributions). It is *symmetric* because the KS statistic between any sample *x* and another sample *y* equals that between *y* and *x*. For example, KS(N1,N2) = 0.18 = KS(N2,N1). Finally, the KS-statistic obeys the *triangle equality*, which means that the dissimilarity between any two samples is always smaller than or equal to the sum of the dissimilarities between those two samples and a third. For example, KS(N1,N2) = 0.18 < KS(N1,N4) + KS(N2,N4) = 0.44 + 0.35 = 0.79. These three characteristics qualify the KS statistics as a *metric*, which makes it particularly suitable for Multidimensional Scaling (MDS) analysis (see Section 7). The KS statistic is just one of many dissimilarity measures for distributional data. However, not all these alternatives to the KS statistic fulfil the triangle inequality [38].

**Figure 3.** Cumulative Age Distributions (CADs) of Namib desert sand samples N1, N2 and N4 with indication of the Kolmogorov–Smirnov distances between them.

#### **6. Principal Component Analysis (PCA)**

*Summary: Principal Component Analysis is an exploratory data analysis method that takes a high dimensional dataset as input and produces a lower (typically two-) dimensional 'projection' as output. PCA is closely related to Multidimensional Scaling (MDS), compositional PCA, and Correspondence Analysis (CA), which are introduced in Sections 7–9. This tutorial introduces PCA using the simplest working example of three two-dimensional points. Nearly identical examples will be used in Sections 7–9.*

1. Consider the following bivariate (*a* and *b*) dataset of three (1, 2 and 3) samples:

$$X = \begin{bmatrix} a & b \\ -1 & 7 \\ 3 & 2 \\ 4 & 3 \end{bmatrix} \tag{7}$$

Generating and plotting *X* in R:

```
X <- matrix(c(-1,3,4,7,2,3),nrow=3,ncol=2)
colnames(X) <- c("a","b")
plot(X)
```
yields a diagram in which two of the three data points plot close together while the third one plots further away.

2. Imagine that you live in a one-dimensional world and cannot see the spatial distribution of the three points represented by *X*. Principal Component Analysis (PCA) is a statistical technique invented by Pearson [39] to represent multi- (e.g., two-) dimensional data in a lower- (e.g., one-) dimensional space whilst preserving the maximum amount of information (i.e., variance). This can be achieved by decomposing *X* into four matrices (*C*, *S*, *V* and *D*):

$$\begin{aligned} X &= 1\_{3,1} \ \mathbb{C} + S \ V \ D\\ &= \begin{bmatrix} 1\\1\\1 \end{bmatrix} \begin{bmatrix} 2 & 4 \ \end{bmatrix} + \begin{bmatrix} -1.15 & 0\\0.58 & -1\\0.58 & 1 \end{bmatrix} \begin{bmatrix} 3.67 & 0\\0 & 0.71 \end{bmatrix} \begin{bmatrix} 0.71 & -0.71\\0.71 & 0.71 \end{bmatrix} \end{aligned} \tag{8}$$

where *C* is the centre (arithmetic mean) of the two data columns; *S* are the *normalised scores*; the diagonals of *V* correspond to the standard deviations of the two principal components; and *D* is a rotation matrix (the *principal directions*). *S*, *V* and *D* can be recombined to define two more matrices:

$$P = S\,V = \begin{bmatrix} -4.24 & 0\\ 2.12 & -0.71\\ 2.12 & 0.71 \end{bmatrix},\tag{9}$$

$$\text{and } L = V \, D = \begin{bmatrix} 2.6 & -2.6 \\ 0.5 & 0.5 \end{bmatrix} \tag{10}$$

where *P* is a matrix of transformed coordinates (the *principal components* or *scores*) and *L* contains the scaled eigenvectors or *loadings*. Figure 4i shows *X* as numbers on a scatterplot, *C* as a yellow square, and 12,1*C* ± *L* as a cross. Thus, the first principal direction (running from the upper left to the lower right) has been stretched by a factor of (3.67/0.71) = 5.2 with respect to the second principal component, which runs perpendicular to it. The two principal components are shown separately as Figure 4ii, and their relative contribution to the total variance of the data as Figure 4iv. Figure 4 can be reproduced with the following R-code:

```
source("helper.R")
PCA2D(X)
```
3. Although the two-dimensional example is useful for illustrative purposes, the true value of PCA obviously lies in higher dimensional situations. As a second example, let us consider one of R's built-in datasets. USArrests contains statistics (in arrests per 100,000 residents) for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percentage of the population living in urban areas. Thus, USArrests is a four-column table that cannot readily be visualised on a two-dimensional surface. Applying PCA yields four principal components, the first two of which represent 62% and 25% of the total variance, respectively. Because the four columns of the input data are expressed in different units (arrests per 100,000 or percentage), it is necessary to scale the data to have unit variance before the analysis takes place:

```
pc <- prcomp(USArrests, scale=TRUE)
biplot(pc)
```
The resulting biplot shows that the loading vectors for Murder, Assault and Rape are all pointing in approximately the same direction (dominating the first principal component), perpendicular to UrbanPop (which dominates the second principal component). This tells us that crime and degree of urbanisation are not correlated in the United States.

**Figure 4.** (**i**)—Three samples (1, 2 and 3) of bivariate (*a* and *b*) data (*X* in Equation (7)). The yellow square marks the arithmetic mean (*C* in Equation (8)), the cross marks the two principal directions (*D* in Equation (8)) stretched by the diagonal elements (i.e., the standard deviations) of *V* (Equation (8)); (**ii**)—The projection of the data points on these two directions yields two principal components (*P* in Equation (9)), representing a one-dimensional representation of the two-dimensional data; (**iii**)—A biplot of both principal components along with the loadings of the two variables shown as arrows; (**iv**)—The squared diagonal values of *V* (Equation (8)) indicate the relative amounts of variance encoded by the two principal components.

#### **7. Multidimensional Scaling**

*Summary: Multidimensional Scaling (MDS) is a less restrictive superset of PCA. This tutorial uses a geographical example to demonstrate how MDS re-creates a map of Europe from a table of pairwise distances between European cities. Applying the same algorithm to the synthetic toy-example of Section 6 yields exactly the same output as PCA.*

1. Multidimensional Scaling (MDS [40–43]) is a dimension-reducing technique that aims to extract two- (or higher) dimensional 'maps' from tables of pairwise distances between objects. This method is most easily illustrated with a geographical example. Consider, for example, the eurodist dataset that is built into R, and which gives the road distances (in km) between 21 cities in Europe (see ?eurodist for further details):

2. The MDS configuration can be obtained by R's built-in cmdscale() function

<sup>&</sup>gt; eurodist

```
conf <- cmdscale(eurodist)
```
Set up an empty plot with a 1:1 aspect ratio, and then label the MDS configuration with the city names:

```
plot(conf,type="n",asp=1)
text(conf,labels=labels(eurodist))
```
Note that the map may be turned 'upside down'. This reflects the rotation invariance of MDS configurations.

3. R's cmdscale() function implements so-called 'classical' MDS, which aims to fit the actual distances [42,43]. If these distances are Euclidean, then it can be shown that MDS is equivalent to PCA [44–46]. To demonstrate this equivalence, let us apply MDS to the data in Equation (7). First, run the first two lines of code from part 1 in Section 6. Calculating the Euclidean distances between the three samples produces a dissimilarity matrix *d*. For example, the distance between points 1 and 2 is (−<sup>1</sup> − <sup>3</sup>)<sup>2</sup> + (<sup>7</sup> − <sup>2</sup>)<sup>2</sup> = 6.4. This value is stored in d[1,2]. In <sup>R</sup>:

d <- dist(X)

which produces:

$$d = \begin{array}{cccc} 1 & 2 & 3 \\ 1 & 0 & 6.4 & 6.4 \\ 2 & 6.4 & 0 & 1.4 \\ 3 & 6.4 & 1.4 & 0 \end{array} \tag{11}$$

4. Next, calculate the MDS configuration:

conf2 <- cmdscale(d)

Finally, plot the MDS configuration as a scatterplot of text labels:

```
plot(conf2,type="n")
text(conf2,labels=1:3)
```
which is identical to the PCA configuration of Figure 4iii apart from an arbitrary rotation or reflection.

5. An alternative implementation of MDS loosens the Euclidean distance assumption by fitting the *relative* distances between objects [40,41]. Let us apply this to the dataset of European city distances using the isoMDS function of the 'Modern Applied Statistics with S' (MASS) package [47]:

library(MASS)

To compute and plot the non-metric MDS configuration:

```
conf3 <- isoMDS(eurodist)$points
plot(conf3,type="n",asp=1)
text(conf3,labels=labels(eurodist))
```
where conf3 is a list with two items: stress, which expresses the goodness-of-fit of the MDS configuration; and points, which contains the configuration. The '\$' operator is used to access any of these items.

Non-metric MDS is a less-restrictive superset of classical MDS and, hence, PCA, which opens this methodology up to non-Euclidean dissimilarity measures, such as the KS-distance introduced in Section 5 [48].

#### **8. PCA of Compositional Data**

*Summary: PCA can be applied to compositional data after logratio transformation. This tutorial first applies such compositional PCA to a three sample, three variable dataset that is mathematically equivalent to the three sample two variable toy example of Section 6. Then, it applies the same method to a real dataset of major element compositions from Namibia. This is first done using basic* R *and then again (and more succinctly) using the* provenance *package.*

Consider the following trivariate (*a*, *b* and *c*) dataset of three (1, 2 and 3) compositions:

$$\begin{array}{cccc} a & b & c\\ \hline \end{array} \begin{bmatrix} a & b & c\\ 0.03 & 99.88 & 0.09\\ 70.54 & 25.95 & 3.51\\ 72.14 & 26.54 & 1.32 \end{bmatrix} \tag{12}$$

It would be wrong to apply conventional PCA to this dataset, because this would ignore the constant sum constraint. As was discussed in Section 6, PCA begins by 'centering' the data via the arithmetic mean. Section 3 showed that this yields incorrect results for compositional data. Subjecting the data to a logratio transformation produces:

$$X\_{\mathbf{a}} = \begin{array}{c} \ln(a/c) & \ln(b/c) \\ 1 \\ 3 & 2 \\ 4 & 3 \end{array} \tag{13}$$

which, the observant reader will note, is identical to the example of Equation (7) (Figure 5i). Applying conventional PCA to the log-transformed data of Equation (13) will yield two principal components that are expressed in terms of the logratios ln(*a*/*c*) and ln(*b*/*c*) (Figure 5ii–iii).

Alternatively, the data of Equation (12) can also be subjected to a different type of logratio transformation [44]. The so-called *centred* logratio transformation (as opposed to the *additive* logratio transformation of Equation (1)) maps any set of (for example, two) compositional data vectors *x* = {*x*1, ... , *xi*, ... , *xn*}, *y* = {*y*1, ... , *yi*, ... , *yn*} to the same number of (centred) logratios *u* = {*u*1,..., *ui*,..., *un*}, *v* = {*v*1,..., *vi*,..., *vn*}, where:

$$u\_i = \ln(\mathbf{x}\_i) - \left[\ln(\mathbf{x}\_i) + \ln(y\_i)\right]/2 \text{ and } v\_i = \ln(y\_i) - \left[\ln(\mathbf{x}\_i) + \ln(y\_i)\right]/2\tag{14}$$

Applying this transformation to the data of Equation (12) yields a new trivariate dataset:

$$X\_{\mathbb{C}} = \begin{bmatrix} \ln(a/g) & \ln(b/g) & \ln(c/g) \\ 1 & -3 & 5 & -2 \\ 1.33 & 0.33 & -1.67 \\ 1.67 & 0.67 & -2.33 \end{bmatrix} \tag{15}$$

where *g* stands for the geometric mean of each row. Subjecting Equation (15) to the same matrix decomposition as Equation (8) yields:

$$\begin{aligned} X\_t &= 1\_{3,1}\mathcal{C} + \mathcal{S} \; V \; D = \begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \end{bmatrix} \begin{bmatrix} 0 & 2 & -2 \end{bmatrix} + \\ &\quad \begin{bmatrix} -1.15 & 0 & 0.82 \\ 0.58 & -1 & 0.82 \\ 0.58 & 1 & 0.82 \end{bmatrix} \begin{bmatrix} 3.67 & 0 & 0 \\ 0 & 0.41 & 0 \\ 0 & 0 & 0 \end{bmatrix} \begin{bmatrix} 0.71 & -0.71 & 0 \\ 0.41 & 0.41 & -0.82 \\ 0.58 & 0.58 & 0.58 \end{bmatrix} \end{aligned} \tag{16}$$

Note that, even though this yields three principal components instead of two, the standard deviation of the third component is zero. Therefore, all the information is contained in the first two components. The PCA map using the centred logratio transformation looks identical to that using the additive logratio transformation. The only difference is that the loadings are expressed in terms of the three centred logratio variables, rather than the two additive logratio variables (Figure 5iv). The centred logratios are easier to interpret than the additive logratios, which is why the centred logratio transformation is preferred in this context.

1. The following script applies compositional PCA to a dataset of major element compositions from Namibia (see Online Supplement) using base R:

```