generate the plot:
summaryplot(Major,Trace,QFL,HM,KDEs(DZ),ncol=2)
```
where Major, Trace, QFL and HM are shown as pie charts (the latter two with a different colour map than the former), and DZ as KDEs. Adding DZ instead of KDEs(DZ) would plot the U-Pb age distributions as histograms.

2. The entire Namib dataset comprises 16,125 measurements spanning five dimensions worth of compositional, distributional and point-counting information. This complex dataset, which may be rightfully described by the internet-era term of 'Big Data', is extremely difficult to interpret by mere visual inspection of the pie charts and KDEs. Applying MDS/PCA to each of the five individual datasets helps but presents the analyst with a multi-plot comparison problem. provenance implements two methods to address this issue [13]. The first of these is called 'Procrustes Analysis' [54]. Given a number of MDS configurations, this technique uses a combination of transformations (translation, rotation, scaling and reflection) to extract a 'consensus view' for all the data considered together:

```
proc <- procrustes(Major,Trace,QFL,HM,DZ)
plot(proc)
```
3. Alternatively, '3-way MDS' is an extension of 'ordinary' (2-way) MDS that accepts 3-dimensional dissimilarity matrices as input. provenance includes the most common implementation of this class of algorithms, which is known as 'INdividual Difference SCALing' or INDSCAL [55,56]:

```
scal <- indscal(Major,Trace,QFL,HM,DZ)
plot(scal)
```
This code produces two pieces of graphical output (Figure 6). The 'group configuration' represents the consensus view of all provenance proxies considered together. This looks very similar to the Procrustes configuration created by the previous code snippet. The second piece of graphical information displays not the samples but the provenance proxies. It shows the weights that each of the proxies attach to the horizontal and vertical axis of the group configuration.

For example, the heavy mineral compositions of the Namib desert sands can be (approximately) described by stretching the group configuration vertically by a factor of 1.9, whilst shrinking it horizontally by a factor of 0.4. In contrast, the configurations of the major and trace element compositions for the same samples are obtained by shrinking the group configuration vertically by a factor 0.8, and stretching it horizontally by a factor of 1.3. Thus, by combining these weights with the group configuration yields five 'private spaces' that aim to fit each of the individual datasets. INDSCAL group configurations are not rotation-invariant, in contrast with the 2-way MDS configurations of Section 7. This gives geological meaning to the horizontal and vertical axes of the plot. For example, samples N1 and N10 plot along a vertical line on the group configuration, indicating that they have different heavy mineral compositions, but similar major and trace element compositions. On the other hand, samples N4 and N8 plot along a horizontal line, indicating that they have similar major and trace element compositions but contrasting heavy mineral compositions.

Closer inspection of the weights reveals that the datasets obtained from fractions of specific densities (HM, PT and DZ) attach stronger weights to the vertical axis, whereas those that are determined on bulk sediment (Major and Trace) dominate the horizontal direction. Provenance proxies that use bulk sediment are more sensitive to winnowing effects than those that are based on density separates. This leads to the interpretation that the horizontal axis separates samples that have been affected by different degrees of hydraulic sorting, whereas the vertical direction separates samples that have different provenance.

**Figure 6.** Output of the 3-way MDS analysis of Namib desert sand. Left: the group configurations show the salient similarities and differences between samples as a 'map' in which similar samples plot close together and dissimilar samples plot far apart. Right: the weights for each of the five data sources show that provenance proxies that are performed on the bulk sediment (e.g., the major and trace element compositions) attach a stronger weight to the X- than the Y-axis. In contrast, proxies that are determined on specific density fractions (e.g., zircons, heavy minerals, or quartz—feldspar—lithics), attach stronger weight to the Y-axis. One geological interpretation of these dimensions is that samples that horizontally separated from each other on the group configuration (e.g., N4 and N8) have experienced hydraulic sorting, whereas samples that are vertically separated (e.g., N1 and N10) have a different provenance.

#### **12. Summary, Conclusions and Outlook**

The statistical toolbox implemented by the provenance package is neither comprehensive nor at the cutting edge of exploratory data analysis. PCA, MDS, CA, and KDEs are tried and tested methods that have been around for many decades. Nothing new is presented here and that is intentional. This paper makes the point that even the most basic statistical parameters like the arithmetic mean and standard deviation cannot be blindly applied to geological data [24,57,58]. Great care must be taken when applying established techniques to sedimentary provenance data such as chemical compositions, point-counts or U-Pb age distributions. Given the difficulty of using even the simplest of methods correctly, geologists may want to think twice before exploring more complicated methods, or inventing entirely new ones.

The set of tutorials presented in this paper did not cover all aspects of statistical provenance analysis. Doing so would fill a book rather than a paper. Some additional topics for such a book could include (1) supervised and unsupervised learning algorithms such as cluster analysis and discriminant analysis, which can group samples into formal groups [10,11,59,60]; (2) the physical and chemical processes that affect the composition of sediment from 'source to sink' [5,61–63]; and (3) quality checks and corrections that must be made to ensure that the data reveal meaningful provenance trends rather than sampling effects [51,52,64–66].

The paper introduced three distinct classes of provenance data. Compositional, point-counting and distributional data each require different statistical treatment. Multi-sample collections of these data can be visualised by Multidimensional Scaling, using different dissimilarity measures (Table 1). Distributional data can be compared using the Kolmogorov–Smirnov statistic or related dissimilarity measures, and plugged straight into an MDS algorithm for further inspection. Compositional data such as chemical concentrations can be visualised by conventional 'normal' statistics after logratio transformation. The Euclidean distance in logratio space is called the Aitchison distance in compositional data space. Classical MDS using this distance is equivalent to Principal Component Analysis. Finally, point-counting data combine aspects of compositional data analysis with multinomial sampling statistics. The Chi-square distance is the natural way to quantify the dissimilarity between multiple point-counting samples. MDS analysis using the Chi-square distance is equivalent to Correspondence Analysis, which is akin to PCA for categorical data.

However, there are some provenance proxies that do not easily fit into these three categories. *Varietal studies* using the chemical composition of single grains of heavy minerals combine aspects of compositional and distributional data [3,60]. Similarly, paired U-Pb ages and Hf-isotope compositions in zircon [1] do not easily fit inside the distributional data class described above. Using the tools provided by the provenance package, such data can be processed by procustes analysis or 3-way MDS (Section 11). Thus, U-Pb and (Hf)-distributions, say, could be entered into the indscal function as separate entities. However, by doing so, the single-grain link between the two datasets would be lost. Alternative approaches may be pursued to address this issue, and new dissimilarity measures could be developed for this hybrid data type. Novel approaches to matrix decomposition may be a way forward in this direction [8,67,68].

**Table 1.** A summary of the three types of provenance data introduced in this paper along with a suitable dissimilarity measure and its corresponding ordination technique.


**Supplementary Materials:** The following are available online at http://www.mdpi.com/2075-163X/9/3/193/s1, (a) Major element concentrations (Major.csv, compositional data). (b) Trace element concentrations (Trace.csv, compositional data). (c) Bulk petrography (PT.csv, point-counting data). (d) Heavy mineral compositions (HM.csv, point-counting data). (e) Detrital zircon U-Pb data (DZ.csv, distributional data). (f). ACNK.csv. (g). helper.R.

**Funding:** This research received no external funding.

**Acknowledgments:** This paper evolved from a set of lecture notes for an iCRAG workshop in sedimentary provenance analysis at NUI Galway. The author would like to thank Sergio Andò for inviting him to contribute to this Special Issue. The manuscript greatly benefited from four critical but constructive reviews. The ratio averaging example of Section 2 was first suggested by Noah McLean.

**Conflicts of Interest:** The author declares no conflict of interest.

#### **Appendix A. An Introduction to** R

R is an increasingly popular programming language for scientific data processing. It is similar in scope and purpose to Matlab but is available free of charge on any operating system at http: //r-project.org. A number of different graphical user interfaces (GUIs) are available for R, the most popular of which are RGui, RStudio, RCommander and Tinn-R. For this tutorial, however, the simple command line console suffices.

1. First, do some arithmetic:

```
>1+1
> sqrt(2)
> exp(log(10))
> 13%%5
```
where the '>' symbol marks the command prompt.

2. You can use the arrow to assign a value to a variable. Note that the arrow can point both ways:

```
> foo <- 2
> 4 -> bar
> foo <- foo*bar
```
3. Create a sequence of values:

```
> myvec <- c(2,4,6,8)
> myvec*2
```
Query the third value of the vector:

> myvec[3]

Change the third value of the vector:

> myvec[3] <- 100

Change the second and the third value of the vector:

> myvec[c(2,3)] <- c(100,101)

Create a vector of 1, 2, 3, ..., 10:

> seq(from=1,to=10,by=1)

Equivalently:

```
> seq(1,10,1)
> seq(1,10)
> seq(to=10,by=1,from=1)
> seq(to=10)
> 1:10
```
Create a 10-element vector of twos:

> rep(2,10)

4. Create a 2 × 4 matrix of ones:

```
> mymat <- matrix(1,nrow=2,ncol=4)
```
Change the third value in the first column of mymat to 3:

> mymat[1,3] <- 3

Change the entire second column of mymat to 2:

> mymat[,2] <- 2

The transpose of mymat:

> t(mymat)

Element-wise multiplication (\*) vs. matrix multiplication (%\*%):

```
> mymat * mymat
> mymat %*% t(mymat)
```
5. Lists are used to store more complex data objects:

```
> mylist <- list(v=myvec, m=mymat, nine=9)
> mylist$v
```
6. Plot the first against the second row of mymat:

```
> plot(mymat[1,],mymat[2,],type="p")
```
Draw lines between the points shown on the existing plot:

> lines(mymat[1,],mymat[2,])

Create a new plot with red lines but no points:

```
> plot(mymat[1,],mymat[2,],type="l",col="red")
```
Use a 1:1 aspect ratio for the X- and Y-axis:

> plot(mymat[1,],mymat[2,],type="l",col="red",asp=1)

7. Save the currently active plot as a vector-editable .pdf file:

```
> dev.copy2pdf(file="trigonometry.pdf")
```
8. To learn more about a function, type 'help' or '?':

```
> help(c)
> ?plot
```
9. It is also possible to define one's own functions:

```
> cube <- function(n){
> return(n^3)
> }
```
Using the newly created function:

> cube(2) > result <- cube(3)

10. Create some random (uniform) numbers:

```
> rand.num <- runif(100)
> hist(rand.num)
```
11. List all the variables in the current workspace:

> ls()

Remove all the variables in the current workspace:

```
> rm(list=ls())
```
To get and set the working directory:

```
> getwd()
> setwd("/path/to/a/valid/directory")
```
12. Collect the following commands in a file called 'myscript.R'. Note that this text does not contain any '>'-symbols because it is not entered at the command prompt but in a separate text editor:

```