Analysis of Scaffold DIA experiment with edgeR

Gas-phase fractionation experimental library approach

Deep learning neural network prediction method improves proteome profiling of vascular sap of grapevines during Pierce’s disease development

Notebook analysis performed by Phil Wilmarth, OHSU PSR Core, August 2020

Overview and objectives

This is a study of grapevine vascular sap samples with and without Pierce's disease (a bacterial infection) using data independent acquisition (DIA). Two new approaches for identifying the peptides to be quantified were compared: using a gas phase fractionation approach to generate an experimental spectrum library for feature identification versus a theoretical library generated from the sequenced genomes using Prosit (one of several deep learning prediction tools).

The data were acquired on a Thermo Lumos Tribrid instrument with high-resolution parent and fragment ions. Both identification approaches were used with Scaffold DIA software. The goal is to explore the differences in the data generated with the gas phase libraries versus the theoretical Prosit libraries. The data in this notebook is from the experimental gas phase library approach.

The healthy and diseased samples were done at three time points (9 week, 12 week, and 15 week) in triplicate. The "W" samples are healthy and the "Y" samples are infected. One healthy sample at 12 weeks was lost during processing. Cluster analysis indicated that one of the 9 week samples (W3) was similar to the two 12 week samples (W5 and W6). This notebook uses W3, W5, and W6 in the "W" group to compare to the Y4, Y5, and Y6 samples.

More biological and experimental details can be found in the preprint.

Differential expression (DE) testing

We will use the Bioconductor R package edgeR for the statistical testing. This widely used genomics tool has moderated test statistics and a robust trimmed mean of M-values normalization method.

Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp.139-140.

Robinson, M.D. and Oshlack, A., 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology, 11(3), p.R25.


Load R libraries

In [17]:
# load libraries
library("tidyverse")
library("psych")
library("gridExtra")
library("scales")
library("limma") 
library("edgeR")

Define some common functions for the notebook

In [18]:
# ================== TMM normalization from DGEList object =====================
apply_tmm_factors <- function(y, color = NULL, plot = TRUE) {
    # computes the tmm normalized data from the DGEList object
        # y - DGEList object
        # returns a dataframe with normalized intensities
    
    # compute and print "Sample loading" normalization factors
    lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size
    cat("\nLibrary size factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), lib_facs))
    
    # compute and print TMM normalization factors
    tmm_facs <- 1/y$samples$norm.factors
    cat("\nTrimmed mean of M-values (TMM) factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), tmm_facs))
    
    # compute and print the final correction factors
    norm_facs <- lib_facs * tmm_facs
    cat("\nCombined (lib size and TMM) normalization factors:\n", 
        sprintf("%-5s -> %f\n", colnames(y$counts), norm_facs))

    # compute the normalized data as a new data frame
    tmt_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
    colnames(tmt_tmm) <- str_c(colnames(y$counts), "_tmm")
    
    # visualize results and return data frame
    if(plot == TRUE) {
        boxplot(log10(tmt_tmm), col = color, notch = TRUE, main = "TMM Normalized data")
    }
    tmt_tmm
}

# ================= reformat edgeR test results ================================
collect_results <- function(df, tt, x, xlab, y, ylab) {
    # Computes new columns and extracts some columns to make results frame
        # df - data in data.frame
        # tt - top tags table from edgeR test
        # x - columns for first condition
        # xlab - label for x
        # y - columns for second condition
        # ylab - label for y
        # returns a new dataframe
    
    # condition average vectors
    ave_x <- rowMeans(df[x])
    ave_y <- rowMeans(df[y])
    
    # FC, direction, candidates
    fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
    direction <- ifelse(ave_y > ave_x, "up", "down")
    candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0), 
                     labels = c("high", "med", "low", "no"))
    
    # make data frame
    temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc, 
                                          PValue = tt$PValue, FDR = tt$FDR, 
                                          ave_x = ave_x, ave_y = ave_y, 
                                          direction = direction, candidate = candidate, 
                                          Acc = tt$genes)) 
    
    # fix column headers for averages
    names(temp)[names(temp) %in% c("ave_x", "ave_y")]  <- str_c("ave_", c(xlab, ylab))    
    
    temp # return the data frame
}

# ============= log2 fold-change distributions =================================
log2FC_plots <- function(results, range, title) {
    # Makes faceted log2FC plots by candidate
        # results - results data frame
        # range - plus/minus log2 x-axis limits
        # title - plot title
    ggplot(results, aes(x = logFC, fill = candidate)) +
        geom_histogram(binwidth=0.1, color = "black") +
        facet_wrap(~candidate) +
        ggtitle(title) + 
        coord_cartesian(xlim = c(-range, range))
}

# ========== Setup for MA and volcano plots ====================================
transform <- function(results, x, y) {
    # Make data frame with some transformed columns
        # results - results data frame
        # x - columns for x condition
        # y - columns for y condition
        # return new data frame
    df <- data.frame(log10((results[x] + results[y])/2), 
                     log2(results[y] / results[x]), 
                     results$candidate,
                     -log10(results$FDR))
    colnames(df) <- c("A", "M", "candidate", "P")
    
    df # return the data frame
}

# ========== MA plots using ggplot =============================================
MA_plots <- function(results, x, y, title) {
    # makes MA-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots 
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # 2-fold change lines
    ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
                     geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
                     geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))

    # make main MA plot
    ma <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
        scale_x_continuous("Ave_intensity") +
        ggtitle(title) + 
        ma_lines
    
    # make separate MA plots
    ma_facet <- ggplot(temp, aes(x = A, y = M)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
        scale_x_continuous("log10 Ave_intensity") +
        ma_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)"))

    # make the plots visible
    print(ma)
    print(ma_facet)
}    

# ========== Scatter plots using ggplot ========================================
scatter_plots <- function(results, x, y, title) {
    # makes scatter-plot DE candidate ggplots
        # results - data frame with edgeR results and some condition average columns
        # x - string for x-axis column
        # y - string for y-axis column
        # title - title string to use in plots
        # returns a list of plots
    
    # 2-fold change lines
    scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
                          geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
                          scale_y_log10(),
                          scale_x_log10())

    # make main scatter plot
    scatter <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        ggtitle(title) + 
        scatter_lines

    # make separate scatter plots
    scatter_facet <- ggplot(results, aes_string(x, y)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        scatter_lines +
        facet_wrap(~ candidate) +
        ggtitle(str_c(title, " (separated)")) 

    # make the plots visible
    print(scatter)
    print(scatter_facet)
}

# ========== Volcano plots using ggplot ========================================
volcano_plot <- function(results, x, y, title) {
    # makes a volcano plot
        # results - a data frame with edgeR results
        # x - string for the x-axis column
        # y - string for y-axis column
        # title - plot title string
    
    # uses transformed data
    temp <- transform(results, x, y)
    
    # build the plot
    ggplot(temp, aes(x = M, y = P)) +
        geom_point(aes(color = candidate, shape = candidate)) +
        xlab("log2 FC") +
        ylab("-log10 FDR") +
        ggtitle(str_c(title, " Volcano Plot"))
}

# ============== individual protein expression plots ===========================
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
#    identifier <- str_split(accession, "\\|", simplify = TRUE)
#    identifier[,3]
    identifier <- accession
}

set_plot_dimensions <- function(width_choice, height_choice) {
    options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}

plot_top_tags <- function(results, nleft, nright, top_tags) {
    # results should have data first, then test results (two condition summary table)
    # nleft, nright are number of data points in each condition
    # top_tags is number of up and number of down top DE candidates to plot
    # get top ipregulated
    up <- results %>% 
        filter(logFC >= 0) %>%
        arrange(FDR)
    up <- up[1:top_tags, ]
    
    # get top down regulated
    down <- results %>% 
        filter(logFC < 0) %>%
        arrange(FDR)
    down <- down[1:top_tags, ]
    
    # pack them
    proteins <- rbind(up, down)
        
    color = c(rep("red", nleft), rep("blue", nright))
    for (row_num in 1:nrow(proteins)) {
        row <- proteins[row_num, ]
        vec <- as.vector(unlist(row[1:(nleft + nright)]))
        names(vec) <- colnames(row[1:(nleft + nright)])
        title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2), 
                       ", FDR: ", scientific(row$FDR, digits = 3), 
                       ", FC: ", round(row$FC, digits = 1),
                       ", ", row$candidate)
        barplot(vec, col = color, main = title,
                cex.main = 1.0, cex.names = 0.7, cex.lab = 0.7)
    }    
}

# ============== p-value distribution =========================================
pvalue_plot <- function(results, title) {
    # Makes p-value distribution plots
        # results - results data frame
        # title - plot title
    ggplot(results, aes(PValue)) + 
        geom_histogram(bins = 100, fill = "white", color = "black") +
        geom_hline(yintercept = mean(hist(results$PValue, breaks = 100, 
                                     plot = FALSE)$counts[26:100]), na.rm = TRUE) +
        ggtitle(str_c(title, " p-value distribution"))
}

Read in the prepped data

The original data contained a small proportion of missing values. For each biological sample, the smallest non-missing values were found. Based on the median value of the smallest values (about 50,000), missing values were replaced by a value of 10,000.

In [19]:
# read the protein-level quantitative values
dia_start <- read_tsv("PW-filtered_Scaffold-DIA.txt")
Parsed with column specification:
cols(
  Accession = col_character(),
  W1 = col_double(),
  W2 = col_double(),
  W3 = col_double(),
  W6 = col_double(),
  W5 = col_double(),
  W8 = col_double(),
  W7 = col_double(),
  Y1 = col_double(),
  Y3 = col_double(),
  Y2 = col_double(),
  Y5 = col_double(),
  Y4 = col_double(),
  Y6 = col_double()
)

In [20]:
# extract protein accession column and the actual data
# separate accessions from the data
accessions <- dia_start$Accession
dia_data <- dia_start %>% select(-Accession)

head(dia_data)
length(accessions)
A tibble: 6 × 13
W1W2W3W6W5W8W7Y1Y3Y2Y5Y4Y6
<dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl><dbl>
4466835922089296135128613846309573441096478196549540874426579519194984460229086765288403150257039578263026799169824365
1174897555128824955467735141467735141 338844156158489319117489755144543977154881662 75857758 23442288 23442288 3548134
223872114467735141109647820177827941 89125094 64565423102329299316227766 81283052251188643 93325430 89125094186208714
7244360 38018940 17782794 15135612 16218101 12882496 9332543 15135612 14454398 26915348 87096359 64565423 93325430
6025596562341325 6760830 14454398 20417379 74131024 27542287123026877102329299120226443630957344416869383851138038
93325430 14125375 63095734 58884366 50118723 14791084 10000000 35481339 21877616 23988329 10471285 7762471 1819701
145

Load the data into an DGEList edgeR object

In [21]:
# define indices for conditions of interest
W_start <- c(3, 5, 4) # W3, W5, W6 - healthy group
Y_start <- c(12, 11, 13) # Y4, Y5, Y6 - infected group

# load data into DGEList object
group <- c(rep("W", 3), rep("Y", 3))
y <- DGEList(counts = dia_data[c(W_start, Y_start)], group = group, genes = accessions)
y$samples
A data.frame: 6 × 3
grouplib.sizenorm.factors
<fct><dbl><dbl>
W3W256843416341
W5W475485254611
W6W311846140251
Y4Y444169293091
Y5Y834978137511
Y6Y346368310281

Normalize the TMT data

EdgeR normalization is actually done in two steps. The first, called a library size adjustment, is like a sample loading normalization. This gets rid of the big differences between samples so that the TMM algorithm has better starting data. We will need to compute the normalized intensities from the TMM factors (edgeR internally uses the factors).

In [22]:
# run the TMM normalization
y <- calcNormFactors(y)

# set colors for plotting
colors <- c(rep("red", 3), rep("blue", 3))

# redefine the indices for the subsetted data
W <- 1:3
Y <- 4:6

# get the normalized data values
dia_tmm <- apply_tmm_factors(y, colors)

# check the clustering
plotMDS(y, col = colors, main = "Samples after TMM")
Library size factors:
 W3    -> 1.732372
 W5    -> 0.935778
 W6    -> 1.426820
 Y4    -> 1.001754
 Y5    -> 0.532886
 Y6    -> 1.284611

Trimmed mean of M-values (TMM) factors:
 W3    -> 0.621227
 W5    -> 1.219252
 W6    -> 0.554390
 Y4    -> 1.108440
 Y5    -> 1.911340
 Y6    -> 1.124064

Combined (lib size and TMM) normalization factors:
 W3    -> 1.076196
 W5    -> 1.140948
 W6    -> 0.791015
 Y4    -> 1.110384
 Y5    -> 1.018527
 Y6    -> 1.443985

Compute dispersion for edgeR modeling

edgeR uses trended dispersion to moderate the testing statistics to make the modeling more robust for studies with small replicate numbers.

In [23]:
# we need to get dispersion estimates
y <- estimateDisp(y)
plotBCV(y, main = "Dispersion trends")
Design matrix not provided. Switch to the classic mode.

edgeR exact test

We will use the exact test in edgeR for this simple two-state comparison. We will also simplify/reformat the test results and save them in a data frame.

In [24]:
# compute the exact test models, p-values, FC, etc.
et <- exactTest(y, pair = c("W", "Y"))

# check some top tags
topTags(et)$table

# this counts up, down, and unchanged genes (proteins) at 5% FDR
summary(decideTestsDGE(et, p.value = 0.05))

# make the results table 
tt <- topTags(et, n = Inf, sort.by = "none")$table
exact <- collect_results(dia_tmm, tt, W, "W", Y, "Y")

# make an MD plot (like MA plot)
plotMD(et, p.value = 0.05)
abline(h = c(-1, 1), col = "black")
A data.frame: 10 × 5
geneslogFClogCPMPValueFDR
<chr><dbl><dbl><dbl><dbl>
34VIT_08s0007g06060.t01 8.92441214.3828351.149482e-189.729546e-17
7VIT_12s0059g02420.t01-4.95462611.0788451.342006e-189.729546e-17
56VIT_02s0025g04330.t01 5.28078814.2388982.197760e-171.062251e-15
48VIT_03s0091g00160.t01 6.73631416.3112028.289114e-173.004804e-15
5VIT_04s0008g00120.t01 5.79645413.1199191.065996e-143.091387e-13
9VIT_03s0088g00810.t01 6.851833 9.7544443.878623e-139.373338e-12
10VIT_14s0081g00030.t01 4.95827010.1340781.954754e-124.049133e-11
12VIT_06s0061g00120.t01 4.79474313.9668265.958931e-121.080056e-10
113VIT_04s0008g00140.t01 4.45947110.3682452.397198e-113.862152e-10
104VIT_07s0005g06000.t01 4.69105110.3285744.649393e-116.741619e-10
       Y-W
Down    47
NotSig  50
Up      48

Check if testing looks okay

It is important to see if the modeling looks reasonable. Our general assumptions are that we have a large fraction of the proteins that are not differentially expressed. Those will have a uniform (flat) p-value distribution from 0.0 to 1.0. We also expect (hopefully) some true differential expression candidates. Those should have very small p-values and have a sharper distribution at low p-values.

In [25]:
# check the p-value distrubution
pvalue_plot(exact, "W vs Y, Scaffold")

We have a rather sparse distribution

Despite the sparse data, we still observe the two distributions of p-values, so the testing seems reasonable.

We can categorize candidates by ranges of adjusted p-values

We can define three cuts on the FDR: 10% to 5% are "low" significance, 5% to 1% are medium significance, and less than 1% are more "highly" significant. Cut values can be adjusted depending on the experimental situation. We can look at expression ratio distributions as a function of candidate category. If variance is not too variable protein-to-protein, then we would expect larger mean differences to be associated with lower FDR values. Faceted plotting in ggplot2 is another way to see patterns in data.

In [26]:
# see how many candidates are in each category
exact %>% count(candidate)

# can look at log2FC distributions as a check
log2FC_plots(exact, 3, "LogFC by candidate for W vs Y")
A tibble: 4 × 2
candidaten
<fct><int>
high74
med 21
low 10
no 40

Visualize the edgeR DE candidates

  • MA plots
  • scatter plots
  • volcano plot

We need some transformed axes for MA plots and for volcano plots. We will make a function for that and also some functions for the plotting. MA plots first. The dotted lines indicate 2-fold changes.

In [27]:
# MA plots of DE candidates
MA_plots(exact, "ave_W", "ave_Y", "W versus Y")

Scatter plots

The solid diagonal line is 1:1, the dotted lines are 2-fold changes. The axes are in log scale.

In [28]:
# scatter plots
scatter_plots(exact, "ave_W", "ave_Y", "W versus Y")

Volcano plot

Volcano plots are another common way to visualize DE candidates.

In [29]:
# finally, a volcano plot
volcano_plot(exact, "ave_W", "ave_Y", "W versus Y")

Plot the intensities for some of the top DE candidates by FDR

We can see how the intensities of the individual samples compare for the top 10 up- and down-regulated DE candidate proteins.

In [30]:
# plot the top 10 up and 10 down proteins
set_plot_dimensions(7, 4)
plot_top_tags(exact, 3, 3, 10)
set_plot_dimensions(7, 7)

Summary

Although we had a smaller number of proteins to work with, there are clearly many significant expression differences. The normalizations, testing results, and individual protein expression levels all fit together and support large-scale changes between healthy and diseased grape sap.

We should always end notebooks with information about what packages and versions were used in the analysis.

In [31]:
# save the testing results
write.table(exact, file = "DIA-Scaffold_results_W3.txt", sep = "\t",
            row.names = FALSE, na = " ")

# log the session
sessionInfo()
R version 3.5.3 (2019-03-11)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS  10.15.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] edgeR_3.24.3    limma_3.38.3    scales_1.1.0    gridExtra_2.3  
 [5] psych_1.9.12.31 forcats_0.5.0   stringr_1.4.0   dplyr_0.8.5    
 [9] purrr_0.3.3     readr_1.3.1     tidyr_1.0.2     tibble_2.1.3   
[13] ggplot2_3.3.0   tidyverse_1.3.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4       locfit_1.5-9.1   lubridate_1.7.4  lattice_0.20-40 
 [5] assertthat_0.2.1 digest_0.6.25    IRdisplay_0.7.0  R6_2.4.1        
 [9] cellranger_1.1.0 repr_1.1.0       backports_1.1.5  reprex_0.3.0    
[13] evaluate_0.14    httr_1.4.1       pillar_1.4.3     rlang_0.4.5     
[17] uuid_0.1-4       readxl_1.3.1     rstudioapi_0.11  labeling_0.3    
[21] splines_3.5.3    munsell_0.5.0    broom_0.5.2      compiler_3.5.3  
[25] modelr_0.1.6     pkgconfig_2.0.3  base64enc_0.1-3  mnormt_1.5-6    
[29] htmltools_0.4.0  tidyselect_1.0.0 fansi_0.4.1      crayon_1.3.4    
[33] dbplyr_1.4.2     withr_2.2.0      grid_3.5.3       nlme_3.1-145    
[37] jsonlite_1.6.1   gtable_0.3.0     lifecycle_0.2.0  DBI_1.1.0       
[41] magrittr_1.5     cli_2.0.2        stringi_1.4.6    farver_2.0.3    
[45] fs_1.3.2         xml2_1.2.5       generics_0.0.2   vctrs_0.2.4     
[49] IRkernel_1.1     tools_3.5.3      glue_1.3.2       hms_0.5.3       
[53] parallel_3.5.3   colorspace_1.4-1 rvest_0.3.5      pbdZMQ_0.3-3    
[57] haven_2.2.0     
In [ ]: