This is a study of grapevine vascular sap samples with and without Pierce's disease (a bacterial infection) using data independent acquisition (DIA). Two new approaches for identifying the peptides to be quantified were compared: using a gas phase fractionation approach to generate an experimental spectrum library for feature identification versus a theoretical library generated from the sequenced genomes using Prosit (one of several deep learning prediction tools).
The data were acquired on a Thermo Lumos Tribrid instrument with high-resolution parent and fragment ions. Both identification approaches were used with Scaffold DIA software. The goal is to explore the differences in the data generated with the gas phase libraries versus the theoretical Prosit libraries. The data in this notebook is from the Prosit theoretical library approach.
The healthy and diseased samples were done at three time points (9 week, 12 week, and 15 week) in triplicate. The "W" samples are healthy and the "Y" samples are infected. One healthy sample at 12 weeks was lost during processing. Cluster analysis indicated that one of the 9 week samples (W3) was similar to the two 12 week samples (W5 and W6). This notebook uses W3, W5, and W6 in the "W" group to compare to the Y4, Y5, and Y6 samples.
More biological and experimental details can be found in the preprint.
We will use the Bioconductor R package edgeR for the statistical testing. This widely used genomics tool has moderated test statistics and a robust trimmed mean of M-values normalization method.
Robinson, M.D., McCarthy, D.J. and Smyth, G.K., 2010. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), pp.139-140.
Robinson, M.D. and Oshlack, A., 2010. A scaling normalization method for differential expression analysis of RNA-seq data. Genome biology, 11(3), p.R25.
# load libraries
library("tidyverse")
library("psych")
library("gridExtra")
library("scales")
library("limma")
library("edgeR")
# ================== TMM normalization from DGEList object =====================
apply_tmm_factors <- function(y, color = NULL, plot = TRUE) {
# computes the tmm normalized data from the DGEList object
# y - DGEList object
# returns a dataframe with normalized intensities
# compute and print "Sample loading" normalization factors
lib_facs <- mean(y$samples$lib.size) / y$samples$lib.size
cat("\nLibrary size factors:\n",
sprintf("%-5s -> %f\n", colnames(y$counts), lib_facs))
# compute and print TMM normalization factors
tmm_facs <- 1/y$samples$norm.factors
cat("\nTrimmed mean of M-values (TMM) factors:\n",
sprintf("%-5s -> %f\n", colnames(y$counts), tmm_facs))
# compute and print the final correction factors
norm_facs <- lib_facs * tmm_facs
cat("\nCombined (lib size and TMM) normalization factors:\n",
sprintf("%-5s -> %f\n", colnames(y$counts), norm_facs))
# compute the normalized data as a new data frame
tmt_tmm <- as.data.frame(sweep(y$counts, 2, norm_facs, FUN = "*"))
colnames(tmt_tmm) <- str_c(colnames(y$counts), "_tmm")
# visualize results and return data frame
if(plot == TRUE) {
boxplot(log10(tmt_tmm), col = color, notch = TRUE, main = "TMM Normalized data")
}
tmt_tmm
}
# ================= reformat edgeR test results ================================
collect_results <- function(df, tt, x, xlab, y, ylab) {
# Computes new columns and extracts some columns to make results frame
# df - data in data.frame
# tt - top tags table from edgeR test
# x - columns for first condition
# xlab - label for x
# y - columns for second condition
# ylab - label for y
# returns a new dataframe
# condition average vectors
ave_x <- rowMeans(df[x])
ave_y <- rowMeans(df[y])
# FC, direction, candidates
fc <- ifelse(ave_y > ave_x, (ave_y / ave_x), (-1 * ave_x / ave_y))
direction <- ifelse(ave_y > ave_x, "up", "down")
candidate <- cut(tt$FDR, breaks = c(-Inf, 0.01, 0.05, 0.10, 1.0),
labels = c("high", "med", "low", "no"))
# make data frame
temp <- cbind(df[c(x, y)], data.frame(logFC = tt$logFC, FC = fc,
PValue = tt$PValue, FDR = tt$FDR,
ave_x = ave_x, ave_y = ave_y,
direction = direction, candidate = candidate,
Acc = tt$genes))
# fix column headers for averages
names(temp)[names(temp) %in% c("ave_x", "ave_y")] <- str_c("ave_", c(xlab, ylab))
temp # return the data frame
}
# ============= log2 fold-change distributions =================================
log2FC_plots <- function(results, range, title) {
# Makes faceted log2FC plots by candidate
# results - results data frame
# range - plus/minus log2 x-axis limits
# title - plot title
ggplot(results, aes(x = logFC, fill = candidate)) +
geom_histogram(binwidth=0.1, color = "black") +
facet_wrap(~candidate) +
ggtitle(title) +
coord_cartesian(xlim = c(-range, range))
}
# ========== Setup for MA and volcano plots ====================================
transform <- function(results, x, y) {
# Make data frame with some transformed columns
# results - results data frame
# x - columns for x condition
# y - columns for y condition
# return new data frame
df <- data.frame(log10((results[x] + results[y])/2),
log2(results[y] / results[x]),
results$candidate,
-log10(results$FDR))
colnames(df) <- c("A", "M", "candidate", "P")
df # return the data frame
}
# ========== MA plots using ggplot =============================================
MA_plots <- function(results, x, y, title) {
# makes MA-plot DE candidate ggplots
# results - data frame with edgeR results and some condition average columns
# x - string for x-axis column
# y - string for y-axis column
# title - title string to use in plots
# returns a list of plots
# uses transformed data
temp <- transform(results, x, y)
# 2-fold change lines
ma_lines <- list(geom_hline(yintercept = 0.0, color = "black"),
geom_hline(yintercept = 1.0, color = "black", linetype = "dotted"),
geom_hline(yintercept = -1.0, color = "black", linetype = "dotted"))
# make main MA plot
ma <- ggplot(temp, aes(x = A, y = M)) +
geom_point(aes(color = candidate, shape = candidate)) +
scale_y_continuous(paste0("logFC (", y, "/", x, ")")) +
scale_x_continuous("Ave_intensity") +
ggtitle(title) +
ma_lines
# make separate MA plots
ma_facet <- ggplot(temp, aes(x = A, y = M)) +
geom_point(aes(color = candidate, shape = candidate)) +
scale_y_continuous(paste0("log2 FC (", y, "/", x, ")")) +
scale_x_continuous("log10 Ave_intensity") +
ma_lines +
facet_wrap(~ candidate) +
ggtitle(str_c(title, " (separated)"))
# make the plots visible
print(ma)
print(ma_facet)
}
# ========== Scatter plots using ggplot ========================================
scatter_plots <- function(results, x, y, title) {
# makes scatter-plot DE candidate ggplots
# results - data frame with edgeR results and some condition average columns
# x - string for x-axis column
# y - string for y-axis column
# title - title string to use in plots
# returns a list of plots
# 2-fold change lines
scatter_lines <- list(geom_abline(intercept = 0.0, slope = 1.0, color = "black"),
geom_abline(intercept = 0.301, slope = 1.0, color = "black", linetype = "dotted"),
geom_abline(intercept = -0.301, slope = 1.0, color = "black", linetype = "dotted"),
scale_y_log10(),
scale_x_log10())
# make main scatter plot
scatter <- ggplot(results, aes_string(x, y)) +
geom_point(aes(color = candidate, shape = candidate)) +
ggtitle(title) +
scatter_lines
# make separate scatter plots
scatter_facet <- ggplot(results, aes_string(x, y)) +
geom_point(aes(color = candidate, shape = candidate)) +
scatter_lines +
facet_wrap(~ candidate) +
ggtitle(str_c(title, " (separated)"))
# make the plots visible
print(scatter)
print(scatter_facet)
}
# ========== Volcano plots using ggplot ========================================
volcano_plot <- function(results, x, y, title) {
# makes a volcano plot
# results - a data frame with edgeR results
# x - string for the x-axis column
# y - string for y-axis column
# title - plot title string
# uses transformed data
temp <- transform(results, x, y)
# build the plot
ggplot(temp, aes(x = M, y = P)) +
geom_point(aes(color = candidate, shape = candidate)) +
xlab("log2 FC") +
ylab("-log10 FDR") +
ggtitle(str_c(title, " Volcano Plot"))
}
# ============== individual protein expression plots ===========================
# function to extract the identifier part of the accesssion
get_identifier <- function(accession) {
# identifier <- str_split(accession, "\\|", simplify = TRUE)
# identifier[,3]
identifier <- accession
}
set_plot_dimensions <- function(width_choice, height_choice) {
options(repr.plot.width=width_choice, repr.plot.height=height_choice)
}
plot_top_tags <- function(results, nleft, nright, top_tags) {
# results should have data first, then test results (two condition summary table)
# nleft, nright are number of data points in each condition
# top_tags is number of up and number of down top DE candidates to plot
# get top ipregulated
up <- results %>%
filter(logFC >= 0) %>%
arrange(FDR)
up <- up[1:top_tags, ]
# get top down regulated
down <- results %>%
filter(logFC < 0) %>%
arrange(FDR)
down <- down[1:top_tags, ]
# pack them
proteins <- rbind(up, down)
color = c(rep("red", nleft), rep("blue", nright))
for (row_num in 1:nrow(proteins)) {
row <- proteins[row_num, ]
vec <- as.vector(unlist(row[1:(nleft + nright)]))
names(vec) <- colnames(row[1:(nleft + nright)])
title <- str_c(get_identifier(row$Acc), ", int: ", scientific(mean(vec), 2),
", FDR: ", scientific(row$FDR, digits = 3),
", FC: ", round(row$FC, digits = 1),
", ", row$candidate)
barplot(vec, col = color, main = title,
cex.main = 1.0, cex.names = 0.7, cex.lab = 0.7)
}
}
# ============== p-value distribution =========================================
pvalue_plot <- function(results, title) {
# Makes p-value distribution plots
# results - results data frame
# title - plot title
ggplot(results, aes(PValue)) +
geom_histogram(bins = 100, fill = "white", color = "black") +
geom_hline(yintercept = mean(hist(results$PValue, breaks = 100,
plot = FALSE)$counts[26:100]), na.rm = TRUE) +
ggtitle(str_c(title, " p-value distribution"))
}
The original data contained a small proportion of missing values. The average intensity per protein across all samples was computed and proteins ranked by decreasing average intensity. Seven proteins had an average of less than 150,000 and were excluded. Those proteins had about 1/3 of all missing values.
For each biological sample, the smallest non-missing values were found. Based on the median value of the smallest values (about 9,000), missing values were replaced by a value of 1,500.
# read the protein-level quantitative values
dia_start <- read_tsv("PW-filtered_Prosit-DIA.txt")
# extract protein accession column and the actual data
# separate accessions from the data
accessions <- dia_start$Accession
dia_data <- dia_start %>% select(-Accession)
head(dia_data)
length(accessions)
DGEList
edgeR object¶# define indices for conditions of interest
W_start <- c(3, 5, 4) # W3, W5, W6 - healthy group
Y_start <- c(12, 11, 13) # Y4, Y5, Y6 - infected group
# load data into DGEList object
group <- c(rep("W", 3), rep("Y", 3))
y <- DGEList(counts = dia_data[c(W_start, Y_start)], group = group, genes = accessions)
y$samples
EdgeR normalization is actually done in two steps. The first, called a library size adjustment, is like a sample loading normalization. This gets rid of the big differences between samples so that the TMM algorithm has better starting data. We will need to compute the normalized intensities from the TMM factors (edgeR internally uses the factors).
# run the TMM normalization
y <- calcNormFactors(y)
# set colors for plotting
colors <- c(rep("red", 3), rep("blue", 3))
# redefine the indices for the subsetted data
W <- 1:3
Y <- 4:6
# get the normalized data values
dia_tmm <- apply_tmm_factors(y, colors)
# check the clustering
plotMDS(y, col = colors, main = "Samples after TMM")
edgeR uses trended dispersion to moderate the testing statistics to make the modeling more robust for studies with small replicate numbers.
# we need to get dispersion estimates
y <- estimateDisp(y)
plotBCV(y, main = "Dispersion trends")
We will use the exact test in edgeR for this simple two-state comparison. We will also simplify/reformat the test results and save them in a data frame.
# compute the exact test models, p-values, FC, etc.
et <- exactTest(y, pair = c("W", "Y"))
# check some top tags
topTags(et)$table
# this counts up, down, and unchanged genes (proteins) at 5% FDR
summary(decideTestsDGE(et, p.value = 0.05))
# make the results table
tt <- topTags(et, n = Inf, sort.by = "none")$table
exact <- collect_results(dia_tmm, tt, W, "W", Y, "Y")
# make an MD plot (like MA plot)
plotMD(et, p.value = 0.05)
abline(h = c(-1, 1), col = "black")
It is important to see if the modeling looks reasonable. Our general assumptions are that we have a large fraction of the proteins that are not differentially expressed. Those will have a uniform (flat) p-value distribution from 0.0 to 1.0. We also expect (hopefully) some true differential expression candidates. Those should have very small p-values and have a sharper distribution at low p-values.
# check the p-value distrubution
pvalue_plot(exact, "W vs Y, Scaffold")
Despite the sparse data, we still observe the two distributions of p-values, so the testing seems reasonable.
We can define three cuts on the FDR: 10% to 5% are "low" significance, 5% to 1% are medium significance, and less than 1% are more "highly" significant. Cut values can be adjusted depending on the experimental situation. We can look at expression ratio distributions as a function of candidate category. If variance is not too variable protein-to-protein, then we would expect larger mean differences to be associated with lower FDR values. Faceted plotting in ggplot2 is another way to see patterns in data.
# see how many candidates are in each category
exact %>% count(candidate)
# can look at log2FC distributions as a check
log2FC_plots(exact, 4, "LogFC by candidate for W vs Y")
# MA plots of DE candidates
MA_plots(exact, "ave_W", "ave_Y", "W versus Y")
The solid diagonal line is 1:1, the dotted lines are 2-fold changes. The axes are in log scale.
# scatter plots
scatter_plots(exact, "ave_W", "ave_Y", "W versus Y")
Volcano plots are another common way to visualize DE candidates.
# finally, a volcano plot
volcano_plot(exact, "ave_W", "ave_Y", "W versus Y")
We can see how the intensities of the individual samples compare for the top 10 up- and down-regulated DE candidate proteins.
# plot the top 10 up and 10 down proteins
set_plot_dimensions(7, 4)
plot_top_tags(exact, 3, 3, 10)
set_plot_dimensions(7, 7)
We had many more quantified proteins with the Prosit approach and there are clearly many significant expression differences. The normalizations, testing results, and individual protein expression levels all fit together and support large-scale changes between healthy and diseased grape sap.
We should always end notebooks with information about what packages and versions were used in the analysis.
# save the testing results
write.table(exact, file = "DIA-Prosit_results_W3.txt", sep = "\t",
row.names = FALSE, na = " ")
# log the session
sessionInfo()