Pathways analysis is a family of powerful methods that associate experimentally-derived lists of genes with known gene signatures. Such experimental gene lists can be generated by differential expression analysis, GWAS, epigenetic studies such as ATAC-Seq, high-throughput genetic screens and other methods. The gene signatures (often referred to as “gene sets” or “modules”), are likewise derived through a variety of approaches. Some gene signatures represent curated gene sets obtained over decades of careful experimental delineation of metabolic and signaling pathways. Others are derived by high throughput methods such as chemical or genetic perturbation, chromatin-IP, GWAS, genetic screens, or associations with disease states or chromosomal locations. Still others are derived computationally by such methods as in silico identification and prediction of cis-elements. Numerous compendia of gene sets exist (henceforth referred as gene set collections (GSC)), including GeneOntology, Reactome, KEGG Pathways, as well as larger collections that incorporate many such gene set collections and others in larger searchable databases (e.g. MSigDB).
Pathways analysis methods such as GSEA([1],[2]), CERNO([3]), DAVID([4],[5]) and others work by identifying statistical associations of ordered or unordered gene lists with genetic signatures, and returning lists of gene signatures with adjusted estimated p-values. A common pitfall of these methods is that the resulting lists of gene signatures may be complex and difficult to interpret. The reason for this complexity is multifold. Firstly, within GSCs which may include many thousands of gene signatures, there are frequently numerous signatures that are nested or otherwise overlap, sharing many of the same genes. This is easily understood; many of the perturbations used to generate the GSCs can activate the same or overlapping pathways. In addition, multiple separate pathways may be perturbed in a given experimental dataset. A significant challenge in pathways analysis is to determine which of numerous genetic signature “hits” from a given experimental dataset represent overlapping or distinct pathways.
Strategies exist to address this problem. TopGO ([6]) maps hits against nested gene ontology gene sets to the predefined hierarchies of GO terms, and scores gene sets within the hierarchy. However it’s utility is limited to gene ontology gene sets. DAVID ([4],[5]) uses kappa statisics to group gene sets together, but this feature is only usable within the DAVID web application and is not presently applicable to other methods or gene sets beyond those present in DAVID’s online database. The Enrichment Map plugin for Cytoscape ([7]) uses Jaccard or Szymkiewicz–Simpson coefficients as distance metrics to group gene sets together, but it does not actually divide the gene sets into clusters, and is meant primarily for GSEA data. Here, we present a general method for associating gene sets that (1) is modular and independent of pathways-analysis method*, (2) works with arbitrary gene sets/combinations of gene sets, (3) can generate tabular gene set associations and graphical visualizations from pathways data and (4) is accessible via an application programming interface (API). We call this method Gene Set Network Analysis (GSNA).
For best results with this vignette, we suggest running the code blocks listed herein in an R markdown environment in Rstudio. Some commands in this document are specific to R markdown, particularly knitr::include_graphics().
The main inputs into the standard GSNA workflow are:
Gene set network analysis (GSNA) is meant to be done after pathways analysis has been performed, as a way of simplifying interpretation of a pathways analysis results set. As such, a pathways data set is one of the main inputs for the method.
The GSNA package is designed to seamlessly import data sets from multiple common pathways analysis software packages and functions, including but not limited to GSEA([1],[2]), DAVID([4],[5]), and the tmodCERNOtest() function from the tmod package([3]). Pathways data sets should be in the form of a data.frame. At the very minimum, the results data sets should include a column containing gene set identifiers for the pathway “hits”, a column containing a significance statistic (frequently but not always a p-value or q-value), and optimally also a descriptive title or gene set name. Sometimes the ID and the Title will be the same.
Pathways data sets are normally filtered to include only the gene sets with significant hits, for example p-values below a particular value of α (typically \(\alpha <= 0.05\)). In the case of dual-channel GSNA analysis, the pathways dataset may be the result of a merger of two separate pathways analyses, with two sets of statistical values for comparison.
Gene set collections (GSCs) should contain only those gene sets that are to be analyzed by GSNA, and only the gene sets that are present in the pathways data set. Generally, this includes gene set hits with significant p- or q-values. Gene set collections may be supplied in any of the three following formats:
Typically, the GSCs used for GSNA are not the same as the GSCs used for the initial pathways analysis, but rather they are subsetted from the original gene set collection to contain only the gene sets with significant hits or which are otherwise of interest.
It is important that gene set identifiers (IDs) be the same in the pathways data sets as in the gene set collections, or GSNA will not work. In some cases, it may be necessary to harmonize the gene set IDs from the pathways data sets with those used in the GSC.
The set of background genes is supplied as a character vector containing the the symbols of the genes that were observed in the original analysis performed prior to pathways analysis. If pathways analysis was performed with a differential expression gene set, then the background should be a vector of all the symbols for genes observed in the differential expression data set, regardless of whether they were differentially expressed or not.
Gene identifiers and or symbols in the ref.background vector should match those in the GSC in form and capitalization. In most gene set collections, such as MSigDB, symbols are capitalized. If ENSEMBL gene IDs are used in the gene set collection, the background should be the same.
A standard GSNA analysis consists of the following steps:
Start by loading the library, as follows:
The GSNA package supports pathways datasets generated by GSEA, DAVID, and the CERNO method (from the tmod package). In addition, the GSNA package includes a function to perform a Fisher exact test of overrepresentation (similar to DAVID) and supports the output of that function. Other tabular pathways data types are supported by specifying the id_col, stat_col, sig_order arguments, and for 2-channel datasets, the stat_col_2, sig_order_2 fields, or otherwise using the gsnImportGenericPathway() function.
In this vignette, we will use differential expression and pathways data derived from the publicly-available dataset of Bai et al.([8]). In their study, transcriptomic data were derived from the treatment of mouse fibroblast type 2 cells with a chemical cocktail to induce differentiation into multiple cell types, including hepatocyte-like cells (CiHep) and keratinocyte-like cells (CiKrt).
In the sample data provided, gene expression in CiHep cells and expression in CiKrt cells are both compared to the fibroblast type 2 cell type, generating two sets of differential expression data to profile both cell type transitions. These two transitions, fibroblast 2 to CiHep and fibroblast 2 to CiKrt cells were further analyzed by performing pathways analysis using a gene set collection containing sets of genes activated or repressed dependent on particular transcription factors derived from MSigDB signatures as well as from the dorothea data set([9]). The main signatures observed in this dataset are repressed gene signatures, genes whose expression went down upon treatment with the chemical cocktail.
The Bai results set contains the following data objects:
When the GSNA library is loaded, these objects are lazy-loaded into the environment, and are then ready for use without calling data().
For the first example analysis, we are performing a single-channel GSNA analysis with gene signatures in CERNO data that were down in CiHep cells relative to fibroblast 2 cells. We will use the following data objects for this first analysis:
Let’s begin by inspecting the structure of this CERNO data set using the head function.
head( Bai_CiHep_DN.cerno )
#> ID Title cerno N1 AUC cES
#> M30131 M30131 Psmb5 target genes 646.75388 183 0.6176900 1.767087
#> M29968 M29968 Foxe1 target genes 1183.78772 439 0.5851964 1.348278
#> M30171 M30171 Snrnp70 target genes 1329.53093 515 0.5532870 1.290807
#> M40742 M40742 Gtf2a2 target genes 815.02548 294 0.5674120 1.386098
#> M40770 M40770 Atxn7l3 target genes 568.41196 196 0.5942726 1.450031
#> M30055 M30055 Map2k1 target genes 63.07777 7 0.8660123 4.505555
#> P.Value adj.P.Val
#> M30131 7.261354e-18 2.102162e-14
#> M29968 1.992487e-11 2.884125e-08
#> M30171 6.613673e-10 6.382194e-07
#> M40742 1.433739e-09 1.037669e-06
#> M40770 1.285708e-08 7.444249e-06
#> M30055 3.361872e-08 1.622103e-05
This CERNO result set includes a gene set ID as well as several statistical values including the cerno, AUC, cES, and both standard (nominal) and adjusted p-values. We will use the adjusted p-values in our analysis.
The gene set collection (GSC) object included in with the GSNA package is a tmod class object. However, as of version 0.50.11, the tmod package used in this vignette uses a new version of its gene set collection object, a tmodGS object. If you’re using version 0.50.11 of the tmod package, convert the Bai_gsc.tmod into a tmodGS object. The following code snippet includes a check whether your tmod version is “0.50.11” or later, and if so, runs tmod2tmodGS() to update the format of Bai_gsc.tmod to class tmodGS.
if( utils::packageVersion( pkg = "tmod" ) >= "0.50.11" ){
Bai_gsc.tmod <- tmod::tmod2tmodGS( Bai_gsc.tmod )
}
Now, we can inspect the Bai_gsc.tmod object to see that it is indeed a tmod or tmodGS class object, and also see how many gene sets (referred to as modules) and genes it contains:
Now we can filter the Bai_CiHep_DN.cerno data set for gene signatures that have significant adjusted p-values. We will use \(\alpha = 0.05\) for our significance cutoff. This filtered CERNO data set will be used
The ID column in a cerno dataset contains the the ID of its respective gene set, which can then be used for subsetting a gene set collection list, or a tmod or tmodGS object.:
Bai_CiHep_DN.sigid <- Bai_CiHep_DN.cerno.sig$ID
Bai_CiHep_DN.sigid
#> [1] "M30131" "M29968" "M30171" "M40742" "M40770" "M30055" "M30117" "M30190"
#> [9] "M30154" "M30186" "M29922" "M29984" "M40804" "M40775" "M40776" "M40846"
Now, we subset the Bai_gsc.tmod gene set collection to obtain a collection with only significant hits using the gene set IDs as keys for the tmod object. Subsetting tmod or tmodGS objects works using the list subset operator [], where the IDs are treated as indices.
Bai_CiHep_DN.sig.tmod <- Bai_gsc.tmod[Bai_CiHep_DN.sigid]
Bai_CiHep_DN.sig.tmod
#> An object of class "tmod"
#> 16 modules, 6413 genes
Variation:
Now that we have a filtered CERNO data set and a GSC containing only significant gene sets, we need the background of observable genes in our expression set. Typically, this is obtained from the differential expression data. In the case of the Bai dataset, we have expression data in the form of a Seurat data set obtained using Seurat’s FindMarkers() function, with gene symbols as the row names. In this data, gene symbols are in “title case”, with only the first letter capitalized, but gene symbols in the gene set collection are all upper case, so we convert the genes in the Bai data set to upper case too:
Use the buildGeneSetNetworkSTLF() function to create a GSNData object and calculate a distance matrix. This first step requires just the filtered GSC and the reference background.
Bai_CiHep_DN.sig.GSN <- buildGeneSetNetworkSTLF( ref.background = Bai_CiHep_genes,
geneSetCollection = Bai_CiHep_DN.sig.tmod )
Bai_CiHep_DN.sig.GSN
#> GSNData object version: 0.1.4.2
#> Contains data for:
#> 8769 genes.
#> 16 gene sets.
#> Contains the following distance(s):
#> stlf
Variations:
Once we have a GSNData object containing a distance matrix, we import the pathways results data set as a data.frame. This should include an ID column and a statistical column for ordering the data by significance. Such a statistical column is generally an adjusted p-value. When pathways data are imported, the id_col, stat_col, and sig_order fields must also be set, but for known data set types (CERNO, GSEA, DAVID), defaults for these can be set automatically. NOTE: the id_col is the ID used by the gene set collection whether that is in the form of a tmod or tmodGS object, or when the gene set collection is supplied as a named list of gene set character vectors.
Bai_CiHep_DN.sig.GSN <- gsnAddPathwaysData( object = Bai_CiHep_DN.sig.GSN,
pathways_data = Bai_CiHep_DN.cerno.sig )
#> Using CERNO import.
Bai_CiHep_DN.sig.GSN
#> GSNData object version: 0.1.4.2
#> Contains data for:
#> 8769 genes.
#> 16 gene sets.
#> Contains the following distance(s):
#> stlf
#> Contains pathways data of type: cerno
#> id_col = ID
#> stat_col = adj.P.Val
#> sig_order = loToHi
#> n_col = N1
The gsnAddPathwaysData() function automatically identifies the data set as a CERNO data set, and selects ‘ID’ as the identifier column (id_col), ‘adj.P.Val’ as the statistical column (stat_col) to be used, and the ordering of that column (sig_order) as ‘loToHi’ meaning that low values are treated as most significant. This value ‘loToHi’ is appropriate when p-values or adjusted p-values are used as the statistical column.
NOTE: In the case of most datasets generated by GSEA, CERNO, DAVID or GSNORA, the gsnAddPathwaysData() function can correctly identify the data-type and select appropriate values of stat_col and sig_order. However, in the case of data with merged and/or renamed columns, the appropriate values of stat_col and sig_order may need to be provided.
Now, pare the network using hierarchical clustering:
Variations
Once we have a pared distance matrix, we can assign subnets.
Once subnets have been assigned, we can output a set of them in tabular format. We use gsnMergePathways to create a table of the pathways results organized into subnets. This table includes, in addition to the pathways results, subnet and subnetRank columns.
Bai_CiHep_DN.sig.GSN.subnets <- gsnMergePathways(object = Bai_CiHep_DN.sig.GSN )
Bai_CiHep_DN.sig.GSN.subnets
#> subnet subnetRank ID Title cerno N1 AUC
#> 1 1 1 M30131 Psmb5 target genes 646.75388 183 0.6176900
#> 2 1 2 M29968 Foxe1 target genes 1183.78772 439 0.5851964
#> 3 1 3 M30171 Snrnp70 target genes 1329.53093 515 0.5532870
#> 4 1 4 M40742 Gtf2a2 target genes 815.02548 294 0.5674120
#> 5 1 5 M40770 Atxn7l3 target genes 568.41196 196 0.5942726
#> 6 1 6 M30190 Taf9b target genes 907.13061 356 0.5603921
#> 7 1 7 M30154 Setd7 target genes 1305.71154 537 0.5496645
#> 8 1 8 M30186 Supt16h target genes 2028.89035 875 0.5533285
#> 9 1 9 M29922 Chaf1b target genes 1189.28971 490 0.5546465
#> 10 1 10 M29984 Gtf2e2 target genes 574.78755 217 0.5759383
#> 11 2 1 M30055 Map2k1 target genes 63.07777 7 0.8660123
#> 12 3 1 M30117 Per1 target genes 101.39983 20 0.7569837
#> 13 4 1 M40804 Zzz3 target genes 236.79730 75 0.6266697
#> 14 4 2 M40775 Znf318 target genes 668.08996 261 0.5566422
#> 15 5 1 M40776 Hmbox1 target genes 1493.07674 652 0.5412185
#> 16 6 1 M40846 Prdm4 target genes 555.05334 222 0.5789229
#> cES P.Value adj.P.Val
#> 1 1.767087 7.261354e-18 2.102162e-14
#> 2 1.348278 1.992487e-11 2.884125e-08
#> 3 1.290807 6.613673e-10 6.382194e-07
#> 4 1.386098 1.433739e-09 1.037669e-06
#> 5 1.450031 1.285708e-08 7.444249e-06
#> 6 1.274060 8.788686e-07 3.180406e-04
#> 7 1.215746 1.342491e-06 4.318345e-04
#> 8 1.159366 3.465626e-06 1.003299e-03
#> 9 1.213561 4.451882e-06 1.171654e-03
#> 10 1.324395 6.384149e-06 1.540176e-03
#> 11 4.505555 3.361872e-08 1.622103e-05
#> 12 2.534996 3.075072e-07 1.271762e-04
#> 13 1.578649 7.938417e-06 1.767824e-03
#> 14 1.279866 1.473262e-05 3.046495e-03
#> 15 1.144998 1.925234e-04 3.715701e-02
#> 16 1.250120 2.538713e-04 4.593483e-02
Within R markdown documents, data frames can be reformatted for output using the yassifyPathways() function. In addition to providing a better output format, the values in the ID column can be nested within links to information on the corresponding gene sets using the url_map_list argument. The argument takes a list containing named vectors. The vectors are named according to the ID column to be substituted with a link, and the vectors contain name-value pairs in which the names are the IDs to be substituted and the values are the corresponding link URL.
In the following code block, the structure function is used to generate the ID/URL pairs from values in the MODULES or gv field of the Bai_gsc.tmod object, depending on your installed version of tmod.
if( utils::packageVersion( pkg = "tmod" ) >= "0.50.11" ){
#url_map_l <- list( ID = with( Bai_gsc.tmod$gs, structure( URL, names = ID ) ))
# For tmod version 0.50.11 and later.
url_from_ID <- with( Bai_gsc.tmod$gs, structure( URL, names = ID ) )
} else {
#url_map_l <- list( ID = with( Bai_gsc.tmod$MODULES, structure( URL, names = ID ) ))
# For earlier versions:
url_from_ID <- with( Bai_gsc.tmod$MODULES, structure( URL, names = ID ) )
}
We can inspect what this vector looks like, by looking at some of the first name/“URL” pairs:
head( url_from_ID )
#> M29994
#> "http://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?systematicName=M29994"
#> M40825
#> "http://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?systematicName=M40825"
#> M30131
#> "http://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?systematicName=M30131"
#> M29968
#> "http://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?systematicName=M29968"
#> M30171
#> "http://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?systematicName=M30171"
#> M40742
#> "http://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?systematicName=M40742"
Now, calling the yassifyPathways() function thereby yields a table with improved format and links to the corresponding URLs describing the gene sets in question.
Note the url_map_list argument results in links for the gene set IDs.
Another, more summarized report of subnets can be obtained using gsnSubnetSummary(). This report groups subnets together and gives some additional statistics relating to the subnets.
Here we use both the argument url_map_list = list( Seed.ID = url_from_ID ), to map names to URLs in the url_from_ID vector using the entire Seed.ID field, and url_map_by_words_list = list( IDs = url_from_ID ) argument to map using individual words within the IDs field. Use url_map_by_words_list when a data.frame column may contain more than one single mappable ID per row.
NOTE: In addition to its own arguments, yassifyPathways can pass arguments (like options) to DT::datatable() to allow additional control of table characteristics and formatting. In the example below, the options argument is used to provide horizontal scrolling.
Bai_CiHep_DN.sig.GSN.subnetSummary <- gsnSubnetSummary( object = Bai_CiHep_DN.sig.GSN )
yassifyPathways( Bai_CiHep_DN.sig.GSN.subnetSummary,
url_map_list = list( Seed.ID = url_from_ID ),
url_map_by_words_list = list( IDs = url_from_ID ),
options =list( autoWidth = FALSE,
scrollX = TRUE
)
)
Try horizontally scrolling this table to see all columns.
Once the gene set network has been pared and subnets assigned, the plot.GSNData() method (called simply as plot()) can be used to generate plots of GSNData objects as a circular network plot. Generating a plot directly within an R graphical environment can work, but for R markdown, output to an file and reading the image in through knitr::include_graphics() often produces better results. In this vignette, because of package space limitations, we’re generating PNG bitmap formats, but for high quality vector images, SVG or PDF outputs may be preferable. NOTE: The , , and functions can select an output format based on the presence of a supported file name suffix (.pdf, .png, .svg), so to output an SVG format to a file, you could simply give the file a name ending in . (Work is currently underway to improve automatic setting of graphical parameters by GSNA, but currently, tweaking graphical parameters can improve rendering.)
gsnPlotNetwork( object = Bai_CiHep_DN.sig.GSN, filename = "Bai_CiHep_DN.sig.GSN.PL.png", width = 9, height = 7, n_col = 'N1' ) -> out
#> Warning in makeNodeSizeLegend(numbers = gs_numbers, sizeEncode.fun =
#> sizeEncode.fun, : Node size legend may have insufficient height to plot.
#> Warning in makeNodeSizeLegend(numbers = gs_numbers, sizeEncode.fun =
#> sizeEncode.fun, : Node size legend may have insufficient height to plot.
knitr::include_graphics( "Bai_CiHep_DN.sig.GSN.PL.png" )
Note that the n_col = ‘N1’ argument tells the plot function to scale the size of vertices according to the values of the ‘N1’ column of the pathways data, which indicates the number of genes in a given gene signature.
To generate a hierarchical dendrogram of distances between gene sets, the gsnHierarchicalDendrogram() function can be used.
gsnHierarchicalDendrogram( object = Bai_CiHep_DN.sig.GSN,
width = 8,
height = 5,
filename = "Bai_CiHep_DN.sig.GSN.HD.png",
show.leaves = TRUE,
show.legend = TRUE,
n_col = 'N1',
leaf_cex_range = c(0.3,1.6) )
knitr::include_graphics( path = "Bai_CiHep_DN.sig.GSN.HD.png" )
On the basis of this output, it could be concluded that there are 6 subnets or clusters of gene signatures that are downregulated upon chemical induction of CiHep cells from fibroblast2 cells, with subnet 1 being the most significant, and within that, the M30131 Psmb5 signature being particularly strong.
Variation: A hierarchical dendrogram with a circular geometry can be generated with gsnHierarchicalDendrogram() using the geometry = ‘circular’ argument.
Two channel gene set networks are useful for comparing differential pathway activation or repression between conditions, treatments or transitions between cell types. This mainly applies to cases where two distinct cellular states are derived from a single reference state or cell type. In the Bai data set, the fibroblast 2 cellular subset differentiates into hepatocyte-like (CiHep), keratinocyte-like (CiKrt) and several other cell types upon chemical induction. In an analysis of this data set, comparisons can be made between the transition from the fibroblast 2 cell type to both the CiHep and CiKrt induced cell types.
A 2-channel Gene Set Network workflow is very similar to a single channel workflow. The main difference is that two parallel significance statistics are used in the pathways dataset instead of a single significance statistic. These two significance statistics are derived from the two pathways datasets being compared. Subsequent steps for the analysis are nearly the same.
For the 2-channel analysis, we will use the following data objects for this first analysis:
The 2-channel workflow begins with the merging of the two unfiltered parallel pathways data sets. Importantly, by “unfiltered”, we mean that the two pathways data sets should contain scores and p-values for all the gene sets in the gene set collection. For CERNO, such complete unfiltered result sets may be obtained by including the argument qval = 1.1 when calling the tmod package’s tmodCERNOtest() function. Using unfiltered pathways results sets allows us to generate a merged data set that includes gene signatures that are significant in either one or both of the two pathways results sets.
We start with two, complete unfiltered data sets containing all the CERNO results for a given GSC.
For columns common to both CERNO datasets, the names are suffixed with .x and .y depending on which argument was used. To clean up this data, we can delete columns not of interest and give the columns with .x and .y suffixes better names, using the rename function from the dplyr package:
Bai_CiHep_v_CiKrt_DN.merge$Title.y <- NULL
library( 'dplyr' )
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
Bai_CiHep_v_CiKrt_DN.merge <- dplyr::rename( .data = Bai_CiHep_v_CiKrt_DN.merge,
Title = Title.x,
cerno.CiHep = cerno.x,
N1.CiHep = N1.x,
AUC.CiHep = AUC.x,
cES.CiHep = cES.x,
P.Value.CiHep = P.Value.x,
adj.P.Val.CiHep = adj.P.Val.x,
cerno.CiKrt = cerno.y,
N1.CiKrt = N1.y,
AUC.CiKrt = AUC.y,
cES.CiKrt = cES.y,
P.Value.CiKrt = P.Value.y,
adj.P.Val.CiKrt = adj.P.Val.y
)
head( Bai_CiHep_v_CiKrt_DN.merge )
#> ID Title cerno.CiHep N1.CiHep AUC.CiHep cES.CiHep
#> 1 M29881 Adcyap1 target genes 1.193995 1 0.4495894 0.5969976
#> 2 M29885 Alkbh3 target genes 490.585215 265 0.4912303 0.9256325
#> 3 M29886 Alx4 target genes 63.730636 21 0.5723213 1.5173961
#> 4 M29887 Arhgap35 target genes 154.090957 64 0.5634298 1.2038356
#> 5 M29889 Arid5b target genes 1045.345896 510 0.5237974 1.0248489
#> 6 M29890 Arnt2 target genes 859.285215 426 0.5053423 1.0085507
#> P.Value.CiHep adj.P.Val.CiHep cerno.CiKrt N1.CiKrt AUC.CiKrt cES.CiKrt
#> 1 0.55046185 0.8558470 2.739698 2 0.4374694 0.6849245
#> 2 0.88903092 0.9598890 441.926264 245 0.5080481 0.9018903
#> 3 0.01685196 0.5672840 35.959098 17 0.4853284 1.0576205
#> 4 0.05792193 0.6766513 144.836259 64 0.5530328 1.1315333
#> 5 0.28389263 0.8385911 909.733432 481 0.5099639 0.9456688
#> 6 0.42382258 0.8385911 727.207958 404 0.4912521 0.9000098
#> P.Value.CiKrt adj.P.Val.CiKrt
#> 1 0.6022861 0.9594682
#> 2 0.9414997 1.0000000
#> 3 0.3768602 0.9454987
#> 4 0.1467167 0.9062450
#> 5 0.8846576 0.9965306
#> 6 0.9805024 1.0000000
The merged dataset must now be filtered. Typically, this is done using the R subset() function and keeping results that have a significant adjusted p-value in one or the other source datasets.
.alpha <- 0.05
Bai_CiHep_v_CiKrt_DN.filt <- subset( Bai_CiHep_v_CiKrt_DN.merge, adj.P.Val.CiHep <= .alpha | adj.P.Val.CiKrt <= .alpha )
yassifyPathways( Bai_CiHep_v_CiKrt_DN.filt,
url_map_list = list( ID = url_from_ID ),
options =list( autoWidth = FALSE,
scrollX = TRUE
)
)
NOTE: In the merged data, you will observe that N1.CiHep and N1.CiKrt are highly correlated, but not identical. This is because the N1 statistic in the CERNO results set is the number of genes for that module that are directly observable in the differential expression data set. In this case, background gene lists for the two data sets, Bai_CiHep_v_Fib2.de and Bai_CiKrt_v_Fib2.de are slightly different, so the N1 values from their respective pathways analyses differ slightly.
First, subset the Bai_gsc.tmod object to include only the gene signatures in the filtered, merged data set.
Bai_CiHep_v_CiKrt_DN.sig.tmod <- Bai_gsc.tmod[ Bai_CiHep_v_CiKrt_DN.filt$ID ]
Bai_CiHep_v_CiKrt_DN.sig.tmod
#> An object of class "tmod"
#> 18 modules, 7399 genes
The two differential expression data sets have largely overlapping but slightly different sets of observable genes. These two background gene lists should give similar, but not identical distance values when generating a gene set network. We’ll use the Bai_CiHep_genes gene list. Using the venn function from the gplots package, we can visualize the overlap of gene sets in the CiHep vs fibroblast 2 and CiKrt vs fibroblast 2 data sets.
Bai_CiHep_genes <- toupper( rownames( Bai_CiHep_v_Fib2.de ) )
Bai_CiKrt_genes <- toupper( rownames( Bai_CiKrt_v_Fib2.de ) )
library( 'gplots' )
#>
#> Attaching package: 'gplots'
#> The following object is masked from 'package:stats':
#>
#> lowess
gplots::venn( list(CiHep = Bai_CiHep_genes, CiKrt = Bai_CiKrt_genes) )
We now calculate gene set distances and generate a GSNData object using the buildGeneSetNetworkSTLF() function:
Bai_CiHep_v_CiKrt_DN.GSN <- GSNA::buildGeneSetNetworkSTLF( geneSetCollection = Bai_CiHep_v_CiKrt_DN.sig.tmod, ref.background = Bai_CiHep_genes )
Adding pathways data is a little different from before, since we will need to explicitly specify stat_col, stat_col_2, sig_order and sig_order_2.
Bai_CiHep_v_CiKrt_DN.GSN <- gsnAddPathwaysData( Bai_CiHep_v_CiKrt_DN.GSN,
pathways_data = Bai_CiHep_v_CiKrt_DN.filt,
stat_col = "adj.P.Val.CiHep",
sig_order = "loToHi",
stat_col_2 = "adj.P.Val.CiKrt",
sig_order_2 = "loToHi" )
#> Using generic pathways import.
#> Using:
#> id_col = ID
Lets take a quick look at the GSNData object:
Bai_CiHep_v_CiKrt_DN.GSN
#> GSNData object version: 0.1.4.2
#> Contains data for:
#> 8769 genes.
#> 18 gene sets.
#> Contains the following distance(s):
#> stlf
#> Contains pathways data of type: NULL
#> id_col = ID
#> stat_col = adj.P.Val.CiHep
#> sig_order = loToHi
#> stat_col_2 = adj.P.Val.CiKrt
#> sig_order_2 = loToHi
#> n_col = N1.CiHep
Now, lets pare the network and assign subnets:
Bai_CiHep_v_CiKrt_DN.GSN <- gsnPareNetGenericHierarchic( Bai_CiHep_v_CiKrt_DN.GSN )
Bai_CiHep_v_CiKrt_DN.GSN <- gsnAssignSubnets( Bai_CiHep_v_CiKrt_DN.GSN )
yassifyPathways( gsnMergePathways( Bai_CiHep_v_CiKrt_DN.GSN ),
url_map_list = list( ID = url_from_ID ),
options =list( autoWidth = FALSE,
scrollX = TRUE
) )
Subnets can be output as a table with the gsnSubnetSummary() function.
In 2-Channel network plots, gene set nodes can be color-scaled in two dimensions, based on stat_col and stat_col_2. In the plots below, signatures significant in the CiHep vs fibroblast 2 comparison are strongly red, whereas those significant in CiKrt vs fibroblast 2 are strongly blue. Any signatures significant for both comparisons would be purple. Note the square color legend that indicates significance for both channels.
gsnPlotNetwork( Bai_CiHep_v_CiKrt_DN.GSN, n_col = "N1.CiHep", filename = "Bai_CiHep_v_CiKrt_DN.GSN.NP.png", width = 7, height = 5 )
#> Warning in makeNodeSizeLegend(numbers = gs_numbers, sizeEncode.fun =
#> sizeEncode.fun, : Node size legend may have insufficient height to plot.
#> Warning in makeNodeSizeLegend(numbers = gs_numbers, sizeEncode.fun =
#> sizeEncode.fun, : Node size legend may have insufficient height to plot.
knitr::include_graphics( "Bai_CiHep_v_CiKrt_DN.GSN.NP.png" )
From this analysis, it may be concluded that about 7 subnets or clusters of related gene sets are downregulated in CiHep and CiKrt cells upon chemical induction of fibroblast 2 cells. In particular, the Psmb5 downregulation signature is strong in CiHeps (strong red), whereas the Hes2 signature (strong blue) is strongest in CiKrt cells.
Variation: When available in the pathways data, an alternative statistic can be used to color gene sets in a gene set network. Here, we’re specifying stat_col = ‘AUC.CiHep’, suppressing the previously set value of the second statistical column with stat_col_2 = NA and specifying a custom set of vertex colors used to represent the AUC values, such that the color scale goes from dark blue to white.
gsnPlotNetwork( object = Bai_CiHep_v_CiKrt_DN.GSN,
width = 8,
height = 5,
file = "Bai_CiHep_v_CiKrt_DN.GSN.NP_AUC.png",
show.legend = TRUE,
n_col = "N1.CiHep",
stat_col = 'AUC.CiHep',
sig_order = 'hiToLo',
stat_col_2 = NA,
vertex_colors = c('darkblue', 'blue', '#9999FF', '#DDDDFF', 'white' )
)
#> Warning in makeNodeSizeLegend(numbers = gs_numbers, sizeEncode.fun =
#> sizeEncode.fun, : Node size legend may have insufficient height to plot.
#> Warning in makeNodeSizeLegend(numbers = gs_numbers, sizeEncode.fun =
#> sizeEncode.fun, : Node size legend may have insufficient height to plot.
knitr::include_graphics( path = "Bai_CiHep_v_CiKrt_DN.GSN.NP_AUC.png" )
A hierarchical dendrogram is also revealing. This time, we’ll generate a circular hierarchical dendrogram by including the argument geometry = ‘circular’:
gsnHierarchicalDendrogram( Bai_CiHep_v_CiKrt_DN.GSN,
n_col = "N1.CiHep",
filename = "Bai_CiHep_v_CiKrt_DN.GSN.HD.png",
width = 7,
height = 5,
show.leaves = TRUE,
geometry = 'circular' )
#> Warning in renderCircularDendrogram(dendro = GSN.dend, leaf.names =
#> vertex.names, : Font cex is too large. Scaling label text x 0.396760877446335.
knitr::include_graphics( "Bai_CiHep_v_CiKrt_DN.GSN.HD.png" )
Hierarchical dendrograms display relationships between all the gene sets, whereas the circular network plot show just gene sets connected to their nearest neighbors within subnet, but not connections between the subnets. Otherwise, the data shown are similar, and the gsnHierarchicalDendrogram() function possesses many of the same capabilities as the network plot function, including the ability to specify an alternative statistical columns for leaf coloration and the ability to vary the size of gene set leaves by gene set size.
GSEA is a popular pathways analysis program available from https://www.gsea-msigdb.org/gsea/index.jsp([1],[2]). Users supply a GSC and either a normalized gene expression matrix and a list of sample phenotypes or a “pre-ranked” ordered list of genes; GSEA then uses those data to calculate an enrichment score (ES) statistic and a normalized enrichment score (NES). Positive ES and NES statistics indicate a positive association of gene sets with the data, e.g. gene sets that showed activation or increased expression in a data set or tended toward the beginning of the gene list, whereas negative ES and NES scores indicate a negative association. Using a permutation test, GSEA then estimates p-values (NOM p-val), false discovery rates (FDR q-val), and familywise error rates (FWER q-val) for each gene set association. Because GSEA uses permutation tests to determine p-values, p-values cannot reliably be measured below \(p = 1/ [# permutations]\). As a consequence, many highly significant associations will have p-values estimated to be 0.
When running GSEA, users may specify an output directory; otherwise the data are saved to a subdirectory named “~/gsea_home/output/[date]” within the user’s home directory. Results files are stored in these results directories with names along the lines of “gsea_report_for_Pheno1_1698209272006.tsv”, with “gsea_report_for_” as a prefix, the phenotype being compared, and lastly, numbers + “.tsv” as the suffix. Results files are generally present in pairs appropriate for a given comparison, For a pheno1_versus_pheno2 comparison, there could be a file “gsea_report_for_Pheno1_1698209272006.tsv” with positive ES and NES values, and “gsea_report_for_Pheno2_1698209272006.tsv” file with negative ES and NES values. Such output files can be suitable as inputs for GSNA. Since p-values (and q-values) may not be accurately estimated below \(p = 1/[# permutations]\), NES or ES statistics may be more suitable for use as the primary statistic used for GSEA. Since GSEA separates UP gene sets and DOWN gene sets into two different output files, these may be combined for a comprehensive analysis of activated and repressed gene sets.
The following figure illustrates the relationship between GSEA’s NES and nominal p-value statistics for a concatenated data set derived from a comparison of CiHep versus fibroblast 2 cells. Non-significant and significant results assessed on the basis of FDR q-values are indicated by crosses and solid circles, respectively.
Note that NES values are discontinuous, but have an approximately symmetrical relationship with the corresponding p-values. Note that only values at the extremes of NES have significant FDR q-values.
The primary statistic, stat_col determines how subnets are assigned, with the most significant results being assigned first. In the case of p-values and q-values, the lowest numerical values are most significant, and hence we specify sig_order = ‘loToHi’. In the case of NES and ES, the most significant values are in the extremes. (To do: add support for abs(‘hiToLo’) for this case.)
In this current example, we will use the FDR p-value as the primary statistic for assigning subnet membership, but use NES as the statistic for graphical outputs.
Using GSEA data with GSNA is straightforward and similar to using it with CERNO data. For this example, we will use the following data objects:
Since these two data sets are saved as data objects within the GSNA package, it is unnecessary to load them, but for your own data, you could load them into memory with the read.table() command as shown here:
Bai_CiHep_dorothea_UP.Gsea <- read.table( file = file.path( gsea_outputs_dir,
"gsea_report_for_CiHep_1698197511726.tsv" ),
header = TRUE,
check.names = FALSE,
sep = "\t" )
Bai_CiHep_dorothea_DN.Gsea <- read.table( file = file.path( gsea_outputs_dir,
"gsea_report_for_Fibroblast2_1698197511726.tsv" ),
header = TRUE,
check.names = FALSE,
sep = "\t" )
These two data sets can be considered together. To do so, we’ll use the rbind function to combine the two data sets into one.
Bai_CiHep_dorothea_UD.Gsea <- rbind( Bai_CiHep_dorothea_UP.Gsea, Bai_CiHep_dorothea_DN.Gsea )
yassifyPathways( Bai_CiHep_dorothea_UD.Gsea,
options = list( autoWidth = FALSE,
scrollX = TRUE ),
n = 6 # <= this argument tells the function to just show the first 6 results
)
Note the empty 12th column present in the imported GSEA tsv output
table, plus the GS DETAILS
without useful
information and the
GS<br> follow link to MSigDB
that
duplicates the NAME column. We can clean up this data
somewhat by removing these:
Bai_CiHep_dorothea_UD.Gsea[[12]] <- NULL
Bai_CiHep_dorothea_UD.Gsea$`GS DETAILS` <- NULL
Bai_CiHep_dorothea_UD.Gsea$`GS<br> follow link to MSigDB` <- NULL
We will filter the data sets using the subset command with \(`FDR q-val` <= 0.05\):
Using this filtered GSEA dataset, we will now generate a GSC that contains only the gene sets of interest, i.e. those with significant estimated FDR values. Here we use the gene set names from the GSEA result set as indices for subsetting the Bai_gsc.tmod object.
Bai_gsc.tmod.gsea <- Bai_gsc.tmod[ Bai_CiHep_dorothea_UD.FILT$NAME ]
Bai_gsc.tmod.gsea
#> An object of class "tmod"
#> 68 modules, 2723 genes
NOTE: For this step to work properly, the gene set
names in the GSEA data set must match those in the GSC. However, there
are cases when the name used in the GSEA data set does not match those
used as indices in the GSC. This happens particularly when using a GSEA
dataset is used with a tmod object generated by
importing an MSigDB XML database using the
tmod::tmodImportMSigDB() function. In this case, the
tmod object generated uses
systematic names
as the key, e.g. ‘M22163’ whereas the GSEA
data set contains gene set names
(e.g. ‘GO_MITOCHONDRIAL_RNA_CATABOLIC_PROCESS’). In such a situation, it
is advisable to add a column in the GSEA data set
containing the systematic name
or alternatively, convert
the names in tmod object to appropriate gene set names
(which can be more difficult than the first option).
Once the GSC has been subsetted, we can obtain a background gene list from the gene expression matrix. Here we extract the background gene set from Bai_empty_expr_mat, making sure to convert the gene symbols into upper case.
Now we can calculate a Jaccard distance matrix and create a GSNData object:
Bai_CiHep_dorothea_UD.GSN <- buildGeneSetNetworkJaccard( ref.background = Bai_background_genes,
geneSetCollection = Bai_gsc.tmod.gsea )
Next, we import the filtered pathways data set. Since we are opting to use ‘FDR q-val’ as our primary statistic with sig_order = ‘loToHi’, we can specify that in this command.
The ‘sig_order’ parameter specifies the order in which subnet membership is assigned, and therefore the order that they are reported. It also specifies how they will be colored for output plots, though users can override the automatic plot coloring behavior. In our case, since the NES values are mostly negative in our dataset, we will assign sig_order = ‘loToHi’.
Bai_CiHep_dorothea_UC.GSN <- gsnAddPathwaysData( object = Bai_CiHep_dorothea_UD.GSN,
pathways_data = Bai_CiHep_dorothea_UD.FILT,
stat_col = 'FDR q-val',
sig_order = 'loToHi' )
#> Using GSEA import.
#> Warning in gsnImportGSEA(object = object, pathways_data = pathways_data, : Pathways data are missing the following GSEA fields:GS DETAILS
Bai_CiHep_dorothea_UC.GSN
#> GSNData object version: 0.1.4.2
#> Contains data for:
#> 20035 genes.
#> 68 gene sets.
#> Contains the following distance(s):
#> jaccard
#> Contains pathways data of type: gsea
#> id_col = NAME
#> stat_col = FDR q-val
#> sig_order = loToHi
#> n_col = SIZE
Please ignore any warning about the missing GS DETAILS field, which we removed prior.
NOTE: For p-values and q-values, use sig_order = ‘loToHi’. For data in which high or extreme values are most significant, use ‘hiToLo’. (We may in the future add an option to specifically handle zero-centered extreme values properly in the future as ‘absHiToLo’.)
Next, we pare the network and assign subnets. We’ll use nearest single neighbor pairing for this example, with the gsnPareNetGenericToNearestNNeighbors() function. Among the factors that determine the granularity of the clustering into subnets are the cutoff and extreme parameters. The cutoff parameter specifies the maximal or minimal value of calculated distance for which gene-set groupings can be joined into subnets. More distant connections are removed. For distance metrics like stlf (single tailed log Fisher), lf, etc., for which lower values are considered closer, extreme = ‘min’, hence cutoff is treated as the maximal distance for which gene sets may be joined in a subnet. For distance metrics such as Jaccard index or the Szymkiewicz–Simpson overlap coefficient, for which greater values are closer, extreme = ‘max’, and cutoff is treated as the minimal value of distance for which gene sets may be joined into a subnet.
We have had good results when the cutoff is selected to retain \(1 \over 4\) to \(1 \over 2\) of distances in the distance matrix. We use the gsnDistanceHistogram() function with the argument stat = ‘cumulative’ specified to examine the distribution of distances in the raw distance matrix.
We see that roughly half of the Jaccard values in the data set are 0, so we select a value just above that, 0.02, which retains approximately \(1 \over 4\) of the distances.
Finally we can generate reports for our subnets. With gsnMergePathways(), the results will be ordered on the basis of how subnets were assigned. Here we will do it in two steps. The Bai_CiHep_dorothea_UC.subnets data.frame can be manipulated or reordered as desired, for example with dplyr::arrange, or saved to a file for later processing.
Bai_CiHep_dorothea_UC.subnets <- gsnMergePathways( object = Bai_CiHep_dorothea_UC.GSN )
yassifyPathways( Bai_CiHep_dorothea_UC.subnets,
options =list( autoWidth = FALSE,
scrollX = TRUE
) )
An interpretation of this result set would be that the GSEA pathways results break down into an estimated 16 subnet groupings. These groupings are approximate, and dependent on parameters used for gsnPareNetGenericToNearestNNeighbors(), particularly the cutoff parameter which specifies the minimal or maximal value of the distance parameter for which gene sets can be grouped in a network, depending on the distance metric being used. Within each subnet, the higher significance gene sets may indicate biologically significant associations. In this example GSEA analysis, we used a database with gene sets directed and repressed by particular transcription factors, so in the Bai data, these transcription factors may be considered candidates for mediating the differentiation of fibroblast 2 cells into CiHeps, or hepatocyte like cells. In particular the gene signatures of AR, ELF3, FOXK2, KLF5, KDM5B are downregulated in CiHeps, whereas those of ETV4 & SPI1 are elevated.
We can now use the gsnSubnetSummary() function to generate a concise table of the subnets with summary statistics. By default, the summary statistics calculated are harmonic means and min or max values depending on the value of sig_order. These are generally appropriate for combining and summarizing p-values and q-values, but may not be appropriate for a value such as NES. Here we will specifically specify that the mean and maximal (max) NES values should be calculated instead.
yassifyPathways( gsnSubnetSummary( object = Bai_CiHep_dorothea_UC.GSN, summary_statistics = c("mean", "max") ),
options =list( autoWidth = FALSE,
scrollX = TRUE
) )
To calculate instead the default statistics using
FDR q-val
, we can simply specify that as an argument to
gsnSubnetSummary:
We now plot the network using the default stat_col parameter,
FDR q-val
:
gsnPlotNetwork( object = Bai_CiHep_dorothea_UC.GSN,
n_col = "SIZE",
height = 7,
width = 9,
filename = "Bai_CiHep_dorothea_UC.GSN.DEF.NP.png" )
#> Warning in transform_function(pathways_dat[[stat_col]]): Warning: raw statistic contains zeros. Adding a pseudocount of 0.00011111111
knitr::include_graphics( "Bai_CiHep_dorothea_UC.GSN.DEF.NP.png" )
Now, we can plot the network, coloring nodes on the basis of NES values, with a custom blue, white, red color scale to color the extreme values. We also use transform_function = ‘identity’ to suppress the default log-transformation of the significance statistic.
gsnPlotNetwork( object = Bai_CiHep_dorothea_UC.GSN,
n_col = "SIZE",
stat_col = 'NES',
sig_order = 'hiToLo',
transform_function = identity,
vertex_colors = c("blue", "white", "red"),
height = 7,
width = 9,
filename = "Bai_CiHep_dorothea_UC.GSN.NES.NP.png" )
knitr::include_graphics( "Bai_CiHep_dorothea_UC.GSN.NES.NP.png" )
To obtain a hierarchical dendrogram, the network must be pared using the gsnPareNetGenericHierarchic() function. However, this network was pared using nearest neighbor paring. To generate a hierarchical dendrogram, the gsnPareNetGenericHierarchic() function may be called on the fly, as a nested argument to gsnHierarchicalDendrogram(). This is useful if you wish to see how similar the gene set associations are between the two methods.
gsnHierarchicalDendrogram( object = gsnAssignSubnets(gsnPareNetGenericHierarchic( Bai_CiHep_dorothea_UC.GSN )),
stat_col = 'NES',
sig_order = 'hiToLo',
n_col = "SIZE",
transform_function = identity,
show.leaves = TRUE,
show.legend = TRUE,
leaf_colors = c("blue", "white", "red"),
width = 7,
height = 11,
filename = "Bai_CiHep_dorothea_UC.GSN.HD.png",
lab.cex = 0.8 )
#> Warning in makeLeafSizeLegend(numbers = gs_numbers, sizeEncode.fun =
#> sizeEncode.fun, : Node size legend may have insufficient height to plot.
knitr::include_graphics( "Bai_CiHep_dorothea_UC.GSN.HD.png" )
One interpretation of these results are that changes in expression associated with the differentiation of fibroblast 2 cells into CiHep cells upon chemical induction are characterized mainly by repression of numerous transcriptional programs such as those directed by the Myc, Ar, KLF5 and TP53 transcription factors, and activation of Spi1 and Etv4.
To cite package ‘GSNA’ in publications use:
Collins, R. D, Urbach, M. J, Racenet, J. Z, Arshad, Umar, Power, A. K, Newman, M. R, Mylvaganam, H. G, Ly, L. N, Lian, Xiaodong, Rull, Anna, Rassadkina, Yelizaveta, Yanez, G. A, Peluso, J. M, Deeks, G. S, Vidal, Francesc, Lichterfeld, Mathias, Yu, G. X, Gaiha, D. G, Allen, M. T, Walker, D. B (2021). “Functional impairment of HIV-specific CD8+ T cells precedes aborted spontaneous control of viremia.” Immunity, S107476132100337X. doi:10.1016/j.immuni.2021.08.007 https://doi.org/10.1016/j.immuni.2021.08.007, https://linkinghub.elsevier.com/retrieve/pii/S107476132100337X.
Collins, R. D, Hitschfel, Julia, Urbach, M. J, Mylvaganam, H. G, Ly, L. N, Arshad, Umar, Racenet, J. Z, Yanez, G. A, Diefenbach, J. T, Walker, D. B (2023). “Cytolytic CD8+ T cells infiltrate germinal centers to limit ongoing HIV replication in spontaneous controller lymph nodes.” Science Immunology, 8(83), eade5872. doi:10.1126/sciimmunol.ade5872 https://doi.org/10.1126/sciimmunol.ade5872, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10231436.