Using the GSNA Package

Jonathan M. Urbach

2023-10-11

GSNA version: 0.1.4.2

Introduction

Pathways analysis is a family of powerful methods that associate experimentally-derived lists of genes with known gene signatures. Such experimental gene lists can be generated by differential expression analysis, GWAS, epigenetic studies such as ATAC-Seq, high-throughput genetic screens and other methods. The gene signatures (often referred to as “gene sets” or “modules”), are likewise derived through a variety of approaches. Some gene signatures represent curated gene sets obtained over decades of careful experimental delineation of metabolic and signaling pathways. Others are derived by high throughput methods such as chemical or genetic perturbation, chromatin-IP, GWAS, genetic screens, or associations with disease states or chromosomal locations. Still others are derived computationally by such methods as in silico identification and prediction of cis-elements. Numerous compendia of gene sets exist (henceforth referred as gene set collections (GSC)), including GeneOntology, Reactome, KEGG Pathways, as well as larger collections that incorporate many such gene set collections and others in larger searchable databases (e.g. MSigDB).

Pathways analysis methods such as GSEA([1],[2]), CERNO([3]), DAVID([4],[5]) and others work by identifying statistical associations of ordered or unordered gene lists with genetic signatures, and returning lists of gene signatures with adjusted estimated p-values. A common pitfall of these methods is that the resulting lists of gene signatures may be complex and difficult to interpret. The reason for this complexity is multifold. Firstly, within GSCs which may include many thousands of gene signatures, there are frequently numerous signatures that are nested or otherwise overlap, sharing many of the same genes. This is easily understood; many of the perturbations used to generate the GSCs can activate the same or overlapping pathways. In addition, multiple separate pathways may be perturbed in a given experimental dataset. A significant challenge in pathways analysis is to determine which of numerous genetic signature “hits” from a given experimental dataset represent overlapping or distinct pathways.

Strategies exist to address this problem. TopGO ([6]) maps hits against nested gene ontology gene sets to the predefined hierarchies of GO terms, and scores gene sets within the hierarchy. However it’s utility is limited to gene ontology gene sets. DAVID ([4],[5]) uses kappa statisics to group gene sets together, but this feature is only usable within the DAVID web application and is not presently applicable to other methods or gene sets beyond those present in DAVID’s online database. The Enrichment Map plugin for Cytoscape ([7]) uses Jaccard or Szymkiewicz–Simpson coefficients as distance metrics to group gene sets together, but it does not actually divide the gene sets into clusters, and is meant primarily for GSEA data. Here, we present a general method for associating gene sets that (1) is modular and independent of pathways-analysis method*, (2) works with arbitrary gene sets/combinations of gene sets, (3) can generate tabular gene set associations and graphical visualizations from pathways data and (4) is accessible via an application programming interface (API). We call this method Gene Set Network Analysis (GSNA).

For best results with this vignette, we suggest running the code blocks listed herein in an R markdown environment in Rstudio. Some commands in this document are specific to R markdown, particularly knitr::include_graphics().

Getting Started with GSNA

Inputs for GSNA

The main inputs into the standard GSNA workflow are:

  1. a filtered pathways result set
  2. a gene set collection containing (only) the gene sets to be analyzed and
  3. a background vector of observed genes

Pathways Data Sets:

Gene set network analysis (GSNA) is meant to be done after pathways analysis has been performed, as a way of simplifying interpretation of a pathways analysis results set. As such, a pathways data set is one of the main inputs for the method.

The GSNA package is designed to seamlessly import data sets from multiple common pathways analysis software packages and functions, including but not limited to GSEA([1],[2]), DAVID([4],[5]), and the tmodCERNOtest() function from the tmod package([3]). Pathways data sets should be in the form of a data.frame. At the very minimum, the results data sets should include a column containing gene set identifiers for the pathway “hits”, a column containing a significance statistic (frequently but not always a p-value or q-value), and optimally also a descriptive title or gene set name. Sometimes the ID and the Title will be the same.

Pathways data sets are normally filtered to include only the gene sets with significant hits, for example p-values below a particular value of α (typically \(\alpha <= 0.05\)). In the case of dual-channel GSNA analysis, the pathways dataset may be the result of a merger of two separate pathways analyses, with two sets of statistical values for comparison.

Gene Set Collections:

Gene set collections (GSCs) should contain only those gene sets that are to be analyzed by GSNA, and only the gene sets that are present in the pathways data set. Generally, this includes gene set hits with significant p- or q-values. Gene set collections may be supplied in any of the three following formats:

  • A named list of character vectors in which the names are the gene set names or IDs and the contents of the vectors are gene symbols.
  • A tmod object as used by the tmod package versions up to version ‘0.46.2’.
  • A tmodGS object as used by the tmod package versions since version ‘0.50.11’.

Typically, the GSCs used for GSNA are not the same as the GSCs used for the initial pathways analysis, but rather they are subsetted from the original gene set collection to contain only the gene sets with significant hits or which are otherwise of interest.

It is important that gene set identifiers (IDs) be the same in the pathways data sets as in the gene set collections, or GSNA will not work. In some cases, it may be necessary to harmonize the gene set IDs from the pathways data sets with those used in the GSC.

Background genes:

The set of background genes is supplied as a character vector containing the the symbols of the genes that were observed in the original analysis performed prior to pathways analysis. If pathways analysis was performed with a differential expression gene set, then the background should be a vector of all the symbols for genes observed in the differential expression data set, regardless of whether they were differentially expressed or not.

Gene identifiers and or symbols in the ref.background vector should match those in the GSC in form and capitalization. In most gene set collections, such as MSigDB, symbols are capitalized. If ENSEMBL gene IDs are used in the gene set collection, the background should be the same.

GSNA Standard Workflow

A standard GSNA analysis consists of the following steps:

  1. Create a GSNA object and calculate a distance matrix based on a desired distance metric using one of the following functions: buildGeneSetNetworkSTLF() (calculates single tail log-Fisher values), buildGeneSetNetworkLF() (calculates partial log Fisher values), buildGeneSetNetworkJaccard() (calculates Jaccard indices), buildGeneSetNetworkOC() (calculates Szymkiewicz–Simpson or “overlap” coefficient), or buildGeneSetNetworkGeneric().
  2. Import a pathways dataset. This is typically done using the gsnAddPathwaysData().
  3. Pare the network using gsnPareNetGenericHierarchic() or gsnPareNetGenericToNearestNNeighbors(). This removes lower value connections between less-closely related gene sets, retaining the connections between the most closely related gene sets.
  4. Assign subnets using gsnAssignSubnets(). In this step, gene sets that are connected after pairing are formally assigned and tallied (this function may be incorporated into the paring step in the future).
  5. Output tabular data on the subnets using gsnMergePathways() or gsnSubnetSummary().
  6. Create graphic outputs for the output data sets using plot.GSNData(), gsnPlotNetwork(), or gsnHierarchicalDendrogram().

A real-life example: Generate a single-channel Gene Set Network with CERNO data

Start by loading the library, as follows:

library(GSNA)
#> Loading GSNA....
#> For a vignette, run:
#>   vignette( "using_the_gsna_package" )

The GSNA package supports pathways datasets generated by GSEA, DAVID, and the CERNO method (from the tmod package). In addition, the GSNA package includes a function to perform a Fisher exact test of overrepresentation (similar to DAVID) and supports the output of that function. Other tabular pathways data types are supported by specifying the id_col, stat_col, sig_order arguments, and for 2-channel datasets, the stat_col_2, sig_order_2 fields, or otherwise using the gsnImportGenericPathway() function.

In this vignette, we will use differential expression and pathways data derived from the publicly-available dataset of Bai et al.([8]). In their study, transcriptomic data were derived from the treatment of mouse fibroblast type 2 cells with a chemical cocktail to induce differentiation into multiple cell types, including hepatocyte-like cells (CiHep) and keratinocyte-like cells (CiKrt).

In the sample data provided, gene expression in CiHep cells and expression in CiKrt cells are both compared to the fibroblast type 2 cell type, generating two sets of differential expression data to profile both cell type transitions. These two transitions, fibroblast 2 to CiHep and fibroblast 2 to CiKrt cells were further analyzed by performing pathways analysis using a gene set collection containing sets of genes activated or repressed dependent on particular transcription factors derived from MSigDB signatures as well as from the dorothea data set([9]). The main signatures observed in this dataset are repressed gene signatures, genes whose expression went down upon treatment with the chemical cocktail.

The Bai results set contains the following data objects:

When the GSNA library is loaded, these objects are lazy-loaded into the environment, and are then ready for use without calling data().

For the first example analysis, we are performing a single-channel GSNA analysis with gene signatures in CERNO data that were down in CiHep cells relative to fibroblast 2 cells. We will use the following data objects for this first analysis:

Preparing and Filtering Data

Let’s begin by inspecting the structure of this CERNO data set using the head function.

head( Bai_CiHep_DN.cerno )
#>            ID                Title      cerno  N1       AUC      cES
#> M30131 M30131   Psmb5 target genes  646.75388 183 0.6176900 1.767087
#> M29968 M29968   Foxe1 target genes 1183.78772 439 0.5851964 1.348278
#> M30171 M30171 Snrnp70 target genes 1329.53093 515 0.5532870 1.290807
#> M40742 M40742  Gtf2a2 target genes  815.02548 294 0.5674120 1.386098
#> M40770 M40770 Atxn7l3 target genes  568.41196 196 0.5942726 1.450031
#> M30055 M30055  Map2k1 target genes   63.07777   7 0.8660123 4.505555
#>             P.Value    adj.P.Val
#> M30131 7.261354e-18 2.102162e-14
#> M29968 1.992487e-11 2.884125e-08
#> M30171 6.613673e-10 6.382194e-07
#> M40742 1.433739e-09 1.037669e-06
#> M40770 1.285708e-08 7.444249e-06
#> M30055 3.361872e-08 1.622103e-05

This CERNO result set includes a gene set ID as well as several statistical values including the cerno, AUC, cES, and both standard (nominal) and adjusted p-values. We will use the adjusted p-values in our analysis.

The gene set collection (GSC) object included in with the GSNA package is a tmod class object. However, as of version 0.50.11, the tmod package used in this vignette uses a new version of its gene set collection object, a tmodGS object. If you’re using version 0.50.11 of the tmod package, convert the Bai_gsc.tmod into a tmodGS object. The following code snippet includes a check whether your tmod version is “0.50.11” or later, and if so, runs tmod2tmodGS() to update the format of Bai_gsc.tmod to class tmodGS.

if( utils::packageVersion( pkg = "tmod" ) >= "0.50.11" ){
  Bai_gsc.tmod <- tmod::tmod2tmodGS( Bai_gsc.tmod )
}

Now, we can inspect the Bai_gsc.tmod object to see that it is indeed a tmod or tmodGS class object, and also see how many gene sets (referred to as modules) and genes it contains:

Bai_gsc.tmod
#> An object of class "tmod"
#>  94 modules, 9002 genes

Now we can filter the Bai_CiHep_DN.cerno data set for gene signatures that have significant adjusted p-values. We will use \(\alpha = 0.05\) for our significance cutoff. This filtered CERNO data set will be used

.alpha <- 0.05

Bai_CiHep_DN.cerno.sig <- subset( Bai_CiHep_DN.cerno, adj.P.Val <= .alpha )

The ID column in a cerno dataset contains the the ID of its respective gene set, which can then be used for subsetting a gene set collection list, or a tmod or tmodGS object.:

Bai_CiHep_DN.sigid <- Bai_CiHep_DN.cerno.sig$ID

Bai_CiHep_DN.sigid
#>  [1] "M30131" "M29968" "M30171" "M40742" "M40770" "M30055" "M30117" "M30190"
#>  [9] "M30154" "M30186" "M29922" "M29984" "M40804" "M40775" "M40776" "M40846"

Now, we subset the Bai_gsc.tmod gene set collection to obtain a collection with only significant hits using the gene set IDs as keys for the tmod object. Subsetting tmod or tmodGS objects works using the list subset operator [], where the IDs are treated as indices.

Bai_CiHep_DN.sig.tmod <- Bai_gsc.tmod[Bai_CiHep_DN.sigid]
Bai_CiHep_DN.sig.tmod
#> An object of class "tmod"
#>  16 modules, 6413 genes

Variation:

Now that we have a filtered CERNO data set and a GSC containing only significant gene sets, we need the background of observable genes in our expression set. Typically, this is obtained from the differential expression data. In the case of the Bai dataset, we have expression data in the form of a Seurat data set obtained using Seurat’s FindMarkers() function, with gene symbols as the row names. In this data, gene symbols are in “title case”, with only the first letter capitalized, but gene symbols in the gene set collection are all upper case, so we convert the genes in the Bai data set to upper case too:

Bai_CiHep_genes <- toupper( rownames( Bai_CiHep_v_Fib2.de ) )

Creating a Gene Set Network

Use the buildGeneSetNetworkSTLF() function to create a GSNData object and calculate a distance matrix. This first step requires just the filtered GSC and the reference background.

Bai_CiHep_DN.sig.GSN <- buildGeneSetNetworkSTLF( ref.background = Bai_CiHep_genes,
                                                 geneSetCollection = Bai_CiHep_DN.sig.tmod )

Bai_CiHep_DN.sig.GSN
#> GSNData object version: 0.1.4.2 
#>   Contains data for:
#>      8769 genes.
#>      16 gene sets.
#>   Contains the following distance(s):
#>      stlf

Variations:

Once we have a GSNData object containing a distance matrix, we import the pathways results data set as a data.frame. This should include an ID column and a statistical column for ordering the data by significance. Such a statistical column is generally an adjusted p-value. When pathways data are imported, the id_col, stat_col, and sig_order fields must also be set, but for known data set types (CERNO, GSEA, DAVID), defaults for these can be set automatically. NOTE: the id_col is the ID used by the gene set collection whether that is in the form of a tmod or tmodGS object, or when the gene set collection is supplied as a named list of gene set character vectors.

Bai_CiHep_DN.sig.GSN <- gsnAddPathwaysData( object = Bai_CiHep_DN.sig.GSN,
                                            pathways_data = Bai_CiHep_DN.cerno.sig )
#> Using CERNO import.

Bai_CiHep_DN.sig.GSN
#> GSNData object version: 0.1.4.2 
#>   Contains data for:
#>      8769 genes.
#>      16 gene sets.
#>   Contains the following distance(s):
#>      stlf
#>   Contains pathways data of type:  cerno 
#>      id_col = ID 
#>      stat_col = adj.P.Val 
#>      sig_order = loToHi 
#>      n_col = N1

The gsnAddPathwaysData() function automatically identifies the data set as a CERNO data set, and selects ‘ID’ as the identifier column (id_col), ‘adj.P.Val’ as the statistical column (stat_col) to be used, and the ordering of that column (sig_order) as ‘loToHi’ meaning that low values are treated as most significant. This value ‘loToHi’ is appropriate when p-values or adjusted p-values are used as the statistical column.

NOTE: In the case of most datasets generated by GSEA, CERNO, DAVID or GSNORA, the gsnAddPathwaysData() function can correctly identify the data-type and select appropriate values of stat_col and sig_order. However, in the case of data with merged and/or renamed columns, the appropriate values of stat_col and sig_order may need to be provided.

Now, pare the network using hierarchical clustering:

Bai_CiHep_DN.sig.GSN <- gsnPareNetGenericHierarchic( object = Bai_CiHep_DN.sig.GSN )

Variations

Once we have a pared distance matrix, we can assign subnets.

Bai_CiHep_DN.sig.GSN <- gsnAssignSubnets( object = Bai_CiHep_DN.sig.GSN )

Outputting Subnet Data

Once subnets have been assigned, we can output a set of them in tabular format. We use gsnMergePathways to create a table of the pathways results organized into subnets. This table includes, in addition to the pathways results, subnet and subnetRank columns.

Bai_CiHep_DN.sig.GSN.subnets <- gsnMergePathways(object = Bai_CiHep_DN.sig.GSN )

Bai_CiHep_DN.sig.GSN.subnets
#>    subnet subnetRank     ID                Title      cerno  N1       AUC
#> 1       1          1 M30131   Psmb5 target genes  646.75388 183 0.6176900
#> 2       1          2 M29968   Foxe1 target genes 1183.78772 439 0.5851964
#> 3       1          3 M30171 Snrnp70 target genes 1329.53093 515 0.5532870
#> 4       1          4 M40742  Gtf2a2 target genes  815.02548 294 0.5674120
#> 5       1          5 M40770 Atxn7l3 target genes  568.41196 196 0.5942726
#> 6       1          6 M30190   Taf9b target genes  907.13061 356 0.5603921
#> 7       1          7 M30154   Setd7 target genes 1305.71154 537 0.5496645
#> 8       1          8 M30186 Supt16h target genes 2028.89035 875 0.5533285
#> 9       1          9 M29922  Chaf1b target genes 1189.28971 490 0.5546465
#> 10      1         10 M29984  Gtf2e2 target genes  574.78755 217 0.5759383
#> 11      2          1 M30055  Map2k1 target genes   63.07777   7 0.8660123
#> 12      3          1 M30117    Per1 target genes  101.39983  20 0.7569837
#> 13      4          1 M40804    Zzz3 target genes  236.79730  75 0.6266697
#> 14      4          2 M40775  Znf318 target genes  668.08996 261 0.5566422
#> 15      5          1 M40776  Hmbox1 target genes 1493.07674 652 0.5412185
#> 16      6          1 M40846   Prdm4 target genes  555.05334 222 0.5789229
#>         cES      P.Value    adj.P.Val
#> 1  1.767087 7.261354e-18 2.102162e-14
#> 2  1.348278 1.992487e-11 2.884125e-08
#> 3  1.290807 6.613673e-10 6.382194e-07
#> 4  1.386098 1.433739e-09 1.037669e-06
#> 5  1.450031 1.285708e-08 7.444249e-06
#> 6  1.274060 8.788686e-07 3.180406e-04
#> 7  1.215746 1.342491e-06 4.318345e-04
#> 8  1.159366 3.465626e-06 1.003299e-03
#> 9  1.213561 4.451882e-06 1.171654e-03
#> 10 1.324395 6.384149e-06 1.540176e-03
#> 11 4.505555 3.361872e-08 1.622103e-05
#> 12 2.534996 3.075072e-07 1.271762e-04
#> 13 1.578649 7.938417e-06 1.767824e-03
#> 14 1.279866 1.473262e-05 3.046495e-03
#> 15 1.144998 1.925234e-04 3.715701e-02
#> 16 1.250120 2.538713e-04 4.593483e-02

Within R markdown documents, data frames can be reformatted for output using the yassifyPathways() function. In addition to providing a better output format, the values in the ID column can be nested within links to information on the corresponding gene sets using the url_map_list argument. The argument takes a list containing named vectors. The vectors are named according to the ID column to be substituted with a link, and the vectors contain name-value pairs in which the names are the IDs to be substituted and the values are the corresponding link URL.

In the following code block, the structure function is used to generate the ID/URL pairs from values in the MODULES or gv field of the Bai_gsc.tmod object, depending on your installed version of tmod.

if( utils::packageVersion( pkg = "tmod" ) >= "0.50.11" ){
  #url_map_l <- list( ID = with( Bai_gsc.tmod$gs, structure( URL, names = ID ) ))
  # For tmod version 0.50.11 and later.
  url_from_ID <- with( Bai_gsc.tmod$gs, structure( URL, names = ID ) )
} else {
  #url_map_l <- list( ID = with( Bai_gsc.tmod$MODULES, structure( URL, names = ID ) ))
  # For earlier versions:
  url_from_ID <- with( Bai_gsc.tmod$MODULES, structure( URL, names = ID ) )
}

We can inspect what this vector looks like, by looking at some of the first name/“URL” pairs:

head( url_from_ID  )
#>                                                                          M29994 
#> "http://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?systematicName=M29994" 
#>                                                                          M40825 
#> "http://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?systematicName=M40825" 
#>                                                                          M30131 
#> "http://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?systematicName=M30131" 
#>                                                                          M29968 
#> "http://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?systematicName=M29968" 
#>                                                                          M30171 
#> "http://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?systematicName=M30171" 
#>                                                                          M40742 
#> "http://www.gsea-msigdb.org/gsea/msigdb/geneset_page.jsp?systematicName=M40742"

Now, calling the yassifyPathways() function thereby yields a table with improved format and links to the corresponding URLs describing the gene sets in question.

yassifyPathways( Bai_CiHep_DN.sig.GSN.subnets,
                 url_map_list = list( ID = url_from_ID ) )

Note the url_map_list argument results in links for the gene set IDs.

Another, more summarized report of subnets can be obtained using gsnSubnetSummary(). This report groups subnets together and gives some additional statistics relating to the subnets.

Here we use both the argument url_map_list = list( Seed.ID = url_from_ID ), to map names to URLs in the url_from_ID vector using the entire Seed.ID field, and url_map_by_words_list = list( IDs = url_from_ID ) argument to map using individual words within the IDs field. Use url_map_by_words_list when a data.frame column may contain more than one single mappable ID per row.

NOTE: In addition to its own arguments, yassifyPathways can pass arguments (like options) to DT::datatable() to allow additional control of table characteristics and formatting. In the example below, the options argument is used to provide horizontal scrolling.

Bai_CiHep_DN.sig.GSN.subnetSummary <- gsnSubnetSummary( object = Bai_CiHep_DN.sig.GSN )

yassifyPathways( Bai_CiHep_DN.sig.GSN.subnetSummary,
                 url_map_list = list( Seed.ID = url_from_ID ),
                 url_map_by_words_list = list( IDs = url_from_ID ),
                 options =list( autoWidth = FALSE,
                                scrollX = TRUE
                                )
                 )

Try horizontally scrolling this table to see all columns.

Plotting the Network

Once the gene set network has been pared and subnets assigned, the plot.GSNData() method (called simply as plot()) can be used to generate plots of GSNData objects as a circular network plot. Generating a plot directly within an R graphical environment can work, but for R markdown, output to an file and reading the image in through knitr::include_graphics() often produces better results. In this vignette, because of package space limitations, we’re generating PNG bitmap formats, but for high quality vector images, SVG or PDF outputs may be preferable. NOTE: The , , and functions can select an output format based on the presence of a supported file name suffix (.pdf, .png, .svg), so to output an SVG format to a file, you could simply give the file a name ending in . (Work is currently underway to improve automatic setting of graphical parameters by GSNA, but currently, tweaking graphical parameters can improve rendering.)

gsnPlotNetwork( object = Bai_CiHep_DN.sig.GSN, filename = "Bai_CiHep_DN.sig.GSN.PL.png", width = 9, height = 7, n_col = 'N1' ) -> out
#> Warning in makeNodeSizeLegend(numbers = gs_numbers, sizeEncode.fun =
#> sizeEncode.fun, : Node size legend may have insufficient height to plot.
#> Warning in makeNodeSizeLegend(numbers = gs_numbers, sizeEncode.fun =
#> sizeEncode.fun, : Node size legend may have insufficient height to plot.

knitr::include_graphics( "Bai_CiHep_DN.sig.GSN.PL.png" )

Note that the n_col = ‘N1’ argument tells the plot function to scale the size of vertices according to the values of the ‘N1’ column of the pathways data, which indicates the number of genes in a given gene signature.

To generate a hierarchical dendrogram of distances between gene sets, the gsnHierarchicalDendrogram() function can be used.


gsnHierarchicalDendrogram( object = Bai_CiHep_DN.sig.GSN,
                           width = 8,
                           height = 5,
                           filename = "Bai_CiHep_DN.sig.GSN.HD.png",
                           show.leaves = TRUE,
                           show.legend = TRUE,
                           n_col = 'N1',
                           leaf_cex_range = c(0.3,1.6) )

knitr::include_graphics( path = "Bai_CiHep_DN.sig.GSN.HD.png" )

On the basis of this output, it could be concluded that there are 6 subnets or clusters of gene signatures that are downregulated upon chemical induction of CiHep cells from fibroblast2 cells, with subnet 1 being the most significant, and within that, the M30131 Psmb5 signature being particularly strong.

Variation: A hierarchical dendrogram with a circular geometry can be generated with gsnHierarchicalDendrogram() using the geometry = ‘circular’ argument.

Generating a 2-channel Gene Set Network with CERNO data

Two channel gene set networks are useful for comparing differential pathway activation or repression between conditions, treatments or transitions between cell types. This mainly applies to cases where two distinct cellular states are derived from a single reference state or cell type. In the Bai data set, the fibroblast 2 cellular subset differentiates into hepatocyte-like (CiHep), keratinocyte-like (CiKrt) and several other cell types upon chemical induction. In an analysis of this data set, comparisons can be made between the transition from the fibroblast 2 cell type to both the CiHep and CiKrt induced cell types.

A 2-channel Gene Set Network workflow is very similar to a single channel workflow. The main difference is that two parallel significance statistics are used in the pathways dataset instead of a single significance statistic. These two significance statistics are derived from the two pathways datasets being compared. Subsequent steps for the analysis are nearly the same.

For the 2-channel analysis, we will use the following data objects for this first analysis:

Merging Bai_CiHep_DN.cerno and Bai_CiKrt_DN.cerno

The 2-channel workflow begins with the merging of the two unfiltered parallel pathways data sets. Importantly, by “unfiltered”, we mean that the two pathways data sets should contain scores and p-values for all the gene sets in the gene set collection. For CERNO, such complete unfiltered result sets may be obtained by including the argument qval = 1.1 when calling the tmod package’s tmodCERNOtest() function. Using unfiltered pathways results sets allows us to generate a merged data set that includes gene signatures that are significant in either one or both of the two pathways results sets.

We start with two, complete unfiltered data sets containing all the CERNO results for a given GSC.

Bai_CiHep_v_CiKrt_DN.merge <- merge( x = Bai_CiHep_DN.cerno,
                                     y = Bai_CiKrt_DN.cerno,
                                     by = 'ID' )

For columns common to both CERNO datasets, the names are suffixed with .x and .y depending on which argument was used. To clean up this data, we can delete columns not of interest and give the columns with .x and .y suffixes better names, using the rename function from the dplyr package:

Bai_CiHep_v_CiKrt_DN.merge$Title.y <- NULL

library( 'dplyr' )
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
Bai_CiHep_v_CiKrt_DN.merge <- dplyr::rename( .data = Bai_CiHep_v_CiKrt_DN.merge,
                                             Title = Title.x,
                                             cerno.CiHep = cerno.x,
                                             N1.CiHep = N1.x,
                                             AUC.CiHep = AUC.x,
                                             cES.CiHep = cES.x,
                                             P.Value.CiHep = P.Value.x,
                                             adj.P.Val.CiHep = adj.P.Val.x,
                                             cerno.CiKrt = cerno.y,
                                             N1.CiKrt = N1.y,
                                             AUC.CiKrt = AUC.y,
                                             cES.CiKrt = cES.y,
                                             P.Value.CiKrt = P.Value.y,
                                             adj.P.Val.CiKrt = adj.P.Val.y
                                             )

head( Bai_CiHep_v_CiKrt_DN.merge )
#>       ID                 Title cerno.CiHep N1.CiHep AUC.CiHep cES.CiHep
#> 1 M29881  Adcyap1 target genes    1.193995        1 0.4495894 0.5969976
#> 2 M29885   Alkbh3 target genes  490.585215      265 0.4912303 0.9256325
#> 3 M29886     Alx4 target genes   63.730636       21 0.5723213 1.5173961
#> 4 M29887 Arhgap35 target genes  154.090957       64 0.5634298 1.2038356
#> 5 M29889   Arid5b target genes 1045.345896      510 0.5237974 1.0248489
#> 6 M29890    Arnt2 target genes  859.285215      426 0.5053423 1.0085507
#>   P.Value.CiHep adj.P.Val.CiHep cerno.CiKrt N1.CiKrt AUC.CiKrt cES.CiKrt
#> 1    0.55046185       0.8558470    2.739698        2 0.4374694 0.6849245
#> 2    0.88903092       0.9598890  441.926264      245 0.5080481 0.9018903
#> 3    0.01685196       0.5672840   35.959098       17 0.4853284 1.0576205
#> 4    0.05792193       0.6766513  144.836259       64 0.5530328 1.1315333
#> 5    0.28389263       0.8385911  909.733432      481 0.5099639 0.9456688
#> 6    0.42382258       0.8385911  727.207958      404 0.4912521 0.9000098
#>   P.Value.CiKrt adj.P.Val.CiKrt
#> 1     0.6022861       0.9594682
#> 2     0.9414997       1.0000000
#> 3     0.3768602       0.9454987
#> 4     0.1467167       0.9062450
#> 5     0.8846576       0.9965306
#> 6     0.9805024       1.0000000

Filtering the merged dataset

The merged dataset must now be filtered. Typically, this is done using the R subset() function and keeping results that have a significant adjusted p-value in one or the other source datasets.

.alpha <- 0.05

Bai_CiHep_v_CiKrt_DN.filt <- subset( Bai_CiHep_v_CiKrt_DN.merge, adj.P.Val.CiHep <= .alpha | adj.P.Val.CiKrt <= .alpha )

yassifyPathways( Bai_CiHep_v_CiKrt_DN.filt,
                 url_map_list = list( ID = url_from_ID ),
                 options =list( autoWidth = FALSE,
                                scrollX = TRUE
                                )
                 )

NOTE: In the merged data, you will observe that N1.CiHep and N1.CiKrt are highly correlated, but not identical. This is because the N1 statistic in the CERNO results set is the number of genes for that module that are directly observable in the differential expression data set. In this case, background gene lists for the two data sets, Bai_CiHep_v_Fib2.de and Bai_CiKrt_v_Fib2.de are slightly different, so the N1 values from their respective pathways analyses differ slightly.

Creating a GSNData Network

First, subset the Bai_gsc.tmod object to include only the gene signatures in the filtered, merged data set.

Bai_CiHep_v_CiKrt_DN.sig.tmod <- Bai_gsc.tmod[ Bai_CiHep_v_CiKrt_DN.filt$ID ]

Bai_CiHep_v_CiKrt_DN.sig.tmod
#> An object of class "tmod"
#>  18 modules, 7399 genes

The two differential expression data sets have largely overlapping but slightly different sets of observable genes. These two background gene lists should give similar, but not identical distance values when generating a gene set network. We’ll use the Bai_CiHep_genes gene list. Using the venn function from the gplots package, we can visualize the overlap of gene sets in the CiHep vs fibroblast 2 and CiKrt vs fibroblast 2 data sets.

Bai_CiHep_genes <- toupper( rownames( Bai_CiHep_v_Fib2.de ) )
Bai_CiKrt_genes <- toupper( rownames( Bai_CiKrt_v_Fib2.de ) )

library( 'gplots' )
#> 
#> Attaching package: 'gplots'
#> The following object is masked from 'package:stats':
#> 
#>     lowess
gplots::venn( list(CiHep = Bai_CiHep_genes, CiKrt = Bai_CiKrt_genes) )

We now calculate gene set distances and generate a GSNData object using the buildGeneSetNetworkSTLF() function:

Bai_CiHep_v_CiKrt_DN.GSN <- GSNA::buildGeneSetNetworkSTLF( geneSetCollection = Bai_CiHep_v_CiKrt_DN.sig.tmod, ref.background = Bai_CiHep_genes )

Adding pathways data is a little different from before, since we will need to explicitly specify stat_col, stat_col_2, sig_order and sig_order_2.

Bai_CiHep_v_CiKrt_DN.GSN <- gsnAddPathwaysData( Bai_CiHep_v_CiKrt_DN.GSN,
                                                pathways_data = Bai_CiHep_v_CiKrt_DN.filt,
                                                stat_col = "adj.P.Val.CiHep",
                                                sig_order = "loToHi",
                                                stat_col_2 = "adj.P.Val.CiKrt",
                                                sig_order_2 = "loToHi" )
#> Using generic pathways import.
#> Using:
#>  id_col = ID

Lets take a quick look at the GSNData object:

Bai_CiHep_v_CiKrt_DN.GSN
#> GSNData object version: 0.1.4.2 
#>   Contains data for:
#>      8769 genes.
#>      18 gene sets.
#>   Contains the following distance(s):
#>      stlf
#>   Contains pathways data of type:  NULL 
#>      id_col = ID 
#>      stat_col = adj.P.Val.CiHep 
#>      sig_order = loToHi 
#>      stat_col_2 = adj.P.Val.CiKrt 
#>      sig_order_2 = loToHi 
#>      n_col = N1.CiHep

Now, lets pare the network and assign subnets:

Bai_CiHep_v_CiKrt_DN.GSN <- gsnPareNetGenericHierarchic( Bai_CiHep_v_CiKrt_DN.GSN )

Bai_CiHep_v_CiKrt_DN.GSN <- gsnAssignSubnets( Bai_CiHep_v_CiKrt_DN.GSN )

yassifyPathways( gsnMergePathways( Bai_CiHep_v_CiKrt_DN.GSN  ),
                 url_map_list = list( ID = url_from_ID ),
                 options =list( autoWidth = FALSE,
                                scrollX = TRUE
                                ) )

Subnets can be output as a table with the gsnSubnetSummary() function.

yassifyPathways( gsnSubnetSummary( Bai_CiHep_v_CiKrt_DN.GSN  ), 
                 url_map_list = list( Seed.ID = url_from_ID ),
                 url_map_by_words_list = list( IDs = url_from_ID ),
                 options =list( autoWidth = FALSE,
                                scrollX = TRUE
                                ) )

Plotting the 2-Channel Network

In 2-Channel network plots, gene set nodes can be color-scaled in two dimensions, based on stat_col and stat_col_2. In the plots below, signatures significant in the CiHep vs fibroblast 2 comparison are strongly red, whereas those significant in CiKrt vs fibroblast 2 are strongly blue. Any signatures significant for both comparisons would be purple. Note the square color legend that indicates significance for both channels.

gsnPlotNetwork( Bai_CiHep_v_CiKrt_DN.GSN, n_col = "N1.CiHep", filename = "Bai_CiHep_v_CiKrt_DN.GSN.NP.png", width = 7, height = 5 )
#> Warning in makeNodeSizeLegend(numbers = gs_numbers, sizeEncode.fun =
#> sizeEncode.fun, : Node size legend may have insufficient height to plot.
#> Warning in makeNodeSizeLegend(numbers = gs_numbers, sizeEncode.fun =
#> sizeEncode.fun, : Node size legend may have insufficient height to plot.

knitr::include_graphics( "Bai_CiHep_v_CiKrt_DN.GSN.NP.png" )

From this analysis, it may be concluded that about 7 subnets or clusters of related gene sets are downregulated in CiHep and CiKrt cells upon chemical induction of fibroblast 2 cells. In particular, the Psmb5 downregulation signature is strong in CiHeps (strong red), whereas the Hes2 signature (strong blue) is strongest in CiKrt cells.

Variation: When available in the pathways data, an alternative statistic can be used to color gene sets in a gene set network. Here, we’re specifying stat_col = ‘AUC.CiHep’, suppressing the previously set value of the second statistical column with stat_col_2 = NA and specifying a custom set of vertex colors used to represent the AUC values, such that the color scale goes from dark blue to white.


gsnPlotNetwork( object = Bai_CiHep_v_CiKrt_DN.GSN,
                width = 8,
                height = 5,
                file = "Bai_CiHep_v_CiKrt_DN.GSN.NP_AUC.png",
                show.legend = TRUE,
                n_col = "N1.CiHep",
                stat_col = 'AUC.CiHep',
                sig_order = 'hiToLo',
                stat_col_2 = NA,
                vertex_colors = c('darkblue', 'blue', '#9999FF',  '#DDDDFF', 'white' )
)
#> Warning in makeNodeSizeLegend(numbers = gs_numbers, sizeEncode.fun =
#> sizeEncode.fun, : Node size legend may have insufficient height to plot.
#> Warning in makeNodeSizeLegend(numbers = gs_numbers, sizeEncode.fun =
#> sizeEncode.fun, : Node size legend may have insufficient height to plot.

knitr::include_graphics( path = "Bai_CiHep_v_CiKrt_DN.GSN.NP_AUC.png" )

A hierarchical dendrogram is also revealing. This time, we’ll generate a circular hierarchical dendrogram by including the argument geometry = ‘circular’:

gsnHierarchicalDendrogram( Bai_CiHep_v_CiKrt_DN.GSN,
                           n_col = "N1.CiHep",
                           filename = "Bai_CiHep_v_CiKrt_DN.GSN.HD.png",
                           width = 7,
                           height = 5,
                           show.leaves = TRUE,
                           geometry = 'circular' )
#> Warning in renderCircularDendrogram(dendro = GSN.dend, leaf.names =
#> vertex.names, : Font cex is too large. Scaling label text x 0.396760877446335.

knitr::include_graphics( "Bai_CiHep_v_CiKrt_DN.GSN.HD.png" )

Hierarchical dendrograms display relationships between all the gene sets, whereas the circular network plot show just gene sets connected to their nearest neighbors within subnet, but not connections between the subnets. Otherwise, the data shown are similar, and the gsnHierarchicalDendrogram() function possesses many of the same capabilities as the network plot function, including the ability to specify an alternative statistical columns for leaf coloration and the ability to vary the size of gene set leaves by gene set size.

Using GSNA with GSEA Data

GSEA is a popular pathways analysis program available from https://www.gsea-msigdb.org/gsea/index.jsp([1],[2]). Users supply a GSC and either a normalized gene expression matrix and a list of sample phenotypes or a “pre-ranked” ordered list of genes; GSEA then uses those data to calculate an enrichment score (ES) statistic and a normalized enrichment score (NES). Positive ES and NES statistics indicate a positive association of gene sets with the data, e.g. gene sets that showed activation or increased expression in a data set or tended toward the beginning of the gene list, whereas negative ES and NES scores indicate a negative association. Using a permutation test, GSEA then estimates p-values (NOM p-val), false discovery rates (FDR q-val), and familywise error rates (FWER q-val) for each gene set association. Because GSEA uses permutation tests to determine p-values, p-values cannot reliably be measured below \(p = 1/ [# permutations]\). As a consequence, many highly significant associations will have p-values estimated to be 0.

When running GSEA, users may specify an output directory; otherwise the data are saved to a subdirectory named “~/gsea_home/output/[date]” within the user’s home directory. Results files are stored in these results directories with names along the lines of “gsea_report_for_Pheno1_1698209272006.tsv”, with “gsea_report_for_” as a prefix, the phenotype being compared, and lastly, numbers + “.tsv” as the suffix. Results files are generally present in pairs appropriate for a given comparison, For a pheno1_versus_pheno2 comparison, there could be a file “gsea_report_for_Pheno1_1698209272006.tsv” with positive ES and NES values, and “gsea_report_for_Pheno2_1698209272006.tsv” file with negative ES and NES values. Such output files can be suitable as inputs for GSNA. Since p-values (and q-values) may not be accurately estimated below \(p = 1/[# permutations]\), NES or ES statistics may be more suitable for use as the primary statistic used for GSEA. Since GSEA separates UP gene sets and DOWN gene sets into two different output files, these may be combined for a comprehensive analysis of activated and repressed gene sets.

The following figure illustrates the relationship between GSEA’s NES and nominal p-value statistics for a concatenated data set derived from a comparison of CiHep versus fibroblast 2 cells. Non-significant and significant results assessed on the basis of FDR q-values are indicated by crosses and solid circles, respectively.

Note that NES values are discontinuous, but have an approximately symmetrical relationship with the corresponding p-values. Note that only values at the extremes of NES have significant FDR q-values.

The primary statistic, stat_col determines how subnets are assigned, with the most significant results being assigned first. In the case of p-values and q-values, the lowest numerical values are most significant, and hence we specify sig_order = ‘loToHi’. In the case of NES and ES, the most significant values are in the extremes. (To do: add support for abs(‘hiToLo’) for this case.)

In this current example, we will use the FDR p-value as the primary statistic for assigning subnet membership, but use NES as the statistic for graphical outputs.

Generating a single-channel Gene Set Network with GSEA data

Using GSEA data with GSNA is straightforward and similar to using it with CERNO data. For this example, we will use the following data objects:

Since these two data sets are saved as data objects within the GSNA package, it is unnecessary to load them, but for your own data, you could load them into memory with the read.table() command as shown here:

Bai_CiHep_dorothea_UP.Gsea <- read.table( file = file.path( gsea_outputs_dir,
                                                            "gsea_report_for_CiHep_1698197511726.tsv" ),
                                          header = TRUE,
                                          check.names = FALSE,
                                          sep = "\t" )

Bai_CiHep_dorothea_DN.Gsea <- read.table( file = file.path( gsea_outputs_dir,
                                                            "gsea_report_for_Fibroblast2_1698197511726.tsv" ),
                                          header = TRUE,
                                          check.names = FALSE,
                                          sep = "\t" )

These two data sets can be considered together. To do so, we’ll use the rbind function to combine the two data sets into one.

Bai_CiHep_dorothea_UD.Gsea <- rbind( Bai_CiHep_dorothea_UP.Gsea, Bai_CiHep_dorothea_DN.Gsea )

yassifyPathways( Bai_CiHep_dorothea_UD.Gsea,
                 options = list( autoWidth = FALSE,
                                 scrollX = TRUE ),
                 n = 6   # <= this argument tells the function to just show the first 6 results
                 )

Note the empty 12th column present in the imported GSEA tsv output table, plus the GS DETAILS without useful information and the GS<br> follow link to MSigDB that duplicates the NAME column. We can clean up this data somewhat by removing these:

Bai_CiHep_dorothea_UD.Gsea[[12]] <- NULL
Bai_CiHep_dorothea_UD.Gsea$`GS DETAILS` <- NULL
Bai_CiHep_dorothea_UD.Gsea$`GS<br> follow link to MSigDB` <- NULL

We will filter the data sets using the subset command with \(`FDR q-val` <= 0.05\):

Bai_CiHep_dorothea_UD.FILT <- subset( Bai_CiHep_dorothea_UD.Gsea, `FDR q-val` <= 0.05 )

Using this filtered GSEA dataset, we will now generate a GSC that contains only the gene sets of interest, i.e. those with significant estimated FDR values. Here we use the gene set names from the GSEA result set as indices for subsetting the Bai_gsc.tmod object.

Bai_gsc.tmod.gsea <- Bai_gsc.tmod[ Bai_CiHep_dorothea_UD.FILT$NAME ]

Bai_gsc.tmod.gsea
#> An object of class "tmod"
#>  68 modules, 2723 genes

NOTE: For this step to work properly, the gene set names in the GSEA data set must match those in the GSC. However, there are cases when the name used in the GSEA data set does not match those used as indices in the GSC. This happens particularly when using a GSEA dataset is used with a tmod object generated by importing an MSigDB XML database using the tmod::tmodImportMSigDB() function. In this case, the tmod object generated uses systematic names as the key, e.g. ‘M22163’ whereas the GSEA data set contains gene set names (e.g. ‘GO_MITOCHONDRIAL_RNA_CATABOLIC_PROCESS’). In such a situation, it is advisable to add a column in the GSEA data set containing the systematic name or alternatively, convert the names in tmod object to appropriate gene set names (which can be more difficult than the first option).

Once the GSC has been subsetted, we can obtain a background gene list from the gene expression matrix. Here we extract the background gene set from Bai_empty_expr_mat, making sure to convert the gene symbols into upper case.

Bai_background_genes <- toupper( rownames( Bai_empty_expr_mat ) )

Now we can calculate a Jaccard distance matrix and create a GSNData object:

Bai_CiHep_dorothea_UD.GSN <- buildGeneSetNetworkJaccard( ref.background = Bai_background_genes,
                                                         geneSetCollection = Bai_gsc.tmod.gsea )

Next, we import the filtered pathways data set. Since we are opting to use ‘FDR q-val’ as our primary statistic with sig_order = ‘loToHi’, we can specify that in this command.

The ‘sig_order’ parameter specifies the order in which subnet membership is assigned, and therefore the order that they are reported. It also specifies how they will be colored for output plots, though users can override the automatic plot coloring behavior. In our case, since the NES values are mostly negative in our dataset, we will assign sig_order = ‘loToHi’.

Bai_CiHep_dorothea_UC.GSN <- gsnAddPathwaysData( object = Bai_CiHep_dorothea_UD.GSN,
                                                 pathways_data = Bai_CiHep_dorothea_UD.FILT,
                                                 stat_col = 'FDR q-val',
                                                 sig_order = 'loToHi' )
#> Using GSEA import.
#> Warning in gsnImportGSEA(object = object, pathways_data = pathways_data, : Pathways data are missing the following GSEA fields:GS DETAILS

Bai_CiHep_dorothea_UC.GSN
#> GSNData object version: 0.1.4.2 
#>   Contains data for:
#>      20035 genes.
#>      68 gene sets.
#>   Contains the following distance(s):
#>      jaccard
#>   Contains pathways data of type:  gsea 
#>      id_col = NAME 
#>      stat_col = FDR q-val 
#>      sig_order = loToHi 
#>      n_col = SIZE

Please ignore any warning about the missing GS DETAILS field, which we removed prior.

NOTE: For p-values and q-values, use sig_order = ‘loToHi’. For data in which high or extreme values are most significant, use ‘hiToLo’. (We may in the future add an option to specifically handle zero-centered extreme values properly in the future as ‘absHiToLo’.)

Next, we pare the network and assign subnets. We’ll use nearest single neighbor pairing for this example, with the gsnPareNetGenericToNearestNNeighbors() function. Among the factors that determine the granularity of the clustering into subnets are the cutoff and extreme parameters. The cutoff parameter specifies the maximal or minimal value of calculated distance for which gene-set groupings can be joined into subnets. More distant connections are removed. For distance metrics like stlf (single tailed log Fisher), lf, etc., for which lower values are considered closer, extreme = ‘min’, hence cutoff is treated as the maximal distance for which gene sets may be joined in a subnet. For distance metrics such as Jaccard index or the Szymkiewicz–Simpson overlap coefficient, for which greater values are closer, extreme = ‘max’, and cutoff is treated as the minimal value of distance for which gene sets may be joined into a subnet.

We have had good results when the cutoff is selected to retain \(1 \over 4\) to \(1 \over 2\) of distances in the distance matrix. We use the gsnDistanceHistogram() function with the argument stat = ‘cumulative’ specified to examine the distribution of distances in the raw distance matrix.

gsnDistanceHistogram( Bai_CiHep_dorothea_UC.GSN, stat = "cumulative" )

We see that roughly half of the Jaccard values in the data set are 0, so we select a value just above that, 0.02, which retains approximately \(1 \over 4\) of the distances.

Bai_CiHep_dorothea_UC.GSN <- gsnPareNetGenericToNearestNNeighbors( object = Bai_CiHep_dorothea_UC.GSN,
                                                                   N = 1,
                                                                   keepOrphans = TRUE, cutoff = 0.02 )

Bai_CiHep_dorothea_UC.GSN <- gsnAssignSubnets( object = Bai_CiHep_dorothea_UC.GSN )

Outputting Results Sets

Finally we can generate reports for our subnets. With gsnMergePathways(), the results will be ordered on the basis of how subnets were assigned. Here we will do it in two steps. The Bai_CiHep_dorothea_UC.subnets data.frame can be manipulated or reordered as desired, for example with dplyr::arrange, or saved to a file for later processing.

Bai_CiHep_dorothea_UC.subnets <- gsnMergePathways( object = Bai_CiHep_dorothea_UC.GSN )

yassifyPathways( Bai_CiHep_dorothea_UC.subnets,
                 options =list( autoWidth = FALSE,
                                scrollX = TRUE
                                ) )

An interpretation of this result set would be that the GSEA pathways results break down into an estimated 16 subnet groupings. These groupings are approximate, and dependent on parameters used for gsnPareNetGenericToNearestNNeighbors(), particularly the cutoff parameter which specifies the minimal or maximal value of the distance parameter for which gene sets can be grouped in a network, depending on the distance metric being used. Within each subnet, the higher significance gene sets may indicate biologically significant associations. In this example GSEA analysis, we used a database with gene sets directed and repressed by particular transcription factors, so in the Bai data, these transcription factors may be considered candidates for mediating the differentiation of fibroblast 2 cells into CiHeps, or hepatocyte like cells. In particular the gene signatures of AR, ELF3, FOXK2, KLF5, KDM5B are downregulated in CiHeps, whereas those of ETV4 & SPI1 are elevated.

We can now use the gsnSubnetSummary() function to generate a concise table of the subnets with summary statistics. By default, the summary statistics calculated are harmonic means and min or max values depending on the value of sig_order. These are generally appropriate for combining and summarizing p-values and q-values, but may not be appropriate for a value such as NES. Here we will specifically specify that the mean and maximal (max) NES values should be calculated instead.

yassifyPathways( gsnSubnetSummary( object = Bai_CiHep_dorothea_UC.GSN, summary_statistics = c("mean", "max")  ),
                 options =list( autoWidth = FALSE,
                                scrollX = TRUE
                 ) )

To calculate instead the default statistics using FDR q-val, we can simply specify that as an argument to gsnSubnetSummary:

yassifyPathways( gsnSubnetSummary( object = Bai_CiHep_dorothea_UC.GSN,
                                   stat_col = 'FDR q-val',
                                   sig_order = 'loToHi'  ),
                 options =list( autoWidth = FALSE,
                                scrollX = TRUE
                 ) )

Plotting the Network

We now plot the network using the default stat_col parameter, FDR q-val:

gsnPlotNetwork( object = Bai_CiHep_dorothea_UC.GSN,
                n_col = "SIZE",
                height = 7,
                width = 9,
                filename = "Bai_CiHep_dorothea_UC.GSN.DEF.NP.png" )
#> Warning in transform_function(pathways_dat[[stat_col]]): Warning: raw statistic contains zeros. Adding a pseudocount of 0.00011111111

knitr::include_graphics( "Bai_CiHep_dorothea_UC.GSN.DEF.NP.png" )

Now, we can plot the network, coloring nodes on the basis of NES values, with a custom blue, white, red color scale to color the extreme values. We also use transform_function = ‘identity’ to suppress the default log-transformation of the significance statistic.

gsnPlotNetwork( object = Bai_CiHep_dorothea_UC.GSN,
                n_col = "SIZE",
                stat_col = 'NES',
                sig_order = 'hiToLo',
                transform_function = identity,
                vertex_colors = c("blue", "white", "red"), 
                height = 7,
                width = 9,
                filename = "Bai_CiHep_dorothea_UC.GSN.NES.NP.png" )

knitr::include_graphics( "Bai_CiHep_dorothea_UC.GSN.NES.NP.png" )

To obtain a hierarchical dendrogram, the network must be pared using the gsnPareNetGenericHierarchic() function. However, this network was pared using nearest neighbor paring. To generate a hierarchical dendrogram, the gsnPareNetGenericHierarchic() function may be called on the fly, as a nested argument to gsnHierarchicalDendrogram(). This is useful if you wish to see how similar the gene set associations are between the two methods.

gsnHierarchicalDendrogram( object = gsnAssignSubnets(gsnPareNetGenericHierarchic( Bai_CiHep_dorothea_UC.GSN )),
                           stat_col = 'NES',
                           sig_order = 'hiToLo',
                           n_col = "SIZE",
                           transform_function = identity,
                           show.leaves = TRUE,
                           show.legend = TRUE,
                           leaf_colors =  c("blue", "white", "red"),
                           width = 7,
                           height = 11,
                           filename = "Bai_CiHep_dorothea_UC.GSN.HD.png",
                           lab.cex = 0.8 )
#> Warning in makeLeafSizeLegend(numbers = gs_numbers, sizeEncode.fun =
#> sizeEncode.fun, : Node size legend may have insufficient height to plot.

knitr::include_graphics( "Bai_CiHep_dorothea_UC.GSN.HD.png" )

One interpretation of these results are that changes in expression associated with the differentiation of fibroblast 2 cells into CiHep cells upon chemical induction are characterized mainly by repression of numerous transcriptional programs such as those directed by the Myc, Ar, KLF5 and TP53 transcription factors, and activation of Spi1 and Etv4.

Citing GSNA

To cite package ‘GSNA’ in publications use:

Collins, R. D, Urbach, M. J, Racenet, J. Z, Arshad, Umar, Power, A. K, Newman, M. R, Mylvaganam, H. G, Ly, L. N, Lian, Xiaodong, Rull, Anna, Rassadkina, Yelizaveta, Yanez, G. A, Peluso, J. M, Deeks, G. S, Vidal, Francesc, Lichterfeld, Mathias, Yu, G. X, Gaiha, D. G, Allen, M. T, Walker, D. B (2021). “Functional impairment of HIV-specific CD8+ T cells precedes aborted spontaneous control of viremia.” Immunity, S107476132100337X. doi:10.1016/j.immuni.2021.08.007 https://doi.org/10.1016/j.immuni.2021.08.007, https://linkinghub.elsevier.com/retrieve/pii/S107476132100337X.

Collins, R. D, Hitschfel, Julia, Urbach, M. J, Mylvaganam, H. G, Ly, L. N, Arshad, Umar, Racenet, J. Z, Yanez, G. A, Diefenbach, J. T, Walker, D. B (2023). “Cytolytic CD8+ T cells infiltrate germinal centers to limit ongoing HIV replication in spontaneous controller lymph nodes.” Science Immunology, 8(83), eade5872. doi:10.1126/sciimmunol.ade5872 https://doi.org/10.1126/sciimmunol.ade5872, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10231436.

References

1.
Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences. 2005;102: 15545–15550. doi:10.1073/pnas.0506580102
2.
Mootha VK, Lindgren CM, Eriksson K-F, Subramanian A, Sihag S, Lehar J, et al. PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genetics. 2003;34: 267–273. doi:10.1038/ng1180
3.
Zyla J, Marczyk M, Domaszewska T, Kaufmann SHE, Polanska J, Weiner J 3rd. Gene set enrichment for reproducible science: Comparison of CERNO and eight other algorithms. Bioinformatics. 2019;35: 5146–5154. doi:10.1093/bioinformatics/btz447
4.
Sherman BT, Hao M, Qiu J, Jiao X, Baseler MW, Lane HC, et al. DAVID: A web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Research. 2022;50: W216–W221. doi:10.1093/nar/gkac194
5.
Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nature Protocols. 2009;4: 44–57. doi:10.1038/nprot.2008.211
6.
Alexa A, Rahnenführer J, Lengauer T. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics. 2006;22: 1600–1607. doi:10.1093/bioinformatics/btl140
7.
Merico D, Isserlin R, Stueker O, Emili A, Bader GD. Enrichment Map: A Network-Based Method for Gene-Set Enrichment Visualization and Interpretation. PLOS ONE. 2010;5: e13984. doi:10.1371/journal.pone.0013984
8.
Bai Y, Yang Z, Xu X, Ding W, Qi J, Liu F, et al. Direct chemical induction of hepatocyte‐like cells with capacity for liver repopulation. Hepatology. 2023;77: 1550–1565. doi:10.1002/hep.32686
9.
Garcia-Alonso L, Holland CH, Ibrahim MM, Turei D, Saez-Rodriguez J. Benchmark and integration of resources for the estimation of human transcription factor activities. Genome Research. 2019;29: 1363–1375. doi:10.1101/gr.240663.118