Generating gene sets • Single Cell Pathway Analysis (SCPA)

Using gene sets in SCPA

There are a few different formats for gene sets that you can use in SCPA. However you decide to generate them though, the gene names need to be formatted in the same way across your expression file and gene sets i.e. if you have human gene symbols in your expression data, you need human gene symbols in your gene set.

Generate gene set lists in R (possibly the best option)
Use a classical gene set gmt file
Use a csv file

Here we’ll outline a basic workflow to get gene sets using either of the methods. First let’s load in a couple of packages we’ll need

library(SCPA)
library(msigdbr)
library(magrittr)
library(dplyr)

1. Generating gene sets within R using msigdbr

Generic example

In a simple example, you can generate formatted gene sets using two simple lines of code. We’ll just extract all the Hallmark pathways from msigdbr

pathways <- msigdbr(species = "Homo sapiens", category = "H") %>%
      format_pathways()

More detailed example

msigdbr is a handy R package that contains a huge number of pathways that are derived from the Molecular Signatures Database (MSigDB) v7.5.1. Using msigdbr is probably the easiest and most efficient way to generate all the gene sets that you’re interested in. You can see a detailed explanation of how to use different aspects of the package in the msigdbr vignette. We’re going to use it to generate lists of gene sets that we want to analyse in our data using SCPA.

For example, we can generate a data frame containing all of the Homo sapiens Hallmark gene sets using the “H” call within msigdbr. This data is in tidy format with all the pathway names in the gs_name column, and gene names in the gene_symbol column

hallmark <- msigdbr("Homo sapiens", "H")
head(hallmark)
#> # A tibble: 6 × 15
#>   gs_cat gs_subcat gs_name               gene_symbol entrez_gene ensembl_gene   
#>   <chr>  <chr>     <chr>                 <chr>             <int> <chr>          
#> 1 H      ""        HALLMARK_ADIPOGENESIS ABCA1                19 ENSG00000165029
#> 2 H      ""        HALLMARK_ADIPOGENESIS ABCB8             11194 ENSG00000197150
#> 3 H      ""        HALLMARK_ADIPOGENESIS ACAA2             10449 ENSG00000167315
#> 4 H      ""        HALLMARK_ADIPOGENESIS ACADL                33 ENSG00000115361
#> 5 H      ""        HALLMARK_ADIPOGENESIS ACADM                34 ENSG00000117054
#> 6 H      ""        HALLMARK_ADIPOGENESIS ACADS                35 ENSG00000122971
#> # ℹ 9 more variables: human_gene_symbol <chr>, human_entrez_gene <int>,
#> #   human_ensembl_gene <chr>, gs_id <chr>, gs_pmid <chr>, gs_geoid <chr>,
#> #   gs_exact_source <chr>, gs_url <chr>, gs_description <chr>

We can have a look at a few of these pathways

hallmark$gs_name %>% 
  unique() %>%
  head(10)
#>  [1] "HALLMARK_ADIPOGENESIS"            "HALLMARK_ALLOGRAFT_REJECTION"    
#>  [3] "HALLMARK_ANDROGEN_RESPONSE"       "HALLMARK_ANGIOGENESIS"           
#>  [5] "HALLMARK_APICAL_JUNCTION"         "HALLMARK_APICAL_SURFACE"         
#>  [7] "HALLMARK_APOPTOSIS"               "HALLMARK_BILE_ACID_METABOLISM"   
#>  [9] "HALLMARK_CHOLESTEROL_HOMEOSTASIS" "HALLMARK_COAGULATION"

We only need the pathway name and gene symbol though. For SCPA, we also need it to be formatted into separate lists for each pathway. We can do this easily using the format_pathways() function within SCPA

hallmark <- format_pathways(hallmark)

This will result in a separate list for each pathway that contains the pathway name and all the genes of that pathway. We can see this if we just look at the top of the first pathway list

head(hallmark[[1]])
#> # A tibble: 6 × 2
#>   Pathway               Genes
#>   <chr>                 <chr>
#> 1 HALLMARK_ADIPOGENESIS ABCA1
#> 2 HALLMARK_ADIPOGENESIS ABCB8
#> 3 HALLMARK_ADIPOGENESIS ACAA2
#> 4 HALLMARK_ADIPOGENESIS ACADL
#> 5 HALLMARK_ADIPOGENESIS ACADM
#> 6 HALLMARK_ADIPOGENESIS ACADS

Now we’ve generated pathways that can be analysed directly in the compare_pathways() function of SCPA

Versatility in using msigdbr

An advantage of using something like msigdbr is its versatility in subsetting specific pathways, or collection of gene sets. For example, if we wanted all pathways related to interferon, we could use grepl to filter the gs_names column:

ifn_pathways <- msigdbr("Homo sapiens") %>%
  filter(grepl("interferon", gs_name, ignore.case = T)) %>%
  format_pathways()

Or we could specify a combination of collections so we get all of the Hallmark, KEGG, and Reactome gene sets:

pathways <- c("hallmark", "kegg", "reactome")
hkr_sets <- msigdbr("Homo sapiens") %>%
  filter(grepl(paste(pathways, collapse = "|"), gs_name, ignore.case = T)) %>%
  format_pathways()

Using something like this makes it easy to pull large and/or relevant gene set lists for direct use within SCPA.

2. Using a gmt file

You can download gmt files for both human and mouse from the MSigDB, selecting the “gene symbols” version. SCPA has a built in function to read gmt files and format them properly, so all you need to do is specify the filepath(s); this can be a single file or multiple gmt files e.g. creating a list of all the gmt files in a particular directory:

gmt_files <- list.files(path = "gene_sets", pattern = "gmt", full.names = T)

And then you can just use this in the compare_pathways() function

scpa_out <- compare_pathways(samples = samples, pathways = gmt_files)

3. Using a csv file

In this example, we need to create the csv file, but once that’s done, it’s a simple line of code. As with gmt files, SCPA can directly read properly formatted csv gene set files – you just need to supply the filepath

pathways <- "path/to/geneset/file.csv"
scpa_out <- compare_pathways(samples = samples, pathways = pathways)

How to create the csv file

The csv file needs to be formatted in the style of a classical gene set gmt file. This being the pathway name in column 1, and genes of that pathway in subsequent columns. To generate something like this, all you need to do is download a gmt file, open it in something like excel, delete the second column containing a url, and save it as a csv file. Using this method means you can manually create a list of gene sets you’re interested in, where you’re not limited to existing databases. The file should look something like this: