There are a few different formats for gene sets that you can use in SCPA. However you decide to generate them though, the gene names need to be formatted in the same way across your expression file and gene sets i.e. if you have human gene symbols in your expression data, you need human gene symbols in your gene set.
Here we’ll outline a basic workflow to get gene sets using either of the methods. First let’s load in a couple of packages we’ll need
In a simple example, you can generate formatted gene sets using two simple lines of code. We’ll just extract all the Hallmark pathways from msigdbr
pathways <- msigdbr(species = "Homo sapiens", category = "H") %>%
format_pathways()
msigdbr is a handy R package that contains a huge number of pathways that are derived from the Molecular Signatures Database (MSigDB) v7.5.1. Using msigdbr is probably the easiest and most efficient way to generate all the gene sets that you’re interested in. You can see a detailed explanation of how to use different aspects of the package in the msigdbr vignette. We’re going to use it to generate lists of gene sets that we want to analyse in our data using SCPA.
For example, we can generate a data frame containing all of the Homo
sapiens Hallmark gene sets using the “H” call within
msigdbr
. This data is in tidy format with all the pathway
names in the gs_name column, and gene names in the gene_symbol
column
hallmark <- msigdbr("Homo sapiens", "H")
head(hallmark)
#> # A tibble: 6 × 15
#> gs_cat gs_subcat gs_name gene_symbol entrez_gene ensembl_gene
#> <chr> <chr> <chr> <chr> <int> <chr>
#> 1 H "" HALLMARK_ADIPOGENESIS ABCA1 19 ENSG00000165029
#> 2 H "" HALLMARK_ADIPOGENESIS ABCB8 11194 ENSG00000197150
#> 3 H "" HALLMARK_ADIPOGENESIS ACAA2 10449 ENSG00000167315
#> 4 H "" HALLMARK_ADIPOGENESIS ACADL 33 ENSG00000115361
#> 5 H "" HALLMARK_ADIPOGENESIS ACADM 34 ENSG00000117054
#> 6 H "" HALLMARK_ADIPOGENESIS ACADS 35 ENSG00000122971
#> # ℹ 9 more variables: human_gene_symbol <chr>, human_entrez_gene <int>,
#> # human_ensembl_gene <chr>, gs_id <chr>, gs_pmid <chr>, gs_geoid <chr>,
#> # gs_exact_source <chr>, gs_url <chr>, gs_description <chr>
We can have a look at a few of these pathways
hallmark$gs_name %>%
unique() %>%
head(10)
#> [1] "HALLMARK_ADIPOGENESIS" "HALLMARK_ALLOGRAFT_REJECTION"
#> [3] "HALLMARK_ANDROGEN_RESPONSE" "HALLMARK_ANGIOGENESIS"
#> [5] "HALLMARK_APICAL_JUNCTION" "HALLMARK_APICAL_SURFACE"
#> [7] "HALLMARK_APOPTOSIS" "HALLMARK_BILE_ACID_METABOLISM"
#> [9] "HALLMARK_CHOLESTEROL_HOMEOSTASIS" "HALLMARK_COAGULATION"
We only need the pathway name and gene symbol though. For SCPA, we
also need it to be formatted into separate lists for each pathway. We
can do this easily using the format_pathways()
function
within SCPA
hallmark <- format_pathways(hallmark)
This will result in a separate list for each pathway that contains the pathway name and all the genes of that pathway. We can see this if we just look at the top of the first pathway list
head(hallmark[[1]])
#> # A tibble: 6 × 2
#> Pathway Genes
#> <chr> <chr>
#> 1 HALLMARK_ADIPOGENESIS ABCA1
#> 2 HALLMARK_ADIPOGENESIS ABCB8
#> 3 HALLMARK_ADIPOGENESIS ACAA2
#> 4 HALLMARK_ADIPOGENESIS ACADL
#> 5 HALLMARK_ADIPOGENESIS ACADM
#> 6 HALLMARK_ADIPOGENESIS ACADS
Now we’ve generated pathways that can be analysed directly in the
compare_pathways()
function of SCPA
An advantage of using something like msigdbr is its versatility in subsetting specific pathways, or collection of gene sets. For example, if we wanted all pathways related to interferon, we could use grepl to filter the gs_names column:
ifn_pathways <- msigdbr("Homo sapiens") %>%
filter(grepl("interferon", gs_name, ignore.case = T)) %>%
format_pathways()
Or we could specify a combination of collections so we get all of the Hallmark, KEGG, and Reactome gene sets:
pathways <- c("hallmark", "kegg", "reactome")
hkr_sets <- msigdbr("Homo sapiens") %>%
filter(grepl(paste(pathways, collapse = "|"), gs_name, ignore.case = T)) %>%
format_pathways()
Using something like this makes it easy to pull large and/or relevant gene set lists for direct use within SCPA.
You can download gmt files for both human and mouse from the MSigDB, selecting the “gene symbols” version. SCPA has a built in function to read gmt files and format them properly, so all you need to do is specify the filepath(s); this can be a single file or multiple gmt files e.g. creating a list of all the gmt files in a particular directory:
gmt_files <- list.files(path = "gene_sets", pattern = "gmt", full.names = T)
And then you can just use this in the compare_pathways()
function
scpa_out <- compare_pathways(samples = samples, pathways = gmt_files)
In this example, we need to create the csv file, but once that’s done, it’s a simple line of code. As with gmt files, SCPA can directly read properly formatted csv gene set files – you just need to supply the filepath
pathways <- "path/to/geneset/file.csv"
scpa_out <- compare_pathways(samples = samples, pathways = pathways)
The csv file needs to be formatted in the style of a classical gene set gmt file. This being the pathway name in column 1, and genes of that pathway in subsequent columns. To generate something like this, all you need to do is download a gmt file, open it in something like excel, delete the second column containing a url, and save it as a csv file. Using this method means you can manually create a list of gene sets you’re interested in, where you’re not limited to existing databases. The file should look something like this: