Bootstrap cell type enrichment test — bootstrap_enrichment

bootstrap_enrichment_test takes a genelist and a single cell type transcriptome dataset and determines the probability of enrichment and fold changes for each cell type.

bootstrap_enrichment_test(
  sct_data = NULL,
  hits = NULL,
  bg = NULL,
  genelistSpecies = NULL,
  sctSpecies = NULL,
  output_species = "human",
  reps = 100,
  annotLevel = 1,
  geneSizeControl = FALSE,
  controlledCT = NULL,
  mtc_method = "BH",
  sort_results = TRUE,
  verbose = TRUE
)

Arguments

sct_data	List generated using generate_celltype_data.
hits	List of gene symbols containing the target gene list. Will automatically be converted to human gene symbols if `geneSizeControl=TRUE`.
bg	List of gene symbols containing the background gene list (including hit genes). If `bg=NULL`, an appropriate gene background will be created automatically. if `geneSizeControl=TRUE`.
genelistSpecies	Species that `hits` genes came from (no longer limited to just "mouse" and "human").
sctSpecies	Species that `sct_data` came from (no longer limited to just "mouse" and "human").
output_species	Species to convert `sct_data` and `hits` to (Default: "human").
reps	Number of random gene lists to generate (Default: 100, but should be >=10,000 for publication-quality results).
annotLevel	An integer indicating which level of `sct_data` to analyse (Default: 1).
geneSizeControl	Whether you want to control for GC content and transcript length. Recommended if the gene list originates from genetic studies (Default: FALSE). If set to `TRUE`, then `hits` must be from humans. should be used rather than mouse.
controlledCT	[Optional] If not NULL, and instead is the name of a cell type, then the bootstrapping controls for expression within that cell type.
mtc_method	Multiple-testing correction method (passed to p.adjust).
sort_results	Sort enrichment results from smallest to largest p-values.
verbose	Print messages.

Value

A list containing three data frames:

results: dataframe in which each row gives the statistics (p-value, fold change and number of standard deviations from the mean) associated with the enrichment of the stated cell type in the gene list
hit.cells: vector containing the summed proportion of expression in each cell type for the target list
bootstrap_data: matrix in which each row represents the summed proportion of expression in each cell type for one of the random lists

Examples

# Load the single cell data
ctd <- ewceData::ctd()
#> see ?ewceData and browseVignettes('ewceData') for documentation
#> loading from cache
# Set the parameters for the analysis
# Use 3 bootstrap lists for speed, for publishable analysis use >=10,000
reps <- 3
# Load gene list from Alzheimer's disease GWAS
example_genelist <- ewceData::example_genelist()
#> see ?ewceData and browseVignettes('ewceData') for documentation
#> loading from cache

# Bootstrap significance test, no control for transcript length or GC content
full_results <- EWCE::bootstrap_enrichment_test(
    sct_data = ctd,
    hits = example_genelist,
    reps = reps,
    annotLevel = 1,
    sctSpecies = "mouse",
    genelistSpecies = "human"
)
#> Generating gene background for mouse x human ==> human
#> Retrieving all organisms available in gprofiler.
#> Using stored `gprofiler_orgs`.
#> Mapping species name: mouse
#> Common name mapping found for mouse
#> 1 organism identified from search: mmusculus
#> Retrieving all organisms available in gprofiler.
#> Using stored `gprofiler_orgs`.
#> Mapping species name: human
#> Common name mapping found for human
#> 1 organism identified from search: hsapiens
#> Retrieving all genes using: homologene.
#> Retrieving all organisms available in gprofiler.
#> Using stored `gprofiler_orgs`.
#> Mapping species name: mmusculus
#> 1 organism identified from search: 10090
#> Gene table with 21,207 rows retrieved.
#> Returning all 21,207 genes from mmusculus.
#> Retrieving all genes using: homologene.
#> Retrieving all organisms available in gprofiler.
#> Using stored `gprofiler_orgs`.
#> Mapping species name: hsapiens
#> 1 organism identified from search: 9606
#> Gene table with 19,129 rows retrieved.
#> Returning all 19,129 genes from hsapiens.
#> Preparing gene_df.
#> data.frame format detected.
#> Extracting genes from Gene.Symbol.
#> 21,207 genes extracted.
#> Converting mmusculus ==> hsapiens orthologs using: homologene
#> Retrieving all organisms available in gprofiler.
#> Using stored `gprofiler_orgs`.
#> Mapping species name: mmusculus
#> 1 organism identified from search: 10090
#> Retrieving all organisms available in gprofiler.
#> Using stored `gprofiler_orgs`.
#> Mapping species name: hsapiens
#> 1 organism identified from search: 9606
#> Checking for genes without orthologs in hsapiens.
#> Extracting genes from input_gene.
#> 17,355 genes extracted.
#> Extracting genes from ortholog_gene.
#> 17,355 genes extracted.
#> Checking for genes without 1:1 orthologs.
#> Dropping 131 genes that have multiple input_gene per ortholog_gene.
#> Dropping 498 genes that have multiple ortholog_gene per input_gene.
#> Filtering gene_df with gene_map
#> Adding input_gene col to gene_df.
#> Adding ortholog_gene col to gene_df.
#> 
#> =========== REPORT SUMMARY ===========
#> Total genes dropped after convert_orthologs :
#>    4,725 / 21,207 (22%)
#> Total genes remaining after convert_orthologs :
#>    16,482 / 21,207 (78%)
#> 
#> =========== REPORT SUMMARY ===========
#> 16,482 / 21,207 (77.72%) target_species genes remain after ortholog conversion.
#> 16,482 / 19,129 (86.16%) reference_species genes remain after ortholog conversion.
#> Retrieving all organisms available in gprofiler.
#> Using stored `gprofiler_orgs`.
#> Mapping species name: human
#> Common name mapping found for human
#> 1 organism identified from search: hsapiens
#> Retrieving all organisms available in gprofiler.
#> Using stored `gprofiler_orgs`.
#> Mapping species name: human
#> Common name mapping found for human
#> 1 organism identified from search: hsapiens
#> Retrieving all genes using: homologene.
#> Retrieving all organisms available in gprofiler.
#> Using stored `gprofiler_orgs`.
#> Mapping species name: hsapiens
#> 1 organism identified from search: 9606
#> Gene table with 19,129 rows retrieved.
#> Returning all 19,129 genes from hsapiens.
#> 
#> =========== REPORT SUMMARY ===========
#> 19,129 / 19,129 (100%) target_species genes remain after ortholog conversion.
#> 19,129 / 19,129 (100%) reference_species genes remain after ortholog conversion.
#> 16,482 intersect background genes used.
#> Standardising CellTypeDataset
#> Converting to sparse matrix.
#> Converting to sparse matrix.
#> Checking gene list inputs.
#> Retrieving all genes using: homologene.
#> Retrieving all organisms available in gprofiler.
#> Using stored `gprofiler_orgs`.
#> Mapping species name: human
#> Common name mapping found for human
#> 1 organism identified from search: 9606
#> Gene table with 19,129 rows retrieved.
#> Returning all 19,129 genes from human.
#> Standardising sct_data.
#> Converting gene list input to standardised human genes.
#> Running without gene size control.
#> 17 hit genes remain after filtering.
#> Computing summed proportions.
#> Testing for enrichment in 7 cell types...
#> Sorting results by p-value.
#> Computing BH-corrected q-values.
#> 2 significant cell type enrichment results @ q<0.05 : 
#>               CellType annotLevel p fold_change sd_from_mean q
#> 1            microglia          1 0    2.041600     1.660623 0
#> 2 astrocytes_ependymal          1 0    1.309126     1.229830 0