Getting Started with CASCADE

Introduction

CASCADE classifies the cell type specificity of genes, peaks, and variants from multi-cell-type QTL analyses, and assigns each variant a regulatory mechanism (the chromatin-to-expression cascade). It distinguishes genuine cell type-specific effects from technical artifacts caused by power differences across cell types using a power-aware LFSR-based assessment, with a hierarchy-derived 6-category specificity system and an 8-category mechanism taxonomy.

Configuration

Use create_config() to build a configuration object that tells CASCADE where your data lives and how to run the analysis. Path templates may include {CELL_TYPE} and {CHR} placeholders that the loader substitutes per cell type and per chromosome.

library(cascade)

config <- create_config(
  cell_types = c("Mono", "DC", "NK", "B", "CD4_T", "CD8_T", "other_T"),
  chromosomes = paste0("chr", 1:22),
  file_patterns = list(
    eqtl_acat        = "data/eqtl/{CELL_TYPE}.acat.tsv.gz",
    caqtl_acat       = "data/caqtl/{CELL_TYPE}.acat.tsv.gz",
    eqtl_susie       = "data/eqtl/{CELL_TYPE}.{CHR}.susie.tsv.gz",
    caqtl_susie      = "data/caqtl/{CELL_TYPE}.{CHR}.susie.tsv.gz",
    eqtl_lfsr        = "data/eqtl.lfsr.tsv.gz",
    caqtl_lfsr       = "data/caqtl.lfsr.tsv.gz",
    eqtl_meta        = "data/eqtl.meta.tsv.gz",
    caqtl_meta       = "data/caqtl.meta.tsv.gz",
    peak_bed         = "data/{CELL_TYPE}.peaks.bed",
    peak_gene_links  = "data/{CELL_TYPE}.peak_gene_links.tsv.gz"
  ),
  parameters = list(
    pip_threshold       = 0.5,
    min_pip_threshold   = 0.1,
    acat_fdr_threshold  = 0.05,
    lfsr_sig_threshold  = 0.05,
    lfsr_null_threshold = 0.5,
    n_cores             = 4
  )
)

Alternatively, store the same structure as JSON and pass the file path directly to run_cascade():

results <- run_cascade("config.json", output_dir = "results/")

File patterns

Key Description
eqtl_acat / caqtl_acat ACAT q-values per cell type
eqtl_susie / caqtl_susie SuSiE credible sets per cell type and chromosome
eqtl_lfsr / caqtl_lfsr Mash LFSR matrices (rows = feature×variant, columns = cell types)
eqtl_meta / caqtl_meta Meta-analysis results with Cochran’s Q heterogeneity p-values
peak_bed BED of peak coordinates per cell type
peak_gene_links Per-cell-type peak-to-gene link table
eqtl_mashr / caqtl_mashr (Optional) mashr model RDS files
cs_clusters / cs_cluster_variants (Optional) credible-set cluster annotations

Cell Type Hierarchy

CASCADE uses a hierarchy to decide what counts as “shared” versus “specific”. The package ships with DEFAULT_CELL_HIERARCHY for immune cells:

DEFAULT_CELL_HIERARCHY
#> CellTypeHierarchy
#>   Lineages (2): myeloid, lymphoid
#>   L1 cell types (7): Mono, DC, NK, B, CD4_T, CD8_T, other_T
#>   Subgroup levels: 1
#>     Level 1: T-cell
#>   Bulk: PBMC
#>   Other: other
#>   Mapping to L1: 25 entries
#>   Categories (6): Cross-lineage shared | Likely shared but underpowered |
#>     Lineage-specific | T-cell-specific | Single cell-type | No significance

The six categories are auto-generated from the hierarchy structure: 4 fixed categories plus one per grouping level (1 lineage + N subgroup levels).

Running the Analysis

results <- run_cascade(config, output_dir = "results/")

CASCADE always processes all three feature types (gene, peak, variant) and writes the following outputs:

File Contents
gene_categorization.tsv.gz Per-gene specificity + heterogeneity columns
gene_categorization.summary.tsv Gene category counts
gene_categorization.top_variants.tsv.gz Top variants driving each gene
peak_categorization.tsv.gz Per-peak specificity + heterogeneity columns
peak_categorization.summary.tsv Peak category counts
peak_categorization.top_variants.tsv.gz Top variants driving each peak
variant_categorization.tsv.gz Per-variant Stage 2 cross-cell-type aggregate
variant_categorization.qtl_mechanism.summary.tsv Variant counts per QTL mechanism category
variant_categorization.cell_type_specificity.summary.tsv Variant counts per cell type specificity category
variant_categorization_per_celltype_{CELL_TYPE}.tsv.gz Per-variant Stage 1 output (one file per cell type)

Read results back in R:

genes <- data.table::fread("results/gene_categorization.tsv.gz")
peaks <- data.table::fread("results/peak_categorization.tsv.gz")
variants <- data.table::fread("results/variant_categorization.tsv.gz")

Understanding Results

Gene and peak results

The key column is cell_type_specificity, which takes one of the six hierarchy-derived values:

Category Meaning
Cross-lineage shared Significant in 2+ lineages (e.g., both myeloid and lymphoid)
Likely shared but underpowered LFSR gray-zone evidence suggests hidden sharing
Lineage-specific Significant in exactly one lineage
T-cell-specific Significant within the T-cell subgroup only
Single cell-type Significant in exactly one L1 cell type
No significance Not significant in any cell type

Companion columns include significant_cts / n_significant_cts (which L1 cell types reached the ACAT q-value threshold), tested_cts / n_tested_cts, and — when variant analysis runs — top_variant, max_pip, max_chisq, and variant_heterogeneity.

Variant results

Variant categorization is two-level. The cross-cell-type table contains:

  • qtl_mechanism_category: One of 8 QTL mechanism categories — Local Cascade, Positional Cascade, Distal Cascade, caQTL + eQTL (No Link), Only caQTL (With Link), Only caQTL (No Link), Only eQTL, or No molQTL. The three cascade tiers are distinguished by where the chromatin accessibility bridge sits relative to the variant.
  • cell_type_specificity / gene_cell_type_specificity / peak_cell_type_specificity: three specificity assessments. The combined cell_type_specificity is mechanism-based (no LFSR demotion); the gene- and peak-specific columns apply LFSR.
  • qtl_pattern_number / qtl_pattern: fine-grained pattern (1–25) with a detailed mechanistic interpretation.

Variant heterogeneity

When a feature is active in multiple cell types, CASCADE assesses effect heterogeneity using Cochran’s Q test. The VARIANT_HETEROGENEITY codes are:

Code Label Meaning
a shared_consistent Same variant, consistent effect sizes
b shared_heterogeneous Same variant, different magnitudes
c shared_opposite Same variant, opposite effect directions
d distinct_variants Different causal variants across cell types

Column Mapping

CASCADE expects specific column names in input files. If your data uses different names, override them via column_mapping in create_config(). Defaults are defined in DEFAULT_COLUMN_MAPPING.

config <- create_config(
  cell_types = c("Mono", "DC", "NK", "B", "CD4_T", "CD8_T", "other_T"),
  chromosomes = paste0("chr", 1:22),
  file_patterns = list(...),
  column_mapping = list(
    eqtl_susie = list(
      variant_id = "snp_id", # your column for variant IDs
      feature_id = "gene", # your column for gene IDs
      pip        = "posterior_prob" # your column for PIP values
    ),
    meta = list(
      variant_id = "SNP",
      feature_id = "gene_id",
      cochran_q_nlog10p = "het_pval_nlog10"
    )
  )
)

The column_mapping slot uses unprefixed keys for shared schemas (peak_gene, lfsr, meta); only file_patterns distinguishes eQTL vs caQTL with eqtl_/caqtl_ prefixes. Only the file types and columns you need to override must be specified; all others keep their defaults.

Advanced: Custom Hierarchy

To use CASCADE with a different tissue, define your own hierarchy with create_cell_hierarchy(). Here is an example for brain cell types:

brain <- create_cell_hierarchy(
  lineages = list(
    neuronal = c("ExN_L23", "ExN_L4", "ExN_L56", "InN_PV", "InN_SST", "InN_VIP"),
    glial    = c("Astro", "Oligo", "OPC", "Micro"),
    vascular = c("Endo", "Pericyte")
  ),
  subgroups = list(
    list(
      excitatory = c("ExN_L23", "ExN_L4", "ExN_L56"),
      inhibitory = c("InN_PV", "InN_SST", "InN_VIP")
    )
  ),
  bulk = NULL,
  other = NULL,
  mapping_to_l1 = list(
    ExN_L23_CUX2 = "ExN_L23",
    ExN_L4_RORB  = "ExN_L4",
    InN_PV_PVALB = "InN_PV"
  ),
  column_prefix = "cell_type"
)

brain
#> CellTypeHierarchy
#>   Lineages (3): neuronal, glial, vascular
#>   L1 cell types (12): ExN_L23, ExN_L4, ExN_L56, InN_PV, InN_SST, InN_VIP,
#>     Astro, Oligo, OPC, Micro, Endo, Pericyte
#>   Subgroup levels: 1
#>     Level 1: excitatory, inhibitory
#>   Categories (6): Cross-lineage shared | Likely shared but underpowered |
#>     Lineage-specific | Subgroup-specific | Single cell-type | No significance

Key points:

  • lineages: Must have 2+ groups. These define the top-level split.
  • subgroups: Optional list of grouping levels, ordered broadest to narrowest. Each group must be a subset of a single lineage. Levels with multiple subgroups produce a generic Subgroup-specific category; levels with a single subgroup use that subgroup’s name (e.g., T-cell-specific).
  • mapping_to_l1: Maps finer-resolution cell types (L2, L3, …) to their L1 parent. Targets must appear in lineages or other.
  • column_prefix: Controls how cell type column names are constructed in data files (e.g., cell_type.l1.Astro).

Pass the custom hierarchy to run_cascade() by including it in your config:

config$hierarchy <- brain
results <- run_cascade(config, output_dir = "brain_results/")