Summary Statistics Pipeline
xiaoxiandadada/Summary-Statistics-Pipeline
Traditional Knockoff requires individual-level genotype data, but GWAS typically only release summary statistics (Z-scores, p-values, etc.). This pipeline uses GhostKnockoff and LAVA-Knock to perform Knockoff analyses based solely on summary statistics:
- GhostKnockoff: Reconstructs knockoff statistics from summary data using LD structure from a reference panel.
- LAVA-Knock: Extends to local genetic correlation analyses and supports joint multi-trait analysis.
- No individual data required: Runs with only Z-scores and a reference panel (e.g., 1000 Genomes).
Main Features
- Variable selection (
--mode select): GhostKnockoff + LAVAKnock FDR control, outputs significant variants. - Local genetic correlation (
--mode correl): LAVA-Knock bivariate window analysis, outputs window-level local correlations and knockoff-filtered results.
Pipeline Versions: Standard vs Fast
| Version | Trigger | Main Difference |
|---|---|---|
| Standard (default) | omit --py_accel | All computations performed in R; suitable for general environments. |
| Fast | append --py_accel to command | Uses NumPy/Numba in accelerator.py to accelerate correlation and LD redundancy removal. |
If the Python environment cannot be loaded, the pipeline will automatically fall back to the standard implementation.
Quick Start
The following commands can be run from the repository root. The default reference panel is the 1000 Genomes g1000_eur. Users may provide their own panel or genotype data. Provide a zscore file zscore.txt, and the pipeline will chunk based on the specified genome coordinate system GRCh37 or GRCh38.
1) Install dependencies
Rscript install_packages.R2) Variable selection (reference panel)
Rscript run_pipeline.R --mode select \
--zscore zscore.txt \
--panel g1000_eur \
--ld_coord GRCh37 \
--py_accel -v -o results- Automatically parses
chr:pos:ref:altorCHR/POScolumns inzscore.txt, matching to the reference panel byCHR+POS. --panelcan be omitted (defaultg1000_eur/g1000_eur).--ld_coordmust explicitly specify LD block coordinates (GRCh37orGRCh38).
3) Variable selection (real genotypes)
Rscript run_pipeline.R --mode select \
--zscore zscore.txt \
--geno_plink genotype \
--ld_coord GRCh37 \
-o results -v4) Local genetic correlation (example)
Rscript run_pipeline.R --mode correl \
--zscore zscore.txt \
--panel g1000_eur/g1000_eur \
--py_accel -v -o resultsTo scale to the whole genome, iterate over chromosomes 1–22 (or generate specific windows using
scripts/build_info_from_plink.R).
Input Formats and Auto-detection
- Variant info: CSV with columns
rsid, chr, pos_bp. Can be generated from a PLINK panel usingscripts/build_info_from_plink.R. - GWAS summary:
- Standard:
CHR, POS, Z. rsid + valueonly: automatically converted vianormalize_gwas_table(); usescripts/convert_zscore_to_gwas.Rfor batch preprocessing.- Multi-trait:
--multi_gwas <file> --zcols col1,col2.
- Standard:
- Genotypes / reference panel:
--geno_rds: RDS matrix (column namesrsidorchr:pos).--geno_csv: CSV with--geno_format samples_by_snps|snps_by_samples.--geno_plink: PLINK prefix, read viasnpStats::read.plink.- If real genotypes are not provided,
--ref_plinkis used automatically (defaultg1000_eur/g1000_eur).
Output Description
Selection mode
Output file: results/selection/<info>_<pheno>_selection.csv
Columns include:
id, rsid, chr, pospval.orginal, pval.knockoff*W, Qvalue, selectedgeno_source
If chunking by LD blocks (--ld_coord), an additional <prefix>_chunk_summary.csv is produced, recording for each chunk:
- range
- source (ld_block/fallback)
- number selected
Correlation mode
Output files:
results/correlation/<info>_<pheno1>__<pheno2>_bivariate.csvresults/correlation/<info>_<pheno1>__<pheno2>_significant_windows.csv
Directory Overview
- g1000_eur.bed
- g1000_eur.bim
- g1000_eur.fam
- g1000_eur.synonyms