Individual Statistics Pipeline

Individual Statistics Pipeline

This repository provides a unified interface for performing SNP-level, window-based, and gene-centric association inference using knockoff-based or SCANG-based methods. It supports both uncorrelated and correlated samples, integrates SAIGE for mixed-model GWAS, and provides automated FDR control.

Supported Methods

MethodInput DataSample TypeDescription
KnockoffScreenSNP genotypesUncorrelatedSNP-level and sliding-window inference
GeneScan3DKnockSNP genotypesUncorrelatedGene-centric inference via multiscale aggregation
BIGKnockSNP genotypes (&GRM)CorrelatedGene-centric inference with GLMM models

Installation

1. Install SAIGE (with conda environment)

conda env create -f environment-RSAIGE.yml
conda activate RSAIGE
FLAGPATH=`which python | sed 's|/bin/python$||'`
export LDFLAGS="-L${FLAGPATH}/lib"
export CPPFLAGS="-I${FLAGPATH}/include"

2. Install Dependencies

Rscript install_packages.R

Input Requirements

Required Inputs

ArgumentDescription
--pheno_filePhenotype file (CSV/TSV) containing ID column
--geno_fileGenotype file prefix in PLINK bed/bim/fam format
--phenotypeColumn name of phenotype
--genome_build“hg19” or “hg38”
--sample_uncorrelatedTRUE/FALSE

Optional Inputs

ArgumentDescriptionDefault
--sliding_window_lengthComma-separated list of window sizes (in bp) for window-based inference. Only used when --test_type = "Single_Window"."1000,5000,10000"
--MNumber of knockoff copies.5
--geno_missing_imputationMissing genotype imputation method (fixed or mean)."fixed"
--plink_pathPath to PLINK executable."plink"
--genome_buildGenome build (hg19, hg38) used for annotation and window mapping."hg19"
--sample_uncorrelatedWhether samples are uncorrelated (TRUE/FALSE).TRUE
--fdrTarget false discovery rate.0.1
--grm_fileGRM matrix (.grm + .grm.id) required only for correlated sample methods, e.g., BIGKnock. Ignored otherwise.NULL
--pheno_idColumn name of sample ID in phenotype file. Required when phenotype table contains ID-like columns.NULL
--covariatesComma-separated covariate names. Optional for all methods; used only when covariates are included in association models.NULL
--user_coresNumber of CPU threads used.1

Pipeline Usage

SNP-level and window-based inference (Uncorrelated Samples)

Rscript pipeline.R \
  --outdir "result/" \
  --test_type "Single_Window" \
  --pheno_file "$pheno_file" \
  --grm_file "$grm_file" \
  --geno_file "$geno_file" \
  --phenotype "Y" \
  --pheno_id "id" \
  --covariates "X1" \
  --user_cores 4 \
  --sliding_window_length "1000,5000,10000" \
  --geno_missing_imputation "fixed" \
  --plink_path "plink" \
  --M 5 \
  --genome_build "hg19" \
  --sample_uncorrelated TRUE \
  --fdr 0.1

Gene-Centric Inference (Uncorrelated Samples)

Rscript pipeline.R \
  --outdir "result_gene/" \
  --test_type "Gene_Centric_Coding" \
  --pheno_file "$pheno_file" \
  --geno_file "$geno_file" \
  --phenotype "Y" \
  --pheno_id "id" \
  --covariates "X1,X2" \
  --sliding_window_length "20000,50000" \
  --genome_build "hg19" \
  --sample_uncorrelated TRUE \
  --M 10 \
  --fdr 0.1

Gene-Centric Inference (Correlated Samples)

If --grm_file is not provided, the pipeline will generate GRM using SAIGE automatically.

Rscript pipeline.R \
  --outdir "result_bigknock/" \
  --test_type "Gene_Centric_Coding" \
  --pheno_file "$pheno_file" \
  --grm_file "$grm_file" \
  --grm_id_file "$grm_id_file" \
  --geno_file "$geno_file" \
  --phenotype "Y" \
  --pheno_id "id" \
  --covariates "age,sex,PC1,PC2" \
  --genome_build "hg38" \
  --sample_uncorrelated FALSE \
  --M 5 \
  --fdr 0.1

Output Files

Each pipeline run outputs:

For SNP-level and Window-level inference

  • Single_Window_results_chr*.csv
  • Manhattan/Q–Q plots

For gene-centric inference

  • Gene_results_chr*.csv
  • Manhattan/Q–Q plots

Example Directory Structure

    • genotype.bed
    • genotype.bim
    • genotype.fam
    • phenotype.txt
      • pipeline.R
      • install_packages.R
      • Single_results_chr1.csv
      • Window_results_chr1.csv
      • Gene_results_chr1.csv