Munge GWAS Summary Statistics to Align with GDS File
munge_sumstats.RdAligns GWAS summary statistics with a GDS file using chromosome-position or variant ID matching. Performs quality control filtering, allele flipping for discordant variants, and outputs filtered results.
Usage
munge_sumstats(
gds_path,
sumstats_path,
match_by = "CHR-POS",
chr_col = "CHR",
pos_col = "POS",
id_col = "ID",
ref_col = "REF",
alt_col = "ALT",
beta_col = "BETA",
se_col = "SE",
af_col = "AF",
remove_ambiguous = TRUE,
output_path = NULL
)Arguments
- gds_path
Character; path to the input GDS file.
- sumstats_path
Character; path to the GWAS summary statistics file.
- match_by
Character;
"CHR-POS"(default) or"ID"."CHR-POS": match by chromosome and position (autosomes 1-22 only)."ID": match by variant ID.- chr_col
Character; chromosome column name in the GWAS summary statistics file (default
"CHR").- pos_col
Character; position column name in the GWAS summary statistics file (default
"POS").- id_col
Character; variant ID column name in the GWAS summary statistics file (default
"ID").- ref_col
Character; reference allele column name in the GWAS summary statistics file (default
"REF").- alt_col
Character; alternative allele column name in the GWAS summary statistics file (default
"ALT").- beta_col
Character; effect size column name in the GWAS summary statistics file (default
"BETA").- se_col
Character; standard error column name in the GWAS summary statistics file (default
"SE").- af_col
Character; allele frequency column name in the GWAS summary statistics file (default
"AF"). Optional; omitted from output if absent in input GWAS summary statistics.- remove_ambiguous
Logical; if
TRUE(default), exclude ambiguous SNPs.- output_path
Character or
NULL; output file path. IfNULL(default), auto-generated as<sumstats_prefix>_munged.txtin the same directory assumstats_path, where<sumstats_prefix>is the filename prefix ofsumstats_pathwithout extension.
Value
Invisibly returns NULL. Filtered results written to output_path
as tab-delimited text with columns: CHR, POS, ID, REF, ALT, BETA, SE, AF (optional), GDS_ID.
GDS_ID is the 1-based variant index in the input GDS file to match variants
in the summary statistics with variants in the GDS file.
Details
Required columns:
The input summary statistics must contain columns for REF, ALT, BETA, and SE,
along with either (CHR and POS) when match_by = "CHR-POS" or ID when match_by = "ID".
The AF column is optional. Additional columns are allowed but ignored.
When match_by = "CHR-POS", accepted chromosome formats are autosomes only:
1-22, 01-22, chr1-chr22, or chr01-chr22 (case-insensitive).
The BETA column should represent the effect size estimate for the alternative allele
(log odds ratio for logistic regression or linear coefficient for linear regression),
and the SE column should contain the corresponding standard error.
Quality control filters for summary statistics (applied in order):
REF/ALT: single A/C/G/T nucleotides, biallelic
Ambiguous SNPs: optionally remove A/T, T/A, C/G, G/C
Autosomes: restrict to chromosomes 1-22 (when
match_by = "CHR-POS")POS:
POS > 0(whenmatch_by = "CHR-POS")ID: non-missing and non-empty (when
match_by = "ID")Duplicates: retain variants with unique
(CHR, POS)pairs (whenmatch_by = "CHR-POS") or uniqueID(whenmatch_by = "ID")Effect/SE:
BETA != NA,SE > 0AF range:
0 < AF < 1(if present)