Extract Local-Ancestry Tracts from Phased VCF and MSP (RFMix2/Gnomix)
extract_tracts.RdParses a phased VCF together with an MSP local-ancestry file and writes
ancestry-specific outputs in one or more formats: gds, txt, txt.gz,
vcf, vcf.gz.
Usage
extract_tracts(
vcf_path,
msp_path,
num_ancs,
output_dir = NULL,
output_formats = "gds",
chunk_size = 1024L
)Arguments
- vcf_path
Character; input phased VCF path (
.vcfor.vcf.gz).- msp_path
Character; input MSP path.
- num_ancs
Integer; number of ancestral populations (must be > 1).
- output_dir
Character or
NULL; output directory (must exist). IfNULL(default), uses the directory ofvcf_path.- output_formats
Character vector; output format specification (for example
"gds"orc("gds", "txt.gz")). Default is"gds". Supported values:gds,txt,txt.gz,vcf,vcf.gz. If both compressed and uncompressed versions are requested for the same type, compressed output is written:txt.gzovertxt, andvcf.gzovervcf.- chunk_size
Integer; number of variants processed per chunk. Default is 1024; higher values can improve speed but increase memory usage.
Value
Invisibly returns NULL. Output files are written with filename
prefix derived from vcf_path in output_dir.
Details
Output files (prefix = basename of vcf_path without .vcf or .vcf.gz):
gds: one file<prefix>.gdscontainingsample.id, variant metadata (snp.chromosome,snp.position,snp.id,snp.ref,snp.alt), and ancestry-specific nodesdosage/anc0..ancKandhapcount/anc0..ancK.txt/txt.gz: for each ancestryk, two files<prefix>.anc{k}.dosage.txt(.gz)and<prefix>.anc{k}.hapcount.txt(.gz). Columns areCHROM POS ID REF ALTfollowed by one column per sample.vcf/vcf.gz: for each ancestryk, one file<prefix>.anc{k}.vcf(.gz)withFORMAT=GT; haplotypes not assigned to ancestrykare written as..
Assumptions for inputs:
Biallelic variants.
No missing genotype/local-ancestry fields.
No duplicate variants.
Autosomal chromosomes only (1-22, 01-22,
chr1-chr22,chr01-chr22; case-insensitive).Alleles are uppercase
A/C/G/T.GTis the first sample subfield (GT:...).GTis phased (0|0,0|1,1|0,1|1) with single-digit allele codes.One chromosome per VCF and per MSP file.
Sample order is consistent between VCF and MSP files.