Step 2

Step 2: performing single-variant association tests

For binary traits, saddle point approximation is used to account for case-control imbalance.
File formats for dosages/genotypes of genetic variants to be tested can be used: PLINK, VCF, BGEN, SAV
Conditional analysis based summary stats can be performed (–condition) can be performed in Step 2
To query and test a subset of markers
- both variant IDs (chr:pos:ref:alt) and range of chromosome positions (chr start end) (–idstoIncludeFile=./input/includeID.list_1, –rangestoIncludeFile=./input/includerange.txt)
- for VCF and SAV input, when query with chromosome range, only one range of chromosome positions can be specified (–rangestoIncludeFile=./input/includerange_oneline.txt)
–markers_per_chunk can be used to specify the number of markers to test and output as one chunk. default=10000. Note that a small number may slow down the job. It is required that this number is >= 1000.
If LOCO=TRUE (by default), –chrom MUST be specified, so genotype/dosage file should only contain one chromosome
For VCF/BCF/SAV input, –vcfField=DS to test dosages and –vcfField=GT to test genotypes. Use –vcfFile to specify VCF, BCF or SAV files. For VCF and BCF input, a .csi index file is required in the same directory. For SAV input, a .s1r index file created with savvy is required in the same directory. Index files need to have the same name as the VCF, BCF, or SAV files with suffix: csi or s1r. Note: –vcfFileIndex is used to specify the index file. This option will be removed in the next version (v1.1.4).
By default, missing genotypes/dosages will be imputed as the best guessed gentoypes/dosages (as round(2*freq) with –impute_method=best_guess). Note that currently dropping samples with missing genotypes/dosages is not supported
–sampleFile is used specify a file with sample IDs for bgen file.
When –sparseGRMFile and –sparseGRMSampleIDFile are specified in Step 2, Please use version >= 1.1.8 for association tests

#check the help info for step 2
Rscript step2_SPAtests.R --help

For binary traits, use –is_output_moreDetails=TRUE to output heterozygous and homozygous counts as well as allele frequencies in cases and controls
For binary traits,, effect sizes can be estimated more accurately through the Firth’s Bias-Reduced Logistic Regression by setting
–is_Firth_beta=TRUE and –pCutoffforFirth=0.05. The effect sizes of markers with p-value <= pCutoffforFirth will be estimated through the Firth’s Bias-Reduced Logistic Regression.

full GRM was used for fitting the model in Step 1. Variance ratio MUST be estimated.

–LOCO=TRUE is highly recommended to avoid proximal contamination
Using –bgenFile, –bgenFileIndex, –AlleleOrder, –sampleFile for bgen input

Rscript step2_SPAtests.R        \
        --bgenFile=./input/genotype_100markers.bgen    \
        --bgenFileIndex=./input/genotype_100markers.bgen.bgi \
        --sampleFile=./input/samplelist.txt \
        --AlleleOrder=ref-first \
        --SAIGEOutputFile=./output/genotype_100markers_marker_bgen_fullGRMforNull_with_vr.txt    \
        --chrom=1       \
        --minMAF=0 \
        --minMAC=20 \
        --GMMATmodelFile=./output/example_binary.rda \
        --varianceRatioFile=./output/example_binary.varianceRatio.txt \
        --is_Firth_beta=TRUE    \
        --pCutoffforFirth=0.05 \
        --is_output_moreDetails=TRUE    \
        --LOCO=TRUE

Sparse GRM was used for fitting the model in Step 1. Variance ratio is estimated

For Step 1 generated with version < 1.0.6, DO NOT specify –sparseGRMFile and –sparseGRMSampleIDFile
For Step 1 generated with version >= 1.0.6. Please specify –sparseGRMFile and –sparseGRMSampleIDFile. Specify –is_fastTest=TRUE for fast run

Rscript step2_SPAtests.R        \
        --bgenFile=./input/genotype_100markers.bgen    \
        --bgenFileIndex=./input/genotype_100markers.bgen.bgi \
        --sampleFile=./input/samplelist.txt \
        --AlleleOrder=ref-first \
        --SAIGEOutputFile=./output/genotype_100markers_marker_bgen_Firth.txt    \
        --chrom=1       \
        --minMAF=0 \
        --minMAC=20 \
        --GMMATmodelFile=./output/example_binary_sparseGRM_vr.rda \
        --varianceRatioFile=./output/example_binary_sparseGRM_vr.varianceRatio.txt  \
        --LOCO=FALSE \
        --is_Firth_beta=TRUE    \
        --pCutoffforFirth=0.1 \
        --is_output_moreDetails=TRUE \
        --sparseGRMFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx   \
        --sparseGRMSampleIDFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt \
        --is_fastTest=TRUE

Sparse GRM was used for fitting the model in Step 1. Variance ratio was not estiamted in Step 1. Please specify –sparseGRMFile and –sparseGRMSampleIDFile.

Rscript step2_SPAtests.R        \
        --bgenFile=./input/genotype_100markers.bgen    \
        --bgenFileIndex=./input/genotype_100markers.bgen.bgi \
        --sampleFile=./input/samplelist.txt \
        --AlleleOrder=ref-first \
        --SAIGEOutputFile=./output/genotype_100markers_marker_bgen_sparseGRMforNull_no_vr.txt    \
        --chrom=1       \
        --minMAF=0 \
        --minMAC=20 \
        --GMMATmodelFile=./output/example_binary_sparseGRMforNull_no_vr.rda \
        --LOCO=FALSE \
        --is_Firth_beta=TRUE    \
        --pCutoffforFirth=0.1 \
        --is_output_moreDetails=TRUE	\
	--sparseGRMFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx   \
        --sparseGRMSampleIDFile=output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt

More input file formats are supported: PLINK, VCF, and BCF

Using –bedFile, –bimFile, –famFile, –AlleleOrder for PLINK input

Rscript step2_SPAtests.R        \
        --bedFile=./input/genotype_100markers.bed       \
        --bimFile=./input/genotype_100markers.bim       \
        --famFile=./input/genotype_100markers.fam       \
        --AlleleOrder=alt-first \
        --SAIGEOutputFile=./output/genotype_100markers_marker_plink.txt \
        --chrom=1       \
        --minMAF=0 \
        --minMAC=20 \
        --GMMATmodelFile=./output/example_binary.rda \
        --varianceRatioFile=./output/example_binary.varianceRatio.txt   \
        --LOCO=TRUE \
        --is_output_moreDetails=TRUE

Using –vcfFile， –vcfFileIndex， –vcfField, –chrom for VCF, BCF, and SAV input
Pleas refer to the Set-based test step 2 (example #8) for more example codes when using VCF files as input

Rscript step2_SPAtests.R        \
        --vcfFile=./input/genotype_100markers.vcf.gz    \
        --vcfFileIndex=./input/genotype_100markers.vcf.gz.csi     \
        --vcfField=GT   \
        --SAIGEOutputFile=./output/genotype_100markers_marker_vcf.txt \
        --chrom=1       \
        --minMAF=0 \
        --minMAC=20 \
        --LOCO=FALSE \
        --GMMATmodelFile=./output/example_binary.rda \
        --varianceRatioFile=./output/example_binary.varianceRatio.txt   \
        --is_output_moreDetails=TRUE

Conditional analysis

–condition = Genetic marker ids (chr:pos:ref:alt) separated by comma. e.g.chr3:101651171:C:T,chr3:101651186:G:A

conditioning markers MUST be specified in the same order as stored in the dosage file (e.g. in the VCF file)

Rscript step2_SPAtests.R        \
      --vcfFile=./input/genotype_100markers.vcf.gz    \
      --vcfFileIndex=./input/genotype_100markers.vcf.gz.csi     \
      --vcfField=GT   \
      --SAIGEOutputFile=./output/genotype_100markers_marker_vcf_cond.txt \
      --chrom=1       \
      --minMAF=0 \
      --minMAC=20 \
      --GMMATmodelFile=./output/example_binary.rda \
      --varianceRatioFile=./output/example_binary.varianceRatio.txt   \
      --is_output_moreDetails=TRUE    \
      --condition=1:13:A:C,1:79:A:C

Input files

(Required) Dosage file SAIGE supports different formats for dosages: PLINK, VCF, BCF, BGEN and SAV.

PLINK

 ./input/genotype_100markers.bed
 ./input/genotype_100markers.bim
 ./input/genotype_100markers.fam

BGEN

 ./input/genotype_100markers.bgen
 ./input/genotype_100markers.bgen.bgi 

VCF containing genotypes

 #go to input
 cd ./input
 #index file can be generated using tabix
 tabix --csi -p vcf genotype_10markers.vcf.gz
 zcat genotype_10markers.vcf.gz | less -S
 genotype_10markers.vcf.gz.csi

VCF containing dosages

 #index file can be generated using tabix
 tabix --csi -p vcf dosage_10markers.vcf.gz    
 zcat dosage_10markers.vcf.gz | less -S
 dosage_10markers.vcf.gz.csi

 dosage_10markers.sav
 dosage_10markers.sav.s1r

(Optional. Only for BGEN file not containing sample IDs) Sample file <br> The option was originally for BGEN file that does not contain sample information.

The file can be in two different formats: contains one column for sample IDs corresponding to the sample order in the dosage file. No header is included. less -S ./input/samplelist.txt or

in the default bgen .sample format

```
less -S ./input/genotype_100markers_2chr.sample
```

(Required. Output in Step 1) Model file from step 1
```
 ./output/example_binary.rda
```
(Optional. Output in Step 1)Variance ratio file from step 1
```
 ./output/example_binary.varianceRatio.txt
```

(Optional. Same as the input in Step 1) sparse GRM file and the sample ID file for the sparse GRM. Please see Step 0 for more details.

 ./output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx
 ./output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx.sampleIDs.txt

 ##the sparse matrix can be read and viewed using the R library Matrix
 library(Matrix)
 sparseGRM = readMM("./output/sparseGRM_relatednessCutoff_0.125_1000_randomMarkersUsed.sparseGRM.mtx") 

Output file

A file with association test results

 less -S ./output/genotype_100markers_marker_vcf.txt

An index for the output file. This file contains the progress of the association test. If specifying –is_overwrite_output=FALSE in Step 2, the program will check this index and continue the unfinished analysis instead of starting over from the beginining
```
 less -S ./output/genotype_100markers_marker_vcf.txt.index 
```

NOTE:

Association results are with regard to Allele2.

#check the header
head -n 1 output/genotype_100markers_marker_vcf_cond.txt

CHR: chromosome
POS: genome position 
SNPID: variant ID
Allele1: allele 1
Allele2: allele 2
AC_Allele2: allele count of allele 2
AF_Allele2: allele frequency of allele 2
MissingRate: missing rate (If the markers in the dosage/genotype input are imputed with is_imputed_data=TRUE, imputationInfo will be output instead of MissingRate)
imputationInfo: imputation info (output with is_imputed_data=TRUE). If not in dosage/genotype input file, will output 1
BETA: effect size of allele 2
SE: standard error of BETA
Tstat: score statistic of allele 2
var: estimated variance of score statistic
p.value: p value (with SPA applied for binary traits)
p.value.NA: p value when SPA is not applied (only for binary traits)
Is.SPA.converge: whether SPA is converged or not (only for binary traits)

#if --condition= is used for conditioning analysis, the conditional analysis results will be output  
BETA_c, SE_c, Tstat_c, var_c, p.value_c, p.value.NA_c 


AF_case: allele frequency of allele 2 in cases (only for binary traits)
AF_ctrl: allele frequency of allele 2 in controls (only for binary traits)
N_case: sample size in cases (only for binary traits)
N_ctrl: sample size in controls (only for binary traits)

if --is_output_moreDetails=TRUE
homN_Allele2_cases, hetN_Allele2_cases, homN_Allele2_ctrls, hetN_Allele2_ctrls will be output for sample sizes with different genotypes in cases and controls (only for binary traits)