Tools

The Estonian Genome Centre has created several tools for genome research. Links to these tools along with additional information are displayed below.

GRANVIL

Genome-wide studies have been very successful in identifying moderate genetic effects of common alleles. However, a large portion of the genetic effect has remained unidentified and alternative approaches have to be used for gaining information about uncommon and rare variants as well. In genome-wide association analysis, the power of detecting true positive genotype-phenotype associations decreases in case of low minor-allele frequency. Therefore methods, which combine the information of several markers in genomic regions, should be preferred rather than using single-marker data.

GRANVIL (Gene- or Region-based ANalysis of Variants of Intermediate and Low frequency) is an implementation of a method described by Morris and Zeggini 1 to perform rare-variant analysis of binary or quantitative phenotypes. The method is based on accumulation of minor alleles of rare or uncommon markers discovered through dense genotyping or resequencing data. Association analyses are based on gene- or other pre-defined regions, determined by analyst.

  [1] An evaluation of statistical approaches to rare variant analysis in genetic association studies (2010). Morris, A. P.; Zeggini, E. Genetic Epidemiology 34:188-193.  

Citation

Magi, Kumar, Morris: Assessing the impact of missing genotype data in rare variant association analysis. BMC Proceedings 2011

GRANVIL ver. 2.1.1 (changelog)

Gene list (positions dbsnp37)

Coding region marker extraction list (positions dbsnp37)

Non-synonymous marker extraction list (positions dbsnp37)

Copy GRANVILv*.zip file into your computer, unzip the file:

unzip GRANVILv*.zip

To compile GRANVIL program, use command:

make

in the folder where files have been unpacked. The program can be run by typing:

./GRANVIL

Input files

For running GRANVIL, you need input files in SNPTESTv.2 format and a GENELIST file. SNPTEST file formats are described here. In case of case-control type of analysis, you should have single gen and sample file, where the phenotype is coded 0=control; 1=case.

GENOTYPE file:

 1 rs1 11 A T 1 0 0 1 0 0 1 0 0

 1 rs2 210 A T 0 1 0 1 0 0 1 0 0

1 rs3 300 A T 1 0 0 1 0 0 1 0 0

 1 rs4 4637 A T 1 0 0 1 0 0 1 0 0  

1 rs5 5555 A T 1 0 0 1 0 0 1 0 0

(Genotype file can be gzipped, if it has *.gz extension)  

SAMPLE file:

 Sample_id Subject_id Missing Gender Phenotype Phenotype

Q  0 0 0 D B P  

1 1 0 1 1 4.1  

2 2 0 1 1 4.2  

3 3 0 1 0 4.3   

This file contains one co-variate (Gender) and two phenotypes: first one is case-control type (for logistic regression) and second one is a continuous phenotype (for linear regresson). In current version only continuous co-variates are enabled. Discreet co-variates can be used, if they have two classes (males=1, females=0 etc.). Adjusting for centre effect or other multi-categorious co-variate: create N-1 dummy variables coded 0 or 1 for N centers and code these variables as continuous (number 3 in second row of sample file).  

GENELIST file:  

A1 1 11 111

 A5 1 2500 27300  

A13 1 14 3000

 A15 1 9 24780   

Genelist file contains four columns:  1. GENE ID  2. chromosome  3. start position in bp  4. end position in bp  If using file, make sure that the positions in genelist file are from the same dbSNP version as in the GEN file.  

Command line options:

 ./GRANVIL [-f ] -g -s -x [-o  ] -m [--print_gene ] ...  [--sex ] -p [--debug] [--cov_all]  [--cond_marker ] ... [--cond_gene ] ...  [--cov_name ] ... [-r ] [--call_thresh  ] [--imp_thresh ] [--missing_code ]  [--chr ] [--extract_markers ]  [--extract_samples ] [--exclude_markers ]  [--exclude_samples ] [--] [--version] [-h]

    Where:

  -f , --flanking  This specifies flanking region size in kb (default 0 kb)   

-g , --gen  (required) This specifies genotype file. Can be gzipped, but must have *.gz extension then.

  -s , --sample  (required) This specifies sample file   

-x , --genmap  (required) This specifies gene map file   

-o , --out  This specifies output files root   

-m , --method  (required) This option controls how the genotype uncertainity is  taken into account: (a) threshold - genotypes with probability >= 0.95 will be analysed; (b) expected - genotype dosage will be constructed based on the sum of probabilities of genotypes containing one or two copies of minor allele.

-p , --pheno  (required) This specifies phenotype to test   

REMOVED FROM GRANVILv2.1 --lowmem  In case of limited memory, lowmem options can be used. If both  genelist and genotype file are both sorted in similar manner, sorted  option can be used (faster). If the markers are in random order,  unsorted option must be used   

--debug  Debug mode enabled

  --cov_all  All covariates are used in analysis

  --cov_name (accepted multiple times)  Name of covariate to use (in case of several covariates, use this  command multiple times i.e. --cov_name SEX --cov_name AGE etc.)   

-r , --rare_thresh  Minor allele cutoff for defining rare variants (default 0.05)

  --call_thresh  Call-rate threshold for the best guess genotypes (default 0.9)   

--imp_thresh  Inputation score threshold for including markers (default 0.4)   

--missing_code  This specifies the coding for missing data (default NA)   

--chr  All markers in genotype file are forced to be from this chromosome.  Ignoring the first column in genotype file   --extract_samples  This specifies file with extracted samples   

--exclude_samples  This specifies file with excluded samples   

--extract_markers  This specifies file with extracted markers. Marker extraction list must be in LINKAGE map format (4 columns: chromosome, markername, genetic_distance, position) and the markers are been excluded by chromosome name and position to prevent problems with different marker namings of 1000g imputed markers.   

--exclude_markers  This specifies file with excluded markers. Marker exclusion list must be in LINKAGE map format (4 columns: chromosome, markername, genetic_distance, position) and the markers are been excluded by chromosome name and position to prevent problems with different marker namings of 1000g imputed markers.

  --print_gene (accepted multiple times)  Name of gene model to be printed (in case of several genes, use thiscommand multiple times) Option available in GRANVILv2.1

  --sex  This specifies gender column name in sample file for sex stratified analysis (men=0, women=1) Option available in GRANVILv2.1   

--cond_marker (accepted multiple times)  Name of marker to use as a covariant in conditional analysis (in case of several markers, use this command multiple times) Option available in GRANVILv2.1

  --cond_gene (accepted multiple times)  Name of gene to use as a covariant in conditional analysis (in case of several genes, use this command multiple times) Option available in GRANVILv2.1

  --, --ignore_rest  Ignores the rest of the labeled arguments following this flag.   

--version  Displays version information and exits.

  -h, --help  Displays usage information and exits.        

QUANTITATIVE TEST:  ./GRANVIL -g data.impute.txt -s data.sample.txt --pheno PhenotypeQ --genmap ucsc_genes_b37.txt -m expected

CASE-CONTROL TEST:  ./GRANVIL -g data.impute.txt -s data.sample.txt --pheno Phenotype --genmap ucsc_genes_b37.txt -m expected

Output file format

Results file contains following columns:

1. gene - gene ID

 2. marker_count - number of rare markers in gene region  

3. sample_count - number of samples in analysis  

4. rare_variant_sum - count of rare alleles found in individuals  

5. total_maf - sum of MAF of all used markers in given gene region  

6. average_maf - average MAF of used markers in given gene region (total_maf / marker_count)  

7. beta - effect size  

8. se - std. error of effect

 9. z - z-statistic

 10. p - p-valu

GWAMA – Software tool for meta analysis of whole genome association data

Background

Genome-wide association (GWA) studies have proved to be extremely successful in identifying moderate genetic effects contributing to complex human phenotypes. However, to gain insights into increasingly more modest signals of association, samples of many thousands of individuals are required. One approach to overcome this problem is to combine the results of GWA studies from closely related populations via meta-analysis, without direct exchange of genotype and phenotype data.

We have developed the GWAMA (Genome-Wide Association Meta Analysis) software to perform meta-analysis of the results of GWA studies of binary or quantitative phenotypes. Fixed- and random-effect meta-analyses are performed for both directly genotyped and imputed SNPs using estimates of the allelic odds ratio and 95% confidence interval for binary traits, and estimates of the allelic effect size and standard error for quantitative phenotypes. GWAMA can be used for analysing the results of all different genetic models (multiplicative, additive, dominant, recessive). The software incorporates error trapping facilities to identify strand alignment errors and allele flipping, and performs tests of heterogeneity of effects between studies.

Citations

GWAMA PROGRAM:
Magi R, Morris AP: GWAMA: software for genome-wide association meta-analysis. BMC Bioinformatics 2010, 11:288. (link)

SEX-SPECIFIC ANALYSIS METHOD:
Magi R, Lindgren CM, Morris AP: Meta-analysis of sex-specific genome-wide association studies. Genetic Epidemiology 2010, 34(8):846-853. (link

Download the latest version: GWAMA ver. 2.2.2 (for unix)(changelog)

Download sample files: samples.zip

Download marker location maps: dbsnp37.txt.gz (SNPs from dbSNP build 37)

Download R scripts for Manhattan and QQ Plots: MANH.R, QQ.R

Download perl scripts for reformatting SNPTEST and PLINK output to GWAMA input format:

SNPTEST2GWAMA.pl (for SNPTEST v.1)

SNPTEST2_2_GWAMA.pl (for SNPTEST v.2.4 and older)

SNPTEST2.5_2_GWAMA.pl (for SNPTEST v.2.5)

PLINK2GWAMA.pl

Introduction

GWAMA (Genome-Wide Association Meta Analysis) software has been developed to perform meta-analysis of the results of GWA studies of binary or quantitative phenotypes. The software incorporates error trapping facilities to identify strand alignment errors and allele flipping, and performs tests of heterogeneity of effects between studies.

Installation (UNIX)

Copy gwama.zip file into your computer, unzip the file:

unzip gwama.zip

To compile GWAMA program, use command:

make

in the folder where files have been unpacked. The program can be run by typing:

GWAMA

Installation (WINDOWS)

Copy gwama.zip file into your computer and unpack the *.msi file.

Double-click on the *.msi file and follow the installation instructions.

The program can be run by typing:

c:\Program Files\WTCHG\gwama\gwama (if installed into default folder)
In case of any questions or comments please dont hesitate to contact the authors.

Input files

For running GWAMA you have to create an input file (default name “gwama.in”), which contains the list of all study files. The should have each results' file on separate row. If genderwise heterogeneity analysis option is used, second column should identify if the cohort contains males (M) or females (F) data.

Sample “gwama.in” file:

Pop1.txt M
Pop2.txt M
Pop3.txt F

Each GWA study file has mandatory column headers:

1) MARKERNAME – snp name

2) EA – effect allele

3) NEA – non effect allele

4) OR - odds ratio

5) OR_95L - lower confidence interval of OR

6) OR_95U - upper confidence interval of OR

In case of quantitative trait:

4) BETA – beta

5) SE – std. error

Study files might also contain columns:

7) N - number of samples

8) EAF – effect allele frequency

9) STRAND – marker strand (if the column is missing then program expects all markers being on positive strand)

10) IMPUTED – if marker is imputed or not (if the column is missing then all markers are counted as directly genotyped ones)

Sample study file (NB! This file is a quantitative trait one and GWAMA has to be run with -qt command line option):

MARKERNAME STRAND CHR POS IMP EA NEA BETA SE
rs12565286 + 1 761153 0 G C -0.02 0.0403
rs2977670 + 1 763754 0 C G -0.01 0.40612
rs12138618 + 1 790098 0 G A -0.07 0.37
rs3094315 + 1 792429 0 G A 0.0258 0.1012
rs3131968 + 1 794055 0 G A -0.373 0.0101
rs2519016 + 1 805811 0 T C 0.26 0.3472
rs12562034 + 1 808311 0 G A 0.0092 0.2

Input files must be either tab or space delimited. Files must not have empty columns as multiple separators are treated as one. Files may contain additional columns, which are not used by GWAMA.

Running GWAMA

Command line options:

GWAMA

--filelist {filename} or -i {filename} Specify studies' result files. Default = gwama.in

--output {fileroot} or -o {fileroot} Specify file root for output of analysis. Default = gwama (gwama.out, gwama.gc.out)

--random or -r Use random effect correction. Default = disabled

--genomic_control or -gc Use genomic control for adjusting studies' result files. Default = disabled

--genomic_control_output or -gco Use genomic control on meta-analysis summary (i.e. results of meta- analysis are corrected for gc). Default = disabled

--quantitative or -qt Select quantitative trait version (BETA and SE columns). Default = binary trait

--map {filename} or -m {filename} Select file name for marker map.

--threshold {0-1} or -t {0-1} The p-value threshold for showing direction in summary effect directions. Default = 1

--no_alleles No allele information has been given. Expecting always the same EA.

--indel_alleles Allele labes might contain more than single letter. No strand checks.

--sex Run gender-differentiated and gender- heterogeneity analysis (method described in paper Magi, Lindgren & Morris 2010). Gender info must be provided in filelist file. (second column after file names is either M or F).

--name_marker alternative header to marker name column

--name_strand alternative header to strand column

--name_n alternative header to sample size col

--name_ea alternative header to effect allele column

--name_nea alternative header to non-effect allele column

--name_eaf alternative header to effect allele frequency column

--name_beta alternative header to beta column

--name_se alternative header to std. err. col

--name_or alternative header to OR column

--name_or_95l alternative header to OR 95L column

--name_or_95u alternative header to OR 95U column

--help or -h Print this help

--version or -v Print GWAMA version number

Output files

GWAMA generates following output files:

gwama.out (or 'fileroot'.out if --output option is used)
This file contains results of meta-analysis. Output file has following columns:

chromosome - Marker chromosome
position - Marker position (bp)
rs_number - Marker ID
reference_allele - Effect allele
other_allele - Non effect allele

OR - Overall odds ratio for meta-analysis
OR_95L - Lower 95% CI for OR
OR_95U - Upper 95% CI for OR

IN CASE OF QUANTITATIVE TRAIT (-qt)
beta - Overall beta value for meta-analysis
beta_95L - Lower 95% CI for BETA
beta_95U - Upper 95% CI for BETA

z - Z-score
p-value - Meta-analysis p-value
-log10_p-value - Absolut value of logarithm of meta-analysis p-value to the base of 10.
q_statistic - Cochran's heterogeneity statistic
q_p-value - Cochran's heterogeneity statistic's p-value
i2 - Heterogeneity index I2 by Higgins et al 2003
n_studies - Number of studies with marker present
n_samples - Number of samples with marker present (will be NA if marker is present in any input file where N column is not present)
effects - Summary of effect directions ('+' - positive effect of reference allele, '-' - negative effect of reference allele, '0' - no effect (or non-significant) effect of reference allele, '?' - missing data)

gwama.gc.out (or 'fileroot'.gc.out if --output option is used)
This file contains lambda values for GC correction. The file is only generated, if -gc oprion is used.

gwama.log.out
This file contains all log information about current GWAMA run. Each error and warning has unique error code. More information for them can be found from gwama.err.out file.

gwama.err.out
This file contains all errors and warning generated during GWAMA run. Information about any error can be searched according to error code. For example in UNIX:
grep E000000001 gwama.err.out
gives information about error E000000001

Gender Specific Analysis

If gender specific analysis option is used, additional columns will appear into output file. All male_ and female_ columns are calculated using cohorts with defined gender.

male_eaf
male_OR (or beta if quantitative trait is analysed)
male_OR_se
male_OR_95L
male_OR_95U
male_z
male_p-value
male_n_studies
male_n_samples
female_eaf
female_OR
female_OR_se
female_OR_95L
female_OR_95U
female_z
female_p-value
female_n_studies
female_n_samples
gender_differentiated_p-value - combined p-value of males and females assuming different effect sizes between genders (2 degrees of freedom)
gender_heterogeneity_p-value - heterogeneity between genders (1 degree of freedom)

Paper describing gender specific analysis framework has beed submitted.

Creating plots

Manhattan and QQ plots can be created with accompanied R scripts.

R --slave --vanilla < MANH.R

R --slave --vanilla < QQ.R

By default they expect input file name "gwama.out" and they create output files: "gwama.out.qq.png" and "gwama.out.manh.png". Different names can be used as:

R --slave --vanilla --args input=inputfilename out=outputfilename < QQ.R

Manhattan plot can be drawn, if chromosomal position have been added to the file (for example command line: --map hapmap35.map)

R version 2.9.0 or later must be used with png support

Citation

Magi R, Morris AP: GWAMA: software for genome-wide association meta-analysis. BMC Bioinformatics 2010, 11:288.

Magi R, Lindgren CM, Morris AP: Meta-analysis of sex-specific genome-wide association studies. Genetic Epidemiology 2010, 34(8):846-853.

Reedik Magi
Estonian Genome Center
University of Tartu
Tartu 51010
Estonia

or

Andrew P Morris
Wellcome Trust Centre for Human Genetics
Roosewelt Drive
Oxford
Oxfordshire
OX37BN
United Kingdom

Citation
Magi R, Morris AP: GWAMA: software for genome-wide association meta-analysis. BMC Bioinformatics 2010, 11:288.
Magi R, Lindgren CM, Morris AP: Meta-analysis of sex-specific genome-wide association studies. Genetic Epidemiology 2010, 34(8):846-853.

Email addresses are of the form firstname.lastname@ut.ee and firstname.lastname@well.ox.ac.uk

MIXFIT

MixFit is a multi-dimensional best fit script used to assign individual ancestry components to the unknown individuals based on comparisons with reference populations and by using genome-wide data.

The main features of MixFit are:

Outcome. The outcome is three assigned numerical ancestry components that are chosen from among the reference populations. Assignment is perfomed via multi-dimensional best fit.
Assignment reliability. Assignment reliability is judged based on several statistical parameters calculated by the script and explained in the user manual (see below).
Input data. Input data are in the form of similarity matrices (chunkcount matrices) prepared with the pipeline comprising of SHAPEIT and ChromoPainter (see below). MixFit, however, is not restricted to using the output of this particular pipeline; it can be used in a wide variety of situations.
Flexibility. Several parameters can be controlled by the user (see user manual). This allows to easily modify the assignment method.

MixFit is typically used to study genetic ancestry of an individual provided that a suitable data set is available for the relevant reference populations.

Citation

A scientific article has been written to demonstrate MixFit but the article has not yet been published:

Toomas Haller, Liis Leitsalu, Krista Fischer, Marja-Liisa Nuotio, Tõnu Esko, Dorothea Irene Boomsma, Kirsten Ohm Kyvik, Tim D Spector, Markus Perola, Andres Metspalu. MixFit: methodology for computing ancestry-related genetic scores at the individual level and its application to the Estonian and Finnish population studies.

Downloads (user manual is included)

Detailed instructions can be downloaded here. Short instructions are found below.
(Attn: Reading the detailed instructions is essential for being able to fully use the MixFit script!)

MixFit analyis is carried out like this:

Minimal example for running Mixfit:
./MIXFIT -file unknowns.txt -ref references.txt -refpops 22 -out results.txt

Full example for running MixFit:
./MIXFIT -file unknowns.txt -ref references.txt -refpops 22 -out results.txt -delim space -header yes -refheder yes -plimit 0.1 -step 0.05 -a1 0.3 -a2 0.2 -a3 0.01

-file: name of the input file (here the array of 22)
-ref: name of the reference file (here the matrix of 22 x 22)
-out: output file name
-delim: matrix/array delimiter; options: “tab” (default), “space”, “colon”, “semicolon”, “comma”, or any freely selected text
-header: whether the input array has a vertical header; options: “no” (default), “yes”
-refheader: whether the input matrix has a vertical header; options: “no” (default), “yes”
-refpops: the number of reference populations (here 22)
-plimit: ancestry fraction (value) under which the component is considered irrelevant and is removed from consideration (in which case less than 3 ancestry components are reported); this can be any number between 0 and 1 (deafault is 0.1)
-step: fraction by which each reference population weight is incremented during the process of best fitting; default = 0.05
-choosebest: allows to fix the identities of some ancestry components before best mix according to the overall similarity between the references and the unknwn. For example “-choosebest 1” immediately selects the overall most similar reference population and starts to use this as one of the components by including it in every best fit simulation. Default value is 0 and this generally makes most sense.
-missing: how is missing value denoted, default = “NA”
-a1: when best fit is carried out by systematically varying the ancestry components fluctuations occur between the best and worst fits. The best fits are expressed as minima in the fluctuations. These minima are recorded for candidate selection later in the algorithm. This flag allows one to change the fraction of best fits stored for later candidate selection. Default = 0.1 (meaning that 10% of the minima are considered for compiling the ancestry candidate list).
-a2: each ancestry assignment as it comes out of the -a1 filter is associated with a GOF (goodness of fit) score. These potential assignments are sorted according to the GOF score and only the lowest scores are let pass. This flag determines how many (what fraction) of best assignments pass to the next round where they are averaged to find the 3 top-scoring ancestry components. Default = 0.1 (meaning that 10% of the assignments with the best GOF scores pass).
-a3: in the final simulation the ancestry components are selected but their relative ratios are unknown, so there is one more simulation where the component amounts are systematically varied. The best answer that this step gives is a function of input uncertainty. Therefore MixFit allows the user to average certain number of best solutions. Default = 0.1 (meaning that 10% of the best solutions will be averaged for the very final ancestry component ratios). Note that this number should generally be small.

Please contact us if you have any questions or suggestions:
toomas.haller@ut.ee
tom@toomashaller.com

MR-MEGA

MR-MEGA (Meta-Regression of Multi-AncEstry Genetic Association) is a tool to detect and fine-map complex trait association signals via multi-ancestry meta-regression. This approach uses genome-wide metrics of diversity between populations to derive axes of genetic variation via multi-dimensional scaling [Purcell 2007]. Allelic effects of a variant across GWAS, weighted by their corresponding standard errors, can then be modelled in a linear regression framework, including the axes of genetic variation as covariates. The flexibility of this model enables partitioning of the heterogeneity into components due to ancestry and residual variation, which would be expected to improve fine-mapping resolution.

Questions and suggestions concerning the method should be sent to: apmorris@liverpool.ac.uk

Citation

Please cite the paper:

Mägi R, Horikoshi M, Sofer T, Mahajan A, Kitajima H, Franceschini N, McCarthy MI; COGENT-Kidney Consortium, T2D-GENES Consortium, Morris AP. Trans-ethnic meta-regression of genome-wide association studies accounting for ancestry increases power for discovery and improves fine-mapping resolution. Hum Mol Genet. 2017 Sep 15;26(18):3639-3650

MR-MEGA ver. 0.2 (zip) (changelog) (txt)

Please note that the current version is still a beta version and may contain errors. In case of any problems, please write to reedik.magi@ut.ee

If you have problems downloading the zipped file, please use alternative web browser or download it directly into your server with command:
wget https://tools.gi.ut.ee/tools/MR-MEGA_v0.2.zip

older version: MR-MEGA ver. 0.1.6

Some additional tools:

fixP.r

manh.r

qq.r

As the current C++ library enables to calculate p-values>1e-14, you can use fixP.r script to recalculate p-values in R based on chisq and ndf values down to p-values>1e-325. This script is not necessary for creating MANH and QQ plots as these scripts will do the same calculation independently.

To run the script, use following command:

R --slave --vanilla < fixP.r

If your result file is not mrmega.result, then you can also change input and output files of the script by:

R --slave --vanilla --args input=inputfilename out=outputfilename < fixP.r

Manhattan and QQ plots can be created with accompanied R scripts.

R --slave --vanilla < manh.r

R --slave --vanilla < qq.r

By default they expect input file name "mrmega.result" and they create output files: "mrmega.result.qq_assoc.png", "mrmega.result.qq_ancest.png", "mrmega.result.qq_resid.png" and "mrmega.result.manh.png". Different names can be used as:

R --slave --vanilla --args input=inputfilename out=outputfilename < MANH.R

R version 2.9.0 or later must be used with png support

Copy MR-MEGA_v*.zip file into your computer, unzip the file:

unzip MRMEGA_v*.zip

To compile MR-MEGA program, use command:

make

in the folder where files have been unpacked. The program can be run by typing:

./MR-MEGA

For running MR-MEGA you have to create an input file (default name “mr-mega.in”), which contains the list of all study files. The should have each results' file on separate row.

Sample “MR-MEGA.in” file:

Pop1.txt.gz
Pop2.txt.gz
Pop3.txt.gz
Pop4.txt.gz
Pop5.txt.gz
Pop6.txt.gz
Pop7.txt.gz
Pop8.txt.gz

Each GWA study file has mandatory column headers:

1) MARKERNAME – snp name

2) EA – effect allele

3) NEA – non effect allele

4) OR - odds ratio

5) OR_95L - lower confidence interval of OR

6) OR_95U - upper confidence interval of OR

7) EAF – effect allele frequency

8) N - sample size

9) CHROMOSOME - chromosome of marker

10) POSITION - position of marker

In case of quantitative trait:

4) BETA – beta

5) SE – std. error

Study files might also contain column:

11) STRAND – marker strand (if the column is missing then program expects all markers being on positive strand)

Sample study file (NB! This file is a quantitative trait one and MR-MEGA has to be run with --qt command line option):

MARKERNAME STRAND CHROMOSOME POSITION IMP EA NEA EAF N BETA SE
rs12565286 + 1 761153 0 G C 0.3 1200 -0.02 0.0403
rs2977670 + 1 763754 0 C G 0.23 1200 -0.01 0.40612
rs12138618 + 1 790098 0 G A 0.97 1200 -0.07 0.37
rs3094315 + 1 792429 0 G A 0.01 1199 0.0258 0.1012
rs3131968 + 1 794055 0 G A 0.27 1200 -0.373 0.0101
rs2519016 + 1 805811 0 T C 0.04 1200 0.26 0.3472
rs12562034 + 1 808311 0 G A 0.65 1200 0.0092 0.2

Input files must be either tab or space delimited. Files must not have empty columns as multiple separators are treated as one. Files may contain additional columns, which are not used by MR-MEGA.

Command line options:

./MR-MEGA [--name_pos <string>] ... [--name_chr <string>] ...

[--name_n <string>] ... [--name_strand <string>] ...

[--name_or_95u <string>] ... [--name_or_95l <string>] ...

[--name_or <string>] ... [--name_se <string>] ...

[--name_beta <string>] ... [--name_eaf <string>] ...

[--name_nea <string>] ... [--name_ea <string>] ...

[--name_marker <string>] ... [-f <string>] ... [--pc <int>]

[-t <double>] [--no_std_names] [--debug] [--qt] [--gco]

[--gc] [--no_alleles] [-m <string>] [-o <string>] [-i

<string>] [--] [--version] [-h]

Where:

--name_pos <string> (accepted multiple times)

Alternative header to position column. Default POSITION

--name_chr <string> (accepted multiple times)

Alternative header to chromosome column. Default CHROMOSOME

--name_n <string> (accepted multiple times)

Alternative header to sample size column. Default N

--name_strand <string> (accepted multiple times)

Alternative header to strand column. Default STRAND

--name_or_95u <string> (accepted multiple times)

Alternative header to upper 95 CI of odds ratio column. Default OR_95U

--name_or_95l <string> (accepted multiple times)

Alternative header to lower 95 CI of odds ratio column. Default OR_95L

--name_or <string> (accepted multiple times)

Alternative header to odds ratio column. Default OR

--name_se <string> (accepted multiple times)

Alternative header to standard error column. Default SE

--name_beta <string> (accepted multiple times)

Alternative header to effect column. Default BETA

--name_eaf <string> (accepted multiple times)

Alternative header to effect allele frequency column. Default EAF

--name_nea <string> (accepted multiple times)

Alternative header to other allele column. Default NEA

--name_ea <string> (accepted multiple times)

Alternative header to effect allele column. Default EA

--name_marker <string> (accepted multiple times)

Alternative header to marker name column. Default MARKERNAME

-f <string>, --filter <string> (accepted multiple times)

Set a filtering based on column name. It needs 3 arguments: column

name, equation [>,<,>=,<=,==,!=], numeric filter value. Multiple

filters can be set. Please note that UNIX may require using '\' before

'<' and '>' signs. Column names are not case sensitive. (Example:

INFO\>0.4)

--pc <int>

This specifies the number od PC to use in regression. Default = 4. Please note that the PC count must be < cohort count - 2. Therefore, if five cohorts have been used in the analyse, then the maximum number of PC-s can be two!

-t <double>, --threshold <double>

The p-value threshold for showing direction. Default = 1

--no_std_names

Default column names are not used. All columns must be be defined by

user

--debug

Debug mode on (default OFF)

--qt

Use this option, if trait is quantitative (columns BETA & SE). Default

is binary trait (columns OR, OR95_U, OR_95_L)

--gco

Use second genomic control correction on output file

--gc

Use genomic control correction on input files

--no_alleles

No allele information has been given. Expecting always the same EA

-m <string>, --map <string>

This specifies map file

-o <string>, --out <string>

This specifies output root. By default mrmega

-i <string>, --filelist <string>

Specify studies' result files. Default = mrmega.in

--, --ignore_rest

Ignores the rest of the labeled arguments following this flag.

--version

Displays version information and exits.

-h, --help

Displays usage information and exits.

MR-MEGA generates two output files: *.result and *.log

Results file contains following columns:

MarkerName - unique marker identification across input files

Chromosome - chromosome of marker

Position - physical position in chromosome of marker

EA - allele, which effect was measured across input files

NEA - other allele

EAF - average effect allele frequency (weighted by the samplesize of each input file)

Nsample - total number of samples

Ncohort - total number of cohorts, where the marker was present

Effects - effect direction across cohorts (+ if the effect allele effect was positive, - if negative, 0 if the effect was zero, ? if marker was not available in cohort)

beta_0 - effect of first PC of meta-regression

se_0 - stderr of the effect of first PC of meta-regression

(beta_1)

(se_1)

(...)

chisq_association - chisq value of the association

ndf_association - number of degrees of freedom of the association

P-value_association - p-value of the association

chisq_ancestry_het - chisq value of the heterogeneity due to different ancestry

ndf_ancestry_het - ndf of the heterogeneity due to different ancestry

P-value_ancestry_het - p-value of the heterogeneity due to different ancestry

chisq_residual_het - chisq value of the residual heterogeneity

ndf_residual_het - ndf of the residual heterogeneity

P-value_residual_het - p-value of the residual heterogeneity

lnBF - log of Bayes factor

Comments - reason why marker was not analysed

RegScan (CURRENT VERSION is v. 0.5 (April 18,2017)

RegScan is a command line tool for performing fast association analysis between allele frequencies and continuous traits. It uses linear regression to estimate marker effects on continuous traits.

The main features of RegScan are:

Speed. Currently it is about an order of magnitude faster than the leading GWAS methods (that compute p-value, effect size and standard error such as SNPTEST or QuickTest) with one trait, and hundreds of times faster with a large number of traits and use of restrictive filters. RegScan achieves its speed by efficient implementation and performing only a critical number of statistical tests.
Handling of combinatorial traits. RegScan can automatically create and analyze combinatorial traits such as trait ratios, products, sums, and differences.
Automatically analyzes any number of traits. It can automatically analyze any number of traits without the user having to specify what traits to consider. This saves time during runtime but also makes input data preparation easier.
Runtime filtering. In order to save computational time and reduce the output size the user can set restrictive filters during runtime (but also after runtime). Filtering of hits is done using a) the slope (effect size), b) standard error of slope, c) R2, t-value, or p-value, and c) minor allele count (MAC).
Introduction of Reliability Score (RS). RegScan introduces the Reliability Score (RS) - a simple metric to help to isolate the biologically potentially most interesting associations using combinatorial traits.
Additional functions. RegScan comes with several supporting functions required for data preparation and conversion as well as for filtering and analyzing the results.
Optional summary file. All analysis results are placed in one file. Upon request RegScan will produce an additional summary output file which lists for each marker the best association with a trait based on the statistical parameter (p value) and also the effect size. This enables the user to quickly isolate the most interesting findings in the output data.
Availability. RegScan is an open source project; the code can be compiled for all major computational platforms and both the 32- and 64-bit architectures.
User support. RegScan comes with user instructions and test datasets for practicing and better understanding the functions. The authors also provide technical support and take requests for future updates.
File formats. RegScan uses the following genotype file formats as input: gen, gen.gz, bgen. Bgen support was incorporated in version 0.2 thanks to help from Dr. Gavin Band (Gavin Band & Jonathan Marchini, the BGEN format, http://www.well.ox.ac.uk/~gav/bgen_format/).
RegScan's main goal is to achieve maximal computational speed in order to be applicable for the initial testing of very large data sets.

Citation

The article has been published in Briefings in Bioinformatics:

T. Haller, M. Kals, T. Esko, R. Mägi, K. Fischer. RegScan: a GWAS tool for quick estimation of allele effects on continuous traits and their combinations. Briefings in Bioinformatics.2015 Jan;16(1):39-44. doi: 10.1093/bib/bbt066. Epub 2013 Sep 5.

Download: Abstract Full text Pdf

Downloads (user manual is included)

Detailed instructions can be downloaded here. Illustrative instructions are found below.
(Attn: Reading the detailed instructions is essential for being able to fully use the RegScan program!)

RegScan includes functions for linear regression analysis and preparing files for it and well as functions for post-runtime analysis.
Regression analyis is carried out like this:

./REGSCAN -M gwas -gfile -pfile -missing -slope -statistic -statlimit -maclimit -selimit -out -summary -buffer

-gfile (required ) = genotype file format
-pfile (required) = phenotype file in RegScan format (easily derived from .sample format by RegScan)
-missing = missing phenotype data identifyer
-slope = effect size lower limit for screening
-statistic = main statistic used for screening; options: R2, T value, P value
-statlimit = screening limit for the statistical analysis (upper limit for P value, lower limit for R2 and T value)
-maclimit = minimal allowed minor allele count limit for screening (details in user guide)
-selimit = standard error of slope (SE) limit for screening
-out = output file name
-summary = additional summary file; options: yes, no
-buffer = memory allocation for maximal computational speed

Example:
./REGSCAN -M gwas -gfile TEST.gen -pfile TEST.regscan -missing na -slope 0.01 -statistic p -statlimit 5e-8 -maclimit 5 -selimit 1 -out results.txt -summary no -buffer 500

Please contact us if you have any questions or suggestions:
toomas.haller@ut.ee / tom@toomashaller.com

SCOPA – Software for COrrelated Phenotype Analysis

Genome-wide study-level multiple phenotype analysis, including dissection of association signals, has been implemented in SCOPA. The software requires specification of input genotype and sample files, and a list of phenotypes to be included in the analysis. SCOPA includes options to enable filtering on the basis of imputation quality, to output the variance-covariance matrix and phenotype effects (with standard errors) for each SNP, and to investigate association with all possible subsets of phenotypes using BIC.

Genome-wide meta-analysis has then been implemented in META-SCOPA. The software requires specification of a list of SCOPA output files representing studies to be included in the meta-analysis. META-SCOPA includes options to enable genomic control correction (at the study level and/or after meta-analysis), and filtering of SNPs on the basis of minor allele frequency (MAF) and imputation quality.

Download latest versions:

SCOPA v.1.0.14 (zip) (changelog) (txt)

METASCOPA v.1. (zip)2 (changelog) (txt)

Copy SCOPAv*.zip file into your computer, unzip the file:

unzip SCOPAv*.zip

To compile SCOPA program, use command:

make

in the folder where files have been unpacked. The program can be run by typing:

./SCOPA

Copy METASCOPAv*.zip file into your computer, unzip the file:

unzip METASCOPAv*.zip

To compile METASCOPA program, use command:

make

in the folder where files have been unpacked. The program can be run by typing:

./METASCOPA

For running SCOPA, you need input files in SNPTESTv.2 format. SNPTEST file formats are described here. In case of case-control type of analysis, you should have single gen and sample file, where the phenotype is coded 0=control; 1=case. Please note that SCOPA cannot currently use covariates - therefore please adjust all phenotypes for the covariates and use the residuals of the phenotypes in sample file (case and controls values will be floating numbers around 0 and 1).

GENOTYPE file:

 1 rs1 11 A T 1 0 0 1 0 0 1 0 0

 1 rs2 210 A T 0 1 0 1 0 0 1 0 0

1 rs3 300 A T 1 0 0 1 0 0 1 0 0

 1 rs4 4637 A T 1 0 0 1 0 0 1 0 0  

1 rs5 5555 A T 1 0 0 1 0 0 1 0 0

(Genotype file can be gzipped, if it has *.gz extension)  

SAMPLE file:

 Sample_id Subject_id Missing Phenotype1 Phenotype2 Phenotype3

0 0 0 P P P  

1 1 0 1.24 0.331 0.41  

2 2 0 1.23 -0.3 0.42  

3 3 0 1.22 -.47 0.43   

This file contains data for three phenotypes. As the program cannot use covariates, please adjust your phenotypes for all the covariates and use the residuals of the phenotypes.

Command line options:

./SCOPA [--debug] [--print_covariance] [--print_complex] [--betas]

[--print_all] [--remove_missing] --pheno_name <string> ...

[--imp_threshold <double>] [--missing_phenotype <string>] [-e

[--] [--version] [-h]

Where:

--debug

Debug mode on (default OFF)

--print_covariance

Print covariance matrix data for the model with all phenotypes. This is necessary for METASCOPA and can only be used with "--print_complex" option

(default OFF)

--print_complex

Print only the model with all phenotypes. These ful models can be meta-analysed with METASCOPA (default OFF)

--betas

Print each phenotype's effect size and stderr info of all selected models into separate output file (default OFF)

--print_all

Print out all models (default OFF)

--remove_missing

Remove sample if any of the phenotype values is missing. This is necessary if you want to compare models based on BIC scores (default OFF)

--pheno_name <string> (accepted multiple times)

(required) Name of phenotype to use (use this command multiple times i.e. --pheno_name BMI --pheno_name HEIGHT etc.)

--imp_threshold <double>

Imputation quality threshold (default 0)

--missing_phenotype <string>

This specifies missing data value (default NA)

-e <string>, --exclusion <string>

This specifies marker exclusion list

-o <string>, --out <string>

(required) This specifies output root

-g <string>, --gen <string>

(required) This specifies genotype file.

--chr <int>

This specifies chromosome to be printed into chromosome column

-s <string>, --sample <string>

(required) This specifies sample file

--, --ignore_rest

Ignores the rest of the labeled arguments following this flag

--version

Displays version information and exits

-h, --help

Displays usage information and exits

1 Chromosome - chromosome of variant if set with --chr option. Otherwise 0

2 Position - position of variant

3 MarkerName - variant name

4 EffectAllele - effect allele (necessary for meta-analysis)

5 OtherAllele - non-effect allele (necessary for meta-analysis)

6 InfoScore - Imputation quality measurement calculated similarly to IMPUTE2

7 HWE - p-value for HWE

8 MAF - minor allele frequency

9 N - samplesize

10 AA - genotype counts from imputed data

11 AB - genotype counts from imputed data

12 BB - genotype counts from imputed data

13 PhenotypeCount - number of phenotypes in model

14 Mask - binary mask showing the phenotypes used in current model (1-usd, 0-unused)

15 LogLikelihood - model likelihood

16 nullLogLikelihood - null model likelihood

17 LikelihoodRatio - likelyhood ratio

18 P-value - model p-value

19 BIC - Bayesian information score

20 BICnull - Bayesan iformation score for null model

21 Model - phenotypes in the order they were used in model (important for selecting covariance matrix for meta-analysis)

22 sortedModel - phenotypes in model in alphabetical order

23 beta_1 - effect size for phenotype 1

24 se_1 - stderr of effect for phenotype 1

25 beta_2 - effect size for phenotype 2

26 se_2 - stderr of effect for phenotype 2

27 beta_3 - effect size for phenotype 3

28 se_3 - stderr of effect for phenotype 3

29 cov_1_1 - inverted covariance matrix values

30 cov_1_2 - inverted covariance matrix values

31 cov_1_3 - inverted covariance matrix values

32 cov_2_2 - inverted covariance matrix values

33 cov_2_3 - inverted covariance matrix values

34 cov_3_3 - inverted covariance matrix values

METASCOPA is the script for meta-analysing output files from SCOPA program. As the only input file, you will need a file listing all SCOPA *.results files, which you want to meta-analyse. Plese ote that the listed files must be gzipped. The input files must only contain single model (e.g. using option --print_complex in SCOPA) and you must have the covariance matrix between phenotypes (using option --print_covariance in SCOPA).

List file metascopa.in can contain rows:

cohort1.result.gz

cohort2.result.gz

cohort3.result.gz

Command line options:

./METASCOPA [--debug] [--ogc] [--gc] [-n <int>] [--mac <double>] [--maf <double>] [--info <double>] [--hwe <double>] -o <string> -i <string> [--] [--version] [-h]

Where:

--debug

Debug mode enabled

--gc

Use genomic control to adjust each contibuting file for population stratification (default

OFF)

--ogc

Use genomic control to adjust meta-analysis results for population

stratification (default OFF)

-n <int>, --samplesize <int>

This specifies minimum samplesize filter (default 0)

--mac <double>

This specifies minimum minor allele count filter (default 0)

--maf <double>

This specifies minimal minor allele frequency filter (default 0)

--info <double>

This specifies infoscore filter (default 0)

--hwe <double>

This specifies HWE p-value filter (default 1)

-o <string>, --out <string>

(required) This specifies output file

-i <string>, --input <string>

(required) This specifies input list file

--, --ignore_rest

Ignores the rest of the labeled arguments following this flag.

--version

Displays version information and exits.

-h, --help

Displays usage information and exits.

1 MarkerName - variant name

2 EA - effect allele

3 NEA - other allele

4 CohortCount - number of cohorts having data of given variant

5 N - totalsamplesize

6 beta_0 - meta-analysed effect size of phenotype 1

7 se_0 - meta-analysed stderr of effect for phenotype 1

8 beta_1 - meta-analysed effect size of phenotype 1

9 se_1 - meta-analysed stderr of effect for phenotype 1

10 beta_2 - meta-analysed effect size of phenotype 1

11 se_2 - meta-analysed stderr of effect for phenotype 1

12 ChiSq - chi for entire model

13 Pvalue - p-value for entire model (please note that the script can only calculate p-values down to 1e-20. Lower p-values are given as 0. It is possible to get exact p-values down to ~1e-200 in R using formula:

p<-pchisq(ChiSq,PhenotypeCount, lower.tail=F)

In the first GWAS step we would recommend testing for the full model (print_complex) as it reflects all associations with each separate phenotype and their combinations (e.g. ratios). If several datasets are available, it is recommended to also print the covariance matrix (print_covariance) and use that info for meta-analysing cohorts using METASCOPA tool. An example command line for analysing anthropometric traits would be:

./SCOPA --remove_missing -g cohort1_chr1.gen --chr 1 --print_complex --betas --print_covariance --out cohort1_chr1.result -s one.sample --pheno_name height --pheno_name weight --pheno_name hip --pheno_name waist

By merging all chromosomes a single output file for each cohort can be created:

awk '{if(NR==1 || $1!="Chromosome"){print;}}' cohort1_chr*.result | gzip -c > cohort1.result

These results files from contributing cohorts must be written into a file, listing all file names and then all contributing files can be meta-analysed using METASCOPA with double GC correction:

ls cohort1.result.gz cohort2.result.gz cohort3.result.gz > cohorts.in

./METASCOPA --gc --ogc --mac 10 --info 0.4 --hwe 1e-4 -i cohorts.in -o meta.results

As the next step, top signals from the METASCOPA output can be selected, these variants could be filtered and SCOPA analysis with “print_all” (or without print command to get only the best possible model based on model BIC scores) can be made to find the optimal set of phenotypes associated with particular variant.

./SCOPA --remove_missing -g cohort1_hits.gen --out cohort1_modelselection.result -s one.sample --pheno_name height --pheno_name weight --pheno_name hip --pheno_name waist

Reedik Magi

Estonian Genome Center
University of Tartu
Tartu
Estonia

or

Andrew P Morris

University of Liverpool
Liverpool
UK.

Email addresses are of the form firstname.lastname@ut.ee and firstname.lastname@well.ox.ac.uk

STEROID

STEROID (a fast tool for genetic risk score calculation using biobank data) is a tool for calculating genetic risk scores for samples in vcf format files. Genetic risk models should be in LDpred format and these can be calculated for VCF files using imputation probabilities (vcf format fields GP, DS) or called genotypes (GT). Several models can be calculated simultaneously.

STEROIDv0.1.1 (zip) changelog (txt)

Copy STEROIDv*.zip file into your computer, unzip the file:

unzip STEROIDv*.zip

To compile STEROID program, use command:

make

in the folder where files have been unpacked. The program can be run by typing:

./STEROID

Genetic models in LDpred format (https://github.com/bvilhjal/ldpred).

chrom pos sid nt1 nt2 raw_beta ldpred_beta chrom_1 752566 rs3094315 G A 6.0000e-03 -4.2511e-05 chrom_1 768448 rs12562034 A G 2.7000e-03 -1.3897e-05 chrom_1 779322 rs4040617 G A 6.2000e-03 -4.6578e-05 chrom_1 785989 rs2980300 T C 5.5000e-03 -3.4656e-05

Samples should in standard VCF v4.2 format for chromosome in the first column without "chr" in front of number.

Reading command line options:

USAGE:

./STEROID [--field <string>] -o <string> [--chr <int>] --vcf <string>
--gwas <string> ... [--] [--version] [-h]

Where:

--field <string>
This specifies VCF format field to use. Default is GP (options: GP, DS
, GT)

-o <string>, --out <string>
(required) This specifies output file

--chr <int>
This specifies vcf file chromosome. Dont use this option if vcf
contains several chromosomes

--vcf <string>
(required) This specifies VCF file

--gwas <string> (accepted multiple times)
(required) Name of LDpred output file (in case of several files, use
this command multiple times

--, --ignore_rest
Ignores the rest of the labeled arguments following this flag.

--version
Displays version information and exits.

-h, --help
Displays usage information and exits.

EXAMPLE COMMAND LINE:
./STEROID --vcf DATA_chr22.vcf.gz --gwas xxx_LDpred-inf --gwas xxx_LDpred_p1.0000e+00 --gwas xxx_LDpred_p1.0000e-01 --gwas .xxx_LDpred_p1.0000e-02--out xxx_grs_chr22.txt

Cropper

Cropper is a GUI application for viewing and handling Manhattan Plots. The user can zoom, select and crop Manhattan Plots and generate output both in the graphical and numerical format.

Summary of main features

1. Input files need to have chromsome, position and p-value fields.

2. Zoom, crop, region color, point or line representation, region save (both graphics and numerical files), all chromosome view, single chromosome view, informatgive graphical interface.

3. Can be compiled for all major platforms.

Citation

A scientific article has been written for publication in BMC Bioinformatics to demonstrate the utility of Cropper together with Manhattan Harvester.

CentOS 64 bit version: download

CentOS 64 bit static version: download

MacOS version: download

Ubuntu 32 bit version: download

Windows static version: download

* You can compile Cropper yourself from the source code:

Source code is available HERE

Cropper was designed to be intuitive to use. Detailed instructions are downloaded together with the application.

Please contact us if you have any questions or suggestions:

toomas.haller@ut.ee / toomashaller@gmail.com

Manhattan Harvester

Manhattan Harvester (MH) is a command line tool for automatically detecting and characterizing peaks from GWAS output files (Manhattan Plots). It outputs a list a parametes for each peak, including a general quality score value, to let the user rank all findings. Use MH when you have too many GWAS outcomes to screen them by the eye.

Summary of main features:

1. GWAS output files that contains cromsome number, position and p-value are the input.

2. Table with a list of parameters and a summary score are presented in a simple tabular form.

3. Ability to batch-process any number of files.

4. Defaults are set to optimal values. The user can change many parametrs by using the flags.

5. Fast and adaptable algorithms.

6. Harvester comes in handy when GWAS result files are too many for human screening.

Citation

A scientific article has been written for publication in BMC Bioinformatics to demonstrate the utility of MH.

CentOS 64 bit version: download

CentOS 64 bit static version: download

MacOS version: download

Ubuntu 32 bit version: download

Windows static version: download

* You can compile MH yourself from the source code:

Source code is available HERE

Instructions

Detailed instructions can be downloaded together with the applications above.

Please contact us if you have any questions or suggestions:

toomas.haller@ut.ee

toomashaller@gmail.com

Institute of Genomics is participating in 3 Estonian Centres of Excellence in research

Advancing Genomics for Better Human Health

The Year 2023 of the Institute of Genomics

Tools

GRANVIL

Input files

GWAMA – Software tool for meta analysis of whole genome association data

MIXFIT

MR-MEGA

RegScan (CURRENT VERSION is v. 0.5 (April 18,2017)

SCOPA – Software for COrrelated Phenotype Analysis

STEROID

Cropper

Manhattan Harvester