MIXFIT
INTRODUCTIONMixFit is a multi-dimensional best fit script used to assign individual ancestry components to the unknown individuals based on comparisons with reference populations and by using genome-wide data.
The main features of MixFit are:
- Outcome. The outcome is three assigned numerical ancestry components that are chosen from among the reference populations. Assignment is perfomed via multi-dimensional best fit.
- Assignment reliability. Assignment reliability is judged based on several statistical parameters calculated by the script and explained in the user manual (see below).
- Input data. Input data are in the form of similarity matrices (chunkcount matrices) prepared with the pipeline comprising of SHAPEIT and ChromoPainter (see below). MixFit, however, is not restricted to using the output of this particular pipeline; it can be used in a wide variety of situations.
- Flexibility. Several parameters can be controlled by the user (see user manual). This allows to easilty modify the assignment method.
MixFit is typically used to study genetic ancestry of an individual provided that a suitable data set is available for the relevant reference populations.
REFERENCE
A scientific article has been written to demonstrate MixFit but the article has not yet been published:
Toomas Haller, Liis Leitsalu, Krista Fischer, Marja-Liisa Nuotio, Tõnu Esko, Dorothea Irene Boomsma, Kirsten Ohm Kyvik, Tim D Spector, Markus Perola, Andres Metspalu. MixFit: methodology for computing ancestry-related genetic scores at the individual level and its application to the Estonian and Finnish population studies.
DOWNLOAD MixFit:
- 64-bit Scientific Linux (Red Hat family): download static compilation
Upon request RegScan executable is available also for these systems:
- Debian Linux (Debian, Ubuntu, Mint)
- Mac OS X (Snow Leopard 10.6.8)
- Windows (XP, Vista, 7, 8)
- Source code: not yet available
- User manual: download technical details, user guide, examples
INSTRUCTIONS
Detailed instructions can be downloaded here. Short instructions are found below.
(Attn: Reading the detailed instructions is essential for being able to fully use the MixFit script!)
MixFit analyis is carried out like this:
Minimal example for running Mixfit:
./MIXFIT -file unknowns.txt -ref references.txt -refpops 22 -out results.txt
Full example for running MixFit:
./MIXFIT -file unknowns.txt -ref references.txt -refpops 22 -out results.txt -delim space -header yes -refheder yes -plimit 0.1 -step 0.05 -a1 0.3 -a2 0.2 -a3 0.01
-file: name of the input file (here the array of 22)
-ref: name of the reference file (here the matrix of 22 x 22)
-out: output file name
-delim: matrix/array delimiter; options: “tab” (default), “space”, “colon”, “semicolon”, “comma”, or any freely selected text
-header: whether the input array has a vertical header; options: “no” (default), “yes”
-refheader: whether the input matrix has a vertical header; options: “no” (default), “yes”
-refpops: the number of reference populations (here 22)
-plimit: ancestry fraction (value) under which the component is considered irrelevant and is removed from consideration (in which case less than 3 ancestry components are reported); this can be any number between 0 and 1 (deafault is 0.1)
-step: fraction by which each reference population weight is incremented during the process of best fitting; default = 0.05
-choosebest: allows to fix the identities of some ancestry components before best mix according to the overall similarity between the references and the unknwn. For example “-choosebest 1” immediately selects the overall most similar reference population and starts to use this as one of the components by including it in every best fit simulation. Default value is 0 and this generally makes most sense.
-missing: how is missing value denoted, default = “NA”
-a1: when best fit is carried out by systematically varying the ancestry components fluctuations occur between the best and worst fits. The best fits are expressed as minima in the fluctuations. These minima are recorded for candidate selection later in the algorithm. This flag allows one to change the fraction of best fits stored for later candidate selection. Default = 0.1 (meaning that 10% of the minima are considered for compiling the ancestry candidate list).
-a2: each ancestry assignment as it comes out of the -a1 filter is associated with a GOF (goodness of fit) score. These potential assignments are sorted according to the GOF score and only the lowest scores are let pass. This flag determines how many (what fraction) of best assignments pass to the next round where they are averaged to find the 3 top-scoring ancestry components. Default = 0.1 (meaning that 10% of the assignments with the best GOF scores pass).
-a3: in the final simulation the ancestry components are selected but their relative ratios are unknown, so there is one more simulation where the component amounts are systematically varied. The best answer that this step gives is a function of input uncertainty. Therefore MixFit allows the user to average certain number of best solutions. Default = 0.1 (meaning that 10% of the best solutions will be averaged for the very final ancestry component ratios). Note that this number should generally be small.
Please contact us if you have any questions or suggestions:
toomas.haller [ät] ut.ee
tom [ät] toomashaller.com