SMARTPOP 2.0 Manual
2 Requirements
3 Installation
4 SMARTPOP features
4.1 Input
4.1.1 Via command lines
4.1.2 Via input files
4.1.3 Windows executable
4.2 Simulation parameters
4.2.1 Verbose
4.2.2 Random seed
4.2.3 Population size
4.2.4 Sample size
4.2.5 Number of simulations
4.2.6 Number of generations to run
4.2.7 Mating system and number of offspring
4.2.8 Sibling matings
4.2.9 Demography parameters
4.2.10 Mutation rates
4.2.11 Burn-In phase
4.2.12 DNA sequences
4.2.13 Choice of outputs
4.2.14 Save/Load
4.3 Outputs
4.3.1 Diversity tables
4.3.2 Fasta
4.3.3 Arlequin
4.4 Running SMARTPOP in parallel mode
4.5 Setting up a complex scheme (change of parameters through time)
4.6 Starting conditions: burn-in and pre-run
5 Examples
A Default parameters
B References
1 Introduction to SMARTPOP
SMARTPOP is fast and flexible forward-in-time simulator for population genetics. Specially developed for speed efficiency, it is available in both serial and parallel versions. Developed for anthropological inferences on human populations, SMARTPOP simulates individuals with sequences of sex-linked DNA (mitochondrial, X and Y chromosomes) and autosomes. Studies of social dynamics are enabled using SMARTPOP's flexible demographic models and social rules of mating.
For any use of SMARTPOP or re-use of its code source please cite:
Guillot and Cox, 2014. SMARTPOP: inferring the impact of social dynamics on genetic diversity through high speed simulations. BMC Bioinformatics 2014 15:175
2 Requirements
SMARTPOP has been developed in C++ using the Boost C++ library. To build the software from sources you need a C++ compiler such as g++ or Visual Studio installed on your computer. To compile the parallel version of SMARTPOP, we recommend using mpic++.
3 Installation
You can directly download a binary (executable) version of SMARTPOP compatible with your OS at http://smartpop.sourceforge.net/smartpop2/download.html
Alternatively you can build SMARTPOP from the source code following these instructions:
To build SMARTPOP from source on a UNIX machine:
- Download the source code from http://smartpop.sourceforge.net/download
wget http://smartpop.sourceforge.net/download/s/smartpop2/SMARTPOP.zip - Move the archive to the folder where you want SMARTPOP to be installed
mv SMARTPOP.zip /home/login/foo - Uncompress the archives
unzip SMARTPOP.zip - Move to directory
cd SMARTPOP/src-serial/ - Build SMARTPOP
make - Optional: Build the test package
make test - Optional: Run the test package
./test
SMARTPOP is then ready to launch via the command line:
./smartpop
For building the parallel version, go to the src-parallel directory:
- cd SMARTPOP/src-parallel/
- make
- Launch the parallel version (on UNIX system) on four cores:
- mpiexec -n 4 ./smartpopMPI
We have run tests on the main operating system available today, if you encounter any trouble please contact us.
4 SMARTPOP features
SMARTPOP must be called via the command lines (./smartpop ) from the directory where it is installed (or add the directory to your PATH). One call will launch a set of simulations with one set of parameters.
4.1 Input
A set of simulations rely on several parameters, all of which have default values except the random seed. The default values are given in this table. You can either set these parameters using flags on the command line or via an input file.
4.1.1 Via the command line
You can define all the parameters to start a set of simulations using flags on the command line. All the parameters that are not called by flags will receive default values.
The order of parameters does not matter. If you give a wrong sequence of arguments on the command line, the program will raise an error and not start (in most cases).
To run a set of 1000 simulations with population size 200 evolving for 500 generations, enter:
./smartpop -p 200 -t 500 --nsimu 1000
Flag | Argument |
Meaning |
-v | Verbose | |
-h | Help | |
-i |
Input file with parameters | |
-o | string |
Name for the output files |
-s | integer |
Random seed |
-p | integer | Population size |
--popsizeN | integer integer | Population size of population X |
--npop | integer | Number of populations |
--sample | integer |
Sample size |
--nsimu | integer |
Number of simulations |
-t | integer |
Number of generations between two sampling events |
--nstep | integer |
Number of sampling events through time for one simulation run |
--mat | integer |
mating system (X=:1 random, 2 MBD/FZS, 3 MZD/MZS, 4 FZD/MBS, FBD/FBS) |
--inbreeding | integer | Prevent half siblings (1) or full siblings (2) mating, allowed if 0 |
--polygamy | integer | set the maximum number of mates X per indiv (if X=1 it becomes monagamy) |
--polyandry | integer | set the maximum number of husband X per woman (if X=1 it becomes monagamy) |
--polygyny | integer | set the maximum number of wives X per man (if X=1 it becomes monagamy) |
--polygamyN | integer integer | set the maximum number of mates X per indiv (if X=1 it becomes monagamy) for a given population |
--polygynyN | integer integer | set the maximum number of wives X per man (if X=1 it becomes monagamy) for a given population |
--mu | double (x4 or x8) |
Mutation rates (simple rates) |
--nbLociA | integer |
Number of unlinked autosomal loci |
--sizeA | integer |
Size, in number of sites, per autosomal locus |
--nbLociX | integer |
Number of unlinked loci on the X chromosome |
--sizeX | integer |
Size, in number of sites, per locus on the X chromosome |
--sizeY | integer |
Size, in number of sites, on the Y chromosome |
--sizeMt | integer |
Size, in number of sites, of the mitochondrial sequence |
--burnin | int |
Set the number of generations for the burn-in phase with normal mutation rate |
--burninH | int |
Set the number of generations for the burn-in phase with high mutation rate |
--demog | double double double | Set the demographic function |
--demogN | double double double int | Set the demographic function for population X |
--fasta |
Save each simulation in a fasta format file at the end of the run | |
--arl |
Save each simulation in a arlequin format file at the end of the run | |
--header |
Output the header at the beginning of diversity files | |
--child2 | Forces all couple to have 0 children (no variance in reproductive success) | |
--pooled | pooled sample (X indiv from Y demes) | |
--scattered | scattered sample (X indiv from all demes) | |
--local | local sample (X indiv from 1 demes) | |
--blind | random sample without knowing the structure | |
--nbsamplepop | integer | number of population to sample for pooled scheme and for inter-population diversity comparison |
4.1.2 Via input files
Instead of defining the parameters in the command line, you can also use an input file. For example, this is useful if you modify a lot of default value parameters.
You must respect the format of this file to make it work. It is a html file. Respect the order of flags or it may not work. Each line describe the value for one parameter. All the parameters that are not defined in this input file will take default values, given in this table.
By default, each run of SMARTPOP creates a parameter file called SMARTPOP_parameters.txt. This has exactly the same format as the input parameter file such that running ./smartpop -i SMARTPOP_parameters.txt will run the exact same set of simulations as you formerly ran. This is possible due to the fact that the random seed is one of the input parameters.
4.1.1.3 Windows executable
The available windows executable only handle input files. SMARTPOP will read the file
4.2 Simulation parameters
4.2.1 Verbose
It is recommended to begin using SMARTPOP with the verbose option on. This will make SMARTPOP return relevant information one the command line when the simulations are running. It is a good way to check that the parameters are set correctly, as well as to visualize the progress of the program.
4.2.2 Random seed
Simulations are highly reliant on random processes. Such processes are simulated via sequences of random numbers which are a large part of the program. Each time the program starts, it calls a random seed from which this sequence is produced uniquely. By default, the random seed is random, but it is possible to set it to repeat earlier runs.
4.2.3 Population size
The population size set in SMARTPOP corresponds to the census population size at the beginning of the simulation. If the population size is set to be non constant, this will change through time.
4.2.4 Sample size
Instead of analyzing the entire population, you can sample a certain number of individuals. This situation would match a “real life” situation where you do not have access to the DNA of your whole population. If you define a sample size, all the outputs (diversity, but also fasta and arlequin files) will be generated on a random sample in your population of the defined size.
4.2.5 Number of simulations
You can define the number of simulations that will be run with this set of parameters. If you are loading a set of simulations from a directory, this number must be smaller or equal to the number of saved simulations.
4.2.6 Number of generations to run
During each simulation, the population will evolve for a number of generations t between two sampling events. By default, there is only one sampling event at the end of those t generations, where output files are produced and diversity is measured. If you define multiple sampling events (NSampling) through time, then t defines the number of generations to run between two sampling:
4.2.7 Mating system and number of offspring
The mating system is described by two parameters, the polygamy and the cousin alliance. You can set a maximum number of mate per individual, either for both sex (flag polygamy), or for male (polygyny) or female (polyandry).
For the cousin alliance, you can set a preferiential alliance of one or more cousin using the flag mat (X=:1 random, 2 MBD/FZS, 3 MZD/MZS, 4 FZD/MBS, FBD/FBS).
4.2.8 Sibling matings
For any mating system, you can forbid the mating between full or half siblings using the flag inbreeding.
4.2.9 Demography parameters
The demograpy is defined by a Markovian function:
popsize(t+1)=a + b x popsize(t) + c x popsize(t)2
Based on three parameters a, b and c, this definition permits a large range of demographic scenarios. This table presents the most simple example:
Scenario | a | b | c |
A constant population size | 0 | 1 | 0 |
A constant growth of x individuals per generation | x | 1 | 0 |
A constant decrease of x individuals per generation | -x | 1 | 0 |
An exponential growth of rate x | 0 | x | 0 |
An exponential decrease of rate x | 0 | 1/x | 0 |
4.2.10 Mutation rates
Each type of DNA (mitochondrial, X, Y and autosome) has two mutation rates: a transition and transversion rate. These rates must be set in mutations/site/generation. Overall mutation rates can be defined, but you can set transitions equal to transversions equal to half the total mutation rate, if you do not know the ratio of transition over transversion.
The rationale behind having only eight mutation rates is that mutation rates measured for mtDNA, Y, X and autosomes can be different by more than an order of magnitude. Similarly, transition rates are usually an order of magnitude higher than transversion rates.
4.2.1 Burn-in phase
By default, simulations will start with the entire population having the same DNA. Alternatively, a burn-in phase allows you to start with an accelerated evolutionary process that will force your population to reach a high diversity from which your simulation will start. In this process, a higher mutation rate is applied to your population which quickly increases the diversity. The accelerated evolution stops when a set number of generation has been running. Another phase of burnin with a normal mutation rate can follow by setting the flag burnin to this chosen number of generations
When dealing with multiple populations, both burnin phase simulate a metapopulation with no division. The split between population happens at the end of the burnin, letting all populations with a similar diversity.
4.2.12 DNA sequences
Each individual has a set of sex-linked DNA sequences:
- A mitochondrial sequence
- A sequence from the non-recombining Y chromosome
- A set of unlinked loci on the X chromosome
- A set of unlinked loci on the autosomes
It is possible to choose the length of each loci by type. Note: all autosomal loci must share the same length; all X loci must share the same length. It is also possible to choose the number of loci on X and autosomes.
For computational reasons, the length of sequences must be a multiple of 32, or it will be automatically forced to the next higher multiple of 32. For instance, if you enter a length of 33, your simulated locus will have 64 sites. The size of the DNA considered unnecessary for a study can be set to 0, making the simulations faster.
4.2.13 Outputs
SMARTPOP can produce different outputs for each simulation:
- A file with the simulated DNA (at the end of the run) in fasta format
- A file with the simulated DNA (at the end of the run) in arlequin format [2]
SMARTPOP also produces output for the whole set of simulations:
- A diversity file for of genetic diversity of a set sample
- A file with interpopulation diversity.
If the output name is set to 'foo' in the parameters, then the different files produced would be foo.fasta, foo.arl, fooall.txt and foobetween.txt.
4.3 Outputs
4.3.1 Diversity tables
A file is created for the diversity of the given sample. If a name is provided, and the file already exists, the results will append to the end of the existing file.
This file contains different summary statistics for the population for each sample. Each line represent a sample at a specific time. For each sample, for each DNA type, SMARTPOP will write in the file:
- Population size (not the sample size)
- Number of sites in the sequence evaluated
- Number of polymorphic sites
- Proportion of polymorphic sites
- Mean pairwise difference
- Number of haplotypes
- Allele heterozygosity
- Nei’s heterozygosity (i.e. heterozygosity averaged per site) [4]
- Theta Watterson θw
- Theta homozygosity θH
- Theta Pi θπ
- Tajima’s D
4.3.2 Fasta
You can output the population, or a sample of it, in a fasta format file. Such a file can easily be piped through other population genetics software, such as COMPUTE from the libSequence package[5].
4.3.3 Arlequin
You can output the population, or a sample of it, in a file of the format of ARLEQUIN software. This allows you to make inferences on your simulations using ARLEQUIN [2].
4.4 Running SMARTPOP in parallel mode
It is possible to run SMARTPOP in parallel. You must build a specific version of SMARTPOP using C++ compiler including Message Parsing Interface, such as mpic++.
When running in parallel, the simulations are distributed across several cores at the beginning at the simulation. Each process will write on different files to avoid conflict.
4.6 Starting conditions: burn-in and pre-run
The starting conditions are a fundamental problem in forward-in-time simulations. How to determine the complete genetic set of the population from which the evolution starts? This is equivalent to asking what is the DNA state of an ancient population, which is totally unknown in most studies. In studies based on modern DNA, you have (at most) knowledge about past individuals who left descendant but nothing about the other past members of the population.
The question remains what to start the simulation with. A fully random DNA set is in no way meaningful. Instead the genetic patterns of real population is the result of a long evolutionary process that creates haplotypes - individuals still sharing a large part of their DNA sequence.
The start must therefore be a sequence shared among individuals. From this null diversity, one must produce an artificial diversity, to correspond to a real population. There are three options:
- Assumed equilibrium
It is possible that the population is in a state of equilibrium. The definition of "equilibrium" in population genetics is quite vague, but it is agreed that such equilibrium will be reached by running simulations for a very long time. How long is long enough? It must be at least the TMRCA, to be independent of starting conditions.
As an indicator, the mean and variance of the time to the most recent common ancestor (TMRCA) assuming a constant population size [3, 1, 6] is: - Burn-In
Alternatively, we offer the user the possibility to apply an accelerated mutation scheme on the population. It will assign a mutation rate one order of magnitude higher than the real rate, run some generations and stop once the a set number of generation is reached. The real simulation can then start from this point.
This approach is novel to SMARTPOP. - Make your own recipe
With the flexibility of SMARTPOP, you can choose your own starting scheme. Controlling the mutation rate, you may reach higher diversity than equilibrium. You can also use the demography feature to create your own burn-in phase.
A Default parameters
Flag + Argument | Input File Keyword + Argument |
Meaning | Parameter File Example | Default Value |
--header | output/header integer (0 or 1) |
Output the header at the beginning of diversity files | headerOutput 1 | 0 |
--fasta | output/fasta integer (0 or 1) |
Save each simulation in a fasta format file at the end of the run | output/fasta 1 | 0 |
--nexus | output/nexus integer (0 or 1) |
Save each simulation in a nexus format file at the end of the run | output/nexus 1 | 0 |
-arl | output/arlequin integer (0 or 1) |
Save each simulation in a arlequin format file at the end of the run | output/arlequin 1 | 0 |
-v | output/verbose int |
Verbose | Verbose 1 | 0 |
-i myfile.txt | NA |
Input file with parameters | SMARTPOP_parameters.txt | NULL |
-o filename | output/filename filename |
Name of the outputs | fileOutput smartpop_13145105_2 | smartpop_seed_matingSystem |
-s integer | simulation/seed integer |
Random seed | seed 13145105 | random |
-t integer | simulation/step integer |
Number of generations between two sampling events | step 100 | 100 |
--nstep integer | simulation/nstep integer |
Number of sampling events through time for one simulation | nstep 10 | 1 |
--nsimu integer | simulation/nsimu integer |
Number of simulations | nSimu 500 | 500 |
--burnin double | simulation/burninT double |
Set the number of generations for burnin with normal mutation rate for the burn-in phase | burninT 0.5 | 0 |
--burninH int | simulation/burninH int |
Set the number of generations for burnin with high mutation rate | burninType 1 | 1 |
--sample integer | simulation/sample integer |
Sample size for estimation of diversity, and fasta/arlequin outputs | sample 100 | 100 |
--nbpopsample integer | simulation/nbpopsample integer |
Number of population to sample for estimation of diversity, and fasta/arlequin outputs | nbpopsample 3 | 1 |
--between | simulation/betweenDiv boolean |
Output between population diversity (X=0 No, X=1 Yes) | beetweenDiv false | false |
-p integer | model/popsize integer |
Total census population size | popsize 150 | 100 |
--nbpop integer | model/nbPop integer |
Number of communities/population/patches | nbPop 10 | 1 |
--nbSources integer | model/nbSources integer |
Number of sources population for APA type population scheme | nbSources 3 | 0 |
--pmate double | model/pmate integer |
Relaxation paramter on the mating rule | pmate 0.5 | 0 |
--pmig double | model/pmig integer |
Relaxation paramter on the migration rule | pmig 0.5 | 0 |
--mu double (x4 or x8) |
Mutation rates (simple rates, respectively mtDNA, X, Y and autosome) | --mu 0.0001 0.0001 0.0001 0.0001 | ||
--sizeA integer | model/dna/sizeA integer | Size, in number of sites, per autosome locus | sizeA 3200 | 3200 |
--sizeX integer | model/dna/sizeX integer |
Size, in number of sites, per locus on the X chromosome | sizeX 3200 | 3200 |
--sizeY integer | model/dna/sizeY integer |
Size, in number of sites, of the Y chromosome | sizeY 3200 | 3200 |
--sizeMt integer | model/dna/sizeMt integer |
Size, in number of sites, of the mitochondrial sequence | sizeMt 3200 | 3200 |
--nbLociX integer | model/dna/nbLociX integer |
Number of unlinked loci on the X chromosome | nbLociX 1 | 1 |
-nbLociA integer | nbLociA integer |
Number of unlinked autosomal loci | nbLociA 1 | 1 |
--popsizeN integer integer | model/population/popsize integer |
Census population size for a give population | popsize 150 | 100 |
--inbreedingN integer integer | model/population/inbreeding integer |
Rules of inbreeding for a given population | inbreeding 1 | 0 |
--child2 | model/population/varianceNbChildren integer |
Set the variance in nb of children to 0 (all couple have 2 kids) or 1 (Poisson law) | inbreeding 1 | 1 |
--polygynyN integer integer | model/population/polygyny integer |
Set the maximum number of wives per man for a given population | polygyny 1 | 1 |
--polygyny integer | NA |
Set the maximum number of wives per man for all populations | 1 | |
--polyandryN integer integer | model/population/polyandry integer |
Set the maximum number of husbands per woman for a given population | polyandry 1 | 1 |
--polyandry integer | NA |
Set the maximum number of husbands per woman for all populations | 1 | |
--matN integer integer | model/population/MS integer |
Set the cousin alliance for a given population | MS 1 | 1 |
--mat integer | NA |
Set the cousin alliance for all populations | 1 | |
--demogN double double double integer | model/population/demog /a double /b double /c double |
Set the demography function for a given population | a 0 b 1 c 0 | a 0 b 1 c 0 |
--demog double double double | NA |
Set the demographyfunction for all populations | 0 1 0 | |
B References
[1] P Donnelly and S Tavaré. Coalescents and genealogical structure under neutrality. Annual Review of Genetics, 29(1):401–421, 1995.
[2] Laurent Excoffier and Heidi E L Lischer. Arlequin suite ver 3.5: A new series of programs to perform population genetics analyses under Linux and Windows. Molecular Ecology Resources, 10(3):564–567, May 2010.
[3] Richard R Hudson. Gene genealogies and the coalescent process. Oxford Surveys in Evolutionary Biology, 7(1):44, 1990.
[4] Masatoshi Nei. Estimation of average heterozygosity and genetic distance from a small number of individuals. Genetics, 89(3):583–590, 1978.
[5] K Thornton. LibSequence: A C++ class library for evolutionary genetic analysis. Bioinformatics, 19(17):2325–2327, 2003.
[6] John Wakeley. Coalescent Theory: An Introduction. Roberts & Company Publishers, first edition, 2009.