SMARTPOP Examples

SMARTPOP 2.0 Manual

July, 2015

1 Introduction to SMARTPOP
2 Requirements
3 Installation
4 SMARTPOP features
4.1 Input
   4.1.1 Via command lines
   4.1.2 Via input files
   4.1.3 Windows executable
4.2 Simulation parameters
   4.2.1 Verbose
   4.2.2 Random seed
   4.2.3 Population size
   4.2.4 Sample size
   4.2.5 Number of simulations
   4.2.6 Number of generations to run
   4.2.7 Mating system and number of offspring
   4.2.8 Sibling matings
   4.2.9 Demography parameters
   4.2.10 Mutation rates
   4.2.11 Burn-In phase
   4.2.12 DNA sequences
   4.2.13 Choice of outputs
   4.2.14 Save/Load
4.3 Outputs
   4.3.1 Diversity tables
   4.3.2 Fasta
   4.3.3 Arlequin
4.4 Running SMARTPOP in parallel mode
4.5 Setting up a complex scheme (change of parameters through time)
4.6 Starting conditions: burn-in and pre-run
5 Examples
A Default parameters
B References

1 Introduction to SMARTPOP

SMARTPOP is fast and flexible forward-in-time simulator for population genetics. Specially developed for speed efficiency, it is available in both serial and parallel versions. Developed for anthropological inferences on human populations, SMARTPOP simulates individuals with sequences of sex-linked DNA (mitochondrial, X and Y chromosomes) and autosomes. Studies of social dynamics are enabled using SMARTPOP's flexible demographic models and social rules of mating.
For any use of SMARTPOP or re-use of its code source please cite:
Guillot and Cox, 2014. SMARTPOP: inferring the impact of social dynamics on genetic diversity through high speed simulations. BMC Bioinformatics 2014 15:175

2 Requirements

SMARTPOP has been developed in C++ using the Boost C++ library. To build the software from sources you need a C++ compiler such as g++ or Visual Studio installed on your computer. To compile the parallel version of SMARTPOP, we recommend using mpic++.

3 Installation

You can directly download a binary (executable) version of SMARTPOP compatible with your OS at http://smartpop.sourceforge.net/smartpop2/download.html
Alternatively you can build SMARTPOP from the source code following these instructions:

To build SMARTPOP from source on a UNIX machine:

Download the source code from http://smartpop.sourceforge.net/download
wget http://smartpop.sourceforge.net/download/s/smartpop2/SMARTPOP.zip
Move the archive to the folder where you want SMARTPOP to be installed
mv SMARTPOP.zip /home/login/foo
Uncompress the archives
unzip SMARTPOP.zip
Move to directory
cd SMARTPOP/src-serial/
Build SMARTPOP
make
Optional: Build the test package
make test
Optional: Run the test package
./test

SMARTPOP is then ready to launch via the command line:
./smartpop

For building the parallel version, go to the src-parallel directory:

cd SMARTPOP/src-parallel/
make
Launch the parallel version (on UNIX system) on four cores:
mpiexec -n 4 ./smartpopMPI

We have run tests on the main operating system available today, if you encounter any trouble please contact us.

4 SMARTPOP features

SMARTPOP must be called via the command lines (./smartpop ) from the directory where it is installed (or add the directory to your PATH). One call will launch a set of simulations with one set of parameters.

4.1 Input

A set of simulations rely on several parameters, all of which have default values except the random seed. The default values are given in this table. You can either set these parameters using flags on the command line or via an input file.

4.1.1 Via the command line

You can define all the parameters to start a set of simulations using flags on the command line. All the parameters that are not called by flags will receive default values.
The order of parameters does not matter. If you give a wrong sequence of arguments on the command line, the program will raise an error and not start (in most cases).
To run a set of 1000 simulations with population size 200 evolving for 500 generations, enter:
./smartpop -p 200 -t 500 --nsimu 1000


Flag	Argument	Meaning
-v		Verbose
-h		Help
-i		Input file with parameters
-o	string	Name for the output files
-s	integer	Random seed
-p	integer	Population size
--popsizeN	integer integer	Population size of population X
--npop	integer	Number of populations
--sample	integer	Sample size
--nsimu	integer	Number of simulations
-t	integer	Number of generations between two sampling events
--nstep	integer	Number of sampling events through time for one simulation run
--mat	integer	mating system (X=:1 random, 2 MBD/FZS, 3 MZD/MZS, 4 FZD/MBS, FBD/FBS)
--inbreeding	integer	Prevent half siblings (1) or full siblings (2) mating, allowed if 0
--polygamy	integer	set the maximum number of mates X per indiv (if X=1 it becomes monagamy)
--polyandry	integer	set the maximum number of husband X per woman (if X=1 it becomes monagamy)
--polygyny	integer	set the maximum number of wives X per man (if X=1 it becomes monagamy)
--polygamyN	integer integer	set the maximum number of mates X per indiv (if X=1 it becomes monagamy) for a given population
--polygynyN	integer integer	set the maximum number of wives X per man (if X=1 it becomes monagamy) for a given population
--mu	double (x4 or x8)	Mutation rates (simple rates)
--nbLociA	integer	Number of unlinked autosomal loci
--sizeA	integer	Size, in number of sites, per autosomal locus
--nbLociX	integer	Number of unlinked loci on the X chromosome
--sizeX	integer	Size, in number of sites, per locus on the X chromosome
--sizeY	integer	Size, in number of sites, on the Y chromosome
--sizeMt	integer	Size, in number of sites, of the mitochondrial sequence
--burnin	int	Set the number of generations for the burn-in phase with normal mutation rate
--burninH	int	Set the number of generations for the burn-in phase with high mutation rate
--demog	double double double	Set the demographic function
--demogN	double double double int	Set the demographic function for population X
--fasta		Save each simulation in a fasta format file at the end of the run
--arl		Save each simulation in a arlequin format file at the end of the run
--header		Output the header at the beginning of diversity files
--child2		Forces all couple to have 0 children (no variance in reproductive success)
--pooled		pooled sample (X indiv from Y demes)
--scattered		scattered sample (X indiv from all demes)
--local		local sample (X indiv from 1 demes)
--blind		random sample without knowing the structure
--nbsamplepop	integer	number of population to sample for pooled scheme and for inter-population diversity comparison

Table 1: Flags for command line call of SMARTPOP

4.1.2 Via input files

Instead of defining the parameters in the command line, you can also use an input file. For example, this is useful if you modify a lot of default value parameters.
You must respect the format of this file to make it work. It is a html file. Respect the order of flags or it may not work. Each line describe the value for one parameter. All the parameters that are not defined in this input file will take default values, given in this table.

By default, each run of SMARTPOP creates a parameter file called SMARTPOP_parameters.txt. This has exactly the same format as the input parameter file such that running ./smartpop -i SMARTPOP_parameters.txt will run the exact same set of simulations as you formerly ran. This is possible due to the fact that the random seed is one of the input parameters.

4.1.1.3 Windows executable

The available windows executable only handle input files. SMARTPOP will read the file SMARTPOP_parameters.txt which must be in the same directory. For each new launch of SMARTPOP you must set the parameters by modifying this file. If you want the seed to be picked randomly, remember to erase the line with the seed keyword from the parameter files.

4.2 Simulation parameters

4.2.1 Verbose

It is recommended to begin using SMARTPOP with the verbose option on. This will make SMARTPOP return relevant information one the command line when the simulations are running. It is a good way to check that the parameters are set correctly, as well as to visualize the progress of the program.

4.2.2 Random seed

Simulations are highly reliant on random processes. Such processes are simulated via sequences of random numbers which are a large part of the program. Each time the program starts, it calls a random seed from which this sequence is produced uniquely. By default, the random seed is random, but it is possible to set it to repeat earlier runs.

4.2.3 Population size

The population size set in SMARTPOP corresponds to the census population size at the beginning of the simulation. If the population size is set to be non constant, this will change through time.

4.2.4 Sample size

Instead of analyzing the entire population, you can sample a certain number of individuals. This situation would match a “real life” situation where you do not have access to the DNA of your whole population. If you define a sample size, all the outputs (diversity, but also fasta and arlequin files) will be generated on a random sample in your population of the defined size.

4.2.5 Number of simulations

You can define the number of simulations that will be run with this set of parameters. If you are loading a set of simulations from a directory, this number must be smaller or equal to the number of saved simulations.

4.2.6 Number of generations to run

During each simulation, the population will evolve for a number of generations t between two sampling events. By default, there is only one sampling event at the end of those t generations, where output files are produced and diversity is measured. If you define multiple sampling events (N_Sampling) through time, then t defines the number of generations to run between two sampling:

GTotal = NSampling ×t

4.2.7 Mating system and number of offspring

The mating system is described by two parameters, the polygamy and the cousin alliance. You can set a maximum number of mate per individual, either for both sex (flag polygamy), or for male (polygyny) or female (polyandry).
For the cousin alliance, you can set a preferiential alliance of one or more cousin using the flag mat (X=:1 random, 2 MBD/FZS, 3 MZD/MZS, 4 FZD/MBS, FBD/FBS).

4.2.8 Sibling matings

For any mating system, you can forbid the mating between full or half siblings using the flag inbreeding.

4.2.9 Demography parameters

The demograpy is defined by a Markovian function:
popsize(t+1)=a + b x popsize(t) + c x popsize(t)²
Based on three parameters a, b and c, this definition permits a large range of demographic scenarios. This table presents the most simple example:

Scenario	a	b	c
A constant population size	0	1	0
A constant growth of x individuals per generation	x	1	0
A constant decrease of x individuals per generation	-x	1	0
An exponential growth of rate x	0	x	0
An exponential decrease of rate x	0	1/x	0

4.2.10 Mutation rates

Each type of DNA (mitochondrial, X, Y and autosome) has two mutation rates: a transition and transversion rate. These rates must be set in mutations/site/generation. Overall mutation rates can be defined, but you can set transitions equal to transversions equal to half the total mutation rate, if you do not know the ratio of transition over transversion.
The rationale behind having only eight mutation rates is that mutation rates measured for mtDNA, Y, X and autosomes can be different by more than an order of magnitude. Similarly, transition rates are usually an order of magnitude higher than transversion rates.

4.2.1 Burn-in phase

By default, simulations will start with the entire population having the same DNA. Alternatively, a burn-in phase allows you to start with an accelerated evolutionary process that will force your population to reach a high diversity from which your simulation will start. In this process, a higher mutation rate is applied to your population which quickly increases the diversity. The accelerated evolution stops when a set number of generation has been running. Another phase of burnin with a normal mutation rate can follow by setting the flag burnin to this chosen number of generations
When dealing with multiple populations, both burnin phase simulate a metapopulation with no division. The split between population happens at the end of the burnin, letting all populations with a similar diversity.

4.2.12 DNA sequences

Each individual has a set of sex-linked DNA sequences:

A mitochondrial sequence
A sequence from the non-recombining Y chromosome
A set of unlinked loci on the X chromosome
A set of unlinked loci on the autosomes

It is possible to choose the length of each loci by type. Note: all autosomal loci must share the same length; all X loci must share the same length. It is also possible to choose the number of loci on X and autosomes.
For computational reasons, the length of sequences must be a multiple of 32, or it will be automatically forced to the next higher multiple of 32. For instance, if you enter a length of 33, your simulated locus will have 64 sites. The size of the DNA considered unnecessary for a study can be set to 0, making the simulations faster.

4.2.13 Outputs

SMARTPOP can produce different outputs for each simulation:

A file with the simulated DNA (at the end of the run) in fasta format
A file with the simulated DNA (at the end of the run) in arlequin format [2]

SMARTPOP also produces output for the whole set of simulations:

A diversity file for of genetic diversity of a set sample
A file with interpopulation diversity.

If the output name is set to 'foo' in the parameters, then the different files produced would be foo.fasta, foo.arl, fooall.txt and foobetween.txt.

4.3 Outputs

4.3.1 Diversity tables

A file is created for the diversity of the given sample. If a name is provided, and the file already exists, the results will append to the end of the existing file.
This file contains different summary statistics for the population for each sample. Each line represent a sample at a specific time. For each sample, for each DNA type, SMARTPOP will write in the file:

Population size (not the sample size)
Number of sites in the sequence evaluated
Number of polymorphic sites
Proportion of polymorphic sites
Mean pairwise difference
Number of haplotypes
Allele heterozygosity $( h ) H = -N---- 1- ∑ f2 A N - 1 i=1 i$
Nei’s heterozygosity (i.e. heterozygosity averaged per site) [4] $( ) 1 N ∑S ∑4 HN = S-N---1- (1 - fj2) i=1 j=1$
Theta Watterson θ_w $S θw = ∑i=N---11 i=1 i$
Theta homozygosity θ_H $1 θH = (1--H-) - 1$
Theta Pi θ_π $∑h ∑h θπ = --N--- dist(i,j) N - 1 i=1 j=1$
Tajima’s D $D = ∘------------------π--θw------------------- (b - -1) 1S + (b - n+2 + a2--1--)S (S - 1) 1 a1 a1 2 a1n a21a21+a2$ $n∑-1 1 a1 = i i=1$ $n∑-1 a2 = -1 i=1i2$ $n + 1 b1 = 3(n--1)-$ $2(n2 + n +3) b2 = ------------ 9n (n - 1)$

4.3.2 Fasta

You can output the population, or a sample of it, in a fasta format file. Such a file can easily be piped through other population genetics software, such as COMPUTE from the libSequence package[5].

4.3.3 Arlequin

You can output the population, or a sample of it, in a file of the format of ARLEQUIN software. This allows you to make inferences on your simulations using ARLEQUIN [2].

4.4 Running SMARTPOP in parallel mode

It is possible to run SMARTPOP in parallel. You must build a specific version of SMARTPOP using C++ compiler including Message Parsing Interface, such as mpic++.
When running in parallel, the simulations are distributed across several cores at the beginning at the simulation. Each process will write on different files to avoid conflict.

4.6 Starting conditions: burn-in and pre-run

The starting conditions are a fundamental problem in forward-in-time simulations. How to determine the complete genetic set of the population from which the evolution starts? This is equivalent to asking what is the DNA state of an ancient population, which is totally unknown in most studies. In studies based on modern DNA, you have (at most) knowledge about past individuals who left descendant but nothing about the other past members of the population.
The question remains what to start the simulation with. A fully random DNA set is in no way meaningful. Instead the genetic patterns of real population is the result of a long evolutionary process that creates haplotypes - individuals still sharing a large part of their DNA sequence.
The start must therefore be a sequence shared among individuals. From this null diversity, one must produce an artificial diversity, to correspond to a real population. There are three options:

Assumed equilibrium
It is possible that the population is in a state of equilibrium. The definition of "equilibrium" in population genetics is quite vague, but it is agreed that such equilibrium will be reached by running simulations for a very long time. How long is long enough? It must be at least the TMRCA, to be independent of starting conditions.
As an indicator, the mean and variance of the time to the most recent common ancestor (TMRCA) assuming a constant population size [3, 1, 6] is: $( 1) E (TMRCA ) = 2n 1- n-$ $( i=∑n 1 ( 1 )2) var(TMRCA ) = n 8 -2 - 4 1- -- i=2 i n$
Burn-In
Alternatively, we offer the user the possibility to apply an accelerated mutation scheme on the population. It will assign a mutation rate one order of magnitude higher than the real rate, run some generations and stop once the a set number of generation is reached. The real simulation can then start from this point.
This approach is novel to SMARTPOP.
Make your own recipe
With the flexibility of SMARTPOP, you can choose your own starting scheme. Controlling the mutation rate, you may reach higher diversity than equilibrium. You can also use the demography feature to create your own burn-in phase.

A Default parameters


Flag + Argument	Input File Keyword + Argument	Meaning	Parameter File Example	Default Value

--header	output/header integer (0 or 1)	Output the header at the beginning of diversity files	headerOutput 1	0

--fasta	output/fasta integer (0 or 1)	Save each simulation in a fasta format file at the end of the run	output/fasta 1	0

--nexus	output/nexus integer (0 or 1)	Save each simulation in a nexus format file at the end of the run	output/nexus 1	0

-arl	output/arlequin integer (0 or 1)	Save each simulation in a arlequin format file at the end of the run	output/arlequin 1	0

-v	output/verbose int	Verbose	Verbose 1	0

-i myfile.txt	NA	Input file with parameters	SMARTPOP_parameters.txt	NULL

-o filename	output/filename filename	Name of the outputs	fileOutput smartpop_13145105_2	smartpop_seed_matingSystem

-s integer	simulation/seed integer	Random seed	seed 13145105	random

-t integer	simulation/step integer	Number of generations between two sampling events	step 100	100

--nstep integer	simulation/nstep integer	Number of sampling events through time for one simulation	nstep 10	1

--nsimu integer	simulation/nsimu integer	Number of simulations	nSimu 500	500

--burnin double	simulation/burninT double	Set the number of generations for burnin with normal mutation rate for the burn-in phase	burninT 0.5	0

--burninH int	simulation/burninH int	Set the number of generations for burnin with high mutation rate	burninType 1	1

--sample integer	simulation/sample integer	Sample size for estimation of diversity, and fasta/arlequin outputs	sample 100	100

--nbpopsample integer	simulation/nbpopsample integer	Number of population to sample for estimation of diversity, and fasta/arlequin outputs	nbpopsample 3	1

--between	simulation/betweenDiv boolean	Output between population diversity (X=0 No, X=1 Yes)	beetweenDiv false	false

-p integer	model/popsize integer	Total census population size	popsize 150	100

--nbpop integer	model/nbPop integer	Number of communities/population/patches	nbPop 10	1

--nbSources integer	model/nbSources integer	Number of sources population for APA type population scheme	nbSources 3	0

--pmate double	model/pmate integer	Relaxation paramter on the mating rule	pmate 0.5	0

--pmig double	model/pmig integer	Relaxation paramter on the migration rule	pmig 0.5	0

--mu double (x4 or x8)		Mutation rates (simple rates, respectively mtDNA, X, Y and autosome)	--mu 0.0001 0.0001 0.0001 0.0001

--sizeA integer	model/dna/sizeA integer	Size, in number of sites, per autosome locus	sizeA 3200	3200

--sizeX integer	model/dna/sizeX integer	Size, in number of sites, per locus on the X chromosome	sizeX 3200	3200

--sizeY integer	model/dna/sizeY integer	Size, in number of sites, of the Y chromosome	sizeY 3200	3200

--sizeMt integer	model/dna/sizeMt integer	Size, in number of sites, of the mitochondrial sequence	sizeMt 3200	3200

--nbLociX integer	model/dna/nbLociX integer	Number of unlinked loci on the X chromosome	nbLociX 1	1

-nbLociA integer	nbLociA integer	Number of unlinked autosomal loci	nbLociA 1	1

--popsizeN integer integer	model/population/popsize integer	Census population size for a give population	popsize 150	100

--inbreedingN integer integer	model/population/inbreeding integer	Rules of inbreeding for a given population	inbreeding 1	0

--child2	model/population/varianceNbChildren integer	Set the variance in nb of children to 0 (all couple have 2 kids) or 1 (Poisson law)	inbreeding 1	1

--polygynyN integer integer	model/population/polygyny integer	Set the maximum number of wives per man for a given population	polygyny 1	1

--polygyny integer	NA	Set the maximum number of wives per man for all populations		1

--polyandryN integer integer	model/population/polyandry integer	Set the maximum number of husbands per woman for a given population	polyandry 1	1

--polyandry integer	NA	Set the maximum number of husbands per woman for all populations		1

--matN integer integer	model/population/MS integer	Set the cousin alliance for a given population	MS 1	1

--mat integer	NA	Set the cousin alliance for all populations		1

--demogN double double double integer	model/population/demog /a double /b double /c double	Set the demography function for a given population	a 0 b 1 c 0	a 0 b 1 c 0

--demog double double double	NA	Set the demographyfunction for all populations		0 1 0

B References

[1] P Donnelly and S Tavaré. Coalescents and genealogical structure under neutrality. Annual Review of Genetics, 29(1):401–421, 1995.

[2] Laurent Excoffier and Heidi E L Lischer. Arlequin suite ver 3.5: A new series of programs to perform population genetics analyses under Linux and Windows. Molecular Ecology Resources, 10(3):564–567, May 2010.

[3] Richard R Hudson. Gene genealogies and the coalescent process. Oxford Surveys in Evolutionary Biology, 7(1):44, 1990.

[4] Masatoshi Nei. Estimation of average heterozygosity and genetic distance from a small number of individuals. Genetics, 89(3):583–590, 1978.

[5] K Thornton. LibSequence: A C++ class library for evolutionary genetic analysis. Bioinformatics, 19(17):2325–2327, 2003.

[6] John Wakeley. Coalescent Theory: An Introduction. Roberts & Company Publishers, first edition, 2009.

Home

SMARTPOP

User Manual