SMARTPOP

User Manual

SMARTPOP 2.0 Manual

July, 2015
1 Introduction to SMARTPOP
2 Requirements
3 Installation
4 SMARTPOP features
  4.1 Input
   4.1.1 Via command lines
   4.1.2 Via input files
   4.1.3 Windows executable
  4.2 Simulation parameters
   4.2.1 Verbose
   4.2.2 Random seed
   4.2.3 Population size
   4.2.4 Sample size
   4.2.5 Number of simulations
   4.2.6 Number of generations to run
   4.2.7 Mating system and number of offspring
   4.2.8 Sibling matings
   4.2.9 Demography parameters
   4.2.10 Mutation rates
   4.2.11 Burn-In phase
   4.2.12 DNA sequences
   4.2.13 Choice of outputs
   4.2.14 Save/Load
  4.3 Outputs
   4.3.1 Diversity tables
   4.3.2 Fasta
   4.3.3 Arlequin
  4.4 Running SMARTPOP in parallel mode
  4.5 Setting up a complex scheme (change of parameters through time)
  4.6 Starting conditions: burn-in and pre-run
5 Examples
A Default parameters
B References

1 Introduction to SMARTPOP

SMARTPOP is fast and flexible forward-in-time simulator for population genetics. Specially developed for speed efficiency, it is available in both serial and parallel versions. Developed for anthropological inferences on human populations, SMARTPOP simulates individuals with sequences of sex-linked DNA (mitochondrial, X and Y chromosomes) and autosomes. Studies of social dynamics are enabled using SMARTPOP's flexible demographic models and social rules of mating.
For any use of SMARTPOP or re-use of its code source please cite:
Guillot and Cox, 2014. SMARTPOP: inferring the impact of social dynamics on genetic diversity through high speed simulations. BMC Bioinformatics 2014 15:175

2 Requirements

SMARTPOP has been developed in C++ using the Boost C++ library. To build the software from sources you need a C++ compiler such as g++ or Visual Studio installed on your computer. To compile the parallel version of SMARTPOP, we recommend using mpic++.

3 Installation

You can directly download a binary (executable) version of SMARTPOP compatible with your OS at http://smartpop.sourceforge.net/smartpop2/download.html
Alternatively you can build SMARTPOP from the source code following these instructions:

To build SMARTPOP from source on a UNIX machine:

SMARTPOP is then ready to launch via the command line:
./smartpop

For building the parallel version, go to the src-parallel directory:

We have run tests on the main operating system available today, if you encounter any trouble please contact us.

4 SMARTPOP features

SMARTPOP must be called via the command lines (./smartpop ) from the directory where it is installed (or add the directory to your PATH). One call will launch a set of simulations with one set of parameters.

4.1 Input

A set of simulations rely on several parameters, all of which have default values except the random seed. The default values are given in this table. You can either set these parameters using flags on the command line or via an input file.

4.1.1 Via the command line

You can define all the parameters to start a set of simulations using flags on the command line. All the parameters that are not called by flags will receive default values.
The order of parameters does not matter. If you give a wrong sequence of arguments on the command line, the program will raise an error and not start (in most cases).
To run a set of 1000 simulations with population size 200 evolving for 500 generations, enter:
./smartpop -p 200 -t 500 --nsimu 1000





Flag Argument

Meaning

-v

Verbose

-h

Help

-i

Input file with parameters

-o string

Name for the output files

-s integer

Random seed

-p integer

Population size

--popsizeN integer integer

Population size of population X

--npop integer

Number of populations

--sample integer

Sample size

--nsimu integer

Number of simulations

-t integer

Number of generations between two sampling events

--nstep integer

Number of sampling events through time for one simulation run

--mat integer

mating system (X=:1 random, 2 MBD/FZS, 3 MZD/MZS, 4 FZD/MBS, FBD/FBS)

--inbreeding integer

Prevent half siblings (1) or full siblings (2) mating, allowed if 0

--polygamy integer

set the maximum number of mates X per indiv (if X=1 it becomes monagamy)

--polyandry integer

set the maximum number of husband X per woman (if X=1 it becomes monagamy)

--polygyny integer

set the maximum number of wives X per man (if X=1 it becomes monagamy)

--polygamyN integer integer

set the maximum number of mates X per indiv (if X=1 it becomes monagamy) for a given population

--polygynyN integer integer

set the maximum number of wives X per man (if X=1 it becomes monagamy) for a given population

--mu double (x4 or x8)

Mutation rates (simple rates)

--nbLociAinteger

Number of unlinked autosomal loci

--sizeAinteger

Size, in number of sites, per autosomal locus

--nbLociXinteger

Number of unlinked loci on the X chromosome

--sizeXinteger

Size, in number of sites, per locus on the X chromosome

--sizeY integer

Size, in number of sites, on the Y chromosome

--sizeMt integer

Size, in number of sites, of the mitochondrial sequence

--burnin int

Set the number of generations for the burn-in phase with normal mutation rate

--burninH int

Set the number of generations for the burn-in phase with high mutation rate

--demog double double double

Set the demographic function

--demogN double double double int

Set the demographic function for population X

--fasta

Save each simulation in a fasta format file at the end of the run

--arl

Save each simulation in a arlequin format file at the end of the run

--header

Output the header at the beginning of diversity files

--child2

Forces all couple to have 0 children (no variance in reproductive success)

--pooled

pooled sample (X indiv from Y demes)

--scattered

scattered sample (X indiv from all demes)

--local

local sample (X indiv from 1 demes)

--blind

random sample without knowing the structure

--nbsamplepop integer

number of population to sample for pooled scheme and for inter-population diversity comparison





Table 1: Flags for command line call of SMARTPOP

4.1.2 Via input files

Instead of defining the parameters in the command line, you can also use an input file. For example, this is useful if you modify a lot of default value parameters.
You must respect the format of this file to make it work. It is a html file. Respect the order of flags or it may not work. Each line describe the value for one parameter. All the parameters that are not defined in this input file will take default values, given in this table.


By default, each run of SMARTPOP creates a parameter file called SMARTPOP_parameters.txt. This has exactly the same format as the input parameter file such that running ./smartpop -i SMARTPOP_parameters.txt will run the exact same set of simulations as you formerly ran. This is possible due to the fact that the random seed is one of the input parameters.

4.1.1.3 Windows executable

The available windows executable only handle input files. SMARTPOP will read the file SMARTPOP_parameters.txt which must be in the same directory. For each new launch of SMARTPOP you must set the parameters by modifying this file. If you want the seed to be picked randomly, remember to erase the line with the seed keyword from the parameter files.

4.2 Simulation parameters

4.2.1 Verbose

It is recommended to begin using SMARTPOP with the verbose option on. This will make SMARTPOP return relevant information one the command line when the simulations are running. It is a good way to check that the parameters are set correctly, as well as to visualize the progress of the program.

4.2.2 Random seed

Simulations are highly reliant on random processes. Such processes are simulated via sequences of random numbers which are a large part of the program. Each time the program starts, it calls a random seed from which this sequence is produced uniquely. By default, the random seed is random, but it is possible to set it to repeat earlier runs.

4.2.3 Population size

The population size set in SMARTPOP corresponds to the census population size at the beginning of the simulation. If the population size is set to be non constant, this will change through time.

4.2.4 Sample size

Instead of analyzing the entire population, you can sample a certain number of individuals. This situation would match a “real life” situation where you do not have access to the DNA of your whole population. If you define a sample size, all the outputs (diversity, but also fasta and arlequin files) will be generated on a random sample in your population of the defined size.

4.2.5 Number of simulations

You can define the number of simulations that will be run with this set of parameters. If you are loading a set of simulations from a directory, this number must be smaller or equal to the number of saved simulations.

4.2.6 Number of generations to run

During each simulation, the population will evolve for a number of generations t between two sampling events. By default, there is only one sampling event at the end of those t generations, where output files are produced and diversity is measured. If you define multiple sampling events (NSampling) through time, then t defines the number of generations to run between two sampling:

GTotal = NSampling ×t
4.2.7 Mating system and number of offspring

The mating system is described by two parameters, the polygamy and the cousin alliance. You can set a maximum number of mate per individual, either for both sex (flag polygamy), or for male (polygyny) or female (polyandry).
For the cousin alliance, you can set a preferiential alliance of one or more cousin using the flag mat (X=:1 random, 2 MBD/FZS, 3 MZD/MZS, 4 FZD/MBS, FBD/FBS).

4.2.8 Sibling matings

For any mating system, you can forbid the mating between full or half siblings using the flag inbreeding.

4.2.9 Demography parameters

The demograpy is defined by a Markovian function:
popsize(t+1)=a + b x popsize(t) + c x popsize(t)2
Based on three parameters a, b and c, this definition permits a large range of demographic scenarios. This table presents the most simple example:

Scenario a b c
A constant population size010
A constant growth of x individuals per generationx10
A constant decrease of x individuals per generation-x10
An exponential growth of rate x0x0
An exponential decrease of rate x01/x0
4.2.10 Mutation rates

Each type of DNA (mitochondrial, X, Y and autosome) has two mutation rates: a transition and transversion rate. These rates must be set in mutations/site/generation. Overall mutation rates can be defined, but you can set transitions equal to transversions equal to half the total mutation rate, if you do not know the ratio of transition over transversion.
The rationale behind having only eight mutation rates is that mutation rates measured for mtDNA, Y, X and autosomes can be different by more than an order of magnitude. Similarly, transition rates are usually an order of magnitude higher than transversion rates.

4.2.1 Burn-in phase

By default, simulations will start with the entire population having the same DNA. Alternatively, a burn-in phase allows you to start with an accelerated evolutionary process that will force your population to reach a high diversity from which your simulation will start. In this process, a higher mutation rate is applied to your population which quickly increases the diversity. The accelerated evolution stops when a set number of generation has been running. Another phase of burnin with a normal mutation rate can follow by setting the flag burnin to this chosen number of generations
When dealing with multiple populations, both burnin phase simulate a metapopulation with no division. The split between population happens at the end of the burnin, letting all populations with a similar diversity.

4.2.12 DNA sequences

Each individual has a set of sex-linked DNA sequences:

It is possible to choose the length of each loci by type. Note: all autosomal loci must share the same length; all X loci must share the same length. It is also possible to choose the number of loci on X and autosomes.
For computational reasons, the length of sequences must be a multiple of 32, or it will be automatically forced to the next higher multiple of 32. For instance, if you enter a length of 33, your simulated locus will have 64 sites. The size of the DNA considered unnecessary for a study can be set to 0, making the simulations faster.

4.2.13 Outputs

SMARTPOP can produce different outputs for each simulation:

SMARTPOP also produces output for the whole set of simulations:

If the output name is set to 'foo' in the parameters, then the different files produced would be foo.fasta, foo.arl, fooall.txt and foobetween.txt.

4.3 Outputs

4.3.1 Diversity tables

A file is created for the diversity of the given sample. If a name is provided, and the file already exists, the results will append to the end of the existing file.
This file contains different summary statistics for the population for each sample. Each line represent a sample at a specific time. For each sample, for each DNA type, SMARTPOP will write in the file:

4.3.2 Fasta

You can output the population, or a sample of it, in a fasta format file. Such a file can easily be piped through other population genetics software, such as COMPUTE from the libSequence package[5].

4.3.3 Arlequin

You can output the population, or a sample of it, in a file of the format of ARLEQUIN software. This allows you to make inferences on your simulations using ARLEQUIN [2].

4.4 Running SMARTPOP in parallel mode

It is possible to run SMARTPOP in parallel. You must build a specific version of SMARTPOP using C++ compiler including Message Parsing Interface, such as mpic++.
When running in parallel, the simulations are distributed across several cores at the beginning at the simulation. Each process will write on different files to avoid conflict.

4.6 Starting conditions: burn-in and pre-run

The starting conditions are a fundamental problem in forward-in-time simulations. How to determine the complete genetic set of the population from which the evolution starts? This is equivalent to asking what is the DNA state of an ancient population, which is totally unknown in most studies. In studies based on modern DNA, you have (at most) knowledge about past individuals who left descendant but nothing about the other past members of the population.
The question remains what to start the simulation with. A fully random DNA set is in no way meaningful. Instead the genetic patterns of real population is the result of a long evolutionary process that creates haplotypes - individuals still sharing a large part of their DNA sequence.
The start must therefore be a sequence shared among individuals. From this null diversity, one must produce an artificial diversity, to correspond to a real population. There are three options:

A Default parameters







Flag + Argument Input File Keyword + Argument

Meaning

Parameter File Example Default Value





--header output/header integer (0 or 1)

Output the header at the beginning of diversity files

headerOutput 1 0





--fasta output/fasta integer (0 or 1)

Save each simulation in a fasta format file at the end of the run

output/fasta 1 0





--nexus output/nexus integer (0 or 1)

Save each simulation in a nexus format file at the end of the run

output/nexus 1 0





-arl output/arlequin integer (0 or 1)

Save each simulation in a arlequin format file at the end of the run

output/arlequin 1 0





-v output/verbose int

Verbose

Verbose 1 0





-i myfile.txt NA

Input file with parameters

SMARTPOP_parameters.txt NULL





-o filename output/filename filename

Name of the outputs

fileOutput smartpop_13145105_2 smartpop_seed_matingSystem





-s integer simulation/seed integer

Random seed

seed 13145105 random





-t integer simulation/step integer

Number of generations between two sampling events

step 100 100





--nstep integer simulation/nstep integer

Number of sampling events through time for one simulation

nstep 10 1





--nsimu integer simulation/nsimu integer

Number of simulations

nSimu 500 500





--burnin double simulation/burninT double

Set the number of generations for burnin with normal mutation rate for the burn-in phase

burninT 0.5 0





--burninH int simulation/burninH int

Set the number of generations for burnin with high mutation rate

burninType 1 1





--sample integer simulation/sample integer

Sample size for estimation of diversity, and fasta/arlequin outputs

sample 100 100





--nbpopsample integer simulation/nbpopsample integer

Number of population to sample for estimation of diversity, and fasta/arlequin outputs

nbpopsample 3 1





--between simulation/betweenDiv boolean

Output between population diversity (X=0 No, X=1 Yes)

beetweenDiv false false





-p integer model/popsize integer

Total census population size

popsize 150 100





--nbpop integer model/nbPop integer

Number of communities/population/patches

nbPop 10 1





--nbSources integer model/nbSources integer

Number of sources population for APA type population scheme

nbSources 3 0





--pmate double model/pmate integer

Relaxation paramter on the mating rule

pmate 0.5 0





--pmig double model/pmig integer

Relaxation paramter on the migration rule

pmig 0.5 0





--mu double (x4 or x8)

Mutation rates (simple rates, respectively mtDNA, X, Y and autosome)

--mu 0.0001 0.0001 0.0001 0.0001





--sizeA integer model/dna/sizeA integer

Size, in number of sites, per autosome locus

sizeA 3200 3200





--sizeX integer model/dna/sizeX integer

Size, in number of sites, per locus on the X chromosome

sizeX 3200 3200





--sizeY integer model/dna/sizeY integer

Size, in number of sites, of the Y chromosome

sizeY 3200 3200





--sizeMt integer model/dna/sizeMt integer

Size, in number of sites, of the mitochondrial sequence

sizeMt 3200 3200





--nbLociX integer model/dna/nbLociX integer

Number of unlinked loci on the X chromosome

nbLociX 1 1





-nbLociA integer nbLociA integer

Number of unlinked autosomal loci

nbLociA 1 1





--popsizeN integer integer model/population/popsize integer

Census population size for a give population

popsize 150 100





--inbreedingN integer integer model/population/inbreeding integer

Rules of inbreeding for a given population

inbreeding 1 0





--child2 model/population/varianceNbChildren integer

Set the variance in nb of children to 0 (all couple have 2 kids) or 1 (Poisson law)

inbreeding 1 1





--polygynyN integer integer model/population/polygyny integer

Set the maximum number of wives per man for a given population

polygyny 1 1





--polygyny integer NA

Set the maximum number of wives per man for all populations

1





--polyandryN integer integer model/population/polyandry integer

Set the maximum number of husbands per woman for a given population

polyandry 1 1





--polyandry integer NA

Set the maximum number of husbands per woman for all populations

1





--matN integer integer model/population/MS integer

Set the cousin alliance for a given population

MS 1 1





--mat integer NA

Set the cousin alliance for all populations

1





--demogN double double double integer model/population/demog /a double /b double /c double

Set the demography function for a given population

a 0 b 1 c 0 a 0 b 1 c 0





--demog double double double NA

Set the demographyfunction for all populations

0 1 0











B References

[1]   P Donnelly and S Tavaré. Coalescents and genealogical structure under neutrality. Annual Review of Genetics, 29(1):401–421, 1995.

[2]   Laurent Excoffier and Heidi E L Lischer. Arlequin suite ver 3.5: A new series of programs to perform population genetics analyses under Linux and Windows. Molecular Ecology Resources, 10(3):564–567, May 2010.

[3]   Richard R Hudson. Gene genealogies and the coalescent process. Oxford Surveys in Evolutionary Biology, 7(1):44, 1990.

[4]   Masatoshi Nei. Estimation of average heterozygosity and genetic distance from a small number of individuals. Genetics, 89(3):583–590, 1978.

[5]    K Thornton. LibSequence: A C++ class library for evolutionary genetic analysis. Bioinformatics, 19(17):2325–2327, 2003.

[6]   John Wakeley. Coalescent Theory: An Introduction. Roberts & Company Publishers, first edition, 2009.

Home