SMARTPOP

Frequently Asked Questions

F.A.Q.

Why did you develop a new simulation program?

Is SMARTPOP free?

When should I use SMARTPOP?

Can I use SMARTPOP for non-human species?

Why does speed matter so much?

Why is SMARTPOP so fast?

How can I improve the speed of simulations?

What is the difference between forward-in-time and backward-in-time simulation?

How do you know that the software works?

Why did you chose a two parameter mutation model?

Which social processes are included in SMARTPOP?

How is defined the number of children per individual?

Something interesting happened in the last simulation. Can I reproduce it exactly?

What happen to the parameters that I do not set on the command line?

References

Why did you develop a new simulation program?

A good range of program already exists to simulate population genetics data. (An exhaustive list of them can be found on the Generic Simulation Resources.)

None of the available programs permit you, without reprogramming, to look at the interaction between social rules and genetic patterns. Moreover, most of these program lack sufficient speed for statistical inference (see Why does speed matter so much?). Most programs offer a compromise between speed and flexibility. The most flexible ones, such as simuPOP (Peng et al. 2005), pay the price of speed, thus making them too slow to simulate data for an inferrential setting.

Therefore, we developed SMARTPOP to enable the exploration of interections between social rules and genetic patterns. A new fast and efficient forward-in-time framework aims to strengthen the field of population genetics and social system through simulation.

Is SMARTPOP free?

Yes. SMARTPOP is free and open source. It is distributed under the GPL 3.0 license. We only ask you to cite our paper if you are using SMARTPOP or re-using its code for your own work: Guillot abd Cox 2014, SMARTPOP: inferring the impact of social dynamics on genetic diversity through high speed simulations. BMC Bioinformatics 15:175

When should I use SMARTPOP?

SMARTPOP covers a wide range of population genetic scenarios:

  • Inferences on the evolution of DNA in a population. With SMARTPOP, you can explicitly simulate the evolution of DNA sequences, particularly sex-linked DNA.
  • Inferences relating to changing population size through time. With SMARTPOP, you have a fine control of the demography, which permits complex demographic scanerios.
  • Inferences linked to mating systems. SMARTPOP models mating at the level of individuals and permits contrainsts by social rules.

You should use SMARTPOP only if you are working with a dioecious population (reproduction with two sexes).

SMARTPOP has been developed particularly to work on sex-linked and autosomal sequences. Each individual has:

  • One sequence for mitochondrial DNA
  • One sequence for the non-recombining portion of the Y chromosome
  • Sequences of unlinked loci on the X chromosome
  • Sequences of unlinked loci on the autosomes

SMARTPOP can be used as long as your study system fits within in the implemented scheme. All non relevant sex-linked DNA can be ignored by fixing their length to zero. For instance, you can use SMARTPOP to look only at mitochondrial DNA.

Can I use SMARTPOP for non-human species?

Absolutely! SMARTPOP has been originally implemented by a team working on anthropological question. However its design permits the simulation of many other species as long as they sexually reproduce (they must have males and females). See When should I use SMARTPOP to check if SMARTPOP is suitable for your study.

Why does speed matter so much?

Simulating the evolution of a population through time involves many random processes (e.g. mutations, reproduction). The result of a single simulation under a set of parameters is never sufficient in itself. Instead, a large number of simulations must be processed under the same set of parameters to assess what range of outcomes can be expected. At the very least 500 simulations are needed under the same set of parameters for such studies.
Parameter inference with simulation compares real datasets to simulated ones. Methods such as Approximate Bayesian Computation need to evaluate the outcome of simulations under a certain range of parameters, and this forces the number of required simulations to grow exponentially with the number of parameters.

Unlike backward-in-time methods, forward-in-time simulations also have a particular need for speed. The fundamental problem is that the initial state is unknown. Most studies (cf. Carvajal-Rodriguez 2010 for a good review on the subject) assume that the population is in an equilibrium state. The main reason for this premise is that it is the simplest approximation we can make on the genetic patterns within a population. To attain this state of equilibrium, you must run simulations for "long enough". "Long enough" is not very well defined but its lower bound is the Time to the Most Recent Common Ancestor (TMRCA), which implies that the simulation looses tracks of its initial state. The TMRCA grows linearly with the population size (Wakeley 2005). For studies using this equilibrium, each simulation will therefore have to start with a burn-in phase of thousands of generations, creating a demand of high speed in the simulations.

Why is SMARTPOP so fast?

SMARTPOP has been highly optimized for speed through efficient use of C++ programming features. Widely used, C++ is one of the fastest programming language available, continuously supported by a a strong worldwide community. At the core of its development is a library called Boost. Boost integrates a highly optimized algorithm for generating random numbers used in this simulator.

A principal feature of the program's speed is the representation of individual DNA bases. DNA is stored and manipulated as 64 bit integers. A nucleotide can be coded in a base 4 system (4 values A-C-T-G), coded on 2 bits. Using one 64 bit integer, 32 DNA bases can be stored. This allows all DNA manipulations such as mutation, comparisons, copies, etc. to be performed as bit-wise operations, which is much faster than if they were handled site by site as characters.

Speed is also achieved by the computation of diversity indexes within SMARTPOP. Simulators usually produce files of simulated DNA that must be piped through other software for further analysis. This process of data transfer from one program to another can be quite time consuming, specially if a change of format is required. From a user point of view, it also demands the installation and manipulation of two different programs. To avoid this trouble, SMARTPOP computes the main diversity estimators. It still remains possible to output files compatible with alternative software (Arlequin and Fasta format) for other use.

How can I improve the speed of simulations?

There are ways to ensure that your simulations are running as fast as they can.

  • Only output what you need.
  • Sample your population to measure diversity.
  • Do not use the verbose option.
  • Use a buffering phase.

The most CPU-time consuming task within SMARTPOP is computing the diversity file, which is generated by default. Therefore these should be created only if needed. For instance if only the mitochondrial DNA and Y chromosome are being investigated, the options -mtdiv -ydiv should be used.

Using a small random sample also greatly improves the speed performance. Some estimators, such as the mean pairwise difference, have their runtime proportional to the square of the sample size.

The verbose option is quite useful to check what the program is doing. However, it slows down the program by sending extra output to the terminal.

Finally carefully using a buffering phase that models an accelerated evolution permits you to reach a given point of diversity faster.

What is the difference between forward-in-time and backward-in-time simulations?

Evolution is a process embedded in time. When simulating evolution, one can look at time in two ways. It is possible to start from time zero and advance forward-in-time, generation by generation. This is a Markovian process, each new generation is a function of the previous generation. The other approach is to start the simulation from the last generation (i.e. modern DNA) and reconstruct the history of evolution looking backward-in-time, typically reconstructing a genealogy.

Theories of both approaches have been developed. Coalescent theory (Kingman 1981) is a backward-in-time method. It is particularly useful for real data, when one tries to reconstruct the past of a population or species from a modern sample.

Backward-in-time simulations are faster as they only need to reconstruct the ancestry of the modern sample (i.e. great grand parents etc.). Conversely a forward-in-time simulation needs to compute the history for all individuals, not knowing in advance which individuals are going to be relevant.

Despite being praised for their speed, coalescent simulators are unfortunately very limited by their assumptions. They often assume a Wright-Fisher, or Cannings population model. Their flexibility is highly limited and, for instance, they cannot integrate social processes, such as mating systems.

How do you know that the software works correctly?

Valiation is important to assess the correctness of a simulator. As SMARTPOP simulates a complex system including many random processes, there is no straightfoward way to measure the validity of the simulations. To face this problem, we developped a multi-pronged validation approach described in detail in Guillot and Cox (Forthcoming).

Why did you chose a two parameter mutation model?

SMARTPOP defines eight mutation rates with two per DNA type (mtDNA, X, Y chromosome and autosome): a transition rate and a transversion rate.

The rationale behind this two parameter model is that mutation rates measured on mtDNA, Y, X and autosome can be different by more than an order of magnitude. Similarly, transition rates are usually an order of magnitude higher than transversion rates.

Which social processes are included in SMARTPOP?

Looking at interactions between genetics and social processes, SMARTPOP allow the users to control the mating system as well as the demography.

The user can currently chose between 4 mating system: monogamy, polygamy, polygyny and polyandry. For each system the user can add the option of half or full sibling avoidance.

How is defined the number of children per individual?

To introduce variability in the reproductive success, the number of children per individual follows a Poisson random law. In the case of monogamy, polygamy and polygyny the Poisson law defines the number of children per female, while in the case of polyandry and monogamy (by symmetry) it defines the number of children per male. This Poisson law has however an unnown mean but is conditionnal of the population size which SMARTPOP controls via a demographic function. This permits the model to be parameter free regarding the number of children per househow while still following a Poisson law.

Something interesting happened in the last set of simulations. Can I reproduce it exactly?

Yes, you can! All simulations rely on a pseudo-random sequence of numbers, which is created from a seed. By using the same seed again, you produce the exact same sequence of numbers (i.e. the same set of simulations). If you did not input a specific seed number, you can find it in the SMARTPOP_parameter.txt file, which is automatically written at the start of each run. It includes all SMARTPOP parameters, including the seed, for the last simulation set launched.

What happen to the parameters that I do not set on the command line?

Many parameters can be set in SMARTPOP, but you may chose to fix only a few of them. The parameters that you do not input are given default values. Those default values can be found in the manual. You can also check the values taken by all your parameters in the last simulation set launched by looking at the automatically created parameter file: SMARTPOP_parameters.txt.

References

  • Cannings, C. (1974). The latent roots of certain Markov chains arising in genetics: A new approach, I. Haploid models. Advances in Applied Probability, 6(2), 260–290.
  • Chadeau-Hyam, M., Hoggart, C. J., O’Reilly, P. F., Whittaker, J. C., De Iorio, M., and Balding, D. J. (2008). Fregene: Simulation of realistic sequence-level data in populations and ascertained samples. BMC Bioinformatics, 9(1), 364.
  • Excoffier, L. and Lischer, H. E. L. (2010). Arlequin suite ver 3.5: A new series of programs to perform population genetics analyses under Linux and Windows. Molecular Ecology Resources, 10(3), 564–567.
  • Guillaume, F. and Rougemont, J. (2006). Nemo: An evolutionary and population genetics programming framework. Bioinformatics, 22(20), 2556–2557.
  • Messer, P. W. (2013). SLiM: Simulating evolution with selection and linkage. Genetics, 194(4), 1037–1039.
  • Neuenschwander, S., Hospital, F., Guillaume, F., and Goudet, J. (2008). quantiNemo: An individual-based program to simulate quantitative traits with explicit genetic architecture in a dynamic metapopulation. Bioinformatics, 24(13), 1552–1553.
  • Peng, B. and Kimmel, M. (2005). SimuPOP: A forward-time population genetics simulation environment. Bioinformatics, 21(18), 3686–3587.
  • Carvajal-Rodrıguez, A. (2008). GenomePop: A program to simulate genomes in populations. BMC Bionformatics, 9(1), 223.
  • Carvajal-Rodrıguez, A. (2010). Simulation of genes and genomes forward in time. Current Genomics, 11(1), 58–61.
  • Wakeley, J. (2009). Coalescent Theory: An Introduction. Roberts & Company Publishers, first edition.

Home