---
title: R/simcross User Guide
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{R/simcross User Guide}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8](inputenc)
---
```{r knitr_options, include=FALSE}
library(knitr)
opts_chunk$set(fig.width=7, fig.height=4.5,
dev.args=list(pointsize=16))
```
[R/simcross](https://kbroman.org/simcross/) is an
[R](https://www.r-project.org) package to simulate genotypes from an
experimental cross. The aim is flexibility rather than speed.
Meiosis is simulated following the Stahl model (Copenhaver et
al. 2002) with the interference parameter being an integer. In this
model, chiasma locations are the superposition of two processes: a
proportion _p_ come from a process exhibiting no interference (that
is, a Poisson process) and the remainder (proportion 1 – _p_)
come from a process following the chi-square model (Foss et al. 1993;
Zhao et al. 1995). Thus, with _p_=0, the model reduces to the
chi-square model. The chi-square model has a single parameter, _m_,
which is a non-negative integer and controls the strength of
interference. _m_=0 corresponds to no interference.
[Broman et al. (2002)](https://pubmed.ncbi.nlm.nih.gov/11901128/) estimated _m_=10 for the level of interference in the mouse
(derived assuming _p_=0). The chi-square model is a special case of
the gamma model (McPeek and Speed 1995) which has a positive parameter
_ν_; the chi-square model corresponds to the case that _m_ = _ν_
– 1 is an integer.
In many organisms, there is always at least one chiasma for each pair
of homologous chromosomes at meiosis. Simulations with
[R/simcross](https://kbroman.org/simcross/) may be performed
with this assumption; the same model is used, but with rejection
sampling to get results conditional on there being at least one
chiasma. Note that this isn't an assumption of an obligate
_crossover_; if there is exactly one chiasma, then a random meiotic
product will have 0 or 1 crossovers, with probability 1/2 each. Also
note that, with the obligate chiasma assumption, chromosomes must be
greater than 50 cM.
## Basic functions
There are two basic functions for simulating a cross:
`create_parent` and `cross`.
`create_parent` is for generating a parent object, either an
inbred individual or the F1 offspring of two inbred
individuals. It takes two arguments: the length of the chromosome in
cM, and the allele or pair of alleles.
Here's how to generate two inbred individuals, one from each of strain
1 and strain 2, and an F1 individual, with a 100 cM
chromosome.
```{r set.seed, include=FALSE}
set.seed(80607574)
```
```{r create_parent}
library(simcross)
p1 <- create_parent(L=100, allele=1)
p2 <- create_parent(L=100, allele=2)
f1 <- create_parent(L=100, allele=1:2)
```
You don't need to specify the arguments by name (e.g., in the last
line above, you could just write `create_parent(100, 1:2)`); I'm doing so
here just to better document the names of the arguments.
The `cross` function is used to generate one offspring from the cross
between two individuals. The input is a pair of individuals (e.g., as
produced by `create_parent`), the interference parameter `m` (default
`m=10`), the parameter `p` for the Stahl model (default `p=0`), `xchr`
to indicate whether the chromosome being simulated is the X chromosome
(`FALSE`, the default, to simulate an autosome), and `male` to
indicate whether the offspring is to be male (which only matters if
`xchr=TRUE`). Further, one may use `obligate_chiasma=TRUE` (default
`FALSE`) to require at least one chiasma on the 4-strand bundle).
Here's an example, to generate an F2 individual. (The
outer parentheses cause the result to be printed.)
```{r sim_f2}
(f2 <- cross(f1, f1))
```
The output is a list with two components: the maternal and paternal
chromosomes. Each chromosome is a list with the allele in a set
of intervals, and the locations of the right endpoints of the
intervals.
In the example above, the maternal chromosome is a non-recombinant `2`
chromosome. The paternal chromosome has two crossovers, with the
the `1` allele up to `r round(f2$pat$locations[1], 1)` cM, the `2` allele
in the interval `r round(f2$pat$locations[1], 1)` –
`r round(f2$pat$locations[2], 1)`, and the `1` allele
for the remainder of the chromosome.
By default, we use `m=10`, `p=0`, and `obligate_chiasma=FALSE`. The
chromosome length is taken from the input parent objects (`f1` in this case).
If we wanted to do the simulation with no crossover interference, but
with an obligate chiasma, we'd use:
```{r sim_f2_nointerference}
f2 <- cross(f1, f1, m=0, obligate_chiasma=TRUE)
```
Behind the scenes, there are two additional
functions, `sim_crossovers`, for simulating crossover locations on a
chromosome, and `sim_meiosis`, for simulating a meiotic product from
an individual, but in general the user need not bother with these.
The `cross` function calls `sim_meiosis` twice
(once for each parent) and then combines the results into a single
individual. `sim_meiosis` calls `sim_crossovers` to generate the
meiotic product.
## Generating pedigrees
While one could simulate any experimental cross from a series of calls
to `create_parent` and `cross`, it is generally more efficient to first develop
a table that describes pedigree for the cross, and then simulate from
the pedigree with the function `sim_from_pedigree` (see the next
section).
### Example AIL pedigree matrix
To define a pedigree, we use a numeric matrix (or data frame) with four columns:
individual ID, mom, dad, and sex (coded as 0=female, 1=male). The
[R/simcross package](https://kbroman.org/simcross/) includes a
sample pedigree for advanced intercross lines, `AILped`, taken from
the [QTLRel package](https://cran.r-project.org/package=QTLRel).
Here is the top of that dataset:
```{r AILped_head}
head(AILped, n=10)
```
### Checking a pedigree matrix
The function `check_pedigree` can be used to check that a pedigree
matrix conforms to R/simcross's requirements: founders have
`mom == dad == 0`, all other individuals have both parents
present in the pedigree, and parents always precede any of their
children. The `check_pedigree` function returns `TRUE` if the pedigree
matrix is okay; otherwise, it throws an error.
```{r check_AILped}
check_pedigree(AILped)
```
R/simcross includes a set of functions for generating pedigree
matrices for different cross designs: `sim_ril_pedigree`,
`sim_4way_pedigree`, `sim_ail_pedigree`, and `sim_do_pedigree`.
### RIL pedigree
`sim_ril_pedigree` generates a pedigree matrix for a single
recombinant inbred line derived from 2^k^ founder
lines for some k>0. (Examples include the Collaborative Cross (Threadgill and
Churchill 2012) and MAGIC lines (Kover et al. 2009).) The arguments
are `ngen` (number of generations of inbreeding), `selfing` (default
`FALSE`, for sibling mating), `parents` (a vector of integers for the
parents; the length must be a power of 2 (i.e., 2, 4, 8, 16, etc.) and corresponds to the number
of founder lines), and `firstind`, the ID number to attach to the
first individual following the parents (so that the pedigree matrices
for multiple RIL may be `rbind`-ed together). Note that there's no real
_simulation_ here; the result is entirely deterministic.)
```{r sim_ril_ped}
(ril <- sim_ril_pedigree(ngen=4, parents=1:4))
```
The `gen` column in the output is the generation number, with 0
corresponding to the founders. The generations are simply sequential
and so don't correspond to the numbering scheme used for the
Collaborative Cross (see, for example, [Broman 2012](https://pubmed.ncbi.nlm.nih.gov/22345609/)).
### 4-way cross pedigree
`sim_4way_pedigree` generates a pedigree matrix for an intercross
among four inbred lines. The arguments are `ngen` (which must be 1 or
2) and `nsibs`. We start with four inbred individuals, and cross them
in two pairs to generate a pair of heterozygous individuals. If
`ngen==1`, we then just generate a set of `sum(nsibs)` F1
offspring. If `ngen==2`, we generate `length(nsibs)` pairs of
F1s and intercross them to generate a set of F2
sibships; in this case, the input vector `nsibs` determines the sizes
of the sibships.
The following generates two F2 sibships with 3
offspring in each.
```{r sim_4way_ped}
(fourway <- sim_4way_pedigree(ngen=2, nsibs=c(3, 3)))
```
### AIL pedigree
Advanced intercross lines (AIL) are generated by crossing two inbred
lines to form the F1 hybrid, intercrossing to form the
F2 generation, and then performing repeated intercrosses
with some large set of breeding pairs. At each generation, the mating
pairs are chosen at random, often with an effort to avoid matings
between sibling pairs. I would prefer the term “advanced
intercross _populations_,” as it's a set of heterozygous,
genetically distinct individuals; they aren't really _lines_.
`sim_ail_pedigree` generates a pedigree matrix for 2-way advanced
intercross lines. Unlike `sim_ril_pedigree` and `sim_4way_pedigree`,
this is actually a simulation, as the mating pairs at each generation
are chosen at random. The arguments are `ngen` (number of
generations), `npairs` (number of mating pairs at each generation),
`nkids_per` (for number of kids per sibship in the last generation),
and `design` (`"nosib"` to avoid matings between siblings, or
`"random"` to form the matings completely at random). At each
generation, each mating pair gives two offspring (one male and one
female) for the next generation. At the last generation, there are
`nkids_per` offspring per mating pair, to give a total of
`npairs*nkids_per` at the last generation.
Here's an example of the use of `sim_ail_pedigree`:
```{r sim_ail_ped}
ailped <- sim_ail_pedigree(ngen=12, npairs=100, nkids_per=5)
nrow(ailped)
table(ailped$gen)
```
With 100 breeding pairs and 5 kids per pair in last generation, we
have 200 individuals for most generations and 500 at the end.
### Diversity Outbred pedigree
The Diversity Outbred population (DO; Svenson et al. 2013) is like AIL, but
starting with eight inbred lines rather than two. Actually, the mouse
DO started with partially-inbred individuals from Collaborative Cross
lines (the so-called preCC; intermediate generations in the
development of eight-way RIL). Heterogeneous stock (HS; Mott and Flint
2002) can be viewed as a special case, but starting with the eight
founder lines.
`sim_do_pedigree` generates a pedigree matrix for a DO population.
The arguments are just like those of `sim_ail_pedigree`, but with one
addition: `ccgen`, which is a vector with the numbers of generations
of inbreeding to form the initial preCC lines that are then used to
initiate the DO. (The default for `ccgen` is taken from Figure 1 of
Svenson et al. 2013.) Use `ccgen=0` to simulate a pedigree for HS.
The length `ccgen` should be `npairs`, the number of breeding pairs;
we take two individuals (one female and one male) from each preCC line
to begin the outcrossing generations.
```{r sim_do_ped}
doped <- sim_do_pedigree(ngen=12)
nrow(doped)
table(do=doped$do, gen=doped$gen)
```
The output contains the usual `id`, `mom`, `dad`, and `sex`, plus
`gen` (for generation number) and `do` (1 indicates part of the DO
population, 0 indicates part of the earlier generations). As you can
see from the table above, the generation (`gen`) has a different
meaning, according to whether `do` is 0 or 1. When `do==0`, it is the
number of generations following the initial founder lines; when
`do==1`, it is the generation number of the outbreeding DO population.
Also note that we start with _sixteen_ lines rather than 8: to
properly handle the X chromosome, we need to consider a male and
female from each of the eight founder lines. These are numbered 1–8
for the females, and 9‐16 for the males. Also note that the preCC
lines are formed from a cross among the eight founders, with the order
of the crosses chosen at random (four females and four males, one from
each of the eight founder lines).
## Simulating from a pedigree
The point of the construction of a pedigree matrix for a cross design,
as in the previous section, was in order to simulate genotype data for
the cross design. The R/simcross function for this is
`sim_from_pedigree`. Its arguments are `pedigree` (the pedigree
matrix), `L` (length of chromosome), `xchr` (`TRUE` or `FALSE`, according to
whether to simulate the X chromosome or an autosome; default `FALSE`),
and then the parameters governing the crossing over process: `m`, `p`,
and `obligate_chiasma`, with defaults `m=10`, `p=0`, and
`obligate_chiasma=FALSE`.
(`L` can also be a vector of chromosome lengths, for simulating
multiple chromosomes at once. In this case `xchr` should be either a
logical vector, of the same length as `L`, or a character string with
the name of the chromosome in `L` that correspond to the X
chromosome. An example of simulating multiple chromosomes is in the
final section in this vignette.)
The `sim_from_pedigree` function calls `create_parent` for all
founding individuals and `cross` for any offspring; the output is a
list with each component corresponding to one individual and having
the form output by `cross`: a list with `mat` and `pat`, for the
maternal and paternal chromosomes, respectively, each of which has
`alleles` (allele present in each interval) and `locations` (location
of right endpoint of each interval).
Here's an example, for simulating an AIL to generation 8.
```{r sim_ail_fully}
ailped <- sim_ail_pedigree(ngen=8, npairs=30, nkids_per=5)
xodat <- sim_from_pedigree(ailped, L=100)
```
Here's the result for the last individual:
```{r last_ail_individual}
xodat[[length(xodat)]]
```
We can plot the average number of breakpoints across the
individuals' two chromosomes, by generation, as follows.
```{r plot_ave_breakpoints}
n_breakpoints <- sapply(xodat, function(a) sum(sapply(a, function(b) length(b$alleles)-1)))
ave_breakpoints <- tapply(n_breakpoints, ailped$gen, mean)
gen <- as.numeric(names(ave_breakpoints))
plot(gen, ave_breakpoints,
xlab="Generation", ylab="Average no. breakpoints", las=1,
pch=21, bg="Orchid", main="AIL with 30 breeding pairs")
```
The function `where_het` will show the regions where an individual is
heterozygous.
```{r where_het}
where_het(xodat[[length(xodat)]])
```
We can plot the average proportion of the chromosome that is
heterozygous, by generation, as follows.
```{r plot_prop_het}
prop_het <- sapply(lapply(xodat, where_het), function(a) sum(a[,2]-a[,1])/100)
ave_prop_het <- tapply(prop_het, ailped$gen, mean)
plot(gen, ave_prop_het,
xlab="Generation", ylab="Average proportion heterozygous", las=1,
pch=21, bg="Orchid", main="AIL with 30 breeding pairs")
abline(h=0.5, lty=2)
```
Note: if we'd greatly restricted the number of breeding pairs per
generation, we'd see evidence of inbreeding, as a reduced proportion
of heterozygosity. There's considerably more noise, though, since
we've got just 6 individuals per generation.
```{r reset_seed, include=FALSE}
set.seed(28998542)
```
```{r ail_few_pairs}
ailped2 <- sim_ail_pedigree(ngen=8, npairs=3, nkids_per=50)
xodat2 <- sim_from_pedigree(ailped2, L=100)
prop_het2 <- sapply(lapply(xodat2, where_het), function(a) sum(a[,2]-a[,1])/100)
ave_prop_het2 <- tapply(prop_het2, ailped2$gen, mean)
gen2 <- as.numeric(names(ave_prop_het2))
plot(gen2, ave_prop_het2,
xlab="Generation", ylab="Average proportion heterozygous", las=1,
pch=21, bg="Orchid", main="AIL with 3 breeding pairs")
abline(h=0.5, lty=2)
```
## Generating marker data
R/simcross simulates the locations of crossovers as continuous values
in the interval (0,L). This is precise and compact, and it allows detailed study
of the breakpoint process, but it can be cumbersome to work with and
is often not what you want from simulations. In most cases, one is
interested in individuals' genotypes at a set of markers.
There are two functions for getting marker genotypes on the basis of
the the detailed crossover location data: `get_geno` and
`convert2geno`.
The function `get_geno` will grab the genotype at a specified location
on the chromosome, returning a matrix with two columns: the maternal
and paternal alleles for each individual. Continuing with the
simulation in the previous section (an AIL with 30 breeding pairs),
here's how to grab the genotype at 30 cM:
```{r get_geno}
g30 <- get_geno(xodat, 30)
tail(g30)
```
The `get_geno` function could be useful, for example, for grabbing QTL
genotypes for use in constructing a phenotype.
The `convert2geno` function takes the detailed crossover information
plus a vector of marker locations and returns a matrix with marker
genotypes.
First, construct a vector with the marker locations.
```{r construct_map}
map <- seq(0, 100, by=10)
names(map) <- paste0("m", map)
map
```
Then, pass the crossover information and map to `convert2geno`. I
print the data for the last five individuals.
```{r convert2geno}
geno <- convert2geno(xodat, map)
geno[nrow(geno)-(4:0), ]
```
Here the genotypes get recoded as `1` / `2` / `3` for `11` / `12` / `22`.
That is, genotypes `1` and `3` are the homozygotes for the allele from
founders 1 and 2, respectively, and genotype `2` is the heterozygote.
For crosses with more than two founders, the output of `convert2geno`
is a three-dimensional array, individuals × markers ×
alleles (maternal and paternal). For example, here are genotypes
for the sixth generations of inbreeding of an eight-way RIL.
```{r convert2geno_8wayril}
rilped <- sim_ril_pedigree(ngen=6, parents=1:8)
dat <- sim_from_pedigree(rilped, L=100)
geno <- convert2geno(dat, map)
geno[nrow(geno)-(1:0),,]
```
More commonly, one may be interested in individuals' SNP
genotypes. This can also be obtained with `convert2geno`; one just
needs to provide a matrix of SNP alleles for the founder lines, with
the argument `founder_geno`. This should be a matrix of `1`s and `2`s,
of dimension `n_founders` × `n_markers`.
Let's simulate SNP alleles for eight founder lines.
```{r sim_founder_alleles}
fg <- matrix(sample(1:2, 8*length(map), replace=TRUE), nrow=8)
```
And then here are the SNP genotypes for the sixth generation of
inbreeding an eight-way RIL.
```{r convert2geno_8wayril_snps}
snpgeno <- convert2geno(dat, map, fg)
snpgeno[nrow(snpgeno)-(1:0),]
```
The genotypes are again coded as `1` / `2` / `3` for `11` / `12` /
`22`.
## Simulating multiple chromosomes
One can use `sim_from_pedigree` and `convert2geno` to simulate
multiple chromosomes at once.
To simulate multiple chromosomes with `sim_from_pedigree`,
provide a vector of chromosome lengths. In this case `xchr` should be either a
logical vector, of the same length as `L`, or a character string with
the name of the chromosome in `L` that correspond to the X
chromosome.
```{r ail_mult_chr}
ailped3 <- sim_ail_pedigree(ngen=12, npairs=30, nkids_per=3)
xodat3 <- sim_from_pedigree(ailped3, c("1"=100, "2"=75, "X"=100), "X")
xodat3alt <- sim_from_pedigree(ailped3, c("1"=100, "2"=75, "X"=100),
c(FALSE, FALSE, TRUE))
```
To indicate that all chromosomes are autosomes, you can use `xchr=""`,
`xchr=FALSE`, or `xchr=NULL`. For example:
```{r ail_mult_chr_no_X}
xodat3_noX <- sim_from_pedigree(ailped3, c("1"=100, "2"=75, "3"=100), "")
```
Having simulated multiple chromosomes with `sim_from_pedigree`, you
may wish to use `convert2geno` to convert those results to marker
genotypes. This is done by providing the output of `sim_from_pedigree`
as well as a genetic marker map that is a _list_ of vectors of marker
locations.
Let's first construct the marker map, assuming three chromosomes with
lengths 100, 75, and 100.
```{r construct_marker_map}
map <- list("1"=seq(0, 100, by=5),
"2"=seq(0, 75, by=5),
"X"=seq(0, 100, by=5))
for(i in seq(along=map))
names(map[[i]]) <- paste0("m", names(map)[i], "_", map[[i]])
```
We then use `convert2geno`, ensuring that the inputs are both lists
with the same length.
```{r convert2geno_multchr}
geno <- convert2geno(xodat3, map)
```
For crosses like the DO, in which founder genotypes are needed, the
input `founder_geno` must be a list of matrices (one matrix per
chromosome).
Here are some simulated founder genotypes:
```{r do_founder_geno_mult_chr}
fg <- vector("list", length(map))
for(i in seq(along=map))
fg[[i]] <- matrix(sample(1:2, 8*length(map[[i]]), replace=TRUE), nrow=8)
```
And now here is the simulation of a small DO population for multiple
chromosomes. When simulating the pedigree, we need to have 16 founders
(the 8 female founders and then the 8 male founders). After simulating
the genotypes with `sim_from_pedigree()`, we use
`collapse_do_alleles()` to collapse the alleles 9-16 (for the male
founders) into 1-8.
```{r sim_do_mult_chr}
doped <- sim_do_pedigree(ngen=4, nkids_per=1)
xodat <- sim_from_pedigree(doped, L=c("1"=100, "2"=75, "X"=100), "X")
xodat <- collapse_do_alleles(xodat)
geno <- convert2geno(xodat, map, fg)
```
## References
Broman KW (2012)
[Genotype probabilities at intermediate generations in the construction of recombinant inbred lines](https://doi.org/10.1534/genetics.111.132647).
Genetics 190:403-412
[doi: 10.1534/genetics.111.132647](https://doi.org/10.1534/genetics.111.132647)
Broman KW, Rowe LB, Churchill GA, Paigen K (2002)
[Crossover interference in the mouse](https://doi.org/10.1093/genetics/160.3.1123).
Genetics 160:1123-1131
[doi: 10.1093/genetics/160.3.1123](https://doi.org/10.1093/genetics/160.3.1123)
Copenhaver GP, Housworth EA, Stahl FW (2002)
[Crossover interference in arabidopsis](https://doi.org/10.1093/genetics/160.4.1631).
Genetics 160:1631-1639
[doi: 10.1093/genetics/160.4.1631](https://doi.org/10.1093/genetics/160.4.1631)
Foss E, Lande R, Stahl FW, Steinberg CM (1993)
[Chiasma interference as a function of genetic distance](https://doi.org/10.1093/genetics/92.2.573).
Genetics 133:681-691
[doi: 10.1093/genetics/92.2.573](https://doi.org/10.1093/genetics/92.2.573)
Kover PX, Valdar W, Trakalo J, Scarcelli N, Ehrenreich IM, Purugganan
MD, Durrant C, Mott R (2009)
[A Multiparent Advanced Generation Inter-Cross to fine-map quantitative traits in _Arabidopsis thaliana_](https://doi.org/10.1371/journal.pgen.1000551).
PLoS Genetics 5:e1000551
[doi: 10.1371/journal.pgen.1000551](https://doi.org/10.1371/journal.pgen.1000551)
McPeek MS, Speed TP (1995)
[Modeling interference in genetic recombination](https://doi.org/10.1093/genetics/139.2.1031).
Genetics 139:1031-1044
[doi: 10.1093/genetics/139.2.1031](https://doi.org/10.1093/genetics/139.2.1031)
Mott R, Flint J (2002)
[Simultaneous detection and fine mapping of quantitative trait loci in mice using heterogeneous stocks](https://doi.org/10.1093/genetics/160.4.1609).
Genetics 160:1609-1618
[doi: 10.1093/genetics/160.4.1609](https://doi.org/10.1093/genetics/160.4.1609)
Svenson KL, Gatti DM, Valdar W, Welsh CE, Cheng R, Chesler EJ, Palmer
AA, McMillan L, Churchill GA (2012)
[High-resolution genetic mapping using the mouse Diversity Outbred population.](https://doi.org/10.1534/genetics.111.132597).
Genetics 190:437-447
[doi: 10.1534/genetics.111.132597](https://doi.org/10.1534/genetics.111.132597)
Threadgill DW, Churchill GA (2012)
[Ten years of the Collaborative Cross](https://doi.org/10.1534/genetics.111.138032).
Genetics 190:291-294
[doi: 10.1534/genetics.111.138032](https://doi.org/10.1534/genetics.111.138032)
Zhao H, Speed TP, McPeek MS (1995)
[Statistical analysis of crossover interference using the chi-square model](https://doi.org/10.1534/genetics.111.138032).
Genetics 139:1045-1056
[doi: 10.1534/genetics.111.138032](https://doi.org/10.1534/genetics.111.138032)