| Title: | Identifying Clusters of Related Individuals |
|---|---|
| Description: | Identifying clusters of related individuals. |
| Authors: | Karl W Broman [aut, cre]
|
| Maintainer: | Karl W Broman <[email protected]> |
| License: | GPL-3 |
| Version: | 0.60-6 |
| Built: | 2026-05-09 20:41:57 UTC |
| Source: | https://github.com/kbroman/fingers |
This is RAPD data for 40 loci typed on a set of 10 full-sibling families, with 15 individuals in each family.
data(aedes)data(aedes)
The data is a matrix of 150 rows (the individuals) by 40 columns (the RAPD loci). Each entry is a RAPD phenotype, indicating the presence (1) or absence (0) of a band.
Karl W Broman [email protected]
FINGERS software, WC Black IV, Colorado State University
BL Apostol, WC Black IV, BR Miller, P Reiter, BJ Beaty (1993) Estimation of the number of full sibling families at an oviposition site using RAPD-PCR markers: applications to the mosquito Aedes aegypti. Theor Appl Genet 86:991-1000.
data(aedes)data(aedes)
Calculate the simple distance matrix, by the proportion of mismatches, for a RAPD data set.
calc.dist(dat)calc.dist(dat)
dat |
A matrix of size (n.ind x n.mar) containing RAPD phenotypes, with 1 indicating the presence of a band and 0 indicating absence. |
For each pair of individuals, we calculate the proportion of RAPD markers (among those where both individuals have complete data) at which one individual shows a band and the other doesn't.
A symmetric matrix of dimension (n.ind x n.ind), containing the distances between individuals.
Karl W Broman [email protected]
BL Apostol, WC Black IV, BR Miller, P Reiter, BJ Beaty (1993) Estimation of the number of full sibling families at an oviposition site using RAPD-PCR markers: applications to the mosquito Aedes aegypti. Theor Appl Genet 86:991-1000.
data(aedes) d <- calc.dist(aedes)data(aedes) d <- calc.dist(aedes)
Calculate a score indicating how well two sets of clusters conform.
cluster.stat(fam1,fam2,method=c("all","rand","adj","fm","kb"))cluster.stat(fam1,fam2,method=c("all","rand","adj","fm","kb"))
fam1 |
A list of clusters; each component in the list is one family, containing the indices of the individuals in that family. |
fam2 |
A list, just like |
method |
A character string indicating whether to calculate the
Rand index, the adjusted Rand index, the Fowlkes and Mallows B index,
or Karl Broman's index. If |
In the Rand index (Rand 1971), one considers all pairs of individuals,
and assigns a 1 to a pair if the individuals are either in the same
cluster in both fam1 and fam2 or are not in the same
cluster in both fam1 and fam2, and assigns a 0 to the pair
otherwise, and then takes the sum of these, divided by the number of
pairs of individuals.
Karl Broman's index (which we don't recommend, but we implement
here in order to allow comparisons to be made) is just like the Rand
index, but fam2 is assumed to be the true partition, and
the set of all pairs in the same group (by fam2) and the set of
all pairs in different groups (by fam2), are given equal weight.
Let be the number of individuals in group i by
partition 1 and group j by partition 2. Let and define
similarly.
In the adjusted-Rand index (Hubert and Arabie 1985), ...
In the Fowlkes and Mallows B index (Fowlkes and Mallows 1983), ...
The value of a score for comparing two sets of clusters.
Karl W Broman [email protected]
WM Rand (1971) Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66:846-850.
L Hubert and P Arabie (1985) Comparing partitions. Journal of Classification. 2:193-218.
EB Fowlkes and CL Mallows (1983) A method for comparing two hierarchical clusterings. Journal of the American Statistical Association 78:553-584.
BS Everitt, S Landau and M Leese (2001) Cluster analysis, 4th edition. Arnold, London, pp. 181-3.
data(aedes) f <- freq(aedes) co <- cutoff(f) d <- calc.dist(aedes) fam <- fingers(d,co,make.plot=TRUE) tf <- true.fams(aedes) cluster.stat(fam,tf) cluster.stat(fam,tf,method="fm")data(aedes) f <- freq(aedes) co <- cutoff(f) d <- calc.dist(aedes) fam <- fingers(d,co,make.plot=TRUE) tf <- true.fams(aedes) cluster.stat(fam,tf) cluster.stat(fam,tf,method="fm")
Give diagnostic information indicating how well two sets of clusters conform.
comp.fams(fam1,fam2)comp.fams(fam1,fam2)
fam1 |
A list of clusters; each component in the list is one family, containing the indices of the individuals in that family. |
fam2 |
A list, just like |
A list with two components. The first component is a contingency
table whose (i,j)th element is the number of individuals in cluster i
in fam1 and cluster j in fam2. The second component is
a list indicating, for each cluster from fam1, the cluster
assignment in fam2.
Karl W Broman [email protected]
cluster.stat,
fingers,
true.fams
data(aedes) f <- freq(aedes) co <- cutoff(f) d <- calc.dist(aedes) fam <- fingers(d,co,make.plot=TRUE) tf <- true.fams(aedes) comp.fams(fam,tf)data(aedes) f <- freq(aedes) co <- cutoff(f) d <- calc.dist(aedes) fam <- fingers(d,co,make.plot=TRUE) tf <- true.fams(aedes) comp.fams(fam,tf)
Calculate the cutoff for hierarchical cluster analysis to infer groups of related individuals with RAPD data.
cutoff(f,method=c("qu","meansib","qs","lr"),value=0.2)cutoff(f,method=c("qu","meansib","qs","lr"),value=0.2)
f |
A vector of band allele frequencies for a set of RAPD markers. |
method |
The method to use to form the cutoff: a quantile of the
distribution of distances among unrelated ( |
value |
For |
The cutoff (a single value).
Karl W Broman [email protected]
BL Apostol, WC Black IV, BR Miller, P Reiter, BJ Beaty (1993) Estimation of the number of full sibling families at an oviposition site using RAPD-PCR markers: applications to the mosquito Aedes aegypti. Theor Appl Genet 86:991-1000.
cutoff.llr,
freq,
pull.markers,
fingers
data(aedes) f <- freq(aedes) co1 <- cutoff(f,method="meansib") co2 <- cutoff(f,method="qu",value=0.2) co3 <- cutoff(f,method="qs",value=0.9) co4 <- cutoff(f,method="lr",value=4.0)data(aedes) f <- freq(aedes) co1 <- cutoff(f,method="meansib") co2 <- cutoff(f,method="qu",value=0.2) co3 <- cutoff(f,method="qs",value=0.9) co4 <- cutoff(f,method="lr",value=4.0)
Calculate a cutoff (for the LLR distance measure) for hierarchical cluster analysis to infer groups of related individuals with RAPD data.
cutoff.llr(f,method=c("qu","meansib","qs","lr"),value=0.2)cutoff.llr(f,method=c("qu","meansib","qs","lr"),value=0.2)
f |
A vector of band allele frequencies for a set of RAPD markers. |
method |
The method to use to form the cutoff: a quantile of the
distribution of distances among unrelated ( |
value |
For |
The cutoff (a single value).
Karl W Broman [email protected]
cutoff,
llrdist,
freq,
pull.markers,
fingers
data(aedes) f <- freq(aedes) co1 <- cutoff.llr(f,method="meansib") co2 <- cutoff.llr(f,method="qu",value=0.2) co3 <- cutoff.llr(f,method="qs",value=0.9) co4 <- cutoff.llr(f,method="lr",value=4.0)data(aedes) f <- freq(aedes) co1 <- cutoff.llr(f,method="meansib") co2 <- cutoff.llr(f,method="qu",value=0.2) co3 <- cutoff.llr(f,method="qs",value=0.9) co4 <- cutoff.llr(f,method="lr",value=4.0)
Plot the distance matrix for a RAPD data set, with (optionally) lines drawn separating clusters of individuals.
dist.image(dist,fams=NULL,col=topo.colors(1+ncol(dist)),...)dist.image(dist,fams=NULL,col=topo.colors(1+ncol(dist)),...)
dist |
A matrix of size (n.ind x n.ind), containing the distances between pairs of individuals. |
fams |
A list of clusters; each component in the list is one inferred family, containing the indices of individuals placed in that family. |
col |
Colors to use in the plot; see |
... |
Other arguments to pass to |
The function calls image in order to create an
image of the distance matrix.
Karl W Broman [email protected]
data(aedes) f <- freq(aedes) co <- cutoff(f) d <- calc.dist(aedes) fam <- fingers(d,co,make.plot=TRUE) dist.image(d,fam)data(aedes) f <- freq(aedes) co <- cutoff(f) d <- calc.dist(aedes) fam <- fingers(d,co,make.plot=TRUE) dist.image(d,fam)
Perform hierarchical clustering to infer groups of related individuals with RAPD data.
fingers(dist,cutoff=NULL,method=c("average","complete", "mcquitty","single","ward"),truefam=NULL, make.plot=FALSE,just.plot=FALSE)fingers(dist,cutoff=NULL,method=c("average","complete", "mcquitty","single","ward"),truefam=NULL, make.plot=FALSE,just.plot=FALSE)
dist |
A matrix of size (n.ind x n.ind) containing the distances between individuals. |
cutoff |
A value to use to cut off the dendogram formed by
hierarchical clustering in order to define a set of
clusters. (Optional, but if NULL, the argument |
method |
A hierarchical clustering method. See
|
truefam |
The true family structure; used only if |
make.plot |
If TRUE, make a plot of the dendogram formed by hierarchical clustering. |
just.plot |
If TRUE, just make the plot; don't return the inferred
families. (In this case, the |
We use the function hclust
to do the cluster analysis.
A list of clusters; each component in the list is one inferred family,
containing the indices of individuals placed in that family. The
cutoff used is included as an attribute. Use
attr(result,"cutoff") to obtain this value.
Karl W Broman [email protected]
BL Apostol, WC Black IV, BR Miller, P Reiter, BJ Beaty (1993) Estimation of the number of full sibling families at an oviposition site using RAPD-PCR markers: applications to the mosquito Aedes aegypti. Theor Appl Genet 86:991-1000.
cutoff,
cutoff.llr,
calc.dist,
llrdist,
cluster.stat,
true.fams,
freq,
pull.markers
data(aedes) f <- freq(aedes) co <- cutoff(f) d <- calc.dist(aedes) fam <- fingers(d,co,make.plot=TRUE) tf <- true.fams(aedes) cluster.stat(fam,tf)data(aedes) f <- freq(aedes) co <- cutoff(f) d <- calc.dist(aedes) fam <- fingers(d,co,make.plot=TRUE) tf <- true.fams(aedes) cluster.stat(fam,tf)
Estimate the frequency of the band allele for a set of RAPD markers.
freq(dat)freq(dat)
dat |
A matrix of size (n.ind x n.mar) containing RAPD phenotypes, with 1 indicating the presence of a band and 0 indicating absence. |
The RAPDs are assumed to be in Hardy-Weinberg equilibrium, and so the
frequency of the band allele is estimated as where
is the proportion of individuals showing a band.
A vector of length n.mar, containing the estimated frequencies of the band allele for each RAPD marker.
Karl W Broman [email protected]
BL Apostol, WC Black IV, BR Miller, P Reiter, BJ Beaty (1993) Estimation of the number of full sibling families at an oviposition site using RAPD-PCR markers: applications to the mosquito Aedes aegypti. Theor Appl Genet 86:991-1000.
data(aedes) f <- freq(aedes)data(aedes) f <- freq(aedes)
Calculate a distance matrix, based on the log likelihood ratio comparing the hypotheses of full sibling versus unrelated, for a RAPD data set.
llrdist(dat,p=freq(dat))llrdist(dat,p=freq(dat))
dat |
A matrix of size (n.ind x n.mar) containing RAPD phenotypes, with 1 indicating the presence of a band and 0 indicating absence. |
p |
A vector of band allele frequencies. |
For each pair of individuals, at each locus, we calculate the log
likelihood ratio (LLR) comparing the hypotheses unrelated with
siblings, with the data being B (both have band), N (neither
have band) or D (one has band, the other doesn't). These LLRs are
averaged across individuals. Note: at each
locus, we re-center the LLRs so that the minimum of the LLRs among
B/N/D is 0; this makes the resulting distances 0.
Calculations are performed in a C program.
A symmetric matrix of dimension (n.ind x n.ind), containing the distances between individuals.
Karl W Broman [email protected]
data(aedes) f <- freq(aedes) dis <- llrdist(aedes,f)data(aedes) f <- freq(aedes) dis <- llrdist(aedes,f)
Extract markers from a RAPD data set that have allele frequencies within a specified range.
pull.markers(dat,lo=0.1,hi=0.6,f=freq(dat))pull.markers(dat,lo=0.1,hi=0.6,f=freq(dat))
dat |
A matrix of size (n.ind x n.mar) containing RAPD phenotypes, with 1 indicating the presence of a band and 0 indicating absence. |
lo |
Lower bound for band allele frequency. |
hi |
Upper bound for band allele frequency. |
f |
Vector of band allele frequencies (included in order to avoid recalculating it, if possible). |
A matrix, like the argument dat, but containing only those
markers with band allele frequency between lo and hi.
Karl W Broman [email protected]
data(shiff1) f <- freq(shiff1) subset <- pull.markers(shiff1, 0.1, 0.6, f)data(shiff1) f <- freq(shiff1) subset <- pull.markers(shiff1, 0.1, 0.6, f)
This is RAPD data for 35 loci typed on a set of 135 individuals.
data(shiff1)data(shiff1)
The data is a matrix of 135 rows (the individuals) by 35 columns (the RAPD loci). Each entry is a RAPD phenotype, indicating the presence (1) or absence (0) of a band.
Karl W Broman [email protected]
Clive Shiff, Molecular Microbiology and Immunology, Bloomberg School of Public Health, The Johns Hopkins University
shiff2, shiff3,
aedes, simrapd
data(shiff1)data(shiff1)
This is RAPD data for 10 loci typed on a set of 135 individuals. Markers with estimated band allele frequencies outside of the range 0.1-0.6 have been removed.
data(shiff2)data(shiff2)
The data is a matrix of 135 rows (the individuals) by 10 columns (the RAPD loci). Each entry is a RAPD phenotype, indicating the presence (1) or absence (0) of a band.
Karl W Broman [email protected]
Clive Shiff, Molecular Microbiology and Immunology, Bloomberg School of Public Health, The Johns Hopkins University
shiff1, shiff3,
aedes, simrapd
data(shiff2)data(shiff2)
This is RAPD data for 10 loci typed on a set of 125 individuals. Markers with estimated band allele frequencies outside of the range 0.1-0.6 have been removed. Individuals with one or more missing values have been removed.
data(shiff3)data(shiff3)
The data is a matrix of 125 rows (the individuals) by 10 columns (the RAPD loci). Each entry is a RAPD phenotype, indicating the presence (1) or absence (0) of a band.
Karl W Broman [email protected]
Clive Shiff, Molecular Microbiology and Immunology, Bloomberg School of Public Health, The Johns Hopkins University
shiff1, shiff2,
aedes, simrapd
data(shiff3)data(shiff3)
Simulates RAPD data for a set of sibling families.
simrapd(n.sib = rep(15,10), p = c(rep(0.125,8),rep(0.175,5),rep(0.225,5), rep(0.275,8),rep(0.325,3),rep(0.375,4), rep(0.475,4),rep(0.575,3)))simrapd(n.sib = rep(15,10), p = c(rep(0.125,8),rep(0.175,5),rep(0.225,5), rep(0.275,8),rep(0.325,3),rep(0.375,4), rep(0.475,4),rep(0.575,3)))
n.sib |
A vector giving the number of siblings per family (length is the number of families). |
p |
A vector of frequencies of the band allele at each marker (length is the number of markers). |
The RAPDs are assumed to be in Hardy-Weinberg equilibrium.
A matrix of dimension (n.ind x n.mar), giving the RAPD phenotypes for each individual at each marker, with 1 indicating a band and 0 indicating no band.
Karl W Broman [email protected]
BL Apostol, WC Black IV, BR Miller, P Reiter, BJ Beaty (1993) Estimation of the number of full sibling families at an oviposition site using RAPD-PCR markers: applications to the mosquito Aedes aegypti. Theor Appl Genet 86:991-1000.
data <- simrapd(rep(20,5), p=runif(40, 0.1, 0.6))data <- simrapd(rep(20,5), p=runif(40, 0.1, 0.6))
Simulates RAPD data for a set of sibling families.
simulfams(n.sib=sample(5:20,size=sample(5:20,size=1),replace=TRUE), p=runif(sample(5:15,size=1),min=0.1,max=0.6))simulfams(n.sib=sample(5:20,size=sample(5:20,size=1),replace=TRUE), p=runif(sample(5:15,size=1),min=0.1,max=0.6))
n.sib |
A vector giving the number of siblings per family (length is the number of families). |
p |
A vector of frequencies of the band allele at each marker (length is the number of markers). |
The RAPDs are assumed to be in Hardy-Weinberg equilibrium.
A matrix of dimension (n.ind x n.mar), giving the RAPD phenotypes for each individual at each marker, with 1 indicating a band and 0 indicating no band.
Laura Plantinga and Karl Broman [email protected]
BL Apostol, WC Black IV, BR Miller, P Reiter, BJ Beaty (1993) Estimation of the number of full sibling families at an oviposition site using RAPD-PCR markers: applications to the mosquito Aedes aegypti. Theor Appl Genet 86:991-1000.
data <- simulfams(rep(20,5), p=runif(40, 0.1, 0.6))data <- simulfams(rep(20,5), p=runif(40, 0.1, 0.6))
Use the row names of a RAPD data set to identify the true sets of families.
true.fams(dat)true.fams(dat)
dat |
A matrix of size (n.ind x n.mar) containing RAPD phenotypes, with 1 indicating the presence of a band and 0 indicating absence. The row names (identifying individuals) are assumed to be of the form "family-individual" |
A list of clusters; each component in the list is one inferred family, containing the indices of individuals placed in that family.
Karl W Broman [email protected]
BL Apostol, WC Black IV, BR Miller, P Reiter, BJ Beaty (1993) Estimation of the number of full sibling families at an oviposition site using RAPD-PCR markers: applications to the mosquito Aedes aegypti. Theor Appl Genet 86:991-1000.
aedes, simrapd,
fingers,
cluster.stat
data(aedes) tf <- true.fams(aedes)data(aedes) tf <- true.fams(aedes)