A function to compute the biweight mean vector and covariance matrix
Source:R/biwt_dist_matrix.R
biwt_dist_matrix.Rd
Compute a multivariate location and scale estimate based on Tukey's biweight weight function.
Arguments
- x
an
n x g
matrix or data frame (n
is the number of measurements,g
is the number of observations (genes) )- r
breakdown (
k/n
wherek
is the largest number of observations that can be replaced with arbitrarily large values while keeping the estimates bounded)- median
a logical command to determine whether the initialization is done using the coordinate-wise median and MAD (TRUE) or using the minimum covariance determinant (MCD) (FALSE). Using the MCD is substantially slower.
- full_init
a logical command to determine whether the initialization is done for each pair separately (FALSE) or only one time at the beginning using the entire data matrix (TRUE). Initializing for each pair separately is substantially slower.
- absval
a logical command to determine whether the distance should be measured as 1 minus the absolute value of the correlation (TRUE) or simply 1 minus the correlation (FALSE)
Value
Using biwt_est
to estimate the robust covariance matrix, a robust measure of correlation is computed using Tukey's biweight M-estimator. The biweight correlation is essentially a weighted correlation where the weights are calculated based on the distance of each measurement to the data center with respect to the shape of the data. The correlations are computed pair-by-pair because the weights should depend only on the pairwise relationship at hand and not the relationship between all the observations globally. The biwt functions compute many pairwise correlations and create distance matrices for use in other algorithms (e.g., clustering).
In order for the biweight estimates to converge, a reasonable initialization must be given. Typically, using TRUE for the median and full_init arguments will provide acceptable initializations. With particularly irregular data, the MCD should be used to give the initial estimate of center and shape. With data sets in which the observations are orders of magnitudes different, full_init=FALSE should be specified.
Returns a list with components:
- biwt_dist_matrix
a matrix of the biweight distances (default is 1 minus absolute value of the biweight correlation).
- biwt_NAid_matrix
a matrix representing whether the biweight correlation was possible to compute (will be NA if too much data is missing or if the initializations are not accurate). 0 if computed accurately, 1 if NA.
References
Hardin, J., Mitani, A., Hicks, L., VanKoten, B.; A Robust Measure of Correlation Between Two Genes on a Microarray, BMC Bioinformatics, 8:220; 2007.
Examples
# note that biwt_dist_matrix() takes data that is nxg where the
# goal is to find distances between each of the g items
samp_data <- MASS::mvrnorm(30,mu=c(0,0,0),Sigma=matrix(c(1,.75,-.75,.75,1,-.75,-.75,-.75,1),ncol=3))
r <- 0.2 # breakdown
# To compute the 3 pairwise distances in matrix form:
samp_bw_dist_mat <- biwt_dist_matrix(samp_data, r)
samp_bw_dist_mat
#> $biwt_dist_mat
#> [,1] [,2] [,3]
#> [1,] 0.0000000 0.2767782 0.2801223
#> [2,] 0.2767782 0.0000000 0.3413419
#> [3,] 0.2801223 0.3413419 0.0000000
#>
#> $biwt_NAid_mat
#> [,1] [,2] [,3]
#> [1,] 0 0 0
#> [2,] 0 0 0
#> [3,] 0 0 0
#>
# To convert the distances into an element of class 'dist'
as.dist(samp_bw_dist_mat$biwt_dist_mat)
#> 1 2
#> 2 0.2767782
#> 3 0.2801223 0.3413419