Title: | Measuring Multivariate Dependence Using Distance Multivariance |
---|---|
Description: | Distance multivariance is a measure of dependence which can be used to detect and quantify dependence of arbitrarily many random vectors. The necessary functions are implemented in this packages and examples are given. It includes: distance multivariance, distance multicorrelation, dependence structure detection, tests of independence and copula versions of distance multivariance based on the Monte Carlo empirical transform. Detailed references are given in the package description, as starting point for the theoretic background we refer to: B. Böttcher, Dependence and Dependence Structures: Estimation and Visualization Using the Unifying Concept of Distance Multivariance. Open Statistics, Vol. 1, No. 1 (2020), <doi:10.1515/stat-2020-0001>. |
Authors: | Björn Böttcher [aut, cre], Martin Keller-Ressel [ctb] |
Maintainer: | Björn Böttcher <[email protected]> |
License: | GPL-3 |
Version: | 2.4.1 |
Built: | 2025-01-23 04:24:40 UTC |
Source: | https://github.com/cran/multivariance |
The multivariance package provides basic functions to calculate distance multivariance and related quantities. To test independence use multivariance.test
, it provides an interface (via its arguments) to all the tests based on distance (m-/total-)multivariance. The package offers also several other functions related to distance multivariance, e.g. a detection and visualization of dependence structures dependence.structure
. See below for details on the full content of the package.
Distance multivariance is a multivariate dependence measure, which can be used to detect dependencies between an arbitrary number of random vectors each of which can have a distinct dimension. The necessary functions are implemented in this package, and examples are given. For the theoretic background we refer to the papers [1,2,3,4,5,6]. Paper [3] includes a summary of the first two. It is the recommended starting point for users with an applied interest. Paper [4] is concerned with new (faster) p-value estimates for the independence tests, [5] introduces the copula versions of distance multivariance, [6] discusses the quantification of dependence using distance multicorrelations.
The (current) code is speed improved in comparison to the former releases. Certainly there is still room for improvement and development. Questions, comments and remarks are welcome: [email protected]
For infos on the latest changes and/or updates to the package use news(package="multivariance")
.
To cite this package use the standard citation for R packages, i.e., the output of citation("multivariance")
.
multivariance
computes the distance multivariance
total.multivariance
computes the total distance multivariance
m.multivariance
computes the m-multivariance (introduced in [3])
It might be convenient to compute these simultaneously using multivariances.all
.
copula.multivariance
computes the copula versions of the above (introduced in [5])
multicorrelation
computes the multicorrelations (discussed specifically in [6])
rejection.level
computes a (conservative) rejection level for a given significance level. This can be used for a conservative interpretation of distance multivariance. The counterpart is multivariance.pvalue
, which computes a conservative p-value for a given distance multivariance. Both methods are distribution-free.
resample.rejection.level
and resample.pvalue
are the distribution dependent versions of the above. They are approximately sharp, but computational more expensive. Any resampling is done by resample.multivariance
.
Using the methods developed in [4] approximate p-value estimates are provided by pearson.pvalue
. This method is much faster than the resampling method.
multivariance.test
provides the corresponding tests of independence. The former provides output as common for tests in R.
cdm
and cdms
compute the doubly centered distance matrix and matrices, respectively. These can be used to speed up repeated computations of distance multivariance.
In [4] various methods to estimate the moments of the test statistic under H0 were developed, these are (implicitly) implemented in this package only for the moments used in pearson.pvalue
. Further and explicit functions can be added upon request. Please feel free to contact the author.
emp.transf
computes the Monte Carlo empirical transform of the data. This data yields the copula version of distance multivariance. Hereto note, that values become randomized due to the "Monte Carlo empirical transform", i.e., the copula versions yield in a finite sample setting not identical values for repeated runs.
For planing of large projects or studies it might be convenient to estimate the computation time of multivariance via multivariance.timing
.
dependence.structure
performs the dependence structure detection algorithm as described in [3].
find.cluster
is the basic building block of dependence.structure
. It is recommended to use dependence.structure
.
coins
and tetrahedron
generate samples of pairwise independent random variables, with dependence of higher order.
dep_struct_iterated_13_100
, dep_struct_ring_15_100
, dep_struct_several_26_100
and dep_struct_star_9_100
are example data sets for the dependence structure detection. These might also serve as benchmark examples.
anscombe.extended
provides an extension of Anscombe's Quartett. It illustrates that a large value of Pearson's correlation can occur for very different dependencies and that this is not a small-sample problem. These dependencies are at least partly differentiated by values of distance multicorrelation.
[1] B. Böttcher, M. Keller-Ressel, R.L. Schilling, Detecting independence of random vectors: generalized distance covariance and Gaussian covariance. Modern Stochastics: Theory and Applications, Vol. 5, No. 3(2018) 353-383. https://www.vmsta.org/journal/VMSTA/article/127/info
[2] B. Böttcher, M. Keller-Ressel, R.L. Schilling, Distance multivariance: New dependence measures for random vectors. The Annals of Statistics, Vol. 47, No. 5 (2019) 2757-2789. doi:10.1214/18-AOS1764
[3] B. Böttcher, Dependence and Dependence Structures: Estimation and Visualization using the Unifying Concept of Distance Multivariance. Open Statistics, Vol. 1, No. 1 (2020) 1-46. doi:10.1515/stat-2020-0001
[4] G. Berschneider, B. Böttcher, On complex Gaussian random fields, Gaussian quadratic forms and sample distance multivariance. Preprint. https://arxiv.org/abs/1808.07280
[5] B. Böttcher, Copula versions of distance multivariance and dHSIC via the distributional transform – a general approach to construct invariant dependence measures. Statistics, (2020) 1-18. doi:10.1080/02331888.2020.1748029
[6] B. Böttcher, Notes on the interpretation of dependence measures – Pearson's correlation, distance correlation, distance multicorrelations and their copula versions. Preprint. https://arxiv.org/abs/2004.07649
The dataset extends 'anscombe' provided in the standard R-package 'datasets'. All examples feature the same correlation of 0.82, but different types of dependencies. The main aim was to extend the classical examples, which have sample size 11, to larger sample sizes. This illustrates that the implied problems of Pearson's correlation are not small sample problems! Distance multicorrelation (which coincides in this case with distance correlation) yields different values for the datasets.
anscombe.extended
anscombe.extended
list
with elements:
anscombe.extended$N11
matrix with 11 samples for 5 examples the first 4 are the
classical Anscombe Quartett, the fifth is a monoton relation
which also features the same correlation.
anscombe.extended$N100
same as above but 100 samples
anscombe.extended$N1000
same as above but 1000 samples
Note: Anscombe's quartett features further identical parameters besides Pearson's correlation. The extended set is only concerned with correlation.
This example was introduced in the reference [6] given on the main help page of this package: multivariance-package.
# Code which generates plots of all included data: op = par(mfrow = c(3,5),mar = c(0.5,0.5,3,0.5)) for (name in c("N11","N100","N1000")) { for (i in 1:5) { x = anscombe.extended[[name]][,2*i-1] y = anscombe.extended[[name]][,2*i] plot(x,y,main = paste0("cor = ",round(cor(x,y),2), "\n Mcor = ",round(multicorrelation(cbind(x,y),type = "pairwise",squared = FALSE),2), "\n CMcor = ",round(copula.multicorrelation(cbind(x,y),type = "pairwise",squared = FALSE),2)), axes = FALSE,xlab ="",ylab = "", cex.main=1) # for two variables 'pairwise' coincides with # both values of 'total.upper.lower'. box() } } par(op)
# Code which generates plots of all included data: op = par(mfrow = c(3,5),mar = c(0.5,0.5,3,0.5)) for (name in c("N11","N100","N1000")) { for (i in 1:5) { x = anscombe.extended[[name]][,2*i-1] y = anscombe.extended[[name]][,2*i] plot(x,y,main = paste0("cor = ",round(cor(x,y),2), "\n Mcor = ",round(multicorrelation(cbind(x,y),type = "pairwise",squared = FALSE),2), "\n CMcor = ",round(copula.multicorrelation(cbind(x,y),type = "pairwise",squared = FALSE),2)), axes = FALSE,xlab ="",ylab = "", cex.main=1) # for two variables 'pairwise' coincides with # both values of 'total.upper.lower'. box() } } par(op)
computes the doubly centered distance matrix
cdm( x, normalize = TRUE, psi = NULL, p = NULL, isotropic = FALSE, external.dm.fun = NULL )
cdm( x, normalize = TRUE, psi = NULL, p = NULL, isotropic = FALSE, external.dm.fun = NULL )
x |
matrix, each row of the matrix is treated as one sample |
normalize |
logical, indicates if the matrix should be normalized |
psi |
if it is |
p |
numeric, if it is a value between 1 and 2 then the Minkowski distance with parameter p is used. |
isotropic |
logical, indicates if psi of the Euclidean distance matrix should be computed, i.e., if an isotropic distance should be used. |
external.dm.fun |
here one can supply an external function, which computes the distance matrix given |
The doubly centered distance matrices are required for the computation of (total / m-) multivariance.
If normalize = TRUE
then the value of multivariance is comparable and meaningful. It can be compared to the rejection.level
or its p-value multivariance.pvalue
can be computed.
More details: If normalize = TRUE
the matrix is scaled such that the multivariance based on it, times the sample size, has in the limit - in the case of independence - the distribution of an L^2 norm of a Gaussian process with known expectation.
As default the Euclidean distance is used. The parameters psi
, p
, isotropic
and external.dm.fun
can be used to select a different distance. In particular, external.dm.fun
can be used to provide any function which calculates a distance matrix for the rows of a given matrix.
For the theoretic background see the references given on the main help page of this package: multivariance-package.
x = coins(100) cdm(x) # fast euclidean distances cdm(x,psi = function(x,y) sqrt(sum((x-y)^2))) # this is identical to the previous (but slower) # the function cdm does the following three lines in a faster way N = nrow(x) C = diag(N) - matrix(1/N,nrow = N,ncol = N) A = - C %*% as.matrix(stats::dist(x,method="euclidean")) %*% C #' all(abs(A- cdm(x,normalize = FALSE)) < 10^(-12))
x = coins(100) cdm(x) # fast euclidean distances cdm(x,psi = function(x,y) sqrt(sum((x-y)^2))) # this is identical to the previous (but slower) # the function cdm does the following three lines in a faster way N = nrow(x) C = diag(N) - matrix(1/N,nrow = N,ncol = N) A = - C %*% as.matrix(stats::dist(x,method="euclidean")) %*% C #' all(abs(A- cdm(x,normalize = FALSE)) < 10^(-12))
computes the doubly centered distance matrices
cdms(x, vec = 1:ncol(x), membership = NULL, ...)
cdms(x, vec = 1:ncol(x), membership = NULL, ...)
x |
matrix, each row is a sample |
vec |
vector which indicates which columns are treated as one sample |
membership |
depreciated. Now use |
... |
these are passed to |
It returns a list of distance matrices.
Given a dependence structure graph: vertices representing the multivariances of only two vertices can be turned into an edge labeled with the label of the vertex. Moreover, only subsets of the graph can be selected.
clean.graph( g, only.level = NULL, simplify.pairs = TRUE, drop.label.pairs = FALSE )
clean.graph( g, only.level = NULL, simplify.pairs = TRUE, drop.label.pairs = FALSE )
g |
graph, created by |
only.level |
integer vector, if provided all edges and dependency nodes corresponding to dependence orders not given in 'only.level' are removed |
simplify.pairs |
boolean, if true dependency nodes which are only connected to two variables are turned into edges |
drop.label.pairs |
boolean, if true the labels for edges indicating pairwise dependence are removed |
Note: The option 'only.level' works only properly for a full dependence structure graph, in the case of a clustered dependence structure graph dependency nodes representing a cluster might be removed.
graph
N = 200 y = coins(N,2) x = cbind(y,y,y) ds = dependence.structure(x,structure.type = "clustered") plot(clean.graph(ds$graph)) plot(clean.graph(ds$graph,only.level = 2)) plot(clean.graph(ds$graph,only.level = 3)) # of limited use for a clustered graph, # i.e., here the three-dependence node without edges indicates that # all edges were connected to clusters ds = dependence.structure(x,structure.type = "full") plot(clean.graph(ds$graph)) plot(clean.graph(ds$graph,drop.label.pairs = TRUE)) plot(clean.graph(ds$graph,only.level = 2)) plot(clean.graph(ds$graph,only.level = 2,drop.label.pairs = TRUE)) plot(clean.graph(ds$graph,only.level = 3))
N = 200 y = coins(N,2) x = cbind(y,y,y) ds = dependence.structure(x,structure.type = "clustered") plot(clean.graph(ds$graph)) plot(clean.graph(ds$graph,only.level = 2)) plot(clean.graph(ds$graph,only.level = 3)) # of limited use for a clustered graph, # i.e., here the three-dependence node without edges indicates that # all edges were connected to clusters ds = dependence.structure(x,structure.type = "full") plot(clean.graph(ds$graph)) plot(clean.graph(ds$graph,drop.label.pairs = TRUE)) plot(clean.graph(ds$graph,only.level = 2)) plot(clean.graph(ds$graph,only.level = 2,drop.label.pairs = TRUE)) plot(clean.graph(ds$graph,only.level = 3))
This function creates samples which are dependent but k-independent.
coins(N = 1000, k = 2, type = "even")
coins(N = 1000, k = 2, type = "even")
N |
number of samples |
k |
each k-tuple will be independent |
type |
one of |
Throw k
independent fair coins. Now consider
the k+1 events: The first shows head, the second shows head,... the k
-th shows head,
there is an even
(or odd
as selected via type
) number of heads. Each row
contains the state of these k+1 events.
It returns the samples as rows of an N
by k+1
matrix. The columns are dependent but k-independent.
For the theoretic background see the reference [3] given on the main help page of this package: multivariance-package.
coins(200,4)
coins(200,4)
Formally it is nothing but distance multicorrelation applied to the Monte Carlo emprical transform of the data. Hence its values vary for repeated runs.
copula.multicorrelation(x, vec = 1:ncol(x), ...) CMcor(x, vec = 1:ncol(x), ...)
copula.multicorrelation(x, vec = 1:ncol(x), ...) CMcor(x, vec = 1:ncol(x), ...)
x |
either a data matrix or a list of doubly centered distance matrices |
vec |
if x is a matrix, then this indicates which columns are treated together as one sample; if x is a list, these are the indexes for which the multivariance is calculated. The default is all columns and all indexes, respectively. |
... |
are passed to |
For the theoretic background see the reference [5] given on the main help page of this package: multivariance-package.
Formally it is nothing but tests for distance multivariance applied to the Monte Carlo emprical transform of the data. Hence its values vary for repeated runs.
copula.multicorrelation.test(x, vec = 1:ncol(x), ...)
copula.multicorrelation.test(x, vec = 1:ncol(x), ...)
x |
matrix, each row is a sample |
vec |
vector which indicates which columns are treated as one sample |
... |
these are passed to |
For the theoretic background see the reference [5] given on the main help page of this package: multivariance-package.
Formally it is nothing but distance multivariance applied to the Monte Carlo emprical transform of the data. Hence its values vary for repeated runs.
copula.multivariance(x, vec = 1:ncol(x), type = "total", ...)
copula.multivariance(x, vec = 1:ncol(x), type = "total", ...)
x |
either a data matrix or a list of doubly centered distance matrices |
vec |
if x is a matrix, then this indicates which columns are treated together as one sample; if x is a list, these are the indexes for which the multivariance is calculated. The default is all columns and all indexes, respectively. |
type |
default: "total.lower.upper", for details and other options see below |
... |
these are passed to |
For the theoretic background see the reference [5] given on the main help page of this package: multivariance-package.
dependence.structure
It was generated by
set.seed(532333356) N = 100 x = matrix(sample.int(2,10*N,replace = TRUE)-1,ncol = 10) for (i in c(2,5,9)) x = cbind(x,(rowSums(as.matrix(x[,1:(i-1)])) dep_struct_iterated_13_100 = x save(dep_struct_iterated_13_100,file ="dep_struct_iterated_13_100.rda")
dep_struct_iterated_13_100
dep_struct_iterated_13_100
matrix
13 variables (columns), 100 independent samples (rows)
To avoid irritation, note that the seed is just a simple integer hash value of the variable name.
dependence.structure
It was generated by
set.seed(436646700) N = 100 n= 15 x=matrix(sample.int(2,N*n,replace = TRUE)-1,nrow =N) x[,4] = rowSums(x[,1:3]) x[,7] = rowSums(x[,4:6]) x[,10] = rowSums(x[,7:9]) x[,13] = rowSums(x[,10:12]) x[,15] = rowSums(x[,c(13,14,1)]) dep_struct_ring_15_100 = x save(dep_struct_ring_15_100,file ="dep_struct_ring_15_100.rda")
dep_struct_ring_15_100
dep_struct_ring_15_100
matrix
15 variables (columns), 100 independent samples (rows)
To avoid irritation, note that the seed is just a simple integer hash value of the variable name.
dependence.structure
It was generated by
set.seed(1348879148) N = 100 dep_struct_several_26_100 = cbind(coins(N,2),tetrahedron(N),coins(N,4), tetrahedron(N),tetrahedron(N),coins(N,3),coins(N,3),rnorm(N)) save(dep_struct_several_26_100,file ="dep_struct_several_26_100.rda")
dep_struct_several_26_100
dep_struct_several_26_100
matrix
26 variables (columns), 100 independent samples (rows)
To avoid irritation, note that the seed is just a simple integer hash value of the variable name.
dependence.structure
It was generated by
set.seed(222454572) N = 100 y = coins(N,2) dep_struct_star_9_100 = cbind(y,y,y) save(dep_struct_star_9_100,file ="dep_struct_star_9_100.rda")
dep_struct_star_9_100
dep_struct_star_9_100
matrix
9 variables (columns), 100 independent samples (rows)
To avoid irritation, note that the seed is just a simple integer hash value of the variable name.
Determines the dependence structure as described in [3].
dependence.structure( x, vec = 1:ncol(x), verbose = TRUE, detection.aim = NULL, type = "conservative", structure.type = "clustered", c.factor = 2, list.cdm = NULL, alpha = 0.05, p.adjust.method = "holm", stop.too.many = NULL, ... )
dependence.structure( x, vec = 1:ncol(x), verbose = TRUE, detection.aim = NULL, type = "conservative", structure.type = "clustered", c.factor = 2, list.cdm = NULL, alpha = 0.05, p.adjust.method = "holm", stop.too.many = NULL, ... )
x |
matrix, each row of the matrix is treated as one sample |
vec |
vector, it indicates which columns are initially treated together as one sample |
verbose |
boolean, if |
detection.aim |
|
type |
the method used for the detection, one of ' |
structure.type |
either the ' |
c.factor |
numeric, larger than 0, a constant factor used in the case of ' |
list.cdm |
not required, the list of doubly centered distance matrices corresponding to |
alpha |
numeric between 0 and 1, the significance level used for the tests |
p.adjust.method |
a string indicating the p-value adjustment for multiple testing, see |
stop.too.many |
numeric, upper limit for the number of tested tuples. A warning is issued if it is used. Use |
... |
these are passed to |
Performs the detection of the dependence structure as described in [3]. In the clustered
structure variables are clustered and treated as one variable as soon as a dependence is detected, the full
structure treats always each variable separately. The detection is either based on tests with significance level alpha
or a consistent
estimator is used. The latter yields (in the limit for increasing sample size) under very mild conditions always the correct dependence structure (but the convergence might be very slow).
If fixed.rejection.level
is not provided, the significance level alpha
is used to determine which multivariances are significant using the distribution-free rejection level. As default the Holm method is used for p-value correction corresponding to multiple testing.
The resulting graph can be simplified (pairwise dependence can be represented by edges instead of vertices) using clean.graph
.
Advanced:
The argument detection.aim
is currently only implemented for structure.type = clustered
. It can be used to check, if an expected dependence structure was detected. This might be useful for simulation studies to determine the empirical power of the detection algorithm. Hereto detection.aim
is set to a list of vectors which indicate the expected detected dependence structures (one for each run of find.cluster
). The vector has as first element the k
for which k-tuples are detected (for this aim the detection stops without success if no k-tuple is found), and the other elements, indicate to which clusters all present vertices belong after the detection, e.g. c(3,2,2,1,2,1,1,2,1)
expects that 3-tuples are detected and in the graph are 8 vertices (including those representing the detected 3 dependencies), the order of the 2's and 1's indicate which vertices belong to which cluster. If detection.aim
is provided, the vector representing the actual detection is printed, thus one can use the output with copy-paste to fix successively the expected detection aims.
Note that a failed detection might invoke the warning:
run$mem == detection.aim[[k]][-1] : longer object length is not a multiple of shorter object length
returns a list with elements:
multivariances
calculated multivariances,
cdms
calculated doubly centered distance matrices,
graph
graph representing the dependence structure,
detected
boolean, this is only included if a detection.aim
is given,
number.of.dep.tuples
vector, with the number of dependent tuples for each tested order. For the full dependence structure a value of -1 indicates that all tuples of this order are already lower order dependent, a value of -2 indicates that there were more than stop.too.many
tuples,
structure.type
either clustered
or full
,
type
the type of p-value estimation or consistent estimation used,
total.number.of.tests
numeric vector, with the number of tests for each group of tests,
typeI.error.prob
estimated probability of a type I error,
alpha
significance level used if a p-value estimation procedure is used,
c.factor
factor used if a consistent estimation procedure is used,
parameter.range
significance levels (or 'c.factor' values) which yield the same detection result.
For the theoretic background see the reference [3] given on the main help page of this package: multivariance-package.
# structures for the datasets included in the package dependence.structure(dep_struct_several_26_100) dependence.structure(dep_struct_star_9_100) dependence.structure(dep_struct_iterated_13_100) dependence.structure(dep_struct_ring_15_100) # basic examples: x = coins(100) # 3-dependent dependence.structure(x) colnames(x) = c("A","B","C") dependence.structure(x) # names of variables are used as labels dependence.structure(coins(100),vec = c(1,1,2)) # 3-dependent rv of which the first two rv are used together as one rv, thus 2-dependence. dependence.structure(x,vec = c(1,1,2)) # names of variables are used as labels dependence.structure(cbind(coins(200),coins(200,k=5)),verbose = TRUE) #1,2,3 are 3-dependent, 4,..,9 are 6-dependent # similar to the the previous example, but # the pair 1,3 is treated as one sample, # anagously the pair 2,4. In the resulting structure one does not # see anymore that the dependence of 1,2,3,4 with the rest is due # to 4. dependence.structure(cbind(coins(200),coins(200,k=5)), vec = c(1,2,1,2,3,4,5,6,7),verbose = TRUE) ### Advanced: # How to check the empirical power of the detection algorithm? # Use a dataset for which the structure is detected, e.g. dep_struct_several_26_100. # run: dependence.structure(dep_struct_several_26_100, detection.aim = list(c(ncol(dep_struct_several_26_100)))) # The output provides the first detection aim. Now we run the same line with the added # detection aim dependence.structure(dep_struct_several_26_100,detection.aim = list(c(3,1, 1, 1, 2, 2, 2, 3, 4, 5, 6, 7, 8, 8, 8, 9, 9, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1, 2, 8, 9), c(ncol(dep_struct_several_26_100)))) # and get the next detection aim ... thus we finally obtain all detection aims. # now we can run the code with new sample data .... N = 100 dependence.structure(cbind(coins(N,2),tetrahedron(N),coins(N,4),tetrahedron(N), tetrahedron(N),coins(N,3),coins(N,3),rnorm(N)), detection.aim = list(c(3,1, 1, 1, 2, 2, 2, 3, 4, 5, 6, 7, 8, 8, 8, 9, 9, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1, 2, 8, 9), c(4,1, 1, 1, 2, 2, 2, 3, 4, 5, 6, 7, 8, 8, 8, 9, 9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 1, 2, 8, 9, 10, 11), c(5, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 1, 2, 4, 5, 6, 7, 3), c(5, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 1, 2, 4, 5, 6, 7, 3)))$detected # ... and one could start to store the results and compute the rate of successes. # ... or one could try to check how many samples are necessary for the detection: re = numeric(100) for (i in 2:100) { re[i] = dependence.structure(dep_struct_several_26_100[1:i,],verbose = FALSE, detection.aim = list(c(3,1, 1, 1, 2, 2, 2, 3, 4, 5, 6, 7, 8, 8, 8, 9, 9, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1, 2, 8, 9), c(4,1, 1, 1, 2, 2, 2, 3, 4, 5, 6, 7, 8, 8, 8, 9, 9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 1, 2, 8, 9, 10, 11), c(5, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 1, 2, 4, 5, 6, 7, 3), c(5, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 1, 2, 4, 5, 6, 7, 3)))$detected print(paste("First", i,"samples. Detected?", re[i]==1)) } cat(paste("Given the 1 to k'th row the structure is not detected for k =",which(re == FALSE),"\n"))
# structures for the datasets included in the package dependence.structure(dep_struct_several_26_100) dependence.structure(dep_struct_star_9_100) dependence.structure(dep_struct_iterated_13_100) dependence.structure(dep_struct_ring_15_100) # basic examples: x = coins(100) # 3-dependent dependence.structure(x) colnames(x) = c("A","B","C") dependence.structure(x) # names of variables are used as labels dependence.structure(coins(100),vec = c(1,1,2)) # 3-dependent rv of which the first two rv are used together as one rv, thus 2-dependence. dependence.structure(x,vec = c(1,1,2)) # names of variables are used as labels dependence.structure(cbind(coins(200),coins(200,k=5)),verbose = TRUE) #1,2,3 are 3-dependent, 4,..,9 are 6-dependent # similar to the the previous example, but # the pair 1,3 is treated as one sample, # anagously the pair 2,4. In the resulting structure one does not # see anymore that the dependence of 1,2,3,4 with the rest is due # to 4. dependence.structure(cbind(coins(200),coins(200,k=5)), vec = c(1,2,1,2,3,4,5,6,7),verbose = TRUE) ### Advanced: # How to check the empirical power of the detection algorithm? # Use a dataset for which the structure is detected, e.g. dep_struct_several_26_100. # run: dependence.structure(dep_struct_several_26_100, detection.aim = list(c(ncol(dep_struct_several_26_100)))) # The output provides the first detection aim. Now we run the same line with the added # detection aim dependence.structure(dep_struct_several_26_100,detection.aim = list(c(3,1, 1, 1, 2, 2, 2, 3, 4, 5, 6, 7, 8, 8, 8, 9, 9, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1, 2, 8, 9), c(ncol(dep_struct_several_26_100)))) # and get the next detection aim ... thus we finally obtain all detection aims. # now we can run the code with new sample data .... N = 100 dependence.structure(cbind(coins(N,2),tetrahedron(N),coins(N,4),tetrahedron(N), tetrahedron(N),coins(N,3),coins(N,3),rnorm(N)), detection.aim = list(c(3,1, 1, 1, 2, 2, 2, 3, 4, 5, 6, 7, 8, 8, 8, 9, 9, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1, 2, 8, 9), c(4,1, 1, 1, 2, 2, 2, 3, 4, 5, 6, 7, 8, 8, 8, 9, 9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 1, 2, 8, 9, 10, 11), c(5, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 1, 2, 4, 5, 6, 7, 3), c(5, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 1, 2, 4, 5, 6, 7, 3)))$detected # ... and one could start to store the results and compute the rate of successes. # ... or one could try to check how many samples are necessary for the detection: re = numeric(100) for (i in 2:100) { re[i] = dependence.structure(dep_struct_several_26_100[1:i,],verbose = FALSE, detection.aim = list(c(3,1, 1, 1, 2, 2, 2, 3, 4, 5, 6, 7, 8, 8, 8, 9, 9, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1, 2, 8, 9), c(4,1, 1, 1, 2, 2, 2, 3, 4, 5, 6, 7, 8, 8, 8, 9, 9, 9, 10, 10, 10, 10, 11, 11, 11, 11, 12, 1, 2, 8, 9, 10, 11), c(5, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 1, 2, 4, 5, 6, 7, 3), c(5, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 7, 7, 8, 1, 2, 4, 5, 6, 7, 3)))$detected print(paste("First", i,"samples. Detected?", re[i]==1)) } cat(paste("Given the 1 to k'th row the structure is not detected for k =",which(re == FALSE),"\n"))
Transforms a matrix (rows: samples, columns: variables) into a matrix of uniform samples with the same dependence structure via the Monte Carlo empirical transform.
emp.transf(x, continuous = FALSE)
emp.transf(x, continuous = FALSE)
x |
data matrix (rows: samples, columns: variables) |
continuous |
boolean, if TRUE it provides the classical (non-Monte-Carlo) transformation by the empirical distribution function, which is a reasonable choice for data of continuous distributions. |
For the theoretic background see the reference [5] given on the main help page of this package: multivariance-package.
fast Euclidean distance matrix
fastdist(x)
fastdist(x)
x |
matrix with sample rows for which the distance matrix is computed (to use with vectors, use |
#require(microbenchmark) #x = rnorm(100) #microbenchmark(fastdist(as.matrix(x)),as.matrix(dist(x)))
#require(microbenchmark) #x = rnorm(100) #microbenchmark(fastdist(as.matrix(x)),as.matrix(dist(x)))
fast centered Euclidean distance matrix
fastEuclideanCdm(x, normalize)
fastEuclideanCdm(x, normalize)
x |
matrix with sample rows for which the distance matrix is computed (to use with vectors, use |
normalize |
boolean. If |
Performs the detection of dependence structures algorithm until a cluster is found. This function is the basic building block dependence.structure
. Advanced users, might use it directly.
find.cluster( x, vec = 1:ncol(x), list.cdm = cdms(x, vec = vec), mem = as.numeric(1:max(vec)), cluster.to.vertex = 1:max(mem), vertex.to.cdm = 1:max(mem), previous.n.o.cdms = rep(0, max(mem)), all.multivariances = numeric(0), g = igraph::add.vertices(igraph::graph.empty(, directed = FALSE), max(mem), label = sapply(1:max(mem), function(r) paste(colnames(x, do.NULL = FALSE, prefix = "")[vec == r], collapse = ",")), shape = "circle"), fixed.rejection.level = NA, alpha = 0.05, p.adjust.method = "holm", verbose = TRUE, kvec = 2:max(mem), parameter.range = NULL, type = "conservative", stop.too.many = NULL, ... )
find.cluster( x, vec = 1:ncol(x), list.cdm = cdms(x, vec = vec), mem = as.numeric(1:max(vec)), cluster.to.vertex = 1:max(mem), vertex.to.cdm = 1:max(mem), previous.n.o.cdms = rep(0, max(mem)), all.multivariances = numeric(0), g = igraph::add.vertices(igraph::graph.empty(, directed = FALSE), max(mem), label = sapply(1:max(mem), function(r) paste(colnames(x, do.NULL = FALSE, prefix = "")[vec == r], collapse = ",")), shape = "circle"), fixed.rejection.level = NA, alpha = 0.05, p.adjust.method = "holm", verbose = TRUE, kvec = 2:max(mem), parameter.range = NULL, type = "conservative", stop.too.many = NULL, ... )
x |
matrix with the samples |
vec |
vector, it indicates which columns are initially treated together as one sample |
list.cdm |
list of doubly centered distance matrices |
mem |
numeric vector, its length is the number of vertices, its content is the number of the corresponding cluster for the current iteration, i.e., vertex |
cluster.to.vertex |
vector, contains the cluster to vertex relations, i.e., |
vertex.to.cdm |
vector, contains the vertex to doubly centered distance matrix relations, i.e., |
previous.n.o.cdms |
vector, number of the doubly centered distance matrices in the previous iteration (it is used to ensure that previously check tuples are not checked again) |
all.multivariances |
vector, which contains all distance multivariances which have been calculated so far. Only used to finally return all distance multivariances which have been calculated. |
g |
dependence structure graph |
fixed.rejection.level |
vector, if not |
alpha |
numeric, significance level used for the (distribution-free) tests |
p.adjust.method |
name of the method used to adjust the p-values for multiple testing, see |
verbose |
boolean, if |
kvec |
vector, k-tuples are only checked for each k in |
parameter.range |
numeric matrix, which hosts the range of significance levels or ' |
type |
the method for the detection, one of ' |
stop.too.many |
numeric, upper limit for the number of tested tuples. A warning is issued if it is used. Use |
... |
are passed to |
For further details see dependence.structure
.
Depreciated. Use multivariance.test
instead. It provides all options and returns test result in a standard R format.
independence.test( x, vec = 1:ncol(x), alpha = 0.05, type = "distribution_free", verbose = TRUE, ... )
independence.test( x, vec = 1:ncol(x), alpha = 0.05, type = "distribution_free", verbose = TRUE, ... )
x |
either a data matrix or a list of doubly centered distance matrices |
vec |
if x is a matrix, then this indicates which columns are treated together as one sample; if x is a list, these are the indexes for which the multivariance is calculated. The default is all columns and all indexes, respectively. |
alpha |
significance level |
type |
one of |
verbose |
logical, if TRUE meaningful text output is generated. |
... |
these are passed to |
This computes a test of independence for the columns of a sample matrix (required for the resampling test) or for given doubly centered distance matrices (only possible for the distribution-free test).
The "pearson_approx"
and "resample"
are approximately sharp. The latter is based on a resampling approach and thus much slower. The "distribution_free"
test might be very conservative.
The doubly centered distance matrices can be prepared by cdms
. But note that for the test based on Pearson's approximation and for the resampling test, the data matrix has to be given.
Returns TRUE
if the hypothesis of independence is NOT rejected, otherwise FALSE
.
For the theoretic background see the references given on the main help page of this package: multivariance-package.
independence.test(coins(100)) #dependent sample which is 2-independent independence.test(coins(100),type = "resample") #dependent sample which is 2-independent independence.test(coins(100)[,2:3]) # independent sample independence.test(coins(100)[,2:3],type = "resample") # independent sample independence.test(coins(10),type = "resample") #dependent sample which is 2-independent independence.test(coins(10)[,2:3],type = "resample") #dependent sample which is 2-independent
independence.test(coins(100)) #dependent sample which is 2-independent independence.test(coins(100),type = "resample") #dependent sample which is 2-independent independence.test(coins(100)[,2:3]) # independent sample independence.test(coins(100)[,2:3],type = "resample") # independent sample independence.test(coins(10),type = "resample") #dependent sample which is 2-independent independence.test(coins(10)[,2:3],type = "resample") #dependent sample which is 2-independent
It places the variable nodes on an outer circle and the dependency nodes on an inner circle
layout_on_circles(g, n = sum(is.na(igraph::V(g)$level)))
layout_on_circles(g, n = sum(is.na(igraph::V(g)$level)))
g |
graph |
n |
number of vertices on outer circle |
This is the standard layout for the full dependence structure, since in this case there often too many nodes which make the other (usual) layout incomprehensible.
N = 200 y = coins(N,2) x = cbind(y,y,y) g = dependence.structure(x,structure.type = "clustered",verbose = FALSE)$graph plot(g) plot(g,layout = layout_on_circles(g))
N = 200 y = coins(N,2) x = cbind(y,y,y) g = dependence.structure(x,structure.type = "clustered",verbose = FALSE)$graph plot(g) plot(g,layout = layout_on_circles(g))
Computes m distance multivariance.
m.multivariance( x, vec = NA, m = 2, Nscale = TRUE, Escale = TRUE, squared = TRUE, ... )
m.multivariance( x, vec = NA, m = 2, Nscale = TRUE, Escale = TRUE, squared = TRUE, ... )
x |
either a data matrix or a list of doubly centered distance matrices |
vec |
if x is a matrix, then this indicates which columns are treated together as one sample; if x is a list, these are the indexes for which the multivariance is calculated. The default is all columns and all indexes, respectively. |
m |
|
Nscale |
if |
Escale |
if |
squared |
if |
... |
these are passed to |
m-distance multivariance is per definition the scaled sum of certain distance multivariances, and it characterize m-dependence.
As a rough guide to interpret the value of total distance multivariance note:
Large values indicate dependence.
If the random variables are (m-1)-independent and Nscale = TRUE
, values close to 1 and smaller indicate m-independence, larger values indicate dependence. In fact, in the case of independence the test statistic is a Gaussian quadratic form with expectation 1 and samples of it can be generated by resample.multivariance
.
If the random variables are (m-1)-independent and Nscale = FALSE
, small values (close to 0) indicate m-independence, larger values indicate dependence.
Since random variables are always 1-independent, the case m=2
characterizes pairwise independence.
Finally note, that due to numerical (in)precision the value of m-multivariance might become negative. In these cases it is set to 0. A warning is issued, if the value is negative and further than the usual (used by all.equal
) tolerance away from 0.
For the theoretic background see the reference [3] given on the main help page of this package: multivariance-package.
x = matrix(rnorm(3*30),ncol = 3) # the following values are identical m.multivariance(x,m =2) 1/choose(3,2)*(multivariance(x[,c(1,2)]) + multivariance(x[,c(1,3)]) + multivariance(x[,c(2,3)])) # the following values are identical m.multivariance(x,m=3) multivariance(x) # the following values are identical 1/4*(3*(m.multivariance(x,m=2)) + m.multivariance(x,m=3)) total.multivariance(x, Nscale = TRUE) 1/4*(multivariance(x[,c(1,2)], Nscale = TRUE) + multivariance(x[,c(1,3)], Nscale = TRUE) + multivariance(x[,c(2,3)], Nscale = TRUE) + multivariance(x, Nscale = TRUE))
x = matrix(rnorm(3*30),ncol = 3) # the following values are identical m.multivariance(x,m =2) 1/choose(3,2)*(multivariance(x[,c(1,2)]) + multivariance(x[,c(1,3)]) + multivariance(x[,c(2,3)])) # the following values are identical m.multivariance(x,m=3) multivariance(x) # the following values are identical 1/4*(3*(m.multivariance(x,m=2)) + m.multivariance(x,m=3)) total.multivariance(x, Nscale = TRUE) 1/4*(multivariance(x[,c(1,2)], Nscale = TRUE) + multivariance(x[,c(1,3)], Nscale = TRUE) + multivariance(x[,c(2,3)], Nscale = TRUE) + multivariance(x, Nscale = TRUE))
Computes various types of sample distance multicorrelation as defined and discussed in [3,4,6].
multicorrelation( x, vec = 1:ncol(x), type = "total.upper.lower", multicorrelation.type = "normalized", estimator.type = "bias.corrected", squared = TRUE, ... ) Mcor( x, vec = 1:ncol(x), type = "total.upper.lower", multicorrelation.type = "normalized", estimator.type = "bias.corrected", squared = TRUE, ... )
multicorrelation( x, vec = 1:ncol(x), type = "total.upper.lower", multicorrelation.type = "normalized", estimator.type = "bias.corrected", squared = TRUE, ... ) Mcor( x, vec = 1:ncol(x), type = "total.upper.lower", multicorrelation.type = "normalized", estimator.type = "bias.corrected", squared = TRUE, ... )
x |
either a data matrix or a list of doubly centered distance matrices |
vec |
if x is a matrix, then this indicates which columns are treated together as one sample; if x is a list, these are the indexes for which the multivariance is calculated. The default is all columns and all indexes, respectively. |
type |
default: "total.lower.upper", for details and other options see below |
multicorrelation.type |
one of |
estimator.type |
one of |
squared |
if |
... |
these are passed to |
There exist many variants of distance multicorrelation as discussed in [6] – and only in specific cases a direct comparison of the values is meaningful.
The implemented options are:
total.upper.lower normalized bias.corrected
: default; bounded by 1; fast; population limit characterizes independence by 0
pairwise normalized bias.corrected
: bounded by 1; fast; population limit characterizes pairwise independence by 0
total.upper normalized biased
: biased versions of the above
total.lower normalized biased
pairwise normalized biased
multi normalized biased
: population limit characterizes only in case of lower independence the independence of all variables by 0
m.multi.3 normalized biased
: population limit characterizes only in case of pairwise independence the 3-independence of all variables by 0
pairwise unnormalized biased
population limit characterizes pairwise independence by 0 and relation by similarity transforms by 1
multi unnormalized biased
: population limit characterizes only in case of lower independence the independence of all variables by 0 and relation by similarity transforms by 1
m.multi.3 unnormalized biased
: population limit characterizes only in case of pairwise independence the 3-independence of all variables by 0 and relation by similarity transforms by 1
Further details:
The "bias.corrected"
versions require a data matrix, since they compute bias corrected centered distance matricies.
For "multi"
the unnormalized and normalized version coincide if an even number of variables is considered. They usually differ if an odd number of variables is considered. If all variables are related by similarity transforms the unnormalized "unnormalized"
multicorrelations are 1.
For "pairwise"
an alias is "m.multi.2"
.
For total multicorrelation there is currently only a feasible empirical estimator for a lower or upper bound. These are upper and lower bounds for in the population setting. When using bias corrected estimators these are in general no proper bounds, but their range can be used as values for comparisons.
Value of the multicorrelation(s).
For the theoretic background see the references [2,3,6] given on the main help page of this package: multivariance-package.
y = rnorm(100) x = cbind(y,y*2,(y-2)/3,y+1,y*5) # all variables are related by similarity transforms # compute all types of correlations for x: for (ty in c("total.lower","total.upper","pairwise","m.multi.3","multi")) for (mty in c("normalized")) print(paste(format(multicorrelation( x,type=ty,multicorrelation.type = mty,estimator.type = "biased") ,digits=3,nsmall = 3,width = 7),mty,ty,"correlation - biased estimate")) for (ty in c("total.upper.lower","pairwise")) for (mty in c("normalized")) print(paste(format(multicorrelation( x,type=ty,multicorrelation.type = mty,estimator.type = "bias.corrected") ,digits=3,nsmall = 3,width = 7),mty,ty,"correlation - bias corrected estimate")) for (ty in c("m.multi.2","m.multi.3","multi")) for (mty in c("unnormalized")) print(paste(format(multicorrelation( x,type=ty,multicorrelation.type = mty,estimator.type = "biased") ,digits=3,nsmall = 3,width = 7),mty,ty,"correlation - biased estimate"))
y = rnorm(100) x = cbind(y,y*2,(y-2)/3,y+1,y*5) # all variables are related by similarity transforms # compute all types of correlations for x: for (ty in c("total.lower","total.upper","pairwise","m.multi.3","multi")) for (mty in c("normalized")) print(paste(format(multicorrelation( x,type=ty,multicorrelation.type = mty,estimator.type = "biased") ,digits=3,nsmall = 3,width = 7),mty,ty,"correlation - biased estimate")) for (ty in c("total.upper.lower","pairwise")) for (mty in c("normalized")) print(paste(format(multicorrelation( x,type=ty,multicorrelation.type = mty,estimator.type = "bias.corrected") ,digits=3,nsmall = 3,width = 7),mty,ty,"correlation - bias corrected estimate")) for (ty in c("m.multi.2","m.multi.3","multi")) for (mty in c("unnormalized")) print(paste(format(multicorrelation( x,type=ty,multicorrelation.type = mty,estimator.type = "biased") ,digits=3,nsmall = 3,width = 7),mty,ty,"correlation - biased estimate"))
Computes the distance multivariance, either for given data or a given list of doubly centered distance matrices.
multivariance( x, vec = NA, Nscale = TRUE, correlation = FALSE, squared = TRUE, ... )
multivariance( x, vec = NA, Nscale = TRUE, correlation = FALSE, squared = TRUE, ... )
x |
either a data matrix or a list of doubly centered distance matrices |
vec |
if x is a matrix, then this indicates which columns are treated together as one sample; if x is a list, these are the indexes for which the multivariance is calculated. The default is all columns and all indexes, respectively. |
Nscale |
if |
correlation |
depreciated, please use the function |
squared |
if |
... |
these are passed to |
If x
is a matrix and vec
is not given, then each column is treated as a separate sample. Otherwise vec
has to have as many elements as x
has columns and values starting from 1 up to the number of 'variables', e.g. if x
is an N
by 5 matrix and vec = c(1,2,1,3,1)
then the multivariance of the 1-dimensional variables represented by column 2 and 4 and the 3-dimensional variable represented by the columns 1,3,5 is computed.
As default it computes the normalized Nscaled squared multivariance, for a multivariance without normalization the argument normalize = FALSE
has to be passed to cdms
.
correlation = TRUE
yields values between 0 and 1. These can be interpreted similarly to classical correlations, see also multicorrelation
.
As a rough guide to interpret the value of distance multivariance note:
If the random variables are not (n-1)-independent, large values indicate dependence, but small values are meaningless. Thus in this case use total.multivariance
.
If the random variables are (n-1)-independent and Nscale = TRUE
, values close to 1 and smaller indicate independence, larger values indicate dependence. In fact, in the case of independence the test statistic is a Gaussian quadratic form with expectation 1 and samples of it can be generated by resample.multivariance
.
If the random variables are (n-1)-independent and Nscale = FALSE
, small values (close to 0) indicate independence, larger values indicate dependence.
Finally note, that due to numerical (in)precision the value of multivariance might become negative. In these cases it is set to 0. A warning is issued, if the value is negative and further than the usual (used by all.equal
) tolerance away from 0.
For the theoretic background see the references given on the main help page of this package: multivariance-package.
multivariance(matrix(rnorm(100*3),ncol = 3)) #independent sample multivariance(coins(100)) #dependent sample which is 2-independent x = matrix(rnorm(100*2),ncol = 2) x = cbind(x,x[,2]) multivariance(x) #dependent sample which is not 2-independent (thus small values are meaningless!) multivariance(x[,1:2]) #these are independent multivariance(x[,2:3]) #these are dependent multivariance(x[,2:3],correlation = TRUE)
multivariance(matrix(rnorm(100*3),ncol = 3)) #independent sample multivariance(coins(100)) #dependent sample which is 2-independent x = matrix(rnorm(100*2),ncol = 2) x = cbind(x,x[,2]) multivariance(x) #dependent sample which is not 2-independent (thus small values are meaningless!) multivariance(x[,1:2]) #these are independent multivariance(x[,2:3]) #these are dependent multivariance(x[,2:3],correlation = TRUE)
Computes a conservative p-value for the hypothesis of independence for a given multivariance / m-multivariance / total multivariance.
multivariance.pvalue(x)
multivariance.pvalue(x)
x |
value of a normalized |
This is based on a distribution-free approach. The p-value is conservative, i.e. it might be much smaller. This is the counterpart to rejection.level
. For a less conservative approach see resample.pvalue
or pearson.pvalue
.
p-values larger than 0.215 might be incorrect, since the distribution-free estimate on which the computation is based only holds up to 0.215.
For the theoretic background see the references given on the main help page of this package: multivariance-package.
This performs the (specified by type
and p.value.type
) independence test for the columns of a sample matrix.
multivariance.test( x, vec = 1:ncol(x), type = "total", p.value.type = "pearson_approx", verbose = TRUE, ... )
multivariance.test( x, vec = 1:ncol(x), type = "total", p.value.type = "pearson_approx", verbose = TRUE, ... )
x |
matrix, each row is a sample |
vec |
vector which indicates which columns are treated as one sample |
type |
one of |
p.value.type |
one of |
verbose |
logical, if TRUE meaningful text output is generated. |
... |
these are passed to |
For the use of vec
see the examples below and the more detailed explanation of this argument for multivariance
.
The types "independence"
and "total"
are identical: an independence test is performed.
Also the types "pairwise independence"
and "m.multi.2"
are identical: a test of pairwise independence is performed.
The type "m.multi.3"
, performs a test for 3-independence, assuming pairwise independence. The type "multi"
performs a test for n-independence, assuming (n-1)-independence.
There are several ways (determined by p.value.type
) to estimate the p-value: The "pearson_approx"
and "resample"
are approximately sharp. The latter is based on a resampling approach and thus much slower. The "distribution_free"
test might be very conservative, its p-value estimates are only valid for p-values lower than 0.215 - values above should be interpreted as "values larger than 0.215". Finally, "pearson_unif"
uses fixed parameters in Pearson's estimate, it is only applicable for univariate uniformly distributed marginals
All tests are performed using the standard euclidean distance. Other distances can be supplied via the ...
, see cdm
for the accepted arguments.
A list with class "htest
" containing the following components:
statistic
the value of the test statistic,
p.value
the p-value of the test statistic,
method
a character string indicating the type of test performed,
data.name
a character string giving the name(s) of the data.
For the theoretic background see the references given on the main help page of this package: multivariance-package.
# an independence test multivariance.test(dep_struct_several_26_100,p.value.type = "distribution_free") # conservative multivariance.test(dep_struct_several_26_100,p.value.type = "resample") #sharp but slow multivariance.test(dep_struct_several_26_100,p.value.type = "pearson_approx") # # as an example, all tests for one data set: coins100 = coins(100) for (ty in c("total","m.multi.2","m.multi.3","multi")) for (pvt in c("distribution_free","resample","pearson_approx")) print(multivariance.test(coins100,type=ty,p.value.type = pvt)) # using the vec argument: x = matrix(rnorm(50*6),ncol = 10) # a 50x6 data matrix vec = c(1,2,3,4,5,6) # each column is treated as one variable multivariance.test(x,vec,p.value.type = "distribution_free") # is the same as the default vec = c(1,2,2,1,3,1) # column 1,4,6 are treated as one variable # column 2,3 are treated as one variable # column 5 is treated as one variable multivariance.test(x,vec,p.value.type = "distribution_free")
# an independence test multivariance.test(dep_struct_several_26_100,p.value.type = "distribution_free") # conservative multivariance.test(dep_struct_several_26_100,p.value.type = "resample") #sharp but slow multivariance.test(dep_struct_several_26_100,p.value.type = "pearson_approx") # # as an example, all tests for one data set: coins100 = coins(100) for (ty in c("total","m.multi.2","m.multi.3","multi")) for (pvt in c("distribution_free","resample","pearson_approx")) print(multivariance.test(coins100,type=ty,p.value.type = pvt)) # using the vec argument: x = matrix(rnorm(50*6),ncol = 10) # a 50x6 data matrix vec = c(1,2,3,4,5,6) # each column is treated as one variable multivariance.test(x,vec,p.value.type = "distribution_free") # is the same as the default vec = c(1,2,2,1,3,1) # column 1,4,6 are treated as one variable # column 2,3 are treated as one variable # column 5 is treated as one variable multivariance.test(x,vec,p.value.type = "distribution_free")
Estimates the computation time. This is relative rough. First run with determine.parameters = TRUE
(which takes a while). Then use the computed parameters to determine the computation time/or sample size.
multivariance.timing( N = NULL, n, sectime = NULL, coef.cdm = 15.2, coef.prod = 2.1, coef.sum = 1.05, determine.parameters = FALSE )
multivariance.timing( N = NULL, n, sectime = NULL, coef.cdm = 15.2, coef.prod = 2.1, coef.sum = 1.05, determine.parameters = FALSE )
N |
number of samples. If |
n |
number of variables |
sectime |
desired computation time in seconds. If |
coef.cdm |
computation time parameter for the doubly centered distance matrices |
coef.prod |
computation time parameter for matrix products |
coef.sum |
computation time parameter for matrix sums |
determine.parameters |
if |
When detecting the parameters, the median of the computation times is used.
Ns = (1:100)*10 ns = 1:100 fulltime = outer(Ns,ns,FUN = function(N,n) multivariance.timing(N,n)) contour(Ns,ns,fulltime,xlab = "N",ylab = "n", main = "computation time of multivariance in secs", sub = "using default parameters - use 'determine.parameters = TRUE' to compute machine specific values") # Run to determine the parameters of your system: # multivariance.timing(determine.parameters = TRUE)
Ns = (1:100)*10 ns = 1:100 fulltime = outer(Ns,ns,FUN = function(N,n) multivariance.timing(N,n)) contour(Ns,ns,fulltime,xlab = "N",ylab = "n", main = "computation time of multivariance in secs", sub = "using default parameters - use 'determine.parameters = TRUE' to compute machine specific values") # Run to determine the parameters of your system: # multivariance.timing(determine.parameters = TRUE)
Computes simultaneously multivariance, total multivariance, 2-multivariance and 3-multivariance.
multivariances.all(x, vec = NA, Nscale = TRUE, squared = TRUE, ...)
multivariances.all(x, vec = NA, Nscale = TRUE, squared = TRUE, ...)
x |
either a data matrix or a list of doubly centered distance matrices |
vec |
if x is a matrix, then this indicates which columns are treated together as one sample; if x is a list, these are the indexes for which the multivariance is calculated. The default is all columns and all indexes, respectively. |
Nscale |
if |
squared |
if |
... |
these are passed to |
The computation is faster than the separate computations.
Returns a vector with multivariance, total.multivariance, 2-multivariance and 3-multivariance
multivariance
, total.multivariance
, m.multivariance
x = coins(100,k = 3) multivariances.all(x) # yields the same as: multivariance(x) total.multivariance(x) m.multivariance(x,m=2) m.multivariance(x,m=3)
x = coins(100,k = 3) multivariances.all(x) # yields the same as: multivariance(x) total.multivariance(x) m.multivariance(x,m=2) m.multivariance(x,m=3)
Computes the p-value of a sample using Pearson's approximation of Gaussian quadratic forms with the estimators developed by Berschneider and Böttcher in [4].
pearson.pvalue(x, vec = NA, type = "multi", ...)
pearson.pvalue(x, vec = NA, type = "multi", ...)
x |
matrix, the rows should be iid samples |
vec |
vector, which indicates which columns of |
type |
one of |
... |
these are passed to |
This is the method recommended in [4], i.e., using Pearson's quadratic form estimate with the unbiased finite sample estimators for the mean and variance of normalized multivariance together with the unbiased estimator for the limit skewness.
For the theoretic background see the reference [4] given on the main help page of this package: multivariance-package.
Approximation of the of the value of the distribution function of a Gaussian quadratic form based on its first three moments.
pearson.qf(x, moment, lower.tail = TRUE, verbose = FALSE)
pearson.qf(x, moment, lower.tail = TRUE, verbose = FALSE)
x |
value at which the distribution function is to be evaluated |
moment |
vector with the mean, variance and skewness of the quadratic form |
lower.tail |
logical, indicating of the lower or upper tail of the distribution function should be calculated |
verbose |
logical, if |
This is Pearson's approximation for Gaussian quadratic forms as stated in [4] (equation (4.65) in arXiv:1808.07280v2)
For the theoretic background see the reference [4] given on the main help page of this package: multivariance-package.
Under independence the probability for the normalized and Nscaled (squared) multivariance to be above this level is less than alpha
. The same holds for the normalized, Nscaled and Escaled (squared) total multivariance and m-multivariance.
rejection.level(alpha)
rejection.level(alpha)
alpha |
level of significance |
This is based on a distribution-free approach. The value might be very conservative. This is the counterpart to multivariance.pvalue
. For a less conservative approach see resample.rejection.level
.
The estimate is only valid for alpha
smaller than 0.215.
rejection.level(0.05) #the rejection level, for comparison with the following values total.multivariance(matrix(rnorm(100*3),ncol = 3)) #independent sample total.multivariance(coins(100)) #dependent sample which is 2-independent # and the p values are (to compare with alpha) multivariance.pvalue(total.multivariance(matrix(rnorm(100*3),ncol = 3))) #independent sample multivariance.pvalue(total.multivariance(coins(100))) #dependent sample which is 2-independent ## Not run: # visualization of the rejection level curve(rejection.level(x),xlim = c(0.001,0.215),xlab = "alpha") ## End(Not run)
rejection.level(0.05) #the rejection level, for comparison with the following values total.multivariance(matrix(rnorm(100*3),ncol = 3)) #independent sample total.multivariance(coins(100)) #dependent sample which is 2-independent # and the p values are (to compare with alpha) multivariance.pvalue(total.multivariance(matrix(rnorm(100*3),ncol = 3))) #independent sample multivariance.pvalue(total.multivariance(coins(100))) #dependent sample which is 2-independent ## Not run: # visualization of the rejection level curve(rejection.level(x),xlim = c(0.001,0.215),xlab = "alpha") ## End(Not run)
The distribution of the test statistic under the hypothesis of independence is required for the independence tests. This function generates approximate samples of this distribution either by sampling without replacement (permutations) or by sampling with replacement (bootstrap).
resample.multivariance( x, vec = 1:ncol(x), times = 300, type = "multi", resample.type = "permutation", ... )
resample.multivariance( x, vec = 1:ncol(x), times = 300, type = "multi", resample.type = "permutation", ... )
x |
matrix, the rows should be iid samples |
vec |
vector, which indicates which columns of |
times |
integer, number of samples to generate |
type |
one of |
resample.type |
one of |
... |
is passed to |
The resampling is done by sampling from the original data either without replacement ("permutation"
) or with replacement ("bootstrap"
). Using resampling without replacement is (much) faster (due to special identities which only hold in this case).
For convenience also the actual (total /m-) multivariance is computed and its p-value.
A list with elements
resampled
the (total/m-)multivariances of the resampled data,
original
the (total/m-)multivariance of the original data,
p.value
the p-value of the original data, computed using the resampled data
For the theoretic background see the reference [3] given on the main help page of this package: multivariance-package.
re.m = resample.multivariance(matrix(rnorm(30*2),nrow = 30), type= "multi",times = 300)$resampled curve(ecdf(re.m)(x), xlim = c(0,4),main = "empirical distribution of the test statistic under H_0")
re.m = resample.multivariance(matrix(rnorm(30*2),nrow = 30), type= "multi",times = 300)$resampled curve(ecdf(re.m)(x), xlim = c(0,4),main = "empirical distribution of the test statistic under H_0")
Use a resampling method to generate samples of the test statistic under the hypothesis of independence. Based on these the p.value of a given value of a test statistic is computed.
resample.pvalue(value, ...)
resample.pvalue(value, ...)
value |
numeric, the value of (total-/m-)multivariance for which the p-value shall be computed |
... |
passed to |
This function is useful if a p-value of a test statistic shall be computed based on the resampling values of the test statistic of a different sample. For the p-value based on the same sample resample.multivariance(...)$p.value
is sufficient.
It returns 1 minus the value of the empirical distribution function of the resampling samples evaluated at the given value.
For the theoretic background see the reference [3] given on the main help page of this package: multivariance-package.
x = coins(100) resample.pvalue(multivariance(x),x=x,times = 300) resample.pvalue(multivariances.all(x),x=x,times = 300,type = "all")
x = coins(100) resample.pvalue(multivariance(x),x=x,times = 300) resample.pvalue(multivariances.all(x),x=x,times = 300,type = "all")
Uses the resample method to sample from the test statistic under the hypothesis of independence. The alpha quantile of these samples is returned.
resample.rejection.level(alpha = 0.05, ...)
resample.rejection.level(alpha = 0.05, ...)
alpha |
numeric, the significance value |
... |
passed to |
For the theoretic background see the reference [3] given on the main help page of this package: multivariance-package.
resample.rejection.level(0.05,matrix(rnorm(30*2),nrow = 30)) resample.rejection.level(0.05,matrix(rnorm(30*3),nrow = 30),vec = c(1,1,2))
resample.rejection.level(0.05,matrix(rnorm(30*2),nrow = 30)) resample.rejection.level(0.05,matrix(rnorm(30*3),nrow = 30),vec = c(1,1,2))
resamples doubly centered distance matrices
sample.cdms(list.cdm, replace = FALSE, incl.first = FALSE)
sample.cdms(list.cdm, replace = FALSE, incl.first = FALSE)
list.cdm |
a list of doubly centered distance matrices |
replace |
boolean, sampling with or without replacement |
incl.first |
boolean, if |
Returns a list of doubly centered distance matrices, each matrix corresponds to the resampled columns of the corresponding sample, using resampling with replacement (bootstrap) or without replacement (permutations).
resample the columns of a matrix
sample.cols(x, vec = 1:ncol(x), replace = TRUE, incl.first = TRUE)
sample.cols(x, vec = 1:ncol(x), replace = TRUE, incl.first = TRUE)
x |
matrix |
vec |
vector, indicates which columns belong together |
replace |
boolean, sampling with or without replacement |
incl.first |
boolean, if |
Returns a matrix with the same dimensions as x
. The columns are resampled from the original columns. The resampling is done with replacement (replace = TRUE
) or without (replace = FALSE
). Columns which belong together (indicated by vec) are resampled identically, i.e., all values in rows of these are kept together.
sample.cols(matrix(1:15,nrow = 5),vec = c(1,1,2))
sample.cols(matrix(1:15,nrow = 5),vec = c(1,1,2))
This function creates samples of a tetrahedron-dice colored r, g, b and rgb. Each sample indicates if for the thrown dice the colors r, g and b are contained on the bottom side of the dice.
tetrahedron(N = 1000)
tetrahedron(N = 1000)
N |
number of samples |
It returns the samples of the events r, g and b as rows of a N
by 3 matrix (the first column corresponds to r, the second to g,...). TRUE indicates that this color is on the bottom side of the dice. The columns are dependent but 2-independent.
For the theoretic background see the reference [3] given on the main help page of this package: multivariance-package.
tetrahedron(10)
tetrahedron(10)
computes the total distance multivariance
total.multivariance( x, vec = NA, lambda = 1, Nscale = TRUE, Escale = TRUE, squared = TRUE, ... )
total.multivariance( x, vec = NA, lambda = 1, Nscale = TRUE, Escale = TRUE, squared = TRUE, ... )
x |
either a data matrix or a list of doubly centered distance matrices |
vec |
if x is a matrix, then this indicates which columns are treated together as one sample; if x is a list, these are the indexes for which the multivariance is calculated. The default is all columns and all indexes, respectively. |
lambda |
a scaling parameter >0. Each k-tuple multivariance gets weight |
Nscale |
if |
Escale |
if |
squared |
if |
... |
these are passed to |
Total distance multivariance is per definition the scaled sum of certain distance multivariances, and it characterize dependence.
As a rough guide to interpret the value of total distance multivariance note:
Large values indicate dependence.
For Nscale = TRUE
values close to 1 and smaller indicate independence, larger values indicate dependence. In fact, in the case of independence the test statistic is a Gaussian quadratic form with expectation 1 and samples of it can be generated by resample.multivariance
.
For Nscale = FALSE
small values (close to 0) indicate independence, larger values indicate dependence.
Finally note, that due to numerical (in)precision the value of total multivariance might become negative. In these cases it is set to 0. A warning is issued, if the value is negative and further than the usual (used by all.equal
) tolerance away from 0.
For the theoretic background see the references given on the main help page of this package: multivariance-package.
x = matrix(rnorm(100*3),ncol = 3) total.multivariance(x) #for an independent sample # the value coincides with (multivariance(x[,c(1,2)],Nscale = TRUE) + multivariance(x[,c(1,3)],Nscale = TRUE)+ multivariance(x[,c(2,3)],Nscale = TRUE) + multivariance(x,Nscale = TRUE))/4 total.multivariance(coins(100)) #value for a dependent sample which is 2-independent
x = matrix(rnorm(100*3),ncol = 3) total.multivariance(x) #for an independent sample # the value coincides with (multivariance(x[,c(1,2)],Nscale = TRUE) + multivariance(x[,c(1,3)],Nscale = TRUE)+ multivariance(x[,c(2,3)],Nscale = TRUE) + multivariance(x,Nscale = TRUE))/4 total.multivariance(coins(100)) #value for a dependent sample which is 2-independent