The previous report (April 4, 2013) devises a method for generating a cost matrix with given correlation coefficient between rows and columns. This method shows however the following limitations:
The objectives is thus to adapt this method by replacing irrelevant arguments by the distributions on each row and on each column.
The core idea is to provide a vector of speeds (a row) and a vector of costs (a column) and the corresponding distribution generators. Each column will be a weighted sum of a common vector (the costs) and a newly generated vector. Idem for the rows (the common vector with be the speeds). The issue is to determine the weight and the order of the construction such as to respect all the requirements.
Let's start by building matrix Z by enforcing the correlation between any pair of columns rhoC:
Zj <- alpha*C+beta*Cj
Then, we can derive the correlation:
rhoC = alpha^2*sigmaC/(alpha^2*sigmaC+beta^2*sigmaCj)
We assume that the standard deviations of C and Cj are identical. Additionally, in order to set the mean of each column to rj, alpha+beta=rj/muC. Therefore,
alpha <- rj/(muC*(1+sqrt(1/rhoC-1)))
beta <- rj*sqrt(1/rhoC-1)/(muC*(1+sqrt(1/rhoC-1)))
Now, let's transform the matrix to enforce the correlation between any pair of rows rhoR:
Zi' <- alpha'*R+beta'*Zi
The values alpha and beta must be constant to preserve the previous correlations. Let's first determine the initial mean of Zi:
muZi = alpha*ci+beta*muC
muZi = muR*(ci/muC+sqrt(1/rhoC-1))/(1+sqrt(1/rhoC-1))
Moreover:
sigmaZi = beta*sigmaC
sigmaZi = rj*sqrt(1/rhoC-1)/(muC*(1+sqrt(1/rhoC-1)))*sigmaC
Now, we want the mean of Zi' to be muR*ci. Let's start with a few extreme cases to see whether it is feasible. When rhoC=1:
alpha <- rj/muC
beta <- 0
Zj <- rj*C/muC
muZi = muR*ci/muC
Then, if rhoR=1:
alpha' <- muC
beta' <- 0
Zi' <- muC*R
In this case, the mean of Zj' is correctly rj*muC but rhoC=0 (it is OK because it is a limit case). If rhoR=0, then:
alpha' <- 0
beta' <- muC
Zi' <- muC*Zi
In this case, the mean of Zj' is correctly rj*muC. Let's start again with rhoC=0:
alpha <- 0
beta <- rj/muC
Zj <- rj*Cj/muC
muZi = muR
Then, if rhoR=1:
alpha' <- muC
beta' <- 0
Zi' <- muC*R
In this case, the mean of Zj' is correctly rj*muC. If rhoR=0, then:
alpha' <- 0
beta' <- muC
Zi' <- muC*Zi
The mean of Zj' is muC*muR, which is incorrect. Testing these extreme cases is inconclusive.
Let's compute the mean of Zi':
muZi' = alpha'*muR+beta'*muR*(ci/muC+sqrt(1/rhoC-1))/(1+sqrt(1/rhoC-1))
Now let's compute the correlation coefficient:
rhoR = alpha^2*sigmaR/(alpha^2*sigmaR+beta^2*sigmaZi)
Let's assume we scale R such that is variance is the same as Zi. We want:
alpha'*muR + beta'*(ci/muC+sqrt(1/rhoC-1))/(1+sqrt(1/rhoC-1)) = muR*ci
alpha^2/(alpha^2+beta^2) = rhoR
alpha' = -muR*sqrt(1/rhoC-1))/(1+sqrt(1/rhoC-1)
beta' = muR^2
The mean of Zj' will then be (alpha'+beta')*rj.
All this approach poses some problem about the insufficient number of parameters compared to the number of constraints. Let's try the first approach to have a numerical intuition on the problem.
generate_matrix_intuition <- function(n = 1, R, C, rdistr, rdistc, rhor, rhoc) {
if (length(R) == 1)
R <- rdistr(R)
if (length(C) == 1)
C <- rdistc(C)
Z <- sapply(1:length(R), function(j) return(R[j]/(mean(C) * (1 + sqrt(1/rhoc -
1))) * C + R[j] * sqrt(1/rhoc - 1)/(mean(C) * (1 + sqrt(1/rhoc - 1))) *
rdistc(length(C))))
return(Z)
}
rdistr <- function(n) {
return(runif(n, 20, 50))
}
rdistc <- function(n) {
return(runif(n, 50, 100))
}
Z <- generate_matrix_intuition(1, 100, 100, rdistr, rdistc, 0.5, 0.99)
cor.test(Z[1, ], Z[2, ])$estimate
## cor
## 0.9947
Multiplying by rj introduces a correlation between rows that must be avoided. It seems antagonist to specify the mean of each row and each column and to specify at the same time the correlations between rows and columns.
Let's reconsider the same method except we avoid to specify the mean of each row and each column.
Zj <- alpha*C+beta*Cj
We assume again that the standard deviations of C and Cj are identical. This time, we set the mean of each column to muR, alpha+beta=muR/muC. Therefore,
rhoC = alpha^2/(alpha^2+beta^2)
alpha+beta = muR/muC
alpha <- muR/(muC*(1+sqrt(1/rhoC-1)))
beta <- muR*sqrt(1/rhoC-1)/(muC*(1+sqrt(1/rhoC-1)))
This seems to lead to the same problems. Let's explore another generation method.
Let's keep the first part of the method and study the operation we would like in the extreme cases. If rhoC=rhoR=1, then we want Zj <- C and then Zi' <- R * Zi. If rhoC=rhoR=0, then Zj <- Cj and then Zi' <- Ri * Zi. Let's add some parameters:
Zj <- alpha*C+beta*Cj
Zi' <- R*Zi
We see that the second transformation is only multiplicative with distinct values for each column. It may become necessary to insert an additive term.
From the previous analysis, we know that:
muZi = (alpha+beta)*muC
sigmaZi = beta^2*sigmaC
Now let's study the correlation between any pair of rows:
rhoR = sigmaR^2*muZi^2/(sigmaR^2*sigmaZi^2+sigmaR^2*muZi^2+muR^2*sigmaZi^2)
which simplifies as:
rhoR = 1/(1+(1+1/covR^2)*covZi^2)
Now, let's determine the previous values of alpha and beta. We have the following system:
rhoC = alpha^2/(alpha^2+beta^2)
rhoR = 1/(1+(1+1/covR^2)*covZi^2)
covZi = beta/(alpha+beta)*covC
Using Maxima, the solution is:
assume(rhoR >= 0, rhoC >= 0, rhoR <= 1, rhoC <= 1, covR >= 0, covC >= 0);
declare(rhoR, constant);
declare(rhoC, constant);
declare(covR, constant);
declare(covC, constant);
eqC : rhoC = alpha^2/(alpha^2+beta^2);
eqR : rhoR = 1/(1+(1+1/covR^2)*(beta/(alpha+beta)*covC)^2);
solve([eqC, eqR], [alpha, beta]);
[[alpha = 0, beta = 0]]
There is not enough parameters (alpha and beta are cancelled in the derivations). Add a simple scaling term in for the second transformation would be useless.
Let's redesign the procedure:
Zj <- C+alpha*Cj
Zi' <- R*Zi+beta*R
We still want to ensure that the extreme case with rhoR=rhoC=1 would lead to alpha=beta=0. In any other cases, the final matrix Z' will be scaled to have the correct mean and standard deviation.
Let's build the system:
rhoC = 1/(1+alpha^2)
covZi = alpha/(1+alpha)*covC
rhoR = 1/(1+(1+1/covR^2)*covZi^2)/(1+beta^2)
And the solution is:
alpha = sqrt(1/rhoC-1)
beta = sqrt(1/(1+(1+1/covR^2)*(sqrt(1/rhoC-1)/(1+sqrt(1/rhoC-1))*covC)^2)/rhoR-1)
For feasible alpha, rhoC must be strictly greater than 0 (if it is 0, the fallback method is the random generation). For feasible beta:
rhoR <= 1/(1+(1+1/covR^2)*(sqrt(1/rhoC-1)/(1+sqrt(1/rhoC-1))*covC)^2)
Hopefully, it is always possible to inverse R and C when this second assumption does not hold.
generate_matrix_simple <- function(n = 1, R, C, rdistr, rdistc, rhor, rhoc) {
if (length(R) == 1)
R <- rdistr(R)
if (length(C) == 1)
C <- rdistc(C)
covR <- sqrt(var(R))/mean(R)
covC <- sqrt(var(C))/mean(C)
alpha <- sqrt(1/rhoc - 1)
beta <- sqrt(1/(1 + (1 + 1/covR^2) * (alpha/(1 + alpha) * covC)^2)/rhor -
1)
Z <- sapply(1:length(R), function(j) return(C + alpha * rdistc(length(C))))
for (i in 1:nrow(Z)) Z[i, ] <- R * Z[i, ] + beta * R
return(Z)
}
rdistr <- function(n) {
return(runif(n, 20, 50))
}
rdistc <- function(n) {
return(runif(n, 50, 100))
}
Z <- generate_matrix_simple(1, 100, 100, rdistr, rdistc, 0.5, 1)
cor.test(Z[1, ], Z[2, ])$estimate
## cor
## 1
Unfortunately, this does not work.