Generation of cost matrix with specific correlations and distributions (March 14, 2014)

Motivation

The previous report (April 4, 2013) devises a method for generating a cost matrix with given correlation coefficient between rows and columns. This method shows however the following limitations:

The objectives is thus to adapt this method by replacing irrelevant arguments by the distributions on each row and on each column.

Principle

The core idea is to provide a vector of speeds (a row) and a vector of costs (a column) and the corresponding distribution generators. Each column will be a weighted sum of a common vector (the costs) and a newly generated vector. Idem for the rows (the common vector with be the speeds). The issue is to determine the weight and the order of the construction such as to respect all the requirements.

Notation

Mathematical results for correlation computation

Method

Let's start by building matrix Z by enforcing the correlation between any pair of columns rhoC:

Zj <- alpha*C+beta*Cj

Then, we can derive the correlation:

rhoC = alpha^2*sigmaC/(alpha^2*sigmaC+beta^2*sigmaCj)

We assume that the standard deviations of C and Cj are identical. Additionally, in order to set the mean of each column to rj, alpha+beta=rj/muC. Therefore,

alpha <- rj/(muC*(1+sqrt(1/rhoC-1)))
beta <- rj*sqrt(1/rhoC-1)/(muC*(1+sqrt(1/rhoC-1)))

Now, let's transform the matrix to enforce the correlation between any pair of rows rhoR:

Zi' <- alpha'*R+beta'*Zi

The values alpha and beta must be constant to preserve the previous correlations. Let's first determine the initial mean of Zi:

muZi = alpha*ci+beta*muC
muZi = muR*(ci/muC+sqrt(1/rhoC-1))/(1+sqrt(1/rhoC-1))

Moreover:

sigmaZi = beta*sigmaC
sigmaZi = rj*sqrt(1/rhoC-1)/(muC*(1+sqrt(1/rhoC-1)))*sigmaC

Now, we want the mean of Zi' to be muR*ci. Let's start with a few extreme cases to see whether it is feasible. When rhoC=1:

alpha <- rj/muC
beta <- 0
Zj <- rj*C/muC
muZi = muR*ci/muC

Then, if rhoR=1:

alpha' <- muC
beta' <- 0
Zi' <- muC*R

In this case, the mean of Zj' is correctly rj*muC but rhoC=0 (it is OK because it is a limit case). If rhoR=0, then:

alpha' <- 0
beta' <- muC
Zi' <- muC*Zi

In this case, the mean of Zj' is correctly rj*muC. Let's start again with rhoC=0:

alpha <- 0
beta <- rj/muC
Zj <- rj*Cj/muC
muZi = muR

Then, if rhoR=1:

alpha' <- muC
beta' <- 0
Zi' <- muC*R

In this case, the mean of Zj' is correctly rj*muC. If rhoR=0, then:

alpha' <- 0
beta' <- muC
Zi' <- muC*Zi

The mean of Zj' is muC*muR, which is incorrect. Testing these extreme cases is inconclusive.

Let's compute the mean of Zi':

muZi' = alpha'*muR+beta'*muR*(ci/muC+sqrt(1/rhoC-1))/(1+sqrt(1/rhoC-1))

Now let's compute the correlation coefficient:

rhoR = alpha^2*sigmaR/(alpha^2*sigmaR+beta^2*sigmaZi)

Let's assume we scale R such that is variance is the same as Zi. We want:

alpha'*muR + beta'*(ci/muC+sqrt(1/rhoC-1))/(1+sqrt(1/rhoC-1)) = muR*ci
alpha^2/(alpha^2+beta^2) = rhoR

alpha' = -muR*sqrt(1/rhoC-1))/(1+sqrt(1/rhoC-1)
beta' = muR^2

The mean of Zj' will then be (alpha'+beta')*rj.

All this approach poses some problem about the insufficient number of parameters compared to the number of constraints. Let's try the first approach to have a numerical intuition on the problem.

Numerical intuition

generate_matrix_intuition <- function(n = 1, R, C, rdistr, rdistc, rhor, rhoc) {
    if (length(R) == 1) 
        R <- rdistr(R)
    if (length(C) == 1) 
        C <- rdistc(C)
    Z <- sapply(1:length(R), function(j) return(R[j]/(mean(C) * (1 + sqrt(1/rhoc - 
        1))) * C + R[j] * sqrt(1/rhoc - 1)/(mean(C) * (1 + sqrt(1/rhoc - 1))) * 
        rdistc(length(C))))

    return(Z)
}

rdistr <- function(n) {
    return(runif(n, 20, 50))
}
rdistc <- function(n) {
    return(runif(n, 50, 100))
}

Z <- generate_matrix_intuition(1, 100, 100, rdistr, rdistc, 0.5, 0.99)
cor.test(Z[1, ], Z[2, ])$estimate
##    cor 
## 0.9947

Multiplying by rj introduces a correlation between rows that must be avoided. It seems antagonist to specify the mean of each row and each column and to specify at the same time the correlations between rows and columns.

Simpler method

Let's reconsider the same method except we avoid to specify the mean of each row and each column.

Zj <- alpha*C+beta*Cj

We assume again that the standard deviations of C and Cj are identical. This time, we set the mean of each column to muR, alpha+beta=muR/muC. Therefore,

rhoC = alpha^2/(alpha^2+beta^2)
alpha+beta = muR/muC

alpha <- muR/(muC*(1+sqrt(1/rhoC-1)))
beta <- muR*sqrt(1/rhoC-1)/(muC*(1+sqrt(1/rhoC-1)))

This seems to lead to the same problems. Let's explore another generation method.

Generation with multiplication

Let's keep the first part of the method and study the operation we would like in the extreme cases. If rhoC=rhoR=1, then we want Zj <- C and then Zi' <- R * Zi. If rhoC=rhoR=0, then Zj <- Cj and then Zi' <- Ri * Zi. Let's add some parameters:

Zj <- alpha*C+beta*Cj
Zi' <- R*Zi

We see that the second transformation is only multiplicative with distinct values for each column. It may become necessary to insert an additive term.

From the previous analysis, we know that:

muZi = (alpha+beta)*muC
sigmaZi = beta^2*sigmaC

Now let's study the correlation between any pair of rows:

rhoR = sigmaR^2*muZi^2/(sigmaR^2*sigmaZi^2+sigmaR^2*muZi^2+muR^2*sigmaZi^2)

which simplifies as:

rhoR = 1/(1+(1+1/covR^2)*covZi^2)

Now, let's determine the previous values of alpha and beta. We have the following system:

rhoC = alpha^2/(alpha^2+beta^2)
rhoR = 1/(1+(1+1/covR^2)*covZi^2)
covZi = beta/(alpha+beta)*covC

Using Maxima, the solution is:

assume(rhoR >= 0, rhoC >= 0, rhoR <= 1, rhoC <= 1, covR >= 0, covC >= 0);
declare(rhoR, constant);
declare(rhoC, constant);
declare(covR, constant);
declare(covC, constant);
eqC : rhoC = alpha^2/(alpha^2+beta^2);
eqR : rhoR = 1/(1+(1+1/covR^2)*(beta/(alpha+beta)*covC)^2);
solve([eqC, eqR], [alpha, beta]);
  [[alpha = 0, beta = 0]]

There is not enough parameters (alpha and beta are cancelled in the derivations). Add a simple scaling term in for the second transformation would be useless.

New design by removing useless terms

Let's redesign the procedure:

Zj <- C+alpha*Cj
Zi' <- R*Zi+beta*R

We still want to ensure that the extreme case with rhoR=rhoC=1 would lead to alpha=beta=0. In any other cases, the final matrix Z' will be scaled to have the correct mean and standard deviation.

Let's build the system:

rhoC = 1/(1+alpha^2)
covZi = alpha/(1+alpha)*covC
rhoR = 1/(1+(1+1/covR^2)*covZi^2)/(1+beta^2)

And the solution is:

alpha = sqrt(1/rhoC-1)
beta = sqrt(1/(1+(1+1/covR^2)*(sqrt(1/rhoC-1)/(1+sqrt(1/rhoC-1))*covC)^2)/rhoR-1)

For feasible alpha, rhoC must be strictly greater than 0 (if it is 0, the fallback method is the random generation). For feasible beta:

rhoR <= 1/(1+(1+1/covR^2)*(sqrt(1/rhoC-1)/(1+sqrt(1/rhoC-1))*covC)^2)

Hopefully, it is always possible to inverse R and C when this second assumption does not hold.

generate_matrix_simple <- function(n = 1, R, C, rdistr, rdistc, rhor, rhoc) {
    if (length(R) == 1) 
        R <- rdistr(R)
    if (length(C) == 1) 
        C <- rdistc(C)

    covR <- sqrt(var(R))/mean(R)
    covC <- sqrt(var(C))/mean(C)

    alpha <- sqrt(1/rhoc - 1)
    beta <- sqrt(1/(1 + (1 + 1/covR^2) * (alpha/(1 + alpha) * covC)^2)/rhor - 
        1)

    Z <- sapply(1:length(R), function(j) return(C + alpha * rdistc(length(C))))
    for (i in 1:nrow(Z)) Z[i, ] <- R * Z[i, ] + beta * R

    return(Z)
}

rdistr <- function(n) {
    return(runif(n, 20, 50))
}
rdistc <- function(n) {
    return(runif(n, 50, 100))
}

Z <- generate_matrix_simple(1, 100, 100, rdistr, rdistc, 0.5, 1)
cor.test(Z[1, ], Z[2, ])$estimate
## cor 
##   1

Unfortunately, this does not work.