Large measure study (July 15 and July 16, 2015)

The TMA is not robust to the CV of one matrix and probably not to its dimension. Previous reports showed that the CV of the singular values is not a very good predictor of the performance of HLPT. One suggestion was to use the ratio between the first value and the second to have something closer to the TMA. A list of other potentially interesting measures was also proposed.

We want to investigate the limits of the TMA and explore alternative measures. The objective is to classify possible measures and to select the most interesting ones.

Summary of possible measures

Let's derive a list of possible measures:

We may also consider the SV of the inverted matrix and the inverse SV, for a total of 36 measures.

Let's study those measures with the noise-based and the combination-based methods by trying to predict the performance of HLPT.

Relation between measures with the noise-based method

n <- 100
m <- 30
measures_noise <- NULL
for (i in 1:1000) {
    rhoR <- runif(1)
    rhoC <- runif(1)
    Vmax <- runif(1)
    Z <- generate_heterogeneous_matrix_noise_corr(n, m, rgamma_cost, rhoR, rhoC, 
        Vmax)
    SV <- svd(Z)$d
    if (any(!is.finite(1/Z))) 
        next
    SV_inv <- svd(1/Z)$d
    invSV <- 1/SV
    if (!is.finite(var(SV_inv))) 
        next
    measures_noise <- rbind(measures_noise, data.frame(CV = CV(SV), arith_mean = mean(SV), 
        geom_mean = exp(mean(log(SV))), max = max(SV), first_ratio = SV[1]/SV[2], 
        max_ratio = max(SV[-length(SV)]/SV[-1]), mean_ratio = mean(SV[-length(SV)]/SV[-1]), 
        CV_ratio = CV(SV[-length(SV)]/SV[-1]), first_diff = SV[1] - SV[2], max_diff = max(SV[-length(SV)] - 
            SV[-1]), mean_diff = mean(SV[-length(SV)] - SV[-1]), CV_diff = CV(SV[-length(SV)] - 
            SV[-1]), CV_inv = CV(SV_inv), arith_mean_inv = mean(SV_inv), geom_mean_inv = exp(mean(log(SV_inv))), 
        max_inv = max(SV_inv), first_ratio_inv = SV_inv[1]/SV_inv[2], max_ratio_inv = max(SV_inv[-length(SV_inv)]/SV_inv[-1]), 
        mean_ratio_inv = mean(SV_inv[-length(SV_inv)]/SV_inv[-1]), CV_ratio_inv = CV(SV_inv[-length(SV_inv)]/SV_inv[-1]), 
        first_diff_inv = SV_inv[1] - SV_inv[2], max_diff_inv = max(SV_inv[-length(SV_inv)] - 
            SV_inv[-1]), mean_diff_inv = mean(SV_inv[-length(SV_inv)] - SV_inv[-1]), 
        CV_diff_inv = CV(SV_inv[-length(SV_inv)] - SV_inv[-1]), invCV = CV(invSV), 
        invarith_mean = mean(invSV), invgeom_mean = exp(mean(log(invSV))), invmax = max(invSV), 
        invfirst_ratio = invSV[1]/invSV[2], invmax_ratio = max(invSV[-length(invSV)]/invSV[-1]), 
        invmean_ratio = mean(invSV[-length(invSV)]/invSV[-1]), invCV_ratio = CV(invSV[-length(invSV)]/invSV[-1]), 
        invfirst_diff = invSV[1] - invSV[2], invmax_diff = max(invSV[-length(invSV)] - 
            invSV[-1]), invmean_diff = mean(invSV[-length(invSV)] - invSV[-1]), 
        invCV_diff = CV(invSV[-length(invSV)] - invSV[-1])))
}

Let's see if some of those measures are equivalent.

distance_noise <- matrix(NA, ncol(measures_noise), ncol(measures_noise))
for (i in 1:ncol(measures_noise)) for (j in 1:ncol(measures_noise)) distance_noise[i, 
    j] <- 1 - abs(cor(measures_noise[, i], measures_noise[, j], method = "spearman"))
rownames(distance_noise) <- colnames(measures_noise)
colnames(distance_noise) <- colnames(measures_noise)
plot(hclust(as.dist(distance_noise)))

plot of chunk large_measuse_noise_dis

We can group them in eight categories:

Let's see the values of the first groups on a scatter-plot.

pairs(~first_diff + max_diff + invmax_ratio + max + mean_diff + invCV + invmean_ratio + 
    max_inv + mean_diff_inv + first_diff_inv + max_diff_inv + invCV_diff, measures_noise, 
    cex = 0.1)

plot of chunk large_measuse_noise_plot1

Several pairs of measures are equivalent: first_diff/max_diff and max/mean_diff. invCV and invmean_ratio are quite close too. Let's see some specific relations in log-scale.

pairs(~max_inv + mean_diff_inv + first_diff_inv + max_diff_inv, measures_noise, 
    cex = 0.1, log = "xy")

plot of chunk large_measuse_noise_plot2

Two others equivalence: max_inv/mean_diff_inv and first_diff_inv/max_diff_inv. The four are also quite close.

Let's continue with the next groups:

pairs(~invCV_ratio + invmax + invmean_diff + arith_mean_inv + arith_mean + geom_mean_inv + 
    max_ratio + CV_ratio + invfirst_diff + CV_diff + first_ratio + invfirst_ratio, 
    measures_noise, cex = 0.1)

plot of chunk large_measuse_noise_plot3

Let's focus on some relations in log-scale:

pairs(~invmax + I(-invmean_diff) + max_ratio + CV_ratio + I(-invfirst_diff) + 
    CV_diff + first_ratio + invfirst_ratio, measures_noise, cex = 0.1, log = "xy")

plot of chunk large_measuse_noise_plot4

Again, we see that invmax and invmean_diff are equivalent and we see another equivalence: first_ratio/invfirst_ratio. Also, there are two strong association: max_ratio/CV_ratio and first_ratio/max_ratio.

Finally, the last groups:

pairs(~invmax_diff + mean_ratio_inv + max_ratio_inv + CV_ratio_inv + first_ratio_inv + 
    CV_diff_inv + CV_inv + invarith_mean + mean_ratio + CV + geom_mean + invgeom_mean, 
    measures_noise, cex = 0.1)

plot of chunk large_measuse_noise_plot5

Let's see some relations in more details:

pairs(~mean_ratio_inv + max_ratio_inv + CV_ratio_inv + first_ratio_inv + CV_diff_inv + 
    invarith_mean + mean_ratio + CV + geom_mean + invgeom_mean, measures_noise, 
    cex = 0.1, log = "xy")

plot of chunk large_measuse_noise_plot6

Some equivalences: mean_ratio_inv/max_ratio_inv/CV_ratio_inv and geom_mean/invgeom_mean.

Strong associations: max_ratio_inv/first_ratio_inv, first_ratio_inv/CV_diff_inv and mean_ratio/CV.

Let's summarize them:

It is indeed intuitive that the max and the mean_diff of a quantity are equivalent (the mean of the differences is correlated to the sum of the differences). Let's keep the max only.

We also see that the first difference is almost often the maximum difference. Let's keep only the first difference.

Due to the nature of the measure, first_ratio and invfirst_ratio are anti-correlated, and geom_mean and invgeom_mean are equivalent. Let's keep the former.

Other equivalence are unexplained. Let's keep other measures and see how it works with the combination-based method.

Relation between measures with the combination-based method

CVn <- 100
m <- 30
measures_combi <- NULL
for (i in 1:1000) {
    rhoR <- runif(1)
    rhoC <- runif(1)
    CVparam <- runif(1)
    Z <- generate_matrix_corr_positive(n, m, rgamma_cost, 1, CVparam, rhoR, 
        rhoC)
    SV <- svd(Z)$d
    if (any(!is.finite(1/Z))) 
        next
    SV_inv <- svd(1/Z)$d
    invSV <- 1/SV
    if (!is.finite(var(SV_inv))) 
        next
    measures_combi <- rbind(measures_combi, data.frame(CV = CV(SV), arith_mean = mean(SV), 
        geom_mean = exp(mean(log(SV))), max = max(SV), first_ratio = SV[1]/SV[2], 
        max_ratio = max(SV[-length(SV)]/SV[-1]), mean_ratio = mean(SV[-length(SV)]/SV[-1]), 
        CV_ratio = CV(SV[-length(SV)]/SV[-1]), first_diff = SV[1] - SV[2], max_diff = max(SV[-length(SV)] - 
            SV[-1]), mean_diff = mean(SV[-length(SV)] - SV[-1]), CV_diff = CV(SV[-length(SV)] - 
            SV[-1]), CV_inv = CV(SV_inv), arith_mean_inv = mean(SV_inv), geom_mean_inv = exp(mean(log(SV_inv))), 
        max_inv = max(SV_inv), first_ratio_inv = SV_inv[1]/SV_inv[2], max_ratio_inv = max(SV_inv[-length(SV_inv)]/SV_inv[-1]), 
        mean_ratio_inv = mean(SV_inv[-length(SV_inv)]/SV_inv[-1]), CV_ratio_inv = CV(SV_inv[-length(SV_inv)]/SV_inv[-1]), 
        first_diff_inv = SV_inv[1] - SV_inv[2], max_diff_inv = max(SV_inv[-length(SV_inv)] - 
            SV_inv[-1]), mean_diff_inv = mean(SV_inv[-length(SV_inv)] - SV_inv[-1]), 
        CV_diff_inv = CV(SV_inv[-length(SV_inv)] - SV_inv[-1]), invCV = CV(invSV), 
        invarith_mean = mean(invSV), invgeom_mean = exp(mean(log(invSV))), invmax = max(invSV), 
        invfirst_ratio = invSV[1]/invSV[2], invmax_ratio = max(invSV[-length(invSV)]/invSV[-1]), 
        invmean_ratio = mean(invSV[-length(invSV)]/invSV[-1]), invCV_ratio = CV(invSV[-length(invSV)]/invSV[-1]), 
        invfirst_diff = invSV[1] - invSV[2], invmax_diff = max(invSV[-length(invSV)] - 
            invSV[-1]), invmean_diff = mean(invSV[-length(invSV)] - invSV[-1]), 
        invCV_diff = CV(invSV[-length(invSV)] - invSV[-1])))
}

Let's see if some of those measures are equivalent.

distance_combi <- matrix(NA, ncol(measures_combi), ncol(measures_combi))
for (i in 1:ncol(measures_combi)) for (j in 1:ncol(measures_combi)) distance_combi[i, 
    j] <- 1 - abs(cor(measures_combi[, i], measures_combi[, j], method = "spearman"))
rownames(distance_combi) <- colnames(measures_combi)
colnames(distance_combi) <- colnames(measures_combi)
plot(hclust(as.dist(distance_combi)))

plot of chunk large_measuse_combi_dist

The dendrogram seems to be quite distinct but the two distance matrices are quite close (around 80% of the values are closer to each other than 0.2):

plot.ecdf(as.vector(abs(distance_noise - distance_combi)))

plot of chunk unnamed-chunk-1

Now that we explored the measures a bit, let's see which is the best predictor.

Prediction power with the noise-based method

Let's try to predict the performance of HLPT relatively to some other good heuristics.

best_cmax <- function(costs) {
    min(LPT_unrelated(costs, min), balance_sufferage(costs), balance_EFT(costs))
}

n <- 100
m <- 30
LPT_unrelated_perf_noise <- NULL
for (i in 1:100) {
    rhoR <- runif(1)
    rhoC <- runif(1)
    Vmax <- runif(1)
    Z <- generate_heterogeneous_matrix_noise_corr(n, m, rgamma_cost, rhoR, rhoC, 
        Vmax)
    SV <- svd(Z)$d
    if (any(!is.finite(1/Z))) 
        next
    SV_inv <- svd(1/Z)$d
    invSV <- 1/SV
    if (!is.finite(var(SV_inv))) 
        next
    cmax <- LPT_unrelated(Z, func = min)
    LPT_unrelated_perf_noise <- rbind(LPT_unrelated_perf_noise, data.frame(CV = CV(SV), 
        arith_mean = mean(SV), geom_mean = exp(mean(log(SV))), max = max(SV), 
        first_ratio = SV[1]/SV[2], max_ratio = max(SV[-length(SV)]/SV[-1]), 
        mean_ratio = mean(SV[-length(SV)]/SV[-1]), CV_ratio = CV(SV[-length(SV)]/SV[-1]), 
        first_diff = SV[1] - SV[2], max_diff = max(SV[-length(SV)] - SV[-1]), 
        mean_diff = mean(SV[-length(SV)] - SV[-1]), CV_diff = CV(SV[-length(SV)] - 
            SV[-1]), CV_inv = CV(SV_inv), arith_mean_inv = mean(SV_inv), geom_mean_inv = exp(mean(log(SV_inv))), 
        max_inv = max(SV_inv), first_ratio_inv = SV_inv[1]/SV_inv[2], max_ratio_inv = max(SV_inv[-length(SV_inv)]/SV_inv[-1]), 
        mean_ratio_inv = mean(SV_inv[-length(SV_inv)]/SV_inv[-1]), CV_ratio_inv = CV(SV_inv[-length(SV_inv)]/SV_inv[-1]), 
        first_diff_inv = SV_inv[1] - SV_inv[2], max_diff_inv = max(SV_inv[-length(SV_inv)] - 
            SV_inv[-1]), mean_diff_inv = mean(SV_inv[-length(SV_inv)] - SV_inv[-1]), 
        CV_diff_inv = CV(SV_inv[-length(SV_inv)] - SV_inv[-1]), invCV = CV(invSV), 
        invarith_mean = mean(invSV), invgeom_mean = exp(mean(log(invSV))), invmax = max(invSV), 
        invfirst_ratio = invSV[1]/invSV[2], invmax_ratio = max(invSV[-length(invSV)]/invSV[-1]), 
        invmean_ratio = mean(invSV[-length(invSV)]/invSV[-1]), invCV_ratio = CV(invSV[-length(invSV)]/invSV[-1]), 
        invfirst_diff = invSV[1] - invSV[2], invmax_diff = max(invSV[-length(invSV)] - 
            invSV[-1]), invmean_diff = mean(invSV[-length(invSV)] - invSV[-1]), 
        invCV_diff = CV(invSV[-length(invSV)] - invSV[-1]), mean_cor_row = mean_cor_row(Z), 
        mean_cor_col = mean_cor_col(Z), TMA1 = TMA1(Z), TMA2 = TMA2(Z), ratio = max(1, 
            cmax/best_cmax(Z))))
}

Let's try to see the relation for each measure.

measure_index <- 1:(ncol(LPT_unrelated_perf_noise) - 1)
LPT_unrelated_perf_noise_long <- reshape(LPT_unrelated_perf_noise, varying = measure_index, 
    direction = "long", idvar = "matrix", v.names = "measure", timevar = "type", 
    times = colnames(LPT_unrelated_perf_noise)[measure_index])
library(ggplot2)
p <- ggplot(LPT_unrelated_perf_noise_long, aes(x = abs(measure), y = ratio))
p <- p + geom_point()
p <- p + facet_wrap(~type, nrow = 8, scales = "free")
print(p)

plot of chunk large_measuse_relation_noise_linear

We see some relations for: CV, CV_inv, CV_ratio, CV_ratio_inv and mean_cor_col. And now in log-scale:

print(p + scale_x_log10())

plot of chunk large_measuse_relation_noise_log

And also for: first_ratio, first_ratio_inv, geom_mean, geom_mean_inv, invarith_mean, invfirst_diff, invfirst_ratio, mean_ratio, mean_ratio_inv, TMA1 and TMA2.

Let's fit the performance for each predictor.

fit <- NULL
for (meas in colnames(LPT_unrelated_perf_noise)[-ncol(LPT_unrelated_perf_noise)]) {
    f <- lm(paste("ratio ~", meas), LPT_unrelated_perf_noise)
    fit <- rbind(fit, data.frame(meas = meas, summary(f)$adj.r.squared))
    f <- lm(paste("ratio ~ log(abs(", meas, "))", sep = ""), LPT_unrelated_perf_noise)
    fit <- rbind(fit, data.frame(meas = paste("log", meas), summary(f)$adj.r.squared))
}
head(fit[order(fit[, 2], decreasing = TRUE), ])
##                meas summary.f..adj.r.squared
## 75     mean_cor_col                   0.2977
## 25           CV_inv                   0.2946
## 26       log CV_inv                   0.2630
## 76 log mean_cor_col                   0.2504
## 78         log TMA1                   0.2024
## 1                CV                   0.1840

The best predictors here are CV_inv and mean_cor_col.

Prediction power with the combination-based method

Now, let's see if the conclusions stand with another generation method.

n <- 100
m <- 30
LPT_unrelated_perf_combi <- NULL
for (i in 1:100) {
    rhoR <- runif(1)
    rhoC <- runif(1)
    CVparam <- runif(1)
    Z <- generate_matrix_corr_positive(n, m, rgamma_cost, 1, CVparam, rhoR, 
        rhoC)
    SV <- svd(Z)$d
    if (any(!is.finite(1/Z))) 
        next
    SV_inv <- svd(1/Z)$d
    invSV <- 1/SV
    if (!is.finite(var(SV_inv))) 
        next
    cmax <- LPT_unrelated(Z, func = min)
    LPT_unrelated_perf_combi <- rbind(LPT_unrelated_perf_combi, data.frame(CV = CV(SV), 
        arith_mean = mean(SV), geom_mean = exp(mean(log(SV))), max = max(SV), 
        first_ratio = SV[1]/SV[2], max_ratio = max(SV[-length(SV)]/SV[-1]), 
        mean_ratio = mean(SV[-length(SV)]/SV[-1]), CV_ratio = CV(SV[-length(SV)]/SV[-1]), 
        first_diff = SV[1] - SV[2], max_diff = max(SV[-length(SV)] - SV[-1]), 
        mean_diff = mean(SV[-length(SV)] - SV[-1]), CV_diff = CV(SV[-length(SV)] - 
            SV[-1]), CV_inv = CV(SV_inv), arith_mean_inv = mean(SV_inv), geom_mean_inv = exp(mean(log(SV_inv))), 
        max_inv = max(SV_inv), first_ratio_inv = SV_inv[1]/SV_inv[2], max_ratio_inv = max(SV_inv[-length(SV_inv)]/SV_inv[-1]), 
        mean_ratio_inv = mean(SV_inv[-length(SV_inv)]/SV_inv[-1]), CV_ratio_inv = CV(SV_inv[-length(SV_inv)]/SV_inv[-1]), 
        first_diff_inv = SV_inv[1] - SV_inv[2], max_diff_inv = max(SV_inv[-length(SV_inv)] - 
            SV_inv[-1]), mean_diff_inv = mean(SV_inv[-length(SV_inv)] - SV_inv[-1]), 
        CV_diff_inv = CV(SV_inv[-length(SV_inv)] - SV_inv[-1]), invCV = CV(invSV), 
        invarith_mean = mean(invSV), invgeom_mean = exp(mean(log(invSV))), invmax = max(invSV), 
        invfirst_ratio = invSV[1]/invSV[2], invmax_ratio = max(invSV[-length(invSV)]/invSV[-1]), 
        invmean_ratio = mean(invSV[-length(invSV)]/invSV[-1]), invCV_ratio = CV(invSV[-length(invSV)]/invSV[-1]), 
        invfirst_diff = invSV[1] - invSV[2], invmax_diff = max(invSV[-length(invSV)] - 
            invSV[-1]), invmean_diff = mean(invSV[-length(invSV)] - invSV[-1]), 
        invCV_diff = CV(invSV[-length(invSV)] - invSV[-1]), mean_cor_row = mean_cor_row(Z), 
        mean_cor_col = mean_cor_col(Z), TMA1 = TMA1(Z), TMA2 = TMA2(Z), ratio = max(1, 
            cmax/best_cmax(Z))))
}

Let's try to see again the relations.

measure_index <- 1:(ncol(LPT_unrelated_perf_combi) - 1)
LPT_unrelated_perf_combi_long <- reshape(LPT_unrelated_perf_combi, varying = measure_index, 
    direction = "long", idvar = "matrix", v.names = "measure", timevar = "type", 
    times = colnames(LPT_unrelated_perf_combi)[measure_index])
library(ggplot2)
p <- ggplot(LPT_unrelated_perf_combi_long, aes(x = abs(measure), y = ratio))
p <- p + geom_point()
p <- p + facet_wrap(~type, nrow = 8, scales = "free")
p <- p + scale_x_log10()
print(p)

plot of chunk large_measuse_relation_combi_log

And we fit again the performance with each measure.

fit <- NULL
for (meas in colnames(LPT_unrelated_perf_combi)[-ncol(LPT_unrelated_perf_combi)]) {
    f <- lm(paste("ratio ~", meas), LPT_unrelated_perf_combi)
    fit <- rbind(fit, data.frame(meas = meas, summary(f)$adj.r.squared))
    f <- lm(paste("ratio ~ log(abs(", meas, "))", sep = ""), LPT_unrelated_perf_combi)
    fit <- rbind(fit, data.frame(meas = paste("log", meas), summary(f)$adj.r.squared))
}
head(fit[order(fit[, 2], decreasing = TRUE), ])
##                meas summary.f..adj.r.squared
## 76 log mean_cor_col                   0.2611
## 75     mean_cor_col                   0.2572
## 25           CV_inv                   0.1214
## 50        log invCV                   0.1169
## 49            invCV                   0.1159
## 61    invmean_ratio                   0.1059

The column correlation still plays an important role.

Prediction power with the TMA-based method

It is strange because a previous analysis showed that the TMA had a significant predictive power. Let's check this with a previous generation method:

n <- 100
m <- 30
LPT_unrelated_perf_TMA <- NULL
for (i in 1:100) {
    cvsv <- 10^runif(1, -2, 1)
    Z <- generate_matrix_cvsv(n, m, rgamma_cost, cvsv, 1)
    SV <- svd(Z)$d
    if (any(!is.finite(1/Z))) 
        next
    SV_inv <- svd(1/Z)$d
    invSV <- 1/SV
    if (!is.finite(var(SV_inv))) 
        next
    cmax <- LPT_unrelated(Z, func = min)
    LPT_unrelated_perf_TMA <- rbind(LPT_unrelated_perf_TMA, data.frame(CV = CV(SV), 
        arith_mean = mean(SV), geom_mean = exp(mean(log(SV))), max = max(SV), 
        first_ratio = SV[1]/SV[2], max_ratio = max(SV[-length(SV)]/SV[-1]), 
        mean_ratio = mean(SV[-length(SV)]/SV[-1]), CV_ratio = CV(SV[-length(SV)]/SV[-1]), 
        first_diff = SV[1] - SV[2], max_diff = max(SV[-length(SV)] - SV[-1]), 
        mean_diff = mean(SV[-length(SV)] - SV[-1]), CV_diff = CV(SV[-length(SV)] - 
            SV[-1]), CV_inv = CV(SV_inv), arith_mean_inv = mean(SV_inv), geom_mean_inv = exp(mean(log(SV_inv))), 
        max_inv = max(SV_inv), first_ratio_inv = SV_inv[1]/SV_inv[2], max_ratio_inv = max(SV_inv[-length(SV_inv)]/SV_inv[-1]), 
        mean_ratio_inv = mean(SV_inv[-length(SV_inv)]/SV_inv[-1]), CV_ratio_inv = CV(SV_inv[-length(SV_inv)]/SV_inv[-1]), 
        first_diff_inv = SV_inv[1] - SV_inv[2], max_diff_inv = max(SV_inv[-length(SV_inv)] - 
            SV_inv[-1]), mean_diff_inv = mean(SV_inv[-length(SV_inv)] - SV_inv[-1]), 
        CV_diff_inv = CV(SV_inv[-length(SV_inv)] - SV_inv[-1]), invCV = CV(invSV), 
        invarith_mean = mean(invSV), invgeom_mean = exp(mean(log(invSV))), invmax = max(invSV), 
        invfirst_ratio = invSV[1]/invSV[2], invmax_ratio = max(invSV[-length(invSV)]/invSV[-1]), 
        invmean_ratio = mean(invSV[-length(invSV)]/invSV[-1]), invCV_ratio = CV(invSV[-length(invSV)]/invSV[-1]), 
        invfirst_diff = invSV[1] - invSV[2], invmax_diff = max(invSV[-length(invSV)] - 
            invSV[-1]), invmean_diff = mean(invSV[-length(invSV)] - invSV[-1]), 
        invCV_diff = CV(invSV[-length(invSV)] - invSV[-1]), mean_cor_row = mean_cor_row(Z), 
        mean_cor_col = mean_cor_col(Z), TMA1 = TMA1(Z), TMA2 = TMA2(Z), ratio = max(1, 
            cmax/best_cmax(Z))))
}

Let's try to see again the relations.

measure_index <- 1:(ncol(LPT_unrelated_perf_TMA) - 1)
LPT_unrelated_perf_TMA_long <- reshape(LPT_unrelated_perf_TMA, varying = measure_index, 
    direction = "long", idvar = "matrix", v.names = "measure", timevar = "type", 
    times = colnames(LPT_unrelated_perf_TMA)[measure_index])
library(ggplot2)
p <- ggplot(LPT_unrelated_perf_TMA_long, aes(x = abs(measure), y = ratio))
p <- p + geom_point()
p <- p + facet_wrap(~type, nrow = 8, scales = "free")
p <- p + scale_x_log10()
print(p)

plot of chunk large_measuse_relation_TMA_log

And we fit again the performance with each measure.

LPT_unrelated_perf_TMA_NA <- do.call(data.frame, lapply(LPT_unrelated_perf_TMA, 
    function(x) replace(x, x == 0, NA)))
fit <- NULL
for (meas in colnames(LPT_unrelated_perf_TMA_NA)[-ncol(LPT_unrelated_perf_TMA_NA)]) {
    f <- lm(paste("ratio ~", meas), LPT_unrelated_perf_TMA_NA)
    fit <- rbind(fit, data.frame(meas = meas, summary(f)$adj.r.squared))
    f <- lm(paste("ratio ~ log(abs(", meas, "))", sep = ""), LPT_unrelated_perf_TMA_NA, 
        na.action = na.omit)
    fit <- rbind(fit, data.frame(meas = paste("log", meas), summary(f)$adj.r.squared))
}
head(fit[order(fit[, 2], decreasing = TRUE), ])
##              meas summary.f..adj.r.squared
## 77           TMA1                   0.6288
## 79           TMA2                   0.6263
## 4  log arith_mean                   0.6122
## 2          log CV                   0.5885
## 50      log invCV                   0.5859
## 26     log CV_inv                   0.5752

This is consistent with a previous study. We see that the TMA is a good predictor only for some specific instances. Overall, the best predictors are:

Let's see if the result is the same for the correlation

correlation <- NULL
for (meas in colnames(LPT_unrelated_perf_TMA)[-ncol(LPT_unrelated_perf_TMA)]) {
    correlation <- rbind(correlation, data.frame(meas = meas, cor = cor(LPT_unrelated_perf_TMA$ratio, 
        LPT_unrelated_perf_TMA[, meas], method = "spearman")))
    correlation <- rbind(correlation, data.frame(meas = paste("log", meas), 
        cor = cor(LPT_unrelated_perf_TMA$ratio, log(abs(LPT_unrelated_perf_TMA[, 
            meas])), method = "spearman")))
}
head(correlation[order(abs(correlation[, 2]), decreasing = TRUE), ], n = 10)
##                  meas     cor
## 37     mean_ratio_inv -0.8363
## 38 log mean_ratio_inv -0.8363
## 29      geom_mean_inv  0.8276
## 30  log geom_mean_inv  0.8276
## 77               TMA1  0.8259
## 78           log TMA1  0.8259
## 25             CV_inv -0.8250
## 26         log CV_inv -0.8250
## 79               TMA2  0.8240
## 80           log TMA2  0.8240

There are a lot of large correlation. Let's investigate:

plot.ecdf(abs(correlation$cor))

plot of chunk unnamed-chunk-6

Most of the measures have strong correlations. This is encouraging: it means that there is indeed some association with the performance of HLPT.

Limits of TMA

We started this analysis by stating that the TMA has some limits that alternative measures do not have. Let's confirm this. First, we will vary the CV of similar matrices and see the effect on the TMA. Then, we will change the dimensions and the mean.

n <- 100
m <- 30
mu <- 1
rhoR <- 0.5
rhoC <- 0.5
measures_TMA_CV_combi <- NULL
for (i in 1:100) {
    CVparam <- runif(1)
    Z <- generate_matrix_corr_positive(n, m, rgamma_cost, mu, CVparam, rhoR, 
        rhoC)
    SV <- svd(Z)$d
    measures_TMA_CV_combi <- rbind(measures_TMA_CV_combi, data.frame(CV_param = CVparam, 
        CV = CV(svd(Z)$d), CV_inv = CV(svd(1/Z)$d), mean_cor_row = mean_cor_row(Z), 
        mean_cor_col = mean_cor_col(Z), TMA1 = TMA1(Z), TMA2 = TMA2(Z)))
}
pairs(~CV_param + CV + CV_inv + mean_cor_row + mean_cor_col + TMA1 + TMA2, measures_TMA_CV_combi)

plot of chunk TMA_limits_CV

We see a clear correlation between the CV and both TMA. However, they are all sensible to the CV of the costs (even though the correlations are not sensible to it). We should either find another measure (but the correlations might just be what we need here) or find another limitations for the TMA. Let's see if the mean has any effect:

n <- 100
m <- 30
CVparam <- 0.3
rhoR <- 0.5
rhoC <- 0.5
measures_TMA_mu_combi <- NULL
for (i in 1:100) {
    mu <- runif(1)
    Z <- generate_matrix_corr_positive(n, m, rgamma_cost, mu, CVparam, rhoR, 
        rhoC)
    SV <- svd(Z)$d
    measures_TMA_mu_combi <- rbind(measures_TMA_mu_combi, data.frame(mu = mu, 
        CV = CV(svd(Z)$d), CV_inv = CV(svd(1/Z)$d), mean_cor_row = mean_cor_row(Z), 
        mean_cor_col = mean_cor_col(Z), TMA1 = TMA1(Z), TMA2 = TMA2(Z)))
}
pairs(~mu + CV + CV_inv + mean_cor_row + mean_cor_col + TMA1 + TMA2, measures_TMA_mu_combi)

plot of chunk TMA_limits_mu

All measures are insensible to the mean. This time, both TMA are correlated with the CV of the SV of the inverse of the costs.

And now for the dimension:

n <- 100
mu <- 1
CVparam <- 0.3
rhoR <- 0.5
rhoC <- 0.5
measures_TMA_m_combi <- NULL
for (i in 1:100) {
    m <- runif(1, 30, 300)
    Z <- generate_matrix_corr_positive(n, m, rgamma_cost, mu, CVparam, rhoR, 
        rhoC)
    SV <- svd(Z)$d
    measures_TMA_m_combi <- rbind(measures_TMA_m_combi, data.frame(m = m, CV = CV(svd(Z)$d), 
        CV_inv = CV(svd(1/Z)$d), mean_cor_row = mean_cor_row(Z), mean_cor_col = mean_cor_col(Z), 
        TMA1 = TMA1(Z), TMA2 = TMA2(Z)))
}
pairs(~m + CV + CV_inv + mean_cor_row + mean_cor_col + TMA1 + TMA2, measures_TMA_m_combi, 
    log = "x")

plot of chunk TMA_limits_m

Although the CV is sensible to the dimension, both TMA show more variations. However, this is not enough. A better measure than the TMA should show mush less variation.

At this point, we have two choices. Either we keep the correlation and we chose the following story: its value has an effect on the performance of several heuristics and it is not sensible to unrelated variation in the instances like the TMA. It is good, but we could probably do better by assessing that the TMA can grasp something else related to the instances (with the singular values) and by proposing another measure that is at least not sensible to the dimension and maybe also not to the CV.

Z <- generate_matrix_corr_positive(100, 100, rgamma_cost, 1, 0.3, 0.5, 0.5)
Z.svd <- svd(Z)$d
plot.ecdf(Z.svd, log = "x", xlim = c(min(Z.svd), max(Z.svd)))
plot.ecdf(svd(Z[, 1:30])$d, col = 2, add = TRUE)
plot.ecdf(svd(Z[, 1:10])$d, col = 3, add = TRUE)

plot of chunk unnamed-chunk-7

Z <- generate_matrix_corr_positive(100, 100, rgamma_cost, 1, 1, 0.5, 0.5)
Z.svd <- svd(Z)$d
plot.ecdf(Z.svd, log = "x", xlim = c(min(Z.svd), max(Z.svd)))
Z <- generate_matrix_corr_positive(100, 100, rgamma_cost, 1, 0.3, 0.5, 0.5)
plot.ecdf(svd(Z)$d, col = 2, add = TRUE)
Z <- generate_matrix_corr_positive(100, 100, rgamma_cost, 1, 0.1, 0.5, 0.5)
plot.ecdf(svd(Z)$d, col = 3, add = TRUE)

plot of chunk unnamed-chunk-8

It seems difficult to find a better SV-based measure because the SV are heavily influenced by the CV of the costs and the dimension of the matrix. We could say that the CV is at least simpler to compute and equivalent. For this, it would be necessary to clearly show that the normalization of the TMA is useless:

n <- 100
m <- 30
CVparam <- 0.3
measures_TMA_V_combi <- NULL
for (i in 1:100) {
    V <- runif(1)
    task <- rgamma_cost(n, 1, V)
    proc <- rgamma_cost(m, 1, V)
    Z <- generate_heterogeneous_matrix_noise(task, proc, CVparam)
    SV <- svd(Z)$d
    measures_TMA_V_combi <- rbind(measures_TMA_V_combi, data.frame(V = V, CV = CV(svd(Z)$d), 
        CV_inv = CV(svd(1/Z)$d), mean_cor_row = mean_cor_row(Z), mean_cor_col = mean_cor_col(Z), 
        TMA1 = TMA1(Z), TMA2 = TMA2(Z)))
}
pairs(~V + CV + CV_inv + mean_cor_row + mean_cor_col + TMA1 + TMA2, measures_TMA_V_combi)

plot of chunk TMA_limits_norm

The TMA is indeed independent from the heterogeneity. This is at least one of its quality. In an experiment where everything vary, it is not clear that the CV would be better.

Conclusion

The first conclusion is that the correlation and the TMA are not measuring the same thing. The TMA is sometime quite relevant (actually, the TMA was assumed to be extremely good only because it performs well on one specific set of instances) and this is due to the singular values on which is it based. Other variants are also good predictors that are at least as good (like the CV for instance). However, in contract to the correlation, the TMA and related measures are sensible to the CV of the costs and the dimension of the matrix. The TMA, thanks to its normalization mechanism, is independent from heterogeneity properties (contrarily to the correlation). As such, the TMA has several defaults:

But it also has several advantages:

As future work, we could devise a measure that is simpler than the TMA but with the same strength (independence from heterogeneity) or even better than it (independence from matrix dimension and cost CV).

For now, we can include the CV of the SV of the costs and the CV of the SV of the inverse of the costs in future analysis because they seems to be better predictors than the TMA. It will also be necessary to assess the difference between the TMA and the correlations (for instance, by computing the correlation between those two on all instances of each set and by tile-plots with the TMA).

This study really helps me to have a better understanding of SV-related measures even though dealing with this huge set of measures is unsatisfactory. What is clear is that improving the TMA with a better SV-related measure is challenging.