The TMA is not robust to the CV of one matrix and probably not to its dimension. Previous reports showed that the CV of the singular values is not a very good predictor of the performance of HLPT. One suggestion was to use the ratio between the first value and the second to have something closer to the TMA. A list of other potentially interesting measures was also proposed.
We want to investigate the limits of the TMA and explore alternative measures. The objective is to classify possible measures and to select the most interesting ones.
Let's derive a list of possible measures:
We may also consider the SV of the inverted matrix and the inverse SV, for a total of 36 measures.
Let's study those measures with the noise-based and the combination-based methods by trying to predict the performance of HLPT.
n <- 100
m <- 30
measures_noise <- NULL
for (i in 1:1000) {
rhoR <- runif(1)
rhoC <- runif(1)
Vmax <- runif(1)
Z <- generate_heterogeneous_matrix_noise_corr(n, m, rgamma_cost, rhoR, rhoC,
Vmax)
SV <- svd(Z)$d
if (any(!is.finite(1/Z)))
next
SV_inv <- svd(1/Z)$d
invSV <- 1/SV
if (!is.finite(var(SV_inv)))
next
measures_noise <- rbind(measures_noise, data.frame(CV = CV(SV), arith_mean = mean(SV),
geom_mean = exp(mean(log(SV))), max = max(SV), first_ratio = SV[1]/SV[2],
max_ratio = max(SV[-length(SV)]/SV[-1]), mean_ratio = mean(SV[-length(SV)]/SV[-1]),
CV_ratio = CV(SV[-length(SV)]/SV[-1]), first_diff = SV[1] - SV[2], max_diff = max(SV[-length(SV)] -
SV[-1]), mean_diff = mean(SV[-length(SV)] - SV[-1]), CV_diff = CV(SV[-length(SV)] -
SV[-1]), CV_inv = CV(SV_inv), arith_mean_inv = mean(SV_inv), geom_mean_inv = exp(mean(log(SV_inv))),
max_inv = max(SV_inv), first_ratio_inv = SV_inv[1]/SV_inv[2], max_ratio_inv = max(SV_inv[-length(SV_inv)]/SV_inv[-1]),
mean_ratio_inv = mean(SV_inv[-length(SV_inv)]/SV_inv[-1]), CV_ratio_inv = CV(SV_inv[-length(SV_inv)]/SV_inv[-1]),
first_diff_inv = SV_inv[1] - SV_inv[2], max_diff_inv = max(SV_inv[-length(SV_inv)] -
SV_inv[-1]), mean_diff_inv = mean(SV_inv[-length(SV_inv)] - SV_inv[-1]),
CV_diff_inv = CV(SV_inv[-length(SV_inv)] - SV_inv[-1]), invCV = CV(invSV),
invarith_mean = mean(invSV), invgeom_mean = exp(mean(log(invSV))), invmax = max(invSV),
invfirst_ratio = invSV[1]/invSV[2], invmax_ratio = max(invSV[-length(invSV)]/invSV[-1]),
invmean_ratio = mean(invSV[-length(invSV)]/invSV[-1]), invCV_ratio = CV(invSV[-length(invSV)]/invSV[-1]),
invfirst_diff = invSV[1] - invSV[2], invmax_diff = max(invSV[-length(invSV)] -
invSV[-1]), invmean_diff = mean(invSV[-length(invSV)] - invSV[-1]),
invCV_diff = CV(invSV[-length(invSV)] - invSV[-1])))
}
Let's see if some of those measures are equivalent.
distance_noise <- matrix(NA, ncol(measures_noise), ncol(measures_noise))
for (i in 1:ncol(measures_noise)) for (j in 1:ncol(measures_noise)) distance_noise[i,
j] <- 1 - abs(cor(measures_noise[, i], measures_noise[, j], method = "spearman"))
rownames(distance_noise) <- colnames(measures_noise)
colnames(distance_noise) <- colnames(measures_noise)
plot(hclust(as.dist(distance_noise)))
We can group them in eight categories:
Let's see the values of the first groups on a scatter-plot.
pairs(~first_diff + max_diff + invmax_ratio + max + mean_diff + invCV + invmean_ratio +
max_inv + mean_diff_inv + first_diff_inv + max_diff_inv + invCV_diff, measures_noise,
cex = 0.1)
Several pairs of measures are equivalent: first_diff/max_diff and max/mean_diff. invCV and invmean_ratio are quite close too. Let's see some specific relations in log-scale.
pairs(~max_inv + mean_diff_inv + first_diff_inv + max_diff_inv, measures_noise,
cex = 0.1, log = "xy")
Two others equivalence: max_inv/mean_diff_inv and first_diff_inv/max_diff_inv. The four are also quite close.
Let's continue with the next groups:
pairs(~invCV_ratio + invmax + invmean_diff + arith_mean_inv + arith_mean + geom_mean_inv +
max_ratio + CV_ratio + invfirst_diff + CV_diff + first_ratio + invfirst_ratio,
measures_noise, cex = 0.1)
Let's focus on some relations in log-scale:
pairs(~invmax + I(-invmean_diff) + max_ratio + CV_ratio + I(-invfirst_diff) +
CV_diff + first_ratio + invfirst_ratio, measures_noise, cex = 0.1, log = "xy")
Again, we see that invmax and invmean_diff are equivalent and we see another equivalence: first_ratio/invfirst_ratio. Also, there are two strong association: max_ratio/CV_ratio and first_ratio/max_ratio.
Finally, the last groups:
pairs(~invmax_diff + mean_ratio_inv + max_ratio_inv + CV_ratio_inv + first_ratio_inv +
CV_diff_inv + CV_inv + invarith_mean + mean_ratio + CV + geom_mean + invgeom_mean,
measures_noise, cex = 0.1)
Let's see some relations in more details:
pairs(~mean_ratio_inv + max_ratio_inv + CV_ratio_inv + first_ratio_inv + CV_diff_inv +
invarith_mean + mean_ratio + CV + geom_mean + invgeom_mean, measures_noise,
cex = 0.1, log = "xy")
Some equivalences: mean_ratio_inv/max_ratio_inv/CV_ratio_inv and geom_mean/invgeom_mean.
Strong associations: max_ratio_inv/first_ratio_inv, first_ratio_inv/CV_diff_inv and mean_ratio/CV.
Let's summarize them:
It is indeed intuitive that the max and the mean_diff of a quantity are equivalent (the mean of the differences is correlated to the sum of the differences). Let's keep the max only.
We also see that the first difference is almost often the maximum difference. Let's keep only the first difference.
Due to the nature of the measure, first_ratio and invfirst_ratio are anti-correlated, and geom_mean and invgeom_mean are equivalent. Let's keep the former.
Other equivalence are unexplained. Let's keep other measures and see how it works with the combination-based method.
CVn <- 100
m <- 30
measures_combi <- NULL
for (i in 1:1000) {
rhoR <- runif(1)
rhoC <- runif(1)
CVparam <- runif(1)
Z <- generate_matrix_corr_positive(n, m, rgamma_cost, 1, CVparam, rhoR,
rhoC)
SV <- svd(Z)$d
if (any(!is.finite(1/Z)))
next
SV_inv <- svd(1/Z)$d
invSV <- 1/SV
if (!is.finite(var(SV_inv)))
next
measures_combi <- rbind(measures_combi, data.frame(CV = CV(SV), arith_mean = mean(SV),
geom_mean = exp(mean(log(SV))), max = max(SV), first_ratio = SV[1]/SV[2],
max_ratio = max(SV[-length(SV)]/SV[-1]), mean_ratio = mean(SV[-length(SV)]/SV[-1]),
CV_ratio = CV(SV[-length(SV)]/SV[-1]), first_diff = SV[1] - SV[2], max_diff = max(SV[-length(SV)] -
SV[-1]), mean_diff = mean(SV[-length(SV)] - SV[-1]), CV_diff = CV(SV[-length(SV)] -
SV[-1]), CV_inv = CV(SV_inv), arith_mean_inv = mean(SV_inv), geom_mean_inv = exp(mean(log(SV_inv))),
max_inv = max(SV_inv), first_ratio_inv = SV_inv[1]/SV_inv[2], max_ratio_inv = max(SV_inv[-length(SV_inv)]/SV_inv[-1]),
mean_ratio_inv = mean(SV_inv[-length(SV_inv)]/SV_inv[-1]), CV_ratio_inv = CV(SV_inv[-length(SV_inv)]/SV_inv[-1]),
first_diff_inv = SV_inv[1] - SV_inv[2], max_diff_inv = max(SV_inv[-length(SV_inv)] -
SV_inv[-1]), mean_diff_inv = mean(SV_inv[-length(SV_inv)] - SV_inv[-1]),
CV_diff_inv = CV(SV_inv[-length(SV_inv)] - SV_inv[-1]), invCV = CV(invSV),
invarith_mean = mean(invSV), invgeom_mean = exp(mean(log(invSV))), invmax = max(invSV),
invfirst_ratio = invSV[1]/invSV[2], invmax_ratio = max(invSV[-length(invSV)]/invSV[-1]),
invmean_ratio = mean(invSV[-length(invSV)]/invSV[-1]), invCV_ratio = CV(invSV[-length(invSV)]/invSV[-1]),
invfirst_diff = invSV[1] - invSV[2], invmax_diff = max(invSV[-length(invSV)] -
invSV[-1]), invmean_diff = mean(invSV[-length(invSV)] - invSV[-1]),
invCV_diff = CV(invSV[-length(invSV)] - invSV[-1])))
}
Let's see if some of those measures are equivalent.
distance_combi <- matrix(NA, ncol(measures_combi), ncol(measures_combi))
for (i in 1:ncol(measures_combi)) for (j in 1:ncol(measures_combi)) distance_combi[i,
j] <- 1 - abs(cor(measures_combi[, i], measures_combi[, j], method = "spearman"))
rownames(distance_combi) <- colnames(measures_combi)
colnames(distance_combi) <- colnames(measures_combi)
plot(hclust(as.dist(distance_combi)))
The dendrogram seems to be quite distinct but the two distance matrices are quite close (around 80% of the values are closer to each other than 0.2):
plot.ecdf(as.vector(abs(distance_noise - distance_combi)))
Now that we explored the measures a bit, let's see which is the best predictor.
Let's try to predict the performance of HLPT relatively to some other good heuristics.
best_cmax <- function(costs) {
min(LPT_unrelated(costs, min), balance_sufferage(costs), balance_EFT(costs))
}
n <- 100
m <- 30
LPT_unrelated_perf_noise <- NULL
for (i in 1:100) {
rhoR <- runif(1)
rhoC <- runif(1)
Vmax <- runif(1)
Z <- generate_heterogeneous_matrix_noise_corr(n, m, rgamma_cost, rhoR, rhoC,
Vmax)
SV <- svd(Z)$d
if (any(!is.finite(1/Z)))
next
SV_inv <- svd(1/Z)$d
invSV <- 1/SV
if (!is.finite(var(SV_inv)))
next
cmax <- LPT_unrelated(Z, func = min)
LPT_unrelated_perf_noise <- rbind(LPT_unrelated_perf_noise, data.frame(CV = CV(SV),
arith_mean = mean(SV), geom_mean = exp(mean(log(SV))), max = max(SV),
first_ratio = SV[1]/SV[2], max_ratio = max(SV[-length(SV)]/SV[-1]),
mean_ratio = mean(SV[-length(SV)]/SV[-1]), CV_ratio = CV(SV[-length(SV)]/SV[-1]),
first_diff = SV[1] - SV[2], max_diff = max(SV[-length(SV)] - SV[-1]),
mean_diff = mean(SV[-length(SV)] - SV[-1]), CV_diff = CV(SV[-length(SV)] -
SV[-1]), CV_inv = CV(SV_inv), arith_mean_inv = mean(SV_inv), geom_mean_inv = exp(mean(log(SV_inv))),
max_inv = max(SV_inv), first_ratio_inv = SV_inv[1]/SV_inv[2], max_ratio_inv = max(SV_inv[-length(SV_inv)]/SV_inv[-1]),
mean_ratio_inv = mean(SV_inv[-length(SV_inv)]/SV_inv[-1]), CV_ratio_inv = CV(SV_inv[-length(SV_inv)]/SV_inv[-1]),
first_diff_inv = SV_inv[1] - SV_inv[2], max_diff_inv = max(SV_inv[-length(SV_inv)] -
SV_inv[-1]), mean_diff_inv = mean(SV_inv[-length(SV_inv)] - SV_inv[-1]),
CV_diff_inv = CV(SV_inv[-length(SV_inv)] - SV_inv[-1]), invCV = CV(invSV),
invarith_mean = mean(invSV), invgeom_mean = exp(mean(log(invSV))), invmax = max(invSV),
invfirst_ratio = invSV[1]/invSV[2], invmax_ratio = max(invSV[-length(invSV)]/invSV[-1]),
invmean_ratio = mean(invSV[-length(invSV)]/invSV[-1]), invCV_ratio = CV(invSV[-length(invSV)]/invSV[-1]),
invfirst_diff = invSV[1] - invSV[2], invmax_diff = max(invSV[-length(invSV)] -
invSV[-1]), invmean_diff = mean(invSV[-length(invSV)] - invSV[-1]),
invCV_diff = CV(invSV[-length(invSV)] - invSV[-1]), mean_cor_row = mean_cor_row(Z),
mean_cor_col = mean_cor_col(Z), TMA1 = TMA1(Z), TMA2 = TMA2(Z), ratio = max(1,
cmax/best_cmax(Z))))
}
Let's try to see the relation for each measure.
measure_index <- 1:(ncol(LPT_unrelated_perf_noise) - 1)
LPT_unrelated_perf_noise_long <- reshape(LPT_unrelated_perf_noise, varying = measure_index,
direction = "long", idvar = "matrix", v.names = "measure", timevar = "type",
times = colnames(LPT_unrelated_perf_noise)[measure_index])
library(ggplot2)
p <- ggplot(LPT_unrelated_perf_noise_long, aes(x = abs(measure), y = ratio))
p <- p + geom_point()
p <- p + facet_wrap(~type, nrow = 8, scales = "free")
print(p)
We see some relations for: CV, CV_inv, CV_ratio, CV_ratio_inv and mean_cor_col. And now in log-scale:
print(p + scale_x_log10())
And also for: first_ratio, first_ratio_inv, geom_mean, geom_mean_inv, invarith_mean, invfirst_diff, invfirst_ratio, mean_ratio, mean_ratio_inv, TMA1 and TMA2.
Let's fit the performance for each predictor.
fit <- NULL
for (meas in colnames(LPT_unrelated_perf_noise)[-ncol(LPT_unrelated_perf_noise)]) {
f <- lm(paste("ratio ~", meas), LPT_unrelated_perf_noise)
fit <- rbind(fit, data.frame(meas = meas, summary(f)$adj.r.squared))
f <- lm(paste("ratio ~ log(abs(", meas, "))", sep = ""), LPT_unrelated_perf_noise)
fit <- rbind(fit, data.frame(meas = paste("log", meas), summary(f)$adj.r.squared))
}
head(fit[order(fit[, 2], decreasing = TRUE), ])
## meas summary.f..adj.r.squared
## 75 mean_cor_col 0.2977
## 25 CV_inv 0.2946
## 26 log CV_inv 0.2630
## 76 log mean_cor_col 0.2504
## 78 log TMA1 0.2024
## 1 CV 0.1840
The best predictors here are CV_inv and mean_cor_col.
Now, let's see if the conclusions stand with another generation method.
n <- 100
m <- 30
LPT_unrelated_perf_combi <- NULL
for (i in 1:100) {
rhoR <- runif(1)
rhoC <- runif(1)
CVparam <- runif(1)
Z <- generate_matrix_corr_positive(n, m, rgamma_cost, 1, CVparam, rhoR,
rhoC)
SV <- svd(Z)$d
if (any(!is.finite(1/Z)))
next
SV_inv <- svd(1/Z)$d
invSV <- 1/SV
if (!is.finite(var(SV_inv)))
next
cmax <- LPT_unrelated(Z, func = min)
LPT_unrelated_perf_combi <- rbind(LPT_unrelated_perf_combi, data.frame(CV = CV(SV),
arith_mean = mean(SV), geom_mean = exp(mean(log(SV))), max = max(SV),
first_ratio = SV[1]/SV[2], max_ratio = max(SV[-length(SV)]/SV[-1]),
mean_ratio = mean(SV[-length(SV)]/SV[-1]), CV_ratio = CV(SV[-length(SV)]/SV[-1]),
first_diff = SV[1] - SV[2], max_diff = max(SV[-length(SV)] - SV[-1]),
mean_diff = mean(SV[-length(SV)] - SV[-1]), CV_diff = CV(SV[-length(SV)] -
SV[-1]), CV_inv = CV(SV_inv), arith_mean_inv = mean(SV_inv), geom_mean_inv = exp(mean(log(SV_inv))),
max_inv = max(SV_inv), first_ratio_inv = SV_inv[1]/SV_inv[2], max_ratio_inv = max(SV_inv[-length(SV_inv)]/SV_inv[-1]),
mean_ratio_inv = mean(SV_inv[-length(SV_inv)]/SV_inv[-1]), CV_ratio_inv = CV(SV_inv[-length(SV_inv)]/SV_inv[-1]),
first_diff_inv = SV_inv[1] - SV_inv[2], max_diff_inv = max(SV_inv[-length(SV_inv)] -
SV_inv[-1]), mean_diff_inv = mean(SV_inv[-length(SV_inv)] - SV_inv[-1]),
CV_diff_inv = CV(SV_inv[-length(SV_inv)] - SV_inv[-1]), invCV = CV(invSV),
invarith_mean = mean(invSV), invgeom_mean = exp(mean(log(invSV))), invmax = max(invSV),
invfirst_ratio = invSV[1]/invSV[2], invmax_ratio = max(invSV[-length(invSV)]/invSV[-1]),
invmean_ratio = mean(invSV[-length(invSV)]/invSV[-1]), invCV_ratio = CV(invSV[-length(invSV)]/invSV[-1]),
invfirst_diff = invSV[1] - invSV[2], invmax_diff = max(invSV[-length(invSV)] -
invSV[-1]), invmean_diff = mean(invSV[-length(invSV)] - invSV[-1]),
invCV_diff = CV(invSV[-length(invSV)] - invSV[-1]), mean_cor_row = mean_cor_row(Z),
mean_cor_col = mean_cor_col(Z), TMA1 = TMA1(Z), TMA2 = TMA2(Z), ratio = max(1,
cmax/best_cmax(Z))))
}
Let's try to see again the relations.
measure_index <- 1:(ncol(LPT_unrelated_perf_combi) - 1)
LPT_unrelated_perf_combi_long <- reshape(LPT_unrelated_perf_combi, varying = measure_index,
direction = "long", idvar = "matrix", v.names = "measure", timevar = "type",
times = colnames(LPT_unrelated_perf_combi)[measure_index])
library(ggplot2)
p <- ggplot(LPT_unrelated_perf_combi_long, aes(x = abs(measure), y = ratio))
p <- p + geom_point()
p <- p + facet_wrap(~type, nrow = 8, scales = "free")
p <- p + scale_x_log10()
print(p)
And we fit again the performance with each measure.
fit <- NULL
for (meas in colnames(LPT_unrelated_perf_combi)[-ncol(LPT_unrelated_perf_combi)]) {
f <- lm(paste("ratio ~", meas), LPT_unrelated_perf_combi)
fit <- rbind(fit, data.frame(meas = meas, summary(f)$adj.r.squared))
f <- lm(paste("ratio ~ log(abs(", meas, "))", sep = ""), LPT_unrelated_perf_combi)
fit <- rbind(fit, data.frame(meas = paste("log", meas), summary(f)$adj.r.squared))
}
head(fit[order(fit[, 2], decreasing = TRUE), ])
## meas summary.f..adj.r.squared
## 76 log mean_cor_col 0.2611
## 75 mean_cor_col 0.2572
## 25 CV_inv 0.1214
## 50 log invCV 0.1169
## 49 invCV 0.1159
## 61 invmean_ratio 0.1059
The column correlation still plays an important role.
It is strange because a previous analysis showed that the TMA had a significant predictive power. Let's check this with a previous generation method:
n <- 100
m <- 30
LPT_unrelated_perf_TMA <- NULL
for (i in 1:100) {
cvsv <- 10^runif(1, -2, 1)
Z <- generate_matrix_cvsv(n, m, rgamma_cost, cvsv, 1)
SV <- svd(Z)$d
if (any(!is.finite(1/Z)))
next
SV_inv <- svd(1/Z)$d
invSV <- 1/SV
if (!is.finite(var(SV_inv)))
next
cmax <- LPT_unrelated(Z, func = min)
LPT_unrelated_perf_TMA <- rbind(LPT_unrelated_perf_TMA, data.frame(CV = CV(SV),
arith_mean = mean(SV), geom_mean = exp(mean(log(SV))), max = max(SV),
first_ratio = SV[1]/SV[2], max_ratio = max(SV[-length(SV)]/SV[-1]),
mean_ratio = mean(SV[-length(SV)]/SV[-1]), CV_ratio = CV(SV[-length(SV)]/SV[-1]),
first_diff = SV[1] - SV[2], max_diff = max(SV[-length(SV)] - SV[-1]),
mean_diff = mean(SV[-length(SV)] - SV[-1]), CV_diff = CV(SV[-length(SV)] -
SV[-1]), CV_inv = CV(SV_inv), arith_mean_inv = mean(SV_inv), geom_mean_inv = exp(mean(log(SV_inv))),
max_inv = max(SV_inv), first_ratio_inv = SV_inv[1]/SV_inv[2], max_ratio_inv = max(SV_inv[-length(SV_inv)]/SV_inv[-1]),
mean_ratio_inv = mean(SV_inv[-length(SV_inv)]/SV_inv[-1]), CV_ratio_inv = CV(SV_inv[-length(SV_inv)]/SV_inv[-1]),
first_diff_inv = SV_inv[1] - SV_inv[2], max_diff_inv = max(SV_inv[-length(SV_inv)] -
SV_inv[-1]), mean_diff_inv = mean(SV_inv[-length(SV_inv)] - SV_inv[-1]),
CV_diff_inv = CV(SV_inv[-length(SV_inv)] - SV_inv[-1]), invCV = CV(invSV),
invarith_mean = mean(invSV), invgeom_mean = exp(mean(log(invSV))), invmax = max(invSV),
invfirst_ratio = invSV[1]/invSV[2], invmax_ratio = max(invSV[-length(invSV)]/invSV[-1]),
invmean_ratio = mean(invSV[-length(invSV)]/invSV[-1]), invCV_ratio = CV(invSV[-length(invSV)]/invSV[-1]),
invfirst_diff = invSV[1] - invSV[2], invmax_diff = max(invSV[-length(invSV)] -
invSV[-1]), invmean_diff = mean(invSV[-length(invSV)] - invSV[-1]),
invCV_diff = CV(invSV[-length(invSV)] - invSV[-1]), mean_cor_row = mean_cor_row(Z),
mean_cor_col = mean_cor_col(Z), TMA1 = TMA1(Z), TMA2 = TMA2(Z), ratio = max(1,
cmax/best_cmax(Z))))
}
Let's try to see again the relations.
measure_index <- 1:(ncol(LPT_unrelated_perf_TMA) - 1)
LPT_unrelated_perf_TMA_long <- reshape(LPT_unrelated_perf_TMA, varying = measure_index,
direction = "long", idvar = "matrix", v.names = "measure", timevar = "type",
times = colnames(LPT_unrelated_perf_TMA)[measure_index])
library(ggplot2)
p <- ggplot(LPT_unrelated_perf_TMA_long, aes(x = abs(measure), y = ratio))
p <- p + geom_point()
p <- p + facet_wrap(~type, nrow = 8, scales = "free")
p <- p + scale_x_log10()
print(p)
And we fit again the performance with each measure.
LPT_unrelated_perf_TMA_NA <- do.call(data.frame, lapply(LPT_unrelated_perf_TMA,
function(x) replace(x, x == 0, NA)))
fit <- NULL
for (meas in colnames(LPT_unrelated_perf_TMA_NA)[-ncol(LPT_unrelated_perf_TMA_NA)]) {
f <- lm(paste("ratio ~", meas), LPT_unrelated_perf_TMA_NA)
fit <- rbind(fit, data.frame(meas = meas, summary(f)$adj.r.squared))
f <- lm(paste("ratio ~ log(abs(", meas, "))", sep = ""), LPT_unrelated_perf_TMA_NA,
na.action = na.omit)
fit <- rbind(fit, data.frame(meas = paste("log", meas), summary(f)$adj.r.squared))
}
head(fit[order(fit[, 2], decreasing = TRUE), ])
## meas summary.f..adj.r.squared
## 77 TMA1 0.6288
## 79 TMA2 0.6263
## 4 log arith_mean 0.6122
## 2 log CV 0.5885
## 50 log invCV 0.5859
## 26 log CV_inv 0.5752
This is consistent with a previous study. We see that the TMA is a good predictor only for some specific instances. Overall, the best predictors are:
Let's see if the result is the same for the correlation
correlation <- NULL
for (meas in colnames(LPT_unrelated_perf_TMA)[-ncol(LPT_unrelated_perf_TMA)]) {
correlation <- rbind(correlation, data.frame(meas = meas, cor = cor(LPT_unrelated_perf_TMA$ratio,
LPT_unrelated_perf_TMA[, meas], method = "spearman")))
correlation <- rbind(correlation, data.frame(meas = paste("log", meas),
cor = cor(LPT_unrelated_perf_TMA$ratio, log(abs(LPT_unrelated_perf_TMA[,
meas])), method = "spearman")))
}
head(correlation[order(abs(correlation[, 2]), decreasing = TRUE), ], n = 10)
## meas cor
## 37 mean_ratio_inv -0.8363
## 38 log mean_ratio_inv -0.8363
## 29 geom_mean_inv 0.8276
## 30 log geom_mean_inv 0.8276
## 77 TMA1 0.8259
## 78 log TMA1 0.8259
## 25 CV_inv -0.8250
## 26 log CV_inv -0.8250
## 79 TMA2 0.8240
## 80 log TMA2 0.8240
There are a lot of large correlation. Let's investigate:
plot.ecdf(abs(correlation$cor))
Most of the measures have strong correlations. This is encouraging: it means that there is indeed some association with the performance of HLPT.
We started this analysis by stating that the TMA has some limits that alternative measures do not have. Let's confirm this. First, we will vary the CV of similar matrices and see the effect on the TMA. Then, we will change the dimensions and the mean.
n <- 100
m <- 30
mu <- 1
rhoR <- 0.5
rhoC <- 0.5
measures_TMA_CV_combi <- NULL
for (i in 1:100) {
CVparam <- runif(1)
Z <- generate_matrix_corr_positive(n, m, rgamma_cost, mu, CVparam, rhoR,
rhoC)
SV <- svd(Z)$d
measures_TMA_CV_combi <- rbind(measures_TMA_CV_combi, data.frame(CV_param = CVparam,
CV = CV(svd(Z)$d), CV_inv = CV(svd(1/Z)$d), mean_cor_row = mean_cor_row(Z),
mean_cor_col = mean_cor_col(Z), TMA1 = TMA1(Z), TMA2 = TMA2(Z)))
}
pairs(~CV_param + CV + CV_inv + mean_cor_row + mean_cor_col + TMA1 + TMA2, measures_TMA_CV_combi)
We see a clear correlation between the CV and both TMA. However, they are all sensible to the CV of the costs (even though the correlations are not sensible to it). We should either find another measure (but the correlations might just be what we need here) or find another limitations for the TMA. Let's see if the mean has any effect:
n <- 100
m <- 30
CVparam <- 0.3
rhoR <- 0.5
rhoC <- 0.5
measures_TMA_mu_combi <- NULL
for (i in 1:100) {
mu <- runif(1)
Z <- generate_matrix_corr_positive(n, m, rgamma_cost, mu, CVparam, rhoR,
rhoC)
SV <- svd(Z)$d
measures_TMA_mu_combi <- rbind(measures_TMA_mu_combi, data.frame(mu = mu,
CV = CV(svd(Z)$d), CV_inv = CV(svd(1/Z)$d), mean_cor_row = mean_cor_row(Z),
mean_cor_col = mean_cor_col(Z), TMA1 = TMA1(Z), TMA2 = TMA2(Z)))
}
pairs(~mu + CV + CV_inv + mean_cor_row + mean_cor_col + TMA1 + TMA2, measures_TMA_mu_combi)
All measures are insensible to the mean. This time, both TMA are correlated with the CV of the SV of the inverse of the costs.
And now for the dimension:
n <- 100
mu <- 1
CVparam <- 0.3
rhoR <- 0.5
rhoC <- 0.5
measures_TMA_m_combi <- NULL
for (i in 1:100) {
m <- runif(1, 30, 300)
Z <- generate_matrix_corr_positive(n, m, rgamma_cost, mu, CVparam, rhoR,
rhoC)
SV <- svd(Z)$d
measures_TMA_m_combi <- rbind(measures_TMA_m_combi, data.frame(m = m, CV = CV(svd(Z)$d),
CV_inv = CV(svd(1/Z)$d), mean_cor_row = mean_cor_row(Z), mean_cor_col = mean_cor_col(Z),
TMA1 = TMA1(Z), TMA2 = TMA2(Z)))
}
pairs(~m + CV + CV_inv + mean_cor_row + mean_cor_col + TMA1 + TMA2, measures_TMA_m_combi,
log = "x")
Although the CV is sensible to the dimension, both TMA show more variations. However, this is not enough. A better measure than the TMA should show mush less variation.
At this point, we have two choices. Either we keep the correlation and we chose the following story: its value has an effect on the performance of several heuristics and it is not sensible to unrelated variation in the instances like the TMA. It is good, but we could probably do better by assessing that the TMA can grasp something else related to the instances (with the singular values) and by proposing another measure that is at least not sensible to the dimension and maybe also not to the CV.
Z <- generate_matrix_corr_positive(100, 100, rgamma_cost, 1, 0.3, 0.5, 0.5)
Z.svd <- svd(Z)$d
plot.ecdf(Z.svd, log = "x", xlim = c(min(Z.svd), max(Z.svd)))
plot.ecdf(svd(Z[, 1:30])$d, col = 2, add = TRUE)
plot.ecdf(svd(Z[, 1:10])$d, col = 3, add = TRUE)
Z <- generate_matrix_corr_positive(100, 100, rgamma_cost, 1, 1, 0.5, 0.5)
Z.svd <- svd(Z)$d
plot.ecdf(Z.svd, log = "x", xlim = c(min(Z.svd), max(Z.svd)))
Z <- generate_matrix_corr_positive(100, 100, rgamma_cost, 1, 0.3, 0.5, 0.5)
plot.ecdf(svd(Z)$d, col = 2, add = TRUE)
Z <- generate_matrix_corr_positive(100, 100, rgamma_cost, 1, 0.1, 0.5, 0.5)
plot.ecdf(svd(Z)$d, col = 3, add = TRUE)
It seems difficult to find a better SV-based measure because the SV are heavily influenced by the CV of the costs and the dimension of the matrix. We could say that the CV is at least simpler to compute and equivalent. For this, it would be necessary to clearly show that the normalization of the TMA is useless:
n <- 100
m <- 30
CVparam <- 0.3
measures_TMA_V_combi <- NULL
for (i in 1:100) {
V <- runif(1)
task <- rgamma_cost(n, 1, V)
proc <- rgamma_cost(m, 1, V)
Z <- generate_heterogeneous_matrix_noise(task, proc, CVparam)
SV <- svd(Z)$d
measures_TMA_V_combi <- rbind(measures_TMA_V_combi, data.frame(V = V, CV = CV(svd(Z)$d),
CV_inv = CV(svd(1/Z)$d), mean_cor_row = mean_cor_row(Z), mean_cor_col = mean_cor_col(Z),
TMA1 = TMA1(Z), TMA2 = TMA2(Z)))
}
pairs(~V + CV + CV_inv + mean_cor_row + mean_cor_col + TMA1 + TMA2, measures_TMA_V_combi)
The TMA is indeed independent from the heterogeneity. This is at least one of its quality. In an experiment where everything vary, it is not clear that the CV would be better.
The first conclusion is that the correlation and the TMA are not measuring the same thing. The TMA is sometime quite relevant (actually, the TMA was assumed to be extremely good only because it performs well on one specific set of instances) and this is due to the singular values on which is it based. Other variants are also good predictors that are at least as good (like the CV for instance). However, in contract to the correlation, the TMA and related measures are sensible to the CV of the costs and the dimension of the matrix. The TMA, thanks to its normalization mechanism, is independent from heterogeneity properties (contrarily to the correlation). As such, the TMA has several defaults:
But it also has several advantages:
As future work, we could devise a measure that is simpler than the TMA but with the same strength (independence from heterogeneity) or even better than it (independence from matrix dimension and cost CV).
For now, we can include the CV of the SV of the costs and the CV of the SV of the inverse of the costs in future analysis because they seems to be better predictors than the TMA. It will also be necessary to assess the difference between the TMA and the correlations (for instance, by computing the correlation between those two on all instances of each set and by tile-plots with the TMA).
This study really helps me to have a better understanding of SV-related measures even though dealing with this huge set of measures is unsatisfactory. What is clear is that improving the TMA with a better SV-related measure is challenging.