Ideas for investigating and improving the TMA (July 21 to 23, 2015)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

In the previous analysis and the recent discussions, it appears we have at least some investigation to perform an the TMA. Let’s write down these ideas and see what analysis we can already perform.

First, the TMA consists of three parts: it invert the costs; it normalizes the matrices; and, it computes the ratio of the first SV (singular value) over the mean of all the other SV (edit: it is actually the inverse). This raises several questions:

What is the role of each step?
Are each step essential?
Is it possible to replace some given step with an equivalent one, in particular the last one?

We also remarked that the slopes of the SV are the same when the CV changes. A good measure could be related to the average difference between each SV.

Finally, we have several ideas to investigate:

What is the proportion of the SV (ratio between each SV and the sum of the SV) in all cases?
How behaves the TMA with a square matrix when its size changes?
Could we complete the matrix when the dimension changes?

Effect of TMA cost inversion

Our hypothesis is that the cost inversion has no effect. Let’s design a quick experiment to validate this.

We will compare the relation between the original TMA and HLPT performance and compare it to the relation between a TMA using non-inverted cost and the HLPT performance. Let’s consider a method that is favorable to the original TMA:

best_cmax <- function(costs) {
  min(LPT_unrelated(costs, min), balance_sufferage(costs), balance_EFT(costs))
}

n <- 100
m <- 30
TMA_inversion <- NULL
for (i in 1:100) {
  cvsv <- 10^runif(1, -2, 1)
  Z <- generate_matrix_cvsv(n, m, rgamma_cost, cvsv, 1)
  cmax <- LPT_unrelated(Z, func=min)
  TMA_inversion <- rbind(TMA_inversion,
                         data.frame(cvsv=cvsv, CVSV_meas=CV_meas(svd(Z)$d),
                                    ratio=max(1, cmax/best_cmax(Z)),
                                    TMA1=TMA1(Z), TMA2=TMA2(Z),
                                    TMA1_inv=TMA1(1/Z), TMA2_inv=TMA2(1/Z)))
}

TMA_inversion_long <- gather(TMA_inversion, measure, value, contains("TMA"))
p <- ggplot(TMA_inversion_long, aes(value, ratio))
p <- p + geom_point()
p <- p + geom_smooth()
p <- p + facet_wrap(~ measure, scales = "free")
print(p)

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

The original TMA has a more uniform distribution but the relation between the measure and HLPT performance seems to be constant. Let’s confirm this with some correlations:

tbl_df(TMA_inversion_long) %>%
  group_by(measure) %>%
  summarise(cor(ratio, value, method = "spearman"))

## Source: local data frame [4 x 2]
## 
##    measure cor(ratio, value, method = "spearman")
## 1     TMA1                              0.7760158
## 2     TMA2                              0.7784113
## 3 TMA1_inv                              0.7437886
## 4 TMA2_inv                              0.7468565

The relation when inverting the costs is a bit stronger in this setting.

It must be noted that this is only with a specific kind of matrix generation. Let’s try this with another method:

n <- 100
m <- 30
mu <- 1
TMA_inversion_combi <- NULL
for (i in 1:100) {
  rhoR <- runif(1)
  rhoC <- runif(1)
  CVparam <- runif(1)
  Z <- generate_matrix_corr_positive(n, m, rgamma_cost, mu, CVparam, rhoR, rhoC)
  cmax <- LPT_unrelated(Z, func=min)
  TMA_inversion_combi <- rbind(TMA_inversion_combi,
                               data.frame(cvsv=cvsv, CVSV_meas=CV_meas(svd(Z)$d),
                                          ratio=max(1, cmax/best_cmax(Z)),
                                          TMA1=TMA1(Z), TMA2=TMA2(Z),
                                          TMA1_inv=TMA1(1/Z), TMA2_inv=TMA2(1/Z)))
}
TMA_inversion_combi_long <- gather(TMA_inversion_combi, measure, value, contains("TMA"))
tbl_df(TMA_inversion_combi_long) %>%
  group_by(measure) %>%
  summarise(cor(ratio, value, method = "spearman"))

## Source: local data frame [4 x 2]
## 
##    measure cor(ratio, value, method = "spearman")
## 1     TMA1                           -0.100099672
## 2     TMA2                           -0.102261052
## 3 TMA1_inv                           -0.004081265
## 4 TMA2_inv                           -0.149920088

We can conclude that the inversion is not useful and that the TMA1 with the non-inverted value is the worst.

Effect of TMA normalization

Let’s test whether the normalization plays an important role. We will only use the TMA2, which have the most complete normalization mechanism.

TMA2_without_norm <- function(ETC) {
  ECS <- 1/ETC
  sv <- sort(svd(ECS)$d, decreasing=TRUE)
  mean(sv[-1])/sv[1]
}

Let’s perform the test on a favorable setting:

n <- 100
m <- 30
TMA_norm <- NULL
for (i in 1:100) {
  cvsv <- 10^runif(1, -2, 1)
  Z <- generate_matrix_cvsv(n, m, rgamma_cost, cvsv, 1)
  cmax <- LPT_unrelated(Z, func=min)
  TMA_norm <- rbind(TMA_norm,
                    data.frame(cvsv=cvsv, CVSV_meas=CV_meas(svd(Z)$d),
                               ratio=max(1, cmax/best_cmax(Z)),
                               TMA2=TMA2(Z), TMA2_inv=TMA2(1/Z),
                               TMA2_without_norm=TMA2_without_norm(Z),
                               TMA2_without_norm_inv=TMA2_without_norm(1/Z)))
}

TMA_norm_long <- gather(TMA_norm, measure, value, contains("TMA"))
p <- ggplot(TMA_norm_long, aes(value, ratio))
p <- p + geom_point()
p <- p + geom_smooth()
p <- p + facet_wrap(~ measure, scales = "free")
print(p)

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

tbl_df(TMA_norm_long) %>%
  group_by(measure) %>%
  summarise(cor(ratio, value, method = "spearman"))

## Source: local data frame [4 x 2]
## 
##                 measure cor(ratio, value, method = "spearman")
## 1                  TMA2                              0.8219547
## 2              TMA2_inv                              0.7699373
## 3     TMA2_without_norm                              0.8260374
## 4 TMA2_without_norm_inv                              0.7602986

It no visible effect. Here, considering the non-inverted costs is still hurtful.

A conclusion is that the inversion and the normalization have a little impact on the prediction power. However, it is clear that the normalization makes the measure independent from the heterogeneity as shown by the previous study.

Let’s again check this result is stable with another generation method.

n <- 100
m <- 30
mu <- 1
TMA_norm_combi <- NULL
for (i in 1:100) {
  rhoR <- runif(1)
  rhoC <- runif(1)
  CVparam <- runif(1)
  Z <- generate_matrix_corr_positive(n, m, rgamma_cost, mu, CVparam, rhoR, rhoC)
  cmax <- LPT_unrelated(Z, func=min)
  TMA_norm_combi <- rbind(TMA_norm_combi,
                          data.frame(cvsv=cvsv, CVSV_meas=CV_meas(svd(Z)$d),
                                     ratio=max(1, cmax/best_cmax(Z)),
                                     TMA2=TMA2(Z), TMA2_inv=TMA2(1/Z),
                                     TMA2_without_norm=TMA2_without_norm(Z),
                                     TMA2_without_norm_inv=TMA2_without_norm(1/Z)))
}
TMA_norm_combi_long <- gather(TMA_norm_combi, measure, value, contains("TMA"))
tbl_df(TMA_norm_combi_long) %>%
  group_by(measure) %>%
  summarise(cor(ratio, value, method = "spearman"))

## Source: local data frame [4 x 2]
## 
##                 measure cor(ratio, value, method = "spearman")
## 1                  TMA2                             -0.3510468
## 2              TMA2_inv                             -0.3794621
## 3     TMA2_without_norm                             -0.1067884
## 4 TMA2_without_norm_inv                             -0.2506509

This time, we could say that the relation is improved by the normalization. The effect is however so low that it may be insignificant.

Misc observations

First, the ratio between the CV must be discarded, because it will tends to 1 as the dimension tends to infinity.

Also, computing the proportion of the SV (ratio between each SV and the sum of the SV) would not be interesting because the dimension must remains constant (otherwise, the SV will be much smaller for large matrices). But, for matrices with only the global CV that varies, it could be interesting.

Measure insensible to the dimension with square matrices

Before analyzing the last step of the TMA, let’s find alternative measures that are more relevant.

Let’s start by plotting the SV and their proportion with two methods with a square matrix when the size changes.

rhoR <- 0.5
rhoC <- 0.5
mu <- 1
CV <- 0.3
Vmax <- 0.3
SV_square <- NULL
for (n in as.integer(10^seq(log10(10), log10(1000), length.out = 100))) {
  Z <- generate_matrix_corr_positive(n, n, rgamma_cost, mu, CV, rhoR, rhoC)
  SV_square <- rbind(SV_square, data.frame(n = n, SV = svd(Z)$d, method = "combi"))
  Z <- generate_heterogeneous_matrix_noise_corr(n, n, rgamma_cost, rhoR, rhoC, Vmax)
  SV_square <- rbind(SV_square, data.frame(n = n, SV = svd(Z)$d, method = "noise"))
}
p <- ggplot(SV_square, aes(x = SV, group = n, color = factor(n)))
p <- p + stat_ecdf()
p <- p + coord_cartesian(xlim = c(0, 10))
p <- p + facet_wrap(~ method, nrow = 2)
p <- p + theme(legend.position = "none")
print(p)

The shapes are quite similar for each method but the slope changes.

Let’s find if there is some relation between the median SV. First, we plot the ratio between each median SV obtained from the first method and the second for each instance:

SV_square_med <- tbl_df(SV_square) %>%
  group_by(n, method) %>%
  summarise(med = median(SV)) %>%
  group_by(n) %>%
  mutate(ratio_med = max(med)/min(med))
p <- ggplot(SV_square_med, aes(x = n, y = ratio_med))
p <- p + geom_point()
p <- p + geom_line()
p <- p + scale_x_log10()
print(p)

This ratio is constant, which suggests that when we change the generation method, the SV are similar to a given factor independently of the dimension. Let’s plot the medians:

SV_square_med <- tbl_df(SV_square) %>%
  group_by(n, method) %>%
  summarise(med = median(SV))
p <- ggplot(SV_square_med, aes(x = n, y = med, group = method, color = method))
p <- p + geom_point()
p <- p + geom_line()
p <- p + scale_x_log10()
print(p)

This plot suggests that the noise-based method has more noise/uncertainty. The CV may be too high. Let’s perform a linear regression assuming that the median is associated to a polynomial of n:

SV_square_med <- tbl_df(SV_square) %>%
  group_by(n, method) %>%
  summarise(med = median(SV)) %>%
  group_by(method) %>%
  do(fit = lm(log(med) ~ log(n), data = .))
tidy(SV_square_med, fit)

## Source: local data frame [4 x 6]
## Groups: method
## 
##   method        term   estimate   std.error  statistic       p.value
## 1  combi (Intercept) -1.9614590 0.015798800 -124.15240 3.917065e-105
## 2  combi      log(n)  0.5016823 0.003236254  155.01942 4.654767e-114
## 3  noise (Intercept) -1.5651181 0.029370059  -53.28958  1.695680e-71
## 4  noise      log(n)  0.4979417 0.006016215   82.76660  6.622278e-89

glance(SV_square_med, fit)

## Source: local data frame [2 x 12]
## Groups: method
## 
##   method r.squared adj.r.squared      sigma statistic       p.value df
## 1  combi 0.9961449     0.9961035 0.04081875 24031.021 4.654767e-114  2
## 2  noise 0.9866058     0.9864618 0.07588229  6850.311  6.622278e-89  2
## Variables not shown: logLik (dbl), AIC (dbl), BIC (dbl), deviance (dbl),
##   df.residual (int)

print(p + stat_function(fun = function(x) {
  exp(tidy(SV_square_med, fit)$estimate[1]) *
    (10^x)^(tidy(SV_square_med, fit)$estimate[2]) }))

The fit is excellent: a simple constant 0.1406531 multiplied by the square root of the dimension. This suggests that any measure on the singular value should be divided by the square root of the dimension.

Let’s re-plot the previous SV, but divided by the square root of the dimension this time:

p <- ggplot(SV_square, aes(x = SV/sqrt(n), group = n, color = factor(n)))
p <- p + stat_ecdf(alpha = 0.3)
p <- p + coord_cartesian(xlim = c(0, 0.5))
p <- p + facet_wrap(~ method, nrow = 2)
p <- p + theme(legend.position = "none")
print(p)

All the ECDF seems to be juxtaposed and the SV distributions are all close.

Let’s study how the TMA varies with square matrices:

rhoR <- 0.5
rhoC <- 0.5
mu <- 1
CV <- 0.3
SV_square_TMA <- NULL
for (n in as.integer(10^seq(log10(10), log10(1000), length.out = 10))) {
  Z <- generate_matrix_corr_positive(n, n, rgamma_cost, mu, CV, rhoR, rhoC)
  SV_square_TMA <- rbind(SV_square_TMA, data.frame(n = n, TMA = TMA2(Z), method = "combi"))
  Z <- generate_heterogeneous_matrix_noise_corr(n, n, rgamma_cost, rhoR, rhoC, Vmax)
  SV_square_TMA <- rbind(SV_square_TMA, data.frame(n = n, TMA = TMA2(Z), method = "noise"))
}
p <- ggplot(SV_square_TMA, aes(x = n, y = TMA, group = method, color = method))
p <- p + geom_point()
p <- p + geom_line()
p <- p + scale_x_log10()
print(p)

The TMA is sensible to the dimension of a square matrix. This was expected because it was already shown that it was sensible to a single dimension variation.

Measure insensible to the dimension with non-square matrices

Let’s perform the same analysis but with only one dimension that changes:

m <- 100
rhoR <- 0.5
rhoC <- 0.5
mu <- 1
CV <- 0.3
Vmax <- 0.3
SV_non_square <- NULL
for (n in as.integer(10^seq(log10(10), log10(1000), length.out = 100))) {
  Z <- generate_matrix_corr_positive(n, m, rgamma_cost, mu, CV, rhoR, rhoC)
  SV_non_square <- rbind(SV_non_square, data.frame(n = n, SV = svd(Z)$d, method = "combi"))
  Z <- generate_heterogeneous_matrix_noise_corr(n, n, rgamma_cost, rhoR, rhoC, Vmax)
  SV_non_square <- rbind(SV_non_square, data.frame(n = n, SV = svd(Z)$d, method = "noise"))
}
p <- ggplot(SV_non_square, aes(x = SV, group = n, color = factor(n)))
p <- p + stat_ecdf()
p <- p + coord_cartesian(xlim = c(0, 10))
p <- p + facet_wrap(~ method, nrow = 2)
p <- p + theme(legend.position = "none")
print(p)

Something is strange for the combination-based method: there is an inflection point and it does not appear for the noise-based method. Let’s see directly if the noise-based method just need the median to be divided by the square root of the the dimension that varies to remain constant.

SV_square_med <- tbl_df(SV_non_square) %>%
  group_by(n, method) %>%
  summarise(med = median(SV))
p <- ggplot(SV_square_med, aes(x = n, y = med/sqrt(n), group = method, color = method))
p <- p + geom_point()
p <- p + geom_line()
p <- p + scale_x_log10()
print(p)

This is the case, but it does not work as expected with the combination-based method. Let’s see what is the result when both dimensions vary.

rhoR <- 0.5
rhoC <- 0.5
mu <- 1
CV <- 0.3
Vmax <- 0.3
SV_non_square_tile <- NULL
for (n in as.integer(10^seq(log10(10), log10(1000), length.out = 10)))
  for (m in as.integer(10^seq(log10(10), log10(1000), length.out = 10))) {
    Z <- generate_matrix_corr_positive(n, m, rgamma_cost, mu, CV, rhoR, rhoC)
    SV_non_square_tile <- rbind(SV_non_square_tile,
                                data.frame(n = n, m = m, SV = svd(Z)$d, method = "combi"))
    Z <- generate_heterogeneous_matrix_noise_corr(n, m, rgamma_cost, rhoR, rhoC, Vmax)
    SV_non_square_tile <- rbind(SV_non_square_tile,
                                data.frame(n = n, m = m, SV = svd(Z)$d, method = "noise"))
  }

We will directly try to see what is the measure that varies the least by dividing the median SV by a combination of n and m:

SV_non_square_tile_med <- tbl_df(SV_non_square_tile) %>%
  group_by(n, m, method) %>%
  summarise(med = median(SV), med_norm1 = median(SV / sqrt(min(n, m))),
            med_norm2 = median(SV / sqrt(max(n, m))))
SV_non_square_tile_med_long <- gather(SV_non_square_tile_med, type, median, contains("med"))
p <- ggplot(SV_non_square_tile_med_long, aes(x = median, group = type, color = type))
p <- p + stat_ecdf()
p <- p + scale_x_log10()
p <- p + annotation_logticks()
p <- p + facet_wrap(~ method)
print(p)

The best solution is to divide the SV by the square root of the maximum dimension. Let’s see if there is pattern:

p <- ggplot(SV_non_square_tile_med, aes(x = n, y = m, fill = med_norm2))
p <- p + geom_tile()
p <- p + scale_x_log10()
p <- p + scale_y_log10()
p <- p + scale_fill_gradientn(colours = c(rainbow(100)), trans = "log",
                              breaks = 10^seq(log10(0.1), log10(3), length.out = 10))
p <- p + facet_wrap(~ method, nrow = 2, scales = "free")
print(p)

There is a distinction in this normalization between the square matrix and the non-square matrix but it is the best solution we have.

Finally, let’s see if we could complete a non-square matrix to make it square.

n <- 1000
rhoR <- 0.5
rhoC <- 0.5
mu <- 1
CV <- 0.3
Z <- generate_matrix_corr_positive(n, n, rgamma_cost, mu, CV, rhoR, rhoC)
Z_partial <- Z[,1:100]
Z_reconstruct <- NULL
for (i in 1:10)
  Z_reconstruct <- cbind(Z_reconstruct, Z_partial)
Z_reconstruct.svd <- data.frame(sv = svd(Z_reconstruct)$d)
p <- ggplot(Z_reconstruct.svd, aes(x = sv))
p <- p + stat_ecdf()
p <- p + scale_x_log10()
print(p)

This idea is bad because the added column have no dimension. It would be necessary to add some noise but it is probably not the purpose of any SV-related measure.

Measure insensible to the variation

In the previous report, we remarked that the slopes of the SV were similar in log scale when the CV changes. This suggests that the average difference between each SV divided by the mean of the SV should remains constant. But this also suggests that the CV should remains constant, which was not the case. This could be due to the effect of extreme values. Let’s perform a robust measure with the quartile coefficient of dispersion:

qcd <- NULL
for (i in 1:100)
  for (CV in c(0.01, 0.03, 0.1, 0.3, 1)) {
    Z <- generate_matrix_corr_positive(100, 100, rgamma_cost, 1, CV, 0.5, 0.5)
    qcd <- rbind(qcd, data.frame(CV = CV, qcd = qcd_meas(svd(Z)$d)))
  }
tbl_df(qcd) %>%
  group_by(CV) %>%
  summarise(mean(qcd))

## Source: local data frame [5 x 2]
## 
##     CV mean(qcd)
## 1 0.01  1.079712
## 2 0.03  1.077865
## 3 0.10  1.080770
## 4 0.30  1.092916
## 5 1.00  1.171316

This is quite close and the slight difference when the CV is 1 could be related to too much noise.

Last step of TMA

Now, we will consider the following simplified (and improved) TMA.

TMA_simple <- function(ECT) {
  sv <- sort(svd(ECT)$d, decreasing = TRUE)
  mean(sv[-1])/sv[1]
}

The objective is to compare it to other alternatives. First, it consists in discarding the special treatment for the first value and performing the mean on all the SV. In the same way, we will add the median to make sure that we are not impacted by the extreme value.

Let’s start by comparing the predictive power of the simplified TMA, the mean of the SV and the median of the SV. We will use several generation methods and focus on the correlation between the measure and HLPT performance. Let’s also add a measure that compute the sum of the values that between the first and the third quartiles.

n <- 100
m <- 30
mu <- 1
CV <- 0.3
Vmax <- 0.3
mu <- 1
TMA_alternative <- NULL
for (i in 1:100) {
  rhoR <- runif(1)
  rhoC <- runif(1)
  cvsv <- 10^runif(1, -2, 1)
  TMA <- runif(1)
  for (j in 1:4) {
    if (j == 1)
      Z <- generate_matrix_cvsv(n, m, rgamma_cost, cvsv, mu)
    else if (j == 2)
      Z <- generate_matrix_corr_positive(n, m, rgamma_cost, mu, CV, rhoR, rhoC)
    else if (j == 3)
      Z <- generate_heterogeneous_matrix_noise_corr(n, m, rgamma_cost, rhoR, rhoC, Vmax)
    else
      Z <- generate_matrix_TMA(n, m, TMA, mu)
    Z.sv <- svd(Z)$d
    sumQuartileSV <- sum(Z.sv[Z.sv > quantile(Z.sv, 0.25) & Z.sv < quantile(Z.sv, 0.75)])
    cmax <- LPT_unrelated(Z, func = min)
    TMA_alternative <- rbind(TMA_alternative,
                             data.frame(type = j,
                                        ratio = max(1, cmax/best_cmax(Z)),
                                        TMA_simple = TMA_simple(Z),
                                        meanSV = mean(Z.sv),
                                        medSV = median(Z.sv),
                                        sumQuartileSV = sumQuartileSV,
                                        CVSV = CV_meas(Z.sv),
                                        QCDSV = qcd_meas(Z.sv)))
  }
}
TMA_alternative$type <- c("cvsv", "combi", "noise", "TMA")
TMA_alternative <- tbl_df(TMA_alternative)

The following correlations show that all the proposed measures are acceptable:

TMA_alternative_long <- gather(TMA_alternative, measure, value, 3:(ncol(TMA_alternative) - 2))
TMA_alternative_long %>%
  group_by(type, measure) %>%
  summarise(corr = abs(cor(value, ratio, method = "spearman"))) %>%
  group_by(type) %>%
  arrange(corr)

## Source: local data frame [16 x 3]
## Groups: type
## 
##     type       measure      corr
## 1    TMA         medSV 0.2089914
## 2    TMA sumQuartileSV 0.2131079
## 3    TMA    TMA_simple 0.2361985
## 4    TMA        meanSV 0.2470717
## 5  combi        meanSV 0.7415062
## 6  combi sumQuartileSV 0.7508431
## 7  combi         medSV 0.7514791
## 8  combi    TMA_simple 0.7532913
## 9   cvsv    TMA_simple 0.8051201
## 10  cvsv        meanSV 0.8062069
## 11  cvsv sumQuartileSV 0.8085977
## 12  cvsv         medSV 0.8096240
## 13 noise        meanSV 0.7914911
## 14 noise    TMA_simple 0.8040564
## 15 noise sumQuartileSV 0.8081128
## 16 noise         medSV 0.8123492

Let’s plot the detail just in case:

p <- ggplot(TMA_alternative_long, aes(x = value, y = ratio))
p <- p + geom_point()
p <- p + stat_smooth()
p <- p + facet_wrap(~ type + measure, scales = "free")
print(p)

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

We see that the measures are all close. Let’s compute the correlations between each alternative and the simplified TMA:

gather(TMA_alternative, measure, value, 4:(ncol(TMA_alternative) - 2)) %>%
  group_by(type, measure) %>%
  summarise(corr = abs(cor(value, TMA_simple, method = "spearman"))) %>%
  ungroup() %>%
  arrange(corr)

## Source: local data frame [12 x 3]
## 
##     type       measure      corr
## 1  noise        meanSV 0.9627723
## 2  noise sumQuartileSV 0.9816742
## 3  combi        meanSV 0.9823702
## 4  noise         medSV 0.9831743
## 5    TMA         medSV 0.9881548
## 6    TMA sumQuartileSV 0.9927273
## 7   cvsv         medSV 0.9939154
## 8   cvsv sumQuartileSV 0.9947675
## 9  combi         medSV 0.9951515
## 10 combi sumQuartileSV 0.9956556
## 11   TMA        meanSV 0.9980198
## 12  cvsv        meanSV 0.9999640

The correlations are all high. The closest and furthest measure to and from the TMA is the mean of the SV. It is indeed almost the same but the first SV may also be the most extreme value.

Then, we will see if the CV (and the robust variant, the quartile CV) is a good predictor compared to the simplified TMA.

TMA_alternative_long <- gather(TMA_alternative, measure, value, c(3, 7, 8))
TMA_alternative_long %>%
  group_by(type, measure) %>%
  summarise(corr = abs(cor(value, ratio, method = "spearman"))) %>%
  group_by(type) %>%
  arrange(corr)

## Source: local data frame [12 x 3]
## Groups: type
## 
##     type    measure       corr
## 1    TMA       CVSV 0.23448226
## 2    TMA TMA_simple 0.23619845
## 3    TMA      QCDSV 0.23809467
## 4  combi      QCDSV 0.22989499
## 5  combi TMA_simple 0.75329133
## 6  combi       CVSV 0.75329133
## 7   cvsv      QCDSV 0.80043513
## 8   cvsv       CVSV 0.80455262
## 9   cvsv TMA_simple 0.80512014
## 10 noise      QCDSV 0.04769277
## 11 noise       CVSV 0.80340834
## 12 noise TMA_simple 0.80405641

The CV of the SV is almost equivalent to the simplified TMA, but the quartile coefficient of dispersion is actually very bad. This may be due to the fact that this does not change much (it was insensible to the CV, but it may also be insensible to everything else).

Conclusion

When dealing with SV, a small precaution is necessary: divide the SV by the square root of the dimension to make the measure insensible to the dimension (take the maximum between both dimensions).

We did not find how to make any measure insensible to the CV.

For the TMA:

inverting the cost does not seem to be significantly helpful
the only role of the normalization is to make the measure orthogonal to the heterogeneity of the matrix
simpler alternatives exist for the last step of the TMA (mean, CV)

A good replacement could be to compute the CV of the SV. It is however sensible to the dimension but a quick experiment seems to indicate that it may be possible to simply divide it by the log of the dimension.