Let's statistically analyse the data.
We estimate the median of a set of values. For this, a tool with the following properties is required:
There are several solutions.
http://docs.ggplot2.org/current/stat_smooth.html
Using rq is possible but without CI and with a linear model only.
http://astrostatistics.psu.edu/datasets/R/html/quantreg/html/rq.html
http://docs.ggplot2.org/current/stat_quantile.html
http://latticeextra.r-forge.r-project.org/man/panel.quantile.html
More information in “Quantile Regression: Theory and Applications”, 2013.
Let's load the data.
V_gamma <- read.csv("CCPE-data/cost-cons.csv")
Let's start by doing a classic smooth.
library(ggplot2)
p <- ggplot(data = V_gamma, aes(x = cov, y = perf, group = method, color = method))
p <- p + geom_point(alpha = I(1/30), size = 0.1) + coord_cartesian(ylim = c(0,
50))
p <- p + stat_smooth()
p <- p + scale_x_log10()
p
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
Let's see what we get when we use log scale for the y axis.
library(ggplot2)
p <- ggplot(data = V_gamma, aes(x = cov, y = perf, group = method, color = method))
p <- p + geom_point(alpha = I(1/30), size = 0.1) + coord_cartesian(ylim = c(5,
50))
p <- p + stat_smooth()
p <- p + scale_x_log10()
p <- p + scale_y_log10()
p
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
The inversion in the slope in this last figure for large CV is strange. Conclusions are however consistent with previous figure. Let's proceed with an estimation of some quantiles (5%, 50% and 95%).
library(ggplot2)
p <- ggplot(data = V_gamma[1:1e+05, ], aes(x = cov, y = perf))
p <- p + facet_grid(. ~ method)
p <- p + geom_point(alpha = I(0.1), size = 0.1)
p <- p + stat_smooth()
p <- p + stat_quantile(method = "rqss", quantiles = c(0.05, 0.5, 0.95))
p <- p + scale_x_log10()
p <- p + scale_y_log10()
p <- p + coord_cartesian(ylim = c(5, 50))
p
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
Using stat_quantile with rqss does not work with a large number of points (1e5) for lambda of 1, 10, 100 and 1000: all quantiles are superposed.
library(latticeExtra)
library(splines)
panel.custo <- function(...) {
panel.quantile(y ~ ns(x, 3), tau = c(0.5), ci = T, ...)
}
xyplot(log(perf) ~ log(cov) | method, V_gamma[1:1e+05, ], alpha = 0.1, cex = 0.1,
ylim = c(1, 5)) + layer(panel.quantile(y ~ ns(x, 3), tau = c(0.05), ci = T,
col = "black")) + layer(panel.quantile(y ~ ns(x, 3), tau = c(0.5), ci = T,
col = "black")) + layer(panel.quantile(y ~ ns(x, 3), tau = c(0.95), ci = T,
col = "black"))
## Warning: 212 non-positive fis
## Warning: 4 non-positive fis
## Warning: 568 non-positive fis
## Warning: 133 non-positive fis
## Warning: 446 non-positive fis
The time to analyse 1e5 points is excessively too long for all points and the tool is hard to use:
This last version shows the statistical dispersion of the values. The method seems to be sound (3 splines are used, but their effect of the CI computation is unknown).
Let's redo this analysis with more randomized data to assess the impact.
V_gamma <- read.csv("CCPE-data/cost-agg.csv")
Let's start by the smooth.
library(ggplot2)
p <- ggplot(data = V_gamma, aes(x = cov, y = perf, group = method, color = method))
p <- p + geom_point(alpha = I(1/30), size = 0.1) + coord_cartesian(ylim = c(0,
50))
p <- p + stat_smooth()
p <- p + scale_x_log10()
p
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
It is the same except that the space is more filled with this second method.
library(ggplot2)
p <- ggplot(data = V_gamma, aes(x = cov, y = perf, group = method, color = method))
p <- p + geom_point(alpha = I(1/30), size = 0.1) + coord_cartesian(ylim = c(5,
50))
p <- p + stat_smooth()
p <- p + scale_x_log10()
p <- p + scale_y_log10()
p
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
Idem.
library(ggplot2)
p <- ggplot(data = V_gamma[1:1e+05, ], aes(x = cov, y = perf))
p <- p + facet_grid(. ~ method)
p <- p + geom_point(alpha = I(0.1), size = 0.1)
p <- p + stat_smooth()
p <- p + stat_quantile(method = "rqss", quantiles = c(0.05, 0.5, 0.95))
p <- p + scale_x_log10()
p <- p + scale_y_log10()
p <- p + coord_cartesian(ylim = c(5, 50))
p
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
No sensible change.
library(latticeExtra)
library(splines)
panel.custo <- function(...) {
panel.quantile(y ~ ns(x, 3), tau = c(0.5), ci = T, ...)
}
xyplot(log(perf) ~ log(cov) | method, V_gamma[1:1e+05, ], alpha = 0.1, cex = 0.1,
ylim = c(1, 5)) + layer(panel.quantile(y ~ ns(x, 3), tau = c(0.05), ci = T,
col = "black")) + layer(panel.quantile(y ~ ns(x, 3), tau = c(0.5), ci = T,
col = "black")) + layer(panel.quantile(y ~ ns(x, 3), tau = c(0.95), ci = T,
col = "black"))
## Warning: 212 non-positive fis
## Warning: 4 non-positive fis
## Warning: 121 non-positive fis
## Warning: 7 non-positive fis
## Warning: 150 non-positive fis
The medians may be slightly higher, but it is negligible.
Let's transform the data such that their scales are linear. This will show how
stat_smooth
and stat_quantile
behaves.
V_linear <- data.frame(cov = log(V_gamma$cov), perf = log(V_gamma$perf), method = V_gamma$method)
library(ggplot2)
p <- ggplot(data = V_linear[1:1e+05, ], aes(x = cov, y = perf, group = method,
color = method))
p + geom_point() + stat_smooth()
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
p + stat_quantile(method = "rqss", quantiles = c(0.05, 0.5, 0.95))
## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: tiny diagonals replaced with Inf when calling blkfct
## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: tiny diagonals replaced with Inf when calling blkfct
## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: tiny diagonals replaced with Inf when calling blkfct
## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: tiny diagonals replaced with Inf when calling blkfct
For stat_smooth, the explanation is actually that when the CV is greater than some threshold, some performance values are so close to zero that their importance in the average of the performance logarithmic values is strengthened. It is unclear why these values are so small, but the inversion in the slope is not due to a bug in stat_smooth. We can conclude that given the data, using the log may lead to different results where the importance is given to the orders of magnitude (this is the same difference between the arithmetic average and the geometric average).
Let's produce a simple test case that illustrates the bug of stat_quantile:
library(ggplot2)
X <- runif(10000)
V <- data.frame(x = X, y = log(X + runif(length(X))))
p <- ggplot(data = V, aes(x = x, y = y))
p <- p + geom_point()
p + stat_quantile(quantiles = c(0.05, 0.95), method = "rqss")
## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)
## Warning: tiny diagonals replaced with Inf when calling blkfct
##
## Warning: tiny diagonals replaced with Inf when calling blkfct
This method is thus buggy.
If only the mean is required, ggplot2 smooth_stat seems relevant.
If a quantile is required (median because of outliers, or other quartiles for assessing the statistical dispersion), then the lattice gives a good representation. However, the impact of using splines is unknown.
The only methods that were not tested are rqss and nlrq from the quantreg
package. The examples seem non-trivial and the methods do not provide CI.
Additionally, the qrss had problems when used with the stat_quantile
method.
As a last remark, instead of plotting the raw performance of each method, plotting only the ratio of each method relatively to the best could have been more informative (or at least, would have been a nice addition). In this case, only the conservative scheme would be useful (even though the aggressive one has less chance of bias).