Data Analysis for CCPE (April 29, 2014 to July 21, 2014)

Let's statistically analyse the data.

Representation for median estimation

We estimate the median of a set of values. For this, a tool with the following properties is required:

plot figure with ggplot2
robust to outliers
statistically sound
provides CI (confidence intervals)
non-linear (smooth or arbitrary model like x + x² + x³⁾
(weighting of the errors)

There are several solutions.

Quantile regression in SAS

http://www.statsblogs.com/2013/04/17/quantile-regression-better-than-connecting-the-sample-quantiles-of-binned-data/

robust, sound, CI, arbitrary model
not R

stat_smooth

http://docs.ggplot2.org/current/stat_smooth.html

ggplot2, sound, CI, non-linear
not robust

Using rq is possible but without CI and with a linear model only.

rq (quantreg)

http://astrostatistics.psu.edu/datasets/R/html/quantreg/html/rq.html

robust, sound, CI, arbitrary linear model, (weighting)
CI is not plotted, no smoothing

stat_quantile

http://docs.ggplot2.org/current/stat_quantile.html

ggplot2, robust, sound, non-linear with rqss
no CI

rqss and nlrq (quantreg)

robust, sound, non-linear
no CI

panel.quantile (latticeExtra)

http://latticeextra.r-forge.r-project.org/man/panel.quantile.html

robust, CI, non-linear
incompatible with ggplot2, using spline may not be sound

More information in “Quantile Regression: Theory and Applications”, 2013.

stat_smooth of average of each method (conservative)

Let's load the data.

V_gamma <- read.csv("CCPE-data/cost-cons.csv")

Let's start by doing a classic smooth.

library(ggplot2)
p <- ggplot(data = V_gamma, aes(x = cov, y = perf, group = method, color = method))
p <- p + geom_point(alpha = I(1/30), size = 0.1) + coord_cartesian(ylim = c(0, 
    50))
p <- p + stat_smooth()
p <- p + scale_x_log10()
p

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

plot of chunk CCPE-ana-clean

Let's see what we get when we use log scale for the y axis.

library(ggplot2)
p <- ggplot(data = V_gamma, aes(x = cov, y = perf, group = method, color = method))
p <- p + geom_point(alpha = I(1/30), size = 0.1) + coord_cartesian(ylim = c(5, 
    50))
p <- p + stat_smooth()
p <- p + scale_x_log10()
p <- p + scale_y_log10()
p

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

The inversion in the slope in this last figure for large CV is strange. Conclusions are however consistent with previous figure. Let's proceed with an estimation of some quantiles (5%, 50% and 95%).

library(ggplot2)
p <- ggplot(data = V_gamma[1:1e+05, ], aes(x = cov, y = perf))
p <- p + facet_grid(. ~ method)
p <- p + geom_point(alpha = I(0.1), size = 0.1)
p <- p + stat_smooth()
p <- p + stat_quantile(method = "rqss", quantiles = c(0.05, 0.5, 0.95))
p <- p + scale_x_log10()
p <- p + scale_y_log10()
p <- p + coord_cartesian(ylim = c(5, 50))
p

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)

## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct

## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)

## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct

## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)

## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct

## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)

## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct

plot of chunk CCPE-ana-clean-quantile-rqss

Using stat_quantile with rqss does not work with a large number of points (1e5) for lambda of 1, 10, 100 and 1000: all quantiles are superposed.

library(latticeExtra)
library(splines)
panel.custo <- function(...) {
    panel.quantile(y ~ ns(x, 3), tau = c(0.5), ci = T, ...)
}
xyplot(log(perf) ~ log(cov) | method, V_gamma[1:1e+05, ], alpha = 0.1, cex = 0.1, 
    ylim = c(1, 5)) + layer(panel.quantile(y ~ ns(x, 3), tau = c(0.05), ci = T, 
    col = "black")) + layer(panel.quantile(y ~ ns(x, 3), tau = c(0.5), ci = T, 
    col = "black")) + layer(panel.quantile(y ~ ns(x, 3), tau = c(0.95), ci = T, 
    col = "black"))

## Warning: 212 non-positive fis
## Warning: 4 non-positive fis
## Warning: 568 non-positive fis
## Warning: 133 non-positive fis
## Warning: 446 non-positive fis

plot of chunk CCPE-ana-clean-quantile-lattice

The time to analyse 1e5 points is excessively too long for all points and the tool is hard to use:

it was necessary to plot 3 layers instead of one with three quantiles, otherwise the confidence intervals cannot be shown
controlling whether points are shown or not is (lack of proficiency with lattice)
setting a log scale messes the labels
there are bugs with the confidence intervals

This last version shows the statistical dispersion of the values. The method seems to be sound (3 splines are used, but their effect of the CI computation is unknown).

stat_smooth of average of each method (aggressive)

Let's redo this analysis with more randomized data to assess the impact.

V_gamma <- read.csv("CCPE-data/cost-agg.csv")

Let's start by the smooth.

library(ggplot2)
p <- ggplot(data = V_gamma, aes(x = cov, y = perf, group = method, color = method))
p <- p + geom_point(alpha = I(1/30), size = 0.1) + coord_cartesian(ylim = c(0, 
    50))
p <- p + stat_smooth()
p <- p + scale_x_log10()
p

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

plot of chunk CCPE-ana-clean-agg

It is the same except that the space is more filled with this second method.

library(ggplot2)
p <- ggplot(data = V_gamma, aes(x = cov, y = perf, group = method, color = method))
p <- p + geom_point(alpha = I(1/30), size = 0.1) + coord_cartesian(ylim = c(5, 
    50))
p <- p + stat_smooth()
p <- p + scale_x_log10()
p <- p + scale_y_log10()
p

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

plot of chunk CCPE-ana-clean-log-agg

Idem.

library(ggplot2)
p <- ggplot(data = V_gamma[1:1e+05, ], aes(x = cov, y = perf))
p <- p + facet_grid(. ~ method)
p <- p + geom_point(alpha = I(0.1), size = 0.1)
p <- p + stat_smooth()
p <- p + stat_quantile(method = "rqss", quantiles = c(0.05, 0.5, 0.95))
p <- p + scale_x_log10()
p <- p + scale_y_log10()
p <- p + coord_cartesian(ylim = c(5, 50))
p

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)

## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct

## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)

## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct

## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)

## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct

## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)

## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: singularity problem
## Warning: tiny diagonals replaced with Inf when calling blkfct

plot of chunk CCPE-ana-clean-quantile-rqss-agg

No sensible change.

library(latticeExtra)
library(splines)
panel.custo <- function(...) {
    panel.quantile(y ~ ns(x, 3), tau = c(0.5), ci = T, ...)
}
xyplot(log(perf) ~ log(cov) | method, V_gamma[1:1e+05, ], alpha = 0.1, cex = 0.1, 
    ylim = c(1, 5)) + layer(panel.quantile(y ~ ns(x, 3), tau = c(0.05), ci = T, 
    col = "black")) + layer(panel.quantile(y ~ ns(x, 3), tau = c(0.5), ci = T, 
    col = "black")) + layer(panel.quantile(y ~ ns(x, 3), tau = c(0.95), ci = T, 
    col = "black"))

## Warning: 212 non-positive fis
## Warning: 4 non-positive fis
## Warning: 121 non-positive fis
## Warning: 7 non-positive fis
## Warning: 150 non-positive fis

plot of chunk CCPE-ana-clean-quantile-lattice-agg

The medians may be slightly higher, but it is negligible.

Test without log scales

Let's transform the data such that their scales are linear. This will show how stat_smooth and stat_quantile behaves.

V_linear <- data.frame(cov = log(V_gamma$cov), perf = log(V_gamma$perf), method = V_gamma$method)
library(ggplot2)
p <- ggplot(data = V_linear[1:1e+05, ], aes(x = cov, y = perf, group = method, 
    color = method))
p + geom_point() + stat_smooth()

## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

plot of chunk CCPE-ana-clean-test

p + stat_quantile(method = "rqss", quantiles = c(0.05, 0.5, 0.95))

## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)

## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: tiny diagonals replaced with Inf when calling blkfct

## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)

## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: tiny diagonals replaced with Inf when calling blkfct

## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)

## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: tiny diagonals replaced with Inf when calling blkfct

## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)

## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: tiny diagonals replaced with Inf when calling blkfct

plot of chunk CCPE-ana-clean-test

For stat_smooth, the explanation is actually that when the CV is greater than some threshold, some performance values are so close to zero that their importance in the average of the performance logarithmic values is strengthened. It is unclear why these values are so small, but the inversion in the slope is not due to a bug in stat_smooth. We can conclude that given the data, using the log may lead to different results where the importance is given to the orders of magnitude (this is the same difference between the arithmetic average and the geometric average).

Let's produce a simple test case that illustrates the bug of stat_quantile:

library(ggplot2)
X <- runif(10000)
V <- data.frame(x = X, y = log(X + runif(length(X))))
p <- ggplot(data = V, aes(x = x, y = y))
p <- p + geom_point()
p + stat_quantile(quantiles = c(0.05, 0.95), method = "rqss")

## Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)

## Warning: tiny diagonals replaced with Inf when calling blkfct
## 
## Warning: tiny diagonals replaced with Inf when calling blkfct

plot of chunk CCPE-ana-clean-test-bug

This method is thus buggy.

Conclusion

If only the mean is required, ggplot2 smooth_stat seems relevant.

If a quantile is required (median because of outliers, or other quartiles for assessing the statistical dispersion), then the lattice gives a good representation. However, the impact of using splines is unknown.

The only methods that were not tested are rqss and nlrq from the quantreg package. The examples seem non-trivial and the methods do not provide CI. Additionally, the qrss had problems when used with the stat_quantile method.

As a last remark, instead of plotting the raw performance of each method, plotting only the ratio of each method relatively to the best could have been more informative (or at least, would have been a nice addition). In this case, only the conservative scheme would be useful (even though the aggressive one has less chance of bias).