Saturday, March 2, 2013

Comparing two (or more) classifiers

The problem of comparing two (or more) classifiers on a set of problems is not trivial. For that mater I want to share some fundamental definitions and procedures about it. This post is mainly based on the work of Demsar (2006). I will use a more informal way of describing this issue for the sake of clarity.

First, let's start with some (not-so-) complex and essential definitions.

1) The null hypothesis (or H_0). The hypothesis we want to prove false using, in our case, a statistical test; typically that all classifiers perform the same on average.

2) The alternative hypothesis (or H_1). The opposite hypothesis to the null hypothesis. In our particular case that not all the classifiers perform the same on average.

3) Type I Error or a false positive. Rejecting the null hypothesis (incorrectly) when is actually true.

4) Type II Error or a false negative. Conversely to type I error, not rejecting the null hypothesis when is actually false.

5) The level of significance. This is an important concept. It tells us whether we found a true pattern in data or not (i.e., that it was just chance). For example that classifier X is better, on average, than classifier Y for the given set of problems, with a certain probability of avoiding both type I error and type II error. This probability is coined as alpha.

6) Alpha. The probability value which identifies the level of significance. The larger this value, the more chance of committing type II error, and also we have less statistical power. Conversely, the lower this value, the more chance of committing type I error. Typical values are 0.05 and 0.1.

7) The computed p-value (or simply p). The p-value is the smallest level of significance that results in the rejection of the null hypothesis. This is a key concept, because if a test of significance gives a computed p-value lower than or equal to the significance level alpha, the null hypothesis is rejected.

Knowing this, we can proceed with the statistical test itself. For this purpose, I will use the Friedman's test. It is, probably, the most well known non-parametric test (that is: this test does not assume any particular probability distribution, as opposed to parametric tests like ANOVA). It ranks the algorithms based on their performance, for instance the accuracy or the F-Measure, for each data set separately. The reader is referred to (Demsar, 2006) for further details on this.
Notice that there are several software packages that perform the Friedman's ranking.

After the ranking, the Friendman statistic is computed. This results in a value that is used for rejecting (or not) the null hypothesis, using a chi-squared distribution with k - 1 degrees of freedom, in our case the number of algorithms minus one.

To further improve the precision of this test, a correction is performed. This is the Iman-Davenport correction (also referred to as F_F), which follows an F distribution instead, with k - 1 and (k - 1)(N - 1) degrees of freedom. In our case it refers to (1) the number of algorithms minus one and (2) the multiplication of the number of algorithms minus one and the number of data sets minus one. Again, the reader is referred to (Demsar, 2006) for more details.

In this point we can compute the p-value and check whether the null hypothesis is rejected or not. The problem is that computing the p-value (by hand) is not easy (I will not enter in further details), so we typically make use of p-value calculators for this issue (or other related software). In my case, I use R, an excellent and free software. It is very straightforward, assuming a confidence level of 95% (i.e., alpha = 0.05), seven algorithms and 30 distinct data sets, and a computed F_F value of 7.268, simply type 1 - pf( 7.268, 7 - 1, (7-1) * (30 - 1) ) in the console. The result I get is much less than 0.05 (in fact the actual value is 0.000000610959), hence we reject the null hypothesis that the classifiers perform the same on average. Therefore, we can use a post-hoc test (or more than just one) to check whether there are statistically significant differences, on average, among algorithms. Examples of these tests are the Nemenyi's test, the Bonferroni-Dunn's test or the Holm's step-down procedure---see (Demsar, 2006; García and Herrera, 2008) for further details on post-hoc tests.


References

Demsar J. (2006) Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research 7:1–30

García S, Herrera F (2008) An extension on ”Statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons. Journal of Machine Learning Research 9:2677–2694

Sunday, October 28, 2012

Optimization: Simplex Algorithm


Imagine we have a set of computers running simulations. Those computers can be described by a group of features, for example its power consumption per hour and the total number of processes it can run per hour. From our R+D personnel, a series of programs are send to these computers so we have to perform some schedule to run the applications, but for our profit we wish to minimise the total consumption of the set of machines. Let's show some more detail:

  • Computer 1 has a power consumption of 100 watts per hour and can run 10  high cpu-demanding processes per hour and 7 low cpu-demanding processes per hour at the same time.
  • Computer 2 has a power consumption of 350 watts per hour and can run 60 high cpu-demanding processes per hour and 20 low cpu-demanding processes per hour at the same time.
  • From the R+D area there are a set of at least 3125 high cpu-demanding processes and at least 710 low cpu-demanding processes to be processed.

How many hours should our set of computers run to minimise the power consumption but meeting the restrictions of our R+D personnel? 

This is a typical example of an optimisation problem, that is, a problem consisting in the selection of a best element with regard to some criteria from some set of variable alternatives. Firstly, we have to rewrite our problem in a more convenient way. We have to introduce two variables, the time (in hours) of the first computer and the time (also in hours) of the second computer, represented respectively by the variables x and y:

minimise  F = 100 x + 350 y  (we call this the objective function)

subject to 10 x + 60 y >= 3125, (we call these the restrictions)
                 7 x + 20 y  >= 710,
                 x >= 0, y >= 0.

So, from our Computer Science point of view, what can we do to solve this kind problems? The answer is 'many things' so the purpose of this post is to introduce a clever algorithm called Simplex.  

Simplex solves Linear Programming problems (i.e., those in which the objective function to minimise---or maximise---is linear) and does it by generating a tableau and cleverly operating with it until some exit criteria is met. This algorithm is summarised as follows: the restriction equations are set in the tableau in the first place and then the objective function in the last row of this tableau. Next, because we want a minimisation, we transpose this tableau and we augment it by adding slack variables (i.e., new variables that allow us to get rid of the inequalities), one per restriction.  After this, a series of operations are performed in order to obtain the desired solution. In what follows, the main loop of the algorithm is briefly described:  

 (1) Check for the pivot column by looking for the greatest negative value in all the rows of the tableau. 

 (2) Select the pivot row by dividing the elements of the pivot column by the last column of the tableau (the different   values of the equation). The selected row will be the one containing the column value which has the lowest non-negative. 

 (3) Divide the entire pivot row by the value selected in the second step.

 (4) Form zeroes in the values of the non-slack variables by combining multiples of the pivot row and the rest of rows (not including the pivot one).

 (5) If the values of all the cells of the objective function row in the tableau are less than zero, jump to step 1. Otherwise stop.

Finally, the values of the minimisation are collected. The minimum value is in the last column of the last row of the tableau. The values of the variables (i.e., x and y in the example) for this minimum are in the objective function row cells, more precisely in the slack variables.

The optimal solution is a minimum power consumption of (F) 18229.167 watts, with a usage of the first computer of (x) 0 hours and a usage of the second computer of (y) 52.083 hours.

That's a very pretty clever and straightforward algorithm to solve linear problems. But we can go beyond with other techniques, for example using metaheuristics, but this will be a future post. 

Monday, June 11, 2012

Doing a PhD...


Doing a PhD involves building walls and then demolish them and build them back learning something during the process.