Tuesday, August 02, 2016

Symbolic Regression and Multi-Gene Genetic Programming

I'm back with my blog being about something useful. Sorry for the long waiting, I was really really busy. Today's topic will be about a class of techniques for speeding up Genetic Programming algorithms in Symbolic Regression scenarios.

Regression vs. Classification

First of all, let me do a quick introduction/clarification regarding regression and classification. These two are machine learning tasks, similar in part, but different in the end. In both tasks we have a bunch of objects each described by a set of numbers (not necessarily but for the sake of simplicity numbers will do great).

In classification the goal is to assign a class to an object based on the numbers it is described by. The class can be any discrete value. Examples are: true/false, red/green/blue, high/middle/low, 1/2/3/4/5/6/7/8/9/10 rating, or any label. The special case with just two classes also has a special name binary classification. Typical classification problem is spam detection - is an e-mail spam or ham?

In regression the goal is to assign a numerical value to each object, again based on the numbers describing that object. Generally, these assigned numbers can have any value. Typical regression problems are various prediction and estimation problems. For example, suppose there is some complicated machine and you want to predict fuel consumption based on some operational data.

Symbolic Regression

Symbolic Regression is a class of tasks for GP systems. Actually, it is the very first example I mentioned in Evolution in a Computer, pt. 3 - we have a bunch of data points and we want to find a function that fits them accurately (enough). It doesn't matter where the data points come from, be it anything from a demonstrative problem as in that blog post to data from medical scans where we need to asses a patient's condition. Whenever we want to find a function fitting some data, it is symbolic regression.

The word symbolic is quite important. It means that the output model is a mathematical formula, that could be, in principle, read and interpreted by a human, or, for example, be given to a computer algebra system (CAS) to compute derivatives, find roots, etc. This is in contrast with, e.g., neural networks which is characterized by its topology (how many neurons are there and how are they connected) and the values of the weights (the degrees of how pairs of neurons are connected). That is not something one could look at and see a pattern nor could be given to some CAS even though it still does regression.

So, from now on, let's talk about Symbolic Regression, or SR in short.

Problem of SR

If you played around with the SR example from the already mentioned previous blog post, you might have noticed that it does not work very well. In the better case it just took quite a long time to get to some sensible function. In the worse case it failed completely and got stuck with some obscure function.

There are several reasons why this does not work that well. I would like to write about one of them and that is fine-tuning the parameters. Vanilla GP works with a set of terminals and non-temrinals and evolves functions by juggling around with pieces of mathematical expressions to hopefully get meaningful results. However, this juggling is often also the only mechnism of how to get some specific numbers in the final function. Imagine, that the true function (i.e. what we would like to find by evolution) is \( f(x) = \sin(3.18x - 6.57) \). However, the GP algorithm is very unlikely to have the 3.18 and 6.57 numbers in its terminal set. It must mainpulate what it has to get (close to) these numbers.

Multi-Gene Gentic Programming

The idea of Multi-Gene Genetic Programming, or MGGP for short, is to decompose the problem into more, presumably simpler, subproblems. However, the decomposition is not exact, as (almost) nothing around GP is. MGGP addreses part of the problem with finte-tuning parameters.

How MGGP works

In classical GP, each individual is one tree which is also the solution to the problem, in case of SR it is the expression which is supposed to describe the data and its output is the regression estimate. In MGGP, in contrast to GP, each individual is made of multiple trees which together make the solution. These trees are called genes.

But how do we get a single number out of multiple expressions? The answer is linear combiantion. That means that the output of every gene is multiplied by some constant number and then they are all summed up together. It is basically a weighted sum of the genes. More, in MGGP, there is always an implicit gene which is always equal to the number 1. The reason for such gene is that we get an additive factor for free.

Example: let's have these three genes and (for simplicity) there is only one variable to work with, \(x\): $$ g_1(x) = x $$ $$ g_2(x) = \sin(x) $$ $$ g_3(x) = \cos(x) $$ and of course there is always the implicit gene of 1, let's give it the index of 0: \(g_0(x) = 1\). Now, to get the final expression, there are some constants, let's call them \( w_i \) which multiply the individual genes. The final expression would be, in this case: $$ f(x) = w_0 g_0(x) + w_1 g_1(x) + w_2 g_2(x) + w_3 g_3(x) $$ or to make it simpler (because \( g_0(x) \) is always going to be 1): $$ f(x) = w_0 + w_1 g_1(x) + w_2 g_2(x) + w_3 g_3(x) $$

By now, you should have an idea but you should be wandering: Where do we get the constants (\( w_i \)) from? Yes, that is the key for MGGP. And the trick is really simple. We use the power of linear regression. Linear regression is a kind of regression analysis (see, SR is another kind of regression analysis) which produces models in the form of linear combination of the input features. So, for example, if our problem has two input features, the linear regression produces models like \( f(x_1, x_2) = w_0 + w_1 x_1 + w_2 x_2 \).

There are several ways of how to get the constants \( w_i \) but probably the most widely used is the ordinary least-squares solution. Let the input features be arranged in a matrix \( \mathbf{X} \) so that each datapoint (its input features) is one row of the matrix and let the target values be arranged in a column vectory \( \mathbf{y} \) so that \(i\)-th element corresponds to the \(i\)-th row of the matrix \( \mathbf{X} \). If we have these, the vector of the constants, also called weights, can be computed as:

$$ \mathbf{w} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y} $$
This way we get the weights that multiply the input features but we don't get the additive factor, \( w_0 \). This is fixed easilly by prepending a column of ones to the matrix \( \mathbf{X} \).

Identical procedure is applied in MGGP. Instead of using the vanilla input variables, we first compute “new” features by computing the outputs of the genes - each gene produces one new feature - and of course create the implicit feature of 1. Important fact is that the computation of the weights is done at evaluation, i.e. they are always computed optimally with respect to the genes that are available in the individual.

Additional operators

Since there are multiple genes, the crossover operator is split into two types:

  • Low-level crossover - this is a classical crossover known from GP: a random node is chosen in each individual and the subtrees rooted in these nodes are swapped.
  • High-level crossover - a new type of crossover arising from the fact that there are multiple genes: whole genes (one or more) are swapped between the individuals.

What is it good for

The major benefit of this approach is that it lifts the duty of finding the linear parts from the shoulders of the underlying GP which needs to focus on the non-linear parts only. Also, it makes a kind of decomposition of the problem - multiple, maybe smaller, genes work together to create a reasonably good solution. It addresses the problem of fine-tuning the parameters to a ceratin degree - the top-level parameters, i.e. the linear coefficients, are not needed to be found by evolution. However, the constants hidden inside the non-linear parts are not adressed by this mechanism and are sought traditionally (in GP terms).

References

  • GPTIPS - the leading implementation (and maybe the only one really used) of (not just) MGGP by Dominic Searson.
  • Hinchliffe MP, Willis MJ, Hiden H, Tham MT, McKay B & Barton, GW. Modelling chemical process systems using a multi-gene genetic programming algorithm. In Genetic Programming: Proceedings of the First Annual Conference (late breaking papers), 56-65. The MIT Press, USA, 1996.
  • Searson DP, Willis MJ & Montague GA. Co-evolution of non-linear PLS model components, Journal of Chemometrics, 2, 592-603, 2007.