Genetic Algorithms TBC


Evolution and Optimisation

Optimisation is the process by which we trial solutions to a problem, evaluate them, and finally choose the optimal solution of those we have tried. For example when seasoning a stew we add some salt then mix and taste the broth, if it's too salty we add a little water and if it's ont yet salty enough we can add some more, we can repeat this until the saltiness is just right. In industry and science optimsiation is ubiquitous. When making a product a succesful business will consider the cost of "input" used to create a product versus the profitable output, for example a canning factory will want to minimise the amount of metal used to create a can of soup (or find a cheaper metal source!). In science we often observe an experiment and then "fit" a model to the data, for example a line of best fit through a set of data points.

Mathematically, optimisation can be thought of as finding the minimum value of some function, \(f(x)\), subject to some conditions, \(c_{i}(x)\), i.e $$ \text{minimise} \enspace f(x)$$ $$ \text{subject to} \enspace c_{i}(x) < b_{i} $$ If we want to maximise a function we can always minimise it's inverse \(-f(x)\). For example we may want to find the minimum of \(f(x)=(x-1)^2\) given the condition \(c_{1}(x) = x \leq 0\). In this case the minimum of \(f\) is \(0\) at \(x=1\), but since we have the condition that \(x\) must be negative or \(0\) the minimum subject to the conditions is \(1\) at \(x=0\). There are many ways to solve problems of this kind, differentiation is often part of the solution, but many problems exist where the function being optimised cannot be differentiated. Genetic algorithms are one solution in these cases.

A genetic optimisation algorithm operates inspired by the process of evolution, we have a "fitness score" which is the function to be minimised or maximised, and so we can imagine maintaining a "population" of solutions \(\mathcal{X}_{t} = \{x_{1},x_{2},\ldots,x_{N}\}\) and over time we can select, breed, and mutate the solutions in the population. Then, we hope, just like evolution the population will in time converge to only highly scoring solutions.

The Basic Structure of Genetic Code

Let's consider the example of minimising \(f(x)=x^2\) with no restrictions. For the score we want values of \(x\) close to \(0\) (the minimum) to be highly rated, so if we take our score to be \(f_{score}(x)=-x^2\) all values that are not \(0\) whether positive or negative will have a negative value for their score, but \(0\) will have the highest score of \(0\).

Each iteration, \(t\), starts with a population of proposal solutions \(\mathcal{X}_{t}=\{x_{1},x_{2},\ldots,x_{N}\}\). When \(t=0\) (the first iteration) then we set the population to be some randomly chosen values, say \(\mathcal{X}_{0}=\{10,-5,2,-56\}\)
Selection
In the selection phase we first compute the score of each proposal solution in the population, for our example we have $$\text{Score } = \{-100,-25,-4,-3136\}$$ Based on the score of each solution relative to the others in the population we will select members in the population that survive into the next stage. In practice a probability of selecting will be defined before hand and we will cover that later, for now let's say we take the top \(50\%\) i.e $$\text{Selected Population} = \{2,-5\} $$
Breeding
After selecting the solutions to carry forward we next "breed" them, the way in which solutions breed is dependent on the form they take, here we have real numbers so an ok breeding strategy is to average their values. Doing that our population has increased by one solution to be $$\{2,-5,-3/2\}$$
Mutating
Finally just as in reality we can add some "mutations", or random noise to the new population of solutions, any choice of noise strength can be made, but for simplicity perhaps we choose one value at random and add to it a random number drawn from a standard normal distribution, for example $$\{2,-2.78,-3/2\}$$ Where we randomly choose the second member of the population and added to it the normal random number \(2.22\).

Repeating this process we will eventually converge to the highest scoring solution \(0\).

A note: the separation of Selection and Breeding here is often combined into one process called Selection where we select from the current population and then mutate.

Boltzmann Selection and Boltzmann Breeding

For the filtering step the most basic choice would be to remove from the population all those who have a fitness score less than some pre-determined threshold \(f_{crit}\), this is what was done in the basic example above. This works to some extent but will not be robust to terrible starting conditions, and if chosen poorly could lead to an exploding population.

A better choice, certainly a more general one, is to imagine a probability \(p(\boldsymbol{x},t)\) of the selection or rejection of \(\boldsymbol{x}\) at time \(t\). If we say \(p(\boldsymbol{x},t)\) is the probability of removing \(\boldsymbol{x}\) from the population, it makes sense to ask that this probability is large when the fittness is small and small when the fitness is high. One popular choice is to use a Boltzmann distribution. The Boltzmann distribution is defined as $$ p(x) = \frac{e^{-\beta x}}{\sum_{x}e^{-\beta x}} $$ where the parameter \(\beta\) is the inverse temperature which controls how steeply the probability decays (or increases if it is negative). This distribution naturally arrises when considering how likely a thermodynamic system is to be in a particular (energy) state dependent on its temperature \(\frac{1}{\beta}\).

network

For selection we want the probability to be high when the score is high, which we can achieve by using a negative temperature, or by defining a \(\beta_{s}\geq 0\) and dropping the \(-\)'s in the exponential terms like so $$ p_{selection}(\boldsymbol{x},t) = \frac{e^{\beta_{s} f(\boldsymbol{x})}}{\sum_{\boldsymbol{y}\in \mathcal{X}_{t}}e^{\beta_{s} f(\boldsymbol{y})}},\enspace \beta_{s} \geq 0 $$ There is also a \(\mathcal{X}_{t}\) now to represent the fact we are taking the sum over the population at time \(t\). This means we are considering how "fit" a given \(\boldsymbol{x}\) is relative to the rest of the population.

network

Breeding is more interesting because we need a joint probability of selecting a pair of values, \(\boldsymbol{x}_{i},\boldsymbol{x}_{j}\) from the current population to breed. We can use a modified Boltzmann distribution with two exponential terms $$ p_{breed}(\boldsymbol{x},\boldsymbol{y},t) = \frac{e^{\beta_{s} f(\boldsymbol{x})}+e^{\beta_{s} f(\boldsymbol{y})}}{\sum_{\boldsymbol{x'}\in \mathcal{X}_{t}}\sum_{\boldsymbol{y'}\in \mathcal{X}_{t}}\biggl(e^{\beta_{s} f(\boldsymbol{x'})}+e^{\beta_{s} f(\boldsymbol{y'})}\biggr)},\enspace \beta_{s} \geq 0 $$

network

Note this will allow a member of the population to breed with itself! Indeed the most probable breeding pair is the top scoring member with itself, this can be left as is or countered by setting the probability to zero when \(\boldsymbol{x}=\boldsymbol{y}\), to do this the joint probability can now be expressed using the Dirac delta functional $$ p_{breed}(\boldsymbol{x},\boldsymbol{y},t) = \delta(\boldsymbol{x}-\boldsymbol{y})\frac{e^{\beta_{s} f(\boldsymbol{x})}+e^{\beta_{s} f(\boldsymbol{y})}}{\sum_{\boldsymbol{x'}\in \mathcal{X}_{t}}\sum_{\boldsymbol{y'}\in \mathcal{X}_{t}}\biggl(e^{\beta_{s} f(\boldsymbol{x'})}+e^{\beta_{s} f(\boldsymbol{y'})}\biggr)},\enspace \beta_{s} \geq 0 $$

network


Putting it all Together