Genetic Algorithms TBC
Evolution and Optimisation
Optimisation is the process by which we trial solutions to a problem, evaluate them, and
finally choose the optimal solution of those we have tried. For example when seasoning
a stew we add some salt then mix and taste the broth, if it's too salty we add a little
water and if it's ont yet salty enough we can add some more, we can repeat this until
the saltiness is just right. In industry and science optimsiation is ubiquitous. When
making a product a succesful business will consider the cost of "input" used to create a product
versus the profitable output, for example a canning factory will want to minimise the amount of
metal used to create a can of soup (or find a cheaper metal source!). In science we often observe
an experiment and then "fit" a model to the data, for example a line of best fit through
a set of data points.
Mathematically, optimisation can be thought of as finding the minimum value of some function, \(f(x)\), subject
to some conditions, \(c_{i}(x)\), i.e
$$ \text{minimise} \enspace f(x)$$
$$ \text{subject to} \enspace c_{i}(x) < b_{i} $$
If we want to maximise a function we can always minimise it's inverse \(-f(x)\).
For example we may want to find the minimum of \(f(x)=(x-1)^2\) given the condition \(c_{1}(x) = x \leq 0\).
In this case the minimum of \(f\) is \(0\) at \(x=1\), but since we have the condition that \(x\) must be negative
or \(0\) the minimum subject to the conditions is \(1\) at \(x=0\). There are many ways to solve
problems of this kind, differentiation is often part of the solution, but many problems exist where
the function being optimised cannot be differentiated. Genetic algorithms are one solution in these
cases.
A genetic optimisation algorithm operates inspired by the process of evolution, we have a "fitness score"
which is the function to be minimised or maximised, and so we can imagine maintaining a "population" of
solutions \(\mathcal{X}_{t} = \{x_{1},x_{2},\ldots,x_{N}\}\) and over time we can select, breed, and mutate
the solutions in the population. Then, we hope, just like evolution the population will
in time converge to only highly scoring solutions.
The Basic Structure of Genetic Code
Let's consider the example of minimising \(f(x)=x^2\) with no restrictions. For the score we want
values of \(x\) close to \(0\) (the minimum) to be highly rated, so if we take our score to be \(f_{score}(x)=-x^2\)
all values that are not \(0\) whether positive or negative will have a negative value for their score, but \(0\) will have
the highest score of \(0\).
Each iteration, \(t\), starts with a population of proposal solutions \(\mathcal{X}_{t}=\{x_{1},x_{2},\ldots,x_{N}\}\). When
\(t=0\) (the first iteration) then we set the population to be some randomly chosen values, say \(\mathcal{X}_{0}=\{10,-5,2,-56\}\)
Selection
In the selection phase we first compute the score of each proposal solution in the population, for our example we have
$$\text{Score } = \{-100,-25,-4,-3136\}$$
Based on the score of each solution relative to the others in the population we will select members in the
population that survive into the next stage. In practice a probability of selecting will be defined before hand and we
will cover that later, for now let's say we take the top \(50\%\) i.e
$$\text{Selected Population} = \{2,-5\} $$
Breeding
After selecting the solutions to carry forward we next "breed" them, the way in which solutions breed
is dependent on the form they take, here we have real numbers so an ok breeding strategy is to
average their values. Doing that our population has increased by one solution to be
$$\{2,-5,-3/2\}$$
Mutating
Finally just as in reality we can add some "mutations", or random noise to the new population of solutions,
any choice of noise strength can be made, but for simplicity perhaps we choose one value at
random and add to it a random number drawn from a standard normal distribution, for example
$$\{2,-2.78,-3/2\}$$
Where we randomly choose the second member of the population and added to it the
normal random number \(2.22\).
Repeating this process we will eventually converge to the highest scoring
solution \(0\).
A note: the separation of Selection and Breeding here is often combined into one process
called Selection where we select from the current population and then mutate.
Boltzmann Selection and Boltzmann Breeding
  For the filtering step the most basic choice would be to
  remove from the population all those who have a fitness score less
  than some pre-determined threshold \(f_{crit}\), this is what was done in the
  basic example above. This works to some extent
  but will not be robust to terrible starting conditions, and if chosen poorly
  could lead to an exploding population.
  
  A better choice, certainly a more general one, is to imagine a probability
  \(p(\boldsymbol{x},t)\) of the selection or rejection of \(\boldsymbol{x}\) at time
  \(t\). If we say \(p(\boldsymbol{x},t)\) is the probability of removing \(\boldsymbol{x}\)
  from the population, it makes sense to ask that this probability is large when
  the fittness is small and small when the fitness is high. One popular choice is to
  use a Boltzmann distribution. The Boltzmann distribution is defined as
  $$ p(x) = \frac{e^{-\beta x}}{\sum_{x}e^{-\beta x}} $$
  where the parameter \(\beta\) is the inverse temperature which controls
  how steeply the probability decays (or increases if it is negative). This distribution
  naturally arrises when considering how likely a thermodynamic system is to
  be in a particular (energy) state dependent on its temperature \(\frac{1}{\beta}\).
  
   
  
  For selection we want the probability to be high when the score is high, which we
  can achieve by using a negative temperature, or by defining a \(\beta_{s}\geq 0\) and
  dropping the \(-\)'s in the exponential terms like so
  $$ p_{selection}(\boldsymbol{x},t) = \frac{e^{\beta_{s} f(\boldsymbol{x})}}{\sum_{\boldsymbol{y}\in \mathcal{X}_{t}}e^{\beta_{s} f(\boldsymbol{y})}},\enspace \beta_{s} \geq 0 $$
  There is also a \(\mathcal{X}_{t}\) now to represent the fact we are taking the sum over the population
  at time \(t\). This means we are considering how "fit" a given \(\boldsymbol{x}\) is relative
  to the rest of the population.
  
   
  
  Breeding is more interesting because we need a joint probability of selecting a pair
  of values, \(\boldsymbol{x}_{i},\boldsymbol{x}_{j}\) from the current population to breed. We can
  use a modified Boltzmann distribution with two exponential terms
  $$ p_{breed}(\boldsymbol{x},\boldsymbol{y},t) = \frac{e^{\beta_{s} f(\boldsymbol{x})}+e^{\beta_{s} f(\boldsymbol{y})}}{\sum_{\boldsymbol{x'}\in \mathcal{X}_{t}}\sum_{\boldsymbol{y'}\in \mathcal{X}_{t}}\biggl(e^{\beta_{s} f(\boldsymbol{x'})}+e^{\beta_{s} f(\boldsymbol{y'})}\biggr)},\enspace \beta_{s} \geq 0 $$
  
   
  
  Note this will allow a member of the population to breed with itself! Indeed the most probable
  breeding pair is the top scoring member with itself, this can be left as is or countered by setting
  the probability to zero when \(\boldsymbol{x}=\boldsymbol{y}\), to do this the joint probability can
  now be expressed using the Dirac delta functional
  $$ p_{breed}(\boldsymbol{x},\boldsymbol{y},t) = \delta(\boldsymbol{x}-\boldsymbol{y})\frac{e^{\beta_{s} f(\boldsymbol{x})}+e^{\beta_{s} f(\boldsymbol{y})}}{\sum_{\boldsymbol{x'}\in \mathcal{X}_{t}}\sum_{\boldsymbol{y'}\in \mathcal{X}_{t}}\biggl(e^{\beta_{s} f(\boldsymbol{x'})}+e^{\beta_{s} f(\boldsymbol{y'})}\biggr)},\enspace \beta_{s} \geq 0 $$
  
   
  
Putting it all Together