Implementing Automatic Differentiation 27/11/2020
Differentiation
Calculus is ubiquitous in mathematics applications because it defines
a system for quantifying change. Consider the movement of a ball, the growth
of a portfolio, spread of disease, growth of a bacterial colony, or learning
artificial neural network. All of these problems are intricately linked
to a quantity that changes; to the derivative of a function \(f(x)\).
As humans (or a very advanced AI) we can learn to compute the rate of change
of a function, for example if
$$ f(x) : \mathbb{R}\mapsto\mathbb{R} = x^2 $$
then \(f\)'s rate of change with respect to a change in \(x\) is
$$ \frac{df(x)}{dx} = 2x $$
This comes from the definition of a derivative
$$ f'(x) = \frac{df(x)}{dx} = \lim_{h\rightarrow 0} \frac{f(x+h)-f(x)}{h} $$
i.e the slope of the line \(f(x)\) evaluated between two infinitely close points.
It is possible to define a programming language capable of analytic differentiation
(Mathematica and Wolfram Alpha),
this is not always practical or useful, as such most applications of the derivative
resort to some approximation of the limit definition. Most simply this
can be done by taking an \(h\), say \(h=0.01\), and plugging in the numbers.
This is fine but will lead to errors.
Dual Numbers To the Rescue
It would be nice if for any \(f\) we could know \(f'\) at \(x\) on the fly,
without having to derive it ourselves, approximating it, or asking a computer to chug away and
possibly find the analytic solution itself. This is where dual
numbers come in and reveal "Automatic Differentiation" (AD).
A dual number is a "new" type of number analogous to complex numbers, but now
instead of using \(a+bi\) where \(i=\sqrt{-1}\) we write
$$ a+b\epsilon, \enspace \text{ where } \epsilon^{2}=0, \enspace \text{ for } a,b\in\mathbb{R} $$
This seemingly trivial academic pursuit it incredibly fruitfull. Consider \(f(x)=x^2\), notice
$$f(a+b\epsilon) = a^2 + b^2\cdot(0) + 2ab\epsilon = a^2 + 2ab\epsilon$$
again nothing too crazy, yet, but see that
$$ f(a) = a^2, \text{ and } f'(a) = 2a $$
so if we had taken \(a+\epsilon\) inplace of \(a+b\epsilon\) then we would get
$$f(a+\epsilon) = f(a)+f'(a)\epsilon$$
This is no accident, in fact this result applies to any \(f\) where \(f'\) exists!
$$f(a+b\epsilon) = \sum_{n=0}^{\infty}\frac{f^{(n)}(a)b^n\epsilon^n}{n!} = f(a) + bf'(a)\epsilon + \epsilon^{2}\sum_{n=1}^{\infty}\frac{f^{(n)}(a)b^n\epsilon^n}{n!}$$
So in general evaluating a function of a dual number gives you the value of the function
as it's real part and the exact derivative as it's dual part!
Implementing Dual Numbers
Practically we can make use of these results with a simple object called Dual that
just smashes two Numbers together.
struct Dual{t1<:Number,t2<:Number}
a::t1
b::t2
end
Of course the real work is done in defining operators on our objects, in
this case \(+,-,/,*\) for addition, subtraction, division, and multiplication.
import Base.+, Base.*, Base.^, Base.-, Base./
# addition and subtraction are easy
+(x::Dual,y::Dual) = Dual(x.a+y.a,x.b+y.b)
-(x::Dual,y::Dual) = Dual(x.a-y.a,x.b-y.b)
# multiplication (remember that binomial theorem?)
*(x::Dual,y::Dual) = Dual(x.a*y.a,x.a*y.b+x.b*y.a)
*(x::Number,y::Dual) = Dual(x*y.a,x*y.b)
*(y::Dual,x::Number) = *(x::Number,y::Dual)
# division a bit more of a pain
/(x::Dual,y::Dual) = Dual(x.a/y.a,(x.b*y.a-x.a*y.b)/(y.a^2))
/(x::Number,y::Dual) = /(Dual(x,0.),y)
/(y::Dual,x::Number) = /(y,Dual(x,0.))
# positive integer powers
function ^(x::Dual,n::Int)
y = x
for i in 1:(n-1)
y = y*x
end
return y
end
At this point the only curve-ball is division, after a bit of algebra
you'd find the expression
$$ \frac{a+b\epsilon}{c+d\epsilon} = \frac{a}{c} + \frac{bc-ad}{c^2}\epsilon $$
which is what we have in the code.
With these definitiions we can make use of AD for functions using integer
powers multiplication division and summation, e.g for powers
julia> [Dual(2^b,1)*Dual(2^b,1) for b in 1:8]
8-element Array{Dual{Int64},1}:
Dual{Int64}(4, 4)
Dual{Int64}(16, 8)
Dual{Int64}(64, 16)
Dual{Int64}(256, 32)
Dual{Int64}(1024, 64)
Dual{Int64}(4096, 128)
Dual{Int64}(16384, 256)
Dual{Int64}(65536, 512)
and for \(f(x)=\frac{1}{x}, f'(x)=-\frac{1}{x^2}\)
julia> f = (x) -> 1.0/x
julia> g = (x) -> -(1.0/x^2)
julia> [(1.0/Dual(2^b,1),(f(2^b),g(2^b))) for b in 1:8]
8-element Array{Tuple{Dual{Float64,Float64},Tuple{Float64,Float64}},1}:
(Dual{Float64,Float64}(0.5, -0.25), (0.5, -0.25))
(Dual{Float64,Float64}(0.25, -0.0625), (0.25, -0.0625))
(Dual{Float64,Float64}(0.125, -0.015625), (0.125, -0.015625))
(Dual{Float64,Float64}(0.0625, -0.00390625), (0.0625, -0.00390625))
(Dual{Float64,Float64}(0.03125, -0.0009765625), (0.03125, -0.0009765625))
(Dual{Float64,Float64}(0.015625, -0.000244140625), (0.015625, -0.000244140625))
(Dual{Float64,Float64}(0.0078125, -6.103515625e-5), (0.0078125, -6.103515625e-5))
(Dual{Float64,Float64}(0.00390625, -1.52587890625e-5), (0.00390625, -1.52587890625e-5))
Great. But what about other functions like \(\cos\), \(\sin\), \(\exp\), \(\log\)
and real powers (roots etc)? Generally for any function we would make
use of it's Taylor seris (or a faster converging approximation). Let's
look at \(\exp\). The Taylor seris simplifies heavily, i.e
$$e^{a+b\epsilon} = \sum_{n=0}^{\infty}\frac{(a+b\epsilon)^{n}}{n!}=
\sum_{n=0}^{\infty}\frac{1}{n!}\sum_{k=0}^{n}\binom{n}{k}a^{n-k}(b\epsilon)^{k} \\
= \sum_{n=0}^{\infty}\frac{1}{n!}\bigl(a^{n} + na^{n-1}b\epsilon\bigr) = e^{a}(1+b\epsilon)$$
So \(\exp(x)\) can be found exactly with a few muls!
This also means exponentiation to any real power is trivial if we know log...
$$(a+b\epsilon)^{x} = e^{x\log(a+b\epsilon)}$$
$$ \text{Let } \log(a+b\epsilon) = c+d\epsilon$$
$$ (a+b\epsilon)^x = e^{xc}e^{xd\epsilon} = e^{xc}(1+xd\epsilon)$$
Thankfully we can (admittedly approximately) find the natural log just from integer powers, \(*\), and \(/\) !
$$\log(x) = 2\sum_{k=0}^{\infty}\frac{1}{2k+1}\biggl(\frac{x-1}{x+1}\biggr)^{2k+1}$$
This series in particular works nicely for \(x\) about \(1\), which
we can handle by using \(\log(ab) = \log(a)+\log(b)\) to reduce (or increase) the logand
to be about \(1\)
function Log(z::Dual,k=10)
ln2 = 0.6931471805599453094172321214581765680755001343602552541206800094933936219696947156058633269964186875420014810205706857336855202
b = 0.
m = z.a
# use log(ab) = log(a) + log(b) to reduce the magnitute of the logand
if m > 1
# find smallest power of 2 larger
while 2. ^b < m
b += 1
end
elseif m < 1
# find smallest power of -2 smaller
while m < 2. ^b
b -= 1
end
end
# reduce the logand
z = z / 2. ^b
L = Dual(0.0,0.0)
# compute the series with the reduced logand for accuracy
for i in 0:k
L += (1/(2*i+1))*((z-Dual(1,0))/(z+Dual(1,0)))^(2*i+1)
end
# factor in the logand reduction
return 2*L + Dual(b*ln2,0.0)
end
function ^(x::Dual,n::AbstractFloat)
e = 2.7182818284590452353602874713526624977572470936999595749669676277240766303535475945713821785251664274274663919320030599218174135
y = Log(x,10)
c = e^(n*y.a)
return Dual(c,c*(n*y.b))
end
This code gives us a quick implementation of dual numbers that we can get some use
out of, but it's far from complete. A real implementation in Julia can be found
as a subpackage of ForwardDiff.jl,
and a great use case for industry is the very elegant machine learning solution Flux.jl
References for Further Reading
[1] - Forward-Mode Automatic Differentiation in Julia,Revels, J. and Lubin, M. and Papamarkou, arXiv:1607.07892 [cs.MS], T. , 2016