The Calculus You Actually Need for Deep Learning

Oct 2, 2020   #Calculus 

DRAFT: Need to proof read expressions and derivations.

The Matrix Calculus You Need For Deep Learning (MCYNDL) is a great summary of what you need and expands starting from pretty much the basics. However, I’ve always felt that introducing the index notation used in physics can help simplify what looks like complicated math, even in that simplified intro. So here is my attempt at re-presenting the material in that paper, but using the simplified index notation. I think good notations are technology for the mind because they let you think and work with things that you otherwise might find too cumbersome or complicated to deal with.

Warning: Physicists may scoff at this presentation, because in this domain, we aren’t dealing with real vector spaces even though we’ll be calling things vectors and tensors. This is because true vectors and tensors in physics have invariance properties associated with them, whereas no such invariance properties are available (in general) in the spaces we’re working with.

Introduction

The original paper referenced above gives a great intro. I don’t think I need to add anything to it just yet.

Scalar derivative rules

I find the scalar derivative rules more simply expressed if we use infinitesimal notation instead of the traditional \(d/{dx}\) notation.

Rule \(f(x)\) Derivative Example
Constant \(c\) \(dc = 0\) \(d42 = 0\)
Sum/Difference rule \(f \pm g\) \(d(f \pm g) = df \pm dg\) \(d(x^2 \pm 3x) = (2x \pm 3)dx\)
Product rule \(f \cdot g\) \(d(f \cdot g) = df \cdot g + f \cdot dg\) \(d(x^2 \sin(x)) = (2x \sin(x) + x^2 \cos(x))dx\)
Chain rule \(f(g(x))\) \(d(f(g(x))) = f’dg = f’g’dx\) \(d(\sin(x^2)) = \cos(x^2) \cdot 2x dx\)

The “multiplication by constant” rule of \(d(cf) = c df\) is a consequence of the product rule, where setting \(f = c\) gives us \(df = 0\), leaving behind only the \(f dg\) term.

The “power rule” can be seen like this -

$$ d(x^n) = (x+dx)^n - x^n = x^n + nx^{n-1}dx + (\text{ignored higher powers of } dx)- x^n = nx^{n-1}dx $$

MCYNDL advises against the \(f’\) notation for derivative of a function of a single variable since it doesn’t generalize. This is valid and we’ll see a general version of this below.

Introduction to vector calculus and partial derivatives

While that section title alludes to “vector calculus”, the “vector” part of the calculus is not really seen in its entirety since, as mentioned earlier, we usually aren’t really dealing with vector spaces. For example, we won’t usually encounter the notion of curl. That said, infinitesimals work pretty much like vectors because they exist in “tangent spaces” .. something like a plane that is tangential to a sphere in 3D. So we can use the notion of vectors with infinitesimals.

First, lets consider a scalar function of an \(n\)-dimensional vector whose components are written \(x_i\) with the index \(i\) varying from \(1\) to \(n\). We write such a function \(f(x_i)\). Conventionally, we’ll express its derivative w.r.t. the \(j\)-th axis as \(\frac{\partial f}{\partial x_j}\). As we see, this thing has \(n\) values as well, at a given point \( x_i \). We introduce a simplified notation for this - \(\frac{\partial f}{\partial x_i} = f_{, i}\). (Notice the comma that precedes \(i\)).

So we write \(df = \sum_{i=1}^n{f_{, i}dx_i}\).

Summation convention: Einstein introduced a simplifying convention here, called .. for obvious reasons “the Einstein summation convention”, where a repeated index is expected to be summed over, so you can omit the summation sign for that. With this convention, the above simply reads -

$$ df = f_{, i}dx_i $$

.. which can be seen as the higher dimensional analogue of \(df(x) = f’dx\).

Given two vectors \(x_i\) and \(y_j\), the “dot product” between these two vectors is now simply expressed as \(x_iy_i\). Because the index \(i\) is repeated in the expression, it is expected to be summed over, so that really means \(\sum_{i=1}^{n}{x_iy_i}\). The same notation can be extended for matrices as well by using multiple indices like \(m_{ij}\).

The previous rules for functions of a single vector variable then look like below -

Rule \(f(x)\) Derivative
Scalar function of vector \(f(x_j)\) \(df = f_{, i}dx_i\)
Sum/Difference rule \(f \pm g\) \(d(f \pm g) = (f \pm g)_{, i}dx_i = (f_{, i} \pm g_{, i})dx_i \)
Product rule \(f \cdot g\) \(d(f \cdot g) = (f \cdot g)_{, i}dx_i = (f_{, i}g + fg_{, i})dx_i\)
Chain rule \(f(g(x_i))\) \(df = f’g_{, i}dx_i\)

The derivative of a scalar function of a vector is called its “gradient”. The idea behind the term is that you can imagine the scalar function to specify the height of ground at a given point in a 2D space, and the derivative at a point will be a vector pointing uphill - i.e. along the gradient.

A vector function of a vector is then \(f_i(x_j)\), for which the rules become -

Rule \(f(x)\) Derivative
Vector function of vector \(f_j(x_i)\) \(df_j = f_{j, i}dx_i\)
Dot product rule \(f_j g_j\) \(d(f_j g_j) = (f_j g_j)_{, i}dx_i = (f_{j, i}g_j + f_jg_{j, i})dx_i\)
Chain rule \(f_j(g_k(x_i))\) \(df_j = f_{j, k}dg_k = f_{j, k}g_{k, i}dx_i\)

Note that in the chain rule above, we have double summation over \(k, i\) because both the indices repeat. Also note how the pattern of \(something_{, i}\) acts in exactly the same manner whether we’re dealing with a scalar function or a vector function.

In case you didn’t notice it, we’re already done covering the concepts of -

  1. Gradient - applicable to a scalar function of a vector
  2. Jacobian - this is simply \(f_{j, i}\) for a vector function of a vector.
  3. The chain rule for Jacobians.

We’re now nearly done with the entire section in MCYNDL named “Matrix calculus” .. save for actually working out examples. I don’t undervalue the working out of examples - I actually encourage it. The purpose of this post is to show that we don’t need to invent too much notation to talk about what is essentially one idea.

It is worth introducing some “special matrices” that crop up.

The identity matrix

The “Dirac delta” symbol \(\delta_{ij}\)is usually used for the identity matrix and is defined by -

$$ \delta_{ij} = 1 \text{ if $i=j$ and } 0 \text{ if $i\neq j$} $$

With the above definition and employing the summation convention, it should be clear that for a vector \(x_j\), \(\delta_{ij}x_j = x_i\) .. which you can think of as “delta replaces the index variable with its own”. And by the way, we already saw how \(x_iy_i\) is the dot product of two vectors. By the same token, \(m_{ij}x_j = y_i\) is the matrix multiplication rule, assuming that by now you’ve gotten used to reading an implied summation over repeated indices. This connects with \(\delta\) as the “identity matrix”.

The Jacobian

You also, by now, probably see that the Jacobian \(f_{j, i}\) is a matrix because it has two indices.

What about higher order derivatives? If you want to take the next derivative of the Jacobian for example, you’d get \(f_{k, ij} = \frac{\partial^2 f_k}{\partial x_i \partial x_j}\) where the assumption in the notation is that all the indices that occur after the comma represent “taking the derivative”. So the derivative of a Jacobian is a “rank 3 tensor” since it has three indices to be specified to get a number. By that token, vectors are “rank 1 tensors” and matrices are “rank 2 tensors”.

.. and yes we’re done with “single variable total derivative chain rule” and “vector chain rule” as well. Just see the table above. … and you already know how to extend this to higher rank tensors as well because the “comma notation” works the same way no matter what the rank.

Functions of matrices

We’ve so far considered functions of vectors, even matrix functions of vectors. However, it is also useful to have a way to deal with functions of matrices and such “higher order tensors”. For example, a scalar function of a matrix may be written as \(f(m_{ij})\). By this, we’re saying that \(f\) takes all the values in the matrix and produces a single number - i.e. it is a function of all the elements of the matrix. As a simple example, \(f\) could be the sum of the diagonal elements of the matrix (known as its “trace”). Using the summation convention, we can write the trace of a matrix \(m\) as \(m_{ii}\).

So if we want a notation for the derivative of the function w.r.t. the matrix entries, we have to write -

$$ \frac{\partial f(m_{ij})}{\partial m_{ij}} = g_{ij} $$

where \(g\) is the derivative of the scalar function and we see that it is itself a matrix .. just as the derivative of a scalar function of a vector is itself a vector - i.e. the “gradient vector”.

How do we use the “comma notation” with this though? The problem is that if we write \(f_{, ij}\) that looks like the second derivative. So we need to indicate that the indices \((ij)\) refer to taking the derivative only once. To do that, we may write \(g_{ij} = f_{, (ij)}\).

We can extend this notion to functions of higher order tensors too in an obvious manner.

The gradient of neuron activation.

$$ \renewcommand{\hat}[1]{\widehat{#1}} $$

A neuron layer can be written as -

$$ \hat{y}_i = \sigma(w_{ij}x_j + b_i) $$

Again, we’re assuming that the index \(j\) is being summed over according to the summation convention. One distinction though is that \(\sigma\) is not a scalar function of a vector. It is a scalar function of a scalar, that is applied to every component of a vector to produce another “vector”. Since \(\sigma\) maps one vector to another vector, it is actually a “second rank” entity and must be treated as such. To distinguish such a \(\sigma\) from an actual scalar function of a vector, we’ll write the expression like below -

$$ \hat{y}_i = \sigma_i(y_j) = \sigma_i(w_{jk}x_k + b_k) $$

When applying functions, any implied summation is expected to be carried out within the arguments first and not bleed outside, as that would mean something enitrely different.

One simplification that’ll help us here is to rewrite the “affine” part of the above expression using only \(w\), by extending the indices to include \(0\), with the assumption that \(x_0\) will always be \(1\), so \(w_{i0} = b_i\). With this assumption, we may then write the above expression as -

$$ \hat{y}_i = \sigma_i(w_{jk}x_k) $$

When training a neural network, we want to propagate changes in the output back to the weights \(w_{jk}\). First, we have to define the “loss function”. Let’s take a simple one -

$$ \text{loss}(\hat{y}_i) = (\hat{y}_j - y_j)(\hat{y}_j - y_j) = \hat{y}_j\hat{y}_j - 2y_j\hat{y}_j + y_jy_j $$

Therefore -

$$ d\text{loss}(\hat{y}_i) = 2(\hat{y}_j - y_j)d{\hat{y}_j} $$

where

$$ d(\hat{y}_i) = \hat{y}_{i,(mn)}dw_{mn}
= d{\sigma_i(w_{jk}x_k)}
= \sigma_{i,m}(w_{jk}x_k)d(w_{mn}x_n)
= \sigma_{i,m}(w_{jk}x_k)x_ndw_{mn} $$

.. where we again use the summation convention to get at the “dot product”. Here, \(\text{loss}(\hat{y}_i)\) is indeed a scalar function since it depends on all the components of \(\hat{y}\).

The full derivative expression can be written as -

$$ d\text{loss} = \text{loss}_{,(gh)}dw_{gh}
= 2(\hat{y}_i - y_i) \sigma_{i,m}(w_{jk}x_k)x_ndw_{mn}
= 2(\hat{y}_i - y_i) \sigma_{i,m}(w_{jk}x_k)x_ndw_{mn}
$$

giving us …

$$ \text{loss}_{,(gh)} = 2(\hat{y}_i - y_i) \sigma_{i,m}(w_{jk}x_k)x_n\delta_{(mn)(gh)} $$

where we know that \(\sigma_{i,m} = 0 \text{ for } i \neq m\) and where we use the extended form of \(\delta\) -

$$ \delta_{(ij)(kl)} = 1 \text{ if $i=k$ and $j=l$, else } 0 $$

.. and noting that \(w_{mn,(gh)} = \delta_{(mn)(gh)} \)

In full expanded form with explicit summations put in, the loss derivative is -

$$ \text{loss}_{,(gh)} = \frac{\partial \text{loss}}{\partial w_{gh}} = 2\sum_{imn}{(\hat{y}_i - y_i) \sigma_{i,m}(\sum_k{w_{jk}x_k})x_n\delta_{(mn)(gh)}} $$

Notice that the loss derivative is a matrix of the same order as \(w\) and therefore it can be used to determine the direction in which to adjust \(w\) to get lower loss. This is what gets us the weights update rule for a single neuron layer.

$$ w^{(t+1)}_{gh} = w^{(t)}_{gh} - \eta \text{loss}_{, (gh)} $$

Composing many neuron layers is matter of applying the chain rule since neuron layers compose according to normal function composition.

Postscript on indices

The above index-based treatment of tensors translates more or less into code directly, where tensors are indexed using an array of integer indices. It becomes pretty error prone to manipulate these indices and keep track of what’s what. A better approach when programming is to give descriptive names to these index positions, as implemented in the Named Tensors module in PyTorch.