Why Build Your Own Deep Learning Library?

This course is different from other courses that teach deep learning in that it intends to construct the knowledge it conveys from first principles. This means that I've done my best to remove as much abstraction in the way in which concepts are developed as possible. Of course, that's not to say there's no abstraction involved whatsoever. Teaching any topic outside of the rudiments relies on some sort of mutual understanding of prior knowledge between teacher and student, and this course is no exception. Desired prerequisites include a solid understanding of elementary statistics, calculus, linear algebra fundamentals (at the very least, familiarity with matrix algebra), some software systems and data structures knowledge, and programming experience in a high-level language (Python, R, Julia, Java, C#, etc.) though not necessarily in Python itself. The difference here is that the requisite knowledge lies outside of the domain in which we'll be working. Other courses operate in what I'd like to call a top-to-middle approach. They introduce machine learning within the context of already well-established traditions and tools, perhaps by showing how to create a convolutional neural network using Keras to classify images or how to predict the next word in a sentence using RNN language models coded with PyTorch. From there, they may dive a little deeper, perhaps touching on a point or two of theory as to how the underlying mathematics of the models works and maybe even writing down a few equations for the student to peruse. And that's where the learning stops. All code is provided a-priori to the students such that they default to following a recipe. They copy down what's written onto their own machines, maybe tweak a few hyperparameters here or there, but do they really experiment? Do they try and compare a wide variety of models on a single task, do they integrate multiple architectures into a single ensemble, do they perform cross-validation passes and rigorous hyperparameter searches? Likely not. All they learn is a formula. Ingest data - throw together a few Tensorflow layers - choose a loss function - get decent accuracy - rinse and repeat. Is this really sufficient? Is someone who has completed one of these courses truly suited for an entry-level machine learning role at a top tech company, and can they really apply the knowledge they've learned effectively in their job without really understanding what's going on under the hood of the systems they utilize? Telling a student to call loss.backward() or sess.run() once they've constructed their model doesn't teach them the mathematical principles underlying backprop to the point in which they could comfortably derive it themselves (which is necessary when developing anything relatively unique). What's more, will they be able to debug the CUDA errors that inevitably arise when they construct an invalid model of their data? Without understanding how our tools work, we can't effectively build, and our development remains stymied by these very same tools' constraints. Can you imagine going to an interview for a company like Microsoft or Google and being asked about deep learning systems without at least some rough idea of how they actually work? Can you imagine trying to implement your own custom CUDA kernels when you're unfamiliar with issues of numerical instability, low-level programming, and other caveats? In this course we'll fix that, as this course is built bottom-up. We start learning about neural networks by learning how backpropagation really works. And we won't derive symbolic backpropagation rules and implement them in slow code to develop a model that would never scale in the real world. We'll implement our own reverse-mode automatic differentiation framework, the way backprop is actually implemented in industry, and we'll build out further knowledge atop that underlying foundation. We'll implement our own CUDA kernels and we'll achieve speed and performance comparable to that of the frameworks used by industry giants. We'll implement each and every model and kernel on our own, from scratch, using best practices and state of the art techniques. We'll develop our own data ingestion pipelines and tie them in to our library, then visualize our outputs using our own plotting tools. We'll get comfortable with every step of the deep learning process from start to finish, and we'll build out complex models in Computer Vision, NLP, and Reinforcement Learning atop this solid, underlying foundation. And when something breaks, we won't be left scratching our heads, because we'll have built the framework we're running and will be intimately familiar with its implementation.

A Bit About this Course and Notation

This course is designed with the machine learning practitioner in mind. That's not to say we won't cover theory. We'll do so in bucketloads, but only in such a way that it advances the understanding of how best to implement things in practice. Towards that end, I've introduced explanations for key ML concepts that I'll call proofs by intuition. These concepts will often have rigorous underlying math to them that makes their technical definitions clear, but potentially obfuscates the reasoning as to why these objections or notions exist in the first place and how they can be useful to us when creating, training, and testing models. I'll introduce the math as well, because it's always important, but the proof by intuition will precede it and hopefully give a more global view of how the concept fits into our framework of understanding.

Terminology and Notation

It can sometimes seem like half the challenge in becoming acquainted with machine learning lies in understanding the differences in notation and terminology used between different sources. These differences exist because, classically speaking, machine learning arose out of a confluence of different fields: mathematics, statistics, computer science, control theory, and many more. The point is, each of these fields had (has) their own unique way of talking about concepts and writing them down. For example, in statistics the variable $\theta$ is often used to denote a model's parameters whereas in the deep learning literature it's more common to see $W$ to make explicit that the parameter's are actually a neural network's weights. Similarly a neural network's input $X$ might be referred to as a design matrix in statistics but just as an input in deep learning.

Because we want this course to be comprehensive, whenever we encounter or introduce a concept which can be referenced or notated in multiple ways, we'll make an effort to enumerate as many of these as possible so as to reduce confusion if you see the same concept being used in a different place. We'll also try to stick to a consistent notation scheme, the one most commonly used in deep learning, so as to make our presentation as uniform as possible. However, where using a different notation would be awkward and conflict with the majority of the existing literature, we'll use the standard, prevailing customs.

Standard Notation Index

Here we'll list all the notations used in this course along with their meanings. You should reference this section whenever you get stuck on notation that seems unclear.

  • Parenthesized superscripts on a variable will be used to index the elements of a set. For example, a training set may be notated as $\mathcal{D} = \{x^{(1)}, x^{(2)}, \ldots, x^{(n)}\}$.
  • Subscripts on a variable will be used to index the dimensions of that variable. So, for example, for a vector $\mathbf{x} \in \mathbb{R}^n$, we'd have $\mathbf{x} = [x_1, x_2, \ldots, x_n]$.
  • Variables with $k$ subscripts will be used to denote the elements of a $k$-dimensional tensor. For example, $x_{i, j, k}$ would be the element at the $i$th, $j$th, and $k$th indices of the 3d tensor $X$.
  • Boldfaced, but lowercase, variables will represent vectors, e.g. $\mathbf{x}$.
  • Uppercase variables will represent general tensors.
  • In the context of probability, $P$ will represent the probability distribution function of a discrete random variable $X$.
  • In the context of probability, $p$ will represent the probability distribution function of a continuous random variable $x$.
  • $P(X, Y)$ will represent the joint distribution of the discrete random variables $X$ and $Y$.
  • $p(x, y)$ will represent the joint distribution of the continuous random variables $x$ and $y$.
  • $\mathcal{N}$ will represent either a neural network or the normal distribution, depending on the context in which it's used.
  • $\mathbf{W}$ will represent the entirety of a neural network's weights.
  • $W_i$ will represent the weights of the $i$th layer of a neural network.
  • $\mathbb{E}_{x \sim p(x)}[f(x)]$ will represent the expectation of f(x) with $x$ being drawn from the probability distribution $p(x)$.
  • $L(f(x), y)$ $(L(f(\mathbf{x}, \mathbf{y}))$ will be the loss calculated between a neural network's output $f(x)$ $(f(\mathbf{x}))$ and the target(s) $y$ $(\mathbf{y})$.
  • $\nabla_{\mathbf{x}} f$ will be the gradient of $f$ with respect to $\mathbf{x}$.
  • $J_f$ will be the Jacobian matrix (matrix of partial derivatives) of $f$.