28/3/2017
A non-technical introduction to machine learning
Machine learning is a field that threatens to both augment and undermine exactly what it means to be human, and it’s becoming increasingly important that you—yes, you—actually understand it.
To that end, this post aims to give you a comprehensive, high-level understanding of machine learning — without the esoteric statistical details.
By the end of this post, you’ll understand the basic logical framework of machine learning (ML) and will be able to define important a handful of relevant terms and concepts that anyone interested in this field should know. These terms are highlighted in boldface.
So what is machine learning?
The field itself: ML is a field of study which harnesses principles of computer science and statistics to create statistical models, which I’ll explain later in this post. These models are generally used to do two things:
- Prediction: make predictions about the future based on data about the past
- Inference: discover (or infer) patterns in data
Difference between ML and AI: There is no universally agreed upon distinction between ML and artificial intelligence (AI). AI usually concentrates on programming computers to make decisions (based on ML models and sets of logical rules), whereas ML focuses more on making predictions about the future.
They are highly interconnected fields, and, for most non-technical purposes, they are the same.
What’s a statistical model?
Models: Teaching a computer to make predictions involves feeding data into machine learning models, which are simplified representations of how the world supposedly works. If I tell a statistical model that the world works a certain way (say, for example, that taller people make more money than shorter people), then this model can then tell me who it thinks will make more money, between Cathy, who is 5’2”, and Jill, who is 5’9”.
What does a model actually look like on paper? Well, it’s actually quite simple. A model is just a mathematical function, which, as you probably already know, is a relationship between a set of inputs and a set of outputs. Here’s an example:
This is a function that takes as input a number and returns that number squared. So, f(1) = 1, f(2) = 4, f(3) = 9.
Let’s briefly return to the example of the model that predicts income from height. I may believe, based on what I’ve seen in the corporate world, that a given employee’s annual income is on average equal to her height (in inches) times 1,000. So, if you’re 60 inches tall, then I’ll guess that you make around $60,000 a year. If you’re a foot taller, I think you’ll make $72,000 a year.
This model can be represented mathematically as follows:
In other words, income is a function of height.
Here’s the main point: Machine learning refers to a set of techniques for estimating functions (like the one involving income) based on datasets (pairs of heights and their associated incomes). These functions, which are called models, can then be used for predictions of future data.
Algorithms: These functions are estimated using algorithms. In this context, an algorithm is a predefined set of steps that takes as input a bunch of data and then transforms it through mathematical operations. You can think of an algorithm like a recipe — first do this, then do that, then do this. Done.
Let’s now talk about how a model is actually built.
A framework for understanding ML
Inputs: Statistical models learn from the past, formatted as tables of data (called training data). These structured datasets — such as those you might find in Excel sheets — tend to be formatted in a very easy-to-understand way: each row in the dataset represents an individual observation, also called a datum or measurement, and each column represents a different feature, also called a predictor, of an observation.
For example, you might imagine a dataset about people, in which each row represents a different person, and each column represents a different feature about that person: age, height, income, et cetera. Here’s an example:
Image: Example of structured data. Each row is a person (a unique observation we’ve made), and each column is a different measurement (called a feature or predictor) of that person.
Because one common goal of ML is to make predictions (for example, about someone’s income), training data also includes a column containing the data you want to predict. This feature is called the response variable (or output variable, or dependent variable) and looks just like any other feature in the table. In the example above, we might choose income to be our response variable, which is why it’s highlighted in green.
Now that we have a dataset, we can begin building a statistical model. Assuming that there’s some relationship between our predictors and our response — that income is somehow based on your age or your height, or a combination of the two — our model will then be able to predict someone’s income based on measurements of their height and age.
We first need to pick which type of model to use. There are hundreds, if not thousands, of models a data scientist can choose from — ranging from linear regression to neural networks — and the choice can depend on many different factors (such as the type of your response variable and even the speed of your computer). All in all, we’re looking for a model that best represents the patterns we’re seeing in the data.
Let’s imagine we chose a linear regression model — a simple model which describes a straight line.
Image: An example of a linear regression model (in red) on top of training data (in blue). Borrowed from Wikipedia.
Remember: a model is just a function. So when we choose our model, we’re really just choosing a function to estimate. In the linear regression case, this function might look something like:
Here, $X,XXX is a value that we don’t yet know. Its exact value might be $1,000 (as in the example above), or $500, or $4,131. Whatever the case, this number changes the slope (visually, this can be thought of as the steepness) of the red line above. We’re looking for a number that causes our red line to best represent the pattern of our data.
Luckily, this is the exact problem that machine learning aims to solve! We can use the linear regression algorithm to fit our model to our data (also called training), thereby finding a value of $X,XXX that accurately characterizes the data. This process of estimation is called learning because we’re learning the value of $X,XXX, and the result is a function that we can then use to predict an employee’s income based on her height.
And the rest is history. Once we have our fitted function, our computers can predict incomes for hundreds of thousands of employees in a single second, faster than any human ever could. Such is the power of machine learning, which has the ability to take much of our work and turn it into a thing of the past, for better or for worse.
For a more comprehensive overview of machine learning, including examples of two common machine learning models and discussions of ethical concerns, check out my factsheet here.
Image: Kevin Dooley.