Harness Machine Learning

Machine Learning 101

by Matthew Cloney

Machine Learning is everywhere. We use it in our daily lives to find faster routes between home and work, to meet new friends on social media apps, and sometimes, to buy things we didn’t know we needed. This article will help define some basic concepts of Machine Learning, and how engineers use it to solve problems.

Machine learning is quite different than traditional programming. Traditional programming is often thought of as a set of instructions that are carried out in response to some event (e.g., user input). For instance, when you push the “Save” button in Microsoft Word, the program writes a file to disk (or to the cloud) based on the location and name you choose. It is as it should be: a very well-defined and predictable process.

With machine learning, we want our program not to do something based on a series of rules, but rather for it to recognize and predict patterns in the data we are feeding it. Depending on the type, amount, and quality of data you have, this can mean very different things for different sets of data.

My favorite definition of machine learning comes from Tom Mitchell:

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

How do you apply this to a Machine Learning project? Let’s consider these ten steps:

  1. State the problem you’re trying to solve
  2. Get the data
  3. Inspect, Clean and preprocess the data
  4. Split the data into multiple subsets
  5. Choose which features you will use for prediction
  6. Choose an algorithm
  7. Train your model
  8. Evaluate your model
  9. Tune your model
  10. Release your model

Step 1 — State the Problem

Machine learning, at its core, is about prediction. The first question to answer is, what are we trying to predict? Are we trying to predict the dollar amount of future sales, or are we trying to predict how best to market to a new customer? The former is an example of what’s called a regression problem, while the latter is an example of a categorization problem.

If we’re trying to predict a target variable, we have what’s called a supervised learning problem. If we’re just looking for patterns in data (e.g., to define customer segments), this is called an unsupervised learning problem. (There are other types of machine learning, including semi-supervised learning and reinforcement learning, but those are beyond the scope of this article).

Step 2 — Get the Data

Next, we have to ask ourselves a few questions about our data, including:

  • What data do we have?
  • How much data do we have?
  • How hard is it to get more data?
  • How much can we trust the accuracy of the data?

In general, there’s no such thing as too much data in a machine learning project. As long as the data is of good quality and representative of the data we’d use to make predictions in the future, the sky’s the limit. Getting the data is often the most difficult part, but machine learning is all about data— especially data quality.

Step 3 — Inspect, Clean and Preprocess the Data

When we’re preparing to perform machine learning, we organize our data into cases (or records), and variables (or attributes). You can think of cases or records as rows in a spreadsheet, while variables or attributes are like columns. Next, we should determine what types of variables we have. Categorical variables have two flavors, nominal and ordinal.  Nominal variables just indicate a difference in an attribute (e.g., gender, marital status), while ordinal variables indicate an unknown quantitative difference (e.g., a variable with levels of “low”, “medium”, and “high”.). Numeric variables indicate quantities. Some numeric variables are more precise than others. That is, measurements of height and weight are more precise than number of children, because you can’t have a portion of a child, but you can measure height in half inches. Finally, you might have text data. Text data is its own animal— it needs to be turned into numeric values before being used in most machine learning applications. While this may be one of the most interesting challenges in machine learning today, text data is out of scope for this article.

After this very general inspection, we’d closely look at the data for each variable and the relationship between our variables. This might include graphing data, or creating crosstabs of categorical variables to get counts, averages, and other summary statistics on numerical variables. It’s important to examine not only the predictor variables’ relationships to our target variable, but to each other as well.

After you’ve assessed the quality of your dataset, you’ll almost definitely need to spend time cleaning it. This step takes more time than any other in machine learning, and may include dealing with missing data, corrupted data, or extreme values, called outliers. For instance, in healthcare, you might see a human body temperature of 199 degrees Fahrenheit, which wouldn’t make sense for a living patient!

Step 4 — Split the data into multiple subsets

Once you’ve inspected and cleaned your data, the next step would be to split it into two or three sets. Usually, the majority of your data is separated into the training set. This is the set of data that’s used to train your model (more on that in a minute). A second set of data is saved as the validation set. This is a portion of the data your learning process never sees, and it’s used to make sure that the model you’ve trained didn’t just memorize the data it saw in the training phase. Finally, it’s good practice to have a test set as an additional batch of data that is never seen by the training process. This data serves as an extra check that your model is performing well against previously unseen data. However, this test set is often not created when you don’t have a lot of data and getting more would be difficult or expensive.

Step 5 — Choose which features you will use for prediction

In this “big data” world, you’re often presented with an abundance of features that you could use to do machine learning. Some will be more useful than others, and some will be basically useless noise. Even if your data consists of only a few features, you may have found that two or more are highly correlated. This means that as one changes, the other changes in a predictable way. In a situation like this, it’s common practice to remove one of these predictors and only use one for your modeling purposes. In problems where you have dozens, hundreds, or even thousands of possible features, there are statistical techniques that can help you decide which features are the most important. Of course, expert knowledge of the data, the problem space, and the industry is helpful in knowing not only which features to use that are present in your data.  This expertise may also aid you in manually engineering new features from existing features.

Step 6 — Choose an algorithm

Machine learning is an iterative process. Choosing an algorithm often entails experimenting with more than one, and then determining which works best with your dataset. The algorithm you ultimately choose will depend heavily on the data and the problem you’re trying to solve.

In general, there are two broad categories of machine learning algorithms: Supervised and unsupervised learning. Recall that supervised learning is the process of trying to predict some target variable from a set of predictor variables. This process requires training data that was previously labeled, usually by humans, so the algorithm will learn the relationship between the predictor variables and the target variable. This target variable can be categorical, like when you might be trying to predict which ad a website visitor would be most likely to click on based on their past browsing history, or it can be numeric, if you’re trying to predict temperature based on humidity, barometer, latitude, longitude, season, etc. Determining whether to use supervised or unsupervised learning is pretty straightforward once you’re familiar with the data and your problem space.

Step 7 — Train your model

The next step in the process is to feed your training data to a model. An untrained model is like an empty vessel; it has some structure and boundaries, but ultimately, it’s the data that allows it to truly take shape. Our goal is to train our model so it can predict the target variable reasonably well on cases that were not made available to it during the training phase.

A model has both parameters and hyperparameters. Parameters are essentially variables that are initially set at random values. Their optimal values are learned through training the model. Hyperparameters are attributes that affect how the model behaves. We’ll set these before training our model, but may run multiple analyses, each with a different set of hyperparameters, to get better performance in the tuning phase (step 9). An example of a hyperparameter would be the number of segments to separate your data into in a customer segmentation analysis. Models also have parameters that are automatically adjusted during the training phase — this is the “learning” part of machine learning. How are these parameters “learned?” After each iteration of training, the model adjusts itself based on how it performed on that batch of data.

Feeding data to a machine learning algorithm in batches is a common practice. As our algorithm receives each batch of data, it gets a little better at defining that function. This is why the process of learning a function to fit the data is referred to as “training a machine learning model,” and the data used to train it is called the “train(ing) set.” To conceptualize this process, think of trying to predict housing prices. If the only information available was the cost of a single house, and you were asked to predict the price of a similar house in a different zip code, you would have to predict that both houses were the same price.

Once you have several labeled data points (information on several homes, including their values), you should begin to see relationships between the predictor variables and the target variable. For instance, housing price tends to increase as square footage increases.

Step 8 — Evaluate your model

Once we’ve trained our model, we use the learned function to try to predict the target value for data that was not used to train the model. This is where the validation set comes in. Remember, the validation set (and test set, if we’re using one) were created from the same set of data as the training set, so we know what the target value (or “ground truth value”) actually is. We apply the model we’ve trained to the validation set of data, create our predictions, then compare the predicted values to the actual target values. There are multiple methods to do this. When trying to predict what category a target variable falls into, accuracy is the simplest method to measure the performance of your algorithm — it measures the percentage of cases in the validation set we categorized correctly.

Step 9 — Tune your model

The first time we run our model, it’s unlikely we will get the best possible result. Luckily, we can try various combinations of hyperparameters, get more training data, or even try a different model at this stage. Some algorithms work better with some types of data, and when you’re starting out, you may not know exactly why. That’s OK, just prepare yourself for a lot of experimentation, learning, and overall, fun!

As an example, let’s say you wanted to perform customer segmentation on your CRM data. For this unsupervised learning problem, you decided you wanted your customers segmented into five groups. However, when looking at the results, they don’t seem to make sense. There are no common themes within the segments, (or “clusters”), that your algorithm produced. Adjusting this number up or down and running the algorithm again may produce more meaningful groups (e.g., customers who purchase more expensive items may be in one group, while those who gravitate toward more modestly-priced items may be in another).

Step 10 — Release your model

Once you’re happy with the performance of your model, you can use it on new data. In our customer segmentation example, we can use our algorithm to predict the customer’s segmentation profile, and market to them appropriately. Or, in our housing prediction model, we can look at a new home that’s available to purchase and compare the asking price with the prediction of our algorithm.

It’s important to note that once a machine learning algorithm is born, it begins to die. Less morbidly stated: A machine learning model is most useful immediately after it’s released. Why? Because data is always changing. The nature of your customers’ buying habits is morphing, some neighborhoods become more desirable than others, and the housing market is fluid. This is why it’s important to make sure you’re feeding your model with new data, and why your data scientists should always be developing models based on the most recent data possible. When a new model is created that performs better than the existing one, the existing model is retired, and the new one takes its place.

In the next article, we’ll walk through a machine learning project from start to finish, using the steps outlined above.

Matthew Cloney
Matthew Cloney
Senior Data Scientist
Contact Matthew