In short, since your main task is to select a learning algorithm and train it on some
data, the two things that can go wrong are “bad algorithm” and “bad data.” Let’s start
with examples of bad data.
Insufficient Quantity of Training Data
For a toddler to learn what an apple is, all it takes is for you to point to an apple and
say “apple” (possibly repeating this procedure a few times). Now the child is able to
recognize apples in all sorts of colors and shapes. Genius.
Machine Learning is not quite there yet; it takes a lot of data for most Machine Learn‐
ing algorithms to work properly. Even for very simple problems you typically need
thousands of examples, and for complex problems such as image or speech recogni‐
tion you may need millions of examples (unless you can reuse parts of an existing
Nonrepresentative Training Data
In order to generalize well, it is crucial that your training data be representative of the
new cases you want to generalize to. This is true whether you use instance-based
learning or model-based learning.
For example, the set of countries we used earlier for training the linear model was not
perfectly representative; a few countries were missing. Figure 1-21 shows what the
data looks like when you add the missing countries.
If you train a linear model on this data, you get the solid line, while the old model is
represented by the dotted line. As you can see, not only does adding a few missing
countries significantly alter the model, but it makes it clear that such a simple linear
model is probably never going to work well. It seems that very rich countries are not
happier than moderately rich countries (in fact they seem unhappier), and conversely
some poor countries seem happier than many rich countries.
By using a nonrepresentative training set, we trained a model that is unlikely to make
accurate predictions, especially for very poor and very rich countries.
It is crucial to use a training set that is representative of the cases you want to general‐
ize to. This is often harder than it sounds: if the sample is too small, you will have
sampling noise (i.e., nonrepresentative data as a result of chance), but even very large
samples can be nonrepresentative if the sampling method is flawed. This is called
sampling bias.
Poor-Quality Data
Obviously, if your training data is full of errors, outliers, and noise (e.g., due to poor-
quality measurements), it will make it harder for the system to detect the underlying
patterns, so your system is less likely to perform well. It is often well worth the effort
to spend time cleaning up your training data. The truth is, most data scientists spend
a significant part of their time doing just that. For example:
• If some instances are clearly outliers, it may help to simply discard them or try to
fix the errors manually.
• If some instances are missing a few features (e.g., 5% of your customers did not
specify their age), you must decide whether you want to ignore this attribute alto‐
gether, ignore these instances, fill in the missing values (e.g., with the median
age), or train one model with the feature and one model without it, and so on.
Irrelevant Features
As the saying goes: garbage in, garbage out. Your system will only be capable of learn‐
ing if the training data contains enough relevant features and not too many irrelevant
ones. A critical part of the success of a Machine Learning project is coming up with a
good set of features to train on. This process, called feature engineering, involves:
• Feature selection: selecting the most useful features to train on among existing
• Feature extraction: combining existing features to produce a more useful one (as
we saw earlier, dimensionality reduction algorithms can help).
• Creating new features by gathering new data.
Now that we have looked at many examples of bad data, let’s look at a couple of examples of bad algorithms.
Overfitting the Training Data
Say you are visiting a foreign country and the taxi driver rips you off. You might be
tempted to say that all taxi drivers in that country are thieves. Overgeneralizing is
something that we humans do all too often, and unfortunately machines can fall into
the same trap if we are not careful. In Machine Learning this is called overfitting: it
means that the model performs well on the training data, but it does not generalize
Figure 1-22 shows an example of a high-degree polynomial life satisfaction model
that strongly overfits the training data. Even though it performs much better on the
training data than the simple linear model, would you really trust its predictions?
