Skip to main content

Main Challenges of Machine Learning

 In short, since your main task is to select a learning algorithm and train it on some

data, the two things that can go wrong are “bad algorithm” and “bad data.” Let’s start

with examples of bad data.

Insufficient Quantity of Training Data

For a toddler to learn what an apple is, all it takes is for you to point to an apple and

say “apple” (possibly repeating this procedure a few times). Now the child is able to

recognize apples in all sorts of colors and shapes. Genius.

Machine Learning is not quite there yet; it takes a lot of data for most Machine Learn‐

ing algorithms to work properly. Even for very simple problems you typically need

thousands of examples, and for complex problems such as image or speech recogni‐

tion you may need millions of examples (unless you can reuse parts of an existing

model).


Nonrepresentative Training Data

In order to generalize well, it is crucial that your training data be representative of the

new cases you want to generalize to. This is true whether you use instance-based

learning or model-based learning.

For example, the set of countries we used earlier for training the linear model was not

perfectly representative; a few countries were missing. Figure 1-21 shows what the

data looks like when you add the missing countries.

If you train a linear model on this data, you get the solid line, while the old model is

represented by the dotted line. As you can see, not only does adding a few missing

countries significantly alter the model, but it makes it clear that such a simple linear

model is probably never going to work well. It seems that very rich countries are not

happier than moderately rich countries (in fact they seem unhappier), and conversely

some poor countries seem happier than many rich countries.

By using a nonrepresentative training set, we trained a model that is unlikely to make

accurate predictions, especially for very poor and very rich countries.

It is crucial to use a training set that is representative of the cases you want to general‐

ize to. This is often harder than it sounds: if the sample is too small, you will have

sampling noise (i.e., nonrepresentative data as a result of chance), but even very large

samples can be nonrepresentative if the sampling method is flawed. This is called

sampling bias.

Poor-Quality Data

Obviously, if your training data is full of errors, outliers, and noise (e.g., due to poor-

quality measurements), it will make it harder for the system to detect the underlying

patterns, so your system is less likely to perform well. It is often well worth the effort

to spend time cleaning up your training data. The truth is, most data scientists spend

a significant part of their time doing just that. For example:

• If some instances are clearly outliers, it may help to simply discard them or try to

fix the errors manually.

• If some instances are missing a few features (e.g., 5% of your customers did not

specify their age), you must decide whether you want to ignore this attribute alto‐

gether, ignore these instances, fill in the missing values (e.g., with the median

age), or train one model with the feature and one model without it, and so on.

Irrelevant Features

As the saying goes: garbage in, garbage out. Your system will only be capable of learn‐

ing if the training data contains enough relevant features and not too many irrelevant

ones. A critical part of the success of a Machine Learning project is coming up with a

good set of features to train on. This process, called feature engineering, involves:

• Feature selection: selecting the most useful features to train on among existing

features.

• Feature extraction: combining existing features to produce a more useful one (as

we saw earlier, dimensionality reduction algorithms can help).

• Creating new features by gathering new data.

Now that we have looked at many examples of bad data, let’s look at a couple of examples of bad algorithms.

Overfitting the Training Data

Say you are visiting a foreign country and the taxi driver rips you off. You might be

tempted to say that all taxi drivers in that country are thieves. Overgeneralizing is

something that we humans do all too often, and unfortunately machines can fall into

the same trap if we are not careful. In Machine Learning this is called overfitting: it

means that the model performs well on the training data, but it does not generalize

well.

Figure 1-22 shows an example of a high-degree polynomial life satisfaction model

that strongly overfits the training data. Even though it performs much better on the

training data than the simple linear model, would you really trust its predictions?


Comments