Skip to main content

Main Challenges of Machine Learning

 In short, since your main task is to select a learning algorithm and train it on some

data, the two things that can go wrong are “bad algorithm” and “bad data.” Let’s start

with examples of bad data.

Insufficient Quantity of Training Data

For a toddler to learn what an apple is, all it takes is for you to point to an apple and

say “apple” (possibly repeating this procedure a few times). Now the child is able to

recognize apples in all sorts of colors and shapes. Genius.

Machine Learning is not quite there yet; it takes a lot of data for most Machine Learn‐

ing algorithms to work properly. Even for very simple problems you typically need

thousands of examples, and for complex problems such as image or speech recogni‐

tion you may need millions of examples (unless you can reuse parts of an existing

model).


Nonrepresentative Training Data

In order to generalize well, it is crucial that your training data be representative of the

new cases you want to generalize to. This is true whether you use instance-based

learning or model-based learning.

For example, the set of countries we used earlier for training the linear model was not

perfectly representative; a few countries were missing. Figure 1-21 shows what the

data looks like when you add the missing countries.

If you train a linear model on this data, you get the solid line, while the old model is

represented by the dotted line. As you can see, not only does adding a few missing

countries significantly alter the model, but it makes it clear that such a simple linear

model is probably never going to work well. It seems that very rich countries are not

happier than moderately rich countries (in fact they seem unhappier), and conversely

some poor countries seem happier than many rich countries.

By using a nonrepresentative training set, we trained a model that is unlikely to make

accurate predictions, especially for very poor and very rich countries.

It is crucial to use a training set that is representative of the cases you want to general‐

ize to. This is often harder than it sounds: if the sample is too small, you will have

sampling noise (i.e., nonrepresentative data as a result of chance), but even very large

samples can be nonrepresentative if the sampling method is flawed. This is called

sampling bias.

Poor-Quality Data

Obviously, if your training data is full of errors, outliers, and noise (e.g., due to poor-

quality measurements), it will make it harder for the system to detect the underlying

patterns, so your system is less likely to perform well. It is often well worth the effort

to spend time cleaning up your training data. The truth is, most data scientists spend

a significant part of their time doing just that. For example:

• If some instances are clearly outliers, it may help to simply discard them or try to

fix the errors manually.

• If some instances are missing a few features (e.g., 5% of your customers did not

specify their age), you must decide whether you want to ignore this attribute alto‐

gether, ignore these instances, fill in the missing values (e.g., with the median

age), or train one model with the feature and one model without it, and so on.

Irrelevant Features

As the saying goes: garbage in, garbage out. Your system will only be capable of learn‐

ing if the training data contains enough relevant features and not too many irrelevant

ones. A critical part of the success of a Machine Learning project is coming up with a

good set of features to train on. This process, called feature engineering, involves:

• Feature selection: selecting the most useful features to train on among existing

features.

• Feature extraction: combining existing features to produce a more useful one (as

we saw earlier, dimensionality reduction algorithms can help).

• Creating new features by gathering new data.

Now that we have looked at many examples of bad data, let’s look at a couple of examples of bad algorithms.

Overfitting the Training Data

Say you are visiting a foreign country and the taxi driver rips you off. You might be

tempted to say that all taxi drivers in that country are thieves. Overgeneralizing is

something that we humans do all too often, and unfortunately machines can fall into

the same trap if we are not careful. In Machine Learning this is called overfitting: it

means that the model performs well on the training data, but it does not generalize

well.

Figure 1-22 shows an example of a high-degree polynomial life satisfaction model

that strongly overfits the training data. Even though it performs much better on the

training data than the simple linear model, would you really trust its predictions?


Comments

Popular posts from this blog

Possible Limitations of AI-Based Bots

 The examples above already show the present-day potential of AI-based bots. At present, these systems are still in an early stage and still have certain limitations and potentials for optimisation. Twitter Bot Tay by Microsoft Most bots at present are reactive service bots. Engagement bots that actively interact with the users as market and brand ambassadors go one step further. The most famous example here is the chatbot Tay by Microsoft. Microsoft removed Tay from the web apologetically within one day. The example shows that the uncontrolled training of bots by the community can lead to fatal consequences. AI systems still have to learn ethical standards. It thus becomes apparent that even bots require a kind of guideline. Like a journalist has to observe editorial guidelines, bots have to observe certain standards. The next generation of AI-based bots must control and create the possible room for communication. IBM Watson has been able to celebrate quite a few respectable resul...

What is Machine Learning

 The term machine learning (ML) as a part of artificial intelligence is ubiq- uitous nowadays. The term is used for a wide number of various appli- cations and methods that deal with the “generation of knowledge from experience”. The well-known US computer scientist Tom Mitchell defines machine learning as follows: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E (Mitchell 1997). An illustrative example of this would be a chess computer program that improves its performance (P) in playing chess (the task T) by experience (E), by playing as many games as possible (even against itself ) and analysing them (Mitchell 1997). Machine learning is not a fundamentally new approach for machines to generate “knowledge” from experience. Machine learning technology was used to filter out junk e-mails a long time ago. Whilst spam filters that tack- ...

A Bluffer’s Guide to AI, Algorithmics and Big Data

 Big Data—More Than “Big” A few years ago, the keyword big data resounded throughout the land. What is meant is the emergence and the analysis of huge amounts of data that is generated by the spreading of the Internet, social media, the increasing number of built-in sensors and the Internet of Things, etc. The phenomenon of large amounts of data is not new. Customer and credit card sensors at the point of sale, product identification via barcodes or RFID as well as the GPS positioning system have been producing large amounts of data for a long time. Likewise, the analysis of unstructured data, in the shape of business reports, e-mails, web form free texts or customer surveys, for example, is frequently part of internal analyses. Yet, what is new about the amounts of data falling under the term “big data” that has attracted so much attention recently? Of course, the amount of data avail- able through the Internet of Things (Industry 4.0), through mobile devices and social media has ...