Skip to main content

Main Challenges of Machine Learning

 In short, since your main task is to select a learning algorithm and train it on some

data, the two things that can go wrong are “bad algorithm” and “bad data.” Let’s start

with examples of bad data.

Insufficient Quantity of Training Data

For a toddler to learn what an apple is, all it takes is for you to point to an apple and

say “apple” (possibly repeating this procedure a few times). Now the child is able to

recognize apples in all sorts of colors and shapes. Genius.

Machine Learning is not quite there yet; it takes a lot of data for most Machine Learn‐

ing algorithms to work properly. Even for very simple problems you typically need

thousands of examples, and for complex problems such as image or speech recogni‐

tion you may need millions of examples (unless you can reuse parts of an existing

model).


Nonrepresentative Training Data

In order to generalize well, it is crucial that your training data be representative of the

new cases you want to generalize to. This is true whether you use instance-based

learning or model-based learning.

For example, the set of countries we used earlier for training the linear model was not

perfectly representative; a few countries were missing. Figure 1-21 shows what the

data looks like when you add the missing countries.

If you train a linear model on this data, you get the solid line, while the old model is

represented by the dotted line. As you can see, not only does adding a few missing

countries significantly alter the model, but it makes it clear that such a simple linear

model is probably never going to work well. It seems that very rich countries are not

happier than moderately rich countries (in fact they seem unhappier), and conversely

some poor countries seem happier than many rich countries.

By using a nonrepresentative training set, we trained a model that is unlikely to make

accurate predictions, especially for very poor and very rich countries.

It is crucial to use a training set that is representative of the cases you want to general‐

ize to. This is often harder than it sounds: if the sample is too small, you will have

sampling noise (i.e., nonrepresentative data as a result of chance), but even very large

samples can be nonrepresentative if the sampling method is flawed. This is called

sampling bias.

Poor-Quality Data

Obviously, if your training data is full of errors, outliers, and noise (e.g., due to poor-

quality measurements), it will make it harder for the system to detect the underlying

patterns, so your system is less likely to perform well. It is often well worth the effort

to spend time cleaning up your training data. The truth is, most data scientists spend

a significant part of their time doing just that. For example:

• If some instances are clearly outliers, it may help to simply discard them or try to

fix the errors manually.

• If some instances are missing a few features (e.g., 5% of your customers did not

specify their age), you must decide whether you want to ignore this attribute alto‐

gether, ignore these instances, fill in the missing values (e.g., with the median

age), or train one model with the feature and one model without it, and so on.

Irrelevant Features

As the saying goes: garbage in, garbage out. Your system will only be capable of learn‐

ing if the training data contains enough relevant features and not too many irrelevant

ones. A critical part of the success of a Machine Learning project is coming up with a

good set of features to train on. This process, called feature engineering, involves:

• Feature selection: selecting the most useful features to train on among existing

features.

• Feature extraction: combining existing features to produce a more useful one (as

we saw earlier, dimensionality reduction algorithms can help).

• Creating new features by gathering new data.

Now that we have looked at many examples of bad data, let’s look at a couple of examples of bad algorithms.

Overfitting the Training Data

Say you are visiting a foreign country and the taxi driver rips you off. You might be

tempted to say that all taxi drivers in that country are thieves. Overgeneralizing is

something that we humans do all too often, and unfortunately machines can fall into

the same trap if we are not careful. In Machine Learning this is called overfitting: it

means that the model performs well on the training data, but it does not generalize

well.

Figure 1-22 shows an example of a high-degree polynomial life satisfaction model

that strongly overfits the training data. Even though it performs much better on the

training data than the simple linear model, would you really trust its predictions?


Comments

Popular posts from this blog

A Bluffer’s Guide to AI, Algorithmics and Big Data

 Big Data—More Than “Big” A few years ago, the keyword big data resounded throughout the land. What is meant is the emergence and the analysis of huge amounts of data that is generated by the spreading of the Internet, social media, the increasing number of built-in sensors and the Internet of Things, etc. The phenomenon of large amounts of data is not new. Customer and credit card sensors at the point of sale, product identification via barcodes or RFID as well as the GPS positioning system have been producing large amounts of data for a long time. Likewise, the analysis of unstructured data, in the shape of business reports, e-mails, web form free texts or customer surveys, for example, is frequently part of internal analyses. Yet, what is new about the amounts of data falling under the term “big data” that has attracted so much attention recently? Of course, the amount of data avail- able through the Internet of Things (Industry 4.0), through mobile devices and social media has ...

Sales and Marketing Reloaded—Deep Learning Facilitates New Ways of Winning Customers and Markets

 Sales and Marketing 2017 “Data is the new oil” is a saying that is readily quoted today. Although this sentence still describes the current development well, it ides not get down to the real core of the matter; more suitable would be “artificial intelligence empowers a new economy”. The autonomous automation of ever larger fields of tasks in the business world will trigger fundamental economic and social changes. Based on a future world in which unlimited information is available on unlimited computers, ultimate decisions will be generated in real time and processes will be controlled objectively. These decisions are not liable to any subjectivity, information or delays. In many sectors of the economy, e.g. the public health sector or the autonomous control of vehicles, techniques of artificial intelligence (AI) are applied and increase the quality, availability and integrity of the services offered. The same development can be observed in the field of sales and marketing. Today, ...

How Bots Change Content Marketing

 When considering the future of content marketing, one aspect is of par- ticular significance that nobody who wishes to be successful in the long run should neglect: AI and bots will become game changers in a few years. Many of the former content strategies will be turned upside down by the new pos- sibilities and thus become a greater challenge to companies. Some experts thus speak of the death of the (former) content marketing by the AI algo- rithms. This is certainly an exaggeration, even if provided with a spark of truth. Content marketing itself is regarded as one of the most cost-effective mar- keting strategies that is asserting itself increasingly more worldwide. Even if it is not always easy to be visible on the Internet with one’s own content, one thing remains certain: Customers have a great need for information and want to be entertained. Despite the content shock, the best and most unique contents will always assert themselves somehow. If the demands on content change,...