Skip to main content

Main Challenges of Machine Learning

 In short, since your main task is to select a learning algorithm and train it on some

data, the two things that can go wrong are “bad algorithm” and “bad data.” Let’s start

with examples of bad data.

Insufficient Quantity of Training Data

For a toddler to learn what an apple is, all it takes is for you to point to an apple and

say “apple” (possibly repeating this procedure a few times). Now the child is able to

recognize apples in all sorts of colors and shapes. Genius.

Machine Learning is not quite there yet; it takes a lot of data for most Machine Learn‐

ing algorithms to work properly. Even for very simple problems you typically need

thousands of examples, and for complex problems such as image or speech recogni‐

tion you may need millions of examples (unless you can reuse parts of an existing

model).


Nonrepresentative Training Data

In order to generalize well, it is crucial that your training data be representative of the

new cases you want to generalize to. This is true whether you use instance-based

learning or model-based learning.

For example, the set of countries we used earlier for training the linear model was not

perfectly representative; a few countries were missing. Figure 1-21 shows what the

data looks like when you add the missing countries.

If you train a linear model on this data, you get the solid line, while the old model is

represented by the dotted line. As you can see, not only does adding a few missing

countries significantly alter the model, but it makes it clear that such a simple linear

model is probably never going to work well. It seems that very rich countries are not

happier than moderately rich countries (in fact they seem unhappier), and conversely

some poor countries seem happier than many rich countries.

By using a nonrepresentative training set, we trained a model that is unlikely to make

accurate predictions, especially for very poor and very rich countries.

It is crucial to use a training set that is representative of the cases you want to general‐

ize to. This is often harder than it sounds: if the sample is too small, you will have

sampling noise (i.e., nonrepresentative data as a result of chance), but even very large

samples can be nonrepresentative if the sampling method is flawed. This is called

sampling bias.

Poor-Quality Data

Obviously, if your training data is full of errors, outliers, and noise (e.g., due to poor-

quality measurements), it will make it harder for the system to detect the underlying

patterns, so your system is less likely to perform well. It is often well worth the effort

to spend time cleaning up your training data. The truth is, most data scientists spend

a significant part of their time doing just that. For example:

• If some instances are clearly outliers, it may help to simply discard them or try to

fix the errors manually.

• If some instances are missing a few features (e.g., 5% of your customers did not

specify their age), you must decide whether you want to ignore this attribute alto‐

gether, ignore these instances, fill in the missing values (e.g., with the median

age), or train one model with the feature and one model without it, and so on.

Irrelevant Features

As the saying goes: garbage in, garbage out. Your system will only be capable of learn‐

ing if the training data contains enough relevant features and not too many irrelevant

ones. A critical part of the success of a Machine Learning project is coming up with a

good set of features to train on. This process, called feature engineering, involves:

• Feature selection: selecting the most useful features to train on among existing

features.

• Feature extraction: combining existing features to produce a more useful one (as

we saw earlier, dimensionality reduction algorithms can help).

• Creating new features by gathering new data.

Now that we have looked at many examples of bad data, let’s look at a couple of examples of bad algorithms.

Overfitting the Training Data

Say you are visiting a foreign country and the taxi driver rips you off. You might be

tempted to say that all taxi drivers in that country are thieves. Overgeneralizing is

something that we humans do all too often, and unfortunately machines can fall into

the same trap if we are not careful. In Machine Learning this is called overfitting: it

means that the model performs well on the training data, but it does not generalize

well.

Figure 1-22 shows an example of a high-degree polynomial life satisfaction model

that strongly overfits the training data. Even though it performs much better on the

training data than the simple linear model, would you really trust its predictions?


Comments

Popular posts from this blog

Customer Engagement with Chatbots and Collaboration Bots: Methods, Chances and Risks of the Use of Bots in Service and Marketing

 Relevance and Potential of Bots for Customer  Obtaining information, flight check-ins or keeping a diary of one’s own diet—all of this is possible in dialogue today. Customers can ask questions via Messenger or WhatsApp or initiate processes. This service is comfortable for the customer, available at all times via mobile and promises fast answers or smooth problem-solving. A meanwhile strongly increasing number of companies is already relying on this means of contact and the figures on chat usage speak in favour of this means supplementing or even replacing many apps and web offers in the future. The reasons for this are manifold. Figures of the online magazine Business Insider 1 reveal a clear develop- ment away from the public post to the use of private messaging services such as Facebook Messenger or WhatsApp. Facebook meanwhile has a user base of around 1.7 billion people worldwide; 1.1 billion people use WhatsApp, and Twitter can nevertheless still record 310 million us...

Robot Journalism Is Becoming Creative

 Algorithms are able to automatically search the Web for information, pool it and create a readable piece of writing. In addition, data-based reports in the area of sport, the weather or finances are already frequently created automat- ically today. Recently, for example, merely a few minutes after Apple had announced their latest quarterly figures, there was a report by the news agency Associated Press (AP): “Apple tops Street 1Q forecasts”. The financial report deals solely with the mere financial figures, without any human assistance whatsoever. Yet, AP was able to publish their report entirely via AI in line with the AP guidelines. For this purpose, AP launched their corresponding platform Wordsmith at the beginning of 2016, which automatically creates more than 3000 of such financial reports every quarter, and which are pub- lished fast and accurately. It is no longer that easy to distinguish between whether an algorithm or a human has written a text. Another exception of rece...

Sales and Marketing Reloaded—Deep Learning Facilitates New Ways of Winning Customers and Markets

 Sales and Marketing 2017 “Data is the new oil” is a saying that is readily quoted today. Although this sentence still describes the current development well, it ides not get down to the real core of the matter; more suitable would be “artificial intelligence empowers a new economy”. The autonomous automation of ever larger fields of tasks in the business world will trigger fundamental economic and social changes. Based on a future world in which unlimited information is available on unlimited computers, ultimate decisions will be generated in real time and processes will be controlled objectively. These decisions are not liable to any subjectivity, information or delays. In many sectors of the economy, e.g. the public health sector or the autonomous control of vehicles, techniques of artificial intelligence (AI) are applied and increase the quality, availability and integrity of the services offered. The same development can be observed in the field of sales and marketing. Today, ...