In short, since your main task is to select a learning algorithm and train it on some data, the two things that can go wrong are “bad algorithm” and “bad data.” Let’s start with examples of bad data. Insufficient Quantity of Training Data For a toddler to learn what an apple is, all it takes is for you to point to an apple and say “apple” (possibly repeating this procedure a few times). Now the child is able to recognize apples in all sorts of colors and shapes. Genius. Machine Learning is not quite there yet; it takes a lot of data for most Machine Learn‐ ing algorithms to work properly. Even for very simple problems you typically need thousands of examples, and for complex problems such as image or speech recogni‐ tion you may need millions of examples (unless you can reuse parts of an existing model). Nonrepresentative Training Data In order to generalize well, it is crucial that your training data be representative of the new cases you want to generalize to. This is true wheth...
There are so many different types of Machine Learning systems that it is useful to classify them in broad categories based on: • Whether or not they are trained with human supervision (supervised, unsuper‐ vised, semisupervised, and Reinforcement Learning) • Whether or not they can learn incrementally on the fly (online versus batch learning) • Whether they work by simply comparing new data points to known data points, or instead detect patterns in the training data and build a predictive model, much like scientists do (instance-based versus model-based learning) These criteria are not exclusive; you can combine them in any way you like. For example, a state-of-the-art spam filter may learn on the fly using a deep neural network model trained using examples of spam and ham; this makes it an online, model- based, supervised learning system. Let’s look at each of these criteria a bit more closely. Supervised/Unsupervised Learning Machine Learning systems can be classified according ...