Tutorial: Machine Learning Data Set Preparation, Part 3

To see the entire Machine Learning Tutorial, go here.

Remember that boring data?

NAME	COUNTRY	OWN A CAR	LIKES ICE-CREAM
Eliza Santiago	Guatemala	Yes	No
Fred Winchester	Canada	No	Yes
Marvin Ngoma	Ghana	Yes	No
Xiong Mao	USA	Yes	???

This example may remind some of the battered adage that “correlation does not imply causality.” This cautionary statement is often the only thing people remember from their brief exposure to statistics. While it is certainly useful, it is not entirely the case.

Let’s remove the extraneous details such as the name and country, replacing them with a generic, indexed tag. After all, we aren’t really (at this point), interested in whether certain details such as the number of vowels in a name, or what part of the world we find a country, have any impact on car ownership or ice-cream preference. To take those into account would be to introduce overfitting, which is the phenomenon of having so much information in a model that it becomes burdensome to separate the data that has a causal effect from that which we ought to consider arbitrary.

To take it a step further, let’s generalize car ownership and ice-cream preference to “A” and “B.” We obtain something very similar to a truth table in deductive logic.

NAME	COUNTRY	A	B
n1	c1	A	NOT-B
n2	c2	NOT-A	B
n3	c3	A	NOT-B
n4	c4	A	???

One of the early promises of inductive and probabilistic models is that by putting data into a sophisticated enough machine, hidden rules will emerge. From this it can become really tempting to treat these hidden rules as having deductive weight, in the same way that statements such as “all birds have wings” and “all creatures with wings can fly” allow one to deduce “all winged creatures can fly.” But there are massive problems with this beyond mere arrogance. The biggest problem is that with data obtained in the wild may not have been generated from a deductive rule (if A then B). With my personal methodology, chaos typically reigns supreme.

One is a threshold problem. As I see it, what is the ideal threshold between underfitting (too little data to gleam any decent insight) and overfitting (too much data to get a reliable model that can provide accurate results in a reasonable amount of time)?

Consider this model:

Animal	Feathers?	Wings?	Can Fly?
Merlin	Yes	Yes	Yes
Kiwi	Yes	No	No
Dolphin	No	No	No
Vampire Bat	No	Yes	???
Penguin	Yes	Yes	???

Here, we give our inductive engine (i.e. a machine learning agent) a lot of details from which to issue decisions. We could assume that this engine is intelligent enough not to take “the animal’s name ends with –in” as a criteria, but that is a bold assumption. Sure, if we are doing supervised machine learning, then we should train our machine to answer whether a given animal can fly, based off of a combination of the most relevant information. But just how this machine agent knows what information is relevant and which should be considered a coincidence lays squarely on the humans training that machine.

In unsupervised models, we can’t make the assumption that machines won’t learn from superfluous details such as whether an animal’s name ends with –in or not. Adding the generic, indexed tags as animal names, similar to the tags in the second table of this lesson, can sidestep this and lessen the risk of overfitting.

Given the data on Merlins, Kiwis, Dolphins, Vampire Bats, and Penguins, what answer should we expect regarding bats’ and penguins’ ability to fly?