The weight of the cow: Dealing with bias in datasets

One of the best science books I read this year is “Superforcasters”, by Philip Tetlock. The story of how this book got to be written is just about as fascinating as the book itself, and I strongly recommend it to both scientists and non scientists.

Today I would like to talk about something that I am surprised wasn’t discussed in the book. As many posts on this blog, this is something that may or may not be an original idea: all I know is that it occurred to me and I thought it was worth sharing, and I haven’t heard of it in the forecasting world.

How to pool forecasting data to reduce bias

Early in the book, Tetlock gives an example of the “wisdom of the crowd”. At a fair, people are asked to guess the weight of a cow. Taken individually, some guesses are quite far from the real value. But the average of all the guesses turns out to be almost exactly equal to the real weight of the cow.

He uses that example to illustrate the fact that individual people can be biased, but when you average all the guesses, the biases cancel each other. Imagine each guess as having two components: the signal and the noise. The signal represents all the valid reasons why a person might have a good estimation of the weight of the cow: they know what is a cow and what things weight in general, and maybe this particular person grew up in a farm and knows really well about cows, or maybe they’re a weightlifter and knows really well about weights. The noise represents all the reasons why they might be biased. Maybe they’re a butcher and often overestimate the weight of what they sell to increase the price. Maybe they raise pigs and underestimated the weight of the cow because it’s so much bigger than a pig.

By averaging all the guesses, you are making strong assumptions.
You assume that people’s biases are oppositeĀ and equivalent.
Therefore by averaging, the noise components should cancel each other out and you should be able to get only the signal: the wisdom of the crowd.

 

These are reasonable assumptions that are used by default in many different fields. Noise is supposed to be random, while the signal contains information about something, therefore not is not random. Most people know what a cow is and most people know roughly what things weight. But in the case of human-driven forecasting, these assumptions are not perfect.
1. There is no reason why the bias should be evenly distributed. (In sciency terms: the probability distribution of the noise might not be a uniform distribution). If your crowd is made of 30 cheating butchers (overestimating weights) and 10 greedy clients (underestimating weights), your biases may be opposite but they are not evenly distributed. Even if the clients bias happen to be exactly opposite to the butchers bias, averaging the 40 guesses will not give you the right answer, because you have many more butchers in your population. It will give you an overestimated weight. Instead you should pool the data: average the clients guesses (pool A), average the butchers guesses (pool B), and then take the average of the results of of pool A and B.
2. There is no reason why the biases should be exactly opposite. (The distribution of the noise might not be 0-mean).
Ideally, you would know by how much butchers tend to overestimate (say on average +5% of the total weight) and by how much the clients tend to underestimate (say -10%). If you have this information, you can use it to weight your pooled data before putting them together. In this example, you would want to give less weight to the clients pool because you know that usually, their bias is higher than that of the butchers.

So if you have some forecasting data and you want to get the best forecast out of it, there are two things you should do before making averages.
First, identify all possible sources of biases and form pools based on this information. Repeat this step as many times as needed for different repartitions. If you are doing political forecasting, people might be biased in favor of their candidate: divide your data in political parties (repartition A). If women tend to have different political biases than men for some reason, identify that reason and divide your pool in men and women (repartition B). The more (verifiable) causes for bias you can find, the more you will be able to cancel them out.
Second, quantify the bias so you can attribute weights to your pools. For that you will have to rely on previous data and make guesses.
Finally, make your averages per repartitions. You will have as many averages as the number of repartitions you made. You can decide to make a final average out of this data, or you can go meta and give weights to your repartitions. If you are quite sure that repartition A captures a real cause for opposite biases, but less sure about B, give less weight to B.

Now of course, it should be noted that it isn’t always worth doing all this work. If you have data about guessing a cow’s weight, maybe just do a simple average. The result might be good enough and the extra work is not worth your time.
But if you are gathering data about whether country A and country B are going to start a war in the next 3 months, it might be worth putting a little bit more effort into pooling your data. It doesn’t have to be data directly produced by people: it can be governmental data, numbers from different agencies, it can be you trying to predict what people you know will do next… There is always place for bias anyway.
In addition, clearly identifying the source of bias in your data allows you to notice what data may be missing (if all your pools are biased in the same direction), and it allows you to update your forecasts efficiently. When you come into possession of new data, it can be hard to decide how much it should change your original forecast. But if you can readily identify to which pools the data belongs, updating is much easier.

Happy forecasting! The Good Judgement Open project is a good place to start (you don’t have to be a scientist at all, just give your opinion).

One thought on “The weight of the cow: Dealing with bias in datasets

  1. Pingback: How to define “scientific consensus?” – OpenMimosa Blog

Leave a comment