Adjusting outliers

Well, it had to happen at some point: a usually reliable model requires multiple repair trips for one owner. In statistics, this is known as an “outlier.” The question for me: how to handle them?

Leaving outliers as they are can distort the results, especially with small sample sizes. In the most extreme case from the upcoming August results, I have 25 responses for one model. Twenty owners required no repairs. Four required one repair trip. And one required five repair trips. Simply analyzing these responses (weighted by the number of months reported on) yields a repair rate of 84 repair trips per 100 vehicles (0.84 per car). Since more vehicles require one repair trip than require two, and more require two than require three, and so forth, this is likely a distorted result. After all, none of the vehicles in the sample required two, three, or four repair trips. This suggests that a vehicle that requires five repair trips is rare, much rarer than one in 25 among the entire population.

Sometimes in statistical analysis the outliers are trimmed. They are simply eliminated from the sample, taking one from each end of the distribution to even things out. But, again because of the small sample sizes, especially when combined with the generally low frequency of repairs (most values are zeros), this would distort the result in the other direction.

The solution I have opted for is to adjust the highest value for each model to the next highest value plus one. This retains the vehicle in question as the least reliable in the sample, but smooths out the distribution such that it more resembles the distribution one would expect if surveying the entire population.

In the above case, this means that for the model in question we now have 20 zeros, four ones, and one two. This yields a repair rate of 56 repair trips per 100 vehicles (0.56 per car), which is very likely a more accurate result.

In the great majority of cases, no such adjustment is required. Out of the 81 models for which there are 15 or more responses, seven are affected, and in only three of these cases is the number of repair trips reduced by more than one. But in these three cases, I am much more comfortable with the result after the adjustment.

Another way to think of this is that there are lemons, and they can happen with every car. In each of the above cases, the value for a single problematic car is adjusted, for a total of three cars. When you have data on over 2,200 cars, are three going to be unusually unreliable? Sure. But should the models in question have their reported repair rates directly affected? In effect, they got struck by lightning. It could have been any model, it just happened to be them. It would not be valid to assume that the model in question was substantially less reliable owing to such a rare, chance event.

The adjustment yields an average that much better represents one’s likely repair trip rate when buying a car.