David K. Park, and Andrew Gelman
American Statistical Association, November 2008
A linear regression of y on x can be approximated by a simple difference: the average values of y corresponding to the highest quarter or third of x, minus the average values of y corresponding to the lowest quarter or third of x. A simple theoretical analysis, similar to analyses that have been done in psychometrics, shows this comparison to perform reasonably well, with 80%– 90% efficiency compared to the regression if the predictor is uniformly or normally distributed. By discretizing x into three categories, we claw back about half the efficiency lost by the commonly used strategy of dichotomizing the predictor. We illustrate with the example that motivated our research: an analysis of income and voting which we had originally performed for a scholarly journal but then wanted to communicate to a general audience.
View the paper here: Splitting a Predictor at the Upper Quarter or Third and the Lower Quarter or Third