Nathan TeBlunthuis

Misclassification Causes Bias in Regression Models: How to Fix It Using the MisclassificationModels Package

Automated classifiers (ACs), often built via supervised machine learning, can categorize large and statistically powerful samples of data ranging from text to images and video, and have become widely popular measurement devices in many scientific and industrial fields. Despite this popularity, even highly accurate classifiers make errors that cause misclassification bias and misleading results in downstream analyses—unless such analyses account for these errors.

In principle, existing statistical methods can use “gold standard” validation data, such as that created by human annotators and often used to validate predictiveness, to correct misclassification bias and produce consistent estimates. I will present an evaluation of such methods, including a new method implemented in the experimental R package misclassificationmodels, via Monte-Carlo simulations designed to reveal each method’s limitations. The results show the new method is both versatile and efficient.

In sum, automated classifiers, even those below common accuracy standards or making systematic misclassifications, can be useful for measurement with careful study design and appropriate error correction methods.

Pronouns: He/Him or They/Them

Seattle, WA, USA

Nathan TeBlunthuis is a computational social scientist and postdoctoral researcher at the University of Michigan School of Information and affiliate of the Community Data Science Collective at the University of Washington. Much of Nathan's research uses R to study Wikipedia and other online communities using innovative methods. He earned his Ph.D. from the Department of Communication at the University of Washington in 2021 and has also worked for the Wikimedia Foundation and Microsoft.