Matching with Text Data: An Experimental Evaluation of Methods for Matching Documents and of Measuring Match Quality
How should one perform matching in observational studies when the units are text documents? The lack of randomized assignment of documents into treatment and control groups may lead to systematic differences between groups on high-dimensional and latent features of text such as topical content and sentiment. Standard balance metrics, used to measure the quality of a matching method, fail in this setting. We decompose text matching methods into two parts: (1) a text representation, and (2) a distance metric, and present a framework for measuring the quality of text matches experimentally using human subjects. We consider 28 potential methods, and find that representing text as term vectors and matching on cosine distance significantly outperform alternative representations and distance metrics. We apply our chosen method to a substantive debate in the study of media bias using a novel data set of front page news articles from thirteen news sources. Media bias is composed of topic selection bias and presentation bias; using our matching method to control for topic selection, we find that both components contribute significantly to media bias, though some news sources rely on one component more than the other.
READ FULL TEXT