Improving the Results of Machine-based Entity Resolution with Limited Human Effort: A Risk Perspective
Pure machine-based solutions usually struggle in challenging classification tasks such as entity resolution (ER). To alleviate this problem, a recent trend is to involve humans in the resolution process, most notably the crowdsourcing approach. However, it remains very challenging to find a solution that can effectively improve the quality of entity resolution with limited human effort. In this position paper, we investigate the problem of human and machine cooperation for ER from a risk perspective. We propose to select for manual verification the machine-labeled results at high risk of being mislabeled. We present a risk model for this task that takes into consideration the human-labeled results as well as the output of the machine's resolution. Finally, our experiments on real data demonstrate that the proposed risk model picks up the mislabeled instances with considerably higher accuracy than the existing alternatives. Provided with the same human cost budget, it also achieves consistently better resolution quality than the state-of-the-art approach based on active learning.
READ FULL TEXT