Profiling of OCR'ed Historical Texts Revisited

01/19/2017
by   Florian Fink, et al.
0

In the absence of ground truth it is not possible to automatically determine the exact spectrum and occurrences of OCR errors in an OCR'ed text. Yet, for interactive postcorrection of OCR'ed historical printings it is extremely useful to have a statistical profile available that provides an estimate of error classes with associated frequencies, and that points to conjectured errors and suspicious tokens. The method introduced in Reffle (2013) computes such a profile, combining lexica, pattern sets and advanced matching techniques in a specialized Expectation Maximization (EM) procedure. Here we improve this method in three respects: First, the method in Reffle (2013) is not adaptive: user feedback obtained by actual postcorrection steps cannot be used to compute refined profiles. We introduce a variant of the method that is open for adaptivity, taking correction steps of the user into account. This leads to higher precision with respect to recognition of erroneous OCR tokens. Second, during postcorrection often new historical patterns are found. We show that adding new historical patterns to the linguistic background resources leads to a second kind of improvement, enabling even higher precision by telling historical spellings apart from OCR errors. Third, the method in Reffle (2013) does not make any active use of tokens that cannot be interpreted in the underlying channel model. We show that adding these uninterpretable tokens to the set of conjectured errors leads to a significant improvement of the recall for error detection, at the same time improving precision.

READ FULL TEXT
research
11/24/2021

Group based Personalized Search by Integrating Search Behaviour and Friend Network

The key to personalized search is to build the user profile based on his...
research
08/25/2020

Historical Context and Key Features of Digital Money Tokens

Digital money tokens have attracted the attention of financial instituti...
research
08/16/2023

Diagnosing Human-object Interaction Detectors

Although we have witnessed significant progress in human-object interact...
research
08/20/2021

One Chatbot Per Person: Creating Personalized Chatbots based on Implicit User Profiles

Personalized chatbots focus on endowing chatbots with a consistent perso...
research
06/20/2016

Visualizing textual models with in-text and word-as-pixel highlighting

We explore two techniques which use color to make sense of statistical t...
research
04/21/2021

Soft Expectation and Deep Maximization for Image Feature Detection

Central to the application of many multi-view geometry algorithms is the...
research
08/21/2019

Flexible S-money token schemes

S-money [Proc. R. Soc. A 475, 20190170 (2019)] schemes define virtual to...

Please sign up or login with your details

Forgot password? Click here to reset