DuoSearch: A Novel Search Engine for Bulgarian Historical Documents

05/30/2023
by   Angel Beshirov, et al.
0

Search in collections of digitised historical documents is hindered by a two-prong problem, orthographic variety and optical character recognition (OCR) mistakes. We present a new search engine for historical documents, DuoSearch, which uses ElasticSearch and machine learning methods based on deep neural networks to offer a solution to this problem. It was tested on a collection of historical newspapers in Bulgarian from the mid-19th to the mid-20th century. The system provides an interactive and intuitive interface for the end-users allowing them to enter search terms in modern Bulgarian and search across historical spellings. This is the first solution facilitating the use of digitised historical documents in Bulgarian.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/17/2019

Fast Search with Poor OCR

The indexing and searching of historical documents have garnered attenti...
research
08/03/2021

iART: A Search Engine for Art-Historical Images to Support Research in the Humanities

In this paper, we introduce iART: an open Web platform for art-historica...
research
01/27/2016

Font Identification in Historical Documents Using Active Learning

Identifying the type of font (e.g., Roman, Blackletter) used in historic...
research
10/26/2022

The Biscari Archive. A case study of the application of Transkribus tool

The Paterno' Castello Principi di Biscari Archive, preserved at the Stat...
research
12/15/2022

The Effects of Character-Level Data Augmentation on Style-Based Dating of Historical Manuscripts

Identifying the production dates of historical manuscripts is one of the...
research
01/31/2023

Archive TimeLine Summarization (ATLS): Conceptual Framework for Timeline Generation over Historical Document Collections

Archive collections are nowadays mostly available through search engines...
research
03/23/2016

CONDITOR1: Topic Maps and DITA labelling tool for textual documents with historical information

Conditor is a software tool which works with textual documents containin...

Please sign up or login with your details

Forgot password? Click here to reset