Location Name Extraction from Targeted Text Streams using Gazetteer-based Statistical Language Models

08/10/2017
by   Hussein S. Al-Olimat, et al.
0

Extracting location names from informal and unstructured texts requires the identification of referent boundaries and partitioning of compound names in the presence of variation in location referents. Instead of analyzing semantic, syntactic, and/or orthographic features, our Location Name Extraction tool (LNEx) exploits a region-specific statistical language model to evaluate an observed n-gram in Twitter targeted text as a legitimate location name variant. LNEx handles abbreviations, and automatically filters and augments the location names in gazetteers from OpenStreetMap, Geonames, and DBpedia. Consistent with Carroll [4], LNEx addresses two kinds of location name contractions: category ellipsis and location ellipsis, which produces alternate name forms of location names (i.e., Nameheads of location names). The modified gazetteers and dictionaries of abbreviations help detect the boundaries of multi-word location names delimiting them in texts using n-gram statistics. We evaluated the extent to which using an augmented and filtered region-specific gazetteer can successfully extract location names from a targeted text stream. We used 4,500 event-specific tweets from three targeted streams of different flooding disasters to compare LNEx performance against eight state-of-the-art taggers. LNEx improved the average F-Score by 98-145 outperforming these taggers convincingly on the three manually annotated Twitter streams. Furthermore, LNEx is capable of stream processing.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset