Learning Directly from Grammar Compressed Text

02/28/2020
by   Yoichi Sasaki, et al.
0

Neural networks using numerous text data have been successfully applied to a variety of tasks. While massive text data is usually compressed using techniques such as grammar compression, almost all of the previous machine learning methods assume already decompressed sequence data as their input. In this paper, we propose a method to directly apply neural sequence models to text data compressed with grammar compression algorithms without decompression. To encode the unique symbols that appear in compression rules, we introduce composer modules to incrementally encode the symbols into vector representations. Through experiments on real datasets, we empirically showed that the proposal model can achieve both memory and computational efficiency while maintaining moderate performance.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/18/2020

Grammar compression with probabilistic context-free grammar

We propose a new approach for universal lossless text compression, based...
research
11/20/2019

Grammar Compressed Sequences with Rank/Select Support

Sequence representations supporting not only direct access to their symb...
research
04/01/2020

Grammar-Compressed Indexes with Logarithmic Search Time

Let a text T[1..n] be the only string generated by a context-free gramma...
research
11/25/2020

Grammar Compression By Induced Suffix Sorting

A grammar compression algorithm, called GCIS, is introduced in this work...
research
04/12/2022

Efficient Construction of the BWT for Repetitive Text Using String Compression

We present a new semi-external algorithm that builds the Burrows-Wheeler...
research
11/13/2020

A grammar compressor for collections of reads with applications to the construction of the BWT

We describe a grammar for DNA sequencing reads from which we can compute...
research
02/08/2021

Efficient construction of the extended BWT from grammar-compressed DNA sequencing reads

We present an algorithm for building the extended BWT (eBWT) of a string...

Please sign up or login with your details

Forgot password? Click here to reset