AI Data Wrangling with Associative Arrays

01/18/2020
by   Jeremy Kepner, et al.
0

The AI revolution is data driven. AI "data wrangling" is the process by which unusable data is transformed to support AI algorithm development (training) and deployment (inference). Significant time is devoted to translating diverse data representations supporting the many query and analysis steps found in an AI pipeline. Rigorous mathematical representations of these data enables data translation and analysis optimization within and across steps. Associative array algebra provides a mathematical foundation that naturally describes the tabular structures and set mathematics that are the basis of databases. Likewise, the matrix operations and corresponding inference/training calculations used by neural networks are also well described by associative arrays. More surprisingly, a general denormalized form of hierarchical formats, such as XML and JSON, can be readily constructed. Finally, pivot tables, which are among the most widely used data analysis tools, naturally emerge from associative array constructors. A common foundation in associative arrays provides interoperability guarantees, proving that their operations are linear systems with rigorous mathematical properties, such as, associativity, commutativity, and distributivity that are critical to reordering optimizations.

READ FULL TEXT

page 1

page 2

page 3

research
12/03/2017

Polystore Mathematics of Relational Algebra

Financial transactions, internet search, and data analysis are all placi...
research
07/06/2019

Streaming 1.9 Billion Hypersparse Network Updates per Second with D4M

The Dynamic Distributed Dimensional Data Model (D4M) library implements ...
research
05/01/2020

Multi-dimensional Arrays with Levels

We explore a data structure that generalises rectangular multi-dimension...
research
02/03/2019

A Billion Updates per Second Using 30,000 Hierarchical In-Memory D4M Databases

Analyzing large scale networks requires high performance streaming updat...
research
03/28/2021

Mathematics of Digital Hyperspace

Social media, e-commerce, streaming video, e-mail, cloud documents, web ...
research
12/30/2013

Bayesian Models of Graphs, Arrays and Other Exchangeable Random Structures

The natural habitat of most Bayesian methods is data represented by exch...
research
04/09/2023

Dependently Typing R Vectors, Arrays, and Matrices

The R programming language is widely used in large-scale data analyses. ...

Please sign up or login with your details

Forgot password? Click here to reset