Analyzing Large-Scale, Distributed and Uncertain Data

by   Yaron Gonen, et al.

The exponential growth of data in current times and the demand to gain information and knowledge from the data present new challenges for database researchers. Known database systems and algorithms are no longer capable of effectively handling such large data sets. MapReduce is a novel programming paradigm for processing distributable problems over large-scale data using a computer cluster. In this work we explore the MapReduce paradigm from three different angles. We begin by examining a well-known problem in the field of data mining: mining closed frequent itemsets over a large dataset. By harnessing the power of MapReduce, we present a novel algorithm for mining closed frequent itemsets that outperforms existing algorithms. Next, we explore one of the fundamental implications of "Big Data": The data is not known with complete certainty. A probabilistic database is a relational database with the addendum that each tuple is associated with a probability of its existence. A natural development of MapReduce is of a distributed relational database management system, where relational calculus has been reduced to a combination of map and reduce function. We take this development a step further by proposing a query optimizer over distributed, probabilistic database. Finally, we analyze the best known implementation of MapReduce called Hadoop, aiming to overcome one of its major drawbacks: it does not directly support the explicit specification of the data repeatedly processed throughout different jobs.Many data-mining algorithms, such as clustering and association-rules require iterative computation: the same data are processed again and again until the computation converges or a stopping condition is satisfied. We propose a modification to Hadoop such that it will support efficient access to the same data in different jobs.


page 1

page 2

page 3

page 4


Parallel algorithms for mining of frequent itemsets

In the recent decade companies started collecting of large amount of dat...

Boosting Frequent Itemset Mining via Early Stopping Intersections

Mining frequent itemsets from a transaction database has emerged as a fu...

A Data Structure Perspective to the RDD-based Apriori Algorithm on Spark

During the recent years, a number of efficient and scalable frequent ite...

Maiter: An Asynchronous Graph Processing Framework for Delta-based Accumulative Iterative Computation

Myriad of graph-based algorithms in machine learning and data mining req...

Declarative Recursive Computation on an RDBMS, or, Why You Should Use a Database For Distributed Machine Learning

A number of popular systems, most notably Google's TensorFlow, have been...

Uncertain Spatial Data Management:An Overview

Both the current trends in technology such as smartphones, general mobil...

Database Optimization to Recommend Software Developers using Canonical Order Tree

Recently frequent and sequential pattern mining algorithms have been wid...

Please sign up or login with your details

Forgot password? Click here to reset