Performance modeling of a distributed file-system

by   Sandeep Kumar, et al.

Data centers have become center of big data processing. Most programs running in a data center processes big data. The storage requirements of such programs cannot be fulfilled by a single node in the data center, and hence a distributed file system is used where the the storage resource are pooled together from more than one node and presents a unified view of it to outside world. Optimum performance of these distributed file-systems given a workload is of paramount important as disk being the slowest component in the framework. Owning to this fact, many big data processing frameworks implement their own file-system to get the optimal performance by fine tuning it for their specific workloads. However, fine-tuning a file system for a particular workload results in poor performance for workloads that do not match the profile of desired workload. Hence, these file systems cannot be used for general purpose usage, where the workload characteristics shows high variation. In this paper we model the performance of a general purpose file-system and analyse the impact of tuning the file-system on its performance. Performance of these parallel file-systems are not easy to model because the performance depends on a lot of configuration parameters, like the network, disk, under lying file system, number of servers, number of clients, parallel file-system configuration etc. We present a Multiple Linear regression model that can capture the relationship between the configuration parameters of a file system, hardware configuration, workload configuration (collectively called features) and the performance metrics. We use this to rank the features according to their importance in deciding the performance of the file-system.


page 1

page 2

page 3

page 4


Automating Distributed Tiered Storage Management in Cluster Computing

Data-intensive platforms such as Hadoop and Spark are routinely used to ...

Autonomic Architecture for Big Data Performance Optimization

The big data software stack based on Apache Spark and Hadoop has become ...

A Stochastic Model for File Lifetime and Security in Data Center Networks

Data center networks are an important infrastructure in various applicat...

IOPathTune: Adaptive Online Parameter Tuning for Parallel File System I/O Path

Parallel file systems contain complicated I/O paths from clients to stor...

Resource Sharing for Multi-Tenant NoSQL Data Store in Cloud

Multi-tenancy hosting of users in cloud NoSQL data stores is favored by ...

Evaluating Dynamic File Striping For Lustre

We define dynamic striping as the ability to assign different Lustre str...

Analysis of parallel I/O use on the UK national supercomputing service, ARCHER using Cray LASSi and EPCC SAFE

In this paper, we describe how we have used a combination of the LASSi t...

Please sign up or login with your details

Forgot password? Click here to reset