Towards Personalized Preprocessing Pipeline Search

02/28/2023
by   Diego Martinez, et al.
0

Feature preprocessing, which transforms raw input features into numerical representations, is a crucial step in automated machine learning (AutoML) systems. However, the existing systems often have a very small search space for feature preprocessing with the same preprocessing pipeline applied to all the numerical features. This may result in sub-optimal performance since different datasets often have various feature characteristics, and features within a dataset may also have their own preprocessing preferences. To bridge this gap, we explore personalized preprocessing pipeline search, where the search algorithm is allowed to adopt a different preprocessing pipeline for each feature. This is a challenging task because the search space grows exponentially with more features. To tackle this challenge, we propose ClusterP3S, a novel framework for Personalized Preprocessing Pipeline Search via Clustering. The key idea is to learn feature clusters such that the search space can be significantly reduced by using the same preprocessing pipeline for the features within a cluster. To this end, we propose a hierarchical search strategy to jointly learn the clusters and search for the optimal pipelines, where the upper-level search optimizes the feature clustering to enable better pipelines built upon the clusters, and the lower-level search optimizes the pipeline given a specific cluster assignment. We instantiate this idea with a deep clustering network that is trained with reinforcement learning at the upper level, and random search at the lower level. Experiments on benchmark classification datasets demonstrate the effectiveness of enabling feature-wise preprocessing pipeline search.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/20/2023

DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data

Data preprocessing is a crucial step in the machine learning process tha...
research
01/26/2021

Incremental Search Space Construction for Machine Learning Pipeline Synthesis

Automated machine learning (AutoML) aims for constructing machine learni...
research
05/23/2023

Deep Pipeline Embeddings for AutoML

Automated Machine Learning (AutoML) is a promising direction for democra...
research
06/29/2023

AutoML in Heavily Constrained Applications

Optimizing a machine learning pipeline for a task at hand requires caref...
research
08/03/2016

Improving Quality of Hierarchical Clustering for Large Data Series

Brown clustering is a hard, hierarchical, bottom-up clustering of words ...
research
05/31/2022

Learning brain MRI quality control: a multi-factorial generalization problem

Due to the growing number of MRI data, automated quality control (QC) ha...
research
06/06/2023

ColdNAS: Search to Modulate for User Cold-Start Recommendation

Making personalized recommendation for cold-start users, who only have a...

Please sign up or login with your details

Forgot password? Click here to reset