A principle feature analysis
A key task of data science is to identify relevant features linked to certain output variables that are supposed to be modeled or predicted. To obtain a small but meaningful model, it is important to find stochastically independent variables capturing all the information necessary to model or predict the output variables sufficiently. Therefore, we introduce in this work a framework to detect linear and non-linear dependencies between different features. As we will show, features that are actually functions of other features do not represent further information. Consequently, a model reduction neglecting such features conserves the relevant information, reduces noise and thus improves the quality of the model. Furthermore, a smaller model makes it easier to adopt a model of a given system. In addition, the approach structures dependencies within all the considered features. This provides advantages for classical modeling starting from regression ranging to differential equations and for machine learning. To show the generality and applicability of the presented framework 2154 features of a data center are measured and a model for classification for faulty and non-faulty states of the data center is set up. This number of features is automatically reduced by the framework to 161 features. The prediction accuracy for the reduced model even improves compared to the model trained on the total number of features. A second example is the analysis of a gene expression data set where from 9513 genes 9 genes are extracted from whose expression levels two cell clusters of macrophages can be distinguished.
READ FULL TEXT