Variable SelectionContext:Statisticians and data-miners are used to build predictive models and infer dependencies between variables on the basis of observed data. However, in a lot of emerging domains, like bioinformatics or textmining, they are facing datasets characterized by a very large number of features (up to several thousands), a large amount of noise, non-linear dependencies and, often, only several hundreds of samples. In this context, the detection of functional relationships as well as the design of effective classiﬁers is a major challenge. Results: DISR: We developed the double input symmetrical relevance (DISR). The rationale of this method is that a set of variables can return an information on the output class that is higher than the sum of the informations of each variable taken individually. This property results from variable interaction. Additionally DISR, is well suited to large datasets because of its low computational cost (LNCS 2006). MASSIVE: We showed that a variable selection approach based on DISR can be formulated as a quadratic optimization problem: the Dispersion Sum Problem (DSP). To solve this problem, we use a strategy based on Backward Elimination and Sequential Replacement (BESR). MASSIVE, the combination of DISR method with BESR is shown to be efficient compared to state-of-the-art feature selection methods (IEEE JSTSP 2008). MIMR: The importance of bringing causality into play when designing feature selection methods is more and more acknowledged in the machine learning community. we proposed a variant of DISR which aims to prioritise direct causal relationships in feature selection problems where the ratio between the number of features and the number of samples is high. This approach is based on the notion of interaction which is shown to be informative about the relevance of an input subset as well as its causal relationship with the target. The resulting ﬁlter is called mIMR (min-Interaction Max-Relevance) (ICML 2010). Application: (Bioinformatics) We applied our methods to microarray data. Variable selection applied to microarray data allows to identify a cell signature that can be used for diagnosis, i.e., differentiating malign tumor cells from benign ones, and also for prognosis, i.e., detecting tumor cells sensitive to treatment vs. tumor cells not responding to the treatment. Information TheoryContext:Variable selection and network inference, are subdomains of the data-mining ﬁeld. However, few methods in these ﬁelds can deal with non-linearity together with large number of variables. We therefore needed to resort to more speciﬁc techniques. Information-theoretic methods offer an effective solution to these two issues. Our methods use mutual information, which is an information-theoretic measure of dependency. First, mutual information is a model-independent measure of information that has been used in data analysis for deﬁning concepts like variable relevance, redundancy and interaction but also to redeﬁne theoretic machine learning concepts. Secondly, mutual information captures non-linear dependencies. Finally, mutual information is rather fast to compute. Therefore, it can be computed a high number of times in a reasonable amount of time, as required by datasets having a large number of variables. Results: We introduced the infotheo R/C++ package in order to compute information-theoretic measures from a limited amount of samples. This package makes available a set of six information-theoretic measures (entropy, conditional entropy, mutual information, conditional mutual information, multiinformation and interaction information), four different entropy estimators (i.e. empirical, Miller-Madow, Schurmann-Grassberger and shrink) and three discretization methods (i.e. equal width, global equal width and equal frequencies binning). Robotics
Context: |