Particle Physics Seminar
"Mutual Information as an Upper-Limit of Separability in Machine Learning in HEP/NP"
Presented by Dr Nicholas Carrara
Thursday, March 11, 2021, 3:00 pm — Videoconference / Virtual Event (see link below)
Abstract: I'll present results of work from several publications which utilize mutual information for machine learning tasks in HEP/NP. We developed a method for using mutual information as an upperlimit of separability for any algorithm which attempts to separate events according to a set of classes. Most common among these in HEP/NP are binary classi?cation problems in which one attempts to determine whether an event (data point) originated from the distribution of interest (signal) or some other distribution (background). ML algorithms are often deployed to tackle these problems by training on some known samples. We show that the upper-limit,
(a) determines a priori how well any algorithm can do in principle, and hence provides a natural mechanism for when to stop training.
(b) is quickly computable from data.
The second use of the upper-limit is for the task of feature/variable selection. Often in HEP/NP one is given a set of variables to use in classi?cation/event reconstruction tasks, which can at times be large (∼ 10−100) and di?cult to work with. One e?ort to reduce the number of used variables is to employ some type of feature/variable selection algorithm which searches the subspaces of variable space in an attempt to ?nd a maximally informative set of variables, while only keeping a set number of variables out of the full list. The standard approach involves training on each chosen subset and then comparing the relative results with other subset choices. This can be computationally cumbersome, especially when the variable space is large.
We've also developed an algorithm called the mutual information search tree (MIST) that not only
(a) calculates the upper-limit for large dimensional data sets, but
(b) allows one to sample the subspaces of variable space and ?nd maximally informative subspaces without having to train on each subspace.
I'll discuss results of using MIST on a widely analyzed data set, the Kaggle HiggsML Challenge data set, which concerns a binary classi?cation problem associated to a mock Higgs search. Our algorithm MIST is able to ?nd a subset of variables from the Higgs set which
(a) included 9 out of the 30 discriminating variables which
(b) contained the maximal amount of information (i.e. the upper-limit), and
(c) took only 20 minutes to complete.
Hosted by: Hanyu Wei
16199 | INT/EXT | Events Calendar