Machine learning project : Detect Higgs boson from CERN data
This project consists in using Machine Learning algorithms on datasets provided by the CERN, with the aim of classifying events measured by the ATLAS experiment, distinguishing events originating in the Higgs
boson decay signature.
The Higgs boson is an elementary particle in the Standard Model (wikipedia link), which provides an explanation on why other particles have mass. During a collision at high speeds, protons generate smaller particles as by products of the collision, and at times these collisions produce a Higgs boson. As Higgs boson decays rapidly into other particles, scientists don’t observe it directly, but rather measure its decay signature.
Many decay signatures look similar, and in this project we used different machine learning algorithms to train
models for predicting whether the given event’s signature was the result of a Higgs boson (signal) or some other
process/particle (background). An important part in the training of the models is the preprocessing and feature selection from the dataset. Separation of entries by relevant features, dealing with missing values, and setting hyperparameters proved to be crucial to achieve a high accuracy level.
This article is based on my Github repository containing our report, python source code and notebooks.
II. Data preprocessing
In this section we wil describe how we first explored the data and performed a preprocessing and cleaning. The
following subsections will shortly describe what were our observations and actions we performed towards preparing data for running a different models. In addition, a more detailed report can be found in Jupyter Notebooks we used to interactively execute our code.
A. Data exploration
We were given two sets of data in csv format, a training and a test set. They are signal samples of 30 different
features. In the training set, an additionnal output column
yn is provided with a value depending on whether the recorded signal corresponds to an actual event, i.e. Higgs boson (1) or to background noise ( -1).
The data is composed of 30 columns mainly dealing with numerical values.
The description and the analysis of the dataset can be found in the Github repository under the
After exploring closely all the data we have, we found that only a quarter of the dataset is actually complete. Indeed, 11 features out of the 30 have missing values (encoded as -999). However, we found that they are closely linked to the
jet property. Thus, we took the approach of separating the data based on this property, and applying Machine Learning algoritms to compute the most likely value for each missing one, using the correct data we have.
The first step was then to separate our data into 4, according to the
jet number. The incorrect values were then extracted from the obtained data to create 4 training datasets. On each of them, we were able to apply models that will allow us to make predictions on missing values using ridge regression and data augmentation. In order to obtain a more accurate result, we have carefully adapted the model for all 4
jet numbers. Indeed, we realized for example that a degree 2 for polynomial expansion of
jet0 features was more effective, whereas it brought too much overfitting to the others.
This approach gave us some very good results, and we managed to get a good model by tuning hyperparameters.
Using our models, we were therefore able to replace all the missing values in our 4 datasets with some predicted ones. Of course, all the data wrangling which we performed on the train set, was also done on the test set.
A visual example of our notebook used for the data preprocessing can be found in the same Github repository
III. Machine Learning models
A. Model selection
The following models were tested:
– Gradient and stochastic gradient descent ( GD & SGD)
– Least squares
– Ridge regression, with polynom basis
– Logistic and regularized logistic regression
|Least squares GD||62.81%||0.7582|
|Leqst squares SGD||–||0.7582|
The table above summarizes our results for all models. Considering them, the ridge regression model was selected
B. Data expansion
Before fitting the model, we have done a polynomial expansion of the features in our train and test sets to let
the model learn more complex dependencies, adding:
– The optimal value of
d was determined with grid search for each subset of data.
– Cross term products (to extract correlation between pairs of features)
– Cross term squared products
C. Cross validation
We perform 10-fold cross validation in order to improve the reliability of the model. It consists in splitting the training data set into 10 subsets of equivalent size. 9 subsets are then used to train the models, and the last one to test them. This process is repeated for all 10 combinations of training and testing subsets possible. It allows us to make sure a model is stable and that we did not overfit our model on the training set.
D. Parameter definition
We finally used grid search to set the values of the different hyperparameters that we would use in our model.
We did that for the four subsets of data (jet number) we have. The results are shown in the following table.
The models were tested on the provided training dataset that was split in training and test subdatasets in order to
measure the performance of models.
Once we finished expanding the data and setting the hyperparameters, we were able to do the final submission
on the test data, and save the predicted labels for the kaggle submission. The final score obtained on kaggle was