# Machine learning project : Detect Higgs boson from CERN data

This project consists in using **Machine Learning** algorithms on datasets provided by the **CERN**, with the aim of classifying events measured by the **ATLAS** experiment, distinguishing events originating in the **Higgsboson** decay signature.

### I. Introduction

The Higgs boson is an **elementary particle** in the **Standard Model** (wikipedia link), which provides an explanation on why other **particles** have **mass**. During a **collision** at high **speeds**, **protons** generate smaller particles as by products of the collision, and at times these collisions produce a Higgs boson. As Higgs boson **decays** rapidly into other particles, scientists don’t observe it directly, but rather **measure** its decay signature.

Many **decay signatures** look **similar**, and in this project we used **different** machine learning **algorithms** to train

models for predicting whether the given event’s signature was the result of a Higgs boson (signal) or some other

process/particle (**background**). An important part in the training of the models is the **preprocessing** and feature **selection** from the dataset. Separation of entries by relevant features, dealing with missing values, and setting **hyperparameters** proved to be crucial to achieve a high **accuracy** level.

This article is based on my Github repository containing our report, python source code and notebooks.

https://github.com/pelletierkevin/HiggsBoson_MachineLearning

### II. Data preprocessing

In this section we wil describe how we first explored the data and performed a **preprocessing** and **cleaning**. The

following subsections will shortly describe what were our observations and actions we performed towards preparing data for running a different models. In addition, a more detailed report can be found in Jupyter Notebooks we used to interactively execute our code.

#### A. Data exploration

We were given two sets of data in csv format, a **training** and a **test** set. They are signal samples of 30 different

features. In the training set, an additionnal **output** column `yn`

is provided with a value depending on whether the recorded signal corresponds to an actual event, i.e. Higgs boson (1) or to background noise ( -1).

The data is composed of **30 columns** mainly dealing with **numerical** values.

The description and the analysis of the dataset can be found in the Github repository under the `notebooks/visualize_pandas.ipynb`

**notebook**.

#### B. Preprocessing

After **exploring** closely all the data we have, we found that only a **quarter** of the dataset is actually **complete**. Indeed, **11 features** out of the 30 have **missing values** (encoded as -999). However, we found that they are closely linked to the `jet`

**property**. Thus, we took the approach of **separating** the data based on this property, and applying Machine Learning algoritms to compute the most likely value for each missing one, using the correct data we have.

The first step was then to separate our data into 4, according to the `jet`

number. The incorrect values were then extracted from the obtained data to create 4 **training** **datasets**. On each of them, we were able to apply models that will allow us to make **predictions** on missing values using ridge regression and data augmentation. In order to obtain a more accurate result, we have carefully **adapted** the **model** for all 4 `jet`

numbers. Indeed, we realized for example that a degree 2 for polynomial expansion of `jet0`

features was more effective, whereas it brought too much **overfitting** to the others.

This approach gave us some very good results, and we managed to get a good model by tuning hyperparameters.

Using our models, we were therefore able to **replace** all the **missing values** in our 4 datasets with some predicted ones. Of course, all the data wrangling which we performed on the **train set**, was also done on the **test set**.

A visual example of our notebook used for the data preprocessing can be found in the same Github repository `notebooks/clean_jetnum0.ipynb`

.

### III. Machine Learning models

### A. Model selection

The following **models** were tested:

– Gradient and stochastic gradient descent ( **GD** & **SGD**)

– **Least squares**

– **Ridge regression**, with polynom basis

– Logistic and regularized **logistic regression**

Model | Accuracy | Loss |
---|---|---|

Least squares | 74.18% | 0.8237 |

Least squares GD | 62.81% | 0.7582 |

Leqst squares SGD | – | 0.7582 |

Ridge regression | 74.14% | 0.3392 |

Logistic regression | 72.23% | – |

The table above summarizes our results for all models. Considering them, the ridge regression model was selected

for submission.

#### B. Data expansion

Before fitting the model, we have done a polynomial expansion of the features in our train and test sets to let

the model learn more complex dependencies, adding:

– The optimal value of `d`

was determined with **grid search** for each subset of data.

– **Cross term products** (to extract **correlation** between pairs of features)

– Cross term **squared** products

#### C. Cross validation

We perform **10-fold cross validation** in order to improve the **reliability** of the model. It consists in splitting the training data set into 10 **subsets** of equivalent size. 9 subsets are then used to train the models, and the last one to test them. This process is **repeated** for all 10 combinations of training and testing subsets possible. It allows us to make sure a model is stable and that we did not **overfit** our model on the training set.

#### D. Parameter definition

We finally used grid search to set the values of the different hyperparameters that we would use in our model.

We did that for the four **subsets** of data (jet number) we have. The results are shown in the following table.

Model | d | |
---|---|---|

0 | 7 | |

1 | 8 | |

2 | 8 | |

3 | 8 |

The models were tested on the provided training dataset that was split in training and test subdatasets in order to

measure the performance of models.

### Results

Once we finished expanding the data and setting the **hyperparameters**, we were able to do the final submission

on the test data, and save the predicted labels for the kaggle submission. The final score obtained on kaggle was**81.883%**.

https://github.com/pelletierkevin/HiggsBoson_MachineLearning