Wing Sheung Chan

78 Event selection and classification Second, assuming that the transverse momentum of the Z boson is small, the ` and τ lepton can be expected to have transverse momenta that are equal in magnitude (but opposite in direction). In other words, p T ( ` ) = p T ( τ ) = α p T ( τ had - vis ) . (4.7) The difference between the values of α estimated by Equations (4.6) and (4.7) gives us ∆ α in Equation (4.4) . For the signal events, where the assumptions needed to estimate α can be justified, ∆ α is expected to be close to zero. In contrast, for background events, where the assumptions generally do not hold true, ∆ α is expected to deviate from zero significantly. This makes ∆ α a powerful discriminant. Including both low-level and high-level kinematic variables as input is found to be beneficial to the performance of the NN classifiers. Since the information provided by the variables is redundant, one might expect the addition of high-level variables to be unnecessary, and that the low-level variables alone would suffice. Although that might be true in theory, the finite training sample size and computing resources limit the actual ability of the NNs to fully explore the correlations between the low-level variables. In this regard, the addition of high-level variables is able to help the NNs to converge faster, while they continue to exploit the residual correlations between the low-level variables. The input variables to the NN classifiers are listed in Table 4.6. The expected background and signal distributions of all the input variables in the SR can be found displayed in Appendix C. 4.2.3. Software, architecture and optimiser The NN training and optimisation are implemented using the open-source software pack- age Keras [94] with TensorFlow [95] backend. All of the NNs used in the analysis share the same architecture. Each NN is a feed- forward network consisting of an input layer, two hidden layers of 20 neurons each, and an output layer with a single node. Each layer is fully connected to the neighbouring layers. Low-level and high-level variables are treated equally in the input layer, with one input node per variable. The hidden-layer neurons are rectified linear units, while the activation of the output node is a standard logistic sigmoid function. The number of hidden layers and the number of neurons per layer has been chosen by maximising the area under the ROC curve. The optimisation is done semi-manually with a coarse grid search. The NNs are trained using the Adam algorithm [96] to minimise the binary cross- entropy. All NNs are trained with a batch size of 256 and 200 epochs. Optimised together with the number of hidden layers and the number of neurons per layer, the learning rate of the optimiser is set to 5 × 10 − 5 for the NN Z ττ classifier for the 1P regions, 2 . 5 × 10 − 4 for the NN Z `` classifier, and 1 × 10 − 4 for the other classifiers. There is no indication of over-training, as the loss of the NNs on the evaluation sample sets stabilises and does not