*3.4. Dropout*

Neural networks are prone to overfitting. Overfitting is essentially the exhaustive training of the model in a certain set of data, so much that the model fails to generalize. As a result, the model cannot effectively work with new unknown data. A solution to overfitting is to train multiple models and combine their output afterwards, which is highly inefficient.

Srivastava et al. [**?** ] proposed randomly dropping neural units from the network during the training phase. Their results suggested an improvement of the regularization on diverse datasets. Spatial dropout refers to the exact same process, performed over a single axis of elements, rather than random neural units over each layer. Furthermore, dropout has a significant potential to reduce overfitting and provide improvements over other regularization strategies such as *<sup>L</sup>*-regularization and soft-weight sharing [**?** ].

### *3.5. LSTM and Gated Recurrent Unit*

A feed-forward neural network has a unidirectional processing path, from input to hidden layer to output. A recurrent network can have information travelling both directions by using feedback loops. Computations derived from earlier input are fed back into the network, imitating human memory. In essence, a recurrent neural network is a chain of identical neural networks that transfer the derived knowledge from one to another. That chain creates learning dependencies that decay mainly due to the size of the chained network.

Hochreiter and Schmidhuber [**?** ] proposed Long Short-Term Memory networks to counter that decay. The novel unit of the LSTM architecture is the memory cell that forgets or remembers the information passed from the previous chain link. The Gated Recurrent Unit of Model 2 was introduced by Cho et al. [**?** ]. Its architecture is similar to LSTM units, but with the absence of an output gate. It is shown that GRU networks perform well on Natural Language Processing tasks [**?** ].
