What is a Rectified Linear Unit?
A Rectified Linear Unit is a form of activation function used commonly in deep learning models. In essence, the function returns 0 if it receives a negative input, and if it receives a positive value,
the function will return back the same positive value. The function is understood as:
f(x)=max(0,x)
The rectified linear unit, or ReLU, allows for the deep learning model to account for non-linearities and specific interaction effects.
The image above displays the graphic representation of the ReLU function. Note that the values for any negative X input result in an output of 0, and only once positive values are entered does the function begin to slope upward.
How does a Rectified Linear Unit work?
To understand how a ReLU works, it is important to understand the effects it has on variable interaction effects. An interaction effect is when a variable affects a prediction depending on the value of associated variables. For example, comparing IQ scores of two different schools may have interaction effects of IQ and age. The IQ of a student in high school is better than the IQ of an elementary school student, as age and IQ interact with each other regardless of the school. This is known as an interaction effect and ReLUs can be applied to minimize interaction effects. For example, if A=1 and B=2, and both have the respective associated weights of 2 and 3, the function would be, f(2A+3B). If A increases, the output will increase as well. However, if B is a large negative value, the output will be 0.
The benefits of using the ReLU function is that its simplicity leads it to be a relatively cheap function to compute. As there is no complicated math, the model can be trained and run in a relatively short time. Similarly, it converges faster, meaning the slope doesn't plateau as the value for X gets larger. This vanishing gradient problem is avoided in ReLU, unlike alternative functions such as sigmoid or tanh. Lastly, ReLU is sparsely activated because for all negative inputs, the output is zero. Sparsity is the principle that specific functions only are activated in concise situations. This is a desirable feature for modern neural networks, as in a sparse network it is more likely that neurons are appropriately processing valuable parts of a problem. For example, a model that is processing images of fish may contain a neuron that is specialized to identity fish eyes. That specific neuron would not be activated if the model was processing images of airplanes instead. This specified use of neuron functions accounts for network sparsity.