Understanding the Role of Nonlinearity in Training Dynamics of Contrastive Learning
While the empirical success of self-supervised learning (SSL) heavily relies on the usage of deep nonlinear models, many theoretical works proposed to understand SSL still focus on linear ones. In this paper, we study the role of nonlinearity in the training dynamics of contrastive learning (CL) on one and two-layer nonlinear networks with homogeneous activation h(x) = h'(x)x. We theoretically demonstrate that (1) the presence of nonlinearity leads to many local optima even in 1-layer setting, each corresponding to certain patterns from the data distribution, while with linear activation, only one major pattern can be learned; and (2) nonlinearity leads to specialized weights into diverse patterns, a behavior that linear activation is proven not capable of. These findings suggest that models with lots of parameters can be regarded as a brute-force way to find these local optima induced by nonlinearity, a possible underlying reason why empirical observations such as the lottery ticket hypothesis hold. In addition, for 2-layer setting, we also discover global modulation: those local patterns discriminative from the perspective of global-level patterns are prioritized to learn, further characterizing the learning process. Simulation verifies our theoretical findings.
READ FULL TEXT