{"id":762478,"date":"2021-07-22T13:01:58","date_gmt":"2021-07-22T20:01:58","guid":{"rendered":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/?p=762478"},"modified":"2022-02-24T18:15:10","modified_gmt":"2022-02-25T02:15:10","slug":"on-infinitely-wide-neural-networks-that-exhibit-feature-learning","status":"publish","type":"post","link":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/blog\/on-infinitely-wide-neural-networks-that-exhibit-feature-learning\/","title":{"rendered":"On infinitely wide neural networks that exhibit feature learning"},"content":{"rendered":"\n<figure class=\"wp-block-image alignwide size-large\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Diagram of the SGD Training Progress\" href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/1400x788_Infinite_neural_network_no_logo_animation-1.gif\"><img decoding=\"async\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/1400x788_Infinite_neural_network_no_logo_animation-1.gif\" alt=\"Diagram of the SGD Training Progress\"\/><\/a><\/figure>\n\n\n\n<p>In the pursuit of learning about fundamentals of the natural world, scientists have had success with coming at discoveries from both a bottom-up and top-down approach. Neuroscience is a great example of the former. Spanish anatomist Santiago Ram\u00f3n y Cajal discovered the neuron in the late 19th century. While scientists\u2019 understanding of these building blocks of the brain has grown tremendously in the past century, much about how the brain works on the whole remains an enigma. In contrast, fluid dynamics makes use of the continuum assumption, which treats the fluid as a continuous object. The assumption ignores fluid\u2019s atomic makeup yet makes accurate calculations simpler in many circumstances.<\/p>\n\n\n\n<p>When it comes to neural networks (NNs), one way to build an understanding is to reason about their behaviors when every layer has infinitely many neurons, commonly known as the NN infinite-width limits. We believe taking a top-down approach, as exemplified in the fluid dynamics example, can lead to a better understanding of why practical wide NNs work and how we can improve them.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"the-journey-to-infinity\">The journey to infinity<\/h2>\n\n\n\n<p>Just like how fluid dynamics under the continuum assumption enables accurate calculations of how real fluid\u2014made of individual atoms\u2014behaves, studying the NN infinite-width limit can inform us about how wide NNs behave in practice. As larger, hence wider, NNs are trained every few months, this will only become truer going forward. The catch, however, is that we need an infinite-width limit that sufficiently captures what makes NNs so successful today. In our paper, \u201c<a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/publication\/feature-learning-in-infinite-width-neural-networks\/\">Feature Learning in Infinite-Width Neural Networks<\/a>,\u201d we carefully consider how model weights become correlated during training, which leads us to a new parametrization, the <em>Maximal Update Parametrization<\/em>, that allows all layers to learn features in the infinite-width limit for any modern neural network. The paper appears at the <a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/event\/icml-2021\/\">Thirty-eighth International Conference on Machine Learning (ICML 2021)<\/a>.<\/p>\n\n\n\n<p>There have been two well-studied infinite-width limits for modern NNs: the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/link.springer.com\/10.1007\/978-1-4612-0745-0\">Neural Network-Gaussian Process<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (NNGP) and the <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/arxiv.org\/abs\/1806.07572\">Neural Tangent Kernel<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> (NTK). While both are illuminating to some extent, they fail to capture what makes NNs powerful, namely the ability to learn features. This is evident both theoretically and empirically. The NNGP limit explicitly considers the network at initialization and trains only a linear classifier on top of untrained features. The NTK limit allows training of the whole network\u2014but only with a small enough learning rate. This means the weights do not leave a small neighborhood of their initialization, preventing the learning of new features. Unsurprisingly, the best-performing NNGP and NTK models underperform their conventional finite-width counterparts, even when we calculate their infinite-width limits exactly.<\/p>\n\n\n\n<figure class=\"wp-block-pullquote\"><blockquote><p>&#8220;Neural Tangent Kernel doesn&#8217;t exhibit a critical element of deep learning, which is the ability to learn increasingly abstract features as we add more layers and training proceeds. This work takes an important step toward a theory that captures this capability in overparametrized neural networks.&#8221;<\/p><cite>Yoshua Bengio, Professor at the Universit\u00e9 de Montr\u00e9al and Scientific Director at Mila<\/cite><\/blockquote><\/figure>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"While the NNGP and NTK limits essentially only considers the neural network initialization, the feature learning limit incorporates the entire training trajectory. A Neural network is represented by a stack of vertical shapes: an inverted trapezoid, a square, and a triangle. On the left side of the shape, A blue arrow moves upward and represents the first forward pass. The NNGP limit can be thought of as the limit of this first forward pass. On the right side of the shape, a green arrow moves downward and represents the first backward pass. The NTK limit can be thought of as the limit for this first backward pass. In contrast, the feature learning limit takes into account the many cycles of forward and backward passes that take place during the entire training process. These cycles are represented by many repetitions of blue upward arrow and green downward arrows to the right of the neural network. An orange box encloses all of these cycles. On top of the box is the annotation \u201cSGD Training Progress\u201d with an arrow to the right. An arrow comes out from the bottom of the box pointing to a textbox that says \u201cFeature Learning Limit, This Work.\u201d\" href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/InfiniteWid_figure1.png\"><img loading=\"lazy\" decoding=\"async\" width=\"989\" height=\"533\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/InfiniteWid_figure1.png\" alt=\"While the NNGP and NTK limits essentially only considers the neural network initialization, the feature learning limit incorporates the entire training trajectory. A Neural network is represented by a stack of vertical shapes: an inverted trapezoid, a square, and a triangle. On the left side of the shape, A blue arrow moves upward and represents the first forward pass. The NNGP limit can be thought of as the limit of this first forward pass. On the right side of the shape, a green arrow moves downward and represents the first backward pass. The NTK limit can be thought of as the limit for this first backward pass. In contrast, the feature learning limit takes into account the many cycles of forward and backward passes that take place during the entire training process. These cycles are represented by many repetitions of blue upward arrow and green downward arrows to the right of the neural network. An orange box encloses all of these cycles. On top of the box is the annotation \u201cSGD Training Progress\u201d with an arrow to the right. An arrow comes out from the bottom of the box pointing to a textbox that says \u201cFeature Learning Limit, This Work.\u201d\" class=\"wp-image-762481\" srcset=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/InfiniteWid_figure1.png 989w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/InfiniteWid_figure1-300x162.png 300w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/InfiniteWid_figure1-768x414.png 768w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/InfiniteWid_figure1-240x129.png 240w\" sizes=\"auto, (max-width: 989px) 100vw, 989px\" \/><\/a><figcaption>Figure 1: NNGP and NTK underperform finite-width NNs on Image Classification, Word2Vec and Omniglot, even when calculating their infinite-width limits exactly. This suggests that NNGP and NTK do not capture the learning that happens in a practical NN\u2014that is, they are not the true limit to which finite-width NNs converge. CNN result taken from <a data-bi-bhvr=\"14\"  data-bi-cn=\"While the NNGP and NTK limits essentially only considers the neural network initialization, the feature learning limit incorporates the entire training trajectory. A Neural network is represented by a stack of vertical shapes: an inverted trapezoid, a square, and a triangle. On the left side of the shape, A blue arrow moves upward and represents the first forward pass. The NNGP limit can be thought of as the limit of this first forward pass. On the right side of the shape, a green arrow moves downward and represents the first backward pass. The NTK limit can be thought of as the limit for this first backward pass. In contrast, the feature learning limit takes into account the many cycles of forward and backward passes that take place during the entire training process. These cycles are represented by many repetitions of blue upward arrow and green downward arrows to the right of the neural network. An orange box encloses all of these cycles. On top of the box is the annotation \u201cSGD Training Progress\u201d with an arrow to the right. An arrow comes out from the bottom of the box pointing to a textbox that says \u201cFeature Learning Limit, This Work.\u201d\" href=\"https:\/\/arxiv.org\/abs\/1904.11955\">Arora et al. (2019)<\/a>.<\/figcaption><\/figure><\/div>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"unlocking-feature-learning-by-going-beyond-model-initialization\">Unlocking Feature Learning by going beyond model initialization<\/h2>\n\n\n\n<p>Why do NNGP and NTK fail to learn features? Because to do so, we need to leave the \u201ccomfort zone\u201d of model initialization, where the activation coordinates are easy to analyze as they nicely follow a Gaussian law by a <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Central_limit_theorem\">central limit argument<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>\u2014that is, summing infinitely many roughly independent, zero-mean random variables should yield a Gaussian distribution with a known variance. Just like growing a plant entails not only planting a seed but also proper care throughout its lifetime, the right infinite-width limit should take into consideration both the model initialization and the gradient updates, especially far away from initialization. To unlock feature learning, we need to see gradient updates for what they really are: <em>a different kind of matrices from their randomly initialized counterparts<\/em>.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"Figure 2: Our new limit takes into consideration the entire training process, which makes feature learning possible.\" href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Figure-2_Infinite-Wide_updated-Res.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"602\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Figure-2_Infinite-Wide_updated-Res-1024x602.png\" alt=\"Figure 2: Our new limit takes into consideration the entire training process, which makes feature learning possible.\" class=\"wp-image-762754\" srcset=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Figure-2_Infinite-Wide_updated-Res-1024x602.png 1024w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Figure-2_Infinite-Wide_updated-Res-300x176.png 300w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Figure-2_Infinite-Wide_updated-Res-768x451.png 768w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Figure-2_Infinite-Wide_updated-Res-240x141.png 240w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Figure-2_Infinite-Wide_updated-Res.png 1247w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption>Figure 2: NNGP is essentially the limit of the first forward pass in the training process, and NTK is the first backward pass. Neither leaves the \u201ccomfort zone\u201d of model initialization and thus fails to capture feature learning. Our new limit takes into consideration the entire training process, which makes feature learning possible.<\/figcaption><\/figure><\/div>\n\n\n\n<p>When a matrix \\(W\u2208R^{n\u00d7n}\\) multiplies with an activation vector \\(x\u2208R^n\\) to produce a pre-activation vector, we calculate a coordinate by taking a row from the matrix \\(W\\), multiplying it by \\(x\\) coordinate-wise, and summing the coordinates of the resulting vector. When \\(W\\)\u2019s entries are initialized with zero mean, this summation is across roughly independent elements with zero mean. As such, this sum would be \\(\\sqrt{n}\\) smaller than what it would be if the elements had nonzero mean or were strongly correlated, due to the famous square root cancellation effect underlying phenomena like the Central Limit Theorem.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"We must go beyond the \u201ccomfort zone\u201d of initialization and venture into the \u201cunfamiliar territory\u201d of training. At initialization, the weights are independent from the incoming activations, so their product is easy to reason about (for example, by using Central Limit Theorem); hence initialization is a \u201ccomfort zone.\u201d However once training starts, the weights (more precisely, the change in weights due to the gradient updates) start to correlate with the activations, so we must exit this comfort zone. A Law-of-Large-Number intuition would suggest that their product is square-root-of-width larger than if there are no correlation.\" href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/InfiniteWidFig2updatedres.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"400\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/InfiniteWidFig2updatedres-1024x400.png\" alt=\"We must go beyond the \u201ccomfort zone\u201d of initialization and venture into the \u201cunfamiliar territory\u201d of training. At initialization, the weights are independent from the incoming activations, so their product is easy to reason about (for example, by using Central Limit Theorem); hence initialization is a \u201ccomfort zone.\u201d However once training starts, the weights (more precisely, the change in weights due to the gradient updates) start to correlate with the activations, so we must exit this comfort zone. A Law-of-Large-Number intuition would suggest that their product is square-root-of-width larger than if there are no correlation.\" class=\"wp-image-762727\" srcset=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/InfiniteWidFig2updatedres-1024x400.png 1024w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/InfiniteWidFig2updatedres-300x117.png 300w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/InfiniteWidFig2updatedres-768x300.png 768w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/InfiniteWidFig2updatedres-240x94.png 240w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/InfiniteWidFig2updatedres.png 1347w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption>Figure 3: At initialization, the weights are independent from the incoming activations, so their product is easy to reason about (for example, by using Central Limit Theorem); hence, initialization is a \u201ccomfort zone.\u201d However, once training starts, the weights (more precisely, the change in weights, \u0394Weights, due to the gradient updates) start to correlate with the activations, so we must exit this comfort zone. A Law-of-Large-Number intuition would suggest that their product is \\(\\sqrt{width}\\) larger than if there are no correlation.<\/figcaption><\/figure><\/div>\n\n\n\n<p>In fact, this strong correlation occurs after gradient updates to \\(W\\). Let\u2019s focus on the gradient updates themselves, denoted as \\(\u0394W\\). In general, the coordinates of the vector obtained by coordinate-wise multiplying a row from \\(\u0394W\\) and the activation vector \\(x\\) will not have zero mean. This comes partly from the fact that \\(\u0394W\\) \u201cremembers\u201d the data distribution that produces the activations and partly from the model architecture (for example, the use of nonlinearity). Consequently, each entry of \\(\u0394Wx\\) will be \\(\\sqrt{n}\\) larger than if one naively assumes independence and zero-mean like at initialization.<\/p>\n\n\n\n<p>The key to finding an infinite-width limit that admits feature learning is to carefully analyze when we have sufficient independence and zero mean and when we do not, just like our reasoning above. Now there is just one more step before we can derive such a limit.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"not-all-parameters-are-the-same\">Not all parameters are the same<\/h2>\n\n\n\n<p>Conventionally, say in a multi-layer perceptron (MLP), we treat all the parameters the same way by using the same initialization, like a Gaussian distribution with a variance of \\(\\frac{1}{fan\\_in}\\), and the same learning rate. In the infinite-width limit, there are two kinds of parameters with very different behaviors\u2014<em>vector-like <\/em>parameters and <em>matrix-like<\/em> parameters.<\/p>\n\n\n\n<div class=\"wp-block-image\"><figure class=\"aligncenter size-large\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"It\u2019s useful consider two kinds of parameters separately: the Vector-like and the Matrix-like parameters. On the left, heading reads Vector-like Parameters means exactly one dimension scales with width. An image of a blue horizontal rectangle has two labels. Across the long horizontal side of the rectangle, arrows pointing in both directions are labeled Width. An arrow pointing to the short vertical side is labeled Dimension independent of width e.g., input dimension. On the right, heading reads Matrix-like Parameters means exactly two dimensions scale with width. A blue square has arrows along both the left and top side of the square labeled Width.\" href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Fig4_infinitewide_updatedres.png\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"520\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Fig4_infinitewide_updatedres-1024x520.png\" alt=\"It\u2019s useful consider two kinds of parameters separately: the Vector-like and the Matrix-like parameters. On the left, heading reads Vector-like Parameters means exactly one dimension scales with width. An image of a blue horizontal rectangle has two labels. Across the long horizontal side of the rectangle, arrows pointing in both directions are labeled Width. An arrow pointing to the short vertical side is labeled Dimension independent of width e.g., input dimension. On the right, heading reads Matrix-like Parameters means exactly two dimensions scale with width. A blue square has arrows along both the left and top side of the square labeled Width.\" class=\"wp-image-762730\" srcset=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Fig4_infinitewide_updatedres-1024x520.png 1024w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Fig4_infinitewide_updatedres-300x152.png 300w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Fig4_infinitewide_updatedres-768x390.png 768w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Fig4_infinitewide_updatedres-240x122.png 240w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Fig4_infinitewide_updatedres.png 1452w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/a><figcaption>Figure 4: When width is large, two kinds of parameters have different behaviors. Vector-like parameters have exactly 1 dimension scaling with width, while matrix-like parameters have exactly 2 such dimensions.<\/figcaption><\/figure><\/div>\n\n\n\n<p>Vector-like parameters are those with exactly one dimension that scales with width\u2014input or output layer weights and layer biases, for example. Meanwhile, matrix-like parameters have exactly two such dimensions, like hidden layer weights. The key difference is that a matrix multiplication with a vector-like parameter sometimes only sums across the finite, non-width dimension, whereas a matrix multiplication with a matrix-like parameter always sums across the width dimension, which tends to infinity. This distinction is critical in the infinite-width limit\u2014summing infinitely many elements of size \\(\u0398(1)\\) in width produces infinity, while summing finitely many elements each of size \\(\u0398(1\/{width})\\) produces zero in the limit.<\/p>\n\n\n\n<p>So far, we have introduced two kinds of weights: the random initialization and the gradient updates. We have also introduced two kinds of parameters: the vector-like ones and matrix-like ones. The key is to make sure that all four combinations of these lead the activations to evolve by non-vanishing and non-exploding amounts during training. Maximal Update Parametrization \\((\u03bcP)\\) scales the initialization and parameter multipliers as a function of width to ensure it for all activation vectors, thus achieving maximal feature learning. Depending on the model architecture and optimizer used, the actual parametrization could vary in complexity (see <em>abc<\/em>-parametrization in our paper). However, the underlying principles stay the same.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"practical-impact-and-looking-forward\">Practical impact and looking forward<\/h2>\n\n\n\n<p>Maximal Update Parametrization \\((\u03bcP)\\), which follows the principles we discussed and learns features maximally in the infinite-width limit, has the potential to change the way we train neural networks. For example, we calculated the \\(\u03bcP\\) limit of <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/arxiv.org\/abs\/1301.3781\">Word2Vec<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and found it outperformed both the NTK and NNGP limits as well as finite-width networks. When we visualize the learned embeddings of two groups of words\u2014the names of American cities and those of states\u2014using <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/en.wikipedia.org\/wiki\/Principal_component_analysis\">Principal Component Analysis<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, we see that <em>\u03bcP<\/em> limit exhibits a clear separation between them, like in the finite neural network, while the NTK\/NNGP limit sees essentially random embeddings.<\/p>\n\n\n\n<figure class=\"wp-block-pullquote\"><blockquote><p>&#8220;The theory of wide feature learning is extremely exciting and has the potential to change the way the field thinks about large model training.&#8221;<\/p><cite>Ilya Sutskever, Co-founder and Chief Scientist at OpenAI<\/cite><\/blockquote><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><a data-bi-bhvr=\"14\"  data-bi-cn=\"The feature learning limit properly captures the representation learning behavior of finite models on Word2Vec, while the NTK limit obviously did not learn any features. Principal Component Analysis of Word2Vec embeddings of common US cities and states, for NTK, width-64, and width-\u221e (feature learning) neural networks. NTK embeddings (left plot) are essentially random\u2014there is no separation of cities and states in the embeddings. In contrast, cities and states get naturally separated in the embedding space as width increases in the feature learning regime. In the width-64 model (middle plot), some separation can be seen, and even more separation can be seen in the infinite-width model (right plot).\" href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Figure-5_HighRes_infinite-wid.png\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Figure-5_HighRes_infinite-wid-1024x289.png\" alt=\"The feature learning limit properly captures the representation learning behavior of finite models on Word2Vec, while the NTK limit obviously did not learn any features. Principal Component Analysis of Word2Vec embeddings of common US cities and states, for NTK, width-64, and width-\u221e (feature learning) neural networks. NTK embeddings (left plot) are essentially random\u2014there is no separation of cities and states in the embeddings. In contrast, cities and states get naturally separated in the embedding space as width increases in the feature learning regime. In the width-64 model (middle plot), some separation can be seen, and even more separation can be seen in the infinite-width model (right plot).\" class=\"wp-image-762769\" width=\"924\" height=\"261\" srcset=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Figure-5_HighRes_infinite-wid-1024x289.png 1024w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Figure-5_HighRes_infinite-wid-300x85.png 300w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Figure-5_HighRes_infinite-wid-768x217.png 768w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Figure-5_HighRes_infinite-wid-1536x434.png 1536w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Figure-5_HighRes_infinite-wid-2048x578.png 2048w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/Figure-5_HighRes_infinite-wid-240x68.png 240w\" sizes=\"auto, (max-width: 924px) 100vw, 924px\" \/><\/a><figcaption>Figure 5: Principal Component Analysis of Word2Vec embeddings of common US cities and states, for NTK, width-64, and width-\u221e (feature learning) neural networks. NTK embeddings (left plot) are essentially random\u2014you can see that there is no separation of cities and states in the far left embeddings above. In contrast, cities and states get naturally separated in the embedding space as width increases in the feature learning regime. In the width-64 model (middle plot), some separation can be seen, and even more separation can be seen in the infinite-width model (right plot).<\/figcaption><\/figure>\n\n\n\n<p>Parametrizing a model in \\(\u03bcP\\) allows it to retain the ability to learn features when its width goes to infinity\u2014that is, the model does not become trivial (like NTK and NNGP) or run into numerical issues in the limit. We believe this new perspective opens doors to new capabilities previously unimaginable. Indeed, our theory enables a novel and useful paradigm for training large models, such as <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/arxiv.org\/abs\/2005.14165\">GPT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> and <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"http:\/\/arxiv.org\/abs\/1810.04805\">BERT<span class=\"sr-only\"> (opens in new tab)<\/span><\/a>, which is the topic of one of our on-going projects. Our results also raise several questions about existing practices, for example, about uncertainty in Bayesian neural networks. &#8220;These results are also intriguing because they suggest that the infinite width-limit of feature learning leads to a deterministic training trajectory and thus precludes the use of variance due to initialization to ascertain model uncertainty,\u201d Yoshua Bengio explains. \u201cThis should inspire future works on better uncertainty estimation in the feature learning regime.\u201d<\/p>\n\n\n\n<p>Due to the dominance of Neural Tangent Kernel theory, many researchers in the community believed that large width causes neural networks to lose the ability to learn features. We decisively refute this belief in our work. However, rather than an end to a chapter, we believe this is just a new beginning with many exciting new possibilities. We welcome everyone to join us on this journey to unveil the mysteries of neural networks and to push deep learning to new heights.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"additional-resources\">Additional resources:<\/h3>\n\n\n\n<ul class=\"wp-block-list\"><li>Read <a href=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/publication\/feature-learning-in-infinite-width-neural-networks\/\">our paper<\/a> for a deeper dive into the technical aspects.<\/li><li>Watch <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/recorder-v3.slideslive.com\/#\/share?share=43752&s=36754550-b0c7-4af8-8839-742ad9dd3025\">our 10-min talk<span class=\"sr-only\"> (opens in new tab)<\/span><\/a> at ICML on the gist of our work.<\/li><li>Discover more about feature learning and infinite-width networks in <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.youtube.com\/watch?v=6tA7r7Y5vUM\">a presentation by Greg Yang.<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li><li>Train your own infinite-width feature learning neural network with our <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/github.com\/edwardjhu\/TP4\">GitHub repository.<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li><li>Discover questions and comments from the machine learning community on this <a class=\"msr-external-link glyph-append glyph-append-open-in-new-tab glyph-append-xsmall\" rel=\"noopener noreferrer\" target=\"_blank\" href=\"https:\/\/www.reddit.com\/r\/MachineLearning\/comments\/k8h01q\/r_wide_neural_networks_are_feature_learners_not\/\">Reddit thread.<span class=\"sr-only\"> (opens in new tab)<\/span><\/a><\/li><\/ul>\n","protected":false},"excerpt":{"rendered":"<p>In the pursuit of learning about fundamentals of the natural world, scientists have had success with coming at discoveries from both a bottom-up and top-down approach. Neuroscience is a great example of the former. Spanish anatomist Santiago Ram\u00f3n y Cajal discovered the neuron in the late 19th century. While scientists\u2019 understanding of these building blocks [&hellip;]<\/p>\n","protected":false},"author":40519,"featured_media":762820,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"msr-url-field":"","msr-podcast-episode":"","msrModifiedDate":"","msrModifiedDateEnabled":false,"ep_exclude_from_search":false,"_classifai_error":"","msr-author-ordering":null,"msr_hide_image_in_river":0,"footnotes":""},"categories":[1],"tags":[],"research-area":[13556],"msr-region":[],"msr-event-type":[],"msr-locale":[268875],"msr-post-option":[],"msr-impact-theme":[],"msr-promo-type":[],"msr-podcast-series":[],"class_list":["post-762478","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-research-blog","msr-research-area-artificial-intelligence","msr-locale-en_us"],"msr_event_details":{"start":"","end":"","location":""},"podcast_url":"","podcast_episode":"","msr_research_lab":[],"msr_impact_theme":[],"related-publications":[],"related-downloads":[],"related-videos":[],"related-academic-programs":[],"related-groups":[],"related-projects":[],"related-events":[740803],"related-researchers":[{"type":"guest","value":"edward-hu","user_id":"822325","display_name":"Edward Hu","author_link":"<a href=\"https:\/\/edwardjhu.com\/\" aria-label=\"Visit the profile page for Edward Hu\">Edward Hu<\/a>","is_active":true,"last_first":"Hu, Edward","people_section":0,"alias":"edward-hu"}],"msr_type":"Post","featured_image_thumbnail":"<img width=\"960\" height=\"540\" src=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/1400x788_Neural_Network_no_logo_Still-960x540.jpg\" class=\"img-object-cover\" alt=\"While the NNGP and NTK limits essentially only considers the neural network initialization, the feature learning limit incorporates the entire training trajectory. A Neural network is represented by a stack of vertical shapes: an inverted trapezoid, a square, and a triangle. On the left side of the shape, A blue arrow moves upward and represents the first forward pass. The NNGP limit can be thought of as the limit of this first forward pass. On the right side of the shape, a green arrow moves downward and represents the first backward pass. The NTK limit can be thought of as the limit for this first backward pass. In contrast, the feature learning limit takes into account the many cycles of forward and backward passes that take place during the entire training process. These cycles are represented by many repetitions of blue upward arrow and green downward arrows to the right of the neural network. An orange box encloses all of these cycles. On top of the box is the annotation \u201cSGD Training Progress\u201d with an arrow to the right. An arrow comes out from the bottom of the box pointing to a textbox that says \u201cFeature Learning Limit, This Work.\u201d\" decoding=\"async\" loading=\"lazy\" srcset=\"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/1400x788_Neural_Network_no_logo_Still-960x540.jpg 960w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/1400x788_Neural_Network_no_logo_Still-300x169.jpg 300w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/1400x788_Neural_Network_no_logo_Still-1024x576.jpg 1024w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/1400x788_Neural_Network_no_logo_Still-768x432.jpg 768w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/1400x788_Neural_Network_no_logo_Still-1536x864.jpg 1536w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/1400x788_Neural_Network_no_logo_Still-2048x1152.jpg 2048w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/1400x788_Neural_Network_no_logo_Still-1066x600.jpg 1066w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/1400x788_Neural_Network_no_logo_Still-655x368.jpg 655w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/1400x788_Neural_Network_no_logo_Still-343x193.jpg 343w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/1400x788_Neural_Network_no_logo_Still-240x135.jpg 240w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/1400x788_Neural_Network_no_logo_Still-640x360.jpg 640w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/1400x788_Neural_Network_no_logo_Still-1280x720.jpg 1280w, https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-content\/uploads\/2021\/07\/1400x788_Neural_Network_no_logo_Still-1920x1080.jpg 1920w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/>","byline":"<a href=\"https:\/\/edwardjhu.com\/\" title=\"Go to researcher profile for Edward Hu\" aria-label=\"Go to researcher profile for Edward Hu\" data-bi-type=\"byline author\" data-bi-cN=\"Edward Hu\">Edward Hu<\/a> and Greg Yang","formattedDate":"July 22, 2021","formattedExcerpt":"In the pursuit of learning about fundamentals of the natural world, scientists have had success with coming at discoveries from both a bottom-up and top-down approach. Neuroscience is a great example of the former. Spanish anatomist Santiago Ram\u00f3n y Cajal discovered the neuron in the&hellip;","locale":{"slug":"en_us","name":"English","native":"","english":"English"},"_links":{"self":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/posts\/762478","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/users\/40519"}],"replies":[{"embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/comments?post=762478"}],"version-history":[{"count":44,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/posts\/762478\/revisions"}],"predecessor-version":[{"id":822328,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/posts\/762478\/revisions\/822328"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/media\/762820"}],"wp:attachment":[{"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/media?parent=762478"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/categories?post=762478"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/tags?post=762478"},{"taxonomy":"msr-research-area","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/research-area?post=762478"},{"taxonomy":"msr-region","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-region?post=762478"},{"taxonomy":"msr-event-type","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-event-type?post=762478"},{"taxonomy":"msr-locale","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-locale?post=762478"},{"taxonomy":"msr-post-option","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-post-option?post=762478"},{"taxonomy":"msr-impact-theme","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-impact-theme?post=762478"},{"taxonomy":"msr-promo-type","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-promo-type?post=762478"},{"taxonomy":"msr-podcast-series","embeddable":true,"href":"https:\/\/newed.any0.dpdns.org\/en-us\/research\/wp-json\/wp\/v2\/msr-podcast-series?post=762478"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}