Preface |
|
ix | |
|
|
1 | (1) |
|
0.1 An Effective Theory Approach |
|
|
2 | (1) |
|
0.2 The Theoretical Minimum |
|
|
3 | (8) |
|
|
11 | (26) |
|
|
12 | (9) |
|
1.2 Probability, Correlation and Statistics, and All That |
|
|
21 | (5) |
|
1.3 Nearly-Gaussian Distributions |
|
|
26 | (11) |
|
|
37 | (16) |
|
2.1 Function Approximation |
|
|
37 | (6) |
|
|
43 | (4) |
|
|
47 | (6) |
|
3 Effective Theory of Deep Linear Networks at Initialization |
|
|
53 | (18) |
|
|
54 | (2) |
|
|
56 | (3) |
|
|
59 | (6) |
|
|
65 | (6) |
|
4 RG Flow of Preactivations |
|
|
71 | (38) |
|
4.1 First Layer: Good-Old Gaussian |
|
|
73 | (6) |
|
4.2 Second Layer: Genesis of Non-Gaussianity |
|
|
79 | (11) |
|
4.3 Deeper Layers: Accumulation of Non-Gaussianity |
|
|
90 | (6) |
|
4.4 Marginalization Rules |
|
|
96 | (4) |
|
4.5 Subleading Corrections |
|
|
100 | (3) |
|
|
103 | (6) |
|
5 Effective Theory of Preactivations at Initialization |
|
|
109 | (44) |
|
5.1 Criticality Analysis of the Kernel |
|
|
110 | (13) |
|
5.2 Criticality for Scale-Invariant Activations |
|
|
123 | (2) |
|
5.3 Universality Beyond Scale-Invariant Activations |
|
|
125 | (12) |
|
|
126 | (2) |
|
5.3.2 No Criticality: Sigmoid, Softplus, Nonlinear Monomials, etc |
|
|
128 | (2) |
|
5.3.3 K* = 0 Universality Class: tanh, sin, etc |
|
|
130 | (5) |
|
5.3.4 Half-Stable Universality Classes: SWISH, etc. and GELU, etc. |
|
|
135 | (2) |
|
|
137 | (9) |
|
5.4.1 Fluctuations for the Scale-Invariant Universality Class |
|
|
139 | (2) |
|
5.4.2 Fluctuations for the K* = 0 Universality Class |
|
|
141 | (5) |
|
5.5 Finite-Angle Analysis for the Scale-Invariant Universality Class |
|
|
146 | (7) |
|
|
153 | (38) |
|
|
154 | (2) |
|
6.2 Bayesian Inference and Neural Networks |
|
|
156 | (13) |
|
6.2.1 Bayesian Model Fitting |
|
|
157 | (8) |
|
6.2.2 Bayesian Model Comparison |
|
|
165 | (4) |
|
6.3 Bayesian Inference at Infinite Width |
|
|
169 | (10) |
|
6.3.1 The Evidence for Criticality |
|
|
169 | (4) |
|
6.3.2 Let's Not Wire Together |
|
|
173 | (5) |
|
6.3.3 Absence of Representation Learning |
|
|
178 | (1) |
|
6.4 Bayesian Inference at Finite Width |
|
|
179 | (12) |
|
6.4.1 Hebbian Learning, Inc |
|
|
179 | (3) |
|
6.4.2 Let's Wire Together |
|
|
182 | (4) |
|
6.4.3 Presence of Representation Learning |
|
|
186 | (5) |
|
7 Gradient-Based Learning |
|
|
191 | (8) |
|
|
192 | (2) |
|
7.2 Gradient Descent and Function Approximation |
|
|
194 | (5) |
|
8 RG Flow of the Neural Tangent Kernel |
|
|
199 | (28) |
|
8.0 Forward Equation for the NTK |
|
|
200 | (6) |
|
8.1 First Layer: Deterministic NTK |
|
|
206 | (1) |
|
8.2 Second Layer: Fluctuating NTK |
|
|
207 | (4) |
|
8.3 Deeper Layers: Accumulation of NTK Fluctuations |
|
|
211 | (16) |
|
8.3.0 Interlude: Interlayer Correlations |
|
|
211 | (4) |
|
|
215 | (1) |
|
8.3.2 NTK-Preactivation Cross Correlations |
|
|
216 | (5) |
|
|
221 | (6) |
|
9 Effective Theory of the NTK at Initialization |
|
|
227 | (20) |
|
9.1 Criticality Analysis of the NTK |
|
|
228 | (5) |
|
9.2 Scale-Invariant Universality Class |
|
|
233 | (3) |
|
9.3 K* = 0 Universality Class |
|
|
236 | (5) |
|
9.4 Criticality, Exploding and Vanishing Problems, and None of That |
|
|
241 | (6) |
|
|
247 | (44) |
|
|
248 | (4) |
|
|
250 | (1) |
|
10.1.2 No Representation Learning |
|
|
250 | (2) |
|
|
252 | (12) |
|
|
253 | (4) |
|
10.2.2 Algorithm Independence |
|
|
257 | (2) |
|
10.2.3 Aside: Cross-Entropy Loss |
|
|
259 | (2) |
|
|
261 | (3) |
|
|
264 | (18) |
|
10.3.1 Bias-Variance Tradeoff and Criticality |
|
|
267 | (10) |
|
10.3.2 Interpolation and Extrapolation |
|
|
277 | (5) |
|
10.4 Linear Models and Kernel Methods |
|
|
282 | (9) |
|
|
282 | (2) |
|
|
284 | (3) |
|
10.4.3 Infinite-Width Networks as Linear Models |
|
|
287 | (4) |
|
11 Representation Learning |
|
|
291 | (44) |
|
11.1 Differential of the Neural Tangent Kernel |
|
|
293 | (3) |
|
|
296 | (14) |
|
11.2.0 Forward Equation for the dNTK |
|
|
297 | (2) |
|
11.2.1 First Layer: Zero dNTK |
|
|
299 | (1) |
|
11.2.2 Second Layer: Nonzero dNTK |
|
|
300 | (1) |
|
11.2.3 Deeper Layers: Growing dNTK |
|
|
301 | (9) |
|
11.3 Effective Theory of the dNTK at Initialization |
|
|
310 | (7) |
|
11.3.1 Scale-Invariant Universality Class |
|
|
312 | (2) |
|
11.3.2 K* = 0 Universality Class |
|
|
314 | (3) |
|
11.4 Nonlinear Models and Nearly-Kernel Methods |
|
|
317 | (18) |
|
|
318 | (6) |
|
11.4.2 Nearly-Kernel Methods |
|
|
324 | (6) |
|
11.4.3 Finite-Width Networks as Nonlinear Models |
|
|
330 | (5) |
|
|
335 | (54) |
|
∞.1 Two More Differentials |
|
|
337 | (10) |
|
∞.2 Training at Finite Width |
|
|
347 | (37) |
|
∞.2.1 A Small Step Following a Giant Leap |
|
|
351 | (7) |
|
∞.2.2 Many Many Steps of Gradient Descent |
|
|
358 | (15) |
|
∞.2.3 Prediction at Finite Width |
|
|
373 | (11) |
|
∞.3 RG Flow of the ddNTKs: The Full Expressions |
|
|
384 | (5) |
|
ε Epilogue: Model Complexity from the Macroscopic Perspective |
|
|
389 | (10) |
|
A Information in Deep Learning |
|
|
399 | (26) |
|
A.1 Entropy and Mutual Information |
|
|
400 | (9) |
|
A.2 Information at Infinite Width: Criticality |
|
|
409 | (2) |
|
A.3 Information at Finite Width: Optimal Aspect Ratio |
|
|
411 | (14) |
|
|
425 | (14) |
|
B.1 Residual Multilayer Perceptrons |
|
|
428 | (1) |
|
B.2 Residual Infinite Width: Criticality Analysis |
|
|
429 | (2) |
|
B.3 Residual Finite Width: Optimal Aspect Ratio |
|
|
431 | (5) |
|
B.4 Residual Building Blocks |
|
|
436 | (3) |
References |
|
439 | (6) |
Index |
|
445 | |