|
|
Preface |
|
xxvii | |
P.1 Emphasis on Foundations |
|
xxvii | |
P.2 Glimpse of History |
|
xxix | |
P.3 Organization of the Text |
|
xxxi | |
P.4 How to Use the Text |
|
xxxiv | |
P.5 Simulation Datasets |
|
xxxvii | |
P.6 Acknowledgments |
|
xi | |
Notation |
|
xiv | |
|
|
1 | (58) |
|
|
1 | (4) |
|
1.2 Positive-Definite Matrices |
|
|
5 | (2) |
|
1.3 Range Spaces and Nullspaces |
|
|
7 | (4) |
|
|
11 | (3) |
|
1.5 Cholesky Factorization |
|
|
14 | (4) |
|
|
18 | (2) |
|
1.7 Singular Value Decomposition |
|
|
20 | (2) |
|
|
22 | (2) |
|
|
24 | (6) |
|
1.10 Vector and Matrix Norms |
|
|
30 | (7) |
|
1.11 Perturbation Bounds on Eigenvalues |
|
|
37 | (1) |
|
|
38 | (1) |
|
1.13 Complex-Valued Matrices |
|
|
39 | (2) |
|
1.14 Commentaries and Discussion |
|
|
41 | (9) |
|
|
47 | (3) |
|
1.A Proof of Spectral Theorem |
|
|
50 | (2) |
|
1.B Constructive Proof of SVD |
|
|
52 | (7) |
|
|
53 | (6) |
|
|
59 | (9) |
|
|
59 | (3) |
|
|
62 | (1) |
|
2.3 Matrix Differentiation |
|
|
63 | (2) |
|
2.4 Commentaries and Discussion |
|
|
65 | (3) |
|
|
65 | (2) |
|
|
67 | (1) |
|
|
68 | (64) |
|
3.1 Probability Density Functions |
|
|
68 | (3) |
|
|
71 | (6) |
|
3.3 Dependent Random Variables |
|
|
77 | (16) |
|
|
93 | (3) |
|
3.5 Properties of Covariance Matrices |
|
|
96 | (1) |
|
3.6 Illustrative Applications |
|
|
97 | (9) |
|
3.7 Complex-Valued Variables |
|
|
106 | (3) |
|
3.8 Commentaries and Discussion |
|
|
109 | (10) |
|
|
112 | (7) |
|
3.A Convergence of Random Variables |
|
|
119 | (3) |
|
3.B Concentration Inequalities |
|
|
122 | (10) |
|
|
128 | (4) |
|
|
132 | (35) |
|
4.1 Scalar Gaussian Variables |
|
|
132 | (2) |
|
4.2 Vector Gaussian Variables |
|
|
134 | (4) |
|
4.3 Useful Gaussian Manipulations |
|
|
138 | (6) |
|
4.4 Jointly Distributed Gaussian Variables |
|
|
144 | (6) |
|
|
150 | (5) |
|
4.6 Circular Gaussian Distribution |
|
|
155 | (2) |
|
4.7 Commentaries and Discussion |
|
|
157 | (10) |
|
|
160 | (5) |
|
|
165 | (2) |
|
5 Exponential Distributions |
|
|
167 | (29) |
|
|
167 | (2) |
|
|
169 | (9) |
|
|
178 | (5) |
|
|
183 | (4) |
|
5.5 Commentaries and Discussion |
|
|
187 | (5) |
|
|
189 | (3) |
|
5.A Derivation of Properties |
|
|
192 | (4) |
|
|
195 | (1) |
|
|
196 | (44) |
|
6.1 Information and Entropy |
|
|
196 | (8) |
|
6.2 Kullback Leibler Divergence |
|
|
204 | (5) |
|
6.3 Maximum Entropy Distribution |
|
|
209 | (2) |
|
|
211 | (2) |
|
6.5 Fisher Information Matrix |
|
|
213 | (4) |
|
|
217 | (10) |
|
|
227 | (4) |
|
6.8 Commentaries and Discussion |
|
|
231 | (9) |
|
|
234 | (3) |
|
|
237 | (3) |
|
|
240 | (21) |
|
|
240 | (5) |
|
7.2 Power Spectral Density |
|
|
245 | (7) |
|
7.3 Spectral Factorization |
|
|
252 | (3) |
|
7.4 Commentaries and Discussion |
|
|
255 | (6) |
|
|
257 | (2) |
|
|
259 | (2) |
|
|
261 | (41) |
|
|
261 | (2) |
|
|
263 | (2) |
|
|
265 | (1) |
|
|
266 | (2) |
|
8.5 Hessian Matrix Conditions |
|
|
268 | (4) |
|
|
272 | (7) |
|
|
279 | (2) |
|
|
281 | (4) |
|
|
285 | (5) |
|
8.10 Commentaries and Discussion |
|
|
290 | (12) |
|
|
293 | (6) |
|
|
299 | (3) |
|
|
302 | (28) |
|
9.1 Convex Optimization Problems |
|
|
302 | (8) |
|
|
310 | (2) |
|
9.3 Motivating the KKT Conditions |
|
|
312 | (3) |
|
9.4 Projection onto Convex Sets |
|
|
315 | (7) |
|
9.5 Commentaries and Discussion |
|
|
322 | (8) |
|
|
323 | (5) |
|
|
328 | (2) |
|
|
330 | (11) |
|
|
330 | (2) |
|
|
332 | (5) |
|
10.3 Commentaries and Discussion |
|
|
337 | (4) |
|
|
338 | (2) |
|
|
340 | (1) |
|
|
341 | (34) |
|
11.1 Definition and Properties |
|
|
341 | (6) |
|
11.2 Proximal Point Algorithm |
|
|
347 | (2) |
|
11.3 Proximal Gradient Algorithm |
|
|
349 | (5) |
|
|
354 | (2) |
|
11.5 Douglas-Rachford Algorithm |
|
|
356 | (2) |
|
11.6 Commentaries and Discussion |
|
|
358 | (8) |
|
|
362 | (4) |
|
11.A Convergence under Convexity |
|
|
366 | (3) |
|
11.B Convergence under Strong Convexity |
|
|
369 | (6) |
|
|
372 | (3) |
|
12 Gradient-Descent Method |
|
|
375 | (66) |
|
12.1 Empirical and Stochastic Risks |
|
|
375 | (4) |
|
12.2 Conditions on Risk Function |
|
|
379 | (2) |
|
|
381 | (11) |
|
12.4 Iteration-Dependent Step-Sizes |
|
|
392 | (10) |
|
12.5 Coordinate-Descent Method |
|
|
402 | (11) |
|
12.6 Alternating Projection Algorithm |
|
|
413 | (5) |
|
12.7 Commentaries and Discussion |
|
|
418 | (15) |
|
|
425 | (8) |
|
12.A Zeroth-Order Optimization |
|
|
433 | (8) |
|
|
436 | (5) |
|
13 Conjugate Gradient Method |
|
|
441 | (30) |
|
13.1 Linear Systems of Equations |
|
|
441 | (13) |
|
13.2 Nonlinear Optimization |
|
|
454 | (5) |
|
13.3 Convergence Analysis |
|
|
459 | (6) |
|
13.4 Commentaries and Discussion |
|
|
465 | (6) |
|
|
466 | (3) |
|
|
469 | (2) |
|
|
471 | (36) |
|
14.1 Subgradient Algorithm |
|
|
471 | (4) |
|
14.2 Conditions on Risk Function |
|
|
475 | (4) |
|
14.3 Convergence Behavior |
|
|
479 | (4) |
|
|
483 | (3) |
|
14.5 Exponential Smoothing |
|
|
486 | (3) |
|
14.6 Iteration-Dependent Step Sizes |
|
|
489 | (4) |
|
14.7 Coordinate-Descent Algorithms |
|
|
493 | (3) |
|
14.8 Commentaries and Discussion |
|
|
496 | (5) |
|
|
498 | (3) |
|
14.A Deterministic Inequality Recursion |
|
|
501 | (6) |
|
|
505 | (2) |
|
15 Proximal and Mirror-Descent Methods |
|
|
507 | (40) |
|
15.1 Proximal Gradient Method |
|
|
507 | (8) |
|
15.2 Projection Gradient Method |
|
|
515 | (4) |
|
15.3 Mirror-Descent Method |
|
|
519 | (18) |
|
15.4 Comparison of Convergence Rates |
|
|
537 | (2) |
|
15.5 Commentaries and Discussion |
|
|
539 | (8) |
|
|
541 | (3) |
|
|
544 | (3) |
|
16 Stochastic Optimization |
|
|
547 | (52) |
|
16.1 Stochastic Gradient Algorithm |
|
|
548 | (17) |
|
16.2 Stochastic Subgradient Algorithm |
|
|
565 | (4) |
|
16.3 Stochastic Proximal Gradient Algorithm |
|
|
569 | (5) |
|
|
574 | (2) |
|
|
576 | (6) |
|
16.6 Commentaries and Discussion |
|
|
582 | (8) |
|
|
586 | (4) |
|
16.A Switching Expectation and Differentiation |
|
|
590 | (9) |
|
|
595 | (4) |
|
17 Adaptive Gradient Methods |
|
|
599 | (33) |
|
|
599 | (4) |
|
|
603 | (5) |
|
|
608 | (2) |
|
|
610 | (4) |
|
17.5 Momentum Acceleration Methods |
|
|
614 | (5) |
|
|
619 | (7) |
|
17.7 Commentaries and Discussion |
|
|
626 | (6) |
|
|
630 | (2) |
|
17 A Regret Analysis for ADAM |
|
|
632 | (10) |
|
|
640 | (2) |
|
|
642 | (41) |
|
|
642 | (3) |
|
18.2 Smooth Risk Functions |
|
|
645 | (3) |
|
18.3 Gradient Noise for Smooth Risks |
|
|
648 | (12) |
|
18.4 Nonsmooth Risk Functions |
|
|
660 | (5) |
|
18.5 Gradient Noise for Nonsmooth Risks |
|
|
665 | (8) |
|
18.6 Commentaries and Discussion |
|
|
673 | (4) |
|
|
675 | (2) |
|
18.A Averaging over Mini-Batches |
|
|
677 | (2) |
|
18.B Auxiliary Variance Result |
|
|
679 | (4) |
|
|
681 | (2) |
|
19 Convergence Analysis I: Stochastic Gradient Algorithms |
|
|
683 | (47) |
|
|
683 | (3) |
|
19.2 Convergence under Uniform Sampling |
|
|
686 | (5) |
|
19.3 Convergence of Mini-Batch Implementation |
|
|
691 | (1) |
|
19.4 Convergence under Vanishing Step Sizes |
|
|
692 | (6) |
|
19.5 Convergence under Random Reshuffling |
|
|
698 | (3) |
|
19.6 Convergence under Importance Sampling |
|
|
701 | (6) |
|
19.7 Convergence of Stochastic Conjugate Gradient |
|
|
707 | (5) |
|
19.8 Commentaries and Discussion |
|
|
712 | (8) |
|
|
716 | (4) |
|
19.A Stochastic Inequality Recursion |
|
|
720 | (2) |
|
19.B Proof of Theorem 19.5 |
|
|
722 | (8) |
|
|
727 | (3) |
|
20 Convergence Analysis II: Stochastic Subgradient Algorithms |
|
|
730 | (26) |
|
|
730 | (5) |
|
20.2 Convergence under Uniform Sampling |
|
|
735 | (3) |
|
20.3 Convergence with Pocket Variables |
|
|
738 | (2) |
|
20.4 Convergence with Exponential Smoothing |
|
|
740 | (5) |
|
20.5 Convergence of Mini-Batch Implementation |
|
|
745 | (2) |
|
20.6 Convergence under Vanishing Step Sizes |
|
|
747 | (3) |
|
20.7 Commentaries and Discussion |
|
|
750 | (6) |
|
|
753 | (1) |
|
|
754 | (2) |
|
21 Convergence Analysis III: Stochastic Proximal Algorithms |
|
|
756 | (23) |
|
|
756 | (5) |
|
21.2 Convergence under Uniform Sampling |
|
|
761 | (4) |
|
21.3 Convergence of Mini-Batch Implementation |
|
|
765 | (1) |
|
21.4 Convergence under Vanishing Step Sizes |
|
|
766 | (3) |
|
21.5 Stochastic Projection Gradient |
|
|
769 | (2) |
|
21.6 Mirror-Descent Algorithm |
|
|
771 | (3) |
|
21.7 Commentaries and Discussion |
|
|
774 | (5) |
|
|
775 | (1) |
|
|
776 | (3) |
|
22 Variance-Reduced Methods I: Uniform Sampling |
|
|
779 | (37) |
|
|
779 | (3) |
|
22.2 Naive Stochastic Gradient Algorithm |
|
|
782 | (3) |
|
22.3 Stochastic Average-Gradient Algorithm (SAGA) |
|
|
785 | (8) |
|
22.4 Stochastic Variance-Reduced Gradient Algorithm (SVRG) |
|
|
793 | (6) |
|
22.5 Nonsmooth Risk Functions |
|
|
799 | (7) |
|
22.6 Commentaries and Discussion |
|
|
806 | (4) |
|
|
808 | (2) |
|
22.A Proof of Theorem 22.2 |
|
|
810 | (3) |
|
22.B Proof of Theorem 22.3 |
|
|
813 | (3) |
|
|
815 | (1) |
|
23 Variance-Reduced Methods II: Random Reshuffling |
|
|
816 | (36) |
|
23.1 Amortized Variance-Reduced Gradient Algorithm (AVRG) |
|
|
816 | (2) |
|
23.2 Evolution of Memory Variables |
|
|
818 | (4) |
|
|
822 | (5) |
|
|
827 | (3) |
|
|
830 | (1) |
|
23.6 Nonsmooth Risk Functions |
|
|
831 | (1) |
|
23.7 Commentaries and Discussion |
|
|
832 | (2) |
|
|
833 | (1) |
|
|
834 | (4) |
|
|
838 | (4) |
|
23.C Proof of Theorem 23.1 |
|
|
842 | (3) |
|
|
845 | (4) |
|
23.E Proof of Theorem 23.2 |
|
|
849 | (3) |
|
|
851 | (1) |
|
24 Nonconvex Optimization |
|
|
852 | (50) |
|
24.1 First- and Second-Order Stationarity |
|
|
852 | (8) |
|
24.2 Stochastic Gradient Optimization |
|
|
860 | (5) |
|
24.3 Convergence Behavior |
|
|
865 | (7) |
|
24.4 Commentaries and Discussion |
|
|
872 | (4) |
|
|
874 | (2) |
|
24.A Descent in the Large Gradient Regime |
|
|
876 | (1) |
|
24.B Introducing a Short-Term Model |
|
|
877 | (11) |
|
24.C Descent Away from Strict Saddle Points |
|
|
888 | (9) |
|
24.D Second-Order Convergence Guarantee |
|
|
897 | (5) |
|
|
900 | (2) |
|
25 Decentralized Optimization I: Primal Methods |
|
|
902 | (67) |
|
|
903 | (6) |
|
|
909 | (4) |
|
25.3 Aggregate and Local Risks |
|
|
913 | (5) |
|
25.4 Incremental, Consensus, and Diffusion |
|
|
918 | (17) |
|
25.5 Formal Derivation as Primal Methods |
|
|
935 | (5) |
|
25.6 Commentaries and Discussion |
|
|
940 | (7) |
|
|
943 | (4) |
|
|
947 | (2) |
|
25.B Proof of Property (25.71) |
|
|
949 | (1) |
|
25.C Convergence of Primal Algorithms |
|
|
949 | (20) |
|
|
965 | (4) |
|
26 Decentralized Optimization II: Primal-Dual Methods |
|
|
969 | (40) |
|
|
969 | (1) |
|
|
970 | (2) |
|
26.3 EXACT Diffusion Algorithm |
|
|
972 | (3) |
|
26.4 Distributed Inexact Gradient Algorithm |
|
|
975 | (3) |
|
26.5 Augmented Decentralized Gradient Method |
|
|
978 | (1) |
|
|
979 | (4) |
|
26.7 Unified Decentralized Algorithm |
|
|
983 | (2) |
|
26.8 Convergence Performance |
|
|
985 | (2) |
|
|
987 | (3) |
|
26.10 Decentralized Nonconvex Optimization |
|
|
990 | (5) |
|
26.11 Commentaries and Discussion |
|
|
995 | (5) |
|
|
998 | (2) |
|
26.A Convergence of Primal-Dual Algorithms |
|
|
1000 | (9) |
|
|
1006 | (3) |
Author Index |
|
1009 | (24) |
Subject Index |
|
1033 | |