Preface |
|
vii | |
|
|
1 | (36) |
|
1.1 What is Machine Learning |
|
|
2 | (5) |
|
1.1.1 Symbolical Learning |
|
|
2 | (1) |
|
1.1.2 Statistical Machine Learning |
|
|
3 | (3) |
|
1.1.3 Supervised and Unsupervised Machine Learning |
|
|
6 | (1) |
|
1.2 It all began with the Perceptron |
|
|
7 | (12) |
|
|
8 | (2) |
|
|
10 | (7) |
|
|
17 | (2) |
|
1.3 Road to Deep Learning |
|
|
19 | (1) |
|
|
19 | (1) |
|
|
19 | (4) |
|
|
21 | (2) |
|
1.5 Exercises and Answers |
|
|
23 | (14) |
|
2 Probability and Information |
|
|
37 | (68) |
|
|
38 | (8) |
|
2.1.1 Conditional probability |
|
|
39 | (1) |
|
2.1.2 Law of Total Probability |
|
|
40 | (1) |
|
|
41 | (2) |
|
|
43 | (1) |
|
|
44 | (2) |
|
|
46 | (8) |
|
2.2.1 Gaussian Distribution |
|
|
46 | (5) |
|
2.2.2 Laplace Distribution |
|
|
51 | (1) |
|
2.2.3 Bernoulli Distribution |
|
|
51 | (3) |
|
|
54 | (9) |
|
2.3.1 Surprise and Information |
|
|
55 | (1) |
|
|
56 | (6) |
|
2.3.3 Conditional Entropy |
|
|
62 | (1) |
|
|
62 | (1) |
|
|
62 | (1) |
|
|
63 | (1) |
|
|
63 | (5) |
|
2.5 Exercises and Answers |
|
|
68 | (37) |
|
3 Linear Algebra and Optimization |
|
|
105 | (26) |
|
|
106 | (5) |
|
|
106 | (1) |
|
|
107 | (1) |
|
|
108 | (1) |
|
3.1.4 Linear Independent Vectors |
|
|
109 | (1) |
|
|
110 | (1) |
|
|
110 | (1) |
|
|
111 | (1) |
|
3.1.8 Element-wise division |
|
|
111 | (1) |
|
|
111 | (1) |
|
|
111 | (1) |
|
|
112 | (1) |
|
|
113 | (1) |
|
3.3 Gradient based Numerical Optimization |
|
|
113 | (7) |
|
|
113 | (3) |
|
|
116 | (2) |
|
3.3.3 Second and First Order Optimization |
|
|
118 | (2) |
|
3.4 Dilemmas in Machine Learning |
|
|
120 | (2) |
|
3.4.1 The Curse of Dimensionality |
|
|
120 | (2) |
|
3.4.2 Numerical Computation |
|
|
122 | (1) |
|
3.5 Exercises and Answers |
|
|
122 | (9) |
|
4 Linear and Nonlinear Regression |
|
|
131 | (56) |
|
|
132 | (9) |
|
4.1.1 Regression of a Line |
|
|
132 | (1) |
|
4.1.2 Multiple Linear Regression |
|
|
132 | (2) |
|
|
134 | (1) |
|
|
134 | (1) |
|
4.1.5 Closed-Form Solution |
|
|
135 | (1) |
|
|
136 | (2) |
|
4.1.7 Moore-Penrose Matrix |
|
|
138 | (3) |
|
4.2 Linear Basis Function Models |
|
|
141 | (2) |
|
4.2.1 Example Logarithmic Curve |
|
|
142 | (1) |
|
4.2.2 Example Polynomial Regression |
|
|
142 | (1) |
|
|
143 | (3) |
|
|
146 | (9) |
|
4.4.1 Maximizing the Likelihood or the Posterior |
|
|
147 | (1) |
|
|
148 | (3) |
|
4.4.3 Maximizing a posteriori |
|
|
151 | (1) |
|
4.4.4 Relation between Regularized Least-Squares and MAP |
|
|
151 | (2) |
|
|
153 | (2) |
|
4.5 Linear Regression for classification |
|
|
155 | (1) |
|
4.6 Exercises and Answers |
|
|
156 | (31) |
|
|
187 | (36) |
|
5.1 Linear Regression and Linear Artificial Neuron |
|
|
188 | (4) |
|
|
191 | (1) |
|
5.1.2 Stochastic gradient descent |
|
|
191 | (1) |
|
5.2 Continuous Differentiable Activation Functions |
|
|
192 | (8) |
|
5.2.1 Sigmoid Activation Functions |
|
|
193 | (1) |
|
5.2.2 Perceptron with sgno |
|
|
194 | (2) |
|
5.2.3 Cross Entropy Loss Function |
|
|
196 | (3) |
|
5.2.4 Linear Unit versus Sigmoid Unit |
|
|
199 | (1) |
|
5.2.5 Logistic Regression |
|
|
199 | (1) |
|
5.3 Multiclass Linear Discriminant |
|
|
200 | (6) |
|
5.3.1 Cross Entropy Loss Function for softmax |
|
|
203 | (2) |
|
5.3.2 Logistic Regression Algorithm |
|
|
205 | (1) |
|
5.4 Multilayer Perceptron |
|
|
206 | (1) |
|
5.5 Exercises and Answers |
|
|
207 | (16) |
|
|
223 | (28) |
|
|
224 | (1) |
|
6.2 Networks with Hidden Nonlinear Layers |
|
|
224 | (10) |
|
|
226 | (2) |
|
|
228 | (4) |
|
6.2.3 Activation Function |
|
|
232 | (2) |
|
6.3 Cross Entropy Error Function |
|
|
234 | (6) |
|
|
235 | (2) |
|
|
237 | (1) |
|
|
238 | (1) |
|
|
238 | (2) |
|
|
240 | (4) |
|
|
241 | (1) |
|
6.4.2 Early-Stopping Rule |
|
|
241 | (1) |
|
|
242 | (2) |
|
6.5 Deep Learning and Backpropagation |
|
|
244 | (1) |
|
6.6 Exercises and Answers |
|
|
245 | (6) |
|
|
251 | (42) |
|
7.1 Supervised Classification Problem |
|
|
252 | (1) |
|
7.2 Probability of a bad sample |
|
|
253 | (3) |
|
7.3 Infinite hypotheses set |
|
|
256 | (2) |
|
|
258 | (1) |
|
7.5 A Fundamental Trade-off |
|
|
259 | (3) |
|
7.6 Computing VC Dimension |
|
|
262 | (5) |
|
7.6.1 The VC Dimension of a Perceptron |
|
|
264 | (3) |
|
7.6.2 A Heuristic way to measure hypotheses space complexity |
|
|
267 | (1) |
|
7.7 The Regression Problem |
|
|
267 | (8) |
|
|
270 | (5) |
|
7.8 Exercises and Answers |
|
|
275 | (18) |
|
|
293 | (16) |
|
|
294 | (1) |
|
8.1.1 Precision and Recall |
|
|
294 | (1) |
|
|
295 | (1) |
|
8.2 Validation Set and Test Set |
|
|
295 | (2) |
|
|
297 | (1) |
|
8.4 Minimum-Description-Length |
|
|
298 | (5) |
|
|
299 | (1) |
|
8.4.2 Kolmogorov complexity theory |
|
|
299 | (1) |
|
8.4.3 Learning as Data Compression |
|
|
300 | (1) |
|
8.4.4 Two-part code MDL principle |
|
|
301 | (2) |
|
8.5 Paradox of Deep Learning Complexity |
|
|
303 | (3) |
|
8.6 Exercises and Answers |
|
|
306 | (3) |
|
|
309 | (68) |
|
|
310 | (1) |
|
|
310 | (5) |
|
|
313 | (1) |
|
|
314 | (1) |
|
|
315 | (9) |
|
9.3.1 EM for Gaussian Mixtures |
|
|
317 | (3) |
|
9.3.2 Algorithm: EM for Gaussian mixtures |
|
|
320 | (2) |
|
|
322 | (2) |
|
9.4 EM and k-means Clustering |
|
|
324 | (1) |
|
9.5 Exercises and Answers |
|
|
324 | (53) |
|
|
377 | (14) |
|
|
378 | (1) |
|
10.1.1 Cover's theorem on the separability (1965) |
|
|
379 | (1) |
|
10.2 Interpolation Problem |
|
|
379 | (2) |
|
10.2.1 Micchelli's Theorem |
|
|
381 | (1) |
|
10.3 Radial Basis Function Networks |
|
|
381 | (3) |
|
10.3.1 Modifications of Radial Basis Function Networks |
|
|
382 | (1) |
|
10.3.2 Interpretation of Hidden Units |
|
|
383 | (1) |
|
10.4 Exercises and Answers |
|
|
384 | (7) |
|
11 Support Vector Machines |
|
|
391 | (32) |
|
|
392 | (2) |
|
11.2 Optimal Hyperplane for Linear Separable Patterns |
|
|
394 | (1) |
|
|
395 | (1) |
|
11.4 Quadratic Optimization for Finding the Optimal Hyperplane |
|
|
396 | (4) |
|
|
397 | (3) |
|
11.5 Optimal Hyperplane for Non-separable Patterns |
|
|
400 | (1) |
|
|
401 | (1) |
|
11.6 Support Vector Machine as a Kernel Machine |
|
|
401 | (7) |
|
|
403 | (1) |
|
|
404 | (1) |
|
|
405 | (3) |
|
11.7 Constructing Kernels |
|
|
408 | (2) |
|
|
409 | (1) |
|
|
409 | (1) |
|
11.7.3 Generative mode Kernels |
|
|
410 | (1) |
|
|
410 | (2) |
|
11.8.1 SVMs, MLPs and RBFNs |
|
|
411 | (1) |
|
11.9 Exercises and Answers |
|
|
412 | (11) |
|
|
423 | (66) |
|
|
424 | (2) |
|
|
424 | (1) |
|
|
425 | (1) |
|
|
426 | (3) |
|
12.2.1 Hierarchical Organization |
|
|
427 | (1) |
|
|
427 | (1) |
|
12.2.3 Curse of dimensionality |
|
|
427 | (1) |
|
|
428 | (1) |
|
12.2.5 Can represent big training sets |
|
|
428 | (1) |
|
12.2.6 Efficient Model Selection |
|
|
428 | (1) |
|
12.2.7 Criticism of Deep Neural Networks |
|
|
429 | (1) |
|
12.3 Vanishing Gradients Problem |
|
|
429 | (7) |
|
12.3.1 Rectified Linear Unit (ReL U) |
|
|
430 | (3) |
|
|
433 | (1) |
|
12.3.3 Batch Normalization |
|
|
434 | (2) |
|
12.4 Regularization by Dropout |
|
|
436 | (1) |
|
12.5 Weight Initialization |
|
|
437 | (1) |
|
|
437 | (4) |
|
|
438 | (1) |
|
|
438 | (1) |
|
|
438 | (1) |
|
|
439 | (1) |
|
|
439 | (1) |
|
|
440 | (1) |
|
|
441 | (1) |
|
|
441 | (1) |
|
12.9 Exercises and Answers |
|
|
442 | (47) |
|
13 Convolutional Networks |
|
|
489 | (32) |
|
13.1 Hierarchical Networks |
|
|
490 | (6) |
|
|
490 | (1) |
|
|
490 | (2) |
|
13.1.3 Map transformation cascade |
|
|
492 | (4) |
|
13.2 Convolutional Neural Networks |
|
|
496 | (11) |
|
13.2.1 CNNs and Kernels in Image Processing |
|
|
499 | (7) |
|
|
506 | (1) |
|
|
506 | (1) |
|
13.3 Exercises and Answers |
|
|
507 | (14) |
|
|
521 | (54) |
|
|
522 | (2) |
|
14.2 Recurrent Neural Networks |
|
|
524 | (7) |
|
14.2.1 Elman recurrent neural networks |
|
|
524 | (2) |
|
14.2.2 Jordan recurrent neural networks |
|
|
526 | (1) |
|
|
527 | (2) |
|
14.2.4 Backpropagation Trough Time |
|
|
529 | (2) |
|
14.2.5 Deep Recurrent Networks |
|
|
531 | (1) |
|
14.3 Long Short Term Memory |
|
|
531 | (4) |
|
|
535 | (2) |
|
14.5 Exercises and Answers |
|
|
537 | (38) |
|
|
575 | (34) |
|
15.1 Eigenvectors and Eigenvalues |
|
|
576 | (1) |
|
15.2 The Karhunen-Loeve transform |
|
|
576 | (6) |
|
15.2.1 Principal component analysis |
|
|
578 | (4) |
|
15.3 Singular Value Decomposition |
|
|
582 | (2) |
|
|
583 | (1) |
|
|
583 | (1) |
|
|
584 | (1) |
|
|
584 | (2) |
|
15.5 Undercomplete Autoencoders |
|
|
586 | (1) |
|
15.6 Overcomplete Autoencoders |
|
|
587 | (3) |
|
15.6.1 Denoising Autoencoders |
|
|
589 | (1) |
|
15.7 Exercises and Answers |
|
|
590 | (19) |
|
|
609 | (4) |
Bibliography |
|
613 | (8) |
Index |
|
621 | |