Preface |
|
xvii | |
|
1 Why Learn the Mathematics of Al? |
|
|
1 | (12) |
|
|
2 | (1) |
|
Why Is AI So Popular Now? |
|
|
3 | (1) |
|
|
3 | (3) |
|
An AI Agent's Specific Tasks |
|
|
4 | (2) |
|
What Are AI's Limitations? |
|
|
6 | (2) |
|
What Happens When AI Systems Fail? |
|
|
8 | (1) |
|
|
8 | (2) |
|
Who Are the Current Main Contributors to the AI Field? |
|
|
10 | (1) |
|
What Math Is Typically Involved in AI? |
|
|
10 | (1) |
|
Summary and Looking Ahead |
|
|
11 | (2) |
|
|
13 | (38) |
|
|
14 | (2) |
|
Real Data Versus Simulated Data |
|
|
16 | (1) |
|
Mathematical Models: Linear Versus Nonlinear |
|
|
16 | (2) |
|
|
18 | (3) |
|
An Example of Simulated Data |
|
|
21 | (4) |
|
Mathematical Models: Simulations and AI |
|
|
25 | (2) |
|
Where Do We Get Our Data From? |
|
|
27 | (2) |
|
The Vocabulary of Data Distributions, Probability, and Statistics |
|
|
29 | (6) |
|
|
30 | (1) |
|
Probability Distributions |
|
|
31 | (1) |
|
|
31 | (1) |
|
The Uniform and the Normal Distributions |
|
|
31 | (1) |
|
Conditional Probabilities and Bayes' Theorem |
|
|
31 | (1) |
|
Conditional Probabilities and Joint Distributions |
|
|
32 | (1) |
|
Prior Distribution, Posterior Distribution, and Likelihood Function |
|
|
32 | (1) |
|
Mixtures of Distributions |
|
|
32 | (1) |
|
Sums and Products of Random Variables |
|
|
33 | (1) |
|
Using Graphs to Represent Joint Probability Distributions |
|
|
33 | (1) |
|
Expectation, Mean, Variance, and Uncertainty |
|
|
33 | (1) |
|
Covariance and Correlation |
|
|
33 | (1) |
|
|
34 | (1) |
|
Normalizing, Scaling, and/or Standardizing a Random Variable or Data Set |
|
|
34 | (1) |
|
|
35 | (1) |
|
Continuous Distributions Versus Discrete Distributions (Density Versus Mass) |
|
|
35 | (2) |
|
The Power of the Joint Probability Density Function |
|
|
37 | (1) |
|
Distribution of Data: The Uniform Distribution |
|
|
38 | (2) |
|
Distribution of Data: The Bell-Shaped Normal (Gaussian) Distribution |
|
|
40 | (3) |
|
Distribution of Data: Other Important and Commonly Used Distributions |
|
|
43 | (4) |
|
The Various Uses of the Word "Distribution" |
|
|
47 | (1) |
|
|
48 | (1) |
|
Summary and Looking Ahead |
|
|
48 | (3) |
|
3 Fitting Functions to Data |
|
|
51 | (62) |
|
Traditional and Very Useful Machine Learning Models |
|
|
53 | (2) |
|
Numerical Solutions Versus Analytical Solutions |
|
|
55 | (1) |
|
Regression: Predict a Numerical Value |
|
|
56 | (29) |
|
|
58 | (2) |
|
|
60 | (11) |
|
|
71 | (14) |
|
Logistic Regression: Classify into Two Classes |
|
|
85 | (3) |
|
|
85 | (1) |
|
|
86 | (2) |
|
|
88 | (1) |
|
Softmax Regression: Classify into Multiple Classes |
|
|
88 | (5) |
|
|
90 | (2) |
|
|
92 | (1) |
|
|
92 | (1) |
|
Incorporating These Models into the Last Layer of a Neural Network |
|
|
93 | (1) |
|
Other Popular Machine Learning Techniques and Ensembles of Techniques |
|
|
93 | (15) |
|
|
94 | (4) |
|
|
98 | (9) |
|
|
107 | (1) |
|
|
108 | (1) |
|
Performance Measures for Classification Models |
|
|
108 | (2) |
|
Summary and Looking Ahead |
|
|
110 | (3) |
|
4 Optimization for Neural Networks |
|
|
113 | (48) |
|
The Brain Cortex and Artificial Neural Networks |
|
|
113 | (2) |
|
Training Function: Fully Connected, or Dense, Feed Forward Neural Networks |
|
|
115 | (16) |
|
A Neural Network Is a Computational Graph Representation of the Training Function |
|
|
117 | (1) |
|
Linearly Combine, Add Bias, Then Activate |
|
|
117 | (5) |
|
Common Activation Functions |
|
|
122 | (3) |
|
Universal Function Approximation |
|
|
125 | (6) |
|
Approximation Theory for Deep Learning |
|
|
131 | (1) |
|
|
131 | (2) |
|
|
133 | (12) |
|
Mathematics and the Mysterious Success of Neural Networks |
|
|
134 | (1) |
|
Gradient Descent ω→i+1 = ω→i- ηΔL(ω→i) |
|
|
135 | (2) |
|
Explaining the Role of the Learning Rate Hyperparameter η |
|
|
137 | (3) |
|
Convex Versus Nonconvex Landscapes |
|
|
140 | (3) |
|
Stochastic Gradient Descent |
|
|
143 | (1) |
|
Initializing the Weights ω→0 for the Optimization Process |
|
|
144 | (1) |
|
Regularization Techniques |
|
|
145 | (7) |
|
|
145 | (1) |
|
|
146 | (1) |
|
Batch Normalization of Each Layer |
|
|
146 | (1) |
|
Control the Size of the Weights by Penalizing Their Norm |
|
|
147 | (3) |
|
Penalizing the l2 Norm Versus Penalizing the l2 Norm |
|
|
150 | (1) |
|
Explaining the Role of the Regularization Hyperparameter α |
|
|
151 | (1) |
|
Hyperparameter Examples That Appear in Machine Learning |
|
|
152 | (1) |
|
Chain Rule and Backpropagation: Calculating ΔL(ω→i) |
|
|
153 | (4) |
|
Backpropagation Is Not Too Different from How Our Brain Learns |
|
|
154 | (1) |
|
Why Is It Better to Backpropagate? |
|
|
155 | (1) |
|
Backpropagation in Detail |
|
|
155 | (2) |
|
Assessing the Significance of the Input Data Features |
|
|
157 | (1) |
|
Summary and Looking Ahead |
|
|
158 | (3) |
|
5 Convolutional Neural Networks and Computer Vision |
|
|
161 | (26) |
|
Convolution and Cross-Correlation |
|
|
163 | (5) |
|
Translation Invariance and Translation Equivariance |
|
|
167 | (1) |
|
Convolution in Usual Space Is a Product in Frequency Space |
|
|
167 | (1) |
|
Convolution from a Systems Design Perspective |
|
|
168 | (3) |
|
Convolution and Impulse Response for Linear and Translation Invariant Systems |
|
|
169 | (2) |
|
Convolution and One-Dimensional Discrete Signals |
|
|
171 | (1) |
|
Convolution and Two-Dimensional Discrete Signals |
|
|
172 | (7) |
|
|
174 | (4) |
|
|
178 | (1) |
|
|
179 | (4) |
|
The One-Dimensional Case: Multiplication by a Toeplitz Matrix |
|
|
182 | (1) |
|
The Two-Dimensional Case: Multiplication by a Doubly Block Circulant Matrix |
|
|
182 | (1) |
|
|
183 | (1) |
|
A Convolutional Neural Network for Image Classification |
|
|
184 | (2) |
|
Summary and Looking Ahead |
|
|
186 | (1) |
|
6 Singular Value Decomposition: Image Processing, Natural Language Processing, and Social Media |
|
|
187 | (32) |
|
|
188 | (3) |
|
|
191 | (2) |
|
Matrices as Linear Transformations Acting on Space |
|
|
193 | (7) |
|
Action of A on the Right Singular Vectors |
|
|
194 | (1) |
|
Action of A on the Standard Unit Vectors and the Unit Square Determined by Them |
|
|
195 | (1) |
|
Action of A on the Unit Circle |
|
|
196 | (1) |
|
Breaking Down the Circle-to-Ellipse Transformation According to the Singular Value Decomposition |
|
|
197 | (1) |
|
Rotation and Reflection Matrices |
|
|
198 | (1) |
|
Action of A on a General Vector →x |
|
|
199 | (1) |
|
Three Ways to Multiply Matrices |
|
|
200 | (1) |
|
|
201 | (3) |
|
The Condition Number and Computational Stability |
|
|
203 | (1) |
|
The Ingredients of the Singular Value Decomposition |
|
|
204 | (1) |
|
Singular Value Decomposition Versus the Eigenvalue Decomposition |
|
|
204 | (2) |
|
Computation of the Singular Value Decomposition |
|
|
206 | (2) |
|
Computing an Eigenvector Numerically |
|
|
207 | (1) |
|
|
208 | (1) |
|
Applying the Singular Value Decomposition to Images |
|
|
209 | (3) |
|
Principal Component Analysis and Dimension Reduction |
|
|
212 | (2) |
|
Principal Component Analysis and Clustering |
|
|
214 | (1) |
|
A Social Media Application |
|
|
214 | (1) |
|
|
215 | (1) |
|
Randomized Singular Value Decomposition |
|
|
216 | (1) |
|
Summary and Looking Ahead |
|
|
216 | (3) |
|
7 Natural Language and Finance AI: Vectorization and Time Series |
|
|
219 | (44) |
|
|
222 | (1) |
|
Preparing Natural Language Data for Machine Processing |
|
|
223 | (3) |
|
Statistical Models and the log Function |
|
|
226 | (1) |
|
Zipf's Law for Term Counts |
|
|
226 | (1) |
|
Various Vector Representations for Natural Language Documents |
|
|
227 | (14) |
|
Term Frequency Vector Representation of a Document or Bag of Words |
|
|
227 | (1) |
|
Term Frequency-Inverse Document Frequency Vector Representation of a Document |
|
|
228 | (1) |
|
Topic Vector Representation of a Document Determined by Latent Semantic Analysis |
|
|
228 | (4) |
|
Topic Vector Representation of a Document Determined by Latent Dirichlet Allocation |
|
|
232 | (1) |
|
Topic Vector Representation of a Document Determined by Latent Discriminant Analysis |
|
|
233 | (1) |
|
Meaning Vector Representations of Words and of Documents Determined by Neural Network Embeddings |
|
|
234 | (7) |
|
|
241 | (2) |
|
Natural Language Processing Applications |
|
|
243 | (4) |
|
|
243 | (1) |
|
|
244 | (1) |
|
Search and Information Retrieval |
|
|
244 | (2) |
|
|
246 | (1) |
|
|
247 | (1) |
|
|
247 | (1) |
|
|
247 | (1) |
|
Transformers and Attention Models |
|
|
247 | (8) |
|
The Transformer Architecture |
|
|
248 | (3) |
|
|
251 | (4) |
|
Transformers Are Far from Perfect |
|
|
255 | (1) |
|
Convolutional Neural Networks for Time Series Data |
|
|
255 | (2) |
|
Recurrent Neural Networks for Time Series Data |
|
|
257 | (4) |
|
How Do Recurrent Neural Networks Work? |
|
|
258 | (2) |
|
Gated Recurrent Units and Long Short-Term Memory Units |
|
|
260 | (1) |
|
An Example of Natural Language Data |
|
|
261 | (1) |
|
|
261 | (1) |
|
Summary and Looking Ahead |
|
|
262 | (1) |
|
8 Probabilistic Generative Models |
|
|
263 | (34) |
|
What Are Generative Models Useful For? |
|
|
264 | (1) |
|
The Typical Mathematics of Generative Models |
|
|
265 | (3) |
|
Shifting Our Brain from Deterministic Thinking to Probabilistic Thinking |
|
|
268 | (2) |
|
Maximum Likelihood Estimation |
|
|
270 | (2) |
|
Explicit and Implicit Density Models |
|
|
272 | (1) |
|
Explicit Density-Tractable: Fully Visible Belief Networks |
|
|
273 | (3) |
|
Example: Generating Images via PixelCNN and Machine Audio via WaveNet |
|
|
273 | (3) |
|
Explicit Density-Tractable: Change of Variables Nonlinear Independent Component Analysis |
|
|
276 | (1) |
|
Explicit Density-Intractable: Variational Autoencoders Approximation via Variational Methods |
|
|
277 | (2) |
|
Explicit Density-Intractable: Boltzman Machine Approximation via Markov Chain |
|
|
279 | (1) |
|
Implicit Density-Markov Chain: Generative Stochastic Network |
|
|
279 | (1) |
|
Implicit Density-Direct: Generative Adversarial Networks |
|
|
280 | (1) |
|
How Do Generative Adversarial Networks Work? |
|
|
281 | (2) |
|
Example: Machine Learning and Generative Networks for High Energy Physics |
|
|
283 | (2) |
|
|
285 | (4) |
|
Naive Bayes Classification Model |
|
|
286 | (2) |
|
|
288 | (1) |
|
The Evolution of Generative Models |
|
|
289 | (4) |
|
|
290 | (1) |
|
|
291 | (1) |
|
Restricted Boltzmann Machine (Explicit Density and Intractable) |
|
|
291 | (1) |
|
|
292 | (1) |
|
Probabilistic Language Modeling |
|
|
293 | (2) |
|
Summary and Looking Ahead |
|
|
295 | (2) |
|
|
297 | (50) |
|
Graphs: Nodes, Edges, and Features for Each |
|
|
299 | (3) |
|
Example: PageRank Algorithm |
|
|
302 | (5) |
|
Inverting Matrices Using Graphs |
|
|
307 | (1) |
|
Cayley Graphs of Groups: Pure Algebra and Parallel Computing |
|
|
308 | (1) |
|
Message Passing Within a Graph |
|
|
309 | (1) |
|
The Limitless Applications of Graphs |
|
|
310 | (14) |
|
|
311 | (1) |
|
|
312 | (1) |
|
|
312 | (1) |
|
Detecting and Tracking Fake News Propagation |
|
|
312 | (2) |
|
Web-Scale Recommendation Systems |
|
|
314 | (1) |
|
|
314 | (2) |
|
|
316 | (1) |
|
Molecular Graph Generation for Drug and Protein Structure Discovery |
|
|
316 | (1) |
|
|
316 | (1) |
|
Social Media Networks and Social Influence Prediction |
|
|
316 | (1) |
|
|
317 | (1) |
|
|
317 | (1) |
|
|
317 | (1) |
|
Logistics and Operations Research |
|
|
318 | (1) |
|
|
318 | (2) |
|
Graph Structure of the Web |
|
|
320 | (1) |
|
Automatically Analyzing Computer Programs |
|
|
321 | (1) |
|
Data Structures in Computer Science |
|
|
321 | (1) |
|
Load Balancing in Distributed Networks |
|
|
322 | (1) |
|
Artificial Neural Networks |
|
|
323 | (1) |
|
|
324 | (2) |
|
Node Representation Learning |
|
|
326 | (1) |
|
Tasks for Graph Neural Networks |
|
|
327 | (3) |
|
|
327 | (1) |
|
|
328 | (1) |
|
Clustering and Community Detection |
|
|
329 | (1) |
|
|
329 | (1) |
|
|
329 | (1) |
|
|
330 | (1) |
|
|
330 | (1) |
|
|
331 | (7) |
|
A Bayesian Network Represents a Compactified Conditional Probability Table |
|
|
333 | (1) |
|
Making Predictions Using a Bayesian Network |
|
|
334 | (1) |
|
Bayesian Networks Are Belief Networks, Not Causal Networks |
|
|
334 | (1) |
|
Keep This in Mind About Bayesian Networks |
|
|
335 | (1) |
|
Chains, Forks, and Colliders |
|
|
336 | (1) |
|
Given a Data Set, How Do We Set Up a Bayesian Network for the Involved Variables? |
|
|
337 | (1) |
|
Graph Diagrams for Probabilistic Causal Modeling |
|
|
338 | (2) |
|
A Brief History of Graph Theory |
|
|
340 | (1) |
|
Main Considerations in Graph Theory |
|
|
341 | (3) |
|
Spanning Trees and Shortest Spanning Trees |
|
|
341 | (1) |
|
Cut Sets and Cut Vertices |
|
|
342 | (1) |
|
|
342 | (1) |
|
|
343 | (1) |
|
|
343 | (1) |
|
|
344 | (1) |
|
|
344 | (1) |
|
Algorithms and Computational Aspects of Graphs |
|
|
344 | (1) |
|
Summary and Looking Ahead |
|
|
345 | (2) |
|
|
347 | (64) |
|
|
349 | (1) |
|
Complexity Analysis and 0() Notation |
|
|
350 | (3) |
|
Optimization: The Heart of Operations Research |
|
|
353 | (3) |
|
Thinking About Optimization |
|
|
356 | (9) |
|
Optimization: Finite Dimensions, Unconstrained |
|
|
357 | (1) |
|
Optimization: Finite Dimensions, Constrained Lagrange Multipliers |
|
|
357 | (3) |
|
Optimization: Infinite Dimensions, Calculus of Variations |
|
|
360 | (5) |
|
|
365 | (5) |
|
Traveling Salesman Problem |
|
|
365 | (1) |
|
|
366 | (1) |
|
|
367 | (1) |
|
|
368 | (1) |
|
|
369 | (1) |
|
The Critical Path Method for Project Design |
|
|
369 | (1) |
|
|
370 | (1) |
|
|
371 | (31) |
|
The General Form and the Standard Form |
|
|
372 | (1) |
|
Visualizing a Linear Optimization Problem in Two Dimensions |
|
|
373 | (1) |
|
|
374 | (3) |
|
The Geometry of Linear Optimization |
|
|
377 | (2) |
|
|
379 | (7) |
|
Transportation and Assignment Problems |
|
|
386 | (1) |
|
Duality, Lagrange Relaxation, Shadow Prices, Max-Min, Min-Max, and All That |
|
|
386 | (15) |
|
|
401 | (1) |
|
Game Theory and Multiagents |
|
|
402 | (2) |
|
|
404 | (1) |
|
|
405 | (1) |
|
Machine Learning for Operations Research |
|
|
405 | (1) |
|
Hamilton-Jacobi-Bellman Equation |
|
|
406 | (1) |
|
Operations Research for AI |
|
|
407 | (1) |
|
Summary and Looking Ahead |
|
|
407 | (4) |
|
|
411 | (40) |
|
Where Did Probability Appear in This Book? |
|
|
412 | (3) |
|
What More Do We Need to Know That Is Essential for AI? |
|
|
415 | (1) |
|
Causal Modeling and the Do Calculus |
|
|
415 | (5) |
|
An Alternative: The Do Calculus |
|
|
417 | (3) |
|
Paradoxes and Diagram Interpretations |
|
|
420 | (4) |
|
|
420 | (2) |
|
|
422 | (1) |
|
|
422 | (2) |
|
|
424 | (8) |
|
Examples of Random Vectors and Random Matrices |
|
|
424 | (3) |
|
Main Considerations in Random Matrix Theory |
|
|
427 | (2) |
|
|
429 | (1) |
|
Eigenvalue Density of the Sum of Two Large Random Matrices |
|
|
430 | (1) |
|
Essential Math for Large Random Matrices |
|
|
430 | (2) |
|
|
432 | (6) |
|
|
433 | (1) |
|
|
433 | (1) |
|
|
434 | (1) |
|
Wiener Process or Brownian Motion |
|
|
435 | (1) |
|
|
435 | (1) |
|
|
436 | (1) |
|
|
436 | (1) |
|
|
436 | (1) |
|
|
437 | (1) |
|
Markov Decision Processes and Reinforcement Learning |
|
|
438 | (3) |
|
Examples of Reinforcement Learning |
|
|
438 | (1) |
|
Reinforcement Learning as a Markov Decision Process |
|
|
439 | (2) |
|
Reinforcement Learning in the Context of Optimal Control and Nonlinear Dynamics |
|
|
441 | (1) |
|
Python Library for Reinforcement Learning |
|
|
441 | (1) |
|
Theoretical and Rigorous Grounds |
|
|
441 | (7) |
|
Which Events Have a Probability? |
|
|
442 | (1) |
|
Can We Talk About a Wider Range of Random Variables? |
|
|
443 | (1) |
|
A Probability Triple (Sample Space, Sigma Algebra, Probability Measure) |
|
|
443 | (1) |
|
|
444 | (1) |
|
Random Variable, Expectation, and Integration |
|
|
445 | (1) |
|
Distribution of a Random Variable and the Change of Variable Theorem |
|
|
446 | (1) |
|
Next Steps in Rigorous Probability Theory |
|
|
447 | (1) |
|
The Universality Theorem for Neural Networks |
|
|
448 | (1) |
|
Summary and Looking Ahead |
|
|
448 | (3) |
|
|
451 | (14) |
|
|
452 | (1) |
|
|
452 | (5) |
|
From Few Axioms to a Whole Theory |
|
|
455 | (1) |
|
Codifying Logic Within an Agent |
|
|
456 | (1) |
|
How Do Deterministic and Probabilistic Machine Learning Fit In? |
|
|
456 | (1) |
|
|
457 | (3) |
|
Relationships Between For All and There Exist |
|
|
458 | (2) |
|
|
460 | (1) |
|
|
460 | (1) |
|
|
461 | (1) |
|
Comparison with Human Natural Language |
|
|
462 | (1) |
|
Machines and Complex Mathematical Reasoning |
|
|
462 | (1) |
|
Summary and Looking Ahead |
|
|
463 | (2) |
|
13 Artificial Intelligence and Partial Differential Equations |
|
|
465 | (66) |
|
What Is a Partial Differential Equation? |
|
|
466 | (1) |
|
Modeling with Differential Equations |
|
|
467 | (5) |
|
Models at Different Scales |
|
|
468 | (1) |
|
|
468 | (1) |
|
Changing One Thing in a PDE Can Be a Big Deal |
|
|
469 | (2) |
|
|
471 | (1) |
|
Numerical Solutions Are Very Valuable |
|
|
472 | (21) |
|
Continuous Functions Versus Discrete Functions |
|
|
472 | (2) |
|
PDE Themes from My Ph.D. Thesis |
|
|
474 | (3) |
|
Discretization and the Curse of Dimensionality |
|
|
477 | (1) |
|
|
478 | (6) |
|
|
484 | (5) |
|
Variational or Energy Methods |
|
|
489 | (1) |
|
|
490 | (3) |
|
Some Statistical Mechanics: The Wonderful Master Equation |
|
|
493 | (2) |
|
Solutions as Expectations of Underlying Random Processes |
|
|
495 | (1) |
|
|
495 | (4) |
|
|
495 | (3) |
|
|
498 | (1) |
|
|
499 | (10) |
|
Example Using the Heat Equation |
|
|
499 | (2) |
|
Example Using the Poisson Equation |
|
|
501 | (2) |
|
|
503 | (6) |
|
|
509 | (13) |
|
Deep Learning to Learn Physical Parameter Values |
|
|
509 | (1) |
|
Deep Learning to Learn Meshes |
|
|
510 | (2) |
|
Deep Learning to Approximate Solution Operators of PDEs |
|
|
512 | (7) |
|
Numerical Solutions of High-Dimensional Differential Equations |
|
|
519 | (1) |
|
Simulating Natural Phenomena Directly from Data |
|
|
520 | (2) |
|
Hamilton-Jacobi-Bellman PDE for Dynamic Programming |
|
|
522 | (6) |
|
|
528 | (1) |
|
Other Considerations in Partial Differential Equations |
|
|
528 | (2) |
|
Summary and Looking Ahead |
|
|
530 | (1) |
|
14 Artificial Intelligence, Ethics, Mathematics, Law, and Policy |
|
|
531 | (18) |
|
|
533 | (1) |
|
|
534 | (2) |
|
|
536 | (3) |
|
|
536 | (1) |
|
|
537 | (1) |
|
|
538 | (1) |
|
Unintended Outcomes of Generative Models |
|
|
539 | (1) |
|
|
539 | (5) |
|
Addressing Underrepresentation in Training Data |
|
|
539 | (1) |
|
Addressing Bias in Word Vectors |
|
|
540 | (1) |
|
|
540 | (1) |
|
|
541 | (1) |
|
Injecting Morality into AI |
|
|
542 | (1) |
|
Democratization and Accessibility of AI to Nonexperts |
|
|
543 | (1) |
|
Prioritizing High Quality Data |
|
|
543 | (1) |
|
Distinguishing Bias from Discrimination |
|
|
544 | (1) |
|
|
545 | (1) |
|
|
546 | (3) |
Index |
|
549 | |