Foreword |
|
xv | |
|
|
1 | (12) |
|
|
1.1 Who should read this book? |
|
|
1 | (2) |
|
1.2 What is data science? |
|
|
3 | (3) |
|
|
6 | (2) |
|
1.4 What can I expect from this book? |
|
|
8 | (2) |
|
1.5 What will this book expect from me? |
|
|
10 | (3) |
|
|
13 | (30) |
|
|
|
14 | (1) |
|
2.2 The computing environment |
|
|
14 | (9) |
|
|
14 | (1) |
|
|
15 | (1) |
|
2.2.3 Programming languages |
|
|
16 | (1) |
|
2.2.4 Integrated development environments (IDEs) |
|
|
17 | (1) |
|
|
18 | (4) |
|
|
22 | (1) |
|
|
23 | (7) |
|
2.3.1 Write readable code |
|
|
23 | (3) |
|
2.3.2 Don't repeat yourself |
|
|
26 | (1) |
|
2.3.3 Set seeds for random processes |
|
|
27 | (1) |
|
2.3.4 Profile, benchmark, and optimize judiciously |
|
|
27 | (1) |
|
|
28 | (1) |
|
2.3.6 Don't rely on black boxes |
|
|
29 | (1) |
|
|
30 | (11) |
|
|
30 | (1) |
|
|
30 | (1) |
|
|
31 | (1) |
|
|
31 | (2) |
|
2.4.1.4 Other sources and concerns |
|
|
33 | (1) |
|
|
34 | (1) |
|
|
35 | (1) |
|
|
36 | (1) |
|
|
37 | (1) |
|
|
38 | (1) |
|
|
38 | (2) |
|
2.4.4 Exploratory data analysis (EDA) |
|
|
40 | (1) |
|
|
41 | (1) |
|
|
41 | (2) |
|
|
43 | (56) |
|
|
|
44 | (7) |
|
3.1.1 Data, vectors, and matrices |
|
|
44 | (2) |
|
3.1.2 Term-by-document matrices |
|
|
46 | (1) |
|
3.1.3 Matrix storage and manipulation issues |
|
|
47 | (4) |
|
3.2 Matrix decompositions |
|
|
51 | (27) |
|
3.2.1 Matrix decompositions and data science |
|
|
51 | (1) |
|
3.2.2 The LU decomposition |
|
|
51 | (1) |
|
3.2.2.1 Gaussian elimination |
|
|
51 | (2) |
|
3.2.2.2 The matrices L and U |
|
|
53 | (2) |
|
|
55 | (1) |
|
3.2.2.4 Computational notes |
|
|
56 | (2) |
|
3.2.3 The Cholesky decomposition |
|
|
58 | (2) |
|
3.2.4 Least-squares curve-fitting |
|
|
60 | (3) |
|
3.2.5 Recommender systems and the QR decomposition |
|
|
63 | (1) |
|
3.2.5.1 A motivating example |
|
|
63 | (2) |
|
3.2.5.2 The QR decomposition |
|
|
65 | (5) |
|
3.2.5.3 Applications of the QR decomposition |
|
|
70 | (1) |
|
3.2.6 The singular value decomposition |
|
|
71 | (3) |
|
3.2.6.1 SVD in our recommender system |
|
|
74 | (3) |
|
3.2.6.2 Further reading on the SVD |
|
|
77 | (1) |
|
3.3 Eigenvalues and eigenvectors |
|
|
78 | (14) |
|
|
78 | (4) |
|
3.3.2 Finding eigenvalues |
|
|
82 | (2) |
|
|
84 | (2) |
|
|
86 | (6) |
|
|
92 | (3) |
|
3.4.1 Floating point computing |
|
|
92 | (1) |
|
3.4.2 Floating point arithmetic |
|
|
92 | (2) |
|
|
94 | (1) |
|
|
95 | (4) |
|
3.5.1 Creating a database |
|
|
95 | (1) |
|
3.5.2 The QR decomposition and query-matching |
|
|
96 | (1) |
|
3.5.3 The SVD and latent semantic indexing |
|
|
96 | (1) |
|
|
96 | (3) |
|
|
99 | (86) |
|
|
|
100 | (3) |
|
4.2 Exploratory data analysis and visualizations |
|
|
103 | (8) |
|
4.2.1 Descriptive statistics |
|
|
106 | (3) |
|
|
109 | (2) |
|
|
111 | (13) |
|
|
112 | (4) |
|
4.3.2 Polynomial regression |
|
|
116 | (1) |
|
4.3.3 Group-wise models and clustering |
|
|
117 | (1) |
|
|
118 | (4) |
|
4.3.5 Maximum likelihood estimation |
|
|
122 | (2) |
|
|
124 | (9) |
|
4.4.1 The sampling distribution |
|
|
125 | (2) |
|
4.4.2 Confidence intervals from the sampling distribution |
|
|
127 | (3) |
|
4.4.3 Bootstrap resampling |
|
|
130 | (3) |
|
|
133 | (12) |
|
|
133 | (1) |
|
|
133 | (3) |
|
4.5.1.2 General strategy for hypothesis testing |
|
|
136 | (1) |
|
4.5.1.3 Inference to compare two populations |
|
|
137 | (1) |
|
4.5.1.4 Other types of hypothesis tests |
|
|
138 | (1) |
|
4.5.2 Randomization-based inference |
|
|
139 | (3) |
|
4.5.3 Type I and Type II error |
|
|
142 | (1) |
|
4.5.4 Power and effect size |
|
|
142 | (1) |
|
4.5.5 The trouble with p-hacking |
|
|
143 | (1) |
|
4.5.6 Bias and scope of inference |
|
|
144 | (1) |
|
|
145 | (14) |
|
|
145 | (1) |
|
4.6.2 Outliers and high leverage points |
|
|
146 | (2) |
|
4.6.3 Multiple regression, interaction |
|
|
148 | (4) |
|
4.6.4 What to do when the regression assumptions fail |
|
|
152 | (3) |
|
4.6.5 Indicator variables and ANOVA |
|
|
155 | (4) |
|
4.7 The linear algebra approach to statistics |
|
|
159 | (14) |
|
4.7.1 The general linear model |
|
|
160 | (5) |
|
4.7.2 Ridge regression and penalized regression |
|
|
165 | (1) |
|
4.7.3 Logistic regression |
|
|
166 | (5) |
|
4.7.4 The generalized linear model |
|
|
171 | (1) |
|
4.7.5 Categorical data analysis |
|
|
172 | (1) |
|
|
173 | (4) |
|
4.8.1 Experimental design |
|
|
173 | (3) |
|
|
176 | (1) |
|
|
177 | (3) |
|
|
177 | (1) |
|
4.9.2 Prior and posterior distributions |
|
|
178 | (2) |
|
|
180 | (2) |
|
|
180 | (1) |
|
|
181 | (1) |
|
|
182 | (1) |
|
|
182 | (3) |
|
|
185 | (54) |
|
|
|
186 | (2) |
|
5.1.1 What is clustering? |
|
|
186 | (1) |
|
5.1.2 Example applications |
|
|
186 | (1) |
|
5.1.3 Clustering observations |
|
|
187 | (1) |
|
|
188 | (1) |
|
|
189 | (4) |
|
5.4 Partitioning and the /c-means algorithm |
|
|
193 | (11) |
|
5.4.1 The fc-means algorithm |
|
|
193 | (2) |
|
5.4.2 Issues with fc-means |
|
|
195 | (2) |
|
5.4.3 Example with wine data |
|
|
197 | (3) |
|
|
200 | (4) |
|
5.4.5 Other partitioning algorithms |
|
|
204 | (1) |
|
5.5 Hierarchical clustering |
|
|
204 | (7) |
|
|
205 | (1) |
|
|
206 | (1) |
|
5.5.3 Hierarchical simple example |
|
|
207 | (1) |
|
5.5.4 Dendrograms and wine example |
|
|
208 | (3) |
|
5.5.5 Other hierarchical algorithms |
|
|
211 | (1) |
|
|
211 | (6) |
|
|
212 | (2) |
|
5.6.2 Hierarchical results |
|
|
214 | (1) |
|
5.6.3 Case study conclusions |
|
|
215 | (2) |
|
|
217 | (7) |
|
|
217 | (1) |
|
|
218 | (2) |
|
5.7.3 Mclust and model selection |
|
|
220 | (1) |
|
5.7.4 Example with wine data |
|
|
220 | (1) |
|
5.7.5 Model-based versus fc-means |
|
|
221 | (3) |
|
5.8 Density-based methods |
|
|
224 | (4) |
|
5.8.1 Example with iris data |
|
|
226 | (2) |
|
5.9 Dealing with network data |
|
|
228 | (4) |
|
5.9.1 Network clustering example |
|
|
229 | (3) |
|
|
232 | (2) |
|
|
232 | (1) |
|
5.10.2 Hierarchical clusters |
|
|
233 | (1) |
|
5.10.3 Overlapping clusters, or fuzzy clustering |
|
|
234 | (1) |
|
|
234 | (5) |
|
|
239 | (52) |
|
|
|
6.1 History and background |
|
|
241 | (3) |
|
6.1.1 How does OR connect to data science? |
|
|
241 | (1) |
|
|
242 | (1) |
|
6.1.3 Balance between efficiency and complexity |
|
|
243 | (1) |
|
|
244 | (16) |
|
6.2.1 Complexity-tractability trade-off |
|
|
246 | (1) |
|
6.2.2 Linear optimization |
|
|
247 | (2) |
|
6.2.2.1 Duality and optimality conditions |
|
|
249 | (3) |
|
6.2.2.2 Extension to integer programming |
|
|
252 | (1) |
|
6.2.3 Convex optimization |
|
|
252 | (4) |
|
6.2.3.1 Duality and optimality conditions |
|
|
256 | (2) |
|
6.2.4 Non-convex optimization |
|
|
258 | (2) |
|
|
260 | (13) |
|
6.3.1 Probability principles of simulation |
|
|
261 | (1) |
|
6.3.2 Generating random variables |
|
|
262 | (1) |
|
6.3.2.1 Simulation from a known distribution |
|
|
262 | (5) |
|
6.3.2.2 Simulation from an empirical distribution: bootstrapping |
|
|
267 | (1) |
|
6.3.2.3 Markov Chain Monte Carlo (MCMC) methods |
|
|
267 | (2) |
|
6.3.3 Simulation techniques for statistical and machine learning model assessment |
|
|
269 | (1) |
|
6.3.3.1 Bootstrapping confidence intervals |
|
|
269 | (1) |
|
|
270 | (1) |
|
6.3.4 Simulation techniques for prescriptive analytics |
|
|
271 | (1) |
|
6.3.4.1 Discrete-event simulation |
|
|
272 | (1) |
|
6.3.4.2 Agent-based modeling |
|
|
272 | (1) |
|
6.3.4.3 Using these tools for prescriptive analytics |
|
|
273 | (1) |
|
6.4 Stochastic optimization |
|
|
273 | (4) |
|
6.4.1 Dynamic programming formulation |
|
|
274 | (1) |
|
6.4.2 Solution techniques |
|
|
275 | (2) |
|
6.5 Putting the methods to use: prescriptive analytics |
|
|
277 | (3) |
|
6.5.1 Bike-sharing systems |
|
|
277 | (1) |
|
6.5.2 A customer choice model for online retail |
|
|
278 | (1) |
|
6.5.3 HIV treatment and prevention |
|
|
279 | (1) |
|
|
280 | (3) |
|
6.6.1 Optimization solvers |
|
|
281 | (1) |
|
6.6.2 Simulation software and packages |
|
|
282 | (1) |
|
6.6.3 Stochastic optimization software and packages |
|
|
283 | (1) |
|
6.7 Looking to the future |
|
|
283 | (2) |
|
|
285 | (6) |
|
6.8.1 The vehicle routing problem |
|
|
285 | (1) |
|
6.8.2 The unit commitment problem for power systems |
|
|
286 | (3) |
|
|
289 | (1) |
|
|
289 | (2) |
|
7 Dimensionality Reduction |
|
|
291 | (48) |
|
|
|
|
|
|
|
292 | (2) |
|
7.2 The geometry of data and dimension |
|
|
294 | (4) |
|
7.3 Principal Component Analysis |
|
|
298 | (6) |
|
7.3.1 Derivation and properties |
|
|
298 | (2) |
|
|
300 | (1) |
|
7.3.3 How PCA is used for dimension estimation and data reduction |
|
|
300 | (1) |
|
7.3.4 Topological dimension |
|
|
301 | (2) |
|
7.3.5 Multidimensional scaling |
|
|
303 | (1) |
|
|
304 | (2) |
|
7.5 Non-integer dimensions |
|
|
306 | (6) |
|
7.5.1 Background on dynamical systems |
|
|
307 | (1) |
|
|
308 | (1) |
|
7.5.3 The correlation dimension |
|
|
309 | (2) |
|
7.5.4 Correlation dimension of the Lorenz attractor |
|
|
311 | (1) |
|
7.6 Dimension reduction on the Grassmannian |
|
|
312 | (6) |
|
7.7 Dimensionality reduction in the presence of symmetry |
|
|
318 | (3) |
|
7.8 Category theory applied to data visualization |
|
|
321 | (5) |
|
|
326 | (7) |
|
7.9.1 Nonlinear Principal Component Analysis |
|
|
326 | (4) |
|
7.9.2 Whitney's reduction network |
|
|
330 | (1) |
|
7.9.3 The generalized singular value decomposition |
|
|
331 | (1) |
|
7.9.4 False nearest neighbors |
|
|
332 | (1) |
|
|
332 | (1) |
|
7.10 Interesting theorems on dimension |
|
|
333 | (3) |
|
|
333 | (1) |
|
|
333 | (1) |
|
7.10.3 Nash embedding theorems |
|
|
334 | (1) |
|
7.10.4 Iohnson-Lindenstrauss lemma |
|
|
335 | (1) |
|
|
336 | (3) |
|
7.11.1 Summary and method of application |
|
|
336 | (1) |
|
7.11.2 Suggested exercises |
|
|
336 | (3) |
|
|
339 | (70) |
|
|
|
|
|
340 | (2) |
|
8.1.1 Core concepts of supervised learning |
|
|
341 | (1) |
|
8.1.2 Types of supervised learning |
|
|
342 | (1) |
|
8.2 Training dataset and test dataset |
|
|
342 | (4) |
|
|
342 | (2) |
|
8.2.2 Methods for data separation |
|
|
344 | (2) |
|
8.3 Machine learning workflow |
|
|
346 | (14) |
|
8.3.1 Step 1: obtaining the initial dataset |
|
|
348 | (2) |
|
8.3.2 Step 2: preprocessing |
|
|
350 | (1) |
|
8.3.2.1 Missing values and outliers |
|
|
351 | (1) |
|
8.3.2.2 Feature engineering |
|
|
352 | (1) |
|
8.3.3 Step 3: creating training and test datasets |
|
|
353 | (1) |
|
8.3.4 Step 4: model creation |
|
|
354 | (1) |
|
8.3.4.1 Scaling and normalization |
|
|
354 | (1) |
|
8.3.4.2 Feature selection |
|
|
355 | (2) |
|
8.3.5 Step 5: prediction and evaluation |
|
|
357 | (1) |
|
8.3.6 Iterative model building |
|
|
358 | (2) |
|
8.4 Implementing the ML workflow |
|
|
360 | (4) |
|
|
360 | (3) |
|
8.4.2 Transformer objects |
|
|
363 | (1) |
|
|
364 | (6) |
|
|
364 | (1) |
|
8.5.2 A powerful optimization tool |
|
|
365 | (1) |
|
8.5.3 Application to regression |
|
|
366 | (1) |
|
8.5.4 Support for regularization |
|
|
367 | (3) |
|
|
370 | (7) |
|
8.6.1 Logistic regression framework |
|
|
371 | (1) |
|
8.6.2 Parameter estimation for logistic regression |
|
|
371 | (2) |
|
8.6.3 Evaluating the performance of a classifier |
|
|
373 | (4) |
|
8.7 Naive Bayes classifier |
|
|
377 | (5) |
|
|
377 | (2) |
|
8.7.1.1 Estimating the probabilities |
|
|
379 | (1) |
|
8.7.1.2 Laplace smoothing |
|
|
379 | (1) |
|
8.7.2 Health care example |
|
|
380 | (2) |
|
8.8 Support vector machines |
|
|
382 | (10) |
|
8.8.1 Linear SVMs in the case of linear separability |
|
|
383 | (3) |
|
8.8.2 Linear SVMs without linear separability |
|
|
386 | (3) |
|
|
389 | (3) |
|
|
392 | (10) |
|
8.9.1 Classification trees |
|
|
395 | (3) |
|
8.9.2 Regression decision trees |
|
|
398 | (1) |
|
|
399 | (3) |
|
|
402 | (4) |
|
|
403 | (1) |
|
|
403 | (1) |
|
|
404 | (2) |
|
|
406 | (3) |
|
|
409 | (32) |
|
|
|
410 | (3) |
|
|
410 | (1) |
|
9.1.2 History of neural networks |
|
|
411 | (2) |
|
9.2 Multilayer perceptrons |
|
|
413 | (5) |
|
|
414 | (3) |
|
|
417 | (1) |
|
9.2.3 Neural networks for classification |
|
|
417 | (1) |
|
|
418 | (4) |
|
|
419 | (1) |
|
9.3.2 Optimization algorithms |
|
|
419 | (2) |
|
|
421 | (1) |
|
9.3.4 Batch normalization |
|
|
421 | (1) |
|
9.3.5 Weight regularization |
|
|
421 | (1) |
|
|
422 | (1) |
|
9.4 Convolutional neural networks |
|
|
422 | (7) |
|
|
423 | (1) |
|
9.4.2 Convolutional architectures for ImageNet |
|
|
424 | (5) |
|
9.5 Recurrent neural networks |
|
|
429 | (2) |
|
|
430 | (1) |
|
|
431 | (4) |
|
|
431 | (1) |
|
|
432 | (2) |
|
9.6.3 Self-attention layers |
|
|
434 | (1) |
|
|
434 | (1) |
|
|
434 | (1) |
|
9.7 Deep learning frameworks |
|
|
435 | (5) |
|
9.7.1 Hardware acceleration |
|
|
435 | (1) |
|
9.7.2 History of deep learning frameworks |
|
|
436 | (2) |
|
9.7.3 TensorFlow with Keras |
|
|
438 | (2) |
|
|
440 | (1) |
|
9.9 Exercises and solutions |
|
|
440 | (1) |
|
10 Topological Data Analysis |
|
|
441 | (34) |
|
|
|
|
|
441 | (2) |
|
10.2 Example applications |
|
|
443 | (3) |
|
|
443 | (1) |
|
10.2.2 Molecule configurations |
|
|
443 | (2) |
|
10.2.3 Agent-based modeling |
|
|
445 | (1) |
|
|
445 | (1) |
|
|
446 | (1) |
|
10.4 Simplicial complexes |
|
|
447 | (2) |
|
|
449 | (8) |
|
10.5.1 Simplicial homology |
|
|
450 | (1) |
|
10.5.2 Homology definitions |
|
|
451 | (1) |
|
|
452 | (1) |
|
10.5.4 Homology computation using linear algebra |
|
|
453 | (4) |
|
|
457 | (6) |
|
10.7 Sublevelset persistence |
|
|
463 | (1) |
|
10.8 Software and exercises |
|
|
464 | (3) |
|
|
467 | (1) |
|
10.10 Appendix: stability of persistent homology |
|
|
467 | (8) |
|
10.10.1 Distances between datasets |
|
|
468 | (3) |
|
10.10.2 Bottleneck distance and visualization |
|
|
471 | (2) |
|
10.10.3 Stability results |
|
|
473 | (2) |
Bibliography |
|
475 | (40) |
Index |
|
515 | |