Foreword |
|
xiii | |
Preface |
|
xv | |
Acknowledgments |
|
xxi | |
About the Authors |
|
xxiii | |
|
I Data Science with Hadoop---An Overview |
|
|
1 | (52) |
|
1 Introduction to Data Science |
|
|
3 | (16) |
|
|
3 | (1) |
|
Example: Search Advertising |
|
|
4 | (1) |
|
A Bit of Data Science History |
|
|
5 | (1) |
|
Statistics and Machine Learning |
|
|
6 | (1) |
|
Innovation from Internet Giants |
|
|
7 | (1) |
|
Data Science in the Modern Enterprise |
|
|
8 | (1) |
|
Becoming a Data Scientist |
|
|
8 | (1) |
|
|
8 | (1) |
|
|
9 | (1) |
|
Transitioning to a Data Scientist Role |
|
|
9 | (2) |
|
Soft Skills of a Data Scientist |
|
|
11 | (1) |
|
Building a Data Science Team |
|
|
12 | (1) |
|
The Data Science Project Life Cycle |
|
|
13 | (1) |
|
|
14 | (1) |
|
|
15 | (1) |
|
Data Cleaning: Taking Care of Data Quality |
|
|
15 | (1) |
|
Explore the Data and Design Model Features |
|
|
16 | (1) |
|
Building and Tuning the Model |
|
|
17 | (1) |
|
|
17 | (1) |
|
Managing a Data Science Project |
|
|
18 | (1) |
|
|
18 | (1) |
|
2 Use Cases for Data Science |
|
|
19 | (12) |
|
Big Data---A Driver of Change |
|
|
19 | (1) |
|
Volume: More Data Is Now Available |
|
|
20 | (1) |
|
|
20 | (1) |
|
Velocity: Fast Data Ingest |
|
|
21 | (1) |
|
|
21 | (1) |
|
|
21 | (1) |
|
|
22 | (1) |
|
|
22 | (1) |
|
Sales Leads Prioritization |
|
|
23 | (1) |
|
|
24 | (1) |
|
|
25 | (1) |
|
|
26 | (1) |
|
|
26 | (1) |
|
Predictive Medical Diagnosis |
|
|
27 | (1) |
|
Predicting Patient Re-admission |
|
|
28 | (1) |
|
Detecting Anomalous Record Access |
|
|
28 | (1) |
|
|
29 | (1) |
|
Predicting Oil and Gas Well Production Levels |
|
|
29 | (1) |
|
|
29 | (2) |
|
3 Hadoop and Data Science |
|
|
31 | (22) |
|
|
31 | (1) |
|
|
32 | (2) |
|
Resource Manager and Scheduler |
|
|
34 | (1) |
|
Distributed Data Processing Frameworks |
|
|
35 | (2) |
|
|
37 | (1) |
|
Hadoop Tools for Data Science |
|
|
38 | (1) |
|
|
39 | (1) |
|
|
39 | (1) |
|
|
40 | (1) |
|
|
41 | (1) |
|
|
42 | (3) |
|
|
45 | (1) |
|
Java Machine Learning Packages |
|
|
46 | (1) |
|
Why Hadoop Is Useful to Data Scientists |
|
|
46 | (1) |
|
|
46 | (1) |
|
|
47 | (1) |
|
Unstructured and Semi-Structured Data |
|
|
48 | (1) |
|
|
48 | (1) |
|
Robust Scheduling and Resource Management |
|
|
49 | (1) |
|
Levels of Distributed Systems Abstractions |
|
|
49 | (1) |
|
Scalable Creation of Models |
|
|
50 | (1) |
|
Scalable Application of Models |
|
|
51 | (1) |
|
|
51 | (2) |
|
II Preparing and Visualizing Data with Hadoop |
|
|
53 | (72) |
|
4 Getting Data Into Hadoop |
|
|
55 | (30) |
|
|
56 | (2) |
|
The Hadoop Distributed File System (HDFS) |
|
|
58 | (1) |
|
Direct File Transfer to Hadoop HDFS |
|
|
58 | (1) |
|
Importing Data from Files into Hive Tables |
|
|
59 | (1) |
|
Import CSV Files into Hive Tables |
|
|
59 | (3) |
|
Importing Data into Hive Tables Using Spark |
|
|
62 | (1) |
|
Import CSV Files into HIVE Using Spark |
|
|
63 | (1) |
|
Import a JSON File into HIVE Using Spark |
|
|
64 | (1) |
|
Using Apache Sqoop to Acquire Relational Data |
|
|
65 | (1) |
|
Data Import and Export with Sqoop |
|
|
66 | (1) |
|
Apache Sqoop Version Changes |
|
|
67 | (1) |
|
Using Sqoop V2: A Basic Example |
|
|
68 | (6) |
|
Using Apache Flume to Acquire Data Streams |
|
|
74 | (2) |
|
Using Flume: A Web Log Example Overview |
|
|
76 | (3) |
|
Manage Hadoop Work and Data Flows with Apache Oozie |
|
|
79 | (2) |
|
|
81 | (1) |
|
What's Next in Data Ingestion? |
|
|
82 | (1) |
|
|
82 | (3) |
|
5 Data Munging with Hadoop |
|
|
85 | (22) |
|
Why Hadoop for Data Munging? |
|
|
86 | (1) |
|
|
86 | (1) |
|
|
86 | (1) |
|
Dealing with Data Quality Issues |
|
|
87 | (5) |
|
Using Hadoop for Data Quality |
|
|
92 | (1) |
|
|
93 | (1) |
|
Choosing the "Right" Features |
|
|
94 | (1) |
|
Sampling: Choosing Instances |
|
|
94 | (2) |
|
|
96 | (1) |
|
|
97 | (3) |
|
|
100 | (1) |
|
Features from Complex Data Types |
|
|
101 | (1) |
|
|
102 | (1) |
|
|
103 | (3) |
|
|
106 | (1) |
|
6 Exploring and Visualizing Data |
|
|
107 | (18) |
|
|
107 | (1) |
|
Motivating Example: Visualizing Network Throughput |
|
|
108 | (2) |
|
Visualizing the Breakthrough That Never Happened |
|
|
110 | (2) |
|
|
112 | (1) |
|
|
113 | (1) |
|
|
114 | (3) |
|
|
117 | (1) |
|
|
118 | (3) |
|
Using Visualization for Data Science |
|
|
121 | (1) |
|
Popular Visualization Tools |
|
|
121 | (1) |
|
|
121 | (1) |
|
Python: Matplotlib, Seaborn, and Others |
|
|
122 | (1) |
|
|
122 | (1) |
|
|
123 | (1) |
|
|
123 | (1) |
|
Other Visualization Tools |
|
|
123 | (1) |
|
Visualizing Big Data with Hadoop |
|
|
123 | (1) |
|
|
124 | (1) |
|
III Applying Data Modeling with Hadoop |
|
|
125 | (76) |
|
7 Machine Learning with Hadoop |
|
|
127 | (6) |
|
Overview of Machine Learning |
|
|
127 | (1) |
|
|
128 | (1) |
|
Task Types in Machine Learning |
|
|
129 | (1) |
|
Big Data and Machine Learning |
|
|
130 | (1) |
|
Tools for Machine Learning |
|
|
131 | (1) |
|
The Future of Machine Learning and Artificial Intelligence |
|
|
132 | (1) |
|
|
132 | (1) |
|
|
133 | (18) |
|
Overview of Predictive Modeling |
|
|
133 | (1) |
|
Classification Versus Regression |
|
|
134 | (2) |
|
Evaluating Predictive Models |
|
|
136 | (1) |
|
|
136 | (3) |
|
Evaluating Regression Models |
|
|
139 | (1) |
|
|
139 | (1) |
|
Supervised Learning Algorithms |
|
|
140 | (1) |
|
Building Big Data Predictive Model Solutions |
|
|
141 | (1) |
|
|
141 | (2) |
|
|
143 | (1) |
|
|
144 | (1) |
|
Example: Sentiment Analysis |
|
|
145 | (1) |
|
|
145 | (1) |
|
|
145 | (1) |
|
|
146 | (3) |
|
|
149 | (1) |
|
|
150 | (1) |
|
|
151 | (14) |
|
|
151 | (1) |
|
|
152 | (1) |
|
Designing a Similarity Measure |
|
|
153 | (1) |
|
|
153 | (1) |
|
|
154 | (1) |
|
|
154 | (1) |
|
Example: Clustering Algorithms |
|
|
155 | (1) |
|
|
155 | (2) |
|
Latent Dirichlet Allocation |
|
|
157 | (1) |
|
Evaluating the Clusters and Choosing the Number of Clusters |
|
|
157 | (1) |
|
Building Big Data Clustering Solutions |
|
|
158 | (2) |
|
Example: Topic Modeling with Latent Dirichlet Allocation |
|
|
160 | (1) |
|
|
160 | (2) |
|
Running Latent Dirichlet Allocation |
|
|
162 | (1) |
|
|
163 | (2) |
|
10 Anomaly Detection with Hadoop |
|
|
165 | (16) |
|
|
165 | (1) |
|
Uses of Anomaly Detection |
|
|
166 | (1) |
|
Types of Anomalies in Data |
|
|
166 | (1) |
|
Approaches to Anomaly Detection |
|
|
167 | (1) |
|
|
167 | (1) |
|
Supervised Learning Methods |
|
|
168 | (1) |
|
Unsupervised Learning Methods |
|
|
168 | (2) |
|
Semi-Supervised Learning Methods |
|
|
170 | (1) |
|
Tuning Anomaly Detection Systems |
|
|
170 | (1) |
|
Building a Big Data Anomaly Detection Solution with Hadoop |
|
|
171 | (1) |
|
Example: Detecting Network Intrusions |
|
|
172 | (1) |
|
|
172 | (4) |
|
|
176 | (1) |
|
|
177 | (2) |
|
|
179 | (2) |
|
11 Natural Language Processing |
|
|
181 | (14) |
|
Natural Language Processing |
|
|
181 | (1) |
|
|
182 | (1) |
|
|
182 | (1) |
|
|
183 | (1) |
|
|
183 | (1) |
|
|
184 | (1) |
|
|
184 | (1) |
|
|
184 | (1) |
|
Tooling for NLP in Hadoop |
|
|
184 | (1) |
|
|
184 | (2) |
|
|
186 | (1) |
|
|
187 | (1) |
|
|
187 | (1) |
|
|
188 | (1) |
|
Sentiment Analysis Example |
|
|
189 | (1) |
|
|
189 | (1) |
|
Using Spark for Sentiment Analysis |
|
|
189 | (4) |
|
|
193 | (2) |
|
12 Data Science with Hadoop---The Next Frontier |
|
|
195 | (6) |
|
|
195 | (2) |
|
|
197 | (2) |
|
|
199 | (2) |
|
A Book Web Page and Code Download |
|
|
201 | (2) |
|
|
203 | (6) |
|
Quick Command Dereference |
|
|
204 | (1) |
|
General User HDFS Commands |
|
|
204 | (1) |
|
|
205 | (1) |
|
|
206 | (1) |
|
|
206 | (1) |
|
|
207 | (1) |
|
|
207 | (1) |
|
Delete a File within HDFS |
|
|
207 | (1) |
|
Delete a Directory in HDFS |
|
|
207 | (1) |
|
Get an HDFS Status Report (Administrators) |
|
|
207 | (1) |
|
Perform an FSCK on HDFS (Administrators) |
|
|
208 | (1) |
|
C Additional Background on Data Science and Apache Hadoop and Spark |
|
|
209 | (4) |
|
General Hadoop/Spark Information |
|
|
209 | (1) |
|
Hadoop/Spark Installation Recipes |
|
|
210 | (1) |
|
|
210 | (1) |
|
|
211 | (1) |
|
|
211 | (1) |
|
|
211 | (1) |
|
|
212 | (1) |
Index |
|
213 | |