Foreword |
|
xiii | |
Preface |
|
xvii | |
1 Big Data Technology Primer |
|
1 | (30) |
|
|
3 | (23) |
|
|
5 | (5) |
|
|
10 | (4) |
|
|
14 | (4) |
|
|
18 | (7) |
|
|
25 | (1) |
|
|
25 | (1) |
|
|
26 | (5) |
Part I. Infrastructure |
|
|
|
31 | (14) |
|
Reasons for Multiple Clusters |
|
|
31 | (4) |
|
Multiple Clusters for Resiliency |
|
|
31 | (1) |
|
Multiple Clusters for Software Development |
|
|
32 | (1) |
|
Multiple Clusters for Workload Isolation |
|
|
33 | (1) |
|
Multiple Clusters for Legal Separation |
|
|
34 | (1) |
|
Multiple Clusters and Independent Storage and Compute |
|
|
35 | (1) |
|
|
35 | (2) |
|
Requirements for Multitenancy |
|
|
36 | (1) |
|
|
37 | (4) |
|
|
38 | (2) |
|
|
40 | (1) |
|
|
41 | (1) |
|
|
41 | (2) |
|
The Drivers of Cluster Growth |
|
|
42 | (1) |
|
Implementing Cluster Growth |
|
|
42 | (1) |
|
|
43 | (1) |
|
Replication for Software Development |
|
|
43 | (1) |
|
Replication and Workload Isolation |
|
|
43 | (1) |
|
|
44 | (1) |
|
|
45 | (62) |
|
Computer Architecture for Hadoop |
|
|
46 | (9) |
|
|
46 | (2) |
|
|
48 | (2) |
|
|
50 | (4) |
|
|
54 | (1) |
|
|
55 | (1) |
|
Commoditized Storage Meets the Enterprise |
|
|
55 | (3) |
|
Modularity of Compute and Storage |
|
|
57 | (1) |
|
|
57 | (1) |
|
Replication or Erasure Coding? |
|
|
57 | (1) |
|
|
58 | (1) |
|
Hadoop and the Linux Storage Stack |
|
|
58 | (13) |
|
|
58 | (3) |
|
|
61 | (1) |
|
|
62 | (3) |
|
Short-Circuit and Zero-Copy Reads |
|
|
65 | (4) |
|
|
69 | (2) |
|
Erasure Coding Versus Replication |
|
|
71 | (10) |
|
|
76 | (3) |
|
|
79 | (2) |
|
|
81 | (10) |
|
|
81 | (3) |
|
|
84 | (7) |
|
|
91 | (5) |
|
|
94 | (1) |
|
|
95 | (1) |
|
|
96 | (1) |
|
Cluster Configurations and Node Types |
|
|
97 | (7) |
|
|
98 | (1) |
|
|
99 | (1) |
|
|
100 | (1) |
|
|
101 | (1) |
|
Small Cluster Configurations |
|
|
101 | (1) |
|
Medium Cluster Configurations |
|
|
102 | (1) |
|
Large Cluster Configurations |
|
|
103 | (1) |
|
|
104 | (3) |
|
|
107 | (32) |
|
How Services Use a Network |
|
|
107 | (7) |
|
Remote Procedure Calls (RPCs) |
|
|
107 | (2) |
|
|
109 | (4) |
|
|
113 | (1) |
|
|
113 | (1) |
|
|
114 | (1) |
|
|
114 | (14) |
|
Small Cluster Architectures |
|
|
115 | (1) |
|
Medium Cluster Architectures |
|
|
116 | (8) |
|
Large Cluster Architectures |
|
|
124 | (4) |
|
|
128 | (3) |
|
Reusing an Existing Network |
|
|
128 | (1) |
|
Creating an Additional Network |
|
|
129 | (2) |
|
Network Design Considerations |
|
|
131 | (7) |
|
|
131 | (2) |
|
|
133 | (2) |
|
|
135 | (3) |
|
|
138 | (1) |
|
5 Organizational Challenges |
|
|
139 | (20) |
|
|
140 | (1) |
|
Is It Infrastructure, Middleware, or an Application? |
|
|
140 | (1) |
|
Case Study: A Typical Business Intelligence Project |
|
|
141 | (16) |
|
|
141 | (2) |
|
|
143 | (3) |
|
Compartmentalization of IT |
|
|
146 | (1) |
|
Revised Team Setup for Hadoop in the Enterprise |
|
|
147 | (7) |
|
Solution Overview with Hadoop |
|
|
154 | (1) |
|
|
155 | (1) |
|
|
156 | (1) |
|
|
156 | (1) |
|
Do I Need a Center of Excellence/Competence? |
|
|
157 | (1) |
|
|
157 | (2) |
|
6 Datacenter Considerations |
|
|
159 | (26) |
|
|
159 | (1) |
|
Basic Datacenter Concepts |
|
|
160 | (8) |
|
|
162 | (1) |
|
|
163 | (1) |
|
|
164 | (1) |
|
Rack Awareness and Rack Failures |
|
|
165 | (2) |
|
|
167 | (1) |
|
Space and Racking Constraints |
|
|
168 | (1) |
|
Ingest and Intercluster Connectivity |
|
|
169 | (2) |
|
|
169 | (1) |
|
|
170 | (1) |
|
|
171 | (1) |
|
|
172 | (1) |
|
|
172 | (9) |
|
|
172 | (1) |
|
|
173 | (8) |
|
|
181 | (4) |
Part II. Platform |
|
|
|
185 | (26) |
|
|
185 | (9) |
|
|
187 | (1) |
|
OS Configuration for Hadoop |
|
|
188 | (5) |
|
Automated Configuration Example |
|
|
193 | (1) |
|
|
194 | (8) |
|
|
196 | (1) |
|
Database Integration Options |
|
|
197 | (4) |
|
|
201 | (1) |
|
|
202 | (8) |
|
|
202 | (3) |
|
|
205 | (1) |
|
Distribution Architecture |
|
|
206 | (2) |
|
|
208 | (2) |
|
|
210 | (1) |
|
|
211 | (26) |
|
|
212 | (1) |
|
|
213 | (1) |
|
|
213 | (14) |
|
|
213 | (3) |
|
|
216 | (5) |
|
|
221 | (6) |
|
|
227 | (7) |
|
|
228 | (2) |
|
|
230 | (4) |
|
Validating Other Components |
|
|
234 | (2) |
|
|
235 | (1) |
|
|
236 | (1) |
|
|
237 | (44) |
|
|
237 | (5) |
|
|
238 | (2) |
|
SASL Quality of Protection |
|
|
240 | (1) |
|
Enabling in-Flight Encryption |
|
|
241 | (1) |
|
|
242 | (8) |
|
|
242 | (5) |
|
|
247 | (1) |
|
|
248 | (1) |
|
|
249 | (1) |
|
|
250 | (20) |
|
|
251 | (2) |
|
Superusers and Supergroups |
|
|
253 | (4) |
|
Hadoop Service Level Authorization |
|
|
257 | (1) |
|
Centralized Security Management |
|
|
258 | (2) |
|
|
260 | (1) |
|
|
261 | (1) |
|
|
262 | (1) |
|
|
263 | (1) |
|
|
264 | (1) |
|
|
264 | (1) |
|
|
265 | (1) |
|
|
266 | (1) |
|
|
266 | (1) |
|
|
266 | (3) |
|
|
269 | (1) |
|
|
270 | (1) |
|
|
270 | (9) |
|
Volume Encryption with Cloudera Navigator Encrypt and Key Trustee Server |
|
|
273 | (1) |
|
HDFS Transparent Data Encryption |
|
|
274 | (5) |
|
Encrypting Temporary Files |
|
|
279 | (1) |
|
|
279 | (2) |
|
10 Integration with Identity Management Providers |
|
|
281 | (30) |
|
|
281 | (1) |
|
|
282 | (3) |
|
Scenario 1: Writing a File to HDFS |
|
|
282 | (1) |
|
Scenario 2: Submitting a Hive Query |
|
|
283 | (1) |
|
Scenario 3: Running a Spark Job |
|
|
284 | (1) |
|
|
285 | (2) |
|
|
287 | (9) |
|
|
287 | (2) |
|
|
289 | (1) |
|
|
290 | (1) |
|
|
290 | (2) |
|
|
292 | (4) |
|
|
296 | (8) |
|
|
296 | (2) |
|
|
298 | (6) |
|
|
304 | (5) |
|
|
305 | (2) |
|
|
307 | (1) |
|
|
308 | (1) |
|
|
309 | (1) |
|
|
309 | (2) |
|
11 Accessing and Interacting with Clusters |
|
|
311 | (18) |
|
|
311 | (2) |
|
|
311 | (1) |
|
|
312 | (1) |
|
|
312 | (1) |
|
|
313 | (10) |
|
|
314 | (2) |
|
|
316 | (2) |
|
|
318 | (1) |
|
|
318 | (5) |
|
|
323 | (1) |
|
|
324 | (1) |
|
|
324 | (2) |
|
|
324 | (1) |
|
|
325 | (1) |
|
|
326 | (2) |
|
|
328 | (1) |
|
|
329 | (48) |
|
High Availability Defined |
|
|
330 | (1) |
|
|
330 | (1) |
|
|
330 | (1) |
|
|
331 | (1) |
|
|
331 | (1) |
|
|
331 | (1) |
|
|
331 | (1) |
|
|
331 | (1) |
|
Playbooks and Postmortems |
|
|
332 | (1) |
|
|
332 | (13) |
|
|
332 | (2) |
|
|
334 | (7) |
|
|
341 | (2) |
|
|
343 | (2) |
|
|
345 | (2) |
|
Separation of Master and Worker Processes |
|
|
345 | (1) |
|
Separation of Identical Service Roles |
|
|
345 | (1) |
|
Master Servers in Separate Failure Domains |
|
|
346 | (1) |
|
Balanced Master Configurations |
|
|
346 | (1) |
|
Optimized Server Configurations |
|
|
346 | (1) |
|
High Availability of Cluster Services |
|
|
347 | (29) |
|
|
347 | (1) |
|
|
348 | (5) |
|
|
353 | (3) |
|
|
356 | (2) |
|
|
358 | (1) |
|
|
359 | (3) |
|
|
362 | (5) |
|
|
367 | (2) |
|
|
369 | (2) |
|
|
371 | (1) |
|
|
372 | (3) |
|
|
375 | (1) |
|
|
375 | (1) |
|
|
376 | (1) |
|
13 Backup and Disaster Recovery |
|
|
377 | (34) |
|
|
377 | (11) |
|
|
377 | (1) |
|
|
378 | (1) |
|
|
379 | (3) |
|
|
382 | (1) |
|
|
383 | (3) |
|
|
386 | (1) |
|
|
386 | (1) |
|
|
387 | (1) |
|
|
388 | (1) |
|
|
388 | (3) |
|
|
389 | (1) |
|
|
389 | (1) |
|
|
390 | (1) |
|
|
391 | (1) |
|
|
391 | (14) |
|
|
394 | (4) |
|
Case Study: Automating Backups with Oozie |
|
|
398 | (7) |
|
|
405 | (1) |
|
|
406 | (5) |
Part III. Taking Hadoop to the Cloud |
|
|
14 Basics of Virtualization for Hadoop |
|
|
411 | (22) |
|
|
412 | (3) |
|
Virtual Machine Distribution |
|
|
413 | (1) |
|
|
414 | (1) |
|
|
415 | (8) |
|
Virtualizing Local Storage |
|
|
416 | (1) |
|
|
417 | (4) |
|
Object Storage and Network-Attached Storage |
|
|
421 | (2) |
|
|
423 | (2) |
|
Cluster Life Cycle Models |
|
|
425 | (5) |
|
|
430 | (3) |
|
15 Solutions for Private Clouds |
|
|
433 | (22) |
|
|
435 | (4) |
|
Automation and Integration |
|
|
436 | (1) |
|
|
436 | (2) |
|
|
438 | (1) |
|
|
438 | (1) |
|
|
439 | (3) |
|
|
439 | (1) |
|
|
440 | (1) |
|
|
441 | (1) |
|
|
441 | (1) |
|
VMware and Pivotal Cloud Foundry |
|
|
442 | (1) |
|
|
442 | (6) |
|
|
445 | (1) |
|
|
446 | (1) |
|
|
446 | (1) |
|
|
447 | (1) |
|
Object Storage for Private Clouds |
|
|
448 | (5) |
|
|
448 | (2) |
|
|
450 | (3) |
|
|
453 | (2) |
|
16 Solutions in the Public Cloud |
|
|
455 | (42) |
|
|
455 | (2) |
|
|
457 | (16) |
|
|
457 | (7) |
|
|
464 | (6) |
|
|
470 | (3) |
|
|
473 | (22) |
|
|
473 | (5) |
|
Storage and Life Cycle Models |
|
|
478 | (6) |
|
|
484 | (4) |
|
|
488 | (7) |
|
|
495 | (2) |
|
17 Automated Provisioning |
|
|
497 | (16) |
|
|
497 | (13) |
|
Configuration and Templating |
|
|
498 | (1) |
|
|
499 | (3) |
|
|
502 | (3) |
|
|
505 | (1) |
|
|
505 | (1) |
|
Hooking Into a Provisioning Life Cycle |
|
|
505 | (1) |
|
|
506 | (2) |
|
|
508 | (2) |
|
|
510 | (1) |
|
Sharing Metadata Services |
|
|
511 | (1) |
|
|
512 | (1) |
|
|
513 | (48) |
|
|
513 | (2) |
|
|
515 | (2) |
|
|
515 | (1) |
|
|
516 | (1) |
|
Identity Provider Options for Hadoop |
|
|
517 | (6) |
|
Option A: Cloud-Only Self-Contained ID Services |
|
|
519 | (1) |
|
Option B: Cloud-Only Shared ID Services |
|
|
520 | (1) |
|
Option C: On-Premises ID Services |
|
|
521 | (2) |
|
Object Storage Security and Hadoop |
|
|
523 | (12) |
|
Identity and Access Management |
|
|
523 | (1) |
|
Amazon Simple Storage Service |
|
|
524 | (3) |
|
|
527 | (4) |
|
|
531 | (4) |
|
|
535 | (1) |
|
Encryption for Data at Rest |
|
|
535 | (15) |
|
Requirements for Key Material |
|
|
536 | (1) |
|
Options for Encryption in the Cloud |
|
|
537 | (2) |
|
On-Premises Key Persistence |
|
|
539 | (1) |
|
Encryption via the Cloud Provider |
|
|
539 | (8) |
|
Encryption Feature and Interoperability Summary |
|
|
547 | (2) |
|
Recommendations and Summary for Cloud Encryption |
|
|
549 | (1) |
|
Encrypting Data in Flight in the Cloud |
|
|
550 | (1) |
|
Perimeter Controls and Firewalling |
|
|
551 | (8) |
|
|
553 | (2) |
|
|
555 | (2) |
|
|
557 | (2) |
|
|
559 | (2) |
A Backup Onboarding Checklist |
|
561 | (10) |
Index |
|
571 | |