Atjaunināt sīkdatņu piekrišanu

E-grāmata: Architecting Modern Data Platforms: A Guide to Enterprise Hadoop at Scale

4.20/5 (26 ratings by Goodreads)
  • Formāts: 636 pages
  • Izdošanas datums: 05-Dec-2018
  • Izdevniecība: O'Reilly Media
  • Valoda: eng
  • ISBN-13: 9781491969229
  • Formāts - EPUB+DRM
  • Cena: 46,20 €*
  • * ši ir gala cena, t.i., netiek piemērotas nekādas papildus atlaides
  • Ielikt grozā
  • Pievienot vēlmju sarakstam
  • Šī e-grāmata paredzēta tikai personīgai lietošanai. E-grāmatas nav iespējams atgriezt un nauda par iegādātajām e-grāmatām netiek atmaksāta.
  • Formāts: 636 pages
  • Izdošanas datums: 05-Dec-2018
  • Izdevniecība: O'Reilly Media
  • Valoda: eng
  • ISBN-13: 9781491969229

DRM restrictions

  • Kopēšana (kopēt/ievietot):

    nav atļauts

  • Drukāšana:

    nav atļauts

  • Lietošana:

    Digitālo tiesību pārvaldība (Digital Rights Management (DRM))
    Izdevējs ir piegādājis šo grāmatu šifrētā veidā, kas nozīmē, ka jums ir jāinstalē bezmaksas programmatūra, lai to atbloķētu un lasītu. Lai lasītu šo e-grāmatu, jums ir jāizveido Adobe ID. Vairāk informācijas šeit. E-grāmatu var lasīt un lejupielādēt līdz 6 ierīcēm (vienam lietotājam ar vienu un to pašu Adobe ID).

    Nepieciešamā programmatūra
    Lai lasītu šo e-grāmatu mobilajā ierīcē (tālrunī vai planšetdatorā), jums būs jāinstalē šī bezmaksas lietotne: PocketBook Reader (iOS / Android)

    Lai lejupielādētu un lasītu šo e-grāmatu datorā vai Mac datorā, jums ir nepieciešamid Adobe Digital Editions (šī ir bezmaksas lietotne, kas īpaši izstrādāta e-grāmatām. Tā nav tas pats, kas Adobe Reader, kas, iespējams, jau ir jūsu datorā.)

    Jūs nevarat lasīt šo e-grāmatu, izmantojot Amazon Kindle.

Theres a lot of information about big data technologies, but splicing these technologies into an end-to-end enterprise data platform is a daunting task not widely covered. With this practical book, youll learn how to build big data infrastructure both on-premises and in the cloud and successfully architect a modern data platform.

Ideal for enterprise architects, IT managers, application architects, and data engineers, this book shows you how to overcome the many challenges that emerge during Hadoop projects. Youll explore the vast landscape of tools available in the Hadoop and big data realm in a thorough technical primer before diving into:

Infrastructure: Look at all component layers in a modern data platform, from the server to the data center, to establish a solid foundation for data in your enterprise Platform: Understand aspects of deployment, operation, security, high availability, and disaster recovery, along with everything you need to know to integrate your platform with the rest of your enterprise IT Taking Hadoop to the cloud: Learn the important architectural aspects of running a big data platform in the cloud while maintaining enterprise security and high availability
Foreword xiii
Preface xvii
1 Big Data Technology Primer 1(30)
A Tour of the Landscape
3(23)
Core Components
5(5)
Computational Frameworks
10(4)
Analytical SQL Engines
14(4)
Storage Engines
18(7)
Ingestion
25(1)
Orchestration
25(1)
Summary
26(5)
Part I. Infrastructure
2 Clusters
31(14)
Reasons for Multiple Clusters
31(4)
Multiple Clusters for Resiliency
31(1)
Multiple Clusters for Software Development
32(1)
Multiple Clusters for Workload Isolation
33(1)
Multiple Clusters for Legal Separation
34(1)
Multiple Clusters and Independent Storage and Compute
35(1)
Multitenancy
35(2)
Requirements for Multitenancy
36(1)
Sizing Clusters
37(4)
Sizing by Storage
38(2)
Sizing by Ingest Rate
40(1)
Sizing by Workload
41(1)
Cluster Growth
41(2)
The Drivers of Cluster Growth
42(1)
Implementing Cluster Growth
42(1)
Data Replication
43(1)
Replication for Software Development
43(1)
Replication and Workload Isolation
43(1)
Summary
44(1)
3 Compute and Storage
45(62)
Computer Architecture for Hadoop
46(9)
Commodity Servers
46(2)
Server CPUs and RAM
48(2)
Nonuniform Memory Access
50(4)
CPU Specifications
54(1)
RAM
55(1)
Commoditized Storage Meets the Enterprise
55(3)
Modularity of Compute and Storage
57(1)
Everything Is Java
57(1)
Replication or Erasure Coding?
57(1)
Alternatives
58(1)
Hadoop and the Linux Storage Stack
58(13)
User Space
58(3)
Important System Calls
61(1)
The Linux Page Cache
62(3)
Short-Circuit and Zero-Copy Reads
65(4)
Filesystems
69(2)
Erasure Coding Versus Replication
71(10)
Discussion
76(3)
Guidance
79(2)
Low-Level Storage
81(10)
Storage Controllers
81(3)
Disk Layer
84(7)
Server Form Factors
91(5)
Form Factor Comparison
94(1)
Guidance
95(1)
Workload Profiles
96(1)
Cluster Configurations and Node Types
97(7)
Master Nodes
98(1)
Worker Nodes
99(1)
Utility Nodes
100(1)
Edge Nodes
101(1)
Small Cluster Configurations
101(1)
Medium Cluster Configurations
102(1)
Large Cluster Configurations
103(1)
Summary
104(3)
4 Networking
107(32)
How Services Use a Network
107(7)
Remote Procedure Calls (RPCs)
107(2)
Data Transfers
109(4)
Monitoring
113(1)
Backup
113(1)
Consensus
114(1)
Network Architectures
114(14)
Small Cluster Architectures
115(1)
Medium Cluster Architectures
116(8)
Large Cluster Architectures
124(4)
Network Integration
128(3)
Reusing an Existing Network
128(1)
Creating an Additional Network
129(2)
Network Design Considerations
131(7)
Layer 1 Recommendations
131(2)
Layer 2 Recommendations
133(2)
Layer 3 Recommendations
135(3)
Summary
138(1)
5 Organizational Challenges
139(20)
Who Runs It?
140(1)
Is It Infrastructure, Middleware, or an Application?
140(1)
Case Study: A Typical Business Intelligence Project
141(16)
The Traditional Approach
141(2)
Typical Team Setup
143(3)
Compartmentalization of IT
146(1)
Revised Team Setup for Hadoop in the Enterprise
147(7)
Solution Overview with Hadoop
154(1)
New Team Setup
155(1)
Split Responsibilities
156(1)
Do I Need DevOps?
156(1)
Do I Need a Center of Excellence/Competence?
157(1)
Summary
157(2)
6 Datacenter Considerations
159(26)
Why Does It Matter ?
159(1)
Basic Datacenter Concepts
160(8)
Cooling
162(1)
Power
163(1)
Network
164(1)
Rack Awareness and Rack Failures
165(2)
Failure Domain Alignment
167(1)
Space and Racking Constraints
168(1)
Ingest and Intercluster Connectivity
169(2)
Software
169(1)
Hardware
170(1)
Replacements and Repair
171(1)
Operational Procedures
172(1)
Typical Pitfalls
172(9)
Networking
172(1)
Cluster Spanning
173(8)
Summary
181(4)
Part II. Platform
7 Provisioning Clusters
185(26)
Operating Systems
185(9)
OS Choices
187(1)
OS Configuration for Hadoop
188(5)
Automated Configuration Example
193(1)
Service Databases
194(8)
Required Databases
196(1)
Database Integration Options
197(4)
Database Considerations
201(1)
Hadoop Deployment
202(8)
Hadoop Distributions
202(3)
Installation Choices
205(1)
Distribution Architecture
206(2)
Installation Process
208(2)
Summary
210(1)
8 Platform Validation
211(26)
Testing Methodology
212(1)
Useful Tools
213(1)
Hardware Validation
213(14)
CPU
213(3)
Disks
216(5)
Network
221(6)
Hadoop Validation
227(7)
HDFS Validation
228(2)
General Validation
230(4)
Validating Other Components
234(2)
Operations Validation
235(1)
Summary
236(1)
9 Security
237(44)
In-Flight Encryption
237(5)
TLS Encryption
238(2)
SASL Quality of Protection
240(1)
Enabling in-Flight Encryption
241(1)
Authentication
242(8)
Kerberos
242(5)
LDAP Authentication
247(1)
Delegation Tokens
248(1)
Impersonation
249(1)
Authorization
250(20)
Group Resolution
251(2)
Superusers and Supergroups
253(4)
Hadoop Service Level Authorization
257(1)
Centralized Security Management
258(2)
HDFS
260(1)
YARN
261(1)
ZooKeeper
262(1)
Hive
263(1)
Impala
264(1)
HBase
264(1)
So1r
265(1)
Kudu
266(1)
Oozie
266(1)
Hue
266(3)
Kafka
269(1)
Sentry
270(1)
At-Rest Encryption
270(9)
Volume Encryption with Cloudera Navigator Encrypt and Key Trustee Server
273(1)
HDFS Transparent Data Encryption
274(5)
Encrypting Temporary Files
279(1)
Summary
279(2)
10 Integration with Identity Management Providers
281(30)
Integration Areas
281(1)
Integration Scenarios
282(3)
Scenario 1: Writing a File to HDFS
282(1)
Scenario 2: Submitting a Hive Query
283(1)
Scenario 3: Running a Spark Job
284(1)
Integration Providers
285(2)
LDAP Integration
287(9)
Background
287(2)
LDAP Security
289(1)
Load Balancing
290(1)
Application Integration
290(2)
Linux Integration
292(4)
Kerberos Integration
296(8)
Kerberos Clients
296(2)
KDC Integration
298(6)
Certificate Management
304(5)
Signing Certificates
305(2)
Converting Certificates
307(1)
Wildcard Certificates
308(1)
Automation
309(1)
Summary
309(2)
11 Accessing and Interacting with Clusters
311(18)
Access Mechanisms
311(2)
Programmatic Access
311(1)
Command-Line Access
312(1)
Web UIs
312(1)
Access Topologies
313(10)
Interaction Patterns
314(2)
Proxy Access
316(2)
Load Balancing
318(1)
Edge Node Interactions
318(5)
Access Security
323(1)
Administration Gateways
324(1)
Workbenches
324(2)
Hue
324(1)
Notebooks
325(1)
Landing Zones
326(2)
Summary
328(1)
12 High Availability
329(48)
High Availability Defined
330(1)
Lateral/Service HA
330(1)
Vertical/Systemic HA
330(1)
Measuring Availability
331(1)
Percentages
331(1)
Percentiles
331(1)
Operating for HA
331(1)
Monitoring
331(1)
Playbooks and Postmortems
332(1)
HA Building Blocks
332(13)
Quorums
332(2)
Load Balancing
334(7)
Database HA
341(2)
Ancillary Services
343(2)
General Considerations
345(2)
Separation of Master and Worker Processes
345(1)
Separation of Identical Service Roles
345(1)
Master Servers in Separate Failure Domains
346(1)
Balanced Master Configurations
346(1)
Optimized Server Configurations
346(1)
High Availability of Cluster Services
347(29)
ZooKeeper
347(1)
HDFS
348(5)
YARN
353(3)
HBase
356(2)
KMS
358(1)
Hive
359(3)
Impala
362(5)
Solr
367(2)
Kafka
369(2)
Oozie
371(1)
Hue
372(3)
Other Services
375(1)
Autoconfiguration
375(1)
Summary
376(1)
13 Backup and Disaster Recovery
377(34)
Context
377(11)
Many Distributed Systems
377(1)
Policies and Objectives
378(1)
Failure Scenarios
379(3)
Suitable Data Sources
382(1)
Strategies
383(3)
Data Types
386(1)
Consistency
386(1)
Validation
387(1)
Summary
388(1)
Data Replication
388(3)
HBase
389(1)
Cluster Management Tools
389(1)
Kafka
390(1)
Summary
391(1)
Hadoop Cluster Backups
391(14)
Subsystems
394(4)
Case Study: Automating Backups with Oozie
398(7)
Restore
405(1)
Summary
406(5)
Part III. Taking Hadoop to the Cloud
14 Basics of Virtualization for Hadoop
411(22)
Compute Virtualization
412(3)
Virtual Machine Distribution
413(1)
Anti-Affinity Groups
414(1)
Storage Virtualization
415(8)
Virtualizing Local Storage
416(1)
SANs
417(4)
Object Storage and Network-Attached Storage
421(2)
Network Virtualization
423(2)
Cluster Life Cycle Models
425(5)
Summary
430(3)
15 Solutions for Private Clouds
433(22)
OpenStack
435(4)
Automation and Integration
436(1)
Life Cycle and Storage
436(2)
Isolation
438(1)
Summary
438(1)
OpenShift
439(3)
Automation
439(1)
Life Cycle and Storage
440(1)
Isolation
441(1)
Summary
441(1)
VMware and Pivotal Cloud Foundry
442(1)
Do It Yourself?
442(6)
Automation
445(1)
Isolation
446(1)
Life Cycle Model
446(1)
Summary
447(1)
Object Storage for Private Clouds
448(5)
EMC Isilon
448(2)
Ceph
450(3)
Summary
453(2)
16 Solutions in the Public Cloud
455(42)
Key Things to Know
455(2)
Cloud Providers
457(16)
AWS
457(7)
Microsoft Azure
464(6)
Google Cloud Platform
470(3)
Implementing Clusters
473(22)
Instances
473(5)
Storage and Life Cycle Models
478(6)
Network Architecture
484(4)
High Availability
488(7)
Summary
495(2)
17 Automated Provisioning
497(16)
Long-Lived Clusters
497(13)
Configuration and Templating
498(1)
Deployment Phases
499(3)
Vendor Solutions
502(3)
One-Click Deployments
505(1)
Homegrown Automation
505(1)
Hooking Into a Provisioning Life Cycle
505(1)
Scaling Up and Down
506(2)
Deploying with Security
508(2)
Transient Clusters
510(1)
Sharing Metadata Services
511(1)
Summary
512(1)
18 Security in the Cloud
513(48)
Assessing the Risk
513(2)
Risk Model
515(2)
Environmental Risks
515(1)
Deployment Risks
516(1)
Identity Provider Options for Hadoop
517(6)
Option A: Cloud-Only Self-Contained ID Services
519(1)
Option B: Cloud-Only Shared ID Services
520(1)
Option C: On-Premises ID Services
521(2)
Object Storage Security and Hadoop
523(12)
Identity and Access Management
523(1)
Amazon Simple Storage Service
524(3)
GCP Cloud Storage
527(4)
Microsoft Azure
531(4)
Auditing
535(1)
Encryption for Data at Rest
535(15)
Requirements for Key Material
536(1)
Options for Encryption in the Cloud
537(2)
On-Premises Key Persistence
539(1)
Encryption via the Cloud Provider
539(8)
Encryption Feature and Interoperability Summary
547(2)
Recommendations and Summary for Cloud Encryption
549(1)
Encrypting Data in Flight in the Cloud
550(1)
Perimeter Controls and Firewalling
551(8)
GCP
553(2)
AWS
555(2)
Azure
557(2)
Summary
559(2)
A Backup Onboarding Checklist 561(10)
Index 571
Jan Kunigk has worked on enterprise Hadoop solutions since 2010. Before he joined Cloudera in 2014, Jan built optimized systems architectures for Hadoop at IBM and implemented a Hadoop-as-a-Service offering at T-Systems. In his current role as a Solutions Architect he makes Hadoop projects at Cloudera's enterprise customers successful, covering a wide spectrum of architectural decisions to the implementation of big data applications across all industry sectors on a day-to-day basis.

Ian Buss began his journey into distributed computing with parallel computational electromagnetics whilst studying for a PhD in photonics at the University of Bristol. After simulating LEDs on supercomputers, he made the move from big compute in academia to big data in the public sector, first encountering Hadoop in 2012. After having fun building, deploying, managing and using Hadoop clusters, Ian joined Cloudera as a Solutions Architect in 2014. His day job now involves integrating Hadoop into enterprises and making stuff work in the real world.

Paul Wilkinson has been wrestling with big data in the public sector since before Hadoop existed and was very glad when it arrived in his life in 2009. He became a Cloudera consultant in 2012, advising customers on all things hadoop: application design, information architecture, cluster management and infrastructure planning the FullStack. After a torrent of professional services work across financial services, cybersecurity, adtech, gaming and government, he's seen it all warts and all. Or at least, he hopes he has.

Lars George has been involved with Hadoop and HBase since 2007, and became a full HBase committer in 2009. He has spoken at many Hadoop User Group meetings, and conferences such as Hadoop World and Hadoop Summit, ApacheCon, FOSDEM, QCon etc. He also started the Munich OpenHUG meetings. Lars worked for Cloudera for over five years, as the EMEA Chief Architect, acting as a liaison between the Cloudera professional services team and customers as well as partners in and around Europe, building the next data driven solutions. In 2016 he started with his own Hadoop advisory firm, extending on what he has learned and seen in the field for more than 8 years. He is also the author or O'Reilly's "HBase The Definitive Guide."