Atjaunināt sīkdatņu piekrišanu

E-grāmata: Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale

  • Formāts: 256 pages
  • Izdošanas datums: 08-Dec-2016
  • Izdevniecība: Addison Wesley
  • Valoda: eng
  • ISBN-13: 9780134029726
Citas grāmatas par šo tēmu:
  • Formāts - EPUB+DRM
  • Cena: 24,41 €*
  • * ši ir gala cena, t.i., netiek piemērotas nekādas papildus atlaides
  • Ielikt grozā
  • Pievienot vēlmju sarakstam
  • Šī e-grāmata paredzēta tikai personīgai lietošanai. E-grāmatas nav iespējams atgriezt un nauda par iegādātajām e-grāmatām netiek atmaksāta.
  • Formāts: 256 pages
  • Izdošanas datums: 08-Dec-2016
  • Izdevniecība: Addison Wesley
  • Valoda: eng
  • ISBN-13: 9780134029726
Citas grāmatas par šo tēmu:

DRM restrictions

  • Kopēšana (kopēt/ievietot):

    nav atļauts

  • Drukāšana:

    nav atļauts

  • Lietošana:

    Digitālo tiesību pārvaldība (Digital Rights Management (DRM))
    Izdevējs ir piegādājis šo grāmatu šifrētā veidā, kas nozīmē, ka jums ir jāinstalē bezmaksas programmatūra, lai to atbloķētu un lasītu. Lai lasītu šo e-grāmatu, jums ir jāizveido Adobe ID. Vairāk informācijas šeit. E-grāmatu var lasīt un lejupielādēt līdz 6 ierīcēm (vienam lietotājam ar vienu un to pašu Adobe ID).

    Nepieciešamā programmatūra
    Lai lasītu šo e-grāmatu mobilajā ierīcē (tālrunī vai planšetdatorā), jums būs jāinstalē šī bezmaksas lietotne: PocketBook Reader (iOS / Android)

    Lai lejupielādētu un lasītu šo e-grāmatu datorā vai Mac datorā, jums ir nepieciešamid Adobe Digital Editions (šī ir bezmaksas lietotne, kas īpaši izstrādāta e-grāmatām. Tā nav tas pats, kas Adobe Reader, kas, iespējams, jau ir jūsu datorā.)

    Jūs nevarat lasīt šo e-grāmatu, izmantojot Amazon Kindle.

The Complete Guide to Data Science with HadoopFor Technical Professionals, Businesspeople, and Students

Demand is soaring for professionals who can solve real data science problems with Hadoop and Spark. Practical Data Science with Hadoop® and Spark is your complete guide to doing just that. Drawing on immense experience with Hadoop and big data, three leading experts bring together everything you need: high-level concepts, deep-dive techniques, real-world use cases, practical applications, and hands-on tutorials.

The authors introduce the essentials of data science and the modern Hadoop ecosystem, explaining how Hadoop and Spark have evolved into an effective platform for solving data science problems at scale. In addition to comprehensive application coverage, the authors also provide useful guidance on the important steps of data ingestion, data munging, and visualization.

Once the groundwork is in place, the authors focus on specific applications, including machine learning, predictive modeling for sentiment analysis, clustering for document analysis, anomaly detection, and natural language processing (NLP).

This guide provides a strong technical foundation for those who want to do practical data science, and also presents business-driven guidance on how to apply Hadoop and Spark to optimize ROI of data science initiatives.

Learn





What data science is, how it has evolved, and how to plan a data science career How data volume, variety, and velocity shape data science use cases Hadoop and its ecosystem, including HDFS, MapReduce, YARN, and Spark Data importation with Hive and Spark Data quality, preprocessing, preparation, and modeling Visualization: surfacing insights from huge data sets Machine learning: classification, regression, clustering, and anomaly detection Algorithms and Hadoop tools for predictive modeling Cluster analysis and similarity functions Large-scale anomaly detection NLP: applying data science to human language
Foreword xiii
Preface xv
Acknowledgments xxi
About the Authors xxiii
I Data Science with Hadoop---An Overview
1(52)
1 Introduction to Data Science
3(16)
What Is Data Science?
3(1)
Example: Search Advertising
4(1)
A Bit of Data Science History
5(1)
Statistics and Machine Learning
6(1)
Innovation from Internet Giants
7(1)
Data Science in the Modern Enterprise
8(1)
Becoming a Data Scientist
8(1)
The Data Engineer
8(1)
The Applied Scientist
9(1)
Transitioning to a Data Scientist Role
9(2)
Soft Skills of a Data Scientist
11(1)
Building a Data Science Team
12(1)
The Data Science Project Life Cycle
13(1)
Ask the Right Question
14(1)
Data Acquisition
15(1)
Data Cleaning: Taking Care of Data Quality
15(1)
Explore the Data and Design Model Features
16(1)
Building and Tuning the Model
17(1)
Deploy to Production
17(1)
Managing a Data Science Project
18(1)
Summary
18(1)
2 Use Cases for Data Science
19(12)
Big Data---A Driver of Change
19(1)
Volume: More Data Is Now Available
20(1)
Variety: More Data Types
20(1)
Velocity: Fast Data Ingest
21(1)
Business Use Cases
21(1)
Product Recommendation
21(1)
Customer Churn Analysis
22(1)
Customer Segmentation
22(1)
Sales Leads Prioritization
23(1)
Sentiment Analysis
24(1)
Fraud Detection
25(1)
Predictive Maintenance
26(1)
Market Basket Analysis
26(1)
Predictive Medical Diagnosis
27(1)
Predicting Patient Re-admission
28(1)
Detecting Anomalous Record Access
28(1)
Insurance Risk Analysis
29(1)
Predicting Oil and Gas Well Production Levels
29(1)
Summary
29(2)
3 Hadoop and Data Science
31(22)
What Is Hadoop?
31(1)
Distributed File System
32(2)
Resource Manager and Scheduler
34(1)
Distributed Data Processing Frameworks
35(2)
Hadoop's Evolution
37(1)
Hadoop Tools for Data Science
38(1)
Apache Sqoop
39(1)
Apache Flume
39(1)
Apache Hive
40(1)
Apache Pig
41(1)
Apache Spark
42(3)
Python
45(1)
Java Machine Learning Packages
46(1)
Why Hadoop Is Useful to Data Scientists
46(1)
Cost Effective Storage
46(1)
Schema on Read
47(1)
Unstructured and Semi-Structured Data
48(1)
Multi-Language Tooling
48(1)
Robust Scheduling and Resource Management
49(1)
Levels of Distributed Systems Abstractions
49(1)
Scalable Creation of Models
50(1)
Scalable Application of Models
51(1)
Summary
51(2)
II Preparing and Visualizing Data with Hadoop
53(72)
4 Getting Data Into Hadoop
55(30)
Hadoop as a Data Lake
56(2)
The Hadoop Distributed File System (HDFS)
58(1)
Direct File Transfer to Hadoop HDFS
58(1)
Importing Data from Files into Hive Tables
59(1)
Import CSV Files into Hive Tables
59(3)
Importing Data into Hive Tables Using Spark
62(1)
Import CSV Files into HIVE Using Spark
63(1)
Import a JSON File into HIVE Using Spark
64(1)
Using Apache Sqoop to Acquire Relational Data
65(1)
Data Import and Export with Sqoop
66(1)
Apache Sqoop Version Changes
67(1)
Using Sqoop V2: A Basic Example
68(6)
Using Apache Flume to Acquire Data Streams
74(2)
Using Flume: A Web Log Example Overview
76(3)
Manage Hadoop Work and Data Flows with Apache Oozie
79(2)
Apache Falcon
81(1)
What's Next in Data Ingestion?
82(1)
Summary
82(3)
5 Data Munging with Hadoop
85(22)
Why Hadoop for Data Munging?
86(1)
Data Quality
86(1)
What Is Data Quality?
86(1)
Dealing with Data Quality Issues
87(5)
Using Hadoop for Data Quality
92(1)
The Feature Matrix
93(1)
Choosing the "Right" Features
94(1)
Sampling: Choosing Instances
94(2)
Generating Features
96(1)
Text Features
97(3)
Time-Series Features
100(1)
Features from Complex Data Types
101(1)
Feature Manipulation
102(1)
Dimensionality Reduction
103(3)
Summary
106(1)
6 Exploring and Visualizing Data
107(18)
Why Visualize Data?
107(1)
Motivating Example: Visualizing Network Throughput
108(2)
Visualizing the Breakthrough That Never Happened
110(2)
Creating Visualizations
112(1)
Comparison Charts
113(1)
Composition Charts
114(3)
Distribution Charts
117(1)
Relationship Charts
118(3)
Using Visualization for Data Science
121(1)
Popular Visualization Tools
121(1)
R
121(1)
Python: Matplotlib, Seaborn, and Others
122(1)
SAS
122(1)
Matlab
123(1)
Julia
123(1)
Other Visualization Tools
123(1)
Visualizing Big Data with Hadoop
123(1)
Summary
124(1)
III Applying Data Modeling with Hadoop
125(76)
7 Machine Learning with Hadoop
127(6)
Overview of Machine Learning
127(1)
Terminology
128(1)
Task Types in Machine Learning
129(1)
Big Data and Machine Learning
130(1)
Tools for Machine Learning
131(1)
The Future of Machine Learning and Artificial Intelligence
132(1)
Summary
132(1)
8 Predictive Modeling
133(18)
Overview of Predictive Modeling
133(1)
Classification Versus Regression
134(2)
Evaluating Predictive Models
136(1)
Evaluating Classifiers
136(3)
Evaluating Regression Models
139(1)
Cross Validation
139(1)
Supervised Learning Algorithms
140(1)
Building Big Data Predictive Model Solutions
141(1)
Model Training
141(2)
Batch Prediction
143(1)
Real-Time Prediction
144(1)
Example: Sentiment Analysis
145(1)
Tweets Dataset
145(1)
Data Preparation
145(1)
Feature Generation
146(3)
Building a Classifier
149(1)
Summary
150(1)
9 Clustering
151(14)
Overview of Clustering
151(1)
Uses of Clustering
152(1)
Designing a Similarity Measure
153(1)
Distance Functions
153(1)
Similarity Functions
154(1)
Clustering Algorithms
154(1)
Example: Clustering Algorithms
155(1)
k-means Clustering
155(2)
Latent Dirichlet Allocation
157(1)
Evaluating the Clusters and Choosing the Number of Clusters
157(1)
Building Big Data Clustering Solutions
158(2)
Example: Topic Modeling with Latent Dirichlet Allocation
160(1)
Feature Generation
160(2)
Running Latent Dirichlet Allocation
162(1)
Summary
163(2)
10 Anomaly Detection with Hadoop
165(16)
Overview
165(1)
Uses of Anomaly Detection
166(1)
Types of Anomalies in Data
166(1)
Approaches to Anomaly Detection
167(1)
Rules-based Methods
167(1)
Supervised Learning Methods
168(1)
Unsupervised Learning Methods
168(2)
Semi-Supervised Learning Methods
170(1)
Tuning Anomaly Detection Systems
170(1)
Building a Big Data Anomaly Detection Solution with Hadoop
171(1)
Example: Detecting Network Intrusions
172(1)
Data Ingestion
172(4)
Building a Classifier
176(1)
Evaluating Performance
177(2)
Summary
179(2)
11 Natural Language Processing
181(14)
Natural Language Processing
181(1)
Historical Approaches
182(1)
NLP Use Cases
182(1)
Text Segmentation
183(1)
Part-of-Speech Tagging
183(1)
Named Entity Recognition
184(1)
Sentiment Analysis
184(1)
Topic Modeling
184(1)
Tooling for NLP in Hadoop
184(1)
Small-Model NLP
184(2)
Big-Model NLP
186(1)
Textual Representations
187(1)
Bag-of-Words
187(1)
Word2vec
188(1)
Sentiment Analysis Example
189(1)
Stanford CoreNLP
189(1)
Using Spark for Sentiment Analysis
189(4)
Summary
193(2)
12 Data Science with Hadoop---The Next Frontier
195(6)
Automated Data Discovery
195(2)
Deep Learning
197(2)
Summary
199(2)
A Book Web Page and Code Download
201(2)
B HDFS Quick Start
203(6)
Quick Command Dereference
204(1)
General User HDFS Commands
204(1)
List Files in HDFS
205(1)
Make a Directory in HDFS
206(1)
Copy Files to HDFS
206(1)
Copy Files from HDFS
207(1)
Copy Files within HDFS
207(1)
Delete a File within HDFS
207(1)
Delete a Directory in HDFS
207(1)
Get an HDFS Status Report (Administrators)
207(1)
Perform an FSCK on HDFS (Administrators)
208(1)
C Additional Background on Data Science and Apache Hadoop and Spark
209(4)
General Hadoop/Spark Information
209(1)
Hadoop/Spark Installation Recipes
210(1)
HDFS
210(1)
MapReduce
211(1)
Spark
211(1)
Essential Tools
211(1)
Machine Learning
212(1)
Index 213
Ofer Mendelevitch is Vice President of Data Science at Lendup, where he is responsible for Lendups machine learning and advanced analytics group. Prior to joining Lendup, Ofer was Director of Data Science at Hortonworks, where he was responsible for helping Hortonworks customers apply Data Science with Hadoop and Spark to big data across various industries including healthcare, finance, retail and others. Before Hortonworks, Ofer served as Entrepreneur in Residence at XSeed Capital, VP of Engineering at Nor1, and Director of Engineering at Yahoo!.

Casey Stella is a Principal Software Engineer focusing on Data Science at Hortonworks, which provides an open source Hadoop distribution. Caseys primary responsibility is leading the analytics/data science team for the Apache Metron (Incubating) Project, an open source cybersecurity project. Prior to Hortonworks, Casey was an architect at Explorys, which was a medical informatics startup spun out of the Cleveland Clinic. In the more distant past, Casey served as a developer at Oracle, Research Geophysicist at ION Geophysical and as a poor graduate student in Mathematics at Texas A&M.

Douglas Eadline, PhD, began his career as analytical chemist with an interest in computer methods. Starting with the first Beowulf how-to document, Doug has written hundreds of articles, white papers, and instructional documents covering many aspects of HPC and Hadoop computing. Prior to starting and editing the popular ClusterMonkey.net website in 2005, he served as editoræinæchief for ClusterWorld Magazine and was senior HPC editor for Linux Magazine. He has practical hands-on experience in many aspects of HPC and Apache Hadoop, including hardware and software design, benchmarking, storage, GPU, cloud computing, and parallel computing. Currently, he is a writer and consultant to the HPC/analytics industry and leader of the Limulus Personal Cluster Project (http://limulus.basement-supercomputing.com). He is author of the Apache Hadoop® Fundamentals LiveLessons and Apache Hadoop® YARN Fundamentals LiveLessons videos from Pearson, and is book co-author of Apache Hadoop® YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2 and author of Hadoop® 2 Quick Start Guide: Learn the Essentials of Big Data Computing in the Apache Hadoop 2 Ecosystem, also from Addison-Wesley, and is author of High Performance Computing for Dummies.