Atjaunināt sīkdatņu piekrišanu

Data Science on the Google Cloud Platform: Implementing End-to-End Real-Time Data Pipelines: From Ingest to Machine Learning 2nd edition [Mīkstie vāki]

4.20/5 (10 ratings by Goodreads)
  • Formāts: Paperback / softback, 446 pages
  • Izdošanas datums: 30-Apr-2022
  • Izdevniecība: O'Reilly Media
  • ISBN-10: 1098118952
  • ISBN-13: 9781098118952
Citas grāmatas par šo tēmu:
  • Mīkstie vāki
  • Cena: 73,03 €*
  • * ši ir gala cena, t.i., netiek piemērotas nekādas papildus atlaides
  • Standarta cena: 85,92 €
  • Ietaupiet 15%
  • Grāmatu piegādes laiks ir 3-4 nedēļas, ja grāmata ir uz vietas izdevniecības noliktavā. Ja izdevējam nepieciešams publicēt jaunu tirāžu, grāmatas piegāde var aizkavēties.
  • Daudzums:
  • Ielikt grozā
  • Piegādes laiks - 4-6 nedēļas
  • Pievienot vēlmju sarakstam
  • Formāts: Paperback / softback, 446 pages
  • Izdošanas datums: 30-Apr-2022
  • Izdevniecība: O'Reilly Media
  • ISBN-10: 1098118952
  • ISBN-13: 9781098118952
Citas grāmatas par šo tēmu:
Learn how easy it is to apply sophisticated statistical and machine learning methods to real-world problems when you build using Google Cloud Platform (GCP). This hands-on guide shows data engineers and data scientists how to implement an end-to-end data pipeline with cloud native tools on GCP.

Throughout this updated second edition, you'll work through a sample business decision by employing a variety of data science approaches. Follow along by building a data pipeline in your own project on GCP, and discover how to solve data science problems in a transformative and more collaborative way.

You'll learn how to:

Employ best practices in building highly scalable data and ML pipelines on Google Cloud Automate and schedule data ingest using Cloud Run Create and populate a dashboard in Data Studio Build a real-time analytics pipeline using Pub/Sub, Dataflow, and BigQuery Conduct interactive data exploration with BigQuery Create a Bayesian model with Spark on Cloud Dataproc Forecast time series and do anomaly detection with BigQuery ML Aggregate within time windows with Dataflow Train explainable machine learning models with Vertex AI Operationalize ML with Vertex AI Pipelines
Preface xi
1 Making Better Decisions Based on Data 1(28)
Many Similar Decisions
4(1)
The Role of Data Scientists
5(5)
Scrappy Environment
7(1)
Full Stack Cloud Data Scientists
8(1)
Collaboration
9(1)
Best Practices
10(3)
Simple to Complex Solutions
11(1)
Cloud Computing
11(1)
Serverless
12(1)
A Probabilistic Decision
13(5)
Probabilistic Approach
15(1)
Probability Density Function
16(1)
Cumulative Distribution Function
17(1)
Choices Made
18(4)
Choosing Cloud
19(1)
Not a Reference Book
19(1)
Getting Started with the Code
20(2)
Agile Architecture for Data Science on Google Cloud
22(3)
What Is Agile Architecture?
23(1)
No-Code, Low-Code
23(1)
Use Managed Services
24(1)
Summary
25(1)
Suggested Resources
26(3)
2 Ingesting Data into the Cloud 29(52)
Airline On-Time Performance Data
30(7)
Knowability
31(1)
Causality
31(1)
Training-Serving Skew
32(1)
Downloading Data
33(1)
Hub-and-Spoke Architecture
34(1)
Dataset Fields
35(2)
Separation of Compute and Storage
37(9)
Scaling Up
39(2)
Scaling Out with Sharded Data
41(2)
Scaling Out with Data-in-Place
43(3)
Ingesting Data
46(9)
Reverse Engineering a Web Form
46(2)
Dataset Download
48(2)
Exploration and Cleanup
50(1)
Uploading Data to Google Cloud Storage
51(4)
Loading Data into Google BigQuery
55(8)
Advantages of a Serverless Columnar Database
55(2)
Staging on Cloud Storage
57(1)
Access Control
57(4)
Ingesting CSV Files
61(1)
Partitioning
62(1)
Scheduling Monthly Downloads
63(13)
Ingesting in Python
65(6)
Cloud Run
71(1)
Securing Cloud Run
72(2)
Deploying and Invoking Cloud Run
74(1)
Scheduling Cloud Run
75(1)
Summary
76(1)
Code Break
77(1)
Suggested Resources
78(3)
3 Creating Compelling Dashboards 81(44)
Explain Your Model with Dashboards
83(5)
Why Build a Dashboard First?
84(2)
Accuracy, Honesty, and Good Design
86(2)
Loading Data into Cloud SQL
88(8)
Create a Google Cloud SQL Instance
89(3)
Create Table of Data
92(3)
Interacting with the Database
95(1)
Querying Using BigQuery
96(5)
Schema Exploration
96(1)
Using Preview
97(2)
Using Table Explorer
99(1)
Creating BigQuery View
100(1)
Building Our First Model
101(5)
Contingency Table
101(2)
Threshold Optimization
103(3)
Building a Dashboard
106(13)
Getting Started with Data Studio
107(2)
Creating Charts
109(1)
Adding End-User Controls
110(2)
Showing Proportions with a Pie Chart
112(5)
Explaining a Contingency Table
117(2)
Modern Business Intelligence
119(4)
Digitization
119(1)
Natural Language Queries
120(2)
Connected Sheets
122(1)
Summary
123(1)
Suggested Resources
123(2)
4 Streaming Data: Publication and Ingest with Pub/Sub and Dataflow 125(48)
Designing the Event Feed
126(7)
Transformations Needed
127(1)
Architecture
128(1)
Getting Airport Information
129(3)
Sharing Data
132(1)
Time Correction
133(20)
Apache Beam/Cloud Dataflow
135(1)
Parsing Airports Data
136(3)
Adding Time Zone Information
139(2)
Converting Times to UTC
141(3)
Correcting Dates
144(2)
Creating Events
146(2)
Reading and Writing to the Cloud
148(2)
Running the Pipeline in the Cloud
150(3)
Publishing an Event Stream to Cloud Pub/Sub
153(7)
Speed-Up Factor
154(1)
Get Records to Publish
155(1)
How Many Topics?
156(1)
Iterating Through Records
157(1)
Building a Batch of Events
158(1)
Publishing a Batch of Events
159(1)
Real-Time Stream Processing
160(9)
Streaming in Dataflow
160(2)
Windowing a Pipeline
162(1)
Streaming Aggregation
162(3)
Using Event Timestamps
165(1)
Executing the Stream Processing
166(2)
Analyzing Streaming Data in BigQuery
168(1)
Real-Time Dashboard
169(1)
Summary
170(1)
Suggested Resources
171(2)
5 Interactive Data Exploration with Vertex AI Workbench 173(38)
Exploratory Data Analysis
174(10)
Exploration with SQL
177(2)
Reading a Query Explanation
179(5)
Exploratory Data Analysis in Vertex AI Workbench
184(6)
Jupyter Notebooks
185(1)
Creating a Notebook
186(2)
Jupyter Commands
188(1)
Installing Packages
188(1)
Jupyter Magic for Google Cloud
189(1)
Exploring Arrival Delays
190(14)
Basic Statistics
191(1)
Plotting Distributions
191(3)
Quality Control
194(5)
Arrival Delay Conditioned on Departure Delay
199(5)
Evaluating the Model
204(6)
Random Shuffling
204(1)
Splitting by Date
205(1)
Training and Testing
206(4)
Summary
210(1)
Suggested Resources
210(1)
6 Bayesian Classifier with Apache Spark on Cloud Dataproc 211(34)
MapReduce and the Hadoop Ecosystem
211(3)
How MapReduce Works
212(2)
Apache Hadoop
214(1)
Google Cloud Dataproc
214(7)
Need for Higher-Level Tools
216(1)
Jobs, Not Clusters
217(2)
Preinstalling Software
219(2)
Quantization Using Spark SQL
221(10)
JupyterLab on Cloud Dataproc
222(1)
Independence Check Using BigQuery
223(2)
Spark SQL in JupyterLab
225(2)
Histogram Equalization
227(4)
Bayesian Classification
231(7)
Bayes in Each Bin
231(2)
Evaluating the Model
233(1)
Dynamically Resizing Clusters
234(1)
Comparing to Single Threshold Model
235(3)
Orchestration
238(4)
Submitting a Spark Job
238(1)
Workflow Template
238(1)
Cloud Composer
239(1)
Autoscaling
240(1)
Serverless Spark
241(1)
Summary
242(1)
Suggested Resources
243(2)
7 Logistic Regression Using Spark ML 245(38)
Logistic Regression
246(5)
How Logistic Regression Works
246(3)
Spark ML Library
249(1)
Getting Started with Spark Machine Learning
250(1)
Spark Logistic Regression
251(12)
Creating a Training Dataset
252(4)
Training the Model
256(3)
Predicting Using the Model
259(1)
Evaluating a Model
260(3)
Feature Engineering
263(18)
Experimental Framework
263(4)
Feature Selection
267(4)
Feature Transformations
271(3)
Feature Creation
274(4)
Categorical Variables
278(2)
Repeatable, Real Time
280(1)
Summary
281(1)
Suggested Resources
282(1)
8 Machine Learning with BigQuery ML 283(26)
Logistic Regression
283(7)
Presplit Data
285(1)
Interrogating the Model
286(1)
Evaluating the Model
287(2)
Scale and Simplicity
289(1)
Nonlinear Machine Learning
290(6)
XGBoost
290(2)
Hyperparameter Tuning
292(2)
Vertex AI AutoML Tables
294(2)
Time Window Features
296(4)
Taxi-Out Time
296(2)
Compounding Delays
298(1)
Causality
299(1)
Time Features
300(5)
Departure Hour
300(2)
Transform Clause
302(1)
Categorical Variable
303(1)
Feature Cross
303(2)
Summary
305(1)
Suggested Resources
306(3)
9 Machine Learning with TensorFlow in Vertex AI 309(26)
Toward More Complex Models
310(7)
Preparing BigQuery Data for TensorFlow
314(1)
Reading Data into TensorFlow
315(2)
Training and Evaluation in Keras
317(6)
Model Function
317(1)
Features
318(2)
Inputs
320(1)
Training the Keras Model
320(2)
Saving and Exporting
322(1)
Deep Neural Network
322(1)
Wide-and-Deep Model in Keras
323(4)
Representing Air Traffic Corridors
323(1)
Bucketing
324(1)
Feature Crossing
325(1)
Wide-and-Deep Classifier
326(1)
Deploying a Trained TensorFlow Model to Vertex AI
327(5)
Concepts
328(1)
Uploading Model
328(2)
Creating Endpoint
330(1)
Deploying Model to Endpoint
330(1)
Invoking the Deployed Model
331(1)
Summary
332(1)
Suggested Resources
333(2)
10 Getting Ready for MLOps with Vertex AI 335(22)
Developing and Deploying Using Python
336(7)
Writing model.py
337(1)
Writing the Training Pipeline
338(2)
Predefined Split
340(1)
AutoML
341(2)
Hyperparameter Tuning
343(7)
Parameterize Model
344(1)
Shorten Training Run
345(2)
Metrics During Training
347(1)
Hyperparameter Tuning Pipeline
347(2)
Best Trial to Completion
349(1)
Explaining the Model
350(4)
Configuring Explanations Metadata
350(2)
Creating and Deploying Model
352(1)
Obtaining Explanations
352(2)
Summary
354(1)
Suggested Resources
355(2)
11 Time-Windowed Features for Real-Time Machine Learning 357(46)
Time Averages
357(10)
Apache Beam and Cloud Dataflow
358(2)
Reading and Writing
360(2)
Time Windowing
362(5)
Machine Learning Training
367(9)
Machine Learning Dataset
367(6)
Training the Model
373(3)
Streaming Predictions
376(9)
Reuse Transforms
377(2)
Input and Output
379(1)
Invoking Model
380(1)
Reusing Endpoint
381(3)
Batching Predictions
384(1)
Streaming Pipeline
385(15)
Writing to BigQuery
385(1)
Executing Streaming Pipeline
386(1)
Late and Out-of-Order Records
387(6)
Possible Streaming Sinks
393(7)
Summary
400(1)
Suggested Resources
401(2)
12 The Full Dataset 403(16)
Four Years of Data
403(14)
Creating Dataset
404(5)
Training Model
409(2)
Evaluation
411(6)
Summary
417(1)
Suggested Resources
417(2)
Conclusion 419(4)
Considerations for Sensitive Data Within Machine Learning Datasets 423(8)
Index 431
Valliappa (Lak) Lakshmanan is the director of analytics and AI solutions at Google Cloud, where he leads a team building cross-industry solutions to business problems. His mission is to democratize machine learning so that it can be done by anyone anywhere. Lak is the author or coauthor of Practical Machine Learning for Computer Vision, Machine Learning Design Patterns, Data Governance The Definitive Guide, Google BigQuery The Definitive Guide, and Data Science on the Google Cloud Platform.