Section 1 Get off to a fast start |
|
|
Chapter 1 Introduction to Python for data analysis |
|
|
|
Introduction to data analysis |
|
|
4 | (6) |
|
|
4 | (2) |
|
The five phases of data analysis and visualization |
|
|
6 | (2) |
|
The IDEs for Python data analysis |
|
|
8 | (2) |
|
The Python skills that you need for data analysis |
|
|
10 | (6) |
|
How to install and import the Python modules for data analysis |
|
|
10 | (2) |
|
How to call and chain methods |
|
|
12 | (2) |
|
The coding basics for Python data analysis |
|
|
14 | (2) |
|
How to use JupyterLab as your IDE |
|
|
16 | (12) |
|
How to start JupyterLab and work with a Notebook |
|
|
16 | (2) |
|
How to edit and run the cells in a Notebook |
|
|
18 | (2) |
|
How to use the Tab completion and tooltip features |
|
|
20 | (2) |
|
How syntax and runtime errors work |
|
|
22 | (2) |
|
How to use Markdown language |
|
|
24 | (2) |
|
How to get reference information |
|
|
26 | (2) |
|
Two more skills for working with JupyterLab |
|
|
28 | (4) |
|
How to split the screen between two Notebooks |
|
|
28 | (2) |
|
How to use Magic Commands |
|
|
30 | (2) |
|
Introduction to the case studies |
|
|
32 | (14) |
|
|
32 | (2) |
|
The Forest Fires case study |
|
|
34 | (2) |
|
The Social Survey case study |
|
|
36 | (2) |
|
The Sports Analytics case study |
|
|
38 | (8) |
|
Chapter 2 The Pandas essentials for data analysis |
|
|
|
Introduction to the Pandas DataFrame |
|
|
46 | (6) |
|
|
46 | (2) |
|
Two ways to get data into a DataFrame |
|
|
48 | (2) |
|
How to save and restore a DataFrame |
|
|
50 | (2) |
|
|
52 | (6) |
|
How to display the data in a DataFrame |
|
|
52 | (2) |
|
How to use the attributes of a DataFrame |
|
|
54 | (2) |
|
How to use the info(), nunique(), and describe() methods |
|
|
56 | (2) |
|
How to access the columns and rows |
|
|
58 | (8) |
|
|
58 | (2) |
|
|
60 | (2) |
|
How to access a subset of rows and columns |
|
|
62 | (2) |
|
Another way to access a subset of rows and columns |
|
|
64 | (2) |
|
How to work with the data |
|
|
66 | (8) |
|
|
66 | (2) |
|
How to use the statistical methods |
|
|
68 | (2) |
|
How to use Python for column arithmetic |
|
|
70 | (2) |
|
How to modify the string data in columns |
|
|
72 | (2) |
|
|
74 | (6) |
|
|
74 | (2) |
|
|
76 | (2) |
|
|
78 | (2) |
|
|
80 | (12) |
|
|
80 | (2) |
|
How to aggregate the data |
|
|
82 | (2) |
|
|
84 | (8) |
|
Chapter 3 The Pandas essentials for data visualization |
|
|
|
Introduction to data visualization |
|
|
92 | (8) |
|
The Python libraries for data visualization |
|
|
92 | (2) |
|
Long vs. wide data for data visualization |
|
|
94 | (2) |
|
How the Pandas plot() method works by default |
|
|
96 | (2) |
|
The three basic parameters for the Pandas plot() method |
|
|
98 | (2) |
|
How to create 8 types of plots |
|
|
100 | (10) |
|
How to create a line plot or an area plot |
|
|
100 | (2) |
|
How to create a scatter plot |
|
|
102 | (2) |
|
|
104 | (2) |
|
How to create a histogram or a density plot |
|
|
106 | (2) |
|
How to create a box plot or a pie plot |
|
|
108 | (2) |
|
|
110 | (10) |
|
How to improve the appearance of a plot |
|
|
110 | (2) |
|
How to work with subplots |
|
|
112 | (2) |
|
How to use chaining to get the plots you want |
|
|
114 | (6) |
|
Chapter 4 The Seaborn essentials for data visualization |
|
|
|
|
120 | (8) |
|
The Seaborn methods for plotting |
|
|
120 | (2) |
|
The general methods vs. the specific methods |
|
|
122 | (2) |
|
How to use the basic Seaborn parameters |
|
|
124 | (2) |
|
How to use the Seaborn parameters for working with subplots |
|
|
126 | (2) |
|
How to enhance and save plots |
|
|
128 | (10) |
|
How to set the title, x label, and y label |
|
|
128 | (2) |
|
How to set the ticks, x limits, and y limits |
|
|
130 | (2) |
|
How to set the background style |
|
|
132 | (2) |
|
How to work with subplots |
|
|
134 | (2) |
|
|
136 | (2) |
|
How to create relational plots |
|
|
138 | (4) |
|
How to create a line plot |
|
|
138 | (2) |
|
How to create a scatter plot |
|
|
140 | (2) |
|
How to create categorical plots |
|
|
142 | (4) |
|
|
142 | (2) |
|
|
144 | (2) |
|
How to create distribution plots |
|
|
146 | (6) |
|
How to create a histogram |
|
|
146 | (2) |
|
How to create a KDE or ECDF plot |
|
|
148 | (2) |
|
How to enhance a distribution plot |
|
|
150 | (2) |
|
Other techniques for enhancing a plot |
|
|
152 | (18) |
|
How to use other Axes methods to enhance a plot |
|
|
152 | (2) |
|
|
154 | (2) |
|
How to set the color palette |
|
|
156 | (2) |
|
How to enhance a plot that has subplots |
|
|
158 | (2) |
|
How to customize the titles for subplots |
|
|
160 | (2) |
|
How to set the size of a specific plot |
|
|
162 | (8) |
Section 2 The critical skills for success on the job |
|
|
Chapter 5 How to get the data |
|
|
|
How to find the data that you want to analyze |
|
|
170 | (2) |
|
|
170 | (1) |
|
How to find and select the data that you want |
|
|
170 | (2) |
|
How to import data into a DataFrame |
|
|
172 | (6) |
|
How to import data directly into a DataFrame |
|
|
172 | (2) |
|
How to download a file to disk before importing it |
|
|
174 | (2) |
|
How to work with a zip file on disk |
|
|
176 | (2) |
|
How to get database data into a DataFrame |
|
|
178 | (4) |
|
How to run queries against a database |
|
|
178 | (2) |
|
How to use a SQL query to import data into a DataFrame |
|
|
180 | (2) |
|
How to work with a Stata file |
|
|
182 | (4) |
|
How to get and explore the metadata of a Stata file |
|
|
182 | (2) |
|
How to build DataFrames for the metadata and the data |
|
|
184 | (2) |
|
How to work with a JSON file |
|
|
186 | (12) |
|
How to download a JSON file to disk |
|
|
186 | (1) |
|
How to open a JSON file in JupyterLab |
|
|
186 | (2) |
|
How to drill down into the data |
|
|
188 | (2) |
|
How to build a DataFrame for the data |
|
|
190 | (8) |
|
Chapter 6 How to clean the data |
|
|
|
Introduction to data cleaning |
|
|
198 | (8) |
|
A general plan for cleaning the data |
|
|
198 | (2) |
|
What the info() method can tell you |
|
|
200 | (2) |
|
What the unique values can tell you |
|
|
202 | (2) |
|
What the value counts can tell you |
|
|
204 | (2) |
|
|
206 | (6) |
|
How to drop rows based on conditions |
|
|
206 | (1) |
|
How to drop duplicate rows |
|
|
206 | (2) |
|
|
208 | (2) |
|
|
210 | (2) |
|
How to find and fix missing values |
|
|
212 | (6) |
|
How to find missing values |
|
|
212 | (2) |
|
How to drop rows with missing values |
|
|
214 | (2) |
|
How to fill missing values |
|
|
216 | (2) |
|
How to fix data type problems |
|
|
218 | (10) |
|
How to find dates and numbers that are imported as objects |
|
|
218 | (2) |
|
How to convert date and time strings to the datetime data type |
|
|
220 | (2) |
|
How to convert object columns to numeric data types |
|
|
222 | (2) |
|
How to work with the category data type |
|
|
224 | (2) |
|
How to replace invalid values and convert a column's data type |
|
|
226 | (2) |
|
How to fix data problems when you import the data |
|
|
228 | (12) |
|
How find and fix outliers |
|
|
230 | (1) |
|
|
230 | (2) |
|
|
232 | (8) |
|
Chapter 7 How to prepare the data |
|
|
|
How to add and modify columns |
|
|
240 | (6) |
|
How to work with datetime columns |
|
|
240 | (2) |
|
How to work with string columns |
|
|
242 | (1) |
|
How to work with numeric columns |
|
|
242 | (2) |
|
How to add a summary column to a DataFrame |
|
|
244 | (2) |
|
How to apply functions and lambda expressions |
|
|
246 | (8) |
|
How to apply functions to rows or columns |
|
|
246 | (2) |
|
How to apply user-defined functions |
|
|
248 | (2) |
|
How lambda expressions work with DataFrames |
|
|
250 | (2) |
|
How to apply lambda expressions |
|
|
252 | (2) |
|
|
254 | (4) |
|
How to set and remove an index |
|
|
254 | (2) |
|
How to unstack indexed data |
|
|
256 | (2) |
|
How to combine DataFrames |
|
|
258 | (8) |
|
How to join DataFrames with an inner join |
|
|
258 | (2) |
|
How to join DataFrames with a left or outer join |
|
|
260 | (2) |
|
|
262 | (2) |
|
How to concatenate DataFrames |
|
|
264 | (2) |
|
How to handle the SettingWithCopyWarning |
|
|
266 | (8) |
|
What the warning is telling you |
|
|
266 | (2) |
|
What to do when the warning is displayed |
|
|
268 | (1) |
|
What to watch for when the warning isn't displayed |
|
|
268 | (6) |
|
Chapter 8 How to analyze the data |
|
|
|
How to create and plot long data |
|
|
274 | (4) |
|
How to melt columns to create long data |
|
|
274 | (2) |
|
How to plot melted columns |
|
|
276 | (2) |
|
How to group and aggregate the data |
|
|
278 | (6) |
|
How to group and apply a single aggregate method |
|
|
278 | (2) |
|
How to work with a DataFrameGroupBy object |
|
|
280 | (2) |
|
How to apply multiple aggregate methods |
|
|
282 | (2) |
|
How to create and use pivot tables |
|
|
284 | (4) |
|
How to use the pivot() method |
|
|
284 | (2) |
|
How to use the pivot_table() method |
|
|
286 | (2) |
|
|
288 | (6) |
|
How to create bins of equal size |
|
|
288 | (2) |
|
How to create bins with equal numbers of values |
|
|
290 | (2) |
|
|
292 | (2) |
|
More skills for data analysis |
|
|
294 | (12) |
|
How to select the rows with the largest values |
|
|
294 | (2) |
|
How to calculate the percent change |
|
|
296 | (2) |
|
|
298 | (2) |
|
How to find other methods for analysis |
|
|
300 | (6) |
|
Chapter 9 How to analyze time-series data |
|
|
|
How to reindex time-series data |
|
|
306 | (10) |
|
How to generate time periods |
|
|
306 | (2) |
|
How to reindex with datetime indexes |
|
|
308 | (2) |
|
How to reindex with a semi-month index |
|
|
310 | (2) |
|
How a user-defined function can improve a datetime index |
|
|
312 | (2) |
|
How reindexing with an improved index can improve plots |
|
|
314 | (2) |
|
How to resample time-series data |
|
|
316 | (6) |
|
How to use the resample() method |
|
|
316 | (2) |
|
How to use the label and closed parameters when you downsample |
|
|
318 | (2) |
|
How downsampling can improve plots |
|
|
320 | (2) |
|
How to work with rolling windows |
|
|
322 | (6) |
|
The concept of rolling windows |
|
|
322 | (2) |
|
How to create rolling windows |
|
|
324 | (2) |
|
How to plot rolling window data |
|
|
326 | (2) |
|
How to work with running totals |
|
|
328 | (10) |
|
How to create running totals |
|
|
328 | (2) |
|
How to plot running totals |
|
|
330 | (8) |
Section 3 An introduction to predictive analysis |
|
|
Chapter 10 How to make predictions with a linear regression model |
|
|
|
Introduction to predictive analysis |
|
|
338 | (2) |
|
Types of predictive models |
|
|
338 | (1) |
|
Introduction to regression analysis |
|
|
338 | (2) |
|
How to find correlations between variables |
|
|
340 | (10) |
|
|
340 | (2) |
|
How to identify correlations with a scatter plot |
|
|
342 | (2) |
|
How to identify correlations with a grid of scatter plots |
|
|
344 | (2) |
|
How to identify correlations with r-values |
|
|
346 | (2) |
|
How to identify correlations with a heatmap |
|
|
348 | (2) |
|
How to use Scikit-learn to work with a linear regression |
|
|
350 | (10) |
|
A procedure for creating and using a regression model |
|
|
350 | (2) |
|
The function and methods for linear regression models |
|
|
352 | (2) |
|
How to create, validate, and use a linear regression model |
|
|
354 | (2) |
|
How to plot the predicted data |
|
|
356 | (2) |
|
How to plot the residuals |
|
|
358 | (2) |
|
How to plot regression models with Seaborn |
|
|
360 | (12) |
|
The lmplot() method and some of its parameters |
|
|
360 | (2) |
|
How to plot a simple linear regression |
|
|
362 | (1) |
|
How to plot a logistic regression |
|
|
362 | (2) |
|
How to plot a polynomial regression |
|
|
364 | (1) |
|
How to plot a lowess regression |
|
|
364 | (2) |
|
How to use the residplot() method to plot the residuals |
|
|
366 | (6) |
|
Chapter 11 How to make predictions with a multiple regression model |
|
|
|
A simple regression model for a Cars dataset |
|
|
372 | (6) |
|
|
372 | (2) |
|
How to create a simple regression model |
|
|
374 | (2) |
|
How to plot the residuals of a simple regression |
|
|
376 | (2) |
|
How to work with a multiple regression model |
|
|
378 | (4) |
|
How to create a multiple regression model |
|
|
378 | (2) |
|
How to plot the residuals of a multiple regression |
|
|
380 | (2) |
|
How to work with categorical variables |
|
|
382 | (10) |
|
How to identify categorical variables |
|
|
382 | (2) |
|
How to review categorical variables |
|
|
384 | (2) |
|
How to create dummy variables |
|
|
386 | (2) |
|
How to rescale the data and check the correlations |
|
|
388 | (2) |
|
How to create a multiple regression that includes dummy variables |
|
|
390 | (2) |
|
How to improve a multiple regression model |
|
|
392 | (14) |
|
How to select the independent variables |
|
|
392 | (2) |
|
How to test different combinations of variables |
|
|
394 | (2) |
|
How to use Scikit-learn to select the variables |
|
|
396 | (2) |
|
How to select the right number of variables |
|
|
398 | (8) |
Section 4 The case studies |
|
|
Chapter 12 The Polling case study |
|
|
|
|
406 | (2) |
|
Import the modules that you will need |
|
|
406 | (1) |
|
|
406 | (1) |
|
|
406 | (2) |
|
|
408 | (8) |
|
|
408 | (4) |
|
|
412 | (2) |
|
|
414 | (1) |
|
|
414 | (1) |
|
|
414 | (1) |
|
Take an early plot with Pandas |
|
|
414 | (1) |
|
|
414 | (2) |
|
|
416 | (6) |
|
Add columns for grouping and filtering |
|
|
416 | (2) |
|
Create a new DataFrame in long form |
|
|
418 | (1) |
|
Take an early plot of the long data with Seaborn |
|
|
418 | (2) |
|
Add monthly bins to the DataFrame |
|
|
420 | (1) |
|
Add an average percent column for each month |
|
|
420 | (1) |
|
Save the wide and long DataFrames |
|
|
420 | (2) |
|
|
422 | (8) |
|
Plot the national and swing state polls |
|
|
422 | (2) |
|
|
424 | (2) |
|
Plot the last two months of polling |
|
|
426 | (2) |
|
Plot the gap changes in selected states |
|
|
428 | (2) |
|
More preparation and analysis |
|
|
430 | (12) |
|
Prepare the gap data for the last week of polling |
|
|
430 | (2) |
|
Plot the gap data for the last week of polling |
|
|
432 | (2) |
|
Prepare the weekly gap data for the swing states |
|
|
434 | (2) |
|
Plot the weekly gap data for the swing states |
|
|
436 | (6) |
|
Chapter 13 The Forest Fires case study |
|
|
|
|
442 | (2) |
|
Download and unzip the SQLite database |
|
|
442 | (1) |
|
Connect and query the database |
|
|
442 | (1) |
|
Import the data into a DataFrame |
|
|
442 | (2) |
|
|
444 | (6) |
|
|
444 | (1) |
|
Improve the readability of the data |
|
|
444 | (2) |
|
|
446 | (1) |
|
|
446 | (1) |
|
Convert dates to datetime objects |
|
|
446 | (2) |
|
Check for missing contain dates |
|
|
448 | (2) |
|
|
450 | (2) |
|
Add fire_month and days_burning columns |
|
|
450 | (1) |
|
Examine the contain_date and days_burning columns |
|
|
450 | (2) |
|
|
452 | (12) |
|
Analyze the data for California |
|
|
452 | (2) |
|
Two more plots for California fires |
|
|
454 | (2) |
|
Rank the states by total acres burned |
|
|
456 | (2) |
|
Prepare a DataFrame for total acres burned by year within state |
|
|
458 | (1) |
|
Prepare a DataFrame for the top 4 states |
|
|
458 | (2) |
|
Plot the acres burned total by year for the top 4 states |
|
|
460 | (2) |
|
Review the 20 largest fires in California |
|
|
462 | (2) |
|
Use GeoPandas to plot the fires on a map |
|
|
464 | (10) |
|
Use GeoPandas to plot the California map |
|
|
464 | (2) |
|
Use GeoPandas or Seaborn to plot the California fires on a map |
|
|
466 | (2) |
|
Plot the fires in the continental United States |
|
|
468 | (6) |
|
Chapter 14 The Social Survey case study |
|
|
|
Introduction to the Social Survey |
|
|
474 | (2) |
|
Download and unzip the zip file for the data |
|
|
474 | (1) |
|
Build a DataFrame for the metadata |
|
|
474 | (2) |
|
|
476 | (10) |
|
Use the codebook and read the data that you want |
|
|
476 | (2) |
|
|
478 | (2) |
|
Plot the data and reduce the number of categories |
|
|
480 | (2) |
|
Plot the total counts of the responses |
|
|
482 | (2) |
|
Convert the counts to percents and plot them |
|
|
484 | (2) |
|
The work-life balance data |
|
|
486 | (8) |
|
Search the codebook for small question sets |
|
|
486 | (2) |
|
Read and review the work-life data |
|
|
488 | (2) |
|
Plot the responses for the first question |
|
|
490 | (2) |
|
Plot the responses for the second and third questions |
|
|
492 | (2) |
|
How to expand the scope of the analysis |
|
|
494 | (8) |
|
Use the codebook to find related columns |
|
|
494 | (2) |
|
Use the codebook to find follow-up questions |
|
|
496 | (2) |
|
Select the columns for an expanded DataFrame |
|
|
498 | (2) |
|
Bin the data for a column |
|
|
500 | (2) |
|
How to use a hypothesis to guide your analysis |
|
|
502 | (10) |
|
Develop and test a first hypothesis |
|
|
502 | (2) |
|
Develop and test a second hypothesis |
|
|
504 | (2) |
|
Develop and test a third hypothesis |
|
|
506 | (6) |
|
Chapter 15 The Sports Analytics case study |
|
|
|
Get the data and build the DataFrame |
|
|
512 | (2) |
|
|
512 | (1) |
|
|
512 | (2) |
|
|
514 | (2) |
|
Locate and drop unneeded rows |
|
|
514 | (1) |
|
Locate and drop unneeded columns |
|
|
514 | (1) |
|
Convert the game_date column to datetime data |
|
|
514 | (2) |
|
|
516 | (4) |
|
Add a column for the season |
|
|
516 | (1) |
|
Add a column for the shot result |
|
|
516 | (2) |
|
Add a column for points made for each shot |
|
|
518 | (1) |
|
Add three summary columns |
|
|
518 | (2) |
|
|
520 | (2) |
|
Plot the points per game by season |
|
|
520 | (1) |
|
Plot the averages of shots, shots made, and points per game by season |
|
|
520 | (2) |
|
|
522 | (10) |
|
Plot the shot locations for two games |
|
|
522 | (2) |
|
Plot the shot locations for two seasons |
|
|
524 | (2) |
|
Plot the shot density for one season |
|
|
526 | (2) |
|
Plot the shot density for two seasons |
|
|
528 | |
Appendix A How to set up Windows for this book |
|
|
How to install and use Anaconda |
|
|
532 | (4) |
|
|
532 | (2) |
|
How to use the Anaconda Prompt |
|
|
534 | (1) |
|
How to use the Anaconda Navigator |
|
|
534 | (2) |
|
How to install and use the files for this book |
|
|
536 | (6) |
|
How to install the files for this book |
|
|
536 | (2) |
|
How to make sure Anaconda is installed correctly |
|
|
538 | (1) |
|
How to download the large data files for this book |
|
|
538 | (4) |
Appendix B How to set up macOS for this book |
|
|
How to install and use Anaconda |
|
|
542 | (4) |
|
|
542 | (2) |
|
How to run conda commands |
|
|
544 | (1) |
|
How to use the Anaconda Navigator |
|
|
544 | (2) |
|
How to install and use the files for this book |
|
|
546 | |
|
How to install the files for this book |
|
|
546 | (2) |
|
How to make sure Anaconda is installed correctly |
|
|
548 | (1) |
|
How to download the large data files for this book |
|
|
548 | |