|
List of figures and tables |
|
|
xiii | |
Discover this textbook's online resources! |
|
xv | |
About the author |
|
xvii | |
Acknowledgements |
|
xix | |
Prologue |
|
xxiii | |
0.1 Scaling up: Thinking about programming in the social sciences |
|
xxiii | |
0.2 Who is this book for? |
|
xxv | |
0.3 Why Python (and not R, Stata, Java, C, etc.)? |
|
xxvi | |
0.3.1 How much Python should I already know? |
|
xxvii | |
0.4 What version of Python? |
|
xxviii | |
0.4.1 Part I. Thinking programmatically |
|
xxix | |
0.4.2 Part II. Accessing and converting data |
|
xxix | |
0.4.3 Part III. Interpreting data: Expectations versus observations |
|
xxx | |
0.4.4 Part IV. Social data science in practice: Four approaches |
|
xxxi | |
0.5 What about statistics? |
|
xxxi | |
0.6 Writing and coding considerations |
|
xxxii | |
0.6.1 My final tip before we go |
|
xxxiii | |
|
PART I Thinking Programmatically |
|
|
1 | (2) |
|
1 Introduction: Thinking of life at scale |
|
|
3 | (1) |
|
1.1 From social science to what? |
|
|
4 | (1) |
|
1.2 (PO)DIKW: A potential theoretical framework for data science |
|
|
5 | (4) |
|
|
5 | (3) |
|
1.2.2 From data to wisdom |
|
|
8 | (1) |
|
|
9 | (2) |
|
1.4 Fixed, variable, and marginal costs: Why not to build a barn |
|
|
11 | (3) |
|
1.4.1 From economics to data science |
|
|
11 | (2) |
|
1.4.2 The challenges of maximising fixed costs |
|
|
13 | (1) |
|
|
14 | (4) |
|
|
14 | (1) |
|
|
15 | (1) |
|
|
16 | (1) |
|
|
17 | (1) |
|
1.6 Pseudocode (and pseudo-pseudocode) |
|
|
18 | (1) |
|
1.6.1 Attempt 1. Pseudocode as written word |
|
|
18 | (1) |
|
1.6.2 Attempt 2. Pseudocode as mathematical formula |
|
|
19 | (1) |
|
1.6.3 Attempt 3. Pseudocode as written code |
|
|
19 | (1) |
|
1.6.4 Attempt 4. Slightly more formal pseudocode (in a Python style) |
|
|
19 | (1) |
|
|
19 | (1) |
|
|
20 | (1) |
|
1.9 Extensions and reflections |
|
|
21 | (2) |
|
2 The Series: Taming the distribution |
|
|
23 | (24) |
|
2.1 Introducing the Series: Python's way to store a distribution |
|
|
24 | (14) |
|
|
26 | (2) |
|
2.1.2 Working from values (and masking) |
|
|
28 | (2) |
|
2.1.3 Working from distributions |
|
|
30 | (2) |
|
2.1.4 Adding data to a Series |
|
|
32 | (3) |
|
2.1.5 Deleting data from a Series |
|
|
35 | (1) |
|
2.1.6 Working with missing data in a Series |
|
|
36 | (1) |
|
2.1.7 Getting unique values in a Series |
|
|
37 | (1) |
|
|
38 | (7) |
|
2.2.1 Changing the order of items in the Series |
|
|
38 | (1) |
|
2.2.2 Changing the type of the Series |
|
|
39 | (2) |
|
2.2.3 Changing Series values I: Arithmetic operators |
|
|
41 | (1) |
|
2.2.4 Changing Series values II: Recoding values using map |
|
|
42 | (1) |
|
2.2.5 Changing Series values III: Denning your own mapping |
|
|
43 | (2) |
|
|
45 | (1) |
|
2.4 Extensions and reflections |
|
|
45 | (2) |
|
3 The DataFrame: Python's tabular format |
|
|
47 | (28) |
|
3.1 From the Series to the DataFrame |
|
|
48 | (2) |
|
3.2 A DataFrame with multiple columns |
|
|
50 | (3) |
|
3.2.1 From a list of lists |
|
|
50 | (1) |
|
|
51 | (2) |
|
3.3 Getting data from a DataFrame: Querying, masking, and slicing |
|
|
53 | (4) |
|
3.3.1 Getting data about the DataFrame itself |
|
|
53 | (1) |
|
3.3.2 Returning a single row or column |
|
|
54 | (1) |
|
3.3.3 Returning multiple columns |
|
|
55 | (1) |
|
3.3.4 Returning a single element |
|
|
55 | (1) |
|
3.3.5 Returning a slice of data |
|
|
56 | (1) |
|
3.4 Changing data at different scales |
|
|
57 | (10) |
|
3.4.1 Adding data to an existing DataFrame |
|
|
57 | (3) |
|
3.4.2 Adding one DataFrame to another |
|
|
60 | (1) |
|
3.4.3 Changing a column or the entire DataFrame: apply, map, and applymap |
|
|
61 | (4) |
|
3.4.4 Deep versus shallow copies |
|
|
65 | (2) |
|
3.5 Advanced topics: numpy and numpy arrays |
|
|
67 | (5) |
|
|
69 | (2) |
|
3.5.2 Linear algebra and numpy |
|
|
71 | (1) |
|
|
72 | (1) |
|
|
72 | (1) |
|
3.8 Extensions and reflections |
|
|
73 | (2) |
|
PART II Accessing and Converting Data |
|
|
75 | (94) |
|
4 File types: Getting data in |
|
|
77 | (26) |
|
4.1 Importing data to a DataFrame |
|
|
78 | (2) |
|
4.1.1 A important note on file organisation |
|
|
79 | (1) |
|
|
79 | (1) |
|
4.2 Rectangular data: CSV |
|
|
80 | (3) |
|
4.2.1 Using the csv library |
|
|
80 | (2) |
|
4.2.2 Using the pandas CSV reader: read csv {) |
|
|
82 | (1) |
|
4.3 Rectangular rich data: Excel |
|
|
83 | (3) |
|
|
86 | (5) |
|
|
87 | (4) |
|
4.5 Nested markup languages: HTML and XML |
|
|
91 | (9) |
|
4.5.1 HTML: Hypertext Markup Language |
|
|
91 | (1) |
|
4.5.2 Wikipedia as a data source |
|
|
92 | (1) |
|
|
92 | (1) |
|
4.5.4 Using Beautiful Soup (bs4) for markup data |
|
|
93 | (1) |
|
|
94 | (2) |
|
|
96 | (4) |
|
|
100 | (1) |
|
4.6.1 Long-term storage: Pickles and feather |
|
|
101 | (1) |
|
|
101 | (1) |
|
4.8 Extensions and reflections |
|
|
102 | (1) |
|
5 Merging and grouping data |
|
|
103 | (28) |
|
5.1 Combining data across tables |
|
|
104 | (1) |
|
5.2 A review of adding data to a DataFrame using concat |
|
|
104 | (6) |
|
|
104 | (3) |
|
|
107 | (2) |
|
5.2.3 Multi-level indexed data |
|
|
109 | (1) |
|
5.2.4 Transposing a DataFrame |
|
|
110 | (1) |
|
|
110 | (4) |
|
5.3.1 One-to-many versus one-to-one relationships |
|
|
111 | (3) |
|
|
114 | (4) |
|
5.4.1 A join as a kind of set logic |
|
|
114 | (2) |
|
|
116 | (1) |
|
|
116 | (1) |
|
|
117 | (1) |
|
|
117 | (1) |
|
5.5 Grouping and aggregating data |
|
|
118 | (4) |
|
|
120 | (2) |
|
5.6 Long versus wide data |
|
|
122 | (1) |
|
|
123 | (1) |
|
|
123 | (5) |
|
|
124 | (2) |
|
5.7.2 Using SQL for aggregation and filtering |
|
|
126 | (2) |
|
|
128 | (1) |
|
|
129 | (1) |
|
5.10 Extensions and reflections |
|
|
129 | (2) |
|
6 Accessing data on the World Wide Web using code |
|
|
131 | (18) |
|
6.1 Accessing data I: Remote access of webpages |
|
|
132 | (6) |
|
|
133 | (2) |
|
|
135 | (1) |
|
6.1.3 What is a web request? |
|
|
136 | (2) |
|
6.2 An example web collection task using paging |
|
|
138 | (5) |
|
6.3 Other web-related issues to consider |
|
|
143 | (1) |
|
6.3.1 When to use your own versus someone else's program |
|
|
143 | (1) |
|
6.3.2 Are there ways to simulate a browser? |
|
|
143 | (1) |
|
6.4 Ethical issues to consider |
|
|
143 | (3) |
|
6.4.1 What is public data and how public? |
|
|
143 | (1) |
|
6.4.2 Considering data minimisation as a basic ethical principle |
|
|
144 | (2) |
|
|
146 | (1) |
|
6.6 Further reading in ethics of data access and privacy |
|
|
146 | (1) |
|
6.7 Extensions and reflections |
|
|
147 | (2) |
|
7 Accessing APIs, including Twitter and Reddit |
|
|
149 | (20) |
|
7.1 Accessing APIs: Abstracting from the web |
|
|
150 | (4) |
|
7.1.1 Identifying yourself: Keys and tokens |
|
|
150 | (2) |
|
7.1.2 Securely using credentials |
|
|
152 | (2) |
|
7.2 Accessing Twitter data through the API |
|
|
154 | (6) |
|
7.2.1 Troubleshooting requests |
|
|
155 | (1) |
|
7.2.2 Access rights and Twitter |
|
|
156 | (1) |
|
7.2.3 Strategies for navigating Twitter's API |
|
|
157 | (3) |
|
7.3 Using an API wrapper to simplify data access |
|
|
160 | (3) |
|
7.3.1 Collecting Reddit data using praw |
|
|
160 | (2) |
|
7.3.2 Building a comment tree on Reddit |
|
|
162 | (1) |
|
7.4 Considerations for a data collection pipeline |
|
|
163 | (2) |
|
7.4.1 Version control systems and servers |
|
|
163 | (1) |
|
7.4.2 Storing data remotely |
|
|
164 | (1) |
|
7.4.3 Jupyter in the browser as an alternative |
|
|
164 | (1) |
|
7.5 APIs and epistemology: How data access can mean knowledge access |
|
|
165 | (2) |
|
|
167 | (1) |
|
|
167 | (1) |
|
7.8 Extensions and reflections |
|
|
168 | (1) |
|
PART III Interpreting data: Expectations versus Observations |
|
|
169 | (50) |
|
|
171 | (14) |
|
|
172 | (1) |
|
8.1.1 What is a research question? |
|
|
172 | (1) |
|
8.2 Inductive, deductive, and abductive research questions |
|
|
173 | (3) |
|
8.2.1 Deductive research questions and the null hypothesis |
|
|
174 | (1) |
|
8.2.2 Abductive reasoning and the educated guess |
|
|
175 | (1) |
|
8.3 Avoiding description: Expectation and systematic observation in science |
|
|
176 | (1) |
|
8.4 Prediction versus explanation |
|
|
177 | (2) |
|
8.4.1 Prediction and resampling |
|
|
178 | (1) |
|
8.5 Linking hypotheses to approaches |
|
|
179 | (1) |
|
|
180 | (1) |
|
8.7 Boundedness and research questions |
|
|
181 | (1) |
|
|
182 | (1) |
|
|
183 | (1) |
|
8.10 Extensions and reflections |
|
|
183 | (2) |
|
9 Visualising expectations: Comparing statistical tests and plots |
|
|
185 | (34) |
|
9.1 Introduction: Why show data? |
|
|
186 | (2) |
|
9.2 Visualising distributions |
|
|
188 | (4) |
|
9.2.1 Uniform distribution with histogram |
|
|
190 | (2) |
|
9.3 Testing a uniform distribution using a chi-squared test |
|
|
192 | (2) |
|
9.4 Testing a uniform distribution using regression |
|
|
194 | (10) |
|
9.4.1 Testing against a uniform distribution: Births in the UK |
|
|
198 | (3) |
|
9.4.2 Annotating a figure |
|
|
201 | (3) |
|
9.4.3 Normal versus skewed distributions as being interesting |
|
|
204 | (1) |
|
9.5 Comparing two distributions versus two groups |
|
|
204 | (12) |
|
9.5.1 Constraining our work based on the properties of data |
|
|
205 | (2) |
|
9.5.2 Two continuous distributions |
|
|
207 | (2) |
|
|
209 | (4) |
|
9.5.4 Comparing distinct groups |
|
|
213 | (2) |
|
|
215 | (1) |
|
9.6 Further reading in visualisation |
|
|
216 | (1) |
|
9.7 Extensions and reflections |
|
|
217 | (2) |
|
PART IV Social Data Science in Practice: Four Approaches |
|
|
219 | (124) |
|
10 Cleaning data for socially interesting features |
|
|
221 | (28) |
|
10.1 Data as a form of social context |
|
|
223 | (3) |
|
10.2 A sustained example for cleaning: Stack Exchange |
|
|
226 | (5) |
|
10.2.1 Quick summaries of the dataset |
|
|
229 | (2) |
|
|
231 | (1) |
|
10.4 Handling missing data |
|
|
232 | (1) |
|
10.5 Cleaning numeric data |
|
|
233 | (2) |
|
10.6 Cleaning up web data |
|
|
235 | (3) |
|
|
236 | (1) |
|
10.6.2 Stripping HTML from text |
|
|
236 | (1) |
|
10.6.3 Extracting links from HTML |
|
|
237 | (1) |
|
10.7 Cleaning up lists of data |
|
|
238 | (3) |
|
|
241 | (1) |
|
|
242 | (4) |
|
10.9.1 Further learning for regular expressions |
|
|
244 | (1) |
|
10.9.2 Regular expressions and ground truth |
|
|
245 | (1) |
|
|
246 | (1) |
|
|
246 | (1) |
|
|
247 | (1) |
|
10.13 Extensions and reflections |
|
|
247 | (2) |
|
11 Introducing natural language processing: Cleaning, summarising, and classifying text |
|
|
249 | (20) |
|
11.1 Reading language: Encoding text |
|
|
250 | (2) |
|
11.1.1 Key definitions in text |
|
|
251 | (1) |
|
11.2 From text to language |
|
|
252 | (1) |
|
11.3 A sample simple NLP workflow |
|
|
253 | (5) |
|
11.3.1 Preprocessing text |
|
|
254 | (4) |
|
11.4 NLP approaches to analysis |
|
|
258 | (8) |
|
11.4.1 Scoring documents with sentiment analysis |
|
|
258 | (3) |
|
11.4.2 Extracting keywords: TF-IDF scores |
|
|
261 | (2) |
|
11.4.3 Text classification |
|
|
263 | (3) |
|
|
266 | (1) |
|
|
267 | (1) |
|
11.7 Extensions and reflections |
|
|
268 | (1) |
|
12 Introducing time-series data: Showing periods and trends |
|
|
269 | (20) |
|
12.1 Introduction: It's about time |
|
|
270 | (1) |
|
12.2 Dates and the datetime module |
|
|
271 | (4) |
|
|
272 | (2) |
|
|
274 | (1) |
|
12.2.3 Localisation and time |
|
|
274 | (1) |
|
12.3 Revisiting the Movie Stack Exchange data |
|
|
275 | (1) |
|
12.4 Pandas Datetime Feature Extraction |
|
|
276 | (3) |
|
12.5 Resampling as a way to group by time period |
|
|
279 | (2) |
|
12.6 Slicing and the datetime index in pandas |
|
|
281 | (2) |
|
12.7 Moving window in data |
|
|
283 | (3) |
|
12.7.1 Missing data in a rolling window |
|
|
284 | (2) |
|
|
286 | (1) |
|
12.9 Further explorations |
|
|
287 | (1) |
|
12.10 Extensions and reflections |
|
|
288 | (1) |
|
13 Introducing network analysis: Structuring relationships |
|
|
289 | (26) |
|
13.1 Introduction: The connections that signal social structure |
|
|
290 | (1) |
|
13.1.1 Doing network analysis in Python |
|
|
291 | (1) |
|
13.2 Creating network graphs |
|
|
291 | (3) |
|
13.2.1 Selecting a graph type |
|
|
292 | (1) |
|
|
293 | (1) |
|
|
293 | (1) |
|
|
294 | (3) |
|
13.3.1 Working with distributions of attributes: The case of degree |
|
|
295 | (2) |
|
|
297 | (4) |
|
13.4.1 Considering layouts for a graph |
|
|
299 | (2) |
|
13.5 Subgroups and communities in a network |
|
|
301 | (2) |
|
13.5.1 A goodness-of-fit metric for communities |
|
|
302 | (1) |
|
13.6 Creating a network from data |
|
|
303 | (9) |
|
13.6.1 Whole networks versus partial networks |
|
|
305 | (1) |
|
|
306 | (2) |
|
13.6.3 Bipartite networks |
|
|
308 | (4) |
|
|
312 | (1) |
|
|
313 | (1) |
|
13.9 Extensions and reflections |
|
|
314 | (1) |
|
14 Introducing geographic information systems: Data across space and place |
|
|
315 | (24) |
|
14.1 Introduction: From space to place |
|
|
316 | (1) |
|
14.2 Kinds of spatial data |
|
|
316 | (10) |
|
14.2.1 From a sphere to a rectangle |
|
|
317 | (2) |
|
14.2.2 Mapping places onto spaces |
|
|
319 | (2) |
|
14.2.3 Introducing the geopandas GeoDataFrame |
|
|
321 | (2) |
|
14.2.4 Splitting the data into intervals using mapclassif y |
|
|
323 | (2) |
|
|
325 | (1) |
|
14.3 Creating your own GeoDataFrame |
|
|
326 | (8) |
|
14.3.1 Loading your own maps |
|
|
327 | (2) |
|
14.3.2 Linking maps to other data sources |
|
|
329 | (5) |
|
|
334 | (1) |
|
14.5 Further topics and reading |
|
|
335 | (1) |
|
14.6 Extensions and reflections |
|
|
336 | (3) |
|
15 Conclusion: There (to data science) and back again (to social science) |
|
|
339 | (4) |
References |
|
343 | (10) |
Index |
|
353 | |