Atjaunināt sīkdatņu piekrišanu

E-grāmata: Text as Data: A New Framework for Machine Learning and the Social Sciences

4.22/5 (70 ratings by Goodreads)
  • Formāts: 360 pages
  • Izdošanas datums: 04-Jan-2022
  • Izdevniecība: Princeton University Press
  • Valoda: eng
  • ISBN-13: 9780691207995
  • Formāts - EPUB+DRM
  • Cena: 40,45 €*
  • * ši ir gala cena, t.i., netiek piemērotas nekādas papildus atlaides
  • Ielikt grozā
  • Pievienot vēlmju sarakstam
  • Šī e-grāmata paredzēta tikai personīgai lietošanai. E-grāmatas nav iespējams atgriezt un nauda par iegādātajām e-grāmatām netiek atmaksāta.
  • Formāts: 360 pages
  • Izdošanas datums: 04-Jan-2022
  • Izdevniecība: Princeton University Press
  • Valoda: eng
  • ISBN-13: 9780691207995

DRM restrictions

  • Kopēšana (kopēt/ievietot):

    nav atļauts

  • Drukāšana:

    nav atļauts

  • Lietošana:

    Digitālo tiesību pārvaldība (Digital Rights Management (DRM))
    Izdevējs ir piegādājis šo grāmatu šifrētā veidā, kas nozīmē, ka jums ir jāinstalē bezmaksas programmatūra, lai to atbloķētu un lasītu. Lai lasītu šo e-grāmatu, jums ir jāizveido Adobe ID. Vairāk informācijas šeit. E-grāmatu var lasīt un lejupielādēt līdz 6 ierīcēm (vienam lietotājam ar vienu un to pašu Adobe ID).

    Nepieciešamā programmatūra
    Lai lasītu šo e-grāmatu mobilajā ierīcē (tālrunī vai planšetdatorā), jums būs jāinstalē šī bezmaksas lietotne: PocketBook Reader (iOS / Android)

    Lai lejupielādētu un lasītu šo e-grāmatu datorā vai Mac datorā, jums ir nepieciešamid Adobe Digital Editions (šī ir bezmaksas lietotne, kas īpaši izstrādāta e-grāmatām. Tā nav tas pats, kas Adobe Reader, kas, iespējams, jau ir jūsu datorā.)

    Jūs nevarat lasīt šo e-grāmatu, izmantojot Amazon Kindle.

A guide for using computational text analysis to learn about the social world

From social media posts and text messages to digital government documents and archives, researchers are bombarded with a deluge of text reflecting the social world. This textual data gives unprecedented insights into fundamental questions in the social sciences, humanities, and industry. Meanwhile new machine learning tools are rapidly transforming the way science and business are conducted. Text as Data shows how to combine new sources of data, machine learning tools, and social science research design to develop and evaluate new insights.

Text as Data is organized around the core tasks in research projects using text—representation, discovery, measurement, prediction, and causal inference. The authors offer a sequential, iterative, and inductive approach to research design. Each research task is presented complete with real-world applications, example methods, and a distinct style of task-focused research.

Bridging many divides—computer science and social science, the qualitative and the quantitative, and industry and academia—Text as Data is an ideal resource for anyone wanting to analyze large collections of text in an era when data is abundant and computation is cheap, but the enduring challenges of social science remain.


  • Overview of how to use text as data
  • Research design for a world of data deluge
  • Examples from across the social sciences and industry

Recenzijas

"Among the metaverse of possible books on Text as Data that could have been published . . . I was pleased that my universe produced this one. I will assign this book as a critical part of my own course on content analysis for years to come, and it has already altered and improved the coherence of my own vocabulary and articulation for several critical choices underlying the process of turning text into data. . . . Highly recommend."---James Evans, Sociological Methods & Research

Preface xvii
Prerequisites and Notation xvii
Uses for This Book xviii
What This Book Is Not xix
PART I PRELIMINARIES
1(32)
Chapter 1 Introduction
3(10)
1.1 How This Book Informs the Social Sciences
5(3)
1.2 How This Book Informs the Digital Humanities
8(1)
1.3 How This Book Informs Data Science in Industry and Government
9(1)
1.4 A Guide to This Book
10(1)
1.5 Conclusion
11(2)
Chapter 2 Social Science Research and Text Analysis
13(20)
2.1 Discovery
15(1)
2.2 Measurement
16(1)
2.3 Inference
17(1)
2.4 Social Science as an Iterative and Cumulative Process
17(1)
2.5 An Agnostic Approach to Text Analysis
18(2)
2.6 Discovery, Measurement, and Causal Inference: How the Chinese Government Censors Social Media
20(2)
2.7 Six Principles of Text Analysis
22(10)
2.7.1 Social Science Theories and Substantive Knowledge are Essential for Research Design
22(2)
2.7.2 Text Analysis does not Replace Humans--It Augments Them
24(2)
2.7.3 Building, Refining, and Testing Social Science Theories Requires Iteration and Cumulation
26(2)
2.7.4 Text Analysis Methods Distill Generalizations from Language
28(1)
2.7.5 The Best Method Depends on the Task
29(1)
2.7.6 Validations are Essential and Depend on the Theory and the Task
30(2)
2.8 Conclusion: Text Data and Social Science
32(1)
PART II SELECTION AND REPRESENTATION
33(66)
Chapter 3 Principles of Selection and Representation
35(6)
3.1 Principle 1: Question-Specific Corpus Construction
35(1)
3.2 Principle 2: No Values-Free Corpus Construction
36(1)
3.3 Principle 3: No Right Way to Represent Text
37(1)
3.4 Principle 4: Validation
38(1)
3.5 State of the Union Addresses
38(1)
3.6 The Authorship of the Federalist Papers
39(1)
3.7 Conclusion
40(1)
Chapter 4 Selecting Documents
41(7)
4.1 Populations and Quantities of Interest
42(1)
4.2 Four Types of Bias
43(3)
4.2.1 Resource Bias
43(1)
4.2.2 Incentive Bias
44(1)
4.2.3 Medium Bias
44(1)
4.2.4 Retrieval Bias
45(1)
4.3 Considerations of "Found Data"
46(1)
4.4 Conclusion
46(2)
Chapter 5 Bag of Words
48(12)
5.1 The Bag of Words Model
48(1)
5.2 Choose the Unit of Analysis
49(1)
5.3 Tokenize
50(2)
5.4 Reduce Complexity
52(3)
5.4.1 Lowercase
52(1)
5.4.2 Remove Punctuation
52(1)
5.4.3 Remove Stop Words
53(1)
5.4.4 Create Equivalence Classes (Lemmatize/Stem)
54(1)
5.4.5 Filter by Frequency
55(1)
5.5 Construct Document-Feature Matrix
55(2)
5.6 Rethinking the Defaults
57(2)
5.6.1 Authorship of the Federalist Papers
57(1)
5.6.2 The Scale Argument against Preprocessing
58(1)
5.7 Conclusion
59(1)
Chapter 6 The Multinomial Language Model
60(10)
6.1 Multinomial Distribution
61(2)
6.2 Basic Language Modeling
63(3)
6.3 Regularization and Smoothing
66(1)
6.4 The Dirichlet Distribution
66(3)
6.5 Conclusion
69(1)
Chapter 7 The Vector Space Model and Similarity Metrics
70(8)
7.1 Similarity Metrics
70(3)
7.2 Distance Metrics
73(2)
7.3 tf-idf Weighting
75(2)
7.4 Conclusion
77(1)
Chapter 8 Distributed Representations of Words
78(12)
8.1 Why Word Embeddings
79(2)
8.2 Estimating Word Embeddings
81(5)
8.2.1 The Self-Supervision Insight
81(1)
8.2.2 Design Choices in Word Embeddings
81(1)
8.2.3 Latent Semantic Analysis
82(1)
8.2.4 Neural Word Embeddings
82(2)
8.2.5 Pretrained Embeddings
84(1)
8.2.6 Rare Words
84(1)
8.2.7 An Illustration
85(1)
8.3 Aggregating Word Embeddings to the Document Level
86(1)
8.4 Validation
87(1)
8.5 Contextualized Word Embeddings
88(1)
8.6 Conclusion
89(1)
Chapter 9 Representations from Language Sequences
90(9)
9.1 Text Reuse
90(1)
9.2 Parts of Speech Tagging
91(3)
9.2.1 Using Phrases to Improve Visualization
92(2)
9.3 Named-Entity Recognition
94(1)
9.4 Dependency Parsing
95(1)
9.5 Broader Information Extraction Tasks
96(1)
9.6 Conclusion
97(2)
PART III DISCOVERY
99(72)
Chapter 10 Principles of Discovery
103(8)
10.1 Principle 1: Context Relevance
103(1)
10.2 Principle 2: No Ground Truth
104(1)
10.3 Principle 3: Judge the Concept, Not the Method
105(1)
10.4 Principle 4: Separate Data Is Best
106(1)
10.5 Conceptualizing the US Congress
106(3)
10.6 Conclusion
109(2)
Chapter 11 Discriminating Words
111(12)
11.1 Mutual Information
112(3)
11.2 Fightin' Words
115(2)
11.3 Fictitious Prediction Problems
117(4)
11.3.1 Standardized Test Statistics as Measures of Separation
118(1)
11.3.2 Χ2 Test Statistics
118(3)
11.3.3 Multinomial Inverse Regression
121(1)
11.4 Conclusion
121(2)
Chapter 12 Clustering
123(24)
12.1 An Initial Example Using fc-Means Clustering
124(3)
12.2 Representations for Clustering
127(1)
12.3 Approaches to Clustering
127(10)
12.3.1 Components of a Clustering Method
128(2)
12.3.2 Styles of Clustering Methods
130(2)
12.3.3 Probabilistic Clustering Models
132(2)
12.3.4 Algorithmic Clustering Models
134(3)
12.3.5 Connections between Probabilistic and Algorithmic Clustering
137(1)
12.4 Making Choices
137(7)
12.4.1 Model Selection
137(3)
12.4.2 Careful Reading
140(1)
12.4.3 Choosing the Number of Clusters
140(4)
12.5 The Human Side of Clustering
144(1)
12.5.1 Interpretation
144(1)
12.5.2 Interactive Clustering
144(1)
12.6 Conclusion
145(2)
Chapter 13 Topic Models
147(15)
13.1 Latent Dirichlet Allocation
147(4)
13.1.1 Inference
149(1)
13.1.2 Example: Discovering Credit Claiming for Fire Grants in Congressional Press Releases
149(2)
13.2 Interpreting the Output of Topic Models
151(2)
13.3 Incorporating Structure into LDA
153(4)
13.3.1 Structure with Upstream, Known Prevalence Covariates
154(1)
13.3.2 Structure with Upstream, Known Content Covariates
154(2)
13.3.3 Structure with Downstream, Known Covariates
156(1)
13.3.4 Additional Sources of Structure
157(1)
13.4 Structural Topic Models
157(2)
13.4.1 Example: Discovering the Components of Radical Discourse
159(1)
13.5 Labeling Topic Models
159(1)
13.6 Conclusion
160(2)
Chapter 14 Low-Dimensional Document Embeddings
162(9)
14.1 Principal Component Analysis
162(5)
14.1.1 Automated Methods for Labeling Principal Components
163(1)
14.1.2 Manual Methods for Labeling Principal Components
164(1)
14.1.3 Principal Component Analysis of Senate Press Releases
164(1)
14.1.4 Choosing the Number of Principal Components
165(2)
14.2 Classical Multidimensional Scaling
167(2)
14.2.1 Extensions of Classical MDS
168(1)
14.2.2 Applying Classical MDS to Senate Press Releases
168(1)
14.3 Conclusion
169(2)
PART IV MEASUREMENT
171(60)
Chapter 15 Principles of Measurement
173(5)
15.1 From Concept to Measurement
174(1)
15.2 What Makes a Good Measurement
174(2)
15.2.1 Principle 1: Measures should have Clear Goals
175(1)
15.2.2 Principle 2: Source Material should Always be Identified and Ideally Made Public
175(1)
15.2.3 Principle 3: The Coding Process should be Explainable and Reproducible
175(1)
15.2.4 Principle 4: The Measure should be Validated
175(1)
15.2.5 Principle 5: Limitations should be Explored, Documented and Communicated to the Audience
176(1)
15.3 Balancing Discovery and Measurement with Sample Splits
176(2)
Chapter 16 Word Counting
178(6)
16.1 Keyword Counting
178(2)
16.2 Dictionary Methods
180(1)
16.3 Limitations and Validations of Dictionary Methods
181(2)
16.3.1 Moving Beyond Dictionaries: Wordscores
182(1)
16.4 Conclusion
183(1)
Chapter 17 An Overview of Supervised Classification
184(5)
17.1 Example: Discursive Governance
185(1)
17.2 Create a Training Set
186(1)
17.3 Classify Documents with Supervised Learning
186(1)
17.4 Check Performance
187(1)
17.5 Using the Measure
187(1)
17.6 Conclusion
188(1)
Chapter 18 Coding a Training Set
189(8)
18.1 Characteristics of a Good Training Set
190(1)
18.2 Hand Coding
190(3)
18.2.1 1: Decide on a Codebook
191(1)
18.2.2 2: Select Coders
191(1)
18.2.3 3: Select Documents to Code
191(1)
18.2.4 4: Manage Coders
192(1)
18.2.5 5: Check Reliability
192(1)
18.2.6 Managing Drift
192(1)
18.2.7 Example: Making the News
192(1)
18.3 Crowdsourcing
193(2)
18.4 Supervision with Found Data
195(1)
18.5 Conclusion
196(1)
Chapter 19 Classifying Documents with Supervised Learning
197(14)
19.1 Naive Bayes
198(4)
19.1.1 The Assumptions in Naive Bayes are Almost Certainly Wrong
200(1)
19.1.2 Naive Bayes is a Generative Model
200(1)
19.1.3 Naive Bayes is a Linear Classifier
201(1)
19.2 Machine Learning
202(5)
19.2.1 Fixed Basis Functions
203(2)
19.2.2 Adaptive Basis Functions
205(1)
19.2.3 Quantification
206(1)
19.2.4 Concluding Thoughts on Supervised Learning with Random Samples
207(1)
19.3 Example: Estimating Jihad Scores
207(3)
19.4 Conclusion
210(1)
Chapter 20 Checking Performance
211(8)
20.1 Validation with Gold-Standard Data
211(3)
20.1.1 Validation Set
212(1)
20.1.2 Cross-Validation
213(1)
20.1.3 The Importance of Gold-Standard Data
213(1)
20.1.4 Ongoing Evaluations
214(1)
20.2 Validation without Gold-Standard Data
214(2)
20.2.1 Surrogate Labels
214(1)
20.2.2 Partial Category Replication
215(1)
20.2.3 Nonexpert Human Evaluation
215(1)
20.2.4 Correspondence to External Information
215(1)
20.3 Example: Validating Jihad Scores
216(1)
20.4 Conclusion
217(2)
Chapter 21 Repurposing Discovery Methods
219(12)
21.1 Unsupervised Methods Tend to Measure Subject Better than Subtleties
219(1)
21.2 Example: Scaling via Differential Word Rates
220(1)
21.3 A Workflow for Repurposing Unsupervised Methods for Measurement
221(4)
21.3.1 1: Split the Data
223(1)
21.3.2 2: Fit the Model
223(1)
21.3.3 3: Validate the Model
223(2)
21.3.4 4: Fit to the Test Data and Revalidate
225(1)
21.4 Concerns in Repurposing Unsupervised Methods for Measurement
225(4)
21.4.1 Concern 1: The Method Always Returns a Result
226(1)
21.4.2 Concern 2: Opaque Differences in Estimation Strategies
226(1)
21.4.3 Concern 3: Sensitivity to Unintuitive Hyperparameters
227(1)
21.4.4 Concern 4: Instability in results
227(1)
21.4.5 Rethinking Stability
228(1)
21.5 Conclusion
229(2)
PART V INFERENCE
231(64)
Chapter 22 Principles of Inference
233(8)
22.1 Prediction
233(1)
22.2 Causal Inference
234(4)
22.2.1 Causal Inference Places Identification First
235(1)
22.2.2 Prediction Is about Outcomes That Will Happen, Causal Inference is about Outcomes from Interventions
235(1)
22.2.3 Prediction and Causal Inference Require Different Validations
236(1)
22.2.4 Prediction and Causal Inference Use Features Differently
237(1)
22.3 Comparing Prediction and Causal Inference
238(1)
22.4 Partial and General Equilibrium in Prediction and Causal Inference
238(2)
22.5 Conclusion
240(1)
Chapter 23 Prediction
241(18)
23.1 The Basic Task of Prediction
242(1)
23.2 Similarities and Differences between Prediction and Measurement
243(1)
23.3 Five Principles of Prediction
244(5)
23.3.1 Predictive Features do not have to Cause the Outcome
244(1)
23.3.2 Cross-Validation is not Always a Good Measure of Predictive Power
244(2)
23.3.3 It's Not Always Better to be More Accurate on Average
246(1)
23.3.4 There can be Practical Value in Interpreting Models for Prediction
247(1)
23.3.5 It can be Difficult to Apply Prediction to Policymaking
247(2)
23.4 Using Text as Data for Prediction: Examples
249(8)
23.4.1 Source Prediction
249(4)
23.4.2 Linguistic Prediction
253(1)
23.4.3 Social Forecasting
254(2)
23.4.4 Nowcasting
256(1)
23.5 Conclusion
257(2)
Chapter 24 Causal Inference
259(13)
24.1 Introduction to Causal Inference
260(3)
24.2 Similarities and Differences between Prediction and Measurement, and Causal Inference
263(1)
24.3 Key Principles of Causal Inference with Text
263(3)
24.3.1 The Core Problems of Causal Inference Remain, even when Working with Text
263(1)
24.3.2 Our Conceptualization of the Treatment and Outcome Remains a Critical Component of Causal Inference with Text
264(1)
24.3.3 The Challenges of Making Causal Inferences with Text Underscore the Need for Sequential Science
264(2)
24.4 The Mapping Function
266(3)
24.4.1 Causal Inference with g
267(1)
24.4.2 Identification and Overfitting
268(1)
24.5 Workflows for Making Causal Inferences with Text
269(2)
24.5.1 Define g before Looking at the Documents
269(1)
24.5.2 Use a Train/Test Split
269(2)
24.5.3 Run Sequential Experiments
271(1)
24.6 Conclusion
271(1)
Chapter 25 Text as Outcome
272(5)
25.1 An Experiment on Immigration
272(3)
25.2 The Effect of Presidential Public Appeals
275(1)
25.3 Conclusion
276(1)
Chapter 26 Text as Treatment
277(8)
26.1 An Experiment Using Trump's Tweets
279(2)
26.2 A Candidate Biography Experiment
281(3)
26.3 Conclusion
284(1)
Chapter 27 Text as Confounder
285(10)
27.1 Regression Adjustments for Text Confounders
287(3)
27.2 Matching Adjustments for Text
290(2)
27.3 Conclusion
292(3)
PART VI CONCLUSION
295(8)
Chapter 28 Conclusion
297(6)
28.1 How to Use Text as Data in the Social Sciences
298(1)
28.1.1 The Focus on Social Science Tasks
298(1)
28.1.2 Iterative and Sequential Nature of the Social Sciences
298(1)
28.1.3 Model Skepticism and the Application of Machine Learning to the Social Sciences
299(1)
28.2 Applying Our Principles beyond Text Data
299(1)
28.3 Avoiding the Cycle of Creation and Destruction in Social Science Methodology
300(3)
Acknowledgments 303(4)
Bibliography 307(24)
Index 331
Justin Grimmer is professor of political science and a senior fellow at the Hoover Institution at Stanford University. Twitter @justingrimmer Margaret E. Roberts is associate professor in political science and the Halcolu Data Science Institute at the University of California, San Diego. Twitter @mollyeroberts Brandon M. Stewart is assistant professor of sociology and Arthur H. Scribner Bicentennial Preceptor at Princeton University. Twitter @b_m_stewart