Series Editor's Introduction |
|
xiii | |
Preface |
|
xv | |
Acknowledgments |
|
xvii | |
About the Authors |
|
xix | |
|
|
1 | (25) |
|
|
1 | (5) |
|
1.1.1 Introducing the Definitions |
|
|
2 | (2) |
|
|
4 | (1) |
|
1.1.3 File Formats to Save and Store Text Information |
|
|
5 | (1) |
|
1.2 The Two Applications Considered in This Book |
|
|
6 | (1) |
|
1.3 Introductory Example and Its Analysis Using the R Statistical Software |
|
|
7 | (15) |
|
1.4 The Introductory Example Revisited, Illustrating Concordance and Collocation Using Alternative Software |
|
|
22 | (2) |
|
|
24 | (1) |
|
|
25 | (1) |
|
Chapter 2 A Description of the Studied Text Corpora and A Discussion of Our Modeling Strategy |
|
|
26 | (10) |
|
2.1 Introduction to the Corpora: Selecting the Texts |
|
|
26 | (1) |
|
2.2 Debates of the 39th U.S. Congress, as recorded in the Congressional Globe |
|
|
27 | (2) |
|
2.3 The Territorial Papers of the United States |
|
|
29 | (3) |
|
2.4 Analyzing Text Data: Bottom-Up or Top-Down Analysis |
|
|
32 | (2) |
|
|
34 | (1) |
|
Appendix to Chapter 2: the Complete Congressional Record |
|
|
35 | (1) |
|
Chapter 3 Preparing Text for Analysis: Text Cleaning and Formatting |
|
|
36 | (13) |
|
|
36 | (7) |
|
3.1.1 Compacting Multiple Word Sets Into a Single Word |
|
|
42 | (1) |
|
|
43 | (3) |
|
3.2.1 Formatting by Marking Versus Formatting by Deleting |
|
|
44 | (1) |
|
3.2.2 Formatting Beyond Metavariables: Telling the Computer What Sections to Skip When Running the Analysis |
|
|
44 | (2) |
|
|
46 | (2) |
|
|
48 | (1) |
|
Chapter 4 Word Distributions: Document-Term Matrices of Word Frequencies and the "Bag of Words" Representation |
|
|
49 | (13) |
|
4.1 Document-Term Matrices of Frequencies |
|
|
49 | (4) |
|
4.1.1 Creating the Document-Term Matrix in R |
|
|
51 | (1) |
|
4.1.2 Dropping Sparse Words That Do Not Occur in Many Documents |
|
|
52 | (1) |
|
4.2 Displaying Word Frequencies |
|
|
53 | (3) |
|
4.3 Co-Occurrence of Terms in the Same Document |
|
|
56 | (3) |
|
4.4 The Zipf Law: An Interesting Fact About the Distribution of Word Frequencies |
|
|
59 | (2) |
|
|
61 | (1) |
|
Chapter 5 Metavariables and Text Analysis Stratified on Metavariables |
|
|
62 | (22) |
|
5.1 The Significance of Stratification and the Importance of Metavariables |
|
|
62 | (1) |
|
5.2 Analysis of the Territorial Papers |
|
|
63 | (9) |
|
5.2.1 Territorial Papers: Visualization of the Metavariables |
|
|
64 | (5) |
|
5.2.2 Territorial Papers: Stratified Text Analysis |
|
|
69 | (3) |
|
5.3 Analysis of Speeches From the 39th Congress |
|
|
72 | (11) |
|
5.3.1 Speeches From the 39th Congress: Visualization of the Metavariables |
|
|
73 | (4) |
|
5.3.2 Speeches From the 39th Congress: Stratified Text Analysis |
|
|
77 | (6) |
|
|
83 | (1) |
|
Chapter 6 Sentiment Analysis |
|
|
84 | (13) |
|
6.1 Lexicons of Sentiment-Charged Words |
|
|
84 | (4) |
|
6.1.1 Attaching Sentiment to a Document |
|
|
85 | (2) |
|
6.1.2 Sentiment Analysis for the Corpus and Its Documents |
|
|
87 | (1) |
|
6.1.3 Importance of Sentiment Analysis |
|
|
88 | (1) |
|
6.2 Applying Sentiment Analysis to the Letters of the Territorial Papers |
|
|
88 | (3) |
|
6.3 Using Other Sentiment Dictionaries and the R Software tidytextfor Sentiment Analysis |
|
|
91 | (3) |
|
6.4 Concluding Remarks: An Alternative Approach for Sentiment Analysis |
|
|
94 | (1) |
|
|
95 | (2) |
|
Chapter 7 Clustering of Documents |
|
|
97 | (13) |
|
|
97 | (1) |
|
7.2 Measures for the Closeness and the Distance of Documents |
|
|
98 | (3) |
|
7.3 Methods for Clustering Documents |
|
|
101 | (5) |
|
7.5.7 Hierarchical Agglomerative Clustering and Dendrograms |
|
|
101 | (2) |
|
|
103 | (2) |
|
|
105 | (1) |
|
7.4 Illustrating Clustering Methods on a Simulated Example |
|
|
106 | (3) |
|
|
109 | (1) |
|
Chapter 8 Classification of Documents |
|
|
110 | (11) |
|
|
110 | (1) |
|
8.2 Classification Procedures |
|
|
111 | (5) |
|
8.2.1 The k-Nearest Neighbor Algorithm |
|
|
111 | (2) |
|
8.2.2 Naive Bayesian Analysis |
|
|
113 | (2) |
|
8.2.3 Fisher Linear Discriminant Method and Linear Scoring (SVM) Methods |
|
|
115 | (1) |
|
8.2.4 Evaluating Classification Rules on Hold-Out Samples |
|
|
116 | (1) |
|
8.3 Two Examples Using the Congressional Speech Database |
|
|
116 | (3) |
|
8.4 Concluding Remarks on Authorship Attribution: Commenting on the Field of Stylometry |
|
|
119 | (1) |
|
|
120 | (1) |
|
Chapter 9 Modeling Text Data: Topic Models |
|
|
121 | (21) |
|
|
121 | (9) |
|
9.1.1 Some More Technical Details and a Brief Primer on Dirichlet Distributions |
|
|
126 | (2) |
|
9.1.2 Model Extensions and Useful Software, With a Tip of the Hat to Their Developers |
|
|
128 | (1) |
|
|
129 | (1) |
|
9.2 Fitting Topic Models to the Two Corpora Studied in This Book |
|
|
130 | (10) |
|
9.2.1 Topic Models for the Corpus of the Territorial Papers |
|
|
130 | (4) |
|
9.2.2 Topic Models for the Corpus of Speeches From the 39th U.S. Congress |
|
|
134 | (6) |
|
|
140 | (2) |
|
Chapter 10 n-Crams and Other Ways of Analyzing Adjacent Words |
|
|
142 | (9) |
|
|
142 | (1) |
|
10.2 Text Windows to Measure Word Associations Within a Neighborhood of Words and a Discussion of the R Package text2vec |
|
|
143 | (3) |
|
10.3 Illustrating the Use of n-Grams: Speeches of the 39th Congress |
|
|
146 | (5) |
|
Chapter 11 Concluding Remarks |
|
|
151 | (4) |
Appendix: Listing of Website Resources |
|
155 | (6) |
Index |
|
161 | |