Foreword |
|
xiii | |
preface |
|
xiv | |
acknowledgments |
|
xvii | |
about this book |
|
xix | |
about the cover illustration |
|
xxii | |
|
1 Getting started taming text |
|
|
1 | (15) |
|
1.1 Why taming text is important |
|
|
2 | (2) |
|
1.2 Preview: A fact-based question answering system |
|
|
4 | (4) |
|
|
5 | (3) |
|
1.3 Understanding text is hard |
|
|
8 | (2) |
|
|
10 | (1) |
|
1.5 Text and the intelligent app: search and beyond |
|
|
11 | (3) |
|
|
12 | (1) |
|
|
13 | (1) |
|
|
13 | (1) |
|
An intelligent application |
|
|
14 | (1) |
|
|
14 | (1) |
|
|
14 | (2) |
|
2 Foundations of taming text |
|
|
16 | (21) |
|
2.1 Foundations of language |
|
|
17 | (4) |
|
Words and their categories |
|
|
18 | (1) |
|
|
19 | (1) |
|
|
20 | (1) |
|
2.2 Common tools for text processing |
|
|
21 | (10) |
|
String manipulation tools |
|
|
21 | (1) |
|
|
22 | (2) |
|
Part of speech assignment |
|
|
24 | (1) |
|
|
25 | (2) |
|
|
27 | (1) |
|
|
28 | (2) |
|
|
30 | (1) |
|
2.3 Preprocessing and extracting content from common file formats |
|
|
31 | (5) |
|
The importance of preprocessing |
|
|
3 | (30) |
|
Extracting content using Apache Tika |
|
|
33 | (3) |
|
|
36 | (1) |
|
|
36 | (1) |
|
|
37 | (47) |
|
3.1 Search and faceting example: Amazon.com |
|
|
38 | (2) |
|
3.2 Introduction to search concepts |
|
|
40 | (12) |
|
|
41 | (2) |
|
|
43 | (3) |
|
Ranking documents with the vector space model |
|
|
46 | (3) |
|
|
49 | (3) |
|
3.3 Introducing the Apache Solr search server |
|
|
52 | (5) |
|
Running Solr for the first time |
|
|
52 | (2) |
|
Understanding Solr concepts |
|
|
54 | (3) |
|
3.4 Indexing content with Apache Solr |
|
|
57 | (6) |
|
|
58 | (1) |
|
Extracting and indexing content using Solr and Apache Tika |
|
|
59 | (4) |
|
3.5 Searching content with Apache Solr |
|
|
63 | (6) |
|
Solr query input parameters |
|
|
64 | (3) |
|
Faceting on extracted content |
|
|
67 | (2) |
|
3.6 Understanding search performance factors |
|
|
69 | (5) |
|
|
69 | (4) |
|
|
73 | (1) |
|
3.7 Improving search performance |
|
|
74 | (8) |
|
|
74 | (1) |
|
|
75 | (1) |
|
Query performance improvements |
|
|
76 | (3) |
|
Alternative scoring models |
|
|
79 | (1) |
|
Techniques for improving Solr performance |
|
|
80 | (2) |
|
|
82 | (1) |
|
|
83 | (1) |
|
|
83 | (1) |
|
|
84 | (31) |
|
4.1 Approaches to fuzzy string matching |
|
|
86 | (8) |
|
Character overlap measures |
|
|
86 | (3) |
|
|
89 | (3) |
|
|
92 | (2) |
|
4.2 Finding fuzzy string matches |
|
|
94 | (6) |
|
Using prefixes for matching with Solr |
|
|
94 | (1) |
|
Using a trie for prefix matching |
|
|
95 | (4) |
|
Using n-grams for matching |
|
|
99 | (1) |
|
4.3 Building fuzzy string matching applications |
|
|
100 | (14) |
|
Adding type-ahead to search |
|
|
101 | (4) |
|
Query spell-checking for search |
|
|
105 | (4) |
|
|
109 | (5) |
|
|
114 | (1) |
|
|
114 | (1) |
|
5 Identifying people, places, and things |
|
|
115 | (25) |
|
5.1 Approaches to named-entity recognition |
|
|
117 | (2) |
|
Using rules to identify names |
|
|
117 | (1) |
|
Using statistical classifiers to identify names |
|
|
118 | (1) |
|
5.2 Basic entity identification with Open NLP |
|
|
119 | (4) |
|
Finding names with OpenNLP |
|
|
120 | (1) |
|
Interpreting names identified by OpenNLP |
|
|
121 | (1) |
|
Filtering names based on probability |
|
|
122 | (1) |
|
5.3 In-depth entity identification with OpenNLP |
|
|
123 | (5) |
|
Identifying multiple entity types with OpenNLP |
|
|
123 | (3) |
|
Under the hood: how OpenNLP identifies names |
|
|
126 | (2) |
|
5.4 Performance of OpenNLP |
|
|
128 | (4) |
|
|
129 | (1) |
|
|
130 | (1) |
|
|
131 | (1) |
|
5.5 Customizing OpenNLP entity identification for a new domain |
|
|
132 | (6) |
|
The whys and hows of training a model |
|
|
132 | (1) |
|
Training an OpenNLP model |
|
|
133 | (1) |
|
|
134 | (2) |
|
|
136 | (2) |
|
|
138 | (1) |
|
|
139 | (1) |
|
|
140 | (35) |
|
6.1 Google News document clustering |
|
|
141 | (1) |
|
6.2 Clustering foundations |
|
|
142 | (7) |
|
Three types of text to cluster |
|
|
142 | (2) |
|
Choosing a clustering algorithm |
|
|
144 | (1) |
|
|
145 | (1) |
|
|
146 | (1) |
|
How to evaluate clustering results |
|
|
147 | (2) |
|
6.3 Setting up a simple clustering application |
|
|
149 | (1) |
|
6.4 Clustering search results using Carrot2 |
|
|
149 | (5) |
|
|
150 | (1) |
|
Clustering Solr search results |
|
|
|
|
151 | (3) |
|
6.5 Clustering document collections with Apache Mahout |
|
|
154 | (8) |
|
Preparing the data for clustering |
|
|
155 | (3) |
|
|
158 | (4) |
|
6.6 Topic modeling using Apache Mahout |
|
|
162 | (2) |
|
6.7 Examining clustering performance |
|
|
164 | (8) |
|
Feature selection and reduction |
|
|
164 | (3) |
|
Carrot2 performance and quality |
|
|
167 | (1) |
|
Mahout clustering benchmarks |
|
|
168 | (4) |
|
|
172 | (1) |
|
|
173 | (1) |
|
|
173 | (2) |
|
7 Classification, categorization, and tagging |
|
|
175 | (65) |
|
7.1 Introduction to classification and categorization |
|
|
177 | (3) |
|
7.2 The classification process |
|
|
180 | (9) |
|
Choosing a classification scheme |
|
|
181 | (1) |
|
Identifying features for text categorization |
|
|
182 | (1) |
|
The importance of training data |
|
|
183 | (3) |
|
Evaluating classifier performance |
|
|
186 | (2) |
|
Deploying a classifier into production |
|
|
188 | (1) |
|
7.3 Building document categorizers using Apache Lucene |
|
|
189 | (13) |
|
Categorizing text with Lucene |
|
|
189 | (2) |
|
Preparing the training data for the More Like This categorizer |
|
|
191 | (2) |
|
Training the MoreLikeThis categorizer |
|
|
193 | (4) |
|
Categorizing documents with the MoreLikeThis categorizer |
|
|
197 | (2) |
|
Testing the MoreLikeThis categorizer |
|
|
199 | (2) |
|
MoreLikeThis in production |
|
|
201 | (1) |
|
7.4 Training a naive Bayes classifier using Apache Mahout |
|
|
202 | (13) |
|
Categorizing text using naive Bayes classification |
|
|
202 | (2) |
|
Preparing the training data |
|
|
204 | (3) |
|
|
207 | (1) |
|
|
208 | (1) |
|
|
209 | (1) |
|
Improving the bootstrapping process |
|
|
210 | (2) |
|
Integrating the Mahout Bayes classifier with Solr |
|
|
212 | (3) |
|
7.5 Categorizing documents with OpenNLP |
|
|
215 | (12) |
|
Regression models and maximum entropy document categorization |
|
|
216 | (3) |
|
Preparing training data for the maximum entropy document categorizer |
|
|
219 | (1) |
|
Training the maximum entropy document categorizer |
|
|
220 | (4) |
|
Testing the maximum entropy document classifier |
|
|
224 | (1) |
|
Maximum entropy document categorization in production |
|
|
225 | (2) |
|
7.6 Building a tag recommender using Apache Solr |
|
|
227 | (11) |
|
Collecting training data for tag recommendations |
|
|
229 | (2) |
|
Preparing the training data |
|
|
231 | (1) |
|
Training the Solr tag recommender |
|
|
232 | (2) |
|
Creating tag recommendations |
|
|
234 | (2) |
|
Evaluating the tag recommender |
|
|
236 | (2) |
|
|
238 | (1) |
|
|
239 | (1) |
|
8 Building an example question answering system |
|
|
240 | (20) |
|
8.1 Basics of a question answering system |
|
|
242 | (1) |
|
8.2 Installing and running the QA code |
|
|
243 | (2) |
|
8.3 A sample question answering architecture |
|
|
245 | (3) |
|
8.4 Understanding questions and producing answers |
|
|
248 | (10) |
|
Training the answer type classifier |
|
|
248 | (3) |
|
|
251 | (1) |
|
Computing the answer type |
|
|
252 | (3) |
|
|
255 | (1) |
|
Ranking candidate passages |
|
|
256 | (2) |
|
8.5 Steps to improve the system |
|
|
258 | (1) |
|
|
259 | (1) |
|
|
259 | (1) |
|
9 Untamed text: exploring the next frontier |
|
|
260 | (27) |
|
9.1 Semantics, discourse, and pragmatics: exploring higher levels of NLP |
|
|
261 | (5) |
|
|
262 | (1) |
|
|
263 | (1) |
|
|
264 | (2) |
|
9.2 Document and collection summarization |
|
|
266 | (2) |
|
9.3 Relationship extraction |
|
|
268 | (5) |
|
|
270 | (2) |
|
|
272 | (1) |
|
Tools for relationship extraction |
|
|
273 | (1) |
|
9.4 Identifying important content and people |
|
|
273 | (3) |
|
Global importance and authoritativeness |
|
|
274 | (1) |
|
|
275 | (1) |
|
Resources and pointers on importance |
|
|
275 | (1) |
|
9.5 Detecting emotions via sentiment analysis |
|
|
276 | (6) |
|
|
276 | (2) |
|
|
278 | (1) |
|
A basic polarity algorithm |
|
|
279 | (1) |
|
|
280 | (1) |
|
Open source libraries for sentiment analysis |
|
|
281 | (1) |
|
9.6 Cross-language information retrieval |
|
|
282 | (2) |
|
|
284 | (1) |
|
|
284 | (3) |
Index |
|
287 | |