Atjaunināt sīkdatņu piekrišanu

E-grāmata: Taming Text: How to Find, Organize, and Manipulate It

3.80/5 (220 ratings by Goodreads)
  • Formāts: 320 pages
  • Izdošanas datums: 20-Dec-2012
  • Izdevniecība: Manning Publications
  • Valoda: eng
  • ISBN-13: 9781638353867
  • Formāts - EPUB+DRM
  • Cena: 39,56 €*
  • * ši ir gala cena, t.i., netiek piemērotas nekādas papildus atlaides
  • Ielikt grozā
  • Pievienot vēlmju sarakstam
  • Šī e-grāmata paredzēta tikai personīgai lietošanai. E-grāmatas nav iespējams atgriezt un nauda par iegādātajām e-grāmatām netiek atmaksāta.
  • Formāts: 320 pages
  • Izdošanas datums: 20-Dec-2012
  • Izdevniecība: Manning Publications
  • Valoda: eng
  • ISBN-13: 9781638353867

DRM restrictions

  • Kopēšana (kopēt/ievietot):

    nav atļauts

  • Drukāšana:

    nav atļauts

  • Lietošana:

    Digitālo tiesību pārvaldība (Digital Rights Management (DRM))
    Izdevējs ir piegādājis šo grāmatu šifrētā veidā, kas nozīmē, ka jums ir jāinstalē bezmaksas programmatūra, lai to atbloķētu un lasītu. Lai lasītu šo e-grāmatu, jums ir jāizveido Adobe ID. Vairāk informācijas šeit. E-grāmatu var lasīt un lejupielādēt līdz 6 ierīcēm (vienam lietotājam ar vienu un to pašu Adobe ID).

    Nepieciešamā programmatūra
    Lai lasītu šo e-grāmatu mobilajā ierīcē (tālrunī vai planšetdatorā), jums būs jāinstalē šī bezmaksas lietotne: PocketBook Reader (iOS / Android)

    Lai lejupielādētu un lasītu šo e-grāmatu datorā vai Mac datorā, jums ir nepieciešamid Adobe Digital Editions (šī ir bezmaksas lietotne, kas īpaši izstrādāta e-grāmatām. Tā nav tas pats, kas Adobe Reader, kas, iespējams, jau ir jūsu datorā.)

    Jūs nevarat lasīt šo e-grāmatu, izmantojot Amazon Kindle.

DESCRIPTION

It is no secret that the world is drowning in text and data. This causes real problems for everyday users who need to make sense of all the information available, and for software engineers who want to make their text-based applications more useful and user-friendly. Whether building a search engine for a corporate website, automatically organizing email, or extracting important nuggets of information from the news, dealing with unstructured text can be daunting.

Taming Text is a hands-on, example-driven guide to working with unstructured text in the context of real-world applications. It explores how to automatically organize text, using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization. This book gives examples illustrating each of these topics, as well as the foundations upon which they are built.

KEY POINTS

h One-stop shop for learning how to process text

h Clear, concise, and practical advice

h Builds on high quality open source libraries
Foreword xiii
preface xiv
acknowledgments xvii
about this book xix
about the cover illustration xxii
1 Getting started taming text
1(15)
1.1 Why taming text is important
2(2)
1.2 Preview: A fact-based question answering system
4(4)
Hello, Dr. Frankenstein
5(3)
1.3 Understanding text is hard
8(2)
1.4 Text, tamed
10(1)
1.5 Text and the intelligent app: search and beyond
11(3)
Searching and matching
12(1)
Extracting information
13(1)
Grouping information
13(1)
An intelligent application
14(1)
1.6 Summary
14(1)
1.7 Resources
14(2)
2 Foundations of taming text
16(21)
2.1 Foundations of language
17(4)
Words and their categories
18(1)
Phrases and clauses
19(1)
Morphology
20(1)
2.2 Common tools for text processing
21(10)
String manipulation tools
21(1)
Tokens and tokenization
22(2)
Part of speech assignment
24(1)
Stemming
25(2)
Sentence detection
27(1)
Parsing and grammar
28(2)
Sequence modeling
30(1)
2.3 Preprocessing and extracting content from common file formats
31(5)
The importance of preprocessing
3(30)
Extracting content using Apache Tika
33(3)
2.4 Summary
36(1)
2.5 Resources
36(1)
3 Searching
37(47)
3.1 Search and faceting example: Amazon.com
38(2)
3.2 Introduction to search concepts
40(12)
Indexing content
41(2)
User input
43(3)
Ranking documents with the vector space model
46(3)
Results display
49(3)
3.3 Introducing the Apache Solr search server
52(5)
Running Solr for the first time
52(2)
Understanding Solr concepts
54(3)
3.4 Indexing content with Apache Solr
57(6)
Indexing using XML
58(1)
Extracting and indexing content using Solr and Apache Tika
59(4)
3.5 Searching content with Apache Solr
63(6)
Solr query input parameters
64(3)
Faceting on extracted content
67(2)
3.6 Understanding search performance factors
69(5)
Judging quality
69(4)
Judging quantity
73(1)
3.7 Improving search performance
74(8)
Hardware improvements
74(1)
Analysis improvements
75(1)
Query performance improvements
76(3)
Alternative scoring models
79(1)
Techniques for improving Solr performance
80(2)
3.8 Search alternatives
82(1)
3.9 Summary
83(1)
3.10 Resources
83(1)
4 Fuzzy siring matching
84(31)
4.1 Approaches to fuzzy string matching
86(8)
Character overlap measures
86(3)
Edit distance measures
89(3)
N-gram edit distance
92(2)
4.2 Finding fuzzy string matches
94(6)
Using prefixes for matching with Solr
94(1)
Using a trie for prefix matching
95(4)
Using n-grams for matching
99(1)
4.3 Building fuzzy string matching applications
100(14)
Adding type-ahead to search
101(4)
Query spell-checking for search
105(4)
Record matching
109(5)
4.4 Summary
114(1)
4.5 Resources
114(1)
5 Identifying people, places, and things
115(25)
5.1 Approaches to named-entity recognition
117(2)
Using rules to identify names
117(1)
Using statistical classifiers to identify names
118(1)
5.2 Basic entity identification with Open NLP
119(4)
Finding names with OpenNLP
120(1)
Interpreting names identified by OpenNLP
121(1)
Filtering names based on probability
122(1)
5.3 In-depth entity identification with OpenNLP
123(5)
Identifying multiple entity types with OpenNLP
123(3)
Under the hood: how OpenNLP identifies names
126(2)
5.4 Performance of OpenNLP
128(4)
Quality of results
129(1)
Runtime performance
130(1)
Memory usage in OpenNLP
131(1)
5.5 Customizing OpenNLP entity identification for a new domain
132(6)
The whys and hows of training a model
132(1)
Training an OpenNLP model
133(1)
Altering modeling inputs
134(2)
A new way to model names
136(2)
5.6 Summary
138(1)
5.7 Further reading
139(1)
6 Clustering text
140(35)
6.1 Google News document clustering
141(1)
6.2 Clustering foundations
142(7)
Three types of text to cluster
142(2)
Choosing a clustering algorithm
144(1)
Determining similarity
145(1)
Labeling the results
146(1)
How to evaluate clustering results
147(2)
6.3 Setting up a simple clustering application
149(1)
6.4 Clustering search results using Carrot2
149(5)
Using the Carrot2 API
150(1)
Clustering Solr search results
using Carrot2
151(3)
6.5 Clustering document collections with Apache Mahout
154(8)
Preparing the data for clustering
155(3)
K-Means clustering
158(4)
6.6 Topic modeling using Apache Mahout
162(2)
6.7 Examining clustering performance
164(8)
Feature selection and reduction
164(3)
Carrot2 performance and quality
167(1)
Mahout clustering benchmarks
168(4)
6.8 Acknowledgments
172(1)
6.9 Summary
173(1)
6.10 References
173(2)
7 Classification, categorization, and tagging
175(65)
7.1 Introduction to classification and categorization
177(3)
7.2 The classification process
180(9)
Choosing a classification scheme
181(1)
Identifying features for text categorization
182(1)
The importance of training data
183(3)
Evaluating classifier performance
186(2)
Deploying a classifier into production
188(1)
7.3 Building document categorizers using Apache Lucene
189(13)
Categorizing text with Lucene
189(2)
Preparing the training data for the More Like This categorizer
191(2)
Training the MoreLikeThis categorizer
193(4)
Categorizing documents with the MoreLikeThis categorizer
197(2)
Testing the MoreLikeThis categorizer
199(2)
MoreLikeThis in production
201(1)
7.4 Training a naive Bayes classifier using Apache Mahout
202(13)
Categorizing text using naive Bayes classification
202(2)
Preparing the training data
204(3)
Withholding test data
207(1)
Training the classifier
208(1)
Testing the classifier
209(1)
Improving the bootstrapping process
210(2)
Integrating the Mahout Bayes classifier with Solr
212(3)
7.5 Categorizing documents with OpenNLP
215(12)
Regression models and maximum entropy document categorization
216(3)
Preparing training data for the maximum entropy document categorizer
219(1)
Training the maximum entropy document categorizer
220(4)
Testing the maximum entropy document classifier
224(1)
Maximum entropy document categorization in production
225(2)
7.6 Building a tag recommender using Apache Solr
227(11)
Collecting training data for tag recommendations
229(2)
Preparing the training data
231(1)
Training the Solr tag recommender
232(2)
Creating tag recommendations
234(2)
Evaluating the tag recommender
236(2)
7.7 Summary
238(1)
7.8 References
239(1)
8 Building an example question answering system
240(20)
8.1 Basics of a question answering system
242(1)
8.2 Installing and running the QA code
243(2)
8.3 A sample question answering architecture
245(3)
8.4 Understanding questions and producing answers
248(10)
Training the answer type classifier
248(3)
Chunking the query
251(1)
Computing the answer type
252(3)
Generating the query
255(1)
Ranking candidate passages
256(2)
8.5 Steps to improve the system
258(1)
8.6 Summary
259(1)
8.7 Resources
259(1)
9 Untamed text: exploring the next frontier
260(27)
9.1 Semantics, discourse, and pragmatics: exploring higher levels of NLP
261(5)
Semantics
262(1)
Discourse
263(1)
Pragmatics
264(2)
9.2 Document and collection summarization
266(2)
9.3 Relationship extraction
268(5)
Overview of approaches
270(2)
Evaluation
272(1)
Tools for relationship extraction
273(1)
9.4 Identifying important content and people
273(3)
Global importance and authoritativeness
274(1)
Personal importance
275(1)
Resources and pointers on importance
275(1)
9.5 Detecting emotions via sentiment analysis
276(6)
History and review
276(2)
Tools and data needs
278(1)
A basic polarity algorithm
279(1)
Advanced topics
280(1)
Open source libraries for sentiment analysis
281(1)
9.6 Cross-language information retrieval
282(2)
9.7 Summary
284(1)
9.8 References
284(3)
Index 287
Grant Ingersoll

is an independent consultant developing search and natural language processing tools. He has worked on a number of text processing applications involving information retrieval, question answering, clustering, summarization, and categorization. Grant is a committer, as well as a speaker and trainer, on the Apache Lucene Java project and a co-founder of the Apache Mahout machine-learning project.

Thomas Morton writes software and performs research in the area of text processing and machine learning. He has been the primary developer and maintainer of the OpenNLP text processing project and Maximum Entropy machine learning project for the last 5 years. Currently, he works as a software architect for Comcast Interactive Media in Philadelphia.

Drew Farris is a professional software developer and technology consultant whose interests focus on large scale analytics, distributed computing and machine learning. He has contributed to a number of open source projects including Apache Mahout, Lucene and Solr, and holds a master's degree in Information Resource Management from Syracuse University's iSchool and a B.F.A in Computer Graphics.