Klientu atbalsts: 27018494

Grāmatu iegāde | Jauns profils | Ienākt

E-grāmata: Taming Text: How to Find, Organize, and Manipulate It

3.80/5 (220 ratings by Goodreads)

Andrew Farris, Thomas Morton, Grant Ingersoll

Formāts: 320 pages
Izdošanas datums: 20-Dec-2012
Izdevniecība: Manning Publications
Valoda: eng
ISBN-13: 9781638353867

Citas grāmatas par šo tēmu:

Formāts - EPUB+DRM
Cena: 39,56 €*
* ši ir gala cena, t.i., netiek piemērotas nekādas papildus atlaides
Ielikt grozā
Pievienot vēlmju sarakstam
Šī e-grāmata paredzēta tikai personīgai lietošanai. E-grāmatas nav iespējams atgriezt un nauda par iegādātajām e-grāmatām netiek atmaksāta.

Formāts: 320 pages
Izdošanas datums: 20-Dec-2012
Izdevniecība: Manning Publications
Valoda: eng
ISBN-13: 9781638353867

Citas grāmatas par šo tēmu:

DRM restrictions

Kopēšana (kopēt/ievietot):

nav atļauts
Drukāšana:

nav atļauts
Lietošana:

Digitālo tiesību pārvaldība (Digital Rights Management (DRM))
Izdevējs ir piegādājis šo grāmatu šifrētā veidā, kas nozīmē, ka jums ir jāinstalē bezmaksas programmatūra, lai to atbloķētu un lasītu. Lai lasītu šo e-grāmatu, jums ir jāizveido Adobe ID. Vairāk informācijas šeit. E-grāmatu var lasīt un lejupielādēt līdz 6 ierīcēm (vienam lietotājam ar vienu un to pašu Adobe ID).

Nepieciešamā programmatūra
Lai lasītu šo e-grāmatu mobilajā ierīcē (tālrunī vai planšetdatorā), jums būs jāinstalē šī bezmaksas lietotne: PocketBook Reader (iOS / Android)

Lai lejupielādētu un lasītu šo e-grāmatu datorā vai Mac datorā, jums ir nepieciešamid Adobe Digital Editions (šī ir bezmaksas lietotne, kas īpaši izstrādāta e-grāmatām. Tā nav tas pats, kas Adobe Reader, kas, iespējams, jau ir jūsu datorā.)

Jūs nevarat lasīt šo e-grāmatu, izmantojot Amazon Kindle.

DESCRIPTION

It is no secret that the world is drowning in text and data. This causes real problems for everyday users who need to make sense of all the information available, and for software engineers who want to make their text-based applications more useful and user-friendly. Whether building a search engine for a corporate website, automatically organizing email, or extracting important nuggets of information from the news, dealing with unstructured text can be daunting.

Taming Text is a hands-on, example-driven guide to working with unstructured text in the context of real-world applications. It explores how to automatically organize text, using approaches such as full-text search, proper name recognition, clustering, tagging, information extraction, and summarization. This book gives examples illustrating each of these topics, as well as the foundations upon which they are built.

KEY POINTS

h One-stop shop for learning how to process text

h Clear, concise, and practical advice

h Builds on high quality open source libraries

Foreword

xiii

preface

xiv

acknowledgments

xvii

about this book

xix

about the cover illustration

xxii

1 Getting started taming text

(15)

1.1 Why taming text is important

(2)

1.2 Preview: A fact-based question answering system

(4)

Hello, Dr. Frankenstein

(3)

1.3 Understanding text is hard

(2)

1.4 Text, tamed

(1)

1.5 Text and the intelligent app: search and beyond

(3)

Searching and matching

(1)

Extracting information

(1)

Grouping information

(1)

An intelligent application

(1)

1.6 Summary

(1)

1.7 Resources

(2)

2 Foundations of taming text

(21)

2.1 Foundations of language

(4)

Words and their categories

(1)

Phrases and clauses

(1)

Morphology

(1)

2.2 Common tools for text processing

(10)

String manipulation tools

(1)

Tokens and tokenization

(2)

Part of speech assignment

(1)

Stemming

(2)

Sentence detection

(1)

Parsing and grammar

(2)

Sequence modeling

(1)

2.3 Preprocessing and extracting content from common file formats

(5)

The importance of preprocessing

(30)

Extracting content using Apache Tika

(3)

2.4 Summary

(1)

2.5 Resources

(1)

3 Searching

(47)

3.1 Search and faceting example: Amazon.com

(2)

3.2 Introduction to search concepts

(12)

Indexing content

(2)

User input

(3)

Ranking documents with the vector space model

(3)

Results display

(3)

3.3 Introducing the Apache Solr search server

(5)

Running Solr for the first time

(2)

Understanding Solr concepts

(3)

3.4 Indexing content with Apache Solr

(6)

Indexing using XML

(1)

Extracting and indexing content using Solr and Apache Tika

(4)

3.5 Searching content with Apache Solr

(6)

Solr query input parameters

(3)

Faceting on extracted content

(2)

3.6 Understanding search performance factors

(5)

Judging quality

(4)

Judging quantity

(1)

3.7 Improving search performance

(8)

Hardware improvements

(1)

Analysis improvements

(1)

Query performance improvements

(3)

Alternative scoring models

(1)

Techniques for improving Solr performance

(2)

3.8 Search alternatives

(1)

3.9 Summary

(1)

3.10 Resources

(1)

4 Fuzzy siring matching

(31)

4.1 Approaches to fuzzy string matching

(8)

Character overlap measures

(3)

Edit distance measures

(3)

N-gram edit distance

(2)

4.2 Finding fuzzy string matches

(6)

Using prefixes for matching with Solr

(1)

Using a trie for prefix matching

(4)

Using n-grams for matching

(1)

4.3 Building fuzzy string matching applications

100

(14)

Adding type-ahead to search

101

(4)

Query spell-checking for search

105

(4)

Record matching

109

(5)

4.4 Summary

114

(1)

4.5 Resources

114

(1)

5 Identifying people, places, and things

115

(25)

5.1 Approaches to named-entity recognition

117

(2)

Using rules to identify names

117

(1)

Using statistical classifiers to identify names

118

(1)

5.2 Basic entity identification with Open NLP

119

(4)

Finding names with OpenNLP

120

(1)

Interpreting names identified by OpenNLP

121

(1)

Filtering names based on probability

122

(1)

5.3 In-depth entity identification with OpenNLP

123

(5)

Identifying multiple entity types with OpenNLP

123

(3)

Under the hood: how OpenNLP identifies names

126

(2)

5.4 Performance of OpenNLP

128

(4)

Quality of results

129

(1)

Runtime performance

130

(1)

Memory usage in OpenNLP

131

(1)

5.5 Customizing OpenNLP entity identification for a new domain

132

(6)

The whys and hows of training a model

132

(1)

Training an OpenNLP model

133

(1)

Altering modeling inputs

134

(2)

A new way to model names

136

(2)

5.6 Summary

138

(1)

5.7 Further reading

139

(1)

6 Clustering text

140

(35)

6.1 Google News document clustering

141

(1)

6.2 Clustering foundations

142

(7)

Three types of text to cluster

142

(2)

Choosing a clustering algorithm

144

(1)

Determining similarity

145

(1)

Labeling the results

146

(1)

How to evaluate clustering results

147

(2)

6.3 Setting up a simple clustering application

149

(1)

6.4 Clustering search results using Carrot2

149

(5)

Using the Carrot2 API

150

(1)

Clustering Solr search results

using Carrot2

151

(3)

6.5 Clustering document collections with Apache Mahout

154

(8)

Preparing the data for clustering

155

(3)

K-Means clustering

158

(4)

6.6 Topic modeling using Apache Mahout

162

(2)

6.7 Examining clustering performance

164

(8)

Feature selection and reduction

164

(3)

Carrot2 performance and quality

167

(1)

Mahout clustering benchmarks

168

(4)

6.8 Acknowledgments

172

(1)

6.9 Summary

173

(1)

6.10 References

173

(2)

7 Classification, categorization, and tagging

175

(65)

7.1 Introduction to classification and categorization

177

(3)

7.2 The classification process

180

(9)

Choosing a classification scheme

181

(1)

Identifying features for text categorization

182

(1)

The importance of training data

183

(3)

Evaluating classifier performance

186

(2)

Deploying a classifier into production

188

(1)

7.3 Building document categorizers using Apache Lucene

189

(13)

Categorizing text with Lucene

189

(2)

Preparing the training data for the More Like This categorizer

191

(2)

Training the MoreLikeThis categorizer

193

(4)

Categorizing documents with the MoreLikeThis categorizer

197

(2)

Testing the MoreLikeThis categorizer

199

(2)

MoreLikeThis in production

201

(1)

7.4 Training a naive Bayes classifier using Apache Mahout

202

(13)

Categorizing text using naive Bayes classification

202

(2)

Preparing the training data

204

(3)

Withholding test data

207

(1)

Training the classifier

208

(1)

Testing the classifier

209

(1)

Improving the bootstrapping process

210

(2)

Integrating the Mahout Bayes classifier with Solr

212

(3)

7.5 Categorizing documents with OpenNLP

215

(12)

Regression models and maximum entropy document categorization

216

(3)

Preparing training data for the maximum entropy document categorizer

219

(1)

Training the maximum entropy document categorizer

220

(4)

Testing the maximum entropy document classifier

224

(1)

Maximum entropy document categorization in production

225

(2)

7.6 Building a tag recommender using Apache Solr

227

(11)

Collecting training data for tag recommendations

229

(2)

Preparing the training data

231

(1)

Training the Solr tag recommender

232

(2)

Creating tag recommendations

234

(2)

Evaluating the tag recommender

236

(2)

7.7 Summary

238

(1)

7.8 References

239

(1)

8 Building an example question answering system

240

(20)

8.1 Basics of a question answering system

242

(1)

8.2 Installing and running the QA code

243

(2)

8.3 A sample question answering architecture

245

(3)

8.4 Understanding questions and producing answers

248

(10)

Training the answer type classifier

248

(3)

Chunking the query

251

(1)

Computing the answer type

252

(3)

Generating the query

255

(1)

Ranking candidate passages

256

(2)

8.5 Steps to improve the system

258

(1)

8.6 Summary

259

(1)

8.7 Resources

259

(1)

9 Untamed text: exploring the next frontier

260

(27)

9.1 Semantics, discourse, and pragmatics: exploring higher levels of NLP

261

(5)

Semantics

262

(1)

Discourse

263

(1)

Pragmatics

264

(2)

9.2 Document and collection summarization

266

(2)

9.3 Relationship extraction

268

(5)

Overview of approaches

270

(2)

Evaluation

272

(1)

Tools for relationship extraction

273

(1)

9.4 Identifying important content and people

273

(3)

Global importance and authoritativeness

274

(1)

Personal importance

275

(1)

Resources and pointers on importance

275

(1)

9.5 Detecting emotions via sentiment analysis

276

(6)

History and review

276

(2)

Tools and data needs

278

(1)

A basic polarity algorithm

279

(1)

Advanced topics

280

(1)

Open source libraries for sentiment analysis

281

(1)

9.6 Cross-language information retrieval

282

(2)

9.7 Summary

284

(1)

9.8 References

284

(3)

Index

287

Grant Ingersoll

is an independent consultant developing search and natural language processing tools. He has worked on a number of text processing applications involving information retrieval, question answering, clustering, summarization, and categorization. Grant is a committer, as well as a speaker and trainer, on the Apache Lucene Java project and a co-founder of the Apache Mahout machine-learning project.

Thomas Morton writes software and performs research in the area of text processing and machine learning. He has been the primary developer and maintainer of the OpenNLP text processing project and Maximum Entropy machine learning project for the last 5 years. Currently, he works as a software architect for Comcast Interactive Media in Philadelphia.

Drew Farris is a professional software developer and technology consultant whose interests focus on large scale analytics, distributed computing and machine learning. He has contributed to a number of open source projects including Apache Mahout, Lucene and Solr, and holds a master's degree in Information Resource Management from Syracuse University's iSchool and a B.F.A in Computer Graphics.

Biežāk uzdotie jautājumi par e-grāmatām

Permanent link: https://www.kriso.lv/db/97816383538676e.html

E-grāmata: Taming Text: How to Find, Organize, and Manipulate It

DRM restrictions

Kopēšana (kopēt/ievietot):

Drukāšana:

Lietošana:

Konts un iestatījumi

Meklēšana

Meklēt datubāzē

Refine By

Tēmas Ebook Subjects

Izvēlieties iepirkumu grozu