Atjaunināt sīkdatņu piekrišanu

E-grāmata: Text Mining with Machine Learning: Principles and Techniques

(Masarytk University, Czech Republic), (Mendel University, Czech Republic),
  • Formāts: 366 pages
  • Izdošanas datums: 31-Oct-2019
  • Izdevniecība: CRC Press
  • ISBN-13: 9780429890260
  • Formāts - EPUB+DRM
  • Cena: 57,60 €*
  • * ši ir gala cena, t.i., netiek piemērotas nekādas papildus atlaides
  • Ielikt grozā
  • Pievienot vēlmju sarakstam
  • Šī e-grāmata paredzēta tikai personīgai lietošanai. E-grāmatas nav iespējams atgriezt un nauda par iegādātajām e-grāmatām netiek atmaksāta.
  • Formāts: 366 pages
  • Izdošanas datums: 31-Oct-2019
  • Izdevniecība: CRC Press
  • ISBN-13: 9780429890260

DRM restrictions

  • Kopēšana (kopēt/ievietot):

    nav atļauts

  • Drukāšana:

    nav atļauts

  • Lietošana:

    Digitālo tiesību pārvaldība (Digital Rights Management (DRM))
    Izdevējs ir piegādājis šo grāmatu šifrētā veidā, kas nozīmē, ka jums ir jāinstalē bezmaksas programmatūra, lai to atbloķētu un lasītu. Lai lasītu šo e-grāmatu, jums ir jāizveido Adobe ID. Vairāk informācijas šeit. E-grāmatu var lasīt un lejupielādēt līdz 6 ierīcēm (vienam lietotājam ar vienu un to pašu Adobe ID).

    Nepieciešamā programmatūra
    Lai lasītu šo e-grāmatu mobilajā ierīcē (tālrunī vai planšetdatorā), jums būs jāinstalē šī bezmaksas lietotne: PocketBook Reader (iOS / Android)

    Lai lejupielādētu un lasītu šo e-grāmatu datorā vai Mac datorā, jums ir nepieciešamid Adobe Digital Editions (šī ir bezmaksas lietotne, kas īpaši izstrādāta e-grāmatām. Tā nav tas pats, kas Adobe Reader, kas, iespējams, jau ir jūsu datorā.)

    Jūs nevarat lasīt šo e-grāmatu, izmantojot Amazon Kindle.

This book provides a perspective on the application of machine learning-based methods in knowledge discovery from natural languages texts. By analysing various data sets, conclusions which are not normally evident, emerge and can be used for various purposes and applications. The book provides explanations of principles of time-proven machine learning algorithms applied in text mining together with step-by-step demonstrations of how to reveal the semantic contents in real-world datasets using the popular R-language with its implemented machine learning algorithms. The book is not only aimed at IT specialists, but is meant for a wider audience that needs to process big sets of text documents and has basic knowledge of the subject, e.g. e-mail service providers, online shoppers, librarians, etc.

The book starts with an introduction to text-based natural language data processing and its goals and problems. It focuses on machine learning, presenting various algorithms with their use and possibilities, and reviews the positives and negatives. Beginning with the initial data pre-processing, a reader can follow the steps provided in the R-language including the subsuming of various available plug-ins into the resulting software tool. A big advantage is that R also contains many libraries implementing machine learning algorithms, so a reader can concentrate on the principal target without the need to implement the details of the algorithms her- or himself. To make sense of the results, the book also provides explanations of the algorithms, which supports the final evaluation and interpretation of the results. The examples are demonstrated using realworld data from commonly accessible Internet sources.
Preface v
Authors' Biographies xiii
1 Introduction to Text Mining with Machine Learning
1(12)
1.1 Introduction
1(1)
1.2 Relation of Text Mining to Data Mining
2(3)
1.3 The Text Mining Process
5(1)
1.4 Machine Learning for Text Mining
6(3)
1.4.1 Inductive Machine Learning
8(1)
1.5 Three Fundamental Learning Directions
9(2)
1.5.1 Supervised Machine Learning
9(1)
1.5.2 Unsupervised Machine Learning
9(1)
1.5.3 Semi-supervised Machine Learning
10(1)
1.6 Big Data
11(1)
1.7 About This Book
11(2)
2 Introduction to R
13(62)
2.1 Installing R
14(1)
2.2 Running R
15(2)
2.3 RStudio
17(2)
2.3.1 Projects
18(1)
2.3.2 Getting Help
19(1)
2.4 Writing and Executing Commands
19(2)
2.5 Variables and Data Types
21(1)
2.6 Objects in R
22(9)
2.6.1 Assignment
25(1)
2.6.2 Logical Values
26(1)
2.6.3 Numbers
27(1)
2.6.4 Character Strings
28(1)
2.6.5 Special Values
29(2)
2.7 Functions
31(4)
2.8 Operators
35(1)
2.9 Vectors
36(7)
2.9.1 Creating Vectors
36(2)
2.9.2 Naming Vector Elements
38(2)
2.9.3 Operations with Vectors
40(2)
2.9.4 Accessing Vector Elements
42(1)
2.10 Matrices and Arrays
43(4)
2.11 Lists
47(2)
2.12 Factors
49(2)
2.13 Data Frames
51(4)
2.14 Functions Useful in Machine Learning
55(6)
2.15 Flow Control Structures
61(4)
2.15.1 Conditional Statement
61(3)
2.15.2 Loops
64(1)
2.16 Packages
65(2)
2.16.1 Installing Packages
66(1)
2.16.2 Loading Packages
67(1)
2.17 Graphics
67(8)
3 Structured Text Representations
75(62)
3.1 Introduction
75(4)
3.2 The Bag-of-Words Model
79(1)
3.3 The Limitations of the Bag-of-Words Model
80(3)
3.4 Document Features
83(2)
3.5 Standardization
85(5)
3.6 Texts in Different Encodings
90(2)
3.7 Language Identification
92(1)
3.8 Tokenization
92(1)
3.9 Sentence Detection
93(1)
3.10 Filtering Stop Words, Common, and Rare Terms
94(4)
3.11 Removing Diacritics
98(1)
3.12 Normalization
99(5)
3.12.1 Case Folding
99(1)
3.12.2 Stemming and Lemmatization
100(2)
3.12.3 Spelling Correction
102(2)
3.13 Annotation
104(5)
3.13.1 Part of Speech Tagging
104(3)
3.13.2 Parsing
107(2)
3.14 Calculating the Weights in the Bag-of-Words Model
109(5)
3.14.1 Local Weights
109(1)
3.14.2 Global Weights
110(1)
3.14.3 Normalization Factor
111(3)
3.15 Common Formats for Storing Structured Data
114(9)
3.15.1 Attribute-Relation File Format (ARFF)
114(1)
3.15.2 Comma-Separated Values (CSV)
115(2)
3.15.3 C5 format
117(4)
3.15.4 Matrix Files for CLUTO
121(1)
3.15.5 SVMlight Format
121(1)
3.15.6 Reading Data in R
122(1)
3.16 A Complex Example
123(14)
4 Classification
137(8)
4.1 Sample Data
137(3)
4.2 Selected Algorithms
140(2)
4.3 Classifier Quality Measurement
142(3)
5 Bayes Classifier
145(18)
5.1 Introduction
145(1)
5.2 Bayes' Theorem
146(2)
5.3 Optimal Bayes Classifier
148(1)
5.4 Naive Bayes Classifier
149(1)
5.5 Illustrative Example of Naive Bayes
150(3)
5.6 Naive Bayes Classifier in R
153(10)
5.6.1 Running Naive Bayes Classifier in RStudio
154(2)
5.6.2 Testing with an External Dataset
156(2)
5.6.3 Testing with 10-Fold Cross-Validation
158(5)
6 Nearest Neighbors
163(10)
6.1 Introduction
163(1)
6.2 Similarity as Distance
164(2)
6.3 Illustrative Example of k-NN
166(2)
6.4 k-NN in R
168(5)
7 Decision Trees
173(20)
7.1 Introduction
173(1)
7.2 Entropy Minimization-Based c5 Algorithm
174(7)
7.2.1 The Principle of Generating Trees
174(4)
7.2.2 Pruning
178(3)
7.3 C5 Tree Generator in R
181(12)
7.3.1 Generating a Tree
181(3)
7.3.2 Information Acquired from C5-Tree
184(3)
7.3.3 Using Testing Samples to Assess Tree Accuracy
187(1)
7.3.4 Using Cross-Validation to Assess Tree Accuracy
188(1)
7.3.5 Generating Decision Rules
189(4)
8 Random Forest
193(8)
8.1 Introduction
193(2)
8.1.1 Bootstrap
193(2)
8.1.2 Stability and Robustness
195(1)
8.1.3 Which Tree Algorithm?
195(1)
8.2 Random Forest in R
195(6)
9 Adaboost
201(10)
9.1 Introduction
201(1)
9.2 Boosting Principle
201(1)
9.3 Adaboost Principle
202(2)
9.4 Weak Learners
204(1)
9.5 Adaboost in R
205(6)
10 Support Vector Machines
211(12)
10.1 Introduction
211(2)
10.2 Support Vector Machines Principles
213(4)
10.2.1 Finding Optimal Separation Hyperplane
213(1)
10.2.2 Nonlinear Classification and Kernel Functions
214(1)
10.2.3 Multiclass SVM Classification
215(1)
10.2.4 SVM Summary
216(1)
10.3 SVM in R
217(6)
11 Deep Learning
223(12)
11.1 Introduction
223(2)
11.2 Artificial Neural Networks
225(2)
11.3 Deep Learning in R
227(8)
12 Clustering
235(52)
12.1 Introduction to Clustering
235(1)
12.2 Difficulties of Clustering
236(2)
12.3 Similarity Measures
238(4)
12.3.1 Cosine Similarity
239(1)
12.3.2 Euclidean Distance
240(1)
12.3.3 Manhattan Distance
240(1)
12.3.4 Chebyshev Distance
241(1)
12.3.5 Minkowski Distance
241(1)
12.3.6 Jaccard Coefficient
241(1)
12.4 Types of Clustering Algorithms
242(4)
12.4.1 Partitional (Flat) Clustering
242(1)
12.4.2 Hierarchical Clustering
243(2)
12.4.3 Graph Based Clustering
245(1)
12.5 Clustering Criterion Functions
246(3)
12.5.1 Internal Criterion Functions
247(1)
12.5.2 External Criterion Function
248(1)
12.5.3 Hybrid Criterion Functions
248(1)
12.5.4 Graph Based Criterion Functions
248(1)
12.6 Deciding on the Number of Clusters
249(2)
12.7 K-Means
251(1)
12.8 K-Medoids
252(1)
12.9 Criterion Function Optimization
253(1)
12.10 Agglomerative Hierarchical Clustering
253(4)
12.11 Scatter-Gather Algorithm
257(2)
12.12 Divisive Hierarchical Clustering
259(1)
12.13 Constrained Clustering
260(1)
12.14 Evaluating Clustering Results
261(9)
12.14.1 Metrics Based on Counting Pairs
263(1)
12.14.2 Purity
264(1)
12.14.3 Entropy
264(1)
12.14.4 F-Measure
265(1)
12.14.5 Normalized Mutual Information
266(1)
12.14.6 Silhouette
267(2)
12.14.7 Evaluation Based on Expert Opinion
269(1)
12.15 Cluster Labeling
270(1)
12.16 A Few Examples
271(16)
13 Word Embeddings
287(14)
13.1 Introduction
287(2)
13.2 Determining the Context and Word Similarity
289(2)
13.3 Context Windows
291(1)
13.4 Computing Word Embeddings
291(3)
13.5 Aggregation of Word Vectors
294(1)
13.6 An Example
295(6)
14 Feature Selection
301(22)
14.1 Introduction
301(2)
14.2 Feature Selection as State Space Search
303(1)
14.3 Feature Selection Methods
304(9)
14.3.1 Chi Squared (Χ2)
306(1)
14.3.2 Mutual Information
307(4)
14.3.3 Information Gain
311(2)
14.4 Term Elimination Based on Frequency
313(1)
14.5 Term Strength
314(1)
14.6 Term Contribution
315(1)
14.7 Entropy-Based Ranking
315(1)
14.8 Term Variance
316(1)
14.9 An Example
316(7)
References 323(24)
Index 347
Jan ika is a consultant in machine learning and data mining. He has worked as a system programmer, developer of advanced software systems, and researcher. For the last 25 years, he has devoted himself to AI and machine learning, especially text mining. He has been a faculty at a number of universities and research institutes. He has authored approximately 100 international publications.

Frantiek Daena is an associate professor and the head of the Text Mining and NLP group at the Department of Informatics, Mendel University, Brno. He has published numerous articles in international scientific journals, conference proceedings, and monographs, and is a member of editorial boards of several international journals. His research includes text/data mining, intelligent data processing, and machine learning.

Arnot Svoboda is an expert programer. His speciality includes programming languages and systems such as R, Assembler, Matlab, PL/1, Cobol, Fortran, Pascal, and others. He started as a system programmer. The last 20 years, Arnot has worked also as a teacher and researcher at Masaryk University in Brno. His current interest are machine learning and data mining.