Search icon CANCEL
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Conferences
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Machine Learning in Java

You're reading from   Machine Learning in Java Helpful techniques to design, build, and deploy powerful machine learning applications in Java

Arrow left icon
Product type Paperback
Published in Nov 2018
Publisher Packt
ISBN-13 9781788474399
Length 300 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Ashish Bhatia Ashish Bhatia
Author Profile Icon Ashish Bhatia
Ashish Bhatia
Bostjan Kaluza Bostjan Kaluza
Author Profile Icon Bostjan Kaluza
Bostjan Kaluza
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Applied Machine Learning Quick Start FREE CHAPTER 2. Java Libraries and Platforms for Machine Learning 3. Basic Algorithms - Classification, Regression, and Clustering 4. Customer Relationship Prediction with Ensembles 5. Affinity Analysis 6. Recommendation Engines with Apache Mahout 7. Fraud and Anomaly Detection 8. Image Recognition with Deeplearning4j 9. Activity Recognition with Mobile Phone Sensors 10. Text Mining with Mallet - Topic Modeling and Spam Detection 11. What Is Next? 12. Other Books You May Enjoy

Working with text data

One of the main challenges in text mining is transforming unstructured written natural language into structured attribute-based instances. The process involves many steps, as shown here:

First, we extract some text from the internet, existing documents, or databases. At the end of the first step, the text could still be present in the XML format or some other proprietary format. The next step is to extract the actual text and segment it into parts of the document, for example, title, headline, abstract, and body. The third step is involved with normalizing text encoding to ensure the characters are presented in the same way; for example, documents encoded in formats such as ASCII, ISO 8859-1 and Windows-1250 are transformed into Unicode encoding. Next, tokenization splits the document into particular words, while the next step removes frequent words that...

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €18.99/month. Cancel anytime