Packt+ | Advance your knowledge in tech

You're reading from Clojure for Data Science

Product type Book

Published in Sep 2015

Publisher

ISBN-13 9781784397180

Pages 608 pages

Edition 1st Edition

Languages

Clojure

Concepts

Data Analysis

Author (1):

Henry Garner

Table of Contents (18) Chapters

Clojure for Data Science

Credits

About the Author

Acknowledgments

About the Reviewer

www.PacktPub.com

Preface

1. Statistics

2. Inference

3. Correlation

4. Classification

5. Big Data

6. Clustering

7. Recommender Systems

8. Network Analysis

9. Time Series

10. Visualization

Index

A

A* algorithm
- URL / Finding the shortest path
Acbracad library
- URL / Distributed unique IDs with Hadoop
AcmeContent
- about / Introducing AcmeContent
- sample code / Download the sample code
acyclic / Visualizing graphs with Loom
Adaptive Boosting (AdaBoost) / Bagging and boosting
Akaike Information Criterion (AIC)
- models, identifying / Identifying better models with Akaike Information Criterion
- about / Identifying better models with Akaike Information Criterion
ALS
- movie recommendations / Movie recommendations with alternating least squares
- using, with Spark / ALS with Spark and MLlib
- using, with MLlib / ALS with Spark and MLlib
- used, for making predictions / Making predictions with ALS
- evaluating / Evaluating ALS
Anscombe's Quartet / The importance of visualizations
Apache Commons Math
- about / Estimating the maximum likelihood, Nelder-Mead optimization with Apache Commons Math
- URL / Estimating the maximum likelihood, Nelder-Mead optimization with Apache Commons Math
- used, for Nelder-Mead optimization / Nelder-Mead optimization with Apache Commons Math
ARMA model order
- determining, with ACF and PACF / Determining ARMA model order with ACF and PACF
autocorrelation function (ACF)
- about / Determining autocorrelation in AR models
- ARMA model order, determining / Determining ARMA model order with ACF and PACF
- plotting, of airline data / ACF and PACF of airline data
autocovariance
- about / Autocovariance
autoregressive (AR) models
- about / Autoregressive models
- autocorrelation, determining / Determining autocorrelation in AR models
- combining, with Moving Average (MA) models / Combining the AR and MA models
Autoregressive Integrated Moving Average (ARIMA) model
- about / Removing seasonality with differencing

B

B1
- about / B1
- URL / B1
bag-of-words / The bag-of-words and Euclidean distance
bagging
- about / Bagging and boosting
balanced F-score / F-measure and the harmonic mean
batch gradient descent
- about / Stochastic gradient descent
Bayesian view / Probability
Bayes theorem
- about / Bayes theorem
- with multiple predictors / Bayes theorem with multiple predictors, Naive Bayes classification
bias
- about / Bias
- high bias, addressing / Addressing high bias
bias term / Multiple linear regression
big data
- code, downloading / Downloading the code and data
- example code, URL / Downloading the code and data
- inspecting / Inspecting the data
- records, counting / Counting the records
bigrams
- about / Better clustering with n-grams
bimodal
- about / Visualizing different populations
binning
- about / Binning data
binomial distribution
- about / The binomial distribution
bipartite / Visualizing graphs with Loom
bivariate linear regression / Multiple linear regression
Bloom filters
- used, for testing large sets membership / Testing set membership with Bloom filters
Bonferroni correction
- about / The Bonferroni correction
boosting
- about / Bagging and boosting
bounce
- about / Introducing AcmeContent
box and whisker plots
- about / Box plots
breadth-first search / Breadth-first and depth-first search

C

C4.5 algorithm / Building a decision tree in clj-ml
categorical variables / Categorical and dummy variables
central limit theorem / The central limit theorem
- about / The central limit theorem
Chi-squared multiple significance testing
- about / Chi-squared multiple significance testing
- categories, visualizing / Visualizing the categories
- chi-squared test / The chi-squared test, The chi-squared test
- chi-squared statistic / The chi-squared statistic
chi-squared statistic / The chi-squared statistic
chi-squared test / The chi-squared test, The chi-squared test
classifier
- data / About the data
- data, inspecting / Inspecting the data
- relative risk and odds, comparing with / Comparisons with relative risk and odds
- saving, to file / Saving the classifier to a file
clj-ml
- classification with / Classification with clj-ml
- URL / Classification with clj-ml
- data, loading with / Loading data with clj-ml
- decision tree, building / Building a decision tree in clj-ml
clj-time library
- URL / Inspecting the airline data
clojure-opennlp library
- URL / Tokenizing the Reuters files
Clojure libraries
- URL / Exploratory data visualization
Clojure library succession
- URL / Calculating the likelihood
Clojure library Tesser
- URL / Mathematical folds with Tesser
Clojure reducers library
- URL / Counting the records
- about / The reducers library
- parallel folds / Parallel folds with reducers
- parallel folds with / Parallel folds with reducers
- large files, loading with iota / Loading large files with iota
- reducers processing pipeline, creating / Creating a reducers processing pipeline
- curried reductions, with reducers / Curried reductions with reducers
- statistical folds / Statistical folds with reducers
- associativity / Associativity
- mean calculating, fold used / Calculating the mean using fold
- variance calculating, fold used / Calculating the variance using fold
cluster evaluation, measures
- about / Cluster evaluation measures
- inter-cluster density / Inter-cluster density
- intra-cluster density / Intra-cluster density
- root mean square error, calculating with Parkour / Calculating the root mean square error with Parkour
- clustered points and centroids, loading / Loading clustered points and centroids
- cluster RMSE, calculating / Calculating the cluster RMSE
- optimal k, determining with elbow method / Determining optimal k with the elbow method
- optimal k, determining with Dunn index / Determining optimal k with the Dunn index
- optimal k, determining with Davies-Bouldin index / Determining optimal k with the Davies-Bouldin index
clustering
- data, downloading / Downloading the data
- data, extracting / Extracting the data
- data, inspecting / Inspecting the data
clustering, text
- about / Clustering text
- set-of-words / Set-of-words and the Jaccard index
- Jaccard index / Set-of-words and the Jaccard index
- Reuters files, tokenizing / Tokenizing the Reuters files
- text, representing as vectors / Representing text as vectors
- dictionary, creating / Creating a dictionary
cluster RMSE
- calculating / Calculating the cluster RMSE
code
- downloading / Download the code and data
- downloading, URL / Download the code and data
coefficient of determination / Goodness-of-fit and R-square
coefficient of multiple determination / Multiple R-squared
collinearity
- about / Collinearity
- multicollinearity / Multicollinearity
columns
- adding / Adding columns, Adding derived columns
combinations function
- URL / Determining optimal k with the Dunn index
communities, with label propagation
- detecting / Detecting communities with label propagation
- map vertices / Step one – map vertices
- vertex attribute, sending / Step two – send the vertex attribute
- aggregate value / Step three – aggregate value
- vertex function / Step four – vertex function
- maximum iterations count, setting / Step five – set the maximum iterations count
comparative visualizations
- about / Comparative visualizations
- box and whisker plots / Box plots
- cumulative distribution functions / Cumulative distribution functions
- probability mass function (PMF) / Probability mass functions
- scatter plots / Scatter plots
- scatter transparency / Scatter transparency
confidence interval
- about / Confidence intervals, Confidence intervals
confounding variables
- about / Regression
confusion matrix / The confusion matrix
connected components
- running / Running connected components
- largest connected component, size calculating / Calculating the size of the largest connected component
connected components, with Pregel API
- about / Connected components with the Pregel API
- map vertices / Step one – map vertices
- message function / Steps two and three – the message function
- attributes, updating / Step four – update the attributes
- convergence, iterating to / Step five – iterate to convergence
construction
- about / Construction
content-based filtering / Types of recommender systems
content distribution network (CDN) / jStat
covariance
- about / Covariance
- calculating, with Tesser / Calculating covariance with Tesser
cross-validation
- about / Cross-validation
cumulative distribution function (CDF)
- about / Hypothesis testing
Cumulative distribution functions (CDFs)
- about / Cumulative distribution functions

D

daily means distribution
- about / The distribution of daily means
data
- inspecting / Load and inspect the data, Inspecting the data, Inspect the data
- loading / Load and inspect the data
- about / About the data
- Guardian's excellent data blog, URL / About the data
- visualizing / Visualizing the data
- downloading / Download the code and data
- downloading, URL / Download the code and data
- parsing / Parse the data
data scrubbing
- about / Data scrubbing
Davies-Bouldin index
- used, for determining optimal k / Determining optimal k with the Davies-Bouldin index
decision trees
- about / Decision trees
- information / Information
- entropy / Entropy
- information gain / Information gain
- information gain, using to identify best predictor / Using information gain to identify the best predictor
- building, recursively / Recursively building a decision tree
- using, for classification / Using the decision tree for classification
- classifier, evaluating / Evaluating the decision tree classifier
- building, in clj-ml / Building a decision tree in clj-ml
degenerate matrices / Inversion
degrees of freedom
- about / Student's t-distribution, Degrees of freedom
Delta rule / The gradient descent update rule
dependent variable
- about / Regression
depth-first search / Breadth-first and depth-first search
descriptive statistics
- about / Descriptive statistics
- mean / The mean
- mathematical notation, interpreting / Interpreting mathematical notation
- median / The median
dictionary
- creating / Creating a dictionary
dimensionality reduction
- about / Dimensionality reduction
- Iris dataset, plotting / Plotting the Iris dataset
- principle component analysis (PCA) / Principle component analysis
- principle component analysis(PCA) / Principle component analysis
- Singular Value Decomposition (SVD) / Singular value decomposition
dimensions
- about / Dimensions
Directed Acyclic Graph (DAG) / Visualizing graphs with Loom
Discounted Cumulative Gain (DCG) / Normalized discounted cumulative gain
discrete time models
- about / Discrete time models
- random walks / Random walks
- autoregressive (AR) models / Autoregressive models
- autocorrelation, determining in AR models / Determining autocorrelation in AR models
- Moving Average (MA) models / Moving-average models
- partial autocorrelation function (PACF), calculating / Calculating partial autocorrelation
- seasonality, removing with differencing / Removing seasonality with differencing
distance measures, evaluating
- about / Evaluating distance measures
- Pearson correlation similarity / The Pearson correlation similarity
- Spearmans rank similarity / Spearman's rank similarity
distributed cache
- data, sharing with / Sharing data with the distributed cache
distributed unique IDs
- creating / Creating distributed unique IDs
- with Hadoop / Distributed unique IDs with Hadoop
dot product / Matrix-vector multiplication
dummy variables / Categorical and dummy variables
Dunn index
- used, for determining optimal k / Determining optimal k with the Dunn index
Durbin-Levinson recursion
- used, for calculating partial autocorrelation function (PACF) / PACF with Durbin-Levinson recursion
- about / PACF with Durbin-Levinson recursion
- URL / PACF with Durbin-Levinson recursion
dwell time
- about / Introducing AcmeContent
dwell times
- visualizing / Visualizing the dwell times

E

edge-list format / Inspecting the data
elbow method
- used, for determining optimal k / Determining optimal k with the elbow method
ensemble learning
- about / Ensemble learning and random forests
entropy / Entropy
explained sum of squares (ESS) / The F-test of model significance
exploratory data visualization / Exploratory data visualization
exponential distribution
- about / The exponential distribution

F

F-distribution
- about / The F-distribution
F-statistic
- about / The F-statistic
F-test
- about / The F-test
/ The F-test of model significance
F1 measure / F-measure and the harmonic mean
feature matrix
- creating / Creating a feature matrix
Fisher z-transformation
- about / Confidence intervals
Flambo
- URL / Large-scale machine learning with Apache Spark and MLlib
fold / Parallel folds with reducers
frequency vectors / Representing text as vectors
frequentist / Probability
Fressian
- URL / Chaining mappers and reducers with Parkour graph
fs library
- URL / Clustering the Reuters documents

G

Gaussian distribution
- about / The normal distribution
- central limit theorem / The central limit theorem
Giraph
- URL / Distributed graph computation with GraphX
GitHub
- URL / Downloading the sample code
Glittering
- URL / Creating RDGs with Glittering
gradient descent
- about / The logistic regression cost function, Multiple regression with gradient descent
- multiple regression with / Multiple regression with gradient descent
- update rule / The gradient descent update rule
- learning rate / The gradient descent learning rate
- feature scaling / Feature scaling
- feature extraction / Feature extraction
- custom Tesser fold, creating / Creating a custom Tesser fold
- total model error, calculating / Calculating the total model error
- matrix-mean fold, creating / Creating a matrix-mean fold
- single step, applying / Applying a single step of gradient descent
- iterative gradient descent, running / Running iterative gradient descent
- scaling with Hadoop / Scaling gradient descent with Hadoop
gradient descent on Hadoop, with Tesser and Parkour
- about / Gradient descent on Hadoop with Tesser and Parkour
- Parkour distributed sources and sinks / Parkour distributed sources and sinks
- feature scale fold, running with Hadoop / Running a feature scale fold with Hadoop
- gradient descent, running with Hadoop / Running gradient descent with Hadoop
- code, preparing for Hadoop cluster / Preparing our code for a Hadoop cluster
- uberjar, building / Building an uberjar
- uberjar, submitting to Hadoop / Submitting the uberjar to Hadoop
graphs
- visualizing, Loom used / Visualizing graphs with Loom
graph traversal
- with Loom / Graph traversal with Loom
- Königsberg city, seven bridges / The seven bridges of Königsberg
GraphViz
- URL / Visualizing graphs with Loom
GraphX / Scale-free networks
- distributed graph computation / Distributed graph computation with GraphX
- RDGs, creating with Glittering / Creating RDGs with Glittering
- graph density, measuring with triangle counting / Measuring graph density with triangle counting
- partitioning strategies / GraphX partitioning strategies
- built-in triangle counting algorithm, running / Running the built-in triangle counting algorithm
- triangle counting, implementing with Glittering / Implement triangle counting with Glittering
- custom triangle counting algorithm, running / Running the custom triangle counting algorithm
- Pregel API / The Pregel API
- Pregel API, connected components / Connected components with the Pregel API
- connected components, running / Running connected components
- largest connected component, size calculating / Calculating the size of the largest connected component
- communities with label propagation, detecting / Detecting communities with label propagation
- label propagation, running / Running label propagation
- flow formulation / The flow formulation
- PageRank, implementing with Glittering / Implementing PageRank with Glittering
- PageRank, running to determine community influencers / Running PageRank to determine community influencers
gross domestic product (GDP)
- about / About the data

H

Hadoop Distributed File System (HDFS) / Large-scale machine learning with Apache Spark and MLlib
Hadoop installation guides
- URL / Submitting the uberjar to Hadoop
Hama
- URL / Distributed graph computation with GraphX
heteroscedasticity
- about / Visualizing the airline data
histogram
- about / Histograms
hypothesis testing
- about / Visualizing different populations, Hypothesis testing, Hypothesis testing
- significance testing / Significance

I

Ideal Discounted Cumulative Gain (IDCG) / Normalized discounted cumulative gain
identity matrix / The identity matrix
Incanter
- gradient descent with / Gradient descent with Incanter
- logistic regression, implementing with / Implementing logistic regression with Incanter
Incanter's linear model
- about / Incanter's linear model
- F-test / The F-test of model significance
Incanter library
- URL / Inspecting the data
independent variable
- about / Regression
indices function / Testing set membership with Bloom filters
inferential statistics
- about / Descriptive statistics
information gain
- about / Information gain
- used, for identifying best predictor / Using information gain to identify the best predictor
Information Retrieval statistics (IR stats) evaluator
- about / Information retrieval statistics
- precision / Precision
- recall / Recall
- of Mahout / Mahout's information retrieval evaluator
- F-measure / F-measure and the harmonic mean
- harmonic mean / F-measure and the harmonic mean
- false positive rate / Fall-out
- fall-out / Fall-out
- Discounted Cumulative Gain (DCG) / Normalized discounted cumulative gain
- normalized discounted cumulative gain / Normalized discounted cumulative gain
- results, plotting / Plotting the information retrieval results
- implicit, versus explicit feedback / Implicit versus explicit feedback
inter-cluster density / Inter-cluster density
interface
- binding / Binding the interface
interquartile range / Quantiles
intra-cluster density / Intra-cluster density
inversion matrix / Inversion
iota
- URL / Loading large files with iota
- used, for loading large files / Loading large files with iota
IRS data definition
- URL / Inspecting the data
IRS Statistics of Income (SoI)
- URL / Downloading the code and data
item-based recommenders
- about / Item-based and user-based recommenders
- practical considerations / Practical considerations for user and item recommenders

J

Jaccard index / Set-of-words and the Jaccard index
- applying, to documents / Applying the Jaccard index to documents
jStat
- about / jStat
- URL / jStat

K

k-means
- drawbacks / The drawbacks of k-means
- Mahalanobis distance measure / The Mahalanobis distance measure
- dimensionality, curse / The curse of dimensionality
k-means clustering
- about / Clustering with k-means and Incanter
- with Term Frequency-Inverse Document Frequency (TF-IDF) / k-means clustering with TF-IDF
k-means clustering, running with Mahout
- about / Running k-means clustering with Mahout
- results, viewing / Viewing k-means clustering results
- clustered output, interpreting / Interpreting the clustered output
k-nearest neighbors (k-NN)
- about / k-nearest neighbors
kappa statistic model / The kappa statistic
k hash functions / Testing set membership with Bloom filters

L

label propagation
- about / Detecting communities with label propagation
- running / Running label propagation
large-scale clustering, with Mahout
- about / Large-scale clustering with Mahout
- text documents, converting to sequence file / Converting text documents to a sequence file
- Mahout vectors creating, Parkour used / Using Parkour to create Mahout vectors
- distributed unique IDs, creating / Creating distributed unique IDs
- distributed unique IDs, with Hadoop / Distributed unique IDs with Hadoop
- data, sharing with distributed cache / Sharing data with the distributed cache
- Mahout vectors, building from input documents / Building Mahout vectors from input documents
large-scale machine learning
- MLlib, using / Large-scale machine learning with Apache Spark and MLlib
- Spark, using / Large-scale machine learning with Apache Spark and MLlib
- data, loading with Sparkling / Loading data with Sparkling
- data, mapping / Mapping data
- tuples / Distributed datasets and tuples
- distributed datasets / Distributed datasets and tuples
- data, filtering / Filtering data
- persistence / Persistence and caching
- caching / Persistence and caching
larger sets
- probabilistic methods / Probabilistic methods for large sets
- membership, testing with Bloom filters / Testing set membership with Bloom filters
- Jaccard similarity, with MinHash / Jaccard similarity for large sets with MinHash
learning rate / The gradient descent update rule
locality-sensitive hashing (LSH)
- used, for reducing pair comparisons / Reducing pair comparisons with locality-sensitive hashing
- about / Reducing pair comparisons with locality-sensitive hashing
- signatures, bucketing / Bucketing signatures
- URL / Bucketing signatures
log-linear / Visualizing the dwell times
log-log chart / Visualizing the dwell times
log-normal distribution
- about / The log-normal distribution
- correlation, visualizing / Visualizing correlation
- jittering / Jittering
logistic regression
- and naive Bayes approaches, comparing / Comparing the logistic regression and naive Bayes approaches
logistic regression, classifying
- about / Classification with logistic regression
- sigmoid function / The sigmoid function
- logistic regression cost function / The logistic regression cost function
- parameter optimization, with gradient descent / Parameter optimization with gradient descent
- gradient descent, with Incanter / Gradient descent with Incanter
- convexity / Convexity
logistic regression, implementing with Incanter
- about / Implementing logistic regression with Incanter
- feature matrix, creating / Creating a feature matrix
- logistic regression classifier, evaluating / Evaluating the logistic regression classifier
- confusion matrix / The confusion matrix
- kappa statistic / The kappa statistic
logistic regression classifier
- evaluating / Evaluating the logistic regression classifier
logistic regression cost function
- about / The logistic regression cost function
Loom
- used, for visualizing graphs / Visualizing graphs with Loom
- URL / Visualizing graphs with Loom
- graph traversal with / Graph traversal with Loom
loss function
- about / Ordinary least squares, Parameter optimization with gradient descent

M

machine learning
- movie recommendations, with ALS / Movie recommendations with alternating least squares
- ALS, evaluating / Evaluating ALS
- sum of squared errors, calculating / Calculating the sum of squared errors
Mahalanobis distance measure / The Mahalanobis distance measure
Mahout
- URL / Large-scale clustering with Mahout
- used, for building user-based recommenders / Building a user-based recommender with Mahout
- used, for evaluating recommenders / Recommender evaluation with Mahout
- Information Retrieval statistics (IR stats) evaluator / Mahout's information retrieval evaluator
Mahout vectors
- creating, Parkour used / Using Parkour to create Mahout vectors
- building, from input documents / Building Mahout vectors from input documents
matrix
- about / Matrices
- dimensions / Dimensions
- vectors / Vectors
- construction / Construction
- scalar multiplication / Addition and scalar multiplication
- scalar addition / Addition and scalar multiplication
- -vector multiplication / Matrix-vector multiplication
- -matrix multiplication / Matrix-matrix multiplication
- transposition / Transposition
- identity matrix / The identity matrix
- inversion / Inversion
matrix-matrix multiplication / Matrix-matrix multiplication
matrix-vector multiplication / Matrix-vector multiplication
maximum likelihood, time series
- estimating / Maximum likelihood estimation, Estimating the maximum likelihood
- calculating / Calculating the likelihood
- estimating, with Nelder-Mead optimization / Nelder-Mead optimization with Apache Commons Math
- estimating, with Akaike Information Criterion / Identifying better models with Akaike Information Criterion
maximum likelihood estimation
- about / Removing seasonality with differencing
m bits / Testing set membership with Bloom filters
mean
- calculating, fold used / Calculating the mean using fold
mean square error (MSE) / The F-test of model significance
mean square model (MSM) / The F-test of model significance
Medley
- URL / Performing a z-test
memoryless / The exponential distribution
meta-algorithm / Bagging and boosting
MinHash
- used, for Jaccard similarity for larger sets / Jaccard similarity for large sets with MinHash
MLlib
- used, for large-scale machine learning / Large-scale machine learning with Apache Spark and MLlib
- URL / Large-scale machine learning with Apache Spark and MLlib, Machine learning on Spark with MLlib
- used, for machine learning on Spark / Machine learning on Spark with MLlib
- using, with ALS / ALS with Spark and MLlib
- using, with Spark / ALS with Spark and MLlib
- ALS, evaluating / Evaluating ALS
Monte Carlo simulation
- used, for forecasting time series / Forecasting with Monte Carlo simulation
Moving Average (MA) models
- about / Moving-average models
- autocorrelation, determining / Determining autocorrelation in MA models
- combining, with autoregressive (AR) models / Combining the AR and MA models
multimodal
- about / Visualizing different populations
multiple comparisons
- about / Multiple comparisons
multiple designs
- testing / Testing multiple designs
multiple linear regression / Multiple linear regression
multiple tests
- simulating / Simulating multiple tests

N

n-gram
- about / Better clustering with n-grams
Naive Bayes classification
- about / Naive Bayes classification
- implementing / Implementing a naive Bayes classifier
- evaluating / Evaluating the naive Bayes classifier
natural logarithm / The log-normal distribution
Nelder-Mead optimization
- about / Estimating the maximum likelihood
- with Apache Commons Math / Nelder-Mead optimization with Apache Commons Math
network analysis
- data, downloading / Download the data
- data, inspecting / Inspecting the data
- graphs, visualizing with Loom / Visualizing graphs with Loom
new site design
- testing / Testing a new site design
nonresponse bias
- about / Bias
normal distribution
- about / The normal distribution
normal equation
- about / The normal equation
- features / More features
null hypothesis / Hypothesis testing

O

one-sample t-test
- about / One-sample t-test
one-tailed tests
- about / Two-tailed tests
optimal k
- determining, with elbow method / Determining optimal k with the elbow method
- determining, with Dunn index / Determining optimal k with the Dunn index
- determining, with Davies-Bouldin index / Determining optimal k with the Davies-Bouldin index
Ordinary Least Squares (OLS)
- about / Ordinary least squares
- slope / Slope and intercept
- intercept / Slope and intercept
- interpretation / Interpretation
- visualization / Visualization
- assumptions / Assumptions
over-fitting
- about / Bias and variance, Overfitting

P

PageRank
- used, for measuring community influence / Measuring community influence using PageRank
- implementing, with Glittering / Implementing PageRank with Glittering
- highest influence, sorting by / Sort by highest influence
- running, to determine community influencers / Running PageRank to determine community influencers
Parkour
- URL / Parkour distributed sources and sinks
- used, for creating Mahout vectors / Using Parkour to create Mahout vectors
partial autocorrelation
- calculating / Calculating partial autocorrelation
- autocovariance / Autocovariance
- plotting / Plotting partial autocorrelation
partial autocorrelation function (PACF)
- about / Calculating partial autocorrelation
- calculating, with Durbin-Levinson recursion / PACF with Durbin-Levinson recursion
- ARMA model order, determining / Determining ARMA model order with ACF and PACF
- plotting, of airline data / ACF and PACF of airline data
parts-of-speech taggers / Tokenizing the Reuters files
Pearson's correlation
- about / Pearson's correlation
- sample r and population rho / Sample r and population rho
phi-quantile / Quantiles
Poincaré's baker
- about / Poincaré's baker
- distributions, generating / Generating distributions
polytope
- about / Estimating the maximum likelihood
populations
- about / Samples and populations
- visualizing / Visualizing different populations
precision
- true positives / Precision
- about / Precision
- false positives / Precision
prediction
- about / Prediction
- confidence interval / The confidence interval of a prediction
- model, scope / Model scope
- final model / The final model
prediction intervals / The confidence interval of a prediction
Pregel API
- about / The Pregel API
- connected components with / Connected components with the Pregel API
probability
- about / Probability
- Bayes theorem / Bayes theorem
- Bayes theorem, with multiple predictors / Bayes theorem with multiple predictors
probability densities
- plotting / Plotting probability densities
probability mass function (PMF)
- about / Probability mass functions
processing
- URL / Using Quil for visualization
Pythagoras formula / The bag-of-words and Euclidean distance

Q

quantile-quantile plots
- about / Quantile-quantile plots
quantiles
- about / Quantiles
- URL / Quantiles
quartiles / Quantiles
Quil, used for visualization
- URL / Using Quil for visualization
- about / Using Quil for visualization
- sketch window, drawing to / Drawing to the sketch window
- coordinate system / Quil's coordinate system
- grid, plotting / Plotting the grid
- fill color, specifying / Specifying the fill color
- color and fill / Color and fill
- image file, outputting / Outputting an image file
- PDF, output to / Output to PDF

R

R-squared
- multiple / Multiple R-squared
- adjusted / Adjusted R-squared
random forests
- about / Ensemble learning and random forests
random walks
- about / Random walks
RDGs
- creating, with Glittering / Creating RDGs with Glittering
reagent
- about / State and Reagent
recommenders
- evaluating, with Mahout / Recommender evaluation with Mahout
recommenders, evaluating
- Mahout, using / Recommender evaluation with Mahout
- distance measures / Evaluating distance measures
- optimum neighborhood size, determining / Determining optimum neighborhood size
- information retrieval statistics / Information retrieval statistics
- recommendation with Boolean preferences / Recommendation with Boolean preferences
recommender systems
- types / Types of recommender systems
- collaborative filtering / Collaborative filtering
regression
- about / Regression
- linear equations / Linear equations
- residuals / Residuals
regression lines
- about / Regression
relative power / Relative power
resampling
- about / Resampling
residual plot / Visualization
Resilient Distributed Datasets (RDDs) / Distributed datasets and tuples
Reuters dataset
- URL / Downloading the data
Reuters documents
- clustering / Clustering the Reuters documents
Reuters files, tokenizing
- about / Tokenizing the Reuters files
- Jaccard index, applying to documents / Applying the Jaccard index to documents
- Euclidean distance / The bag-of-words and Euclidean distance
- bag-of-words / The bag-of-words and Euclidean distance
- frequency vectors / Representing text as vectors
root mean square error
- calculating, with Parkour / Calculating the root mean square error with Parkour
Root mean square error (RMSE) / Recommender evaluation with Mahout
Russian election data
- visualizing / Visualizing the Russian election data

S

samples
- about / Samples and populations
- comparing / Sample comparisons
- means, calculating / Calculating sample means
Scalable Vector Graphics (SVG)
- about / Scalable Vector Graphics
scalar
- multiplication / Addition and scalar multiplication
- addition / Addition and scalar multiplication
scale-free networks
- about / Scale-free networks
scatter plots
- about / Scatter plots
scatter transparency
- about / Scatter transparency
shortest path
- finding / Finding the shortest path
- minimum spanning trees / Minimum spanning trees
- connected components / Subgraphs and connected components
- subgraphs / Subgraphs and connected components
- web, bow-tie structure / SCC and the bow-tie structure of the web
- SCC / SCC and the bow-tie structure of the web
sigmoid function / Classification with logistic regression, The sigmoid function
significance testing
- about / Significance
significance testing proportions
- about / Significance testing proportions
simplex method
- about / Estimating the maximum likelihood
simulation
- about / Introducing the simulation
- compiling / Compile the simulation
- browser simulation / The browser simulation
singular matrices / Inversion
Singular Value Decomposition (SVD) / Singular value decomposition
skewed normal distribution / Generating distributions
skewness
- about / Skewness
- quantile-quantile plots / Quantile-quantile plots
Slope One predictors / Item-based and user-based recommenders
Slope One recommenders
- about / Slope One recommenders
- URL / Slope One recommenders
- item differences, calculating / Calculating the item differences
- recommendations, creating / Making recommendations
Spark
- URL / Large-scale machine learning with Apache Spark and MLlib
- used, for large-scale machine learning / Large-scale machine learning with Apache Spark and MLlib
Sparkling
- URL / Large-scale machine learning with Apache Spark and MLlib
- used, for loading data / Loading data with Sparkling
- used, for mapping data / Mapping data
standard deviation / Variance
standard error
- about / Standard error
- of proportion / The standard error of a proportion
- bootstrapping, estimating with / Estimation using bootstrapping
- of proportion, formula / The standard error of a proportion formula
standard errors
- adjusting, for large samples / Adjusting standard errors for large samples
Standard Generalized Markup language (SGML) / Extracting the data
state
- about / State and Reagent
- updating / Updating state
stationary
- about / Visualizing the airline data
stationary time series
- about / Stationarity
statistics
- sample code, downloading / Downloading the sample code
- URL / Downloading the sample code
- examples, running / Running the examples
- data, downloading / Downloading the data
- data, inspecting / Inspecting the data
- data scrubbing / Data scrubbing
- descriptive statistics / Descriptive statistics
stemmers
- URL / Stemming
stemming
- about / Stemming
Stochastic gradient descent
- about / Stochastic gradient descent
- with Parkour / Stochastic gradient descent with Parkour
- mapper, defining / Defining a mapper
- shaping functions / Parkour shaping functions
- reducer, defining / Defining a reducer
- Hadoop jobs, specifying with Parkour graph / Specifying Hadoop jobs with Parkour graph
- mappers, chaining with Parkour graph / Chaining mappers and reducers with Parkour graph
- reducers, chaining with Parkour graph / Chaining mappers and reducers with Parkour graph
stochastic gradient descent (SGD) / Stochastic gradient descent
summary statistics
- about / Descriptive statistics
sum of residual squares (RSS) / The F-test of model significance
sum of squared errors (SSE) / Calculating the root mean square error with Parkour
supersteps / The Pregel API
SVG maps
- URL / Improving the clarity with illustrations

T

t-distribution
- about / Student's t-distribution
t-statistic
- about / The t-statistic
t-test
- performing / Performing the t-test
Tanimoto coefficient / Recommendation with Boolean preferences
term frequency (tf) / Representing text as vectors
Term Frequency-Inverse Document Frequency (TF-IDF)
- about / Better clustering with TF-IDF
- Zipf's law / Zipf's law
- weigh, calculating / Calculating the TF-IDF weight
- k-means clustering with / k-means clustering with TF-IDF
- clustering, with n-grams / Better clustering with n-grams
term frequency vectors
- creating / Creating term frequency vectors
- vector space model / The vector space model and cosine distance
- cosine distance / The vector space model and cosine distance
- stop words, removing / Removing stop words
Tesser
- mathematical folds / Mathematical folds with Tesser
- covariance, calculating with / Calculating covariance with Tesser
- commutativity / Commutativity
- simple linear regression / Simple linear regression with Tesser
- correlation matrix, calculating / Calculating a correlation matrix
- matrix-sum fold, creating / Creating a matrix-sum fold
time series
- Longley dataset / About the data
- Airline dataset / About the data
- Longley data, loading / Loading the Longley data
- Longley data, plotting with linear model / Fitting curves with a linear model
- decomposition / Time series decomposition
- airline data, inspecting / Inspecting the airline data
- airline data, visualizing / Visualizing the airline data
- stationary time series / Stationarity
- de-trending / De-trending and differencing
- differencing / De-trending and differencing
- reference link / Discrete time models
- maximum likelihood estimation / Maximum likelihood estimation
- forecasting / Time series forecasting
- forecasting, with Monte Carlo simulation / Forecasting with Monte Carlo simulation
Toeplitz matrices
- about / PACF with Durbin-Levinson recursion
tokenization
- about / Set-of-words and the Jaccard index
transduce library
- URL / Distributed unique IDs with Hadoop
triangle counting
- graph density, measuring with / Measuring graph density with triangle counting
- built-in triangle counting algorithm, running / Running the built-in triangle counting algorithm
- implementing, with Glittering / Implement triangle counting with Glittering
- neighbor IDs, collecting / Step one – collecting neighbor IDs
- aggregate messages / Steps two, three, and four – aggregate messages
- counts, dividing / Step five – dividing the counts
- custom triangle counting algorithm, running / Running the custom triangle counting algorithm
Twitter's intent API
- URL / Download the data, Running PageRank to determine community influencers
two-dimensional histogram
- representing / Representing a two-dimensional histogram
two-tailed tests
- about / Two-tailed tests

U

uberjar
- building / Building an uberjar
- submitting, to Hadoop / Submitting the uberjar to Hadoop
user-based recommenders
- about / Item-based and user-based recommenders
- practical considerations / Practical considerations for user and item recommenders
- building, with Mahout / Building a user-based recommender with Mahout

V

variance
- about / Variance
- analysis / Analysis of variance
- calculating, fold used / Calculating the variance using fold
vectors
- about / Vectors
visualization
- code, downloading / Download the code and data
- data, downloading / Download the code and data
- exploratory data visualization / Exploratory data visualization
- two-dimensional histogram, representing / Representing a two-dimensional histogram
- Quil, using / Using Quil for visualization
visualization, for communication
- about / Visualization for communication
- wealth distribution, visualizing / Visualizing wealth distribution
- data, bringing to life / Bringing data to life with Quil
- bars of differing widths, drawing / Drawing bars of differing widths
- axis labels, adding / Adding a title and axis labels
- title, adding / Adding a title and axis labels
- clarity, improving with illustrations / Improving the clarity with illustrations
- text, adding to bars / Adding text to the bars
- additional data, incorporating / Incorporating additional data
- complex shapes, drawing / Drawing complex shapes, Drawing curves
- compound charts, plotting / Plotting compound charts
visualizations
- comparative visualizations / Comparative visualizations, Comparative visualizations
- about / The importance of visualizations
- electorate data / Visualizing electorate data
- comparative visualizations, of electorate data / Comparative visualizations of electorate data

W

Waikato Environment for Knowledge Analysis (Weka)
- URL / Classification with clj-ml
weighted graph / Visualizing graphs with Loom
Welch's t-test / Two-tailed tests
whole-graph analysis
- about / Whole-graph analysis
Widrow-Hoff learning rule / The gradient descent update rule
wiki
- URL / Running the examples, Downloading the data

Z

z-test
- performing / Performing a z-test
Zipf's law / Zipf's law
Zipf scale / Scale-free networks

The rest of the chapter is locked

You're reading from Clojure for Data Science

Table of Contents (18) Chapters

Index

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

Z

Authors (1)

Personalised recommendations for you

You're reading from Clojure for Data Science

Table of Contents (18) Chapters

Index

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

S

T

U

V

W

Z

Unlock this book and the full library FREE for 7 days

Authors (1)

Personalised recommendations for you