Search icon CANCEL
Subscription
0
Cart icon
Close icon
You have no products in your basket yet
Save more on your purchases!
Savings automatically calculated. No voucher code required
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Clojure for Data Science

You're reading from  Clojure for Data Science

Product type Book
Published in Sep 2015
Publisher
ISBN-13 9781784397180
Pages 608 pages
Edition 1st Edition
Languages
Author (1):
Henry Garner Henry Garner
Profile icon Henry Garner

Table of Contents (18) Chapters

Clojure for Data Science
Credits
About the Author
Acknowledgments
About the Reviewer
www.PacktPub.com
Preface
1. Statistics 2. Inference 3. Correlation 4. Classification 5. Big Data 6. Clustering 7. Recommender Systems 8. Network Analysis 9. Time Series 10. Visualization Index

Index

A

  • A* algorithm
    • URL / Finding the shortest path
  • Acbracad library
    • URL / Distributed unique IDs with Hadoop
  • AcmeContent
    • about / Introducing AcmeContent
    • sample code / Download the sample code
  • acyclic / Visualizing graphs with Loom
  • Adaptive Boosting (AdaBoost) / Bagging and boosting
  • Akaike Information Criterion (AIC)
    • models, identifying / Identifying better models with Akaike Information Criterion
    • about / Identifying better models with Akaike Information Criterion
  • ALS
    • movie recommendations / Movie recommendations with alternating least squares
    • using, with Spark / ALS with Spark and MLlib
    • using, with MLlib / ALS with Spark and MLlib
    • used, for making predictions / Making predictions with ALS
    • evaluating / Evaluating ALS
  • Anscombe's Quartet / The importance of visualizations
  • Apache Commons Math
    • about / Estimating the maximum likelihood, Nelder-Mead optimization with Apache Commons Math
    • URL / Estimating the maximum likelihood, Nelder-Mead optimization with Apache Commons Math
    • used, for Nelder-Mead optimization / Nelder-Mead optimization with Apache Commons Math
  • ARMA model order
    • determining, with ACF and PACF / Determining ARMA model order with ACF and PACF
  • autocorrelation function (ACF)
    • about / Determining autocorrelation in AR models
    • ARMA model order, determining / Determining ARMA model order with ACF and PACF
    • plotting, of airline data / ACF and PACF of airline data
  • autocovariance
    • about / Autocovariance
  • autoregressive (AR) models
    • about / Autoregressive models
    • autocorrelation, determining / Determining autocorrelation in AR models
    • combining, with Moving Average (MA) models / Combining the AR and MA models
  • Autoregressive Integrated Moving Average (ARIMA) model
    • about / Removing seasonality with differencing

B

  • B1
    • about / B1
    • URL / B1
  • bag-of-words / The bag-of-words and Euclidean distance
  • bagging
    • about / Bagging and boosting
  • balanced F-score / F-measure and the harmonic mean
  • batch gradient descent
    • about / Stochastic gradient descent
  • Bayesian view / Probability
  • Bayes theorem
    • about / Bayes theorem
    • with multiple predictors / Bayes theorem with multiple predictors, Naive Bayes classification
  • bias
    • about / Bias
    • high bias, addressing / Addressing high bias
  • bias term / Multiple linear regression
  • big data
    • code, downloading / Downloading the code and data
    • example code, URL / Downloading the code and data
    • inspecting / Inspecting the data
    • records, counting / Counting the records
  • bigrams
    • about / Better clustering with n-grams
  • bimodal
    • about / Visualizing different populations
  • binning
    • about / Binning data
  • binomial distribution
    • about / The binomial distribution
  • bipartite / Visualizing graphs with Loom
  • bivariate linear regression / Multiple linear regression
  • Bloom filters
    • used, for testing large sets membership / Testing set membership with Bloom filters
  • Bonferroni correction
    • about / The Bonferroni correction
  • boosting
    • about / Bagging and boosting
  • bounce
    • about / Introducing AcmeContent
  • box and whisker plots
    • about / Box plots
  • breadth-first search / Breadth-first and depth-first search

C

  • C4.5 algorithm / Building a decision tree in clj-ml
  • categorical variables / Categorical and dummy variables
  • central limit theorem / The central limit theorem
    • about / The central limit theorem
  • Chi-squared multiple significance testing
    • about / Chi-squared multiple significance testing
    • categories, visualizing / Visualizing the categories
    • chi-squared test / The chi-squared test, The chi-squared test
    • chi-squared statistic / The chi-squared statistic
  • chi-squared statistic / The chi-squared statistic
  • chi-squared test / The chi-squared test, The chi-squared test
  • classifier
    • data / About the data
    • data, inspecting / Inspecting the data
    • relative risk and odds, comparing with / Comparisons with relative risk and odds
    • saving, to file / Saving the classifier to a file
  • clj-ml
    • classification with / Classification with clj-ml
    • URL / Classification with clj-ml
    • data, loading with / Loading data with clj-ml
    • decision tree, building / Building a decision tree in clj-ml
  • clj-time library
    • URL / Inspecting the airline data
  • clojure-opennlp library
    • URL / Tokenizing the Reuters files
  • Clojure libraries
    • URL / Exploratory data visualization
  • Clojure library succession
    • URL / Calculating the likelihood
  • Clojure library Tesser
    • URL / Mathematical folds with Tesser
  • Clojure reducers library
    • URL / Counting the records
    • about / The reducers library
    • parallel folds / Parallel folds with reducers
    • parallel folds with / Parallel folds with reducers
    • large files, loading with iota / Loading large files with iota
    • reducers processing pipeline, creating / Creating a reducers processing pipeline
    • curried reductions, with reducers / Curried reductions with reducers
    • statistical folds / Statistical folds with reducers
    • associativity / Associativity
    • mean calculating, fold used / Calculating the mean using fold
    • variance calculating, fold used / Calculating the variance using fold
  • cluster evaluation, measures
    • about / Cluster evaluation measures
    • inter-cluster density / Inter-cluster density
    • intra-cluster density / Intra-cluster density
    • root mean square error, calculating with Parkour / Calculating the root mean square error with Parkour
    • clustered points and centroids, loading / Loading clustered points and centroids
    • cluster RMSE, calculating / Calculating the cluster RMSE
    • optimal k, determining with elbow method / Determining optimal k with the elbow method
    • optimal k, determining with Dunn index / Determining optimal k with the Dunn index
    • optimal k, determining with Davies-Bouldin index / Determining optimal k with the Davies-Bouldin index
  • clustering
    • data, downloading / Downloading the data
    • data, extracting / Extracting the data
    • data, inspecting / Inspecting the data
  • clustering, text
    • about / Clustering text
    • set-of-words / Set-of-words and the Jaccard index
    • Jaccard index / Set-of-words and the Jaccard index
    • Reuters files, tokenizing / Tokenizing the Reuters files
    • text, representing as vectors / Representing text as vectors
    • dictionary, creating / Creating a dictionary
  • cluster RMSE
    • calculating / Calculating the cluster RMSE
  • code
    • downloading / Download the code and data
    • downloading, URL / Download the code and data
  • coefficient of determination / Goodness-of-fit and R-square
  • coefficient of multiple determination / Multiple R-squared
  • collinearity
    • about / Collinearity
    • multicollinearity / Multicollinearity
  • columns
    • adding / Adding columns, Adding derived columns
  • combinations function
    • URL / Determining optimal k with the Dunn index
  • communities, with label propagation
    • detecting / Detecting communities with label propagation
    • map vertices / Step one – map vertices
    • vertex attribute, sending / Step two – send the vertex attribute
    • aggregate value / Step three – aggregate value
    • vertex function / Step four – vertex function
    • maximum iterations count, setting / Step five – set the maximum iterations count
  • comparative visualizations
    • about / Comparative visualizations
    • box and whisker plots / Box plots
    • cumulative distribution functions / Cumulative distribution functions
    • probability mass function (PMF) / Probability mass functions
    • scatter plots / Scatter plots
    • scatter transparency / Scatter transparency
  • confidence interval
    • about / Confidence intervals, Confidence intervals
  • confounding variables
    • about / Regression
  • confusion matrix / The confusion matrix
  • connected components
    • running / Running connected components
    • largest connected component, size calculating / Calculating the size of the largest connected component
  • connected components, with Pregel API
    • about / Connected components with the Pregel API
    • map vertices / Step one – map vertices
    • message function / Steps two and three – the message function
    • attributes, updating / Step four – update the attributes
    • convergence, iterating to / Step five – iterate to convergence
  • construction
    • about / Construction
  • content-based filtering / Types of recommender systems
  • content distribution network (CDN) / jStat
  • covariance
    • about / Covariance
    • calculating, with Tesser / Calculating covariance with Tesser
  • cross-validation
    • about / Cross-validation
  • cumulative distribution function (CDF)
    • about / Hypothesis testing
  • Cumulative distribution functions (CDFs)
    • about / Cumulative distribution functions

D

  • daily means distribution
    • about / The distribution of daily means
  • data
    • inspecting / Load and inspect the data, Inspecting the data, Inspect the data
    • loading / Load and inspect the data
    • about / About the data
    • Guardian's excellent data blog, URL / About the data
    • visualizing / Visualizing the data
    • downloading / Download the code and data
    • downloading, URL / Download the code and data
    • parsing / Parse the data
  • data scrubbing
    • about / Data scrubbing
  • Davies-Bouldin index
    • used, for determining optimal k / Determining optimal k with the Davies-Bouldin index
  • decision trees
    • about / Decision trees
    • information / Information
    • entropy / Entropy
    • information gain / Information gain
    • information gain, using to identify best predictor / Using information gain to identify the best predictor
    • building, recursively / Recursively building a decision tree
    • using, for classification / Using the decision tree for classification
    • classifier, evaluating / Evaluating the decision tree classifier
    • building, in clj-ml / Building a decision tree in clj-ml
  • degenerate matrices / Inversion
  • degrees of freedom
    • about / Student's t-distribution, Degrees of freedom
  • Delta rule / The gradient descent update rule
  • dependent variable
    • about / Regression
  • depth-first search / Breadth-first and depth-first search
  • descriptive statistics
    • about / Descriptive statistics
    • mean / The mean
    • mathematical notation, interpreting / Interpreting mathematical notation
    • median / The median
  • dictionary
    • creating / Creating a dictionary
  • dimensionality reduction
    • about / Dimensionality reduction
    • Iris dataset, plotting / Plotting the Iris dataset
    • principle component analysis (PCA) / Principle component analysis
    • principle component analysis(PCA) / Principle component analysis
    • Singular Value Decomposition (SVD) / Singular value decomposition
  • dimensions
    • about / Dimensions
  • Directed Acyclic Graph (DAG) / Visualizing graphs with Loom
  • Discounted Cumulative Gain (DCG) / Normalized discounted cumulative gain
  • discrete time models
    • about / Discrete time models
    • random walks / Random walks
    • autoregressive (AR) models / Autoregressive models
    • autocorrelation, determining in AR models / Determining autocorrelation in AR models
    • Moving Average (MA) models / Moving-average models
    • partial autocorrelation function (PACF), calculating / Calculating partial autocorrelation
    • seasonality, removing with differencing / Removing seasonality with differencing
  • distance measures, evaluating
    • about / Evaluating distance measures
    • Pearson correlation similarity / The Pearson correlation similarity
    • Spearman’s rank similarity / Spearman's rank similarity
  • distributed cache
    • data, sharing with / Sharing data with the distributed cache
  • distributed unique IDs
    • creating / Creating distributed unique IDs
    • with Hadoop / Distributed unique IDs with Hadoop
  • dot product / Matrix-vector multiplication
  • dummy variables / Categorical and dummy variables
  • Dunn index
    • used, for determining optimal k / Determining optimal k with the Dunn index
  • Durbin-Levinson recursion
    • used, for calculating partial autocorrelation function (PACF) / PACF with Durbin-Levinson recursion
    • about / PACF with Durbin-Levinson recursion
    • URL / PACF with Durbin-Levinson recursion
  • dwell time
    • about / Introducing AcmeContent
  • dwell times
    • visualizing / Visualizing the dwell times

E

  • edge-list format / Inspecting the data
  • elbow method
    • used, for determining optimal k / Determining optimal k with the elbow method
  • ensemble learning
    • about / Ensemble learning and random forests
  • entropy / Entropy
  • explained sum of squares (ESS) / The F-test of model significance
  • exploratory data visualization / Exploratory data visualization
  • exponential distribution
    • about / The exponential distribution

F

  • F-distribution
    • about / The F-distribution
  • F-statistic
    • about / The F-statistic
  • F-test
    • about / The F-test
    / The F-test of model significance
  • F1 measure / F-measure and the harmonic mean
  • feature matrix
    • creating / Creating a feature matrix
  • Fisher z-transformation
    • about / Confidence intervals
  • Flambo
    • URL / Large-scale machine learning with Apache Spark and MLlib
  • fold / Parallel folds with reducers
  • frequency vectors / Representing text as vectors
  • frequentist / Probability
  • Fressian
    • URL / Chaining mappers and reducers with Parkour graph
  • fs library
    • URL / Clustering the Reuters documents

G

  • Gaussian distribution
    • about / The normal distribution
    • central limit theorem / The central limit theorem
  • Giraph
    • URL / Distributed graph computation with GraphX
  • GitHub
    • URL / Downloading the sample code
  • Glittering
    • URL / Creating RDGs with Glittering
  • gradient descent
    • about / The logistic regression cost function, Multiple regression with gradient descent
    • multiple regression with / Multiple regression with gradient descent
    • update rule / The gradient descent update rule
    • learning rate / The gradient descent learning rate
    • feature scaling / Feature scaling
    • feature extraction / Feature extraction
    • custom Tesser fold, creating / Creating a custom Tesser fold
    • total model error, calculating / Calculating the total model error
    • matrix-mean fold, creating / Creating a matrix-mean fold
    • single step, applying / Applying a single step of gradient descent
    • iterative gradient descent, running / Running iterative gradient descent
    • scaling with Hadoop / Scaling gradient descent with Hadoop
  • gradient descent on Hadoop, with Tesser and Parkour
    • about / Gradient descent on Hadoop with Tesser and Parkour
    • Parkour distributed sources and sinks / Parkour distributed sources and sinks
    • feature scale fold, running with Hadoop / Running a feature scale fold with Hadoop
    • gradient descent, running with Hadoop / Running gradient descent with Hadoop
    • code, preparing for Hadoop cluster / Preparing our code for a Hadoop cluster
    • uberjar, building / Building an uberjar
    • uberjar, submitting to Hadoop / Submitting the uberjar to Hadoop
  • graphs
    • visualizing, Loom used / Visualizing graphs with Loom
  • graph traversal
    • with Loom / Graph traversal with Loom
    • Königsberg city, seven bridges / The seven bridges of Königsberg
  • GraphViz
    • URL / Visualizing graphs with Loom
  • GraphX / Scale-free networks
    • distributed graph computation / Distributed graph computation with GraphX
    • RDGs, creating with Glittering / Creating RDGs with Glittering
    • graph density, measuring with triangle counting / Measuring graph density with triangle counting
    • partitioning strategies / GraphX partitioning strategies
    • built-in triangle counting algorithm, running / Running the built-in triangle counting algorithm
    • triangle counting, implementing with Glittering / Implement triangle counting with Glittering
    • custom triangle counting algorithm, running / Running the custom triangle counting algorithm
    • Pregel API / The Pregel API
    • Pregel API, connected components / Connected components with the Pregel API
    • connected components, running / Running connected components
    • largest connected component, size calculating / Calculating the size of the largest connected component
    • communities with label propagation, detecting / Detecting communities with label propagation
    • label propagation, running / Running label propagation
    • flow formulation / The flow formulation
    • PageRank, implementing with Glittering / Implementing PageRank with Glittering
    • PageRank, running to determine community influencers / Running PageRank to determine community influencers
  • gross domestic product (GDP)
    • about / About the data

H

  • Hadoop Distributed File System (HDFS) / Large-scale machine learning with Apache Spark and MLlib
  • Hadoop installation guides
    • URL / Submitting the uberjar to Hadoop
  • Hama
    • URL / Distributed graph computation with GraphX
  • heteroscedasticity
    • about / Visualizing the airline data
  • histogram
    • about / Histograms
  • hypothesis testing
    • about / Visualizing different populations, Hypothesis testing, Hypothesis testing
    • significance testing / Significance

I

  • Ideal Discounted Cumulative Gain (IDCG) / Normalized discounted cumulative gain
  • identity matrix / The identity matrix
  • Incanter
    • gradient descent with / Gradient descent with Incanter
    • logistic regression, implementing with / Implementing logistic regression with Incanter
  • Incanter's linear model
    • about / Incanter's linear model
    • F-test / The F-test of model significance
  • Incanter library
    • URL / Inspecting the data
  • independent variable
    • about / Regression
  • indices function / Testing set membership with Bloom filters
  • inferential statistics
    • about / Descriptive statistics
  • information gain
    • about / Information gain
    • used, for identifying best predictor / Using information gain to identify the best predictor
  • Information Retrieval statistics (IR stats) evaluator
    • about / Information retrieval statistics
    • precision / Precision
    • recall / Recall
    • of Mahout / Mahout's information retrieval evaluator
    • F-measure / F-measure and the harmonic mean
    • harmonic mean / F-measure and the harmonic mean
    • false positive rate / Fall-out
    • fall-out / Fall-out
    • Discounted Cumulative Gain (DCG) / Normalized discounted cumulative gain
    • normalized discounted cumulative gain / Normalized discounted cumulative gain
    • results, plotting / Plotting the information retrieval results
    • implicit, versus explicit feedback / Implicit versus explicit feedback
  • inter-cluster density / Inter-cluster density
  • interface
    • binding / Binding the interface
  • interquartile range / Quantiles
  • intra-cluster density / Intra-cluster density
  • inversion matrix / Inversion
  • iota
    • URL / Loading large files with iota
    • used, for loading large files / Loading large files with iota
  • IRS data definition
    • URL / Inspecting the data
  • IRS Statistics of Income (SoI)
    • URL / Downloading the code and data
  • item-based recommenders
    • about / Item-based and user-based recommenders
    • practical considerations / Practical considerations for user and item recommenders

J

  • Jaccard index / Set-of-words and the Jaccard index
    • applying, to documents / Applying the Jaccard index to documents
  • jStat
    • about / jStat
    • URL / jStat

K

  • k-means
    • drawbacks / The drawbacks of k-means
    • Mahalanobis distance measure / The Mahalanobis distance measure
    • dimensionality, curse / The curse of dimensionality
  • k-means clustering
    • about / Clustering with k-means and Incanter
    • with Term Frequency-Inverse Document Frequency (TF-IDF) / k-means clustering with TF-IDF
  • k-means clustering, running with Mahout
    • about / Running k-means clustering with Mahout
    • results, viewing / Viewing k-means clustering results
    • clustered output, interpreting / Interpreting the clustered output
  • k-nearest neighbors (k-NN)
    • about / k-nearest neighbors
  • kappa statistic model / The kappa statistic
  • k hash functions / Testing set membership with Bloom filters

L

  • label propagation
    • about / Detecting communities with label propagation
    • running / Running label propagation
  • large-scale clustering, with Mahout
    • about / Large-scale clustering with Mahout
    • text documents, converting to sequence file / Converting text documents to a sequence file
    • Mahout vectors creating, Parkour used / Using Parkour to create Mahout vectors
    • distributed unique IDs, creating / Creating distributed unique IDs
    • distributed unique IDs, with Hadoop / Distributed unique IDs with Hadoop
    • data, sharing with distributed cache / Sharing data with the distributed cache
    • Mahout vectors, building from input documents / Building Mahout vectors from input documents
  • large-scale machine learning
    • MLlib, using / Large-scale machine learning with Apache Spark and MLlib
    • Spark, using / Large-scale machine learning with Apache Spark and MLlib
    • data, loading with Sparkling / Loading data with Sparkling
    • data, mapping / Mapping data
    • tuples / Distributed datasets and tuples
    • distributed datasets / Distributed datasets and tuples
    • data, filtering / Filtering data
    • persistence / Persistence and caching
    • caching / Persistence and caching
  • larger sets
    • probabilistic methods / Probabilistic methods for large sets
    • membership, testing with Bloom filters / Testing set membership with Bloom filters
    • Jaccard similarity, with MinHash / Jaccard similarity for large sets with MinHash
  • learning rate / The gradient descent update rule
  • locality-sensitive hashing (LSH)
    • used, for reducing pair comparisons / Reducing pair comparisons with locality-sensitive hashing
    • about / Reducing pair comparisons with locality-sensitive hashing
    • signatures, bucketing / Bucketing signatures
    • URL / Bucketing signatures
  • log-linear / Visualizing the dwell times
  • log-log chart / Visualizing the dwell times
  • log-normal distribution
    • about / The log-normal distribution
    • correlation, visualizing / Visualizing correlation
    • jittering / Jittering
  • logistic regression
    • and naive Bayes approaches, comparing / Comparing the logistic regression and naive Bayes approaches
  • logistic regression, classifying
    • about / Classification with logistic regression
    • sigmoid function / The sigmoid function
    • logistic regression cost function / The logistic regression cost function
    • parameter optimization, with gradient descent / Parameter optimization with gradient descent
    • gradient descent, with Incanter / Gradient descent with Incanter
    • convexity / Convexity
  • logistic regression, implementing with Incanter
    • about / Implementing logistic regression with Incanter
    • feature matrix, creating / Creating a feature matrix
    • logistic regression classifier, evaluating / Evaluating the logistic regression classifier
    • confusion matrix / The confusion matrix
    • kappa statistic / The kappa statistic
  • logistic regression classifier
    • evaluating / Evaluating the logistic regression classifier
  • logistic regression cost function
    • about / The logistic regression cost function
  • Loom
    • used, for visualizing graphs / Visualizing graphs with Loom
    • URL / Visualizing graphs with Loom
    • graph traversal with / Graph traversal with Loom
  • loss function
    • about / Ordinary least squares, Parameter optimization with gradient descent

M

  • machine learning
    • movie recommendations, with ALS / Movie recommendations with alternating least squares
    • ALS, evaluating / Evaluating ALS
    • sum of squared errors, calculating / Calculating the sum of squared errors
  • Mahalanobis distance measure / The Mahalanobis distance measure
  • Mahout
    • URL / Large-scale clustering with Mahout
    • used, for building user-based recommenders / Building a user-based recommender with Mahout
    • used, for evaluating recommenders / Recommender evaluation with Mahout
    • Information Retrieval statistics (IR stats) evaluator / Mahout's information retrieval evaluator
  • Mahout vectors
    • creating, Parkour used / Using Parkour to create Mahout vectors
    • building, from input documents / Building Mahout vectors from input documents
  • matrix
    • about / Matrices
    • dimensions / Dimensions
    • vectors / Vectors
    • construction / Construction
    • scalar multiplication / Addition and scalar multiplication
    • scalar addition / Addition and scalar multiplication
    • -vector multiplication / Matrix-vector multiplication
    • -matrix multiplication / Matrix-matrix multiplication
    • transposition / Transposition
    • identity matrix / The identity matrix
    • inversion / Inversion
  • matrix-matrix multiplication / Matrix-matrix multiplication
  • matrix-vector multiplication / Matrix-vector multiplication
  • maximum likelihood, time series
    • estimating / Maximum likelihood estimation, Estimating the maximum likelihood
    • calculating / Calculating the likelihood
    • estimating, with Nelder-Mead optimization / Nelder-Mead optimization with Apache Commons Math
    • estimating, with Akaike Information Criterion / Identifying better models with Akaike Information Criterion
  • maximum likelihood estimation
    • about / Removing seasonality with differencing
  • m bits / Testing set membership with Bloom filters
  • mean
    • calculating, fold used / Calculating the mean using fold
  • mean square error (MSE) / The F-test of model significance
  • mean square model (MSM) / The F-test of model significance
  • Medley
    • URL / Performing a z-test
  • memoryless / The exponential distribution
  • meta-algorithm / Bagging and boosting
  • MinHash
    • used, for Jaccard similarity for larger sets / Jaccard similarity for large sets with MinHash
  • MLlib
    • used, for large-scale machine learning / Large-scale machine learning with Apache Spark and MLlib
    • URL / Large-scale machine learning with Apache Spark and MLlib, Machine learning on Spark with MLlib
    • used, for machine learning on Spark / Machine learning on Spark with MLlib
    • using, with ALS / ALS with Spark and MLlib
    • using, with Spark / ALS with Spark and MLlib
    • ALS, evaluating / Evaluating ALS
  • Monte Carlo simulation
    • used, for forecasting time series / Forecasting with Monte Carlo simulation
  • Moving Average (MA) models
    • about / Moving-average models
    • autocorrelation, determining / Determining autocorrelation in MA models
    • combining, with autoregressive (AR) models / Combining the AR and MA models
  • multimodal
    • about / Visualizing different populations
  • multiple comparisons
    • about / Multiple comparisons
  • multiple designs
    • testing / Testing multiple designs
  • multiple linear regression / Multiple linear regression
  • multiple tests
    • simulating / Simulating multiple tests

N

  • n-gram
    • about / Better clustering with n-grams
  • Naive Bayes classification
    • about / Naive Bayes classification
    • implementing / Implementing a naive Bayes classifier
    • evaluating / Evaluating the naive Bayes classifier
  • natural logarithm / The log-normal distribution
  • Nelder-Mead optimization
    • about / Estimating the maximum likelihood
    • with Apache Commons Math / Nelder-Mead optimization with Apache Commons Math
  • network analysis
    • data, downloading / Download the data
    • data, inspecting / Inspecting the data
    • graphs, visualizing with Loom / Visualizing graphs with Loom
  • new site design
    • testing / Testing a new site design
  • nonresponse bias
    • about / Bias
  • normal distribution
    • about / The normal distribution
  • normal equation
    • about / The normal equation
    • features / More features
  • null hypothesis / Hypothesis testing

O

  • one-sample t-test
    • about / One-sample t-test
  • one-tailed tests
    • about / Two-tailed tests
  • optimal k
    • determining, with elbow method / Determining optimal k with the elbow method
    • determining, with Dunn index / Determining optimal k with the Dunn index
    • determining, with Davies-Bouldin index / Determining optimal k with the Davies-Bouldin index
  • Ordinary Least Squares (OLS)
    • about / Ordinary least squares
    • slope / Slope and intercept
    • intercept / Slope and intercept
    • interpretation / Interpretation
    • visualization / Visualization
    • assumptions / Assumptions
  • over-fitting
    • about / Bias and variance, Overfitting

P

  • PageRank
    • used, for measuring community influence / Measuring community influence using PageRank
    • implementing, with Glittering / Implementing PageRank with Glittering
    • highest influence, sorting by / Sort by highest influence
    • running, to determine community influencers / Running PageRank to determine community influencers
  • Parkour
    • URL / Parkour distributed sources and sinks
    • used, for creating Mahout vectors / Using Parkour to create Mahout vectors
  • partial autocorrelation
    • calculating / Calculating partial autocorrelation
    • autocovariance / Autocovariance
    • plotting / Plotting partial autocorrelation
  • partial autocorrelation function (PACF)
    • about / Calculating partial autocorrelation
    • calculating, with Durbin-Levinson recursion / PACF with Durbin-Levinson recursion
    • ARMA model order, determining / Determining ARMA model order with ACF and PACF
    • plotting, of airline data / ACF and PACF of airline data
  • parts-of-speech taggers / Tokenizing the Reuters files
  • Pearson's correlation
    • about / Pearson's correlation
    • sample r and population rho / Sample r and population rho
  • phi-quantile / Quantiles
  • Poincaré's baker
    • about / Poincaré's baker
    • distributions, generating / Generating distributions
  • polytope
    • about / Estimating the maximum likelihood
  • populations
    • about / Samples and populations
    • visualizing / Visualizing different populations
  • precision
    • true positives / Precision
    • about / Precision
    • false positives / Precision
  • prediction
    • about / Prediction
    • confidence interval / The confidence interval of a prediction
    • model, scope / Model scope
    • final model / The final model
  • prediction intervals / The confidence interval of a prediction
  • Pregel API
    • about / The Pregel API
    • connected components with / Connected components with the Pregel API
  • probability
    • about / Probability
    • Bayes theorem / Bayes theorem
    • Bayes theorem, with multiple predictors / Bayes theorem with multiple predictors
  • probability densities
    • plotting / Plotting probability densities
  • probability mass function (PMF)
    • about / Probability mass functions
  • processing
    • URL / Using Quil for visualization
  • Pythagoras formula / The bag-of-words and Euclidean distance

Q

  • quantile-quantile plots
    • about / Quantile-quantile plots
  • quantiles
    • about / Quantiles
    • URL / Quantiles
  • quartiles / Quantiles
  • Quil, used for visualization
    • URL / Using Quil for visualization
    • about / Using Quil for visualization
    • sketch window, drawing to / Drawing to the sketch window
    • coordinate system / Quil's coordinate system
    • grid, plotting / Plotting the grid
    • fill color, specifying / Specifying the fill color
    • color and fill / Color and fill
    • image file, outputting / Outputting an image file
    • PDF, output to / Output to PDF

R

  • R-squared
    • multiple / Multiple R-squared
    • adjusted / Adjusted R-squared
  • random forests
    • about / Ensemble learning and random forests
  • random walks
    • about / Random walks
  • RDGs
    • creating, with Glittering / Creating RDGs with Glittering
  • reagent
    • about / State and Reagent
  • recommenders
    • evaluating, with Mahout / Recommender evaluation with Mahout
  • recommenders, evaluating
    • Mahout, using / Recommender evaluation with Mahout
    • distance measures / Evaluating distance measures
    • optimum neighborhood size, determining / Determining optimum neighborhood size
    • information retrieval statistics / Information retrieval statistics
    • recommendation with Boolean preferences / Recommendation with Boolean preferences
  • recommender systems
    • types / Types of recommender systems
    • collaborative filtering / Collaborative filtering
  • regression
    • about / Regression
    • linear equations / Linear equations
    • residuals / Residuals
  • regression lines
    • about / Regression
  • relative power / Relative power
  • resampling
    • about / Resampling
  • residual plot / Visualization
  • Resilient Distributed Datasets (RDDs) / Distributed datasets and tuples
  • Reuters dataset
    • URL / Downloading the data
  • Reuters documents
    • clustering / Clustering the Reuters documents
  • Reuters files, tokenizing
    • about / Tokenizing the Reuters files
    • Jaccard index, applying to documents / Applying the Jaccard index to documents
    • Euclidean distance / The bag-of-words and Euclidean distance
    • bag-of-words / The bag-of-words and Euclidean distance
    • frequency vectors / Representing text as vectors
  • root mean square error
    • calculating, with Parkour / Calculating the root mean square error with Parkour
  • Root mean square error (RMSE) / Recommender evaluation with Mahout
  • Russian election data
    • visualizing / Visualizing the Russian election data

S

  • samples
    • about / Samples and populations
    • comparing / Sample comparisons
    • means, calculating / Calculating sample means
  • Scalable Vector Graphics (SVG)
    • about / Scalable Vector Graphics
  • scalar
    • multiplication / Addition and scalar multiplication
    • addition / Addition and scalar multiplication
  • scale-free networks
    • about / Scale-free networks
  • scatter plots
    • about / Scatter plots
  • scatter transparency
    • about / Scatter transparency
  • shortest path
    • finding / Finding the shortest path
    • minimum spanning trees / Minimum spanning trees
    • connected components / Subgraphs and connected components
    • subgraphs / Subgraphs and connected components
    • web, bow-tie structure / SCC and the bow-tie structure of the web
    • SCC / SCC and the bow-tie structure of the web
  • sigmoid function / Classification with logistic regression, The sigmoid function
  • significance testing
    • about / Significance
  • significance testing proportions
    • about / Significance testing proportions
  • simplex method
    • about / Estimating the maximum likelihood
  • simulation
    • about / Introducing the simulation
    • compiling / Compile the simulation
    • browser simulation / The browser simulation
  • singular matrices / Inversion
  • Singular Value Decomposition (SVD) / Singular value decomposition
  • skewed normal distribution / Generating distributions
  • skewness
    • about / Skewness
    • quantile-quantile plots / Quantile-quantile plots
  • Slope One predictors / Item-based and user-based recommenders
  • Slope One recommenders
    • about / Slope One recommenders
    • URL / Slope One recommenders
    • item differences, calculating / Calculating the item differences
    • recommendations, creating / Making recommendations
  • Spark
    • URL / Large-scale machine learning with Apache Spark and MLlib
    • used, for large-scale machine learning / Large-scale machine learning with Apache Spark and MLlib
  • Sparkling
    • URL / Large-scale machine learning with Apache Spark and MLlib
    • used, for loading data / Loading data with Sparkling
    • used, for mapping data / Mapping data
  • standard deviation / Variance
  • standard error
    • about / Standard error
    • of proportion / The standard error of a proportion
    • bootstrapping, estimating with / Estimation using bootstrapping
    • of proportion, formula / The standard error of a proportion formula
  • standard errors
    • adjusting, for large samples / Adjusting standard errors for large samples
  • Standard Generalized Markup language (SGML) / Extracting the data
  • state
    • about / State and Reagent
    • updating / Updating state
  • stationary
    • about / Visualizing the airline data
  • stationary time series
    • about / Stationarity
  • statistics
    • sample code, downloading / Downloading the sample code
    • URL / Downloading the sample code
    • examples, running / Running the examples
    • data, downloading / Downloading the data
    • data, inspecting / Inspecting the data
    • data scrubbing / Data scrubbing
    • descriptive statistics / Descriptive statistics
  • stemmers
    • URL / Stemming
  • stemming
    • about / Stemming
  • Stochastic gradient descent
    • about / Stochastic gradient descent
    • with Parkour / Stochastic gradient descent with Parkour
    • mapper, defining / Defining a mapper
    • shaping functions / Parkour shaping functions
    • reducer, defining / Defining a reducer
    • Hadoop jobs, specifying with Parkour graph / Specifying Hadoop jobs with Parkour graph
    • mappers, chaining with Parkour graph / Chaining mappers and reducers with Parkour graph
    • reducers, chaining with Parkour graph / Chaining mappers and reducers with Parkour graph
  • stochastic gradient descent (SGD) / Stochastic gradient descent
  • summary statistics
    • about / Descriptive statistics
  • sum of residual squares (RSS) / The F-test of model significance
  • sum of squared errors (SSE) / Calculating the root mean square error with Parkour
  • supersteps / The Pregel API
  • SVG maps
    • URL / Improving the clarity with illustrations

T

  • t-distribution
    • about / Student's t-distribution
  • t-statistic
    • about / The t-statistic
  • t-test
    • performing / Performing the t-test
  • Tanimoto coefficient / Recommendation with Boolean preferences
  • term frequency (tf) / Representing text as vectors
  • Term Frequency-Inverse Document Frequency (TF-IDF)
    • about / Better clustering with TF-IDF
    • Zipf's law / Zipf's law
    • weigh, calculating / Calculating the TF-IDF weight
    • k-means clustering with / k-means clustering with TF-IDF
    • clustering, with n-grams / Better clustering with n-grams
  • term frequency vectors
    • creating / Creating term frequency vectors
    • vector space model / The vector space model and cosine distance
    • cosine distance / The vector space model and cosine distance
    • stop words, removing / Removing stop words
  • Tesser
    • mathematical folds / Mathematical folds with Tesser
    • covariance, calculating with / Calculating covariance with Tesser
    • commutativity / Commutativity
    • simple linear regression / Simple linear regression with Tesser
    • correlation matrix, calculating / Calculating a correlation matrix
    • matrix-sum fold, creating / Creating a matrix-sum fold
  • time series
    • Longley dataset / About the data
    • Airline dataset / About the data
    • Longley data, loading / Loading the Longley data
    • Longley data, plotting with linear model / Fitting curves with a linear model
    • decomposition / Time series decomposition
    • airline data, inspecting / Inspecting the airline data
    • airline data, visualizing / Visualizing the airline data
    • stationary time series / Stationarity
    • de-trending / De-trending and differencing
    • differencing / De-trending and differencing
    • reference link / Discrete time models
    • maximum likelihood estimation / Maximum likelihood estimation
    • forecasting / Time series forecasting
    • forecasting, with Monte Carlo simulation / Forecasting with Monte Carlo simulation
  • Toeplitz matrices
    • about / PACF with Durbin-Levinson recursion
  • tokenization
    • about / Set-of-words and the Jaccard index
  • transduce library
    • URL / Distributed unique IDs with Hadoop
  • triangle counting
    • graph density, measuring with / Measuring graph density with triangle counting
    • built-in triangle counting algorithm, running / Running the built-in triangle counting algorithm
    • implementing, with Glittering / Implement triangle counting with Glittering
    • neighbor IDs, collecting / Step one – collecting neighbor IDs
    • aggregate messages / Steps two, three, and four – aggregate messages
    • counts, dividing / Step five – dividing the counts
    • custom triangle counting algorithm, running / Running the custom triangle counting algorithm
  • Twitter's intent API
    • URL / Download the data, Running PageRank to determine community influencers
  • two-dimensional histogram
    • representing / Representing a two-dimensional histogram
  • two-tailed tests
    • about / Two-tailed tests

U

  • uberjar
    • building / Building an uberjar
    • submitting, to Hadoop / Submitting the uberjar to Hadoop
  • user-based recommenders
    • about / Item-based and user-based recommenders
    • practical considerations / Practical considerations for user and item recommenders
    • building, with Mahout / Building a user-based recommender with Mahout

V

  • variance
    • about / Variance
    • analysis / Analysis of variance
    • calculating, fold used / Calculating the variance using fold
  • vectors
    • about / Vectors
  • visualization
    • code, downloading / Download the code and data
    • data, downloading / Download the code and data
    • exploratory data visualization / Exploratory data visualization
    • two-dimensional histogram, representing / Representing a two-dimensional histogram
    • Quil, using / Using Quil for visualization
  • visualization, for communication
    • about / Visualization for communication
    • wealth distribution, visualizing / Visualizing wealth distribution
    • data, bringing to life / Bringing data to life with Quil
    • bars of differing widths, drawing / Drawing bars of differing widths
    • axis labels, adding / Adding a title and axis labels
    • title, adding / Adding a title and axis labels
    • clarity, improving with illustrations / Improving the clarity with illustrations
    • text, adding to bars / Adding text to the bars
    • additional data, incorporating / Incorporating additional data
    • complex shapes, drawing / Drawing complex shapes, Drawing curves
    • compound charts, plotting / Plotting compound charts
  • visualizations
    • comparative visualizations / Comparative visualizations, Comparative visualizations
    • about / The importance of visualizations
    • electorate data / Visualizing electorate data
    • comparative visualizations, of electorate data / Comparative visualizations of electorate data

W

  • Waikato Environment for Knowledge Analysis (Weka)
    • URL / Classification with clj-ml
  • weighted graph / Visualizing graphs with Loom
  • Welch's t-test / Two-tailed tests
  • whole-graph analysis
    • about / Whole-graph analysis
  • Widrow-Hoff learning rule / The gradient descent update rule
  • wiki
    • URL / Running the examples, Downloading the data

Z

  • z-test
    • performing / Performing a z-test
  • Zipf's law / Zipf's law
  • Zipf scale / Scale-free networks
lock icon The rest of the chapter is locked
arrow left Previous Section
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime}