Index
A
- A* algorithm
- URL / Finding the shortest path
- Acbracad library
- URL / Distributed unique IDs with Hadoop
- AcmeContent
- about / Introducing AcmeContent
- sample code / Download the sample code
- acyclic / Visualizing graphs with Loom
- Adaptive Boosting (AdaBoost) / Bagging and boosting
- Akaike Information Criterion (AIC)
- models, identifying / Identifying better models with Akaike Information Criterion
- about / Identifying better models with Akaike Information Criterion
- ALS
- movie recommendations / Movie recommendations with alternating least squares
- using, with Spark / ALS with Spark and MLlib
- using, with MLlib / ALS with Spark and MLlib
- used, for making predictions / Making predictions with ALS
- evaluating / Evaluating ALS
- Anscombe's Quartet / The importance of visualizations
- Apache Commons Math
- about / Estimating the maximum likelihood, Nelder-Mead optimization with Apache Commons Math
- URL / Estimating the maximum likelihood, Nelder-Mead optimization with Apache Commons Math
- used, for Nelder-Mead optimization / Nelder-Mead optimization with Apache Commons Math
- ARMA model order
- determining, with ACF and PACF / Determining ARMA model order with ACF and PACF
- autocorrelation function (ACF)
- about / Determining autocorrelation in AR models
- ARMA model order, determining / Determining ARMA model order with ACF and PACF
- plotting, of airline data / ACF and PACF of airline data
- autocovariance
- about / Autocovariance
- autoregressive (AR) models
- about / Autoregressive models
- autocorrelation, determining / Determining autocorrelation in AR models
- combining, with Moving Average (MA) models / Combining the AR and MA models
- Autoregressive Integrated Moving Average (ARIMA) model
- about / Removing seasonality with differencing
B
- B1
- about / B1
- URL / B1
- bag-of-words / The bag-of-words and Euclidean distance
- bagging
- about / Bagging and boosting
- balanced F-score / F-measure and the harmonic mean
- batch gradient descent
- about / Stochastic gradient descent
- Bayesian view / Probability
- Bayes theorem
- about / Bayes theorem
- with multiple predictors / Bayes theorem with multiple predictors, Naive Bayes classification
- bias
- about / Bias
- high bias, addressing / Addressing high bias
- bias term / Multiple linear regression
- big data
- code, downloading / Downloading the code and data
- example code, URL / Downloading the code and data
- inspecting / Inspecting the data
- records, counting / Counting the records
- bigrams
- about / Better clustering with n-grams
- bimodal
- about / Visualizing different populations
- binning
- about / Binning data
- binomial distribution
- about / The binomial distribution
- bipartite / Visualizing graphs with Loom
- bivariate linear regression / Multiple linear regression
- Bloom filters
- used, for testing large sets membership / Testing set membership with Bloom filters
- Bonferroni correction
- about / The Bonferroni correction
- boosting
- about / Bagging and boosting
- bounce
- about / Introducing AcmeContent
- box and whisker plots
- about / Box plots
- breadth-first search / Breadth-first and depth-first search
C
- C4.5 algorithm / Building a decision tree in clj-ml
- categorical variables / Categorical and dummy variables
- central limit theorem / The central limit theorem
- about / The central limit theorem
- Chi-squared multiple significance testing
- about / Chi-squared multiple significance testing
- categories, visualizing / Visualizing the categories
- chi-squared test / The chi-squared test, The chi-squared test
- chi-squared statistic / The chi-squared statistic
- chi-squared statistic / The chi-squared statistic
- chi-squared test / The chi-squared test, The chi-squared test
- classifier
- data / About the data
- data, inspecting / Inspecting the data
- relative risk and odds, comparing with / Comparisons with relative risk and odds
- saving, to file / Saving the classifier to a file
- clj-ml
- classification with / Classification with clj-ml
- URL / Classification with clj-ml
- data, loading with / Loading data with clj-ml
- decision tree, building / Building a decision tree in clj-ml
- clj-time library
- URL / Inspecting the airline data
- clojure-opennlp library
- URL / Tokenizing the Reuters files
- Clojure libraries
- URL / Exploratory data visualization
- Clojure library succession
- URL / Calculating the likelihood
- Clojure library Tesser
- URL / Mathematical folds with Tesser
- Clojure reducers library
- URL / Counting the records
- about / The reducers library
- parallel folds / Parallel folds with reducers
- parallel folds with / Parallel folds with reducers
- large files, loading with iota / Loading large files with iota
- reducers processing pipeline, creating / Creating a reducers processing pipeline
- curried reductions, with reducers / Curried reductions with reducers
- statistical folds / Statistical folds with reducers
- associativity / Associativity
- mean calculating, fold used / Calculating the mean using fold
- variance calculating, fold used / Calculating the variance using fold
- cluster evaluation, measures
- about / Cluster evaluation measures
- inter-cluster density / Inter-cluster density
- intra-cluster density / Intra-cluster density
- root mean square error, calculating with Parkour / Calculating the root mean square error with Parkour
- clustered points and centroids, loading / Loading clustered points and centroids
- cluster RMSE, calculating / Calculating the cluster RMSE
- optimal k, determining with elbow method / Determining optimal k with the elbow method
- optimal k, determining with Dunn index / Determining optimal k with the Dunn index
- optimal k, determining with Davies-Bouldin index / Determining optimal k with the Davies-Bouldin index
- clustering
- data, downloading / Downloading the data
- data, extracting / Extracting the data
- data, inspecting / Inspecting the data
- clustering, text
- about / Clustering text
- set-of-words / Set-of-words and the Jaccard index
- Jaccard index / Set-of-words and the Jaccard index
- Reuters files, tokenizing / Tokenizing the Reuters files
- text, representing as vectors / Representing text as vectors
- dictionary, creating / Creating a dictionary
- cluster RMSE
- calculating / Calculating the cluster RMSE
- code
- downloading / Download the code and data
- downloading, URL / Download the code and data
- coefficient of determination / Goodness-of-fit and R-square
- coefficient of multiple determination / Multiple R-squared
- collinearity
- about / Collinearity
- multicollinearity / Multicollinearity
- columns
- adding / Adding columns, Adding derived columns
- combinations function
- URL / Determining optimal k with the Dunn index
- communities, with label propagation
- detecting / Detecting communities with label propagation
- map vertices / Step one – map vertices
- vertex attribute, sending / Step two – send the vertex attribute
- aggregate value / Step three – aggregate value
- vertex function / Step four – vertex function
- maximum iterations count, setting / Step five – set the maximum iterations count
- comparative visualizations
- about / Comparative visualizations
- box and whisker plots / Box plots
- cumulative distribution functions / Cumulative distribution functions
- probability mass function (PMF) / Probability mass functions
- scatter plots / Scatter plots
- scatter transparency / Scatter transparency
- confidence interval
- about / Confidence intervals, Confidence intervals
- confounding variables
- about / Regression
- confusion matrix / The confusion matrix
- connected components
- running / Running connected components
- largest connected component, size calculating / Calculating the size of the largest connected component
- connected components, with Pregel API
- about / Connected components with the Pregel API
- map vertices / Step one – map vertices
- message function / Steps two and three – the message function
- attributes, updating / Step four – update the attributes
- convergence, iterating to / Step five – iterate to convergence
- construction
- about / Construction
- content-based filtering / Types of recommender systems
- content distribution network (CDN) / jStat
- covariance
- about / Covariance
- calculating, with Tesser / Calculating covariance with Tesser
- cross-validation
- about / Cross-validation
- cumulative distribution function (CDF)
- about / Hypothesis testing
- Cumulative distribution functions (CDFs)
- about / Cumulative distribution functions
D
- daily means distribution
- about / The distribution of daily means
- data
- inspecting / Load and inspect the data, Inspecting the data, Inspect the data
- loading / Load and inspect the data
- about / About the data
- Guardian's excellent data blog, URL / About the data
- visualizing / Visualizing the data
- downloading / Download the code and data
- downloading, URL / Download the code and data
- parsing / Parse the data
- data scrubbing
- about / Data scrubbing
- Davies-Bouldin index
- used, for determining optimal k / Determining optimal k with the Davies-Bouldin index
- decision trees
- about / Decision trees
- information / Information
- entropy / Entropy
- information gain / Information gain
- information gain, using to identify best predictor / Using information gain to identify the best predictor
- building, recursively / Recursively building a decision tree
- using, for classification / Using the decision tree for classification
- classifier, evaluating / Evaluating the decision tree classifier
- building, in clj-ml / Building a decision tree in clj-ml
- degenerate matrices / Inversion
- degrees of freedom
- about / Student's t-distribution, Degrees of freedom
- Delta rule / The gradient descent update rule
- dependent variable
- about / Regression
- depth-first search / Breadth-first and depth-first search
- descriptive statistics
- about / Descriptive statistics
- mean / The mean
- mathematical notation, interpreting / Interpreting mathematical notation
- median / The median
- dictionary
- creating / Creating a dictionary
- dimensionality reduction
- about / Dimensionality reduction
- Iris dataset, plotting / Plotting the Iris dataset
- principle component analysis (PCA) / Principle component analysis
- principle component analysis(PCA) / Principle component analysis
- Singular Value Decomposition (SVD) / Singular value decomposition
- dimensions
- about / Dimensions
- Directed Acyclic Graph (DAG) / Visualizing graphs with Loom
- Discounted Cumulative Gain (DCG) / Normalized discounted cumulative gain
- discrete time models
- about / Discrete time models
- random walks / Random walks
- autoregressive (AR) models / Autoregressive models
- autocorrelation, determining in AR models / Determining autocorrelation in AR models
- Moving Average (MA) models / Moving-average models
- partial autocorrelation function (PACF), calculating / Calculating partial autocorrelation
- seasonality, removing with differencing / Removing seasonality with differencing
- distance measures, evaluating
- about / Evaluating distance measures
- Pearson correlation similarity / The Pearson correlation similarity
- Spearmans rank similarity / Spearman's rank similarity
- distributed cache
- data, sharing with / Sharing data with the distributed cache
- distributed unique IDs
- creating / Creating distributed unique IDs
- with Hadoop / Distributed unique IDs with Hadoop
- dot product / Matrix-vector multiplication
- dummy variables / Categorical and dummy variables
- Dunn index
- used, for determining optimal k / Determining optimal k with the Dunn index
- Durbin-Levinson recursion
- used, for calculating partial autocorrelation function (PACF) / PACF with Durbin-Levinson recursion
- about / PACF with Durbin-Levinson recursion
- URL / PACF with Durbin-Levinson recursion
- dwell time
- about / Introducing AcmeContent
- dwell times
- visualizing / Visualizing the dwell times
E
- edge-list format / Inspecting the data
- elbow method
- used, for determining optimal k / Determining optimal k with the elbow method
- ensemble learning
- about / Ensemble learning and random forests
- entropy / Entropy
- explained sum of squares (ESS) / The F-test of model significance
- exploratory data visualization / Exploratory data visualization
- exponential distribution
- about / The exponential distribution
F
- F-distribution
- about / The F-distribution
- F-statistic
- about / The F-statistic
- F-test
- about / The F-test
- F1 measure / F-measure and the harmonic mean
- feature matrix
- creating / Creating a feature matrix
- Fisher z-transformation
- about / Confidence intervals
- Flambo
- URL / Large-scale machine learning with Apache Spark and MLlib
- fold / Parallel folds with reducers
- frequency vectors / Representing text as vectors
- frequentist / Probability
- Fressian
- URL / Chaining mappers and reducers with Parkour graph
- fs library
- URL / Clustering the Reuters documents
G
- Gaussian distribution
- about / The normal distribution
- central limit theorem / The central limit theorem
- Giraph
- URL / Distributed graph computation with GraphX
- GitHub
- URL / Downloading the sample code
- Glittering
- URL / Creating RDGs with Glittering
- gradient descent
- about / The logistic regression cost function, Multiple regression with gradient descent
- multiple regression with / Multiple regression with gradient descent
- update rule / The gradient descent update rule
- learning rate / The gradient descent learning rate
- feature scaling / Feature scaling
- feature extraction / Feature extraction
- custom Tesser fold, creating / Creating a custom Tesser fold
- total model error, calculating / Calculating the total model error
- matrix-mean fold, creating / Creating a matrix-mean fold
- single step, applying / Applying a single step of gradient descent
- iterative gradient descent, running / Running iterative gradient descent
- scaling with Hadoop / Scaling gradient descent with Hadoop
- gradient descent on Hadoop, with Tesser and Parkour
- about / Gradient descent on Hadoop with Tesser and Parkour
- Parkour distributed sources and sinks / Parkour distributed sources and sinks
- feature scale fold, running with Hadoop / Running a feature scale fold with Hadoop
- gradient descent, running with Hadoop / Running gradient descent with Hadoop
- code, preparing for Hadoop cluster / Preparing our code for a Hadoop cluster
- uberjar, building / Building an uberjar
- uberjar, submitting to Hadoop / Submitting the uberjar to Hadoop
- graphs
- visualizing, Loom used / Visualizing graphs with Loom
- graph traversal
- with Loom / Graph traversal with Loom
- Königsberg city, seven bridges / The seven bridges of Königsberg
- GraphViz
- URL / Visualizing graphs with Loom
- GraphX / Scale-free networks
- distributed graph computation / Distributed graph computation with GraphX
- RDGs, creating with Glittering / Creating RDGs with Glittering
- graph density, measuring with triangle counting / Measuring graph density with triangle counting
- partitioning strategies / GraphX partitioning strategies
- built-in triangle counting algorithm, running / Running the built-in triangle counting algorithm
- triangle counting, implementing with Glittering / Implement triangle counting with Glittering
- custom triangle counting algorithm, running / Running the custom triangle counting algorithm
- Pregel API / The Pregel API
- Pregel API, connected components / Connected components with the Pregel API
- connected components, running / Running connected components
- largest connected component, size calculating / Calculating the size of the largest connected component
- communities with label propagation, detecting / Detecting communities with label propagation
- label propagation, running / Running label propagation
- flow formulation / The flow formulation
- PageRank, implementing with Glittering / Implementing PageRank with Glittering
- PageRank, running to determine community influencers / Running PageRank to determine community influencers
- gross domestic product (GDP)
- about / About the data
H
- Hadoop Distributed File System (HDFS) / Large-scale machine learning with Apache Spark and MLlib
- Hadoop installation guides
- URL / Submitting the uberjar to Hadoop
- Hama
- URL / Distributed graph computation with GraphX
- heteroscedasticity
- about / Visualizing the airline data
- histogram
- about / Histograms
- hypothesis testing
- about / Visualizing different populations, Hypothesis testing, Hypothesis testing
- significance testing / Significance
I
- Ideal Discounted Cumulative Gain (IDCG) / Normalized discounted cumulative gain
- identity matrix / The identity matrix
- Incanter
- gradient descent with / Gradient descent with Incanter
- logistic regression, implementing with / Implementing logistic regression with Incanter
- Incanter's linear model
- about / Incanter's linear model
- F-test / The F-test of model significance
- Incanter library
- URL / Inspecting the data
- independent variable
- about / Regression
- indices function / Testing set membership with Bloom filters
- inferential statistics
- about / Descriptive statistics
- information gain
- about / Information gain
- used, for identifying best predictor / Using information gain to identify the best predictor
- Information Retrieval statistics (IR stats) evaluator
- about / Information retrieval statistics
- precision / Precision
- recall / Recall
- of Mahout / Mahout's information retrieval evaluator
- F-measure / F-measure and the harmonic mean
- harmonic mean / F-measure and the harmonic mean
- false positive rate / Fall-out
- fall-out / Fall-out
- Discounted Cumulative Gain (DCG) / Normalized discounted cumulative gain
- normalized discounted cumulative gain / Normalized discounted cumulative gain
- results, plotting / Plotting the information retrieval results
- implicit, versus explicit feedback / Implicit versus explicit feedback
- inter-cluster density / Inter-cluster density
- interface
- binding / Binding the interface
- interquartile range / Quantiles
- intra-cluster density / Intra-cluster density
- inversion matrix / Inversion
- iota
- URL / Loading large files with iota
- used, for loading large files / Loading large files with iota
- IRS data definition
- URL / Inspecting the data
- IRS Statistics of Income (SoI)
- URL / Downloading the code and data
- item-based recommenders
- about / Item-based and user-based recommenders
- practical considerations / Practical considerations for user and item recommenders
J
- Jaccard index / Set-of-words and the Jaccard index
- applying, to documents / Applying the Jaccard index to documents
- jStat
- about / jStat
- URL / jStat
K
- k-means
- drawbacks / The drawbacks of k-means
- Mahalanobis distance measure / The Mahalanobis distance measure
- dimensionality, curse / The curse of dimensionality
- k-means clustering
- about / Clustering with k-means and Incanter
- with Term Frequency-Inverse Document Frequency (TF-IDF) / k-means clustering with TF-IDF
- k-means clustering, running with Mahout
- about / Running k-means clustering with Mahout
- results, viewing / Viewing k-means clustering results
- clustered output, interpreting / Interpreting the clustered output
- k-nearest neighbors (k-NN)
- about / k-nearest neighbors
- kappa statistic model / The kappa statistic
- k hash functions / Testing set membership with Bloom filters
L
- label propagation
- about / Detecting communities with label propagation
- running / Running label propagation
- large-scale clustering, with Mahout
- about / Large-scale clustering with Mahout
- text documents, converting to sequence file / Converting text documents to a sequence file
- Mahout vectors creating, Parkour used / Using Parkour to create Mahout vectors
- distributed unique IDs, creating / Creating distributed unique IDs
- distributed unique IDs, with Hadoop / Distributed unique IDs with Hadoop
- data, sharing with distributed cache / Sharing data with the distributed cache
- Mahout vectors, building from input documents / Building Mahout vectors from input documents
- large-scale machine learning
- MLlib, using / Large-scale machine learning with Apache Spark and MLlib
- Spark, using / Large-scale machine learning with Apache Spark and MLlib
- data, loading with Sparkling / Loading data with Sparkling
- data, mapping / Mapping data
- tuples / Distributed datasets and tuples
- distributed datasets / Distributed datasets and tuples
- data, filtering / Filtering data
- persistence / Persistence and caching
- caching / Persistence and caching
- larger sets
- probabilistic methods / Probabilistic methods for large sets
- membership, testing with Bloom filters / Testing set membership with Bloom filters
- Jaccard similarity, with MinHash / Jaccard similarity for large sets with MinHash
- learning rate / The gradient descent update rule
- locality-sensitive hashing (LSH)
- used, for reducing pair comparisons / Reducing pair comparisons with locality-sensitive hashing
- about / Reducing pair comparisons with locality-sensitive hashing
- signatures, bucketing / Bucketing signatures
- URL / Bucketing signatures
- log-linear / Visualizing the dwell times
- log-log chart / Visualizing the dwell times
- log-normal distribution
- about / The log-normal distribution
- correlation, visualizing / Visualizing correlation
- jittering / Jittering
- logistic regression
- and naive Bayes approaches, comparing / Comparing the logistic regression and naive Bayes approaches
- logistic regression, classifying
- about / Classification with logistic regression
- sigmoid function / The sigmoid function
- logistic regression cost function / The logistic regression cost function
- parameter optimization, with gradient descent / Parameter optimization with gradient descent
- gradient descent, with Incanter / Gradient descent with Incanter
- convexity / Convexity
- logistic regression, implementing with Incanter
- about / Implementing logistic regression with Incanter
- feature matrix, creating / Creating a feature matrix
- logistic regression classifier, evaluating / Evaluating the logistic regression classifier
- confusion matrix / The confusion matrix
- kappa statistic / The kappa statistic
- logistic regression classifier
- evaluating / Evaluating the logistic regression classifier
- logistic regression cost function
- about / The logistic regression cost function
- Loom
- used, for visualizing graphs / Visualizing graphs with Loom
- URL / Visualizing graphs with Loom
- graph traversal with / Graph traversal with Loom
- loss function
- about / Ordinary least squares, Parameter optimization with gradient descent
M
- machine learning
- movie recommendations, with ALS / Movie recommendations with alternating least squares
- ALS, evaluating / Evaluating ALS
- sum of squared errors, calculating / Calculating the sum of squared errors
- Mahalanobis distance measure / The Mahalanobis distance measure
- Mahout
- URL / Large-scale clustering with Mahout
- used, for building user-based recommenders / Building a user-based recommender with Mahout
- used, for evaluating recommenders / Recommender evaluation with Mahout
- Information Retrieval statistics (IR stats) evaluator / Mahout's information retrieval evaluator
- Mahout vectors
- creating, Parkour used / Using Parkour to create Mahout vectors
- building, from input documents / Building Mahout vectors from input documents
- matrix
- about / Matrices
- dimensions / Dimensions
- vectors / Vectors
- construction / Construction
- scalar multiplication / Addition and scalar multiplication
- scalar addition / Addition and scalar multiplication
- -vector multiplication / Matrix-vector multiplication
- -matrix multiplication / Matrix-matrix multiplication
- transposition / Transposition
- identity matrix / The identity matrix
- inversion / Inversion
- matrix-matrix multiplication / Matrix-matrix multiplication
- matrix-vector multiplication / Matrix-vector multiplication
- maximum likelihood, time series
- estimating / Maximum likelihood estimation, Estimating the maximum likelihood
- calculating / Calculating the likelihood
- estimating, with Nelder-Mead optimization / Nelder-Mead optimization with Apache Commons Math
- estimating, with Akaike Information Criterion / Identifying better models with Akaike Information Criterion
- maximum likelihood estimation
- about / Removing seasonality with differencing
- m bits / Testing set membership with Bloom filters
- mean
- calculating, fold used / Calculating the mean using fold
- mean square error (MSE) / The F-test of model significance
- mean square model (MSM) / The F-test of model significance
- Medley
- URL / Performing a z-test
- memoryless / The exponential distribution
- meta-algorithm / Bagging and boosting
- MinHash
- used, for Jaccard similarity for larger sets / Jaccard similarity for large sets with MinHash
- MLlib
- used, for large-scale machine learning / Large-scale machine learning with Apache Spark and MLlib
- URL / Large-scale machine learning with Apache Spark and MLlib, Machine learning on Spark with MLlib
- used, for machine learning on Spark / Machine learning on Spark with MLlib
- using, with ALS / ALS with Spark and MLlib
- using, with Spark / ALS with Spark and MLlib
- ALS, evaluating / Evaluating ALS
- Monte Carlo simulation
- used, for forecasting time series / Forecasting with Monte Carlo simulation
- Moving Average (MA) models
- about / Moving-average models
- autocorrelation, determining / Determining autocorrelation in MA models
- combining, with autoregressive (AR) models / Combining the AR and MA models
- multimodal
- about / Visualizing different populations
- multiple comparisons
- about / Multiple comparisons
- multiple designs
- testing / Testing multiple designs
- multiple linear regression / Multiple linear regression
- multiple tests
- simulating / Simulating multiple tests
N
- n-gram
- about / Better clustering with n-grams
- Naive Bayes classification
- about / Naive Bayes classification
- implementing / Implementing a naive Bayes classifier
- evaluating / Evaluating the naive Bayes classifier
- natural logarithm / The log-normal distribution
- Nelder-Mead optimization
- about / Estimating the maximum likelihood
- with Apache Commons Math / Nelder-Mead optimization with Apache Commons Math
- network analysis
- data, downloading / Download the data
- data, inspecting / Inspecting the data
- graphs, visualizing with Loom / Visualizing graphs with Loom
- new site design
- testing / Testing a new site design
- nonresponse bias
- about / Bias
- normal distribution
- about / The normal distribution
- normal equation
- about / The normal equation
- features / More features
- null hypothesis / Hypothesis testing
O
- one-sample t-test
- about / One-sample t-test
- one-tailed tests
- about / Two-tailed tests
- optimal k
- determining, with elbow method / Determining optimal k with the elbow method
- determining, with Dunn index / Determining optimal k with the Dunn index
- determining, with Davies-Bouldin index / Determining optimal k with the Davies-Bouldin index
- Ordinary Least Squares (OLS)
- about / Ordinary least squares
- slope / Slope and intercept
- intercept / Slope and intercept
- interpretation / Interpretation
- visualization / Visualization
- assumptions / Assumptions
- over-fitting
- about / Bias and variance, Overfitting
P
- PageRank
- used, for measuring community influence / Measuring community influence using PageRank
- implementing, with Glittering / Implementing PageRank with Glittering
- highest influence, sorting by / Sort by highest influence
- running, to determine community influencers / Running PageRank to determine community influencers
- Parkour
- URL / Parkour distributed sources and sinks
- used, for creating Mahout vectors / Using Parkour to create Mahout vectors
- partial autocorrelation
- calculating / Calculating partial autocorrelation
- autocovariance / Autocovariance
- plotting / Plotting partial autocorrelation
- partial autocorrelation function (PACF)
- about / Calculating partial autocorrelation
- calculating, with Durbin-Levinson recursion / PACF with Durbin-Levinson recursion
- ARMA model order, determining / Determining ARMA model order with ACF and PACF
- plotting, of airline data / ACF and PACF of airline data
- parts-of-speech taggers / Tokenizing the Reuters files
- Pearson's correlation
- about / Pearson's correlation
- sample r and population rho / Sample r and population rho
- phi-quantile / Quantiles
- Poincaré's baker
- about / Poincaré's baker
- distributions, generating / Generating distributions
- polytope
- about / Estimating the maximum likelihood
- populations
- about / Samples and populations
- visualizing / Visualizing different populations
- precision
- true positives / Precision
- about / Precision
- false positives / Precision
- prediction
- about / Prediction
- confidence interval / The confidence interval of a prediction
- model, scope / Model scope
- final model / The final model
- prediction intervals / The confidence interval of a prediction
- Pregel API
- about / The Pregel API
- connected components with / Connected components with the Pregel API
- probability
- about / Probability
- Bayes theorem / Bayes theorem
- Bayes theorem, with multiple predictors / Bayes theorem with multiple predictors
- probability densities
- plotting / Plotting probability densities
- probability mass function (PMF)
- about / Probability mass functions
- processing
- URL / Using Quil for visualization
- Pythagoras formula / The bag-of-words and Euclidean distance
Q
- quantile-quantile plots
- about / Quantile-quantile plots
- quantiles
- about / Quantiles
- URL / Quantiles
- quartiles / Quantiles
- Quil, used for visualization
- URL / Using Quil for visualization
- about / Using Quil for visualization
- sketch window, drawing to / Drawing to the sketch window
- coordinate system / Quil's coordinate system
- grid, plotting / Plotting the grid
- fill color, specifying / Specifying the fill color
- color and fill / Color and fill
- image file, outputting / Outputting an image file
- PDF, output to / Output to PDF
R
- R-squared
- multiple / Multiple R-squared
- adjusted / Adjusted R-squared
- random forests
- about / Ensemble learning and random forests
- random walks
- about / Random walks
- RDGs
- creating, with Glittering / Creating RDGs with Glittering
- reagent
- about / State and Reagent
- recommenders
- evaluating, with Mahout / Recommender evaluation with Mahout
- recommenders, evaluating
- Mahout, using / Recommender evaluation with Mahout
- distance measures / Evaluating distance measures
- optimum neighborhood size, determining / Determining optimum neighborhood size
- information retrieval statistics / Information retrieval statistics
- recommendation with Boolean preferences / Recommendation with Boolean preferences
- recommender systems
- types / Types of recommender systems
- collaborative filtering / Collaborative filtering
- regression
- about / Regression
- linear equations / Linear equations
- residuals / Residuals
- regression lines
- about / Regression
- relative power / Relative power
- resampling
- about / Resampling
- residual plot / Visualization
- Resilient Distributed Datasets (RDDs) / Distributed datasets and tuples
- Reuters dataset
- URL / Downloading the data
- Reuters documents
- clustering / Clustering the Reuters documents
- Reuters files, tokenizing
- about / Tokenizing the Reuters files
- Jaccard index, applying to documents / Applying the Jaccard index to documents
- Euclidean distance / The bag-of-words and Euclidean distance
- bag-of-words / The bag-of-words and Euclidean distance
- frequency vectors / Representing text as vectors
- root mean square error
- calculating, with Parkour / Calculating the root mean square error with Parkour
- Root mean square error (RMSE) / Recommender evaluation with Mahout
- Russian election data
- visualizing / Visualizing the Russian election data
S
- samples
- about / Samples and populations
- comparing / Sample comparisons
- means, calculating / Calculating sample means
- Scalable Vector Graphics (SVG)
- about / Scalable Vector Graphics
- scalar
- multiplication / Addition and scalar multiplication
- addition / Addition and scalar multiplication
- scale-free networks
- about / Scale-free networks
- scatter plots
- about / Scatter plots
- scatter transparency
- about / Scatter transparency
- shortest path
- finding / Finding the shortest path
- minimum spanning trees / Minimum spanning trees
- connected components / Subgraphs and connected components
- subgraphs / Subgraphs and connected components
- web, bow-tie structure / SCC and the bow-tie structure of the web
- SCC / SCC and the bow-tie structure of the web
- sigmoid function / Classification with logistic regression, The sigmoid function
- significance testing
- about / Significance
- significance testing proportions
- about / Significance testing proportions
- simplex method
- about / Estimating the maximum likelihood
- simulation
- about / Introducing the simulation
- compiling / Compile the simulation
- browser simulation / The browser simulation
- singular matrices / Inversion
- Singular Value Decomposition (SVD) / Singular value decomposition
- skewed normal distribution / Generating distributions
- skewness
- about / Skewness
- quantile-quantile plots / Quantile-quantile plots
- Slope One predictors / Item-based and user-based recommenders
- Slope One recommenders
- about / Slope One recommenders
- URL / Slope One recommenders
- item differences, calculating / Calculating the item differences
- recommendations, creating / Making recommendations
- Spark
- URL / Large-scale machine learning with Apache Spark and MLlib
- used, for large-scale machine learning / Large-scale machine learning with Apache Spark and MLlib
- Sparkling
- URL / Large-scale machine learning with Apache Spark and MLlib
- used, for loading data / Loading data with Sparkling
- used, for mapping data / Mapping data
- standard deviation / Variance
- standard error
- about / Standard error
- of proportion / The standard error of a proportion
- bootstrapping, estimating with / Estimation using bootstrapping
- of proportion, formula / The standard error of a proportion formula
- standard errors
- adjusting, for large samples / Adjusting standard errors for large samples
- Standard Generalized Markup language (SGML) / Extracting the data
- state
- about / State and Reagent
- updating / Updating state
- stationary
- about / Visualizing the airline data
- stationary time series
- about / Stationarity
- statistics
- sample code, downloading / Downloading the sample code
- URL / Downloading the sample code
- examples, running / Running the examples
- data, downloading / Downloading the data
- data, inspecting / Inspecting the data
- data scrubbing / Data scrubbing
- descriptive statistics / Descriptive statistics
- stemmers
- URL / Stemming
- stemming
- about / Stemming
- Stochastic gradient descent
- about / Stochastic gradient descent
- with Parkour / Stochastic gradient descent with Parkour
- mapper, defining / Defining a mapper
- shaping functions / Parkour shaping functions
- reducer, defining / Defining a reducer
- Hadoop jobs, specifying with Parkour graph / Specifying Hadoop jobs with Parkour graph
- mappers, chaining with Parkour graph / Chaining mappers and reducers with Parkour graph
- reducers, chaining with Parkour graph / Chaining mappers and reducers with Parkour graph
- stochastic gradient descent (SGD) / Stochastic gradient descent
- summary statistics
- about / Descriptive statistics
- sum of residual squares (RSS) / The F-test of model significance
- sum of squared errors (SSE) / Calculating the root mean square error with Parkour
- supersteps / The Pregel API
- SVG maps
- URL / Improving the clarity with illustrations
T
- t-distribution
- about / Student's t-distribution
- t-statistic
- about / The t-statistic
- t-test
- performing / Performing the t-test
- Tanimoto coefficient / Recommendation with Boolean preferences
- term frequency (tf) / Representing text as vectors
- Term Frequency-Inverse Document Frequency (TF-IDF)
- about / Better clustering with TF-IDF
- Zipf's law / Zipf's law
- weigh, calculating / Calculating the TF-IDF weight
- k-means clustering with / k-means clustering with TF-IDF
- clustering, with n-grams / Better clustering with n-grams
- term frequency vectors
- creating / Creating term frequency vectors
- vector space model / The vector space model and cosine distance
- cosine distance / The vector space model and cosine distance
- stop words, removing / Removing stop words
- Tesser
- mathematical folds / Mathematical folds with Tesser
- covariance, calculating with / Calculating covariance with Tesser
- commutativity / Commutativity
- simple linear regression / Simple linear regression with Tesser
- correlation matrix, calculating / Calculating a correlation matrix
- matrix-sum fold, creating / Creating a matrix-sum fold
- time series
- Longley dataset / About the data
- Airline dataset / About the data
- Longley data, loading / Loading the Longley data
- Longley data, plotting with linear model / Fitting curves with a linear model
- decomposition / Time series decomposition
- airline data, inspecting / Inspecting the airline data
- airline data, visualizing / Visualizing the airline data
- stationary time series / Stationarity
- de-trending / De-trending and differencing
- differencing / De-trending and differencing
- reference link / Discrete time models
- maximum likelihood estimation / Maximum likelihood estimation
- forecasting / Time series forecasting
- forecasting, with Monte Carlo simulation / Forecasting with Monte Carlo simulation
- Toeplitz matrices
- about / PACF with Durbin-Levinson recursion
- tokenization
- about / Set-of-words and the Jaccard index
- transduce library
- URL / Distributed unique IDs with Hadoop
- triangle counting
- graph density, measuring with / Measuring graph density with triangle counting
- built-in triangle counting algorithm, running / Running the built-in triangle counting algorithm
- implementing, with Glittering / Implement triangle counting with Glittering
- neighbor IDs, collecting / Step one – collecting neighbor IDs
- aggregate messages / Steps two, three, and four – aggregate messages
- counts, dividing / Step five – dividing the counts
- custom triangle counting algorithm, running / Running the custom triangle counting algorithm
- Twitter's intent API
- URL / Download the data, Running PageRank to determine community influencers
- two-dimensional histogram
- representing / Representing a two-dimensional histogram
- two-tailed tests
- about / Two-tailed tests
U
- uberjar
- building / Building an uberjar
- submitting, to Hadoop / Submitting the uberjar to Hadoop
- user-based recommenders
- about / Item-based and user-based recommenders
- practical considerations / Practical considerations for user and item recommenders
- building, with Mahout / Building a user-based recommender with Mahout
V
- variance
- about / Variance
- analysis / Analysis of variance
- calculating, fold used / Calculating the variance using fold
- vectors
- about / Vectors
- visualization
- code, downloading / Download the code and data
- data, downloading / Download the code and data
- exploratory data visualization / Exploratory data visualization
- two-dimensional histogram, representing / Representing a two-dimensional histogram
- Quil, using / Using Quil for visualization
- visualization, for communication
- about / Visualization for communication
- wealth distribution, visualizing / Visualizing wealth distribution
- data, bringing to life / Bringing data to life with Quil
- bars of differing widths, drawing / Drawing bars of differing widths
- axis labels, adding / Adding a title and axis labels
- title, adding / Adding a title and axis labels
- clarity, improving with illustrations / Improving the clarity with illustrations
- text, adding to bars / Adding text to the bars
- additional data, incorporating / Incorporating additional data
- complex shapes, drawing / Drawing complex shapes, Drawing curves
- compound charts, plotting / Plotting compound charts
- visualizations
- comparative visualizations / Comparative visualizations, Comparative visualizations
- about / The importance of visualizations
- electorate data / Visualizing electorate data
- comparative visualizations, of electorate data / Comparative visualizations of electorate data
W
- Waikato Environment for Knowledge Analysis (Weka)
- URL / Classification with clj-ml
- weighted graph / Visualizing graphs with Loom
- Welch's t-test / Two-tailed tests
- whole-graph analysis
- about / Whole-graph analysis
- Widrow-Hoff learning rule / The gradient descent update rule
- wiki
- URL / Running the examples, Downloading the data
Z
- z-test
- performing / Performing a z-test
- Zipf's law / Zipf's law
- Zipf scale / Scale-free networks