Indexing is used to optimize data access and supply the parameters to specific machine learning algorithms in an acceptable format.
We will be incorporating the race variable into the decision tree model, so the first step is to determine what the different values of race are. We will do this by again using SQL to count the frequency by race. Notice we can say either "Group by Race" or "Group by 1" which is a shorthand reference to the first column specified in the select statement (which is race):
%python
dfx = spark.sql("SELECT race,count(*) FROM stopfrisk group by 1")
dfx.show()
Observe that there are eight values, Q, B, U, Z, A, W, I, and P:
Next, use indexer.fit(df2) transform. This will map a string factor (race) to a numeric index (race_indexed):
%python
indexer = StringIndexer(inputCol="race"...