Case study: fitting classifier models in pyspark
Now that we have examined several algorithms for fitting classifier models in the scikit-learn library, let us look at how we might implement a similar model in PySpark. We can use the same census dataset from earlier in this chapter, and start by loading the data using a textRdd after starting the spark context:
>>> censusRdd = sc.textFile('census.data')
Next we need to split the data into individual fields, and strip whitespace
>>> censusRddSplit = censusRdd.map(lambda x: [e.strip() for e in x.split(',')])
Now, as before, we need to determine which of our features are categorical and need to be re-encoded using one-hot encoding. We do this by taking a single row and asking whether the string in each position represent a digit (is not a categorical variable):
>>> categoricalFeatures = [e for e,i in enumerate(censusRddSplit.take(1)[0]) if i.isdigit()==False] >>> allFeatures = [e for e,i in enumerate(censusRddSplit...