Analyzing the therapy bot session dataset
It is always important to first analyze any dataset before applying models on that same dataset
Getting ready
This section will require importing functions
from pyspark.sql
to be performed on our dataframe.
import pyspark.sql.functions as F
How to do it...
The following section walks through the steps to profile the text data.
- Execute the following script to group the
label
column and to generate a count distribution:
df.groupBy("label") \ .count() \ .orderBy("count", ascending = False) \ .show()
- Add a new column,
word_count
, to the dataframe,df
, using the following script:
import pyspark.sql.functions as F df = df.withColumn('word_count', F.size(F.split(F.col('response_text'),' ')))
- Aggregate the average word count,
avg_word_count
, bylabel
using the following script:
df.groupBy('label')\ .agg(F.avg('word_count').alias('avg_word_count'))\ .orderBy('avg_word_count', ascending = False) \ .show()
How it works...
The following section explains the...