Chapter 01: The Python Data Science Stack
Activity 1: IPython and Jupyter
Open the python_script_student.py file in a text editor, copy the contents to a notebook in IPython, and execute the operations.
Copy and paste the code from the Python script into a Jupyter notebook:
import numpy as np def square_plus(x, c): return np.power(x, 2) + c
Now, update the values of the x and c variables. Then, change the definition of the function:
x = 10 c = 100 result = square_plus(x, c) print(result)
The output is as follows:
200
Activity 2: Working with Data Problems
Import pandas and NumPy library:
import pandas as pd import numpy as np
Read the RadNet dataset from the U.S. Environmental Protection Agency, available from the Socrata project:
url = "https://opendata.socrata.com/api/views/cf4r-dfwe/rows.csv?accessType=DOWNLOAD" df = pd.read_csv(url)
Create a list with numeric columns for radionuclides in the RadNet dataset:
columns = df.columns id_cols = ['State', 'Location', "Date Posted", 'Date Collected', 'Sample Type', 'Unit'] columns = list(set(columns) - set(id_cols)) columns
Use the apply method on one column, with a lambda function that compares the Non-detect string:
df['Cs-134'] = df['Cs-134'].apply(lambda x: np.nan if x == "Non-detect" else x) df.head()
The output is as follows:
Replace the text values with NaN in one column with np.nan:
df.loc[:, columns] = df.loc[:, columns].applymap(lambda x: np.nan if x == 'Non-detect' else x) df.loc[:, columns] = df.loc[:, columns].applymap(lambda x: np.nan if x == 'ND' else x)
Use the same lambda comparison and use the applymap method on several columns at the same time, using the list created in the first step:
df.loc[:, ['State', 'Location', 'Sample Type', 'Unit']] = df.loc[:, ['State', 'Location',g 'Sample Type', 'Unit']].applymap(lambda x: x.strip())
Create a list of the remaining columns that are not numeric:
df.dtypes
The output is as follows:
Convert the DataFrame objects into floats using the to_numeric function:
df['Date Posted'] = pd.to_datetime(df['Date Posted']) df['Date Collected'] = pd.to_datetime(df['Date Collected']) for col in columns: df[col] = pd.to_numeric(df[col]) df.dtypes
The output is as follows:
Using the selection and filtering methods, verify that the names of the string columns don't have any spaces:
df['Date Posted'] = pd.to_datetime(df['Date Posted']) df['Date Collected'] = pd.to_datetime(df['Date Collected']) for col in columns: df[col] = pd.to_numeric(df[col]) df.dtypes
The output is as follows:
Activity 3: Plotting Data with Pandas
Use the RadNet DataFrame that we have been working with.
Fix all the data type problems, as we saw before.
Create a plot with a filter per Location, selecting the city of San Bernardino, and one radionuclide, with the x-axis set to the date and the y-axis with radionuclide I-131:
df.loc[df.Location == 'San Bernardino'].plot(x='Date Collected', y='I-131')
The output is as follows:
Create a scatter plot with the concentration of two related radionuclides, I-131 and I-132:
fig, ax = plt.subplots() ax.scatter(x=df['I-131'], y=df['I-132']) _ = ax.set( xlabel='I-131', ylabel='I-132', title='Comparison between concentrations of I-131 and I-132' )
The output is as follows: