Understanding method chaining
Method chaining is a technique or way of structuring your code. It’s commonly used across DataFrame libraries such as pandas and PySpark. As the name tells you, it means that you chain methods one after another. This makes your code more readable, concise, and maintainable. It follows a natural flow from one operation to another, which makes your code easy to follow. All of that helps you focus on the data transformation logic and problems you’re trying to solve.
The good news is that Polars is a good fit for method chaining. Polars utilizes expressions and other methods that can easily be stacked on each other.
Getting ready
This recipe requires the titanic dataset. Make sure to read it into a DataFrame:
df = pl.read_csv('../data/titanic_dataset.csv')
How to do it...
Let’s say that you’re doing a few operations on the dataset. First, we will predefine the columns that we want to select:
cols = ['Name', 'Sex', 'Age', 'Fare', 'Cabin', 'Pclass', 'Survived']
If you’re not using method chaining, you might want to write code like this:
df = df.select(cols) df = df.filter(pl.col('Age')>=35) df = df.sort(by=['Age', 'Name'])
When you use method chaining, it’d look like this:
df = df.select(cols).filter(pl.col('Age')>=35).sort(by=['Age', 'Name'])
To go one step further, let’s stack these methods vertically. This is the preferred way of writing your code with method chaining:
df = ( df .select(cols) .filter(pl.col('Age')>=35) .sort(by=['Age', 'Name']) )
All of the preceding code produces the same output:
Figure 1.35 – The output after column selection, filtering, and sorting
How it works...
The first example I showed defines each method line by line, storing each result in a variable each time. The last example involved method chaining, aligning the beginning of each method vertically. Some users don’t even know that you can stack your methods on top of each other, especially users who are just getting started. You might have a habit of defining your transformations line by line, like in the first example.
Having looked at a few examples, which pattern do you think is best? I’d say the one using method chaining, stacking each method vertically. Aligning the beginning of each method helps with readability. Having all the logic in the same place makes it easier to maintain the code and figure things out later. It also helps you streamline your workflows by making your code more concise and ensuring that it is organized in a logical way.
How does this help with testing and debugging though? You can comment out or add another method within the parentheses to test the result:
df = ( df .select(cols) # .filter(pl.col('Age')>=35) .sort(by=['Age', 'Name']) ) df.head()
The preceding code will return the following output:
Figure 1.36 – The first five rows without the filtering condition
We’ll cover testing and debugging in more detail in Chapter 12, Testing and Debugging in Polars.
One caveat is that when your chain is too long, it may make your code hard to read and work with. This increased complexity that comes with a long chain can make your debugging hard, too. It can become challenging to understand each intermediary step in a long chain. In that case, you should break your logic down into smaller pieces to help reduce the complexity and length of your chain. With all of that said, it all comes down to the fact that a balance is needed to make testing your code feasible.
In the interest of full disclosure, remember that you don’t have an obligation to use method chaining. If it feels more comfortable or appropriate to write your code line by line separately, that’s all good and fine. Method chaining is just another practice, and many people find it helpful. I can confidently say that method chaining has done me more good than harm.
There’s more...
When you stack your methods vertically, you can also use backslashes instead of using parentheses:
df = df \ .select(cols) \ .filter(pl.col('Age')>=35) \ .sort(by=['Age', 'Name'])
I have to say that adding a backslash for each method is a little bit of work. Also, if you comment out the last method in the chain for testing and debugging purposes, it messes up the whole chain because you can’t end your code with a backslash. I’d choose using parentheses over backslashes any day.
See also
These are useful resources to learn more about method chaining: