Pivot Tables on steroids: Excel-to-pandas equivalences

Juan Felipe Alvarez Jaramillo
5 min readJul 30, 2019

--

Pivot Tables are, without question, one of the most powerful applications of Ms Excel. With Pivot Tables you can summarize vast amounts of information with just a few commands and get a general feeling of the dataset you are working with. I have been using Pivot Tables in Excel as my default workhorse for all kinds of problems for a long time and they have lived up to their promise….so far.

Ms Excel is great, but it does have some limitations dealing with large, complex datasets. Ever since I began to use Python to tackle more complicated business questions, I started to realize that as the dataset grew in size, and the information tended to flow faster, Pandas attractiveness was more appealing. Moreover, pandas has a very ambitious goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. I guess in the future we might all get used to transition to more powerful open-source solutions like pandas.

When I started to use pandas, I quickly realized that pd.pivot_table was not very straightforward to interact with and I must recognize that my learning curve to use this function gave me a constant headache.

So, to make life easier, in this blog I intend to perform a side-by-side equivalence of the key commands you will need to do in pandas to accomplish what you are used to do in Pivot Tables. I will use a simple dummy dataset I created using this website, which has these fields:

date: month of observation

client: client’s name

region: client’s region

sales: value of sales made

client_type: client category (A/B/C)

Summarizing time series: one common use of the Pivot Table is to aggregate a numerical field (in this case sales) by some type of categories (in this case the month of each sale). Below, we see a table that sums all sales per every month. To achieve the same result in pandas, we need to create an object calling the pandas pd.pivot_table function.

Five important arguments to highlight: data is the name of the dataframe that contains the information, index is the field that I want to use as “rows” of my pivot table, values is the numerical field that I will be aggregating, aggfunc is the argument to specify that I want a sum and margins is an argument that calculates the total sum of all rows.

Adding more row labels: Excel’s pivot table also has a “tabular form” you can access in the report layout design options. This enables us to have more than 2 fields to be aggregated as rows, like in the example below. To achieve something like this in pandas, we just need to pass a list of columns in the index argument, instead of only 1 column, like we did before. The output dataframe while have a multiindex with the specified fields, which is super handy when you want to make advanced merges.

Specifying the type of aggregating function: Sometimes you want to obtain two different metrics per row, like for example volume of sales and number of transactions. We typically achieve the latter by performing a count of the rows. Below, I have specified a list of aggregating functions in the aggfunc argument. I could have specified a custom function if I wanted, something that Excel can’t do (for example if I wanted to calculate a very specific function like a log-transform, I could just define the function first and then pass it to the aggfunc argument)

Cross-Tab: Another very useful thing you might have done countless times with Excel Pivot table is to cross-tab one variable against another. The example below has replaced the rows/index by “region” and has added client_type to a new dimension in the columns field. Now I have a table that shows the sum of sales, broken down by client_type for every region in the dataset. In pandas, I have added the columns argument to specify which field (or fields, If I passed a list) I want to put on top of the table. I have also added a fill_value argument to specify that missing values should be replaced by zero.

Filtering views: Another super-useful thing that you can achieve with Excel pivot tables is to filter a view by segmenting the dataset according to a criteria. In the example below, I have filtered the same pivot table we had before, so it shows only the sales corresponding to the month of January 2018. To achieve this in pandas, we just need to use pandas boolean indexing to pass on what subset of the data frame you want to create the pivot table from. You will see that first I created a new data frame called pivot_filter, which already filters the values of date to include only the rows containing January 2018. Then I use that data frame to create the pivot table. Using different selection criteria, you apply multiple filters to your data.

Recap: As you could see, pandas pivot_table delivers all that Excel pivot tables do. And they are capable of much, much more. Once that you understand the key arguments, you will be ready to make the transition; and then, you will be creating those pivot tables as a pro and taking advantage of the powerful things that come after manipulating the information.

This photo summarizes what we have covered so far:

If you want to connect and have a conversation about this or any other topic, please feel free to find me on LinkedIn:

--

--

Juan Felipe Alvarez Jaramillo
Juan Felipe Alvarez Jaramillo

Written by Juan Felipe Alvarez Jaramillo

Data and analytics expert, driven by curiosity and fueled by a hacker’s mentality. MSc Business Analytics from Alliance Manchester Business School.

No responses yet