I have been playing around with the
pivot_table function in the Python package,
pandas. The reason for this is because I was using
groupby and wanted to see the difference in functionality and how it could impact my work.
pivot_table function is a lot like PivotTables that you may have used or seen in Excel where you can reshape data and quickly apply indexes and columns to perform analysis. One thing to note before starting is that
pivot_table is one Python function that is constructed of arguments and lists.
pivot_table is just one way to do analysis on data which I find to be the fun part in learning.
In my example, I explored
pivot_table through a fictitious data set on gift officer data. This sample only scrapes the surface of this function but I hope you enjoy!
Questions that directed my experimentation with
- How many gift officers are in this sample?
- How can I see who is assigned to each manager?
- How do I aggregate information for each manager?
- How do I filter on specific information, like accepted proposals?
As I started up my project, I loaded in my
DataFrame and reviewed all the data.
After looking at my data, I wanted to see how many managers were actually in the file. To start looking at manager information, I identified my index to be Manager data. Note that in this section, Ask columns look strange and do not total the correct amounts. Also note that the table is created referencing the
DataFrame (df), and the function is under the pandas namespace and so begins with
pd., rather than being a method of a
df.groupby). Due to muscle memory, I make the mistake of beginning to type
My next step was to then look at the ID information to see what IDs were linked to specific managers. This helps me see that record 4413 has worked with three of the four managers in the sample. The actual values in the data set are incorrect because the default of
pivot_table is to report the mean. The returned value can be specified to return sums and lengths.
My next step was to add in Columns, or the section that data is compared. I specified the column to be Status while keeping the index the same. As you can see, this yielded an error because Status does not contain numeric data. The important thing here to remember is that errors are normal, especially when you are practicing new skills or constructing your argument differently. The great thing about errors is that you can learn why they occur and how to modify your code.
Still wanting to check out the Status and learning that only numeric data can be applied to the columns, I rearranged my variables and added Ask_Amount, Actual_Ask and Accepted_Ask. Note here that in the output, the columns are arranged alphabetically.
After looking at the last table, I determined that I did not care to see the IDs anymore in the data. That was because I felt that I did not need to know information on a record by record level, but rather in aggregate. To show the sum for each manager I applied the argument
aggfunc and set the value to return the sum. What a world a a difference that makes, huh?
My final thought in working through this data set was about filtering. What if I wanted to filter on a specific value such as accepted ask amounts only? My first step was define a new set of data. I did that by making a variable called Officer_table. You will see that all the code after the equals sign is the
pivot_table function previously used before. Then I used
.query and specified that I only want look ask information if the Status was Accepted. To do that I used the operator
==, make sure you check out other Boolean operations and comparisons if you want to get into specifying data in your work.
Hope you enjoyed this quick overview of
pivot_table! You can learn more about them in the pandas documentation.
Source: montko.com, December 15, 2016
To learn more about analytics at Bentz Whaley Flessner, check out BWF Insight, https://www.bwfinsight.net/