With groupby, you can group data in a DataFrame and apply calculations on those groups in various ways.

This Cheatbook (Cheatsheet + Notebook) introduces you to the core functionality of pandas’ groupby function. Here can find the executable Jupyter Notebook version to directly play around with it!

References

Example Scenario

This is an excerpt of a file list from a directory with the following information as separate columns / Series:

  • file: The name of the file
  • dir: The name of the directory where the file lives in
  • bytes: The size of the file in bytes

This data is stored into a pandas’ DataFrame named df.

In [1]:
import pandas as pd
df = pd.DataFrame({
    "file" : ['tutorial.md', 'hello.java', 'controller.java', "build.sh", "deploy.sh"],
    "dir" : ["docs", "src", "src", "src", "src"],
    "bytes" : [124, 54, 36, 78, 62]
    })
df
Out[1]:
file dir bytes
0 tutorial.md docs 124
1 hello.java src 54
2 controller.java src 36
3 build.sh src 78
4 deploy.sh src 62

When to use it

groupby is a great way to summarize data in a specific way to build a more higher-level view on your data (e.g., to go from code level to module level). E.g., in our scenario, we could count the number of files per directory. Let’s take a look at this use case step by step.

Basic Principles

You can use the groupby function on our DataFrame df. As parameter, you can put in the name (or a list of names) of the Series you want to group. In our case, we want to group the directories / the Series dir.
In [2]:
df.groupby('dir')
Out[2]:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f118630f6d0>

This gives you a GroupBy object. We can take a look at the built groups by inspecting the groups object of the GroupBy object.

In [3]:
df.groupby('dir').groups
Out[3]:
{'docs': Int64Index([0], dtype='int64'),
 'src': Int64Index([1, 2, 3, 4], dtype='int64')}

The groups object shows you the groups and their members, using their indexes.

Aggregating Values

Now we have built some groups, but now what? The next step is to decide what we want to do with the values that belong to a group. This means we need to tell the GroupBy object how we want to group the values. We can apply a multitude of aggregating functions here, e.g.

  • count: count the number of entries of each group
  • first: take the first entry of each group
  • max: take the entry with the highest value

We can also apply dedicated functions on each group using e.g.,

  • agg: apply a variety of aggregating functions on the groups (e.g., building the sum as well as counting the values at once)
  • apply: apply a custom function on each group to execute calculations as you like
  • transform: calculate summarizing values for each group (e.g., the sum of all entries for each group)

But step by step!

Counting values

We group by the directories and use count on the built groups.

In [4]:
df.groupby('dir').count()
Out[4]:
file bytes
dir
docs 1 1
src 4 4

This gives us the number of entries of files and bytes information in each directory.

Summing up

We group by the directories and use sum on the built groups.

In [5]:
df.groupby('dir').sum()
Out[5]:
bytes
dir
docs 124
src 230

This gives us the number of bytes of all files that reside in a directory. Note that there is no more file Series because it doesn’t contain any values we could sum up. So this Series was thrown away.

More Advanced Use Cases

Let’s dig deeper into our example scenario. We want to find out which kind of files occupy what space in which directory. For this, we extract the files’ extensions from the file series. We use the string split function to split by the . sign and keep just the last piece of the split file name (which is the file’s extension).

In [6]:
df['ext'] = df["file"].str.split(".").str[-1]
df
Out[6]:
file dir bytes ext
0 tutorial.md docs 124 md
1 hello.java src 54 java
2 controller.java src 36 java
3 build.sh src 78 sh
4 deploy.sh src 62 sh

We can then group this data in a more sophisticated way by using two Series for our groups. We sum up the numeric values (= the bytes) for each file for each group.

In [7]:
dir_ext_bytes = df.groupby(['dir', 'ext']).sum()
dir_ext_bytes
Out[7]:
bytes
dir ext
docs md 124
src java 90
sh 140

Last, we want to calculate the ratio of the files’ bytes for each extension. We first calculate the overall size for each extension in each directory by using transform. The transform function doesn’t compute results for each value of a group. Instead, it provides results for all values of a group.

In [8]:
bytes_per_dir = dir_ext_bytes.groupby('dir').transform('sum')
bytes_per_dir
Out[8]:
bytes
dir ext
docs md 124
src java 230
sh 230

In our case, we summed up all the files’ bytes of the file extensions per directory. We can add this new information to our existing DataFrame.

In [9]:
dir_ext_bytes['all'] = bytes_per_dir
dir_ext_bytes
Out[9]:
bytes all
dir ext
docs md 124 124
src java 90 230
sh 140 230

Now we are able to calculate the ratio.

In [10]:
dir_ext_bytes['ratio'] = dir_ext_bytes['bytes'] / dir_ext_bytes['all']
dir_ext_bytes
Out[10]:
bytes all ratio
dir ext
docs md 124 124 1.000000
src java 90 230 0.391304
sh 140 230 0.608696

The result gives us the ratios of file sizes per file type for each directory.

Summary

The groupby function is an excellent way to summarize data. It will create a higher-level view of your fine-grained raw data.

print
Cheatbook: groupby

Leave a Reply

Your email address will not be published. Required fields are marked *

I accept that my given data and my IP address is sent to a server in the USA only for the purpose of spam prevention through the Akismet program.More information on Akismet and GDPR.