This Cheatbook (Cheatsheet + Notebook) introduces you to the core functionality of pandas’ groupby function. Here can find the executable Jupyter Notebook version to directly play around with it!

import pandas as pd
df = pd.DataFrame({
    "file" : ['tutorial.md', 'hello.java', 'controller.java', "build.sh", "deploy.sh"],
    "dir" : ["docs", "src", "src", "src", "src"],
    "bytes" : [124, 54, 36, 78, 62]
    })
df

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f118630f6d0>

{'docs': Int64Index([0], dtype='int64'),
 'src': Int64Index([1, 2, 3, 4], dtype='int64')}

Here you can find out more about this function.

Example Scenario¶

This is an excerpt of a file list from a directory with the following information as separate columns / Series:

file: The name of the file
dir: The name of the directory where the file lives in
bytes: The size of the file in bytes

This data is stored into a pandas’ DataFrame named df.

import pandas as pd
df = pd.DataFrame({
    "file" : ['tutorial.md', 'hello.java', 'controller.java', "build.sh", "deploy.sh"],
    "dir" : ["docs", "src", "src", "src", "src"],
    "bytes" : [124, 54, 36, 78, 62]
    })
df

When to use it¶

groupby is a great way to summarize data in a specific way to build a more higher-level view on your data (e.g., to go from code level to module level). E.g., in our scenario, we could count the number of files per directory. Let’s take a look at this use case step by step.

Basic Principles¶

You can use the groupby function on our DataFrame df. As parameter, you can put in the name (or a list of names) of the Series you want to group. In our case, we want to group the directories / the Series dir.

df.groupby('dir')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f118630f6d0>

This gives you a GroupBy object. We can take a look at the built groups by inspecting the groups object of the GroupBy object.

df.groupby('dir').groups

{'docs': Int64Index([0], dtype='int64'),
 'src': Int64Index([1, 2, 3, 4], dtype='int64')}

The groups object shows you the groups and their members, using their indexes.

Aggregating Values¶

Now we have built some groups, but now what? The next step is to decide what we want to do with the values that belong to a group. This means we need to tell the GroupBy object how we want to group the values. We can apply a multitude of aggregating functions here, e.g.

count: count the number of entries of each group
first: take the first entry of each group
max: take the entry with the highest value

We can also apply dedicated functions on each group using e.g.,

agg: apply a variety of aggregating functions on the groups (e.g., building the sum as well as counting the values at once)
apply: apply a custom function on each group to execute calculations as you like
transform: calculate summarizing values for each group (e.g., the sum of all entries for each group)

But step by step!

Counting values¶

We group by the directories and use count on the built groups.

df.groupby('dir').count()

This gives us the number of entries of files and bytes information in each directory.

Summing up¶

We group by the directories and use sum on the built groups.

df.groupby('dir').sum()

This gives us the number of bytes of all files that reside in a directory. Note that there is no more file Series because it doesn’t contain any values we could sum up. So this Series was thrown away.

More Advanced Use Cases¶

Let’s dig deeper into our example scenario. We want to find out which kind of files occupy what space in which directory. For this, we extract the files’ extensions from the file series. We use the string split function to split by the . sign and keep just the last piece of the split file name (which is the file’s extension).

df['ext'] = df["file"].str.split(".").str[-1]
df

We can then group this data in a more sophisticated way by using two Series for our groups. We sum up the numeric values (= the bytes) for each file for each group.

dir_ext_bytes = df.groupby(['dir', 'ext']).sum()
dir_ext_bytes

Last, we want to calculate the ratio of the files’ bytes for each extension. We first calculate the overall size for each extension in each directory by using transform. The transform function doesn’t compute results for each value of a group. Instead, it provides results for all values of a group.

bytes_per_dir = dir_ext_bytes.groupby('dir').transform('sum')
bytes_per_dir

In our case, we summed up all the files’ bytes of the file extensions per directory. We can add this new information to our existing DataFrame.

dir_ext_bytes['all'] = bytes_per_dir
dir_ext_bytes

Now we are able to calculate the ratio.

dir_ext_bytes['ratio'] = dir_ext_bytes['bytes'] / dir_ext_bytes['all']
dir_ext_bytes

The result gives us the ratios of file sizes per file type for each directory.

Summary¶

The groupby function is an excellent way to summarize data. It will create a higher-level view of your fine-grained raw data.

feststelltaste

Cheatbook: groupby

References¶