Python: itertools.groupby() and pd.DataFrame.groupby() are different!

Yao Yao on February 11, 2020

Looks like I’ve been wrong for years… maybe because I am too familiar with the SQL-style groupby operation.

import pandas as pd

# A dataframe with only one column `grade`
df = pd.DataFrame(data={'grade': list('AAAABBBCCDAABBB')})

# Every accumulated `g` has the same dataframe structure with `df`
for k, g in df.groupby(by='grade'):
    print(k, ":", list(g.grade))

# Output:
    # A : ['A', 'A', 'A', 'A', 'A', 'A']
    # B : ['B', 'B', 'B', 'B', 'B', 'B']
    # C : ['C', 'C']
    # D : ['D']

Works as expected.

from itertools import groupby

# Every accumulated `g` is a generator
for k, g in groupby('AAAABBBCCDAABBB'):
    print(k, ":", list(g))

# Output:
    # A : ['A', 'A', 'A', 'A']
    # B : ['B', 'B', 'B']
    # C : ['C', 'C']
    # D : ['D']
    # A : ['A', 'A']
    # B : ['B', 'B', 'B']

Yes, 2 groups with key 'A' and 2 groups with key 'B'. They won’t combine!

Simply put:

  • pd.DataFrame.groupby() groups data globally
    • since it can see the whole data
  • itertools.groupby() groups data locally
    • since it has to iterate through data entries one by one.
    • Previous groups are not referred to when processing latter ones.

blog comments powered by Disqus