Introducing Pandas-Sets: Set-oriented Operations in Pandas

I frequently find myself storing standard Python set objects in DataFrame columns. This usually happens when I have some kind of a tags or labels column for each observation. It can also be the output of a groupby operation where the end result needs to be a list-like (or set-like) object before it’s aggregated. Using set operations (union, intersection etc.) can come in handy in such cases.

To tackle those scenarios however I end up writing code like df.tags.map(lambda x: set(x).add(elem) which apart from being ugly, also doesn’t allow for pandas-like immutable-based compositions (aka one-liners).

Ideally, I would like to be able treat the tags column as a set-like one, so I could write code like df.tags.set.add(elem) or filter like df[df.tags.set.contains(elem)] and df[df.tags.set.union({`t1`,`t2,`t3`})

To achieve this, I wrote pandas-sets, a Pandas extension that adds set-like properties to existing Seriesobjects, provided that they already store set objects.

You can check out the code on GitHub.

The pandas_sets package adds a .set accessor to any pandas Series object; it’s like .dt for datetime or .str for string, but for set.

It exposes all public methods available in the standard set.

Using it is pretty simple. First install with pip.

pip install pandas-sets

Then, just import the pandas_sets package and it will register a .set accessor to any Series object.

import pandas_sets
import pandas as pd

df = pd.DataFrame({'post': [1, 2, 3, 4],
                    'tags': [{'python', 'pandas'}, 
                    {'philosophy', 'strategy'}, {
                    'scikit-learn'}, {'pandas'}]
                   })

pandas_posts = df[df.tags.set.contains('pandas')]

pandas_posts.tags.set.add('data')

pandas_posts.tags.set.update({'data', 'analysis'})

The implementation is very primitive for now and draws heavily from pandas’ core StringMethodsimplementation.

Next steps include: further testing with edge-case scenarios, adding detailed docstrings and more fine-grained NA handling.

Some day it may be incorporated into pandas core itself.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s