Introducing Pandas-Sets: Set-oriented Operations in Pandas

I frequently find myself storing standard Python 

set

 objects in 

DataFrame

 columns. This usually happens when I have some kind of a 

tags

 or 

labels

 column for each observation. It can also be the output of a 

groupby

 operation where the end result needs to be a list-like (or set-like) object before it’s aggregated. Using set operations (union, intersection etc.) can come in handy in such cases.

To tackle those scenarios however I end up writing code like 

df.tags.map(lambda x: set(x).add(elem)

 which apart from being ugly, also doesn’t allow for pandas-like immutable-based compositions (aka one-liners).

Ideally, I would like to be able treat the 

tags

 column as a set-like one, so I could write code like 

df.tags.set.add(elem)

 or filter like 

df[df.tags.set.contains(elem)]

 and 

df[df.tags.set.union({`t1`,`t2,`t3`})

To achieve this, I wrote 

pandas-sets

, a Pandas extension that adds set-like properties to existing 

Series

objects, provided that they already store 

set

 objects.

You can check out the code on GitHub.

The 

pandas_sets

 package adds a 

.set

 accessor to any pandas 

Series

 object; it’s like 

.dt

 for 

datetime

 or 

.str

 for 

string

, but for 

set

.

It exposes all public methods available in the standard 

set

.

Using it is pretty simple. First install with 

pip

.

pip install pandas-sets

Then, just import the 

pandas_sets

 package and it will register a 

.set

 accessor to any 

Series

 object.

import pandas_sets
import pandas as pd

df = pd.DataFrame({'post': [1, 2, 3, 4],
                    'tags': [{'python', 'pandas'}, 
                    {'philosophy', 'strategy'}, {
                    'scikit-learn'}, {'pandas'}]
                   })

pandas_posts = df[df.tags.set.contains('pandas')]

pandas_posts.tags.set.add('data')

pandas_posts.tags.set.update({'data', 'analysis'})

The implementation is very primitive for now and draws heavily from pandas’ core 

StringMethods

implementation.

Next steps include: further testing with edge-case scenarios, adding detailed docstrings and more fine-grained 

NA

 handling.

Some day it may be incorporated into pandas core itself.