Pandas dataframes are a great way of preserving relational database structure while gaining a huge API of vectorized functions for cleaning and analysis. However, I’ve always been frustrated by a lack of merge between functionality in pandas. So, I’ve come up with a fairly simple method for returning a merge-between-like result set with good runtime and manageable memory usage, that’s also easy to parallelize.
Using SQL, I can easily perform a left join between two datasets on a between condition, that associates my left table key with all matching data from the right table. For example:
SELECT a.*, b.* FROM a LEFT JOIN b ON b.timestamp >= a.start_time AND b.timestamp <= a.end_time
will return all the rows in b that have timestamps between the start and end time for each row in a. Admittedly, it can be very slow, but it’s a useful process. As of pandas .2, there’s no simple way to perform this operation.
Pandas .2 does have merge_asof implemented, which matches a to the nearest entry on b in a specified direction. But for many data analysis applications, I don’t want the closest entry, I want all the entries in between. For example, I might want to take the mean over a time period, or do a more sophisticated aggregation. Or, maybe my timestamps aren’t perfectly aligned, so I’d rather pull in a range of data for each entry in a and perform an estimation from the values in b. Either way, merge_asof doesn’t cut it.
If all the start and end times in a are distinct (no overlapping events), then I can easily interpolate the data in a and b by concatenating, sorting on the timestamp, and boolean masking for data that falls between a the same start and end, though the runtime isn’t very good. I don’t want to go through the extra data cleaning steps to make my data distinct — which may not be the right approach anyways. Instead, I’m going to use a little iterating and some fancy indexing to return an efficient merge-between-like result set. It’s also naturally suited to parallelization, since I can do the operations in distinct chunks (like splitting time series data by date), group by a unique identifier from my left table, and return a condensed result set. Since the condensed result set is at most len(left table), and python clears memory between function calls, I can also manage memory usage by intelligent chunking, and process more data than I can load into working memory at any one time.
Let’s create some random data using numpy’s methods for sampling with replacement. In this example, I have a website selling merchandise, and three different ad campaigns promoting my website. The ad campaigns are running on multiple channels, and have different spot lengths. I want to know if during the ad spots, there’s any notable activity on my website.
Let’s say my website sells socks, shirts, pants, backpacks and pencils, and records the time when a user either clicks, adds an item to their cart, or purchases it. This is clearly a bit cleaner than any data you’ll ever find in the wild, but gets the job done.
For my advertisement data, I’ll assume my ads have three different sets of content, and can be 30, 45, or 60 seconds long. Each ad will also have a unique integer id.
import pandas as pd import numpy as np PRODUCTS = ['socks', 'shirt', 'pants', 'backpack', 'pencil'] ACTION_LIST = ['click', 'add_to_cart', 'purchase'] HOURS = [9, 10, 11, 12, 13] AD_LENGTHS = [30, 45, 60] AD_CONTENT = ['foo', 'bar', 'baz']
You can use pandas functions to create a series with times, but I want to demonstrate the functionality using timeseries data without any assumptions of regularity.
First, this means that I can use data of arbitrary length and sparsity/density. Many real-world time series data sets are naturally extremely long, with some dense and some sparse periods. Resampling (as much of the pandas documentation recommends) is extremely slow on long data sets. If some of my data is nanoseconds apart, and some of my data is hours apart, resampling to nanoseconds will create an absurdly huge dataframe. Resampling to minutes combines my denser data, and I want to preserve its granularity.
Second, this approach works on time series data after minimal data cleaning. A website might have multiple users performing actions at the same time, so duplicates are valid. I may be running two ads on different channels that overlap with each other — also valid.
The following generates two dataframes, one with my website information, and one with my advertisement information. The website dataframe has 10,000 rows, each representing an action on a product, and the ad dataframe has 200 rows, representing a unique advertisement than ran on 1/1/2017. The website dataframe is sorted by timestamp, and has a timestamp index.
# create a dataframe of random actions at random times website_actions = pd.DataFrame(data={'product':np.random.choice(PRODUCTS, 10000), 'action':np.random.choice(ACTION_LIST, 10000), 'hour':np.random.choice(HOURS, 10000), 'minute':np.random.choice(60,10000), 'second':np.random.choice(60,10000)}) # all on the same date website_actions['year'] = 2017 website_actions['month'] = 1 website_actions['day'] = 1 # coalesce all datetime columns into one timestamp website_actions['timestamp'] = pd.to_datetime(website_actions[['year', 'month', 'day', 'hour', 'minute', 'second']]) # clean up the df and create a time index website_actions.drop(['year', 'month', 'day', 'hour', 'minute', 'second'], axis=1, inplace=True) website_actions = website_actions.sort_values('timestamp').set_index('timestamp') # do the same thing for ad spots advertisements = pd.DataFrame(data={'content':np.random.choice(AD_CONTENT, 200), 'hour':np.random.choice(HOURS, 200), 'minute':np.random.choice(60, 200), 'second':np.random.choice(60, 200), 'ad_length':np.random.choice(AD_LENGTHS, 200), 'id':range(200)}) advertisements['year'] = 2017 advertisements['month'] = 1 advertisements['day'] = 1 advertisements['start_time'] = pd.to_datetime(advertisements[['year', 'month', 'day', 'hour', 'minute', 'second']]) advertisements['end_time'] = advertisements['start_time'] + pd.to_timedelta(advertisements['ad_length'], unit='s') advertisements.drop(['year', 'month', 'day', 'hour', 'minute', 'second'], axis=1, inplace=True)
I try to never iterate over dataframes (or use apply), but in this case, it’s necessary, since a given website row can correspond to arbitrarily many ads and vice versa. For each ad, we can index into the website dataframe between the start and end times to find the actions during the ad. If we label each slice with the unique ad id, we can simply concatenate all the slices together and return all the actions during each ad, labeled by the ad id.
ad_actions = [] # pass over the dataframe. Itertuples is the most efficient method, and we don't need an index for tup in advertisements.itertuples(index=False): # use the advertisement start and end times to index into the website df site_slice = website_actions[tup.start_time:tup.end_time] # label the slice site_slice['id'] = tup.id # add it to the list of slices ad_actions.append(site_slice) # when we're done iterating, concatenate all the slices into our final dataframe ad_actions = pd.concat(ad_actions)
As mentioned above, an advantage of this method is that you can define your own split-apply-combine functions to use on the chunks of data. Since we preserved the unique ad id for each slice of website data, we can group on the id and easily merge our final results back into the advertisement dataframe to do further aggregations.
One simple analysis task might be to determine which ad content generated the most revenue. Let’s say we make $1 for every pair of socks we sell, $2.50 for every shirt, $3.25 for every pair of pants, and $2.75 for every backpack, and we lose $.05 for every pencil we sell. One way of looking at the effectiveness of an ad is comparing to what we could have sold to what we did sell. We could have sold anything that appears in our website dataframe, since that represents items people clicked on, items people added to their carts (but didn’t checkout), and final sales. We sold anything that has action = ‘purchase’.
# define product revenues revenue_by_prod = {'socks':1, 'shirt':2.5, 'pants':3.25, 'backpack':2.75, 'pencil':-.05} # every row has potential revenue ad_actions['potential_revenue'] = ad_actions['product'].replace(revenue_by_prod) # but it only has real revenue if it was purchased ad_actions['real_revenue'] = np.where(ad_actions['action'] == 'purchase', ad_actions['potential_revenue'], 0) # compute the real and potential revenue by ad revenue_by_ad = ad_actions.groupby('id')[['potential_revenue', 'real_revenue']].sum() # missed is potential - real revenue_by_ad['missed_revenue'] = revenue_by_ad['potential_revenue'] - revenue_by_ad['real_revenue'] # print the worst 5 ads by missed revenue print(revenue_by_ad.sort_values('missed_revenue', ascending=False).head()) # merge the revenue back into the advertisement df on the id column advertisements = pd.merge(advertisements, revenue_by_ad, left_on=['id'], right_index=True) # now aggregate by content revenue_by_content = advertisements.groupby('content')[['missed_revenue']].sum() print(revenue_by_content[['missed_revenue']])
Since the values are randomly generated, every run will produce a different result.
I see:
content missed_revenue 'bar' 2311.05 'baz' 2069.90 'foo' 2005.45
Not the most sophisticated analysis! For example, I haven’t taken into account the way content may have overlapped during ad time. I also don’t know if ‘baz’ content missed the least revenue because people declined to buy the most pencils. I can add a little more granularity to my results by counting the pencils per ad:
pencils_by_ad = ad_actions.groupby(['product', 'id']).size().loc['pencil'].to_frame().rename(columns={0:'num_pencils'})
Or using a bit of fancy indexing, the unbought pencils per ad:
unbought_pencils_by_ad = ad_actions.groupby(['product', 'action', 'id']).size().loc[ ('pencil', 'click'), ('pencil', 'add_to_cart')].reset_index().groupby('id')[[0]].sum().rename( columns={0: 'num_unbought_pencils'})
And then seeing which content has the most unbought pencils:
tot_pencils = pencils_by_ad.join(unbought_pencils_by_ad) advertisements = pd.merge(advertisements, tot_pencils, left_on='id', right_index=True) print(advertisements.groupby('content')[['num_pencils', 'num_unbought_pencils']].sum())
Which gives me:
content num_pencils num_unbought_pencils 'bar' 383 222.0 'baz' 310 206.0 'foo' 339 204.0
It doesn’t look like any set of content is correlated with whether or not site visitors look at or purchase pencils (which makes sense, since the numpy sampling functions sample from a constant distribution by default, though you can seed them however you like).