Python for Business: Identifying Duplicate Data

Data Preparation is one of those critical tasks that most digital analysts take for granted as many of the analytics platforms we use take care of this task for us or at least we like to believe they do so. With that said, Data Preparation should be a task that every good analyst completes as part of any data investigation.

Wes McKinney, author of Python for Data Analysis, defines Data Preparation as “cleaning, munging, combining, normalizing, reshaping, slicing, dicing, and transforming data for analysis.”

In this post, I am going to walk you through a real world example, focusing on Data Preparation, of how Python can be a very powerful tool for business focused data analysis.


The Problem

We have been asked to analyze the company’s sales performance for 2015 and have been provided a 300MB Excel Workbook containing millions of rows of sales transactions. We imported the Workbook into Python and while completing a initial round of Data Preparation, flagged several sales transactions that appeared to be duplicates.

Before we move on, we are going to investigate this duplicate issue further to understand how widespread it is and the potential impact it could be having on key metrics such as items sold and total company revenue.


The Dataset

Our dataset contains every order transaction for 2015. The data is structured in such a way that each item purchased, in an order, is a unique row in the data. If an order contained three unique product SKUs, that one order would have three rows in the dataset.

Sales Data Example

order_id: A unique id that serves as the key to group line items into a single order
order_item_cd: Product SKU
order_item_quantity: The number each SKU purchased for an order
order_item_cost_price: The individual unit price of a SKU purchased

An order that contains duplicated order line items would have a SKU that appears in more than one row of data, for a given order_id, as shown below.

Sales Data Example with Duplicates


The Solution using Python

As with most (all) analysis work I do in Python, I make use of pandas, so we will begin by importing the pandas library. pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for Python.

In [1]:
# We will use data structures and data analysis tools provided in the pandas library
import pandas as pd

Next, we will use pandas to import all of the sales data from our Excel Workbook into a pandas DataFrame. The pandas DataFrame, along with Series, is one of the most important data structures you will use as a data analyst. Creating a DataFrame is one of the first things I typically do after launching Python.

A DataFrame is tabular in nature and has a “spreadsheet like” data structure containing an ordered collection of columns and rows.

In [2]:
# Import retail sales data from an Excel Workbook into a data frame
path = '/Documents/analysis/python/examples/2015sales.xlsx'
xlsx = pd.ExcelFile(path)
df = pd.read_excel(xlsx, 'Sheet1')

We have our data in a DataFrame, which resembles the structure shown in the example sales data above, now we need a way to identify all of the duplicate rows in our dataset. To accomplish this I’m going to add a new column to my DataFrame, named is_duplicated, that will hold a boolean value identifying if the row is a duplicate or not. To populate the boolean value for each row in the dataset, we are going to use a very powerful function called ‘duplicated’ which we will feed the columns we want to evaluate for duplication. In this case, we will mark a row as a duplicate if we find more than one row that has the same order_id and order_item_sku.

In [3]:
# Let's add a new boolean column to our DataFrame that will identify a duplicated order line item (False=Not a duplicate; True=Duplicate)
df['is_duplicated'] = df.duplicated(['order_id', 'order_item_cd'])

Before we go any further, let’s get a sense of how big of an issue we have here. I’m going to sum up all of the rows that were marked as a duplicate. This will give me the number of duplicate line items in the dataset.

In [4]:
# We can sum on a boolean column to get a count of duplicate order line items


Ok, so this isn’t just a few records here and there that are a problem, our dataset contains 13,372 line items that have been marked as being duplicates. Before we look to do any cleanup, let’s further understand the impact of this dirty data. Let’s find out how many duplicate units are in our dataset, we would expect at least 13,372 units but there is a high likelihood that customers often purchase more than one unit of any given SKU.

Like we did in the previous step, let’s sum up the number of items purchased that were marked as being duplicates in our dataset.

In [5]:
# Now let's quantify the impact of duplicate line items in terms of over counted units sold


Had we not taken a critical look at the quality of our data, we could have very easily jumped ahead in our analysis and summarized last year’s sales data, reporting back to management an extra 63,234 units that we actually didn’t sell last year. This is getting serious.

Let’s see what the impact to revenue is. In order to get the impacted revenue, we will add a new column to our DataFrame which will be the line item total for each row in the dataset. To get this value, we will multiple quantity purchased by the item cost, for each row.

Once we have the total for each line item, we can again sum all of the duplicated line items, this time using our revenue value.

In [6]:
# With Python, we can use 'vectorizing' to multiple two columns and place the result in a new column
# This saves us from having to loop over every row in our sales data
df['line_item_total'] = df.order_item_quantity * df.order_item_cost_price

# We can use our boolean sum shortcut again to get the sum of revenue from duplicated order line items



Jobs are on the line now, our dataset contains an overstated revenue of $4,736,155. Good thing we caught this before any numbers were published.

Let’s clean this data up, so we can move forward with our 2015 sales analysis.

In [7]:
# We are only interested in the non-duplicated order line items, so let's get those now
df_nodup = df.loc[df['is_duplicated'] == False]

It’s always a good practice to constantly check our work. Let’s do a quick sanity check to make sure we have gotten rid of the problematic duplicate data.

In [8]:
# Let's do a quick sanity check to make sure we now have a dataset with no identified duplicate order line items


I feel so much better that we took the time to properly prepare our data before we jumped to building the pretty visualizations for the board members, don’t you?

Let’s export this cleaned up dataset so that we can do additional analysis of our 2015 sales data without having to remove all the duplicate data again.

In [9]:
# Finally let's save our cleaned up data to a csv file
df_nodup.to_csv('2015sales_nodup.csv', encoding='utf-8')

If you find this framework for identifying duplicate data helpful, you are more than welcome to use, extend, improve, the code in order to solve your unique data challenges.

Jason Thompson
Jason is the co-founder of 33 Sticks. In addition to being an amateur chef and bass player, he can also eat large amounts of sushi. As an analytics and optimization expert, he brings over 12 years of experience, going back to being part of the original team at Omniture.