The Classic Problem: Duplicate Records

Marketers know the duplicate record problem well. Multiple records for the same person or account signal that you have inaccurate or stale data, which leads to bad reporting, skewed metrics, and poor sender reputation. It can even result in different sales representatives calling on the same account.

De-duplication is the process of identifying duplicate records and merging the best data.

The less well-known problem of duplicate data fields also afflicts many companies.

This article will discuss...

  1. Duplicate data fields and what causes them
  2. How to minimize the problem
  3. How to perform data unification
  4. Recommended tools and resources for the job

1. The Emerging Problem: Duplicate Data Fields

Today, there are endless options for acquiring, enriching, and validating leads. Each option has its strengths and weaknesses, so it is common for marketers to use and explore many data sources:

  • List sellers and renters
  • Contact enrichment and email validation services
  • Content-based lead generation services
  • Advertising and social selling platforms
  • Predictive lead sourcing and scoring services
  • Event organizers

A lead record typically originates from one source and then gets validated and enriched by multiple sources over time. After each enhancement effort, marketers usually keep both the old and new data, with the intention of auditing the quality of the new data before unification, and reverting to the old data if necessary.

However, when work schedules get busy, auditing and unification are postponed indefinitely. Consequently, duplicate data fields accumulate, resulting in records with (for example) two job titles, three emails, four sets of addresses, five phone numbers, six industries, and seven company sizes.

In addition, as institutional knowledge fades over time, the current marketing team can't determine the source and age of the duplicate data fields.

Let's take a look at some solutions to this issue.

2. First, Minimize Field Duplication

When duplicate records grow, the cleansing effort grows incrementally: The work to identify and merge four duplicate records isn't that much greater than de-duplicating two records.

In contrast, when duplicate fields increase, cleansing grows exponentially: It takes more than twice the effort to unify four industry data fields than to unify two fields. The unification logic becomes exponentially complex to develop and execute as the number of duplicate fields grows.

How to minimize duplicate fields?

Audit and unify promptly

It seems obvious, but the best advice is to unify fields promptly while the data is fresh and the institutional knowledge is available.

Audit a small sample size and automate the unification

If a database has over 10,000 records, it is unlikely you will be able to review every single record. Auditing a representative sample of a few hundred to a thousand records gives you a sense of the quality of the new data. You're able to then decide how to unify the new data with the old, and automate the execution. Audit, unify, and move on.

Clearly label the data source and age

If you do have to delay the unification work, ensure you clearly label the new data with its source and age, and provide sufficient documentation to enable future unification efforts.

3. Tips for Data Unification

Apply a consistent unification logic

Let's say you have four different industry data fields. How should you unify them? Take these drivers into consideration:

  1. Source authority: Which data source do you trust more? For example, industry data from Dunn and Bradstreet is probably more authoritative than the equivalents from a lead vendor.
  2. Source focus: Which data source is more aligned with your market perspective? A lead source that specializes in your industry should provide more precise data than a broad-market source.
  3. Age of data: Data regarding an industry changes slowly, but contact and company-size data can frequently undergo change. More recent data is often the better data.

Avoid ad-hoc decisions

Resist the urge to manually review every record and make ad-hoc, record-by-record decisions. Though ad-hoc decisions may yield better results for specific records, that approach is never scalable; moreover, it is unlikely you have sufficient information to optimally evaluate the majority of records. When applied to your entire database, a consistent logic will yield better overall results than ad-hoc decisions.

Take the opportunity to normalize and re-map

What is better than unified data? Unified and normalized data. With minimal effort, you can—and should—normalize data such as industry, company size, job function, job level, state, country, and phone number. For example, how could you effectively use 2,000+ industries to run campaigns? Remap the 2000+ industries to the 10 that you have defined for your business. Say your business is Internet of Things, in which case an automotive company such as Toyota should be re-mapped to "industry = Vehicle Telematics"—a non-standard industry segment, but a target industry segment for you.

4. Tools and Resources You Will Need

What tools and resources are available to perform data unification work?

Use low-cost labor

This is the most popular means because it is the easiest to set up and requires no new technology. However, very detailed unification instructions are required, and the accuracy of the results varies based on the quality of your personnel. In the long run, manual unification is expensive and challenging to scale when the data set is larger than a few hundred thousand records.

Hire a database developer

This is not low-cost labor. This approach requires a technical person to set up a database and write SQL scripts to extract, transform, and load data. What you pay for is infinite flexibility.

Find a data automation solution

Once you've defined the unification logic, you can easily automate the task using a data automation solution, which can be either Cloud-based or on-premise licensed software. A software-as-a-service solution would help keep the cost low and ensure the solution is easy to use by nontechnical marketing team members.

Enter your email address to continue reading

Your Duplicate-Data Problem Has an Evil Twin That Is Much Worse: Duplicate Fields

Don't worry...it's free!

Already a member? Sign in now.

Sign in with your preferred account, below.

Did you like this article?
Know someone who would enjoy it too? Share with your friends, free of charge, no sign up required! Simply share this link, and they will get instant access…
  • Copy Link

  • Email

  • Twitter

  • Facebook

  • Pinterest

  • Linkedin


ABOUT THE AUTHOR

image of Ed King

Ed King is the founder and CEO of Openprise, a data orchestration SaaS company. He has over 20 years of experience in enterprise software.

Twitter: @ekwking

LinkedIn: Ed King