How I built a Customer Data Platform based on Google Analytics unsampled data

The goal of this article is to give you a brief overview about how I built a homemade CDP (Customer Data Platform), why I did this, and which technologies I used. Focuses on key steps and detailed technical issues will follow in other articles. I wish you a good read!


What is a Customer Data Platform?


“A CDP is a marketing system that unifies a company’s customer data from marketing and other channels to enable customer modeling and optimize the timing and targeting of messages and offers.” – Gartner Report

In other words, a Customer Data Platform is a user-centric database handled by the marketing team and updated at each user interaction, based on both cold and behavioural data.

It thus enables a company to get real-time user profiles and address them at the best moment of their customer journey. This is the first step in setting up an omichannel marketing strategy.

To really push the point, we could say that today’s data marketing core value is not in the CRM software anymore, it is in the CDP.

Why Google Analytics data?

There are many ways and trackers to collect behavioural data on your website. I first tried this with, which is really a great but quite expensive one.

I was then looking into one of the numerous free alternatives when I found out an article by Dmitri Ilin, explaining how to export data from Google Analytics to Google BigQuery.

This article grabbed my attention because it solves a well known issue in the analytics world: the sampling of Google Analytics data. Indeed, although this solution is probably the most advanced free analytics tool, it samples the data when the stats you want to access involve a small number of sessions (so that you cannot access the detail of each visitor journey).

Retrieving unsampled Google Analytics data thus solves this issue and enables you to access the full journey of each visitor. Moreover and unlike other analytics tools, it natively includes the tracking of Google Analytics events, which are super powerfull and easy to set up. This is why I chose to use this method for the building of my Customer Data Platform.

The way I built the CDP

Retrieving, storing and processing the data

Dmitri Ilin’s article was a great inspiration on how to get Google Analytics unsampled data, but this was only the first step on the road to reach the business goal I set to myself: building a Customer Data Platform and being able to plug my CRM to it. Reaching this goal would thus enable me to have precise information on all my clients and address them at the best time.

Moreover, storing all the logs in Google BigQuery did not looked like the best solution for the objective I wanted to reach. It seems far more appropriate to store raw data in a datalake than in such an expensive database.

Since I am more familiar with the AWS environment, I decided to adapt Dmitri Ilin’s tutorial to it. I created and deployed a Flask web application on Elastic Beanstalk which retrieves the logs and stores them to AWS S3 (which plays the role of a datalake here).

The global architecture
The global architecture

The logs are concatenated daily at 3am with Pandas (very popular Python library for processing data), and stored into time and userID-based directories in my S3 bucket.

How do I identify a user to all its interactions?

This step is the key to have a good analytics tool and complete user profiles.

My userID (Auth0’s one) is passed as a custom dimension in Google Analytics, so that I am able to identify a user when he signs in.

In order to have a complete vision of the customer journey, I then match Google Analytics’ ClientID (unique for each client) to the userID.

This way, all interactions before the signin and after the logout are also stored and identified to the right user.

Calculate the aggregates, score and store the user profiles

Once we have a directory per user on S3, it is quite easy to be able to calculate aggregates and scores for each one of them.

Aggregates are like global stats on raw logs, and it is a crucial information in marketing. It is very useful when it comes to knowing how many times a user has visited a particular web page, or how many times did he triggers a web-based event.

In my case, I set up three aggregates: the list of ClientIDs, the number of login events, and the number of times a user has visited the pricing page.

Preview (on Data Studio) of the CliendID, number of logins and number of times each one of the users visited the pricing page

The scorings are based on the aggregates. The scoring is a way to find out which leads have the more chances to become customers, so that you can focus your marketing efforts on them.

For example, I created an “activity score”, based on the number of logins per week. The rules are the following:

  • if “number of logins per week” is 0: score = 0
  • if “number of logins per week” is >0 and <3: score = 1
  • if “number of logins per week” is >3: score = 2


It then enables me to send a dedicated newsletter to the leads with the best activity score, because it looks like they are very interested into my product.

The aggregates and scoring are calculated daily with AWS Athena, with is a querying tool for S3 (using SQL syntax). The output is stored on a CSV file on S3, which is then loaded into a PostgreSQL database on RDS (Amazon Relational Database Service).

Calculate the aggregates and load them to the DB

This database is my CDP: every user has only one profile, which is refreshed daily based on behavioural data, and contains marketing aggregates I am using in order to better communicated with him.

Next steps

The uses cases described here are really basics. There are so much other things we can do with such a tool: segment the user profiles based on the behavioural data, personalize the website content, setting up a multi-touch attribution model… I will definitely have a deeper look to these use cases.

Another challenge is to have the user profile refreshed in real time (not only on a daily basis anymore). I have some ideas about this, and I’ll get back to you soon with more news.

Thanks for reading! If you are interested into this topic, please stay in touch.