Enhancing Data Quality with Snowplow: A Comprehensive Guide by Freddie

Convert Your Audio To Text

4.9/5

3727 customer reviews

Freddie, a solutions engineer at Snowplow, demonstrates how to supercharge data quality using Snowplow's customer data infrastructure, validation, and enrichment.

Supercharging your Data Quality with Snowplow - NO SLIDES

Added on 09/30/2024

Speakers

Add new speaker

Speaker 1: All right. Hello and good morning and good afternoon to everyone joining today. I'm Freddie. I'm a solutions engineer at Snowplough and I've been at the business for around two and a half years. Primarily I'm working with our customers on their sort of initial engagements, but equally work within the sales team to help with technical evaluations of our products. So what I'll do is I'll begin by sharing my screen. And on here I can help contextualize the products that I'll be demonstrating today. So for the no slides conference today, I've chosen to do a focus on supercharging data quality by using Snowplough. So if you haven't heard of Snowplough before, I'll do a brief introduction. So Snowplough is a customer data infrastructure product. Its primary purpose is on delivering behavioral data from various sources, such as your websites, apps, service and webhook applications potentially. The data is then validated and enriched and subsequently lands within a cloud data platform of your choice in real time. So the platform and the pipeline itself can either be hosted by ourselves using our cloud hosted offering. Or additionally, we do have some customers that opt for our private SaaS option, which means that the infrastructure will actually be embedded within your cloud account, whether that be on AWS, Google Cloud or Azure. We'll do a quick left to right of the various parts of the platform before we dive into the demo in greater detail. So on the left hand side, as I mentioned, this is our source domains. So this is our website applications, potentially service side tracking, such as back end, maybe on Java or Python, for example. And additionally, webhooks, such as if you want to understand interaction data across maybe emails, for example, if a user has clicked on an email and subsequently maybe opened a link, for example. We've got 20 plus different software development kits that can be utilized in order to track all of that behavioral data associated with those different domains. Once the data has been tracked, it can then be sent to this pipeline. It'll be ingested by a collector and then subsequently we'll go through this validation layer here. Now, this is the sort of key value proposition of Snowplow is that we're able to deliver high quality data natively into the warehouse and it's done using this validation. So the validation may be done utilizing JSON schemas. So these schemas, you can think of them as a particular set of rules that you can dictate and as such, should an event not conform to those data structures, we like to call them data contracts. The events will be subject to fail and ultimately they'll land within a failed events table in Snowflake, for example, or across the other cloud data platforms in another failed events area. This means that the pipeline can remain non-lossy and those events that don't conform to these data structures that you create, they can still be repaired and ultimately be added into your cloud data warehouse or lake destination, such as S3, utilizing open table formats, such as Apache Iceberg, Hudi, and Delta as well. After the validation, we also have an enrichment layer. So within the enrichment layer, this is where we'll take first party and third party data sources and enrich them and add them contextually to events as they're processed through the pipeline. The first thing I'll talk about with enrichment is an IP lookup and it's commonly done across all of our customers. They'll take the IP address and then potentially use that to gain geographical information based upon a user. Additionally, what a user might look to do here is use our PII pseudonymization enrichment, and this can prevent personally identifiable data from landing in its raw format inside this cloud data platform of choice. Additionally, you can do more custom and complex enrichments at this stage. So you could do a JavaScript enrichment, which runs a JavaScript function on top of each behavioral data event that occurs and is processed through the pipeline. But additionally, you can do things such as SQL query enrichments, where we frequently see customers querying data against a vector database, for example, before ultimately having that data processed into their cloud data warehouse or lake house of choice. All right, so once the data's landed within the cloud data platform, from here, this is where we can then model the data. So with Snowplow, we've built out a collection of DBT models, which can be utilized in order to aggregate that behavioral level atomic level data and ultimately aggregate that such that we can create different use cases. We offer a unified digital model, which can be easily used in order to aggregate user sessions across web and mobile applications, for example, in order to create more customer 360 type use cases. Additionally, for customers that are focused on marketing use cases, we've built out an attribution model as well. And additionally, we built out media-based models and e-commerce-based models for media analytics and e-commerce-based use cases as well. It's worth noting that these DBT models that we've built out, they're fully transparent, they're available on DBT. We like to think of them as getting you about 80% of the way there, and then you can build on top of them as additional logic that is required, say, for your business. Once the data has been modeled, it can then be used with our visualizations. We've built out some data applications such as user and marketing analytics, as well as attribution modeling, media, and e-commerce as well. And these can be thought of as quick wins, enabling the downstream analysts and consumers of the data to be able to answer simple questions and enable the data engineering team to maintain to maintain work on more complex use cases, such as building out AI-based use cases. Additionally, as I mentioned, we're able to feed real-time into these cloud data platforms, but we also offer a real-time stream, which could be a Kinesis, PubSub, or maybe even an Apache Kafka topic, depending on the customer cloud of choice. This real-time stream can then be used to filter and send data onto other destinations of choice within real time. For the purposes of today's demonstration, in order to contextualize this a little bit further, within Snowplow, one thing that we're particularly interested in, say, is how our customers ultimately look to request a demo. Within the Snowplow website, a user may click this book a demo button. Subsequently, they might then look to fill in a form. And then at the bottom, they'll have to click on this book a demo now button. Now, what we can now do is look to create this user journey using the Snowplow UI. And then subsequently, after we've created it as part of our data products builder, we can then instrument the tracking that can then be utilized across our website. So within the console, I'm going to move into the data products section now. Now, the data products, you can think of them as a collection of event specifications and rules that you might build out as part of a tracking plan, say. If I click in the view, this demo request funnel, data product, what we can see is a collection of event specifications. Now, we've added some metadata at the top. I can subscribe to see any changes made to this tracking plan, such as, say, if I was a downstream consumer, I can then make any amendments to my data model, for example. But what we'll do now is we'll drill a little bit further into this request a demo event specification. Now, within here, we can see some of that validation that we set up. So for the event data structures here, I set up this demo request action. And as such, the action must be a property of type demo request, complete form, or book demo. Now, if I send an event that doesn't conform with the property action being of that type, it will then fail validation. And ultimately, this means that all data that flows into our cloud data platform will fail can remain a very high quality. Additionally, within the data product and this event specification that I've created, we can also look to add what we call entities. And entities can be thought of as additional contextual pieces of information that can be modularly attached to each event. As an example, users that are doing A-B testing might look to understand, let's say, if I make this book a demo button a little bit bigger, is it more likely to make a user go through that funnel? So here, we've added an A-B test entity. And here, it must be of type string. And we've got the version of the A-B test that we're doing within here, say. Additionally, I've added a page entity that can be utilized. Now, below, I've added some more metadata that can be used by the development team in order to understand exactly where this event should be triggered. So within here, I've added an image whereby we can select where we should be booking a demo. And ultimately, this makes it easy for the developer to know where that event should be tracked. At the bottom here, within this Explore section, it's just providing some very basic SQL such that we can easily interpret these events that we've created. Having built out this set of event specifications that can be used in order to track these various clicks and filling in forms and submissions that a user might do on the website, what I can now do is convert that information into an easy-to-utilize tracking. And I'll do so using Snowtype. So Snowtype is a command-line interface tool which essentially bridges the gap between this tracking plan and my IDE. So what I can do now is I would say, take these methods, these commands, run them inside my IDE terminal. And having done that, it's going to create a snowplow.js file that's going to contain these event specifications and make it as easy and seamless as possible to instrument these events. What I'll do now to show what that looks like is move into my VS Code. So within here in the terminal, I could instrument those commands. And then what it's going to do is create this snowplow.js file. So for that request, a demo specification, it's created a function that can then be utilized in order to make the tracking implementation as seamless and easy as possible for our front-end and back-end engineers. Now, if I move into my React application where I'm tracking on, what I can do now is we see that action as part of the tracker demo request. What I can do is if I start adding that in now is you can see that those three enumerated items that I added to that list have become available. And as such, using that function, it can add additional visualization on top of my tracking and additional helpers in order to make instrumenting this tracking as easy as possible. So I can then click on this to say the action should be demo request. But additionally, if I made a mistype, for example, you'll also see that a squiggly line has appeared under action. And it will give me an error message associated with why that is happening. This means that ultimately, I can ensure that when I'm a front-end engineer or a back-end engineer, for example, when I'm instrumenting these tracking event specifications, I can ensure that they are as expected prior to them reaching that validation. And as such, prevents the data from then being filtered into, say, a bad events table within Snowflake. So this means that we've got two levels of data quality checks that can be done inside the pipeline. One that can be done upfront, such that the developers, those instrumenting front-end and back-end engineering, can easily understand if they're instrumenting the events correctly and the validation. As such, filtering out any additional events that may have, unfortunately, been instrumented incorrectly. All right. So having taken a look at my tracking code, I've built it out with my demo action. I've added my A-B testing entity and my page entity. What I now can do is take a look at what this code might look like live on the website. So for today, I've added a little dummy application here as part of this funnel. You can see a page view has already fired. This is just one of our out-of-the-box events. But if I click on this book a demo function, what we can now, button rather, we can now see an event that started to fire. On the right-hand side, this is the snowplow inspector. And this is really useful for teams that are developing the code to be able to see events as they're fired in real time. Now, as part of this inspector, what we can also see is the events that we've sent with it. So here we can see the action and demo requests that we sent along with that, as well as the entities. So we can see that we sent the button size variation of our A-B testing, as well as the page entity. One additional entity that I've sent through with this event is this event specification entity. And this event specification entity is specific to the event specification that I created within here. Now, the benefit of that is that ultimately this means that if I've instrumented the events correctly, we'll get visual feedback within the UI. So here I can see exactly which applications, and based upon an ID that I can send with those events, have had that event instrumented correctly. So we can see this webinar demo, publisher demo, and snowplow demo has had the event added on them. All right, so if I move back to my architectural diagram, I've shown you the tracking instrumentation. I've shown you the validation that might be taking place, and the enrichment. And now let's take a look at the data as it lands within, in this example, Snowflake in real time. So what I'm going to do is I'll just add a few more events in here. And now I can move into my atomic events table, and I can click play on this function, this query rather. Now, having done so, we can see some of these events as they've started to land. Now, in order to demonstrate this has landed in real time, what we can see is the timestamp that was created from the device, and compare that to the timestamp that was shown when the data was ultimately loaded. And what we can see is utilizing our Snowflake streaming loader is we're able to get that data from the point of creation into the warehouse in just over one second. Now, whilst we're in Snowflake, I can talk a little bit about Snowplow's data structures. So Snowplow utilizes a big table format, which means that regardless of the application that the event's being tracked on, or the event itself, it's all going to land in one singular table. The benefit of that being that in order to create use cases across marketing or product, for example, it requires very few joins. It means that you can very easily start to build out data models based on top of this data, and initially combine it with other data sources, for example, very easily. This makes Snowplow from a data engineering perspective very simple to use, and ultimately reduces the time to get to value. If I scroll to the right-hand side, what we'll see is a few more of the properties that I've selected for the purposes of the demo today. This is a small subset of all the properties that can be fired out of the box. We do have just over 100 properties that can be added to each event. But just for today, I'm showing a few here. So here we've got the action for the demo request. So here we can see it's a type action demo request. To the right-hand side here, we've got that AB test, where we're making a variation on the button size, as mentioned. And then on the right-hand side here, I've just got this page entity, just so we can add additional color as to which page this is occurring on. What you'll notice here is I've got a user IP address, and this doesn't look much like an IP address, and that's because we've utilized our PII enrichment in-stream, ensuring that when the IP address lands within the warehouse, it's not in its raw format, and it's been hashed and obfuscated as such. Additionally here, we've got a network user ID, which means that this network user ID is essentially a cookie. And if set up in a first-party manner, it can last up to 400 days on Safari, meaning that if a customer doesn't, say, log in, for example, or provide you with any further information in order to identify themselves, you're still able to track that customer journey after many, many different sessions. If I scroll to the right-hand side, you can see that based upon that IP address, before I pseudonymized it, I was able to do an IP lookup using that IP lookups enrichment, and as such, I've determined my geolocational city, as well as my country. Further to the right, I also did a bot enrichment, such that against the IAB bot database, I can check whether I was a bot or a spider, and based upon my user activity, it's deemed that I was not. Now, as I mentioned, we can also add in this event specification entity, which makes it very easy to understand whether the event was instrumented using that data product that I created inside the Snowplow UI. So if we go back one step and take a look at this infrastructural diagram once more, we created the data, and we did so using Snowtype, ensuring that the data was of very high quality in the first place when we were setting up the tracking. The data was then validated against a set of schemas that were contained inside that schema, as I mentioned, and then we enriched the data in-stream and ultimately fed it into Snowflake in real time. We can then model that data for a bunch of additional use cases, but for today, what I'll be doing is building out a basic funnel on top of those user actions that were taken when a user ultimately booked a demo, completed the form, and then made a submission. And I've done so using the Funnel Builder tool. So the Funnel Builder tool is one of the Snowplow data applications that we've built out, and essentially it's built in a way that allows users that have minimal SQL knowledge to be able to build out visualizations with ease. So here I can create steps that a user might take. Here, as an example, we've got a user that's arrived on the homepage. Subsequently, when a user views the book a demo page, and then ultimately when a user submits that demo form. This is just three events that I've created and subsequently built on top of that a visualization. On the right-hand side, we can see the session counts by funnel step, so we can see where users might be dropping off. And we can also see that as a funnel conversion rate, an abandonment at each step, and additionally a summary table at the bottom. What I could look to do here is I could look to add in additional properties. So for example, I added in an A-B test entity, so I could do so to understand whether, say for example, increasing the size of that book a demo button ultimately led to a higher propensity for a user to go through this workflow. Additionally, based upon the geographical information that I was able to collect as a result of the IP lookup, I could look to add that to ultimately see where users are most likely coming from when they're submitting this demo form. OK, so having taken a look at the funnel, what I can now do is download the SQL. So one of the benefits of the funnel builder is having created the visualizations in the data application, I can then port the SQL elsewhere, such that I can use that in other BI tools, for example. Or maybe I can look to build on top of the SQL that's been created, and as such customize it potentially inside Snowflake or in, say, Tableau, for example. So that covers everything I had intended on demonstrating for today. So I'll just summarize how Snowplow can improve your data quality once more by moving back to this architectural diagram. So once again, up front, we're able to instrument tracking, utilizing our command line interface tool Snowtype in order to prevent those failures from scraping through the front end or back end engineers. However, if that safeguard doesn't prevent poorly orchestrated events from coming through, we do have that validation layer, which is built on top of those JSON schemas, as I mentioned. And ultimately, this ensures that that data that lands within Snowflake is of a very high quality, and it's both ready for BI as well as AI use cases, for example. All right, that's everything I was intending on showing for today. So what I'll do is I will ask if there's been any questions so far. All right, if there's no questions, thank you very much for your time today, and I hope you enjoy the rest of the sessions.

Summary

Generate a brief summary highlighting the main points of the transcript.

Generate

Title

Generate a concise and relevant title for the transcript based on the main themes and content discussed.

Generate

Keywords

Identify and highlight the key words or phrases most relevant to the content of the transcript.

Generate

Enter your query

Submit

Sentiments

Analyze the emotional tone of the transcript to determine whether the sentiment is positive, negative, or neutral.

Generate

Quizzes

Create interactive quizzes based on the content of the transcript to test comprehension or engage users.

Generate

Back

Forward

{{ Math.round(speed * 100) / 100 }}x

Select Audio file

Convert Your Audio To Text

Secure and Encryption, NDA

4.9/5 3727 customer reviews

1/732

Verified Order

“I haven't used the customer support yet, but the interface, guides, and easy access to the contact buttons are promising. The output is also really accurate and well-executed:)”

keziah

Aug 15, 2025

“Service is very fast and easy. I noticed a few errors but they were minor. I like your service.”

MICHAEL TRENT

Aug 12, 2025

“Excellent service!”

DanutM

“Excellent service, thank you very much!”

Samantha Cava

Aug 11, 2025

We Trust in Human Precision

Value-Driven Pricing

Trusted by Global Leaders

GoTranscript

24/7 Customer Support