Messy data? We’ve all been there. In an ideal world, your data would be squeaky clean in your reporting interface. Google Analytics would provide row after row and column after column of sense-making numbers that all added up. You wouldn’t have to manipulate a report and add the values of multiple rows together to get an accurate number for a GA event, or account for different spellings of a campaign name. You wouldn’t have to apply four advanced segments in an effort to catch all the known errors in your GA data. You wouldn’t have to export multiple reports to CSVs to manipulate for basic weekly reporting. You know what I’m talking about!
We all have our workarounds to make sense of our data. If you know your business and your customers and your website, you know how you need to look at things for the numbers to make sense in GA. And for heavier lifts, your business likely already has a process in place to transform the data before it hits your database for storage and queries. I should say now that the need for ETL (Extract, Transform, Load) tasks will never go away when it comes to data warehousing needs (and we don’t want them to)...but what if the numbers in GA were better? What if a new analyst on your team could jump into GA and immediately understand what was going on? Wouldn’t that be nice?
To clean your data in Google Analytics, you need to fix it at its source or fix it on its way in.
We’re talking about the pesky details that can cause big problems. If you are analyzing the wrong data, your analysis doesn’t matter. If your test data and internal traffic are mixed in with your user data, the waters are muddy.
There are the usual suspects when it comes to messy data: a campaign name has multiple spelling variations; personally identifiable information (PII) snuck its way in (fix this ASAP, it’s against GA’s terms of service); there are discrepancies in event casing; there are inconsistencies in naming conventions (of events, products, attributes, elements on the page, etc). What are the known culprits in your GA data?
Inconsistent Event Categories in Google Analytics due to casing issues.
Fixing Data at Its Source
The best thing would be if, at its source, the data were perfect. If everything in your data layer were pristine, it would be pristine in GA, so why not fix it there first? This is a great solution! It’s the solution I’d prefer to use every time if I could—if the task at hand isn’t too difficult, that is. If, for example, it’s a matter of leveraging your CMS to attach an attribute in the data layer to all hero images then do it! But depending on the ask, fixing data at its source could be a developer’s nightmare. It could tie up your resources for an extended period of time and cost too much money, time, and effort for little payoff, especially when an analyst could simply pull a CSV report and clean the data after the fact.
But if you can fairly easily fix data at its source, or if you determine that spending the development time up front will save many analyst hours down the road, do it.
Fixing Data as It's Being Processed
If your data is incorrect at the source (inconsistent or appears in a way you don’t prefer), you can transform it on its way into Google Analytics. There are two main methods:
- GA Filters
- Google Tag Manager
After a hit is sent to GA, it is processed before it lands in your GA reports (and in BigQuery). You can imagine the Google server catching all the data in a sieve, then determining what can pass through to each Google Analytics View based on the filter rules set up. If you have a filter in place to remove all your internal traffic (using an Exclude filter based on your internal IP addresses), then Google checks each hit and says “does it match this IP? No? Then it can go through to this view.” The hits that match the predetermined rules are deposited in your GA views. GA Filters have long been a great tool for cleaning up your data. In addition to exclude/include filters, you can create lowercase filters (think: event categories/actions/labels, campaign names, hostname, etc), and Search and Replace filters (which you might use to remove a final slash from Request URIs, find a parameter from a URL to insert into a custom dimension, or to rewrite URIs with a cleaner folder structure).
Things to remember:
- Filters permanently affect your data. Always test filters in a duplicate testing view, which has the same setup as the view you’d like the filter to end up in. Once you’re happy with how it’s performing, then move it to the appropriate view.
- Filter order matters. Do not put a Search and Replace filter that relies on query parameters BELOW a filter that strips all query parameters. They are processed in order. If you have multiple filters that work in tandem, consider labeling them appropriately so that it’s easy to see at a glance if they’re in order (e.g. Remove Internal Traffic 1/3, Remove Internal Traffic 2/3, Remove Internal Traffic 3/3).
Google Tag Manager
Google Tag Manager functions as the intermediary between your website/data layer and Google Analytics (as well as other tools where you send data). GTM is reading and assessing the data against a set of rules and choosing which data to send where at what time. GTM can also transform data on its way to GA (or elsewhere).
But for other situations, like forcing to lowercase, the new Format Value feature within user-defined variables in GTM, allows you to easily format the output of your variables for consistency. And while you do need an understanding of GTM, you do not need to be a developer to use this feature! It is a great, simple way to automate some of the process and reconcile discrepancies so that your GA data is cleaner and easier to understand.
GTM Format Value feature (available within user-defined variable configuration).
GA vs. GTM
There is no hard and fast rule that determines when to use GTM vs. a GA Filter to keep your data clean. If you already have GA filters in place for forcing Event Categories, Actions, and Labels to lowercase, you don’t need to switch to using the GTM feature. Regardless of the strategy you establish for cleaning up your data, the goal is to report on and analyze the best numbers you can - hopefully with as much accuracy, consistency, and automation as possible. The less time you have to spend cleaning, the more time you can spend analyzing (and Metricstory’s automated analytics solution can help with that too). The last thing you want is your team looking at clunky, inaccurate data - so clean it up!