Strategies for importing and extracting GA4 event data from BigQuery into SQL Server

Google Analytics 4

Strategies for importing and extracting GA4 event data from BigQuery into SQL Server

Posted by Oliver on 16 September 2022 at 9:44 pm

Let’s break it down. We’ve been pulling daily event CSV files from BigQuery using Python. These files have a bunch of data columns, some of which hold .json data that might even have more .json data embedded in there – it’s like a data Matryoshka doll situation. Each of these daily CSVs hold around 1 million records.

Using the Google GA4 Events Schema, we’ve started setting up corresponding tables in SQL server to keep everything organised. The tables we’re working on include: ‘event’, ‘user’, ‘device’, ‘geo’, ‘app_info’, ‘traffic_source’, ‘stream and platform’, ‘ecommerce’ and ‘items’.

Now, here’s where we’d love your advice – what’s the best approach to parse these CSVs, with their potential layers of .json, into their respective SQL server tables? We had a crack at creating an SSIS package that would import the CSV into a staging table, and a script that would parse the rows. But we quickly realised that we’d end up with a super complex query filled with cross apply statements, which got us worried about performance issues. Any ideas?

Abdul replied 1 year ago 3 Members · 2 Replies
2 Replies

Taylor

Member
13 October 2022 at 12:23 pm

Hey there! It sounds like you have quite the Matryoshka doll of data on your hands! Just a heads up, the best practice for this kind of situation is to flatten your BigQuery GA4 export before you even export it. Basically, you want to simplify that json data in your event property/custom dimension column so you don’t have to worry about it later. A common approach is to create a unique column for each event property or custom dimension.

Then, once everything’s all organized and simplified like that, you can go ahead and shift your data into the SQL server tables. It’s a relatively straightforward move at that point. If you decide to transform the data any further, it’ll mostly depend on your specific requirements. Just keep in mind that ETL logic can get a bit hefty as you make modifications over time. But no worries, it’s all part of refining and improving your data, right?

I hope that helps! Feel free to reach out if you have any other data dilemmas you need a hand with.
Abdul

Member
15 June 2023 at 8:15 pm

One solution could be to use Python, which excels in handling and parsing JSON data, to handle the complex parsing process. Firstly, you can write a script that reads the CSV file and the potentially nested JSON objects in each row. Python’s JSON module (json.loads) can convert JSON data into Python dictionaries, which is a format you can work with more easily. You can then transform the data and segregate it as needed, matching the target structure of your SQL Server tables. Lastly, use a Python SQL toolkit like SQLAlchemy to insert the data into your SQL server tables. You can also handle any issues with data conversion, formatting, and error checking before loading the data into SQL Server tables, making this approach more robust. Of course, remember to test this process on a smaller scale before running it on your full 1 million records data set.

Strategies for importing and extracting GA4 event data from BigQuery into SQL Server

Taylor

Abdul