Querying GA4 Events data in BigQuery

GA4 schema - nested records in BigQuery

To understand how events are stored in BigQuery for Google Analytics 4, we will explore how events and their parameters are stored in BigQuery. Google Analytics event parameters are recorded in nested and repeatable field called event_params. We will explore what is unnesting and why it is mandatory to filter and extract data from BigQuery. I am using Google’s sample dataset for Google Analytics 4 for this article you can access the dataset here for querying alongside the article.

Event Schema

A complete reference to the schema can be found here. I am using key fields of interest which can help us in understanding how events are represented in BigQuery for GA4 and how we can use SQL to access it.

GA4 main events fields
GA4 main events fields

Here is a visual view of the above schema to highlight where the nested columns are.

GA4 schema - nested records in BigQuery
GA4 schema – nested records in BigQuery

If we look at a single-row preview of how this data is actually stored in BigQuery, we get the following preview.

GA4 event fields preview in bigQuery
GA4 event fields preview in bigQuery

Let us go into bit more detail and reproduce the table in an excel like format to understand it .

wdt_ID event_date event_timestamp event_name event_params.key event_params.value.string_value event_params.value.int_value event_params.value.float_value event_params.value.double_value
1 20210131 2147483647 page_view gclid
2 gclsrc
3 debug_mode 1
4 ga_session_number 1
5 all_data
6 page_location https://shop.googlemerchandisestore.com/
7 entrances 1
8 session_engaged 0
9 ga_session_id 661.084.800
10 clean_event gtm.js

The first 3 columns event_date, event_timestamp and event_name are not nested. They constitute 1 row. However, from the 4th column onwards, we see “nesting” starts. All the parameters of the event page_view are stored as key, value pairs. These two are represented differently in nested records.

  1. Key : event_params.key the column highlighted in the green is the key for each parameter.
  2. value : Value is represented using next 4 columns. In case of string value for paramter it is stored in .string_value , if the event had an integer value it will be stored in .int_value column and so on.

Limitation of Traditional SQL

Traditional SQL queries row data.There is no provision which can allow us to query nested data. In the above table, say you want to get page_location from the table, you cannot get it because 1 row has several nested rows.

SELECT *
    FROM
    `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210130`

    WHERE event_params.key='page_location'

It will throw an error saying,

“Cannot access field key on a value with type ARRAY<STRUCT<key STRING, value STRUCT<string_value STRING, int_value INT64, float_value FLOAT64, …”

The error above means that our query is attempting to access a field named ‘key‘ from a value with an ARRAY type, which is not allowed.

If you refer back to our schema diagram above, you will see the following :

  1. event_params : has a data type RECORD
  2. event_params.key : has a data type STRING
  3. event_params.value has a data type RECORD
  4. Individual value columns under event_params.value : have either STRING, INTEGER or FLOAT data type

You could only query the above data, if it were in the following format where each row was accessible uniquely.

wdt_ID event_date event_timestamp event_name event_params.key event_params.value.string_value event_params.value.int_value event_params.value.float_value event_params.value.double_value
1 20210131 2147483647 page_view gclid
2 20210131 2147483647 page_view gclsrc
3 20210131 2147483647 page_view debug_mode 1
4 20210131 2147483647 page_view ga_session_number 1
5 20210131 2147483647 page_view all_data
6 20210131 2147483647 page_view page_location https://shop.googlemerchandisestore.com/
7 20210131 2147483647 page_view entrances 1
8 20210131 2147483647 page_view session_engaged 0
9 20210131 2147483647 page_view ga_session_id 661.084.800
10 20210131 2147483647 page_view clean_event gtm.js

In this table, we have manually filled first 3 columns with repeated values so that we can get the data in a standard row format. This is exactly what unnesting does, it allows us to query data as if it was from a unique row. We can thenaccess any row of data and filter on it.

Unnesting the events

Let us begin with a query to recreate the table above where we have event_date, event_timestamp, and event_name repeated for each nested record. We can then decipher the query step-by-step to understand how unnesting works in BigQuery.

WITH unnested_params AS (
  SELECT
    event_date,
    event_timestamp,
    event_name,
    params.key AS param_key,
    params.value.string_value AS param_string_value,
    params.value.int_value AS param_int_value,
    params.value.double_value AS param_double_value,
    params.value.float_value AS float_int_value

  FROM
    `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210130`
  CROSS JOIN
    UNNEST(event_params) AS params
)

SELECT
  *
FROM
  unnested_params

This query above will generate the same result as our table above, where we manually unnested by pasting the repeating values in first 3 columns

unnesting query result for understanding GA4 events
unnesting query result for understanding GA4 events

Let us dissect the query and understand what is happening behind the scenes.

GA4 Un-nesting in BigQuery
GA4 Un-nesting in BigQuery

1. With Clause

Using with clause is optional. I am using it here to make the query easy to digest. “with” clause creates a temporary result set for our query within it. We can then use the temporary result set as our table for next select statement. In ou case, we will get the temporary result set for our query within the blue highlighted area un unnested_params variable. We can then query unnested_params using a select statement.

2. Key,value fields

Under SELECT statement, we are selecting normal columns event_date, event_timestamp,event_name. In addition to these, we are also selecting unnested keys using params.key and values from params.value RECORD.

3. CROSS JOIN

Cross join takes every row on the left side of it and joins it with every row on the right side of it. The SQL on left side of CROSS JOIN selected the parameters as discussed above and for each row on left side appended a row from the Right side. The right side rows are obtained from unnesting the event_params RECORD.

4. UNNEST

The UNNEST keyword as the name suggests Undo the nesting of the field passed to it in brackets. E.g. the UNNEST(event_params) command takes the event_params record out of the context of nested record and makes it available as a separate column on its own. When we CROSS JOIN the event names and key values with unnested record , we get the result we were after.

Getting Events with specific parameters

We can now extend the above query to do meaningful data extraction. Let us say, we want to find all page views where page location contained they keyword “Apparel”.

We can rewrite the above query with the conditions

  1. params.keys : Get only page_location parameter
  2. params.value.string_value : Get only pages where we have “Apparel” in page_location parameter key.
  3. event_name: Get only page_view event.
WITH unnested_params AS (
  SELECT
    event_date,
    event_timestamp,
    event_name,
    params.key AS param_key,
    params.value.string_value AS param_string_value,
    params.value.int_value AS param_int_value,
    params.value.double_value AS param_double_value,
    params.value.float_value AS float_int_value
  FROM
    `bigquery-public-data.ga4_obfuscated_sample_ecommerce.events_20210130`

  CROSS JOIN
    UNNEST(event_params) AS params
    where params.key='page_location'
    AND
    params.value.string_value LIKE '%Apparel%'
    AND
    event_name='page_view'
)
SELECT
  *
FROM
  unnested_params

This query will now give us results filtered which were possible only after unnesting the event_params RECORD.

GA4 BigQuery unnested event_params RECORD.
GA4 BigQuery unnested event_params RECORD.

Summary

In this article, we explored the RECORD type fields in BigQuery in the context of Google Analytics 4 data set. BigQuery RECORD type fields are used for nested records. Google Analytics 4 events parameters are stored as key, value pairs in event_params field of RECORD type. To filter on the basis of either keys or values we have to un-nest the field. This can be done with the help of “UNNEST” keyword and using a CROSS JOIN with the fields that we want to combine the data with.

Related Articles

Responses

Your email address will not be published. Required fields are marked *