What could be causing abnormal retention rates in bigquery compared to GA4?

Google Analytics 4

What could be causing abnormal retention rates in bigquery compared to GA4?

Posted by Olivia on 5 December 2022 at 1:33 am
Hey fellow data wranglers, I’m in a pickle and could really use your help!

Here’s the deal: I’ve been feeding Google Analytics 4 (GA4) data into BigQuery. Things are going decently, but I’ve hit a snag with the SQL query I crafted for the retention table. On Day 0, the number of users appears as you’d expect. It’s when we look at days after that, things get wonky – the numbers are ridiculously high!

The weird thing is that I know this doesn’t jive because when I compare it to the retention table GA4 itself kicks out or the one in AppMetrica, my numbers are off. Plus, I’ve used this very same SQL query for a couple of other apps and it worked like a charm, lining up neatly with what GA4 and AppMetrica showed. So, I don’t think this query is the issue.

What’s really bugging me is that Day 0 is alright. So I suspect something fishy is going on when we crunch the numbers with varying dates.

Here’s my SQL query, I wonder if any of you ninja coders see something I am missing or could suggest some tweaks?
```
   
WITH
    table1 AS (
    SELECT
     user_pseudo_id AS user_pseudo_id,
     MIN(event_date) AS birthday,
    FROM
     table
    WHERE
     event_name = 'first_open'
    GROUP BY user_pseudo_id
    )
    , table2 AS (
    SELECT
     r.birthday AS birthday,
     l.event_date AS event_date,
     DATE_DIFF(l.event_date, r.birthday, DAY) AS diff,
     l.user_pseudo_id AS user_pseudo_id
    
    FROM
     table AS l
    JOIN
     table1 AS r
    ON
     l.user_pseudo_id = r.user_pseudo_id
    )
    , table3 AS (
    SELECT
     birthday,
     diff,
     COUNT(DISTINCT user_pseudo_id) as count_users,
     FIRST_VALUE(CASE WHEN diff = 0 THEN COUNT(DISTINCT user_pseudo_id) END) OVER (PARTITION BY birthday ORDER BY diff ASC) AS users_day0
    
    FROM table2
    GROUP BY
     birthday,
     diff
    )
    
    SELECT
     birthday,
     diff,
     count_users,
     users_day0,
     count_users/users_day0 AS retention
    
    FROM table3
```
If any of you have faced something similar or have ideas on what could be causing this, I could really use your input! I’m all ears.
Ethan replied 1 year ago 3 Members · 2 Replies
2 Replies

John

Member
22 May 2023 at 2:23 am

Your SQL query isn’t inherently flawed, but there seems to be a mismatch between your actual data distribution and how the query processes the data. One possible reason for sky-high values from Day 1 onward might be multiple records per user in the ‘table’ you are using for the analysis. Ensure that the ‘first_open’ event is unique per user – ideally, there should only be one occurrence of ‘first_open’ per user. If there are multiple ‘first_open’ actions per user, that might create a redundancy in the ‘user_pseudo_id’ leading to inflated numbers after Day 0.

You may want to check your data source or tracking setup for any setting that might log multiple ‘first_open’ events for the same user. Another possibility could be that the ‘user_pseudo_id’ isn’t persistent for each user across different days, causing inaccurate counts of ‘unique’ users in your output.

Remember that even though the structure of your SQL query is okay, the input data quality and data tracking setup play a significant role. The query can only analyze based on the data it is given, and if that data is flawed in some way, it reflects in your analysis.
Ethan

Member
8 June 2023 at 6:35 am

One possible issue could be related to how “first_open” events are handled. If there are users who have multiple “first_open” events (which sometimes can happen in GA4 if users reinstall the app, clear their cookies, or switch devices), they would be counted multiple times in your retention calculation. This could particularly inflate numbers for days after Day 0. To fix this issue, you can either filter out duplicate “first_open” events or retain only the earliest “first_open” event per user. Another issue could be related to the window of your data. If some users have their “first_open” event close to the end of your observed period, their activities in the following days could not be captured, resulting in underestimation of retention rates. It might help if you exclude users who have their “first_open” event in the last few days of your observed period from your calculation.

What could be causing abnormal retention rates in bigquery compared to GA4?

John

Ethan