Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import of historical data #1466

Closed
wants to merge 138 commits into from
Closed

Import of historical data #1466

wants to merge 138 commits into from

Conversation

m-col
Copy link
Contributor

@m-col m-col commented Nov 12, 2021

Changes

As a bit of an update, here is the current state of my work on importing data from Google Analytics (as a first step - generic CSV input to be added later).

My ongoing to do list:

  • Add import panel to general settings page
  • Use existing google auth setup to get access to GA data
  • Fetch GA data via Reporting API (currently just getting visitors for main graph)
  • Insert visitors data into new clickhouse table
  • Add rendering code to dashboard main graph
  • Add query to fetch a domain's imported visitors data in main graph's date range
  • Hook up queried data to main graph
  • Annotate graph/dashboard in some way to indicate imported data
  • Make clicking on main graph imported data indicator toggle showing imported data
  • Add way to delete imported data
  • Name imported data (may not be necessary if only one imported dataset per domain can exist at a time)* EDIT: currently only 1 import exists but it is named
  • Automatically unlink google account if it was only connected for the import Edit: the unlink button is there, so this might not be a good idea e.g. if someone imports from the wrong GA profile they'd need to connect their account again.
  • Add "unlink Google Account" button to data import settings panel
  • Import sources from GA
  • Only display imported data on dashboard when there are no filters (except time)
  • Exclude imported data when filtering for goal?
  • Correctly map referrers from google's data to plausible's data (i.e. google.com vs Google)
  • Correctly map GA deviceType/screenResolution to plausible devices
  • Check exit pages modal exit rate
  • Ensure calculations used for modal columns are correct
  • Fix rendering of bars in mediums - why are they thinner?
  • Map imported OSes to Plausible OSes (check the differences that we see and ensure we are mapping to preferable values)
  • Map imported browsers to Plausible browsers (check the differences that we see etc)
  • Drop OS and browser versions
  • Run referrers through RefInspector upon import
  • Run import in background
  • Lots of testing
  • Test for data coming in - ensure referrals are resolved etc? Devices etc are normalised?
  • What needs doing to support 'events' metric? If "events" is in the metrics list, imported data is ignored.

* we also must consider CSV-imported data; data will be imported from multiple CSVs as it covers a number of tables, so having a representation of what is available and ability to delete individual components may be needed to avoid import of and simultaneous validation of a batch of CSVs.

Tests

  • Automated tests have been added

Changelog

  • Entry has been added to changelog

Documentation

  • Docs have been updated

@m-col
Copy link
Contributor Author

m-col commented Nov 17, 2021

What do you think would be a nice way to show that blue is plausible and red is imported? I was thinking a legend right above the graph, below the top stats, and then if the colours are consistent throughout the dashboard then that might also make it clearly enough when there is imported data displayed in the other panels
image

@metmarkosaric
Copy link
Contributor

do we need to show that difference at all?

we should not allow any imported data for the period after your Plausible stats have started tracking.

the graph and presentation can stay the same for both first party and imported data.

the best way to show that there's some imported data in any specific view would be to introduce a little "i" icon perhaps next to the download icon that tells something like "this date range includes some imported data". from here we can link to the docs that tells a bit more about the imported data and what it lacks compared to the first party data. this "i" only shows up in views that feature imported data and is not there otherwise.

seems clean and minimal way of doing this. what do you think?

we just need to understand what the actual differences in usage there will be between first party and imported data. like for instance in the "month to date" view will people be able to click on a day that has imported data only like they can on a normal day that has Plausible data? if that feature is there, then at least for the top chart the third party data has a parity with the first party data so no real reason to visually differentiate between them

@m-col
Copy link
Contributor Author

m-col commented Nov 18, 2021

we should not allow any imported data for the period after your Plausible stats have started tracking.

Ah sorry I should clarified - currently data is only imported up to the date of the first plausible data point. For testing I have disabled that so I could visualise things better and that's what in the image. The red and blue lines will never overlap in time.

the graph and presentation can stay the same for both first party and imported data.

Do you mean that they were form the same line (in the case of the visitors graph), and be counted together into the same counts in the other tables? I think this might work for some metrics (like the visitors graph) but others it might not make sense from a statistics point of view. However I can try to combine the stats for each panel as I'm adding it to the feature keeping this in mind and so for the time being we can assume they will be merged.

seems clean and minimal way of doing this. what do you think?

Yeah that's nice!

we just need to understand what the actual differences in usage there will be between first party and imported data. like for instance in the "month to date" view will people be able to click on a day that has imported data only like they can on a normal day that has Plausible data? if that feature is there, then at least for the top chart the third party data has a parity with the first party data so no real reason to visually differentiate between them

That would be ideal, and what I'm aiming for :)

@metmarkosaric
Copy link
Contributor

That's great! And yeah, we keep imported data consistent visually with the native data. No differences at all in the visual presentation (same line, color, font etc). Then we describe any possible drawbacks with the third party data in the docs. Say "third party data cannot be aggregated with native data when we do the calculation for visit duration" or whatever the drawbacks end up being in the final implementation.

@ACPK
Copy link

ACPK commented Nov 23, 2021

@metmarkosaric @m-col - How does this work for events that are sent to GA (ex: "viewed product", "added product to shopping cart", "made purchase of multiple items")?

FYI - We've manually imported this historical data to Clickhouse via CSV imports. One pain has been importing the event names to "goals" because the events are in Clickhouse and the goals table is in Postgres. As such, I've been exporting the goal list grouped by goal name to CSV and then importing the names of unique goal names to Postgres. My colleague asked why the goals table cannot show all "goals" directly from click house grouped by goal name.

@m-col
Copy link
Contributor Author

m-col commented Nov 29, 2021

To update: the current state has all of the required data being imported from GA, and all of it is being merged into the plausible dashboard in the corresponding panel and tab. There are still a few issues that need ironing out. I've tried to keep the checkbox list on the first post up to date so they should be listed there. There are a few questions which would help to improve/fix the implementation.

Currently plausible distinguishes between 4 device types in the dashboard: desktop, laptop, tablet, mobile. Google Analytics uses 3: desktop, tablet and mobile. However I think the definitions the same: An extract from the Plausible docs (link):

Shows the width of the screens used by your visitors. We measure the width of the browser window where your site is actually rendered rather than the full screen width. Anything under 576px screen size is considered a mobile device, up to 992px is considered a tablet, up to 1440px is considered a laptop and anything above 1440px is considered a desktop.

"Screen size" seems misleading here, because both the device screen resolution and the web page's viewport are valid and attainable metrics for a session, and "screen size" implies the former. GA exposes both screen size and browser size. The screen size thresholds that distinguish device types also seem odd. My low-end 5+ year old has a 1080x2160 screen size, and my laptop is 1920x1080. These would be considered a tablet and desktop according to plausible. It's possible that I've misunderstood how the calculations are done, or how the data is reported by the browser though! The relevance for this PR is that I'm wondering what the best way to get equivalent from GA. The options are the screen size or browser size, but also device category directly. That latter one sounds like what the plausible "screen size" is meant to be so I wonder how they are deducing that informaiton.

Regarding locations: with regions and cities on the way, it might sense to also fetch these from GA at this point. They would likely have to be fetched by the time this PR is merged, even if cities and regions aren't yet tracked by plausible, as the import is a one-shot "get everything then you no longer need your google account" feature. Think I should work that into the clickhouse table now, and leave it out of the queries etc, or just leave it out altogether?

@ACPK The plans for our import are to consider each dimension individually, which makes import/export much easier, but that means that filtering and goals won't really be compatible. I tried importing everything into one table to enable this but the data can't be fetched all together from google (they limit what can be requested per query), leading to possibly duplicated visitor counts. Similarly data is exported spread across individual CSVs from Fathom (like Plausible) and this imported data also cannot be filtered.

@ACPK
Copy link

ACPK commented Nov 30, 2021

@m-col
Copy link
Contributor Author

m-col commented Nov 30, 2021

It's using Reporting API v4, which has equivalent methods and they have the same limits on dimensions etc. Is there an advantage to using Google Analytics Data API v1 (GA4) that I may have overlooked?

@ukutaht
Copy link
Contributor

ukutaht commented Nov 30, 2021

About devices

Yes I think your criticism of the screen size thing in Plausible is correct. We are planning to stop using the viewport size and instead infer the device type from the User-Agent in the future. I believe this is what GA's deviceCategory means as well. So I think we should import deviceCategory from GA.

We probably have to get rid of the 'laptop' category when we use the user-agent anyways, so it's OK if the import doesn't have any data for 'Laptop'.

Locations

Yeah, since the import cannot be run again in the future, it's best to get all of the data in one go. It would be great if GA could export the same identifiers that we use - ISO3166-2 for regions and geoname_id for city. I just checked and that does not seem to be the case..

@m-col how are you dealing with countries at the moment? Are we getting it as a name or as an identifier from GA? How is it merged with our own data?

@m-col
Copy link
Contributor Author

m-col commented Nov 30, 2021

@m-col how are you dealing with countries at the moment? Are we getting it as a name or as an identifier from GA? How is it merged with our own data?

Countries are being fetched as countryIsoCode: "Users' country's ISO code (in ISO-3166-1 alpha-2 format), derived from their IP addresses or Geographical IDs. For example, BR for Brazil, CA for Canada." link.

Regions can also be fetched in the format we want: "Users' region ISO code in ISO-3166-2 format, derived from their IP addresses or Geographical IDs." (ga:regionIsoCode).

City is less clear: we have ga:cityId: "Users' city ID, derived from their IP addresses or Geographical IDs. The city IDs are the same as the Criteria IDs found at https://developers.google.com/analytics/devguides/collection/protocol/v1/geoid." And also just the name ga:city: "Users' city, derived from their IP addresses or Geographical IDs.".

@ukutaht
Copy link
Contributor

ukutaht commented Nov 30, 2021

That's great for countries and regions!

It does look like cities might be a pain. Let's see what we can do but even if they're missing from the import it's not a huge deal I think

@m-col
Copy link
Contributor Author

m-col commented Dec 21, 2021

There are some updates I need to make in line with some of the new features, but in its current form it is working 100%!

I'm first going to rebase and make those updates but will then be focussing on getting tests written for all of the changes. Any review/comments on its current form would be appreciated!

@m-col
Copy link
Contributor Author

m-col commented Dec 21, 2021

The import is not yet run in the background. It's convenient for development for it to be synchronous so I'm leaving that until the end.

@ukutaht
Copy link
Contributor

ukutaht commented Dec 22, 2021

Sweet! Sounds good @m-col

@m-col m-col force-pushed the import branch 2 times, most recently from 69d408b to 195d5e9 Compare January 3, 2022 12:02
@bundlemon
Copy link

bundlemon bot commented Jan 3, 2022

BundleMon

Files updated (1)
Status Path Size Limits
static/js/dashboard.js
284.2KB (+305B +0.1%) -
Unchanged files (6)
Status Path Size Limits
static/css/app.css
514.8KB -
static/js/app.js
12.13KB -
static/js/embed.host.js
5.58KB -
static/js/embed.content.js
5.06KB -
tracker/js/plausible.js
750B -
static/js/applyTheme.js
314B -

Total files change +305B +0.04%

Final result: ✅

View report in BundleMon website ➡️


Current branch size history | Target branch size history

@m-col
Copy link
Contributor Author

m-col commented Jan 3, 2022

Quick update: the PR is rebased and updated such that utm_term and utm_content are imported and merged, as are regions. Cities are added to the imported_locations table but are not imported from GA. As discussed on matrix, this is due to the GA city data not being compatible with the city data used by plausible, and so the data cannot be merged. The field is kept in the table for future imports from other sources.

I am now continuing work on fixing current tests + adding new ones.

Copy link
Contributor

@ukutaht ukutaht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall.

I'm concerned about the big change from strings to atoms for some metrics. The state it's currently in feels very error-prone. Maybe the right solution is to move completely to using atoms.

Missing from my perspective:

  1. Tests
  2. Run the import job in the background

I think it would be useful to show an indicator to the user for what time period we will import. Currently I don't know if the user gets any feedback about that.

lib/plausible/google/api.ex Outdated Show resolved Hide resolved
lib/plausible/google/api.ex Outdated Show resolved Hide resolved
lib/plausible/google/api.ex Outdated Show resolved Hide resolved
lib/plausible/imported/browsers.ex Outdated Show resolved Hide resolved
lib/plausible/imported/browsers.ex Outdated Show resolved Hide resolved
lib/plausible/stats/base.ex Outdated Show resolved Hide resolved
lib/plausible/stats/base.ex Outdated Show resolved Hide resolved
lib/plausible/stats/base.ex Outdated Show resolved Hide resolved
lib/plausible/stats/breakdown.ex Outdated Show resolved Hide resolved
lib/plausible_web/controllers/api/stats_controller.ex Outdated Show resolved Hide resolved
@m-col
Copy link
Contributor Author

m-col commented Jan 4, 2022

I'm getting some incompatibilities with the countries and regions from data imported from GA, which allegedly follows the ISO standard. Still investigating whether the issue is with GA or Location. The patch to ensure the query doesn't fail is the commit 4f22c33

@m-col m-col marked this pull request as ready for review January 4, 2022 16:44
@ukutaht
Copy link
Contributor

ukutaht commented Jan 5, 2022

The ISO standard changes all the time. My local region had it's ISO code change as recently as 2019. Maybe we need some mappings from older to newer ones.

I like the patch, we had some cases on prod as well so this is useful.

@m-col
Copy link
Contributor Author

m-col commented Jan 5, 2022

I like the patch, we had some cases on prod as well so this is useful.

Happy to submit it as a standalone PR if you'd like it sooner.

@ACPK
Copy link

ACPK commented Jan 5, 2022

@m-col Will importing CSVs be part of this PR or an additional PR?

@m-col
Copy link
Contributor Author

m-col commented Jan 6, 2022

@m-col Will importing CSVs be part of this PR or an additional PR?

It won't be part of this PR. Importing via CSVs is something we have dicussed and are open to adding but there are no immediate plans to implement it.

@m-col
Copy link
Contributor Author

m-col commented Jan 15, 2022

I'm seeing all region values fetched from GA being (not set). This is the ga:regionIsoCode metric. Fetching ga:regionId (a non-standard ID like their city IDs) in the same request shows that most rows do in fact have non-null region value. This means that for whatever reason GA doesn't want to export ISO standard region data. I think it would be good to leave that metric in the request in case it changes in the future though; it doesn't really cost anything to include it.

@ukutaht
Copy link
Contributor

ukutaht commented Jan 15, 2022

That's disappointing but yeah, let's attempt to import in case it changes

@m-col
Copy link
Contributor Author

m-col commented Feb 26, 2022

Rebased.

What kind of timeline do you envisage for merging this feature?

@ukutaht
Copy link
Contributor

ukutaht commented Feb 28, 2022

Thanks @m-col. The plan is to integrate it this week and start testing with customers next week.

@ukutaht
Copy link
Contributor

ukutaht commented Mar 10, 2022

Thanks for all the work @m-col. We'll do some internal testing tomorrow and real user testing next week.

@ukutaht
Copy link
Contributor

ukutaht commented Mar 10, 2022

This was completed in #1753

@ukutaht ukutaht closed this Mar 10, 2022
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants