Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove analytics #408

Merged
merged 1 commit into from
May 15, 2021
Merged

Conversation

ivanov
Copy link
Member

@ivanov ivanov commented May 15, 2021

per discussion a year ago on the steering council list
https://groups.google.com/u/1/g/jupyter-steering/c/7j7F0lyQY84/m/Ch9Qj9nMAAAJ

@choldgraf
Copy link
Collaborator

Is there a different way to keep track of traffic on the website? (Sorry if that was discussed in the steerco conversation, that link isn't available to non members)

@ivanov
Copy link
Member Author

ivanov commented May 15, 2021

Hi @choldgraf ! Here's some context: a little over a year ago, this came up on the Jupyter Steering Committee list about removing google analytics form ipython.org, specifically pointing to this article. There was agreement to do so, which is what happened in ipython/ipython-website#150 , and also a proposal to do the same for jupyter.org, with some discussion about possible alternatives to GA that could be used instead. So far, no one has stepped up to do that work, but in the meantime jupyter.org has continued to track users. I have just pinged the @jupyter/steeringcouncil on list pointing to this PR, encouraging further discussion to take place in the open here.

@fperez
Copy link
Member

fperez commented May 15, 2021

+1 from me - in particular, the cookies point seems like a clear violation with the current setup. Basically we're collecting cookies without offering either disclosure nor opt-out options. These days just about every website has at least offered a consent/opt-out pop-up (which I always open to deactivate all I can).

At first I thought we should try to add an alternative option as part of the removal, but after reading the article it seems pretty clear to me that right now, if we wanted to keep Google Analytics at all, we'd have to at least add the tools for cookie consent/opt-out.

Since we're obviously not going to do that (it would probably be ~ as much work as replacing GA with something less intrusive), we might as well get rid of GA immediately.

And then, we have an issue of how to do the work to add back some minimal analytics with a different tool - I'm all for at least tracking basic metrics of site access b/c that's useful to know, but we should do it with something better than GA.

In summary, +1 for this PR in its current form, even if it leaves us without analytics for now. The cookies/GDPR argument I think is very strong and calls for immediate action.

Copy link
Contributor

@blink1073 blink1073 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@SylvainCorlay SylvainCorlay merged commit 4448f38 into jupyter:master May 15, 2021
@minrk
Copy link
Member

minrk commented May 15, 2021

Rough website analytics has just come up in requested information from the CZI proposals, so it would be nice to deploy something like matomo, which we do for mybinder.org. Matomo lets you do the kind of much more crude, privacy-respecting analytics that we are actually interested in (very coarse hit counter with some vague geographic distributions), and when self-hosted we don't need the "trust a third party to tell the truth" part.

FWIW, I believe the {storage: 'none'} option meant we shouldn't have been setting a google analytics cookie, but that presumably results in some fingerprint-based tracking that's arguably less "identifiable," but not less gross.

It's relatively easy for mybinder.org to deploy matomo, since that's already a big kubernetes application, so one more pod and a GCP-managed SQL server is no big deal. But for what's so far a static web site, adding a persistent managed server is a big step in continuous maintenance and cost. We could pay for matomo, but at our current traffic level of ~2M pageviews/month, that would be around $500/month. Plausible Analytics (the source of that "don't use GA" article), would be closer to $70/month. I don't know anything about it, but would be happy to give it a try, and less that $1k/year is probably fine for us? Self-hosting matomo (or plausible) would likely not be a lot cheaper than that (possibly more), especially taking maintenance time into account.

We also have Google Analytics sprinkled on some other sites like nbviewer and readthedocs, with cookies disabled where possible, so we might not be done stripping this out.

@choldgraf
Copy link
Collaborator

I'm pretty sure we still use GA on mybinder.org as well, it is the easiest way to demonstrate impact of that service, IMO. It would be great if we can have an alternative strategy for how to demonstrate "who is using the jupyter website, where are they coming from, etc" because this can be very important in grant proposals.

Also while I generally agree that we should try to move away from Google analytics, I didn't realize that the article linked above was from a company that directly competes with Google Analytics. Does anybody know of a good writeup from someone without an obvious conflict of interest?

@ivanov
Copy link
Member Author

ivanov commented May 16, 2021

Does anybody know of a good writeup from someone without an obvious conflict of interest?

I think "conflict of interest" does not properly describe the linked article. You can say the piece may be biased due to the vested interest of the authors - but everything here is above board and transparent - it's literally on their company blog.

I think you'd be hard-pressed to find someone describing the wound without proposing a salve. Surely we can separate the two, we don't have to buy the salve to acknowledge the wound. There's company A that makes its money selling advertisements that provides an analytics service for free, thereby increasing the quality and quantity of their product (ad space and eyeballs). And there's company B that makes its money charging for an analytics service describing the hidden costs associated with using Company A's free analytics service, "surveillance capitalism" being chief among them. To their credit, company B's article does not limit the salve to their own paid solutions: this is what gives the article credibility. They point the reader to alternatives: using server logs directly, a competitor company C, and even an alternative way to get relevant data from company A, before proposing company B's own offering.

Getting back to "conflict of interest" - I think the kind of article you seek, "good writeup from someone without an obvious [vested] interest" would be the kind of article that is susceptible to having a conflict of interest. Suppose someone unaffiliated with company A writes an article "Don't believe the haters: free analytics from company A is just fine." You won't know if that article was written because the author was approached and compensated by Company A for the piece, is trying to get a job there or have a family member works there, or they own a bunch of company A stock, and so on.

@choldgraf
Copy link
Collaborator

Sorry if my statement came across as strongly skeptical - a better question would have been asking about a writeup of many alternatives that didn't also come from one of those alternatives. I am just bummed that we don't know how much traffic the site is getting anymore, which pages, etc, and trying to figure out if there's another easy option (I asked in the Plausible repo if they had plans for an open source plan, but no dice)

@slel
Copy link
Contributor

slel commented May 17, 2021

Not a write-up, but just a list of analytics tools: https://github.com/0xnr/awesome-analytics

@SylvainCorlay
Copy link
Member

It is general knowledge that using stock google analytics poses issues wrt GDPR.

@Carreau
Copy link
Member

Carreau commented May 17, 2021

One things unrelated to GDPR, is that other analytics tools like plausible, simple-analytics and co to allow the metrics to be public (not sure matomo). Which I think is good. I believe having the community be able to look at metrics is important for them to be able to bring up issues.

Second thing, most of above-mentioned analytics tools allow multiple domains; it could be a good idea to have an account at numfocus level to make it easy for other/new projects that don't have many visit to ride on the plan of the bigger projects.

@ellisonbg
Copy link
Collaborator

I am supportive of moving to a more privacy respecting analytics/telemetry tool, and do see the value in having some data to help demonstrate impact and reach in a quantitative manner.

@SylvainCorlay
Copy link
Member

Just for reference, it appears that using Google Analytics is now really illegal in France and Austria (and other EU countries will probably follow).

https://www.lesechos.fr/tech-medias/hightech/lutilisation-de-google-analytics-enfreint-le-droit-europeen-selon-la-cnil-1386157

@fperez
Copy link
Member

fperez commented Feb 14, 2022

Do we have any uses of it left still? If so, I think there's reasonable agreement of proceeding with the removal. @choldgraf is it still on Binder, you think?

This legal change raises the priority of making the removal, we'll need to figure out an alternative that's not as invasive (and hopefully easy to use), as obviously getting some analytics on usage is very important both internally and regarding funders.

@choldgraf
Copy link
Collaborator

yes it's still on Binder, I'll open an issue to share this context. Probably won't be act on anything myself this week though, as I am on vacation.

@choldgraf
Copy link
Collaborator

choldgraf commented Feb 14, 2022

issue here: jupyterhub/team-compass#491

Can we find some funds to pay for something like Plausible analytics across the project? It would really be a shame if we can't keep track of which pages our users are hitting anymore. For those of us that apply for grants, it is one of the only quantifiable metrics we have to demonstrate impact and reach.

Note: I also suspect it is being used on several documentation sites that use ReadTheDocs, because they let you embed GA links directly on pages. Here's their issue where they concluded that Google Analytics wasn't an issue as long you respected "do not track". I am not an expert in this at all so defer to others on how this impacts Jupyter.

@choldgraf
Copy link
Collaborator

As a side-note: I'd be happy to explore whether we can use grant funds to pay for Plausible, if it would help simplify our workflows. I bet that this would be in-scope for most grants focused on supporting Jupyter. In my opinion, paying $99/mo for something is totally worth it if it means we don't have to spend ~any time thinking about maintaining it or providing access.

@fperez
Copy link
Member

fperez commented Feb 15, 2022

I'm pretty sure we have the funds for something like that, but I can't confirm right now - pinging @afshin @jasongrout @Ruv7 @ellisonbg so we don't forget to check on this at the Friday call (our best place right now for things like this). Totally legitimate points Chris, thx!

@ellisonbg
Copy link
Collaborator

ellisonbg commented Feb 15, 2022 via email

@jasongrout
Copy link
Member

jasongrout commented Jul 15, 2022

As a follow-up: I'm sitting in a Scipy talk right now about scientific-python.org, which is (if I understand correctly) setting up a plausible instance for various projects in the community share and use: https://views.scientific-python.org/login

@Carreau
Copy link
Member

Carreau commented Jul 15, 2022

To add on jason message, the instance does not track users, but only page view, and in a server which scientific-python.org own and is GDPR and european compliant.

@jasongrout
Copy link
Member

but only page view

In the example image in the talk, it did have a statistic for "unique visitors", but I'm not sure how they do that.

@choldgraf
Copy link
Collaborator

I think we should open an issue to track setting up plausible for the jupyter website. (or some other analytics tracker like matomo) - i think it'd be well worth the price if we can find a way to use analytics data as part of fundraising etc

@Carreau
Copy link
Member

Carreau commented Jul 21, 2022

I think what is missing from Jason message and my reply is that such a server is set up, and it is https://views.scientific-python.org, we can just ask them for a tracking code for jupyter.org. And really when I say "them" it's also "us", as the folks who did that are Stefan, Jarod, ...

@choldgraf
Copy link
Collaborator

choldgraf commented Jul 21, 2022

@Carreau ah that is excellent, I was just asking about this in the scientific-python Discord. It seems like a good idea to me.

For what it's worth we are also adding support for Plausible to the PyData theme, so we could re-use that across the Jupyter projects:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants