You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
TL;DR: No Google Groups ingestion currently because of changes to Google Groups, causing scraping code to fail.
Discovered while trying to update dependencies.
Zero topics
Monthly pipeline processing was showing 0 topics returned:
2022/11/01 08:01:32 GOOGLEGROUPS loading golang-checkins:
2022/11/01 08:01:32 All topics captured: total topics captured are 0.
Checking the go code for how topic counts are captured, the regex doesn't match current Google Groups UI (there may have been some MaterialUI changes since this code was written).
So because the topic counts are 0, it's effecting loops later on (in my estimation)
Nest unit tests
Additionally, trying to run unit tests, it appears running just mailinglists/ doesn't run the nested mailing lists, so the unit tests for googlegroups weren't being run (and are currently breaking)
Failing topic unit tests
Now running the unit tests:
=== RUN TestTopicIDToRawMsgUrlMap/Pull_topic_ids_for_date
2022/11/15 22:40:43 No message ID found in topicId: 8sv65_WCOS4.
googlegroups_data_test.go:300: Result response does not match.
got: map[2018-09.txt:[]]
want: map[2018-09.txt:[https://groups.google.com/forum/message/raw?msg=golang-checkins/8sv65_WCOS4/3Fc-diD_AwAJ]]
Infinite redirects
This URL is no longer a valid URL format, as trying to curl it gets stuck in an infinite 301 redirect loop:
$ curl https://groups.google.com/forum/message/raw\?msg\=golang-checkins/8sv65_WCOS4/3Fc-diD_AwAJ
<HTML>
<HEAD>
<TITLE>Moved Permanently</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFFF" TEXT="#000000">
<H1>Moved Permanently</H1>
The document has moved <A HREF="https://groups.google.com/forum/message/raw?msg=golang-checkins/8sv65_WCOS4/3Fc-diD_AwAJ">here</A>.
</BODY>
</HTML>
Summary
This is going to take some re-engineering to work out what's changed in the Google Groups format to bring this code back to working.
The text was updated successfully, but these errors were encountered:
As part of Project OCEAN's Open Source Data Ecosystem, @nyghtowl (Xoogler) and members of the 20% Dive Crew scoped, designed, and built a data pipeline to aggregate mailing lists from multiple communities, including: Python, Angular, and Go.
This dataset was used in multiple research projects with our academic partners, including an accepted dataset track submission at MSR 2022.
As outlined by @glasnt, there are several updates that need to be made in the open source project and the GCP project to maintain this dataset. Polling our research stakeholders, this dataset is not currently being used for any ongoing research project.
Any changes currently made would most likely need to be maintained with future open source dependency version changes, GCP product updates, and Google Groups API/RSS supported features.
Rather than update a project no one is using, we are going to put it all on the shelf with proper documentation for future explorers and experimentation.
TL;DR: No Google Groups ingestion currently because of changes to Google Groups, causing scraping code to fail.
Discovered while trying to update dependencies.
Zero topics
Monthly pipeline processing was showing 0 topics returned:
Checking the go code for how topic counts are captured, the regex doesn't match current Google Groups UI (there may have been some MaterialUI changes since this code was written).
E.g. https://groups.google.com/g/golang-checkins shows 1–30 of 81553 (specifically
–
is\u2013 EN DASH
). The regex ingetTotalTopics
specifies-
(\u002D HYPHEN-MINUS
).So because the topic counts are 0, it's effecting loops later on (in my estimation)
Nest unit tests
Additionally, trying to run unit tests, it appears running just
mailinglists/
doesn't run the nested mailing lists, so the unit tests forgooglegroups
weren't being run (and are currently breaking)Failing topic unit tests
Now running the unit tests:
Infinite redirects
This URL is no longer a valid URL format, as trying to curl it gets stuck in an infinite 301 redirect loop:
Summary
This is going to take some re-engineering to work out what's changed in the Google Groups format to bring this code back to working.
The text was updated successfully, but these errors were encountered: