A collection of tools to process the Google Plus-related data from Google Takeout.
Copyright (C) 2018-2020 Filip H.F. "FiXato" Slagter
With the announcement of Google+ shutting its doors for consumers, and the second announcement of Google+ expediting its shutdown, those of us who've accumulated a large collection of posts and other data over the lifetime of the platform, are in need of tools to process the data archives that Google are providing us through Google Takeout, and possibly live data through the Google+ API. With the shutdown originally being scheduled for August 2019, but following the second announcement, expedited to April 2nd, 2019, and the Google+ APIs being turned off even sooner, on Sunday March 7th, 2019 at the latest, time has ran out.
This repository will hopefully provide some of those tools.
This section is divided into two parts:
-
platform-specific instructions to get the Plexodus-Tools toolset installed
-
generic platform-independent instructions to set up Plexodus-Tools' dependencies and run its wrapper command.
Select your desired platform below, and follow its instructions:
While most versions of Android don't come with a terminal emulator, the Google Play Store does have an excellent app called Termux, which allows you to install and run various Linux applications.
-
Open the Termux App
-
Update Termux packages, by running on the command-line prompt:
pkg upgrade
- Install git and bash, by running on the command-line prompt:
pkg install git bash
- You should now be able to continue with 2. Plexodus-Tools Installation Instructions
-
Start your preferred Terminal emulator. While I personally prefer iTerm2, macOS itself already comes with Terminal.app, which should work fine as well.
-
Install Homebrew. Homebrew is the package manager used by Plexodus-Tools, and is one of the most popular CLI package managers for macOS used to install CLI tools and dependencies. On Homebrew's homepage you can find the preferred command to install Homebrew on macOS (
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
), but you can also find more advanced installation instructions on Homebrew's Installation page -
Update Bash and install Git:
brew install bash git
-
You should now be able to continue with 2. Plexodus-Tools Installation Instructions
These instructions are for Ubuntu, but should work for other versions of Linux as well, replacing apt-get with your preferred package manager.
- Install Bash and Git:
sudo apt-get install bash git
- That's it. You should now be able to continue with 2. Plexodus-Tools Installation Instructions
On the command-line prompt, run:
git clone https://github.com/FiXato/Plexodus-Tools && cd Plexodus-Tools
This will clone the source code into the Plexodus-Tools directory, and change the current working directory to this newly created directory.
Next you can run the wrapper script:
./bin/plexodus-tools.sh
This should bring up a console-based menu. Press "1" followed by your enter key to set up the required dependencies for Plexodus-Tools.
Once the Setup
task has run, you should be able to run all the other scripts without issues.
If your (Takeout) Zip-archive is greater than 2GB, and has to be extracted on a platform that doesn't support extracting zip64 files natively, I suggest you install p7zip, a port of 7-Zip for POSIX systems. On macOS the easiest would be to install it through Homebrew with brew install p7zip
. Once (p)7-Zip has been installed, you can extract all files while retaining their directory structure with:
7z x takeout-20181025T124927Z-001.zip
If you want to just extract the JSON files, you can use this instead:
7z x takeout-20181025T124927Z-001.zip '*.json' -r
To extract them to a different directory (while retaining their directory structure):
7z x takeout-20181025T124927Z-001.zip '*.json' -r -o/path/to/output/dir/
To extract them to a different directory (without creating sub-directories):
7z e takeout-20181025T124927Z-001.zip '*.json' -r -o/path/to/output/dir/
Extract all *.json
, *.html
, *.csv
, *.vcf
and *.ics
files from multi-part Zip-archives:
7z x -an -ai'!takeout-20181111T153533Z-00*.zip' '*.json' '*.html' '*.csv' '*.vcf' '*.ics' -r -oextracted/2018-11-11/
Argument | Explanation |
---|---|
e |
Extract archive into current folder, without retaining folder structure. |
x |
eXtract archive while retaining folder structure. |
t |
Test (matched) contents of archive. |
l |
List (matched) contents of archive. |
-an |
No Archive Name matching. Recommended since we're doing a 'wildcard' archive match with -ai! . |
-ai |
Use Archive Include to define the input archives. We're wrapping the (masked/wildcarded) filename in quotes to prevent shell interpretation of the exclamation (!) mark. The filename is prefixed with an exclamation ( ! ) mark to allow for wildcards with the asterisk (* ) character. |
-r |
Recurse through the archive. Needed to match the wildcard filename patterns through the entire archive. |
-o |
Specify Output path. Files will be extracted with this folder as their root folder. It should be directly followed by the path; no space between the flag ( -o ) and the path (extracted/2018-11-11 ).If the path does not exist yet, it will be automatically created. |
*.json |
apply archive operation on archived files matching *.json (JavaScript Object Notation) filename pattern. |
*.html |
apply archive operation on archived files matching *.html (HyperText Markup Language) filename pattern. |
*.csv |
apply archive operation on archived files matching *.csv (Comma Separated Values) filename pattern. |
*.vcf |
apply archive operation on archived files matching *.vcf (Virtual Contact File vCards) filename pattern. |
*.ics |
apply archive operation on archived files matching *.ics (Internet Calendaring and Scheduling) filename pattern. |
One of those tools is plexodus-tools.jq, a library of filter methods for the excellent commandline JSON processor jq. With the library you'll be able to chain filters and sort methods to limit your Google+ Takeout JSON files to a subset of data you can then pass on to other tools or services. For instance, it will allow you to limit your Activity data to just public posts, or those with comments or other interactions with one or more specific users.
It's useful to combine all the separate JSON activity files into a single JSON file:
jq -s '.' "Takeout/Google+ Stream/Posts/*.json" > combined_activities.json
If you run into the 'argument list too long' error, you can instead use this solution:
gfind 'Takeout/Google+ Stream/Posts/' -iname '*.json' -exec cat {} + | jq -s '.' > combined_activities.json
(Using gfind
rather than find
to indicate I'm using GNU's find
which I've installed through Homebrew, rather than the default BSD find
available on macOS.
This way you can directly use this single file for your future jq
filter commands.
jq -L /path/to/Plexodus-Tools/ 'include "plexodus-tools"; . ' /path/to/combined_activities.json
You specify the directory in which the plexodus-tools.jq
library is located with: -L /path/to/Plexodus-Tools/
and then load it by specifying include "plexodus-tools";
before your actual jq query.
Filter Name | Description |
---|---|
not_empty |
Exclude empty results |
with_comments |
Only return Activity results that have Comments |
without_comments |
Only return Activity results that lack any Comments |
with_image |
Only return Activity results that have an Image Attachment |
with_video |
Only return Activity results that have a Video Attachment |
with_audio |
Only return Activity results that have an Audio Attachment |
with_media |
Only return Activity results that have any kind of Media Attachment |
without_media |
Exclude Activity items without any kind of Media Attachment from the results |
has_legacy_acl |
Only return Activity results whose postAcl contains an isLegacyAcl item. |
has_collection_acl |
Only return Activity results whose postAcl contains a collectionAcl item. |
has_community_acl |
Only return Activity results whose postAcl contains a communityAcl item. |
has_event_acl |
Only return Activity results whose postAcl contains an eventAcl item. |
has_circle_acl |
Only return Activity results whose postAcl .visibleToStandardAcl contains a circles item. |
has_public_circles_acl |
Only return Public Activity results; i.e. those that have CIRCLE_TYPE_PUBLIC as visibleToStandardAcl 'circle' type. |
is_public |
For now an alias for has_public_circles_acl .Note that this might not (yet) include posts that were posted to public Collections or publicly accessible Communities; this may change in the future. |
has_extended_circles_acl |
Only return Extended Circled Activity results; i.e. those that have CIRCLE_TYPE_EXTENDED_CIRCLES as visibleToStandardAcl 'circle' type. These are posts that were set to only be visible to your 'Extended Circles'. |
has_own_circles_acl |
Only return Private Activity results; i.e. those that have CIRCLE_TYPE_YOUR_CIRCLES as visibleToStandardAcl 'circle' type. These are posts that were set to only be visible to your 'Your Circles'. |
has_your_circles_acl |
Alias for has_own_circles_acl . |
with_interaction_with(displayNames) |
Only return Activity items that have some form of interaction with users whose displayName is an exact match for one of the specified displayNames. displayNames can be either a string, or an array of strings. |
with_comment_by(displayNames) |
Only return Activity items as results when they have Comments by any of the users whose displayName is an exact match for one of the specified displayNames. displayNames can be either a string, or an array of strings. |
url_from_domain(domains) |
Only return Activity results with url items that match any of the specified domains . domains can be either a string, or an array of strings. |
from_collection(displayNames) |
Only return Activity results that were posted to a Collection of which the displayName is an exact match for one of the specified displayNames. The supplied displayNames can be either a string, or an array of strings. Note that collections by different owners could have the same name. If you only want to match activities in a specific collection, you'll have to find its resourceName and use that with the from_collection_with_resource_name(resourceNames) filter instead. |
from_collection_with_resource_name(resourceNames) |
Similar to from_collection , but rather than compare to the displayName of the Collection, compares items to the unique resourceName instead. |
from_collection_with_resource_id(resourceNames) |
Similar to from_collection_with_resource_name , but only needs a resourceId rather that the resourceName ; i.e. it doesn't need the collections/ prefix. |
sort_by_creation_time |
Sort results by the Activity's creationTime . |
sort_by_update_time |
Sort results by the Activity's updateTime . |
sort_by_last_modified |
Alias for sort_by_update_time . |
sort_by_url |
Sort results by the Activity's url item. |
sort_activity_log_by_ts |
Sort ActivityLog items by their timestampMs timestamp item. |
get_circles |
Get list of all unique circles items from the current results. |
get_all_circle_types |
Get list of all unique circle types from the current results. |
get_all_circle_display_names |
Get list of all unique circle displayNames from the current results. |
get_all_circle_resource_names |
Get list of all unique circle resourceNames from the current results. |
get_all_acl_keys |
Get list of all unique Access Control List keys from the current results. |
get_all_community_names |
Get list of all unique Community displayNames from the current results. |
get_all_collection_names |
Get list of all unique Collection displayNames from the current results. |
get_all_event_resource_names |
Get list of all unique Event resourceNames from the current results. |
get_all_media_content_types |
Get list of all unique media content-types from the current results. |
Return just the activities that are marked as 'public', and have comments by a user whose displayName is FiXato
, and sort the results by the creation time of the Actvity:
jq -L /path/to/Plexodus-Tools/ 'include "plexodus-tools"; . | is_public | with_comment_by("FiXato") | sort_by_creation_time' combined_activities.json
Return just the activities that have any kind of interaction with a users whose displayName is either FiXato
or Filip H.F. Slagter
, have some form of media attachment, and sort the results by the last modified (updateTime) time of the Actvity:
jq -L /path/to/Plexodus-Tools/ 'include "plexodus-tools"; . | with_interaction_with(["FiXato", "Filip H.F. Slagter"]) | with_media | sort_by_last_modified' combined_activities.json
Return just the activities that were posted to a Collection with the name Google Plus Revisited
and sort by their creation time:
jq -L /path/to/Plexodus-Tools/ 'include "plexodus-tools"; . | from_collection("Google Plus Revisited") | sort_by_creation_time' combined_activities.json
Get a list of all the unique Circle types in your JSON archive:
jq -L /path/to/Plexodus-Tools/ 'include "plexodus-tools";get_all_circle_types' combined_activities.json
The result of this is likely:
[
"CIRCLE_TYPE_EXTENDED_CIRCLES",
"CIRCLE_TYPE_PUBLIC",
"CIRCLE_TYPE_USER_CIRCLE",
"CIRCLE_TYPE_YOUR_CIRCLES"
]
Just the last file extension:
7z l -an -ai'!takeout-20181111T153533Z-00*.zip' | gsed -E 's/\s+/ /g' | gcut -d' ' -f1,2,3,4,5 --complement | ggrep -E -o '\.([^.]+)$' | sort -u
Up to the last 3 file extensions, of which the first 2 can be at most 4 characters long, while the last (primary) file extension can be of arbitrary length:
7z l -an -ai'!takeout-20181111T153533Z-00*.zip' | gsed -E 's/\s+/ /g' | gcut -d' ' -f1,2,3,4,5 --complement | ggrep -E -o '(\.[^.]{1,4}){0,2}\.([^.]+)$' | sort -u
Note: I'm using gsed
, gcut
and ggrep
here to indicate I'm using the GNU versions of the utilities, rather than the BSD versions supplied by macOS. These versions can be installed (and linked with g
-prefixes) through Homebrew on macOS. For instance with brew install sed cut grep
. On other platforms such as Linux and Windows Cygwin, you're likely installing the GNU versions anyway.
Aside the various Ruby scripts, this toolset also contains some self-contained Bash scripts for data retrieval and processing:
While the Google Plus APIs have been terminated on March 7th, 2019, the profile data (at least for numeric IDs) is still available through the newer Google People API. This toolset allows you to retrieve this data.
This requires an API key (or OAuth workflow). You can find official developers.google.com instructions on how to get started with the Google People API, and in particular https://console.developers.google.com/apis/credentials/key to create request an API key for your project.
You also need to set the GPLUS_APIKEY
ENVironment variable for your Google Plus API key: export GPLUS_APIKEY=Z382rQPoNMlKJIhGfEDc_Ba
(This above key value is obviously an example one; you'll need to replace them with your own actual keys.)
You just need to pass the (numeric) profile id to the people_api_data_for_gplus_profile.sh
script:
./people_api_data_for_gplus_profile.sh 123456
Passing the ID for a Custom Profile URL (e.g. +YonatanZunger for https://plus.google.com/+YonatanZunger), should also work:
./people_api_data_for_gplus_profile.sh +YonatanZunger
Even passing the URL should work:
./people_api_data_for_gplus_profile.sh https://plus.google.com/112064652966583500522
If you have a list of userIDs or profile URLs stored in memberslist.txt
, with each ID on a separate line, you can use xargs
to pass all these to the script. For instance with 3 request running in parallel, and deleting the target JSON file if a retrieval error occurs:
rm logs/failed-profile-retrievals.txt; cat memberslist.txt | xargs -L 1 -P 3 -I __UID__ ./people_api_data_for_gplus_profile.sh __UID__ --delete-target
(__UID__
will automatically be filled in by xargs)
Or leave the JSON output files intact when a retrieval error occurs, so you can debug more easily, and so profiles that no longer exist (and thus return a http 404 return code) won't be retried:
rm logs/failed-profile-retrievals.txt; cat memberslist.txt | xargs -L 1 -P 3 ./people_api_data_for_gplus_profile.sh
I would not recommend increasing the amount of parallel processes beyond 3, as you're more likely to hit User Rate Limit Exceeded errors then.
grep
is a CLI tool for regular expression-based filtering of files and data. Since the BSD version of grep that comes with macOS is rather limited, please install GNU grep instead. Not sure if it's through brew install grep
or brew install gnu-grep
.
Once installed on macOS, you should have access to GNU's version of grep via the g
-prefix: ggrep
. The Bash scripts in this repository will automatically use ggrep
rather than grep
if they find it available on your system.
sed
the Stream EDitor is used for regular expression substitution. Since the BSD version of sed that comes with macOS is rather limited, you will likely need to install GNU sed instead with brew install gsed
and replace calls to sed
with gsed
Once installed on macOS, you should have access to GNU's version of sed via the g
-prefix: gsed
. The scripts in this repository will automatically use gsed
rather than sed
if they find it available on your system.
curl
is a CLI tool for retrieving online resources. For the tools in this project, I use it to send HTTP GET requests to the APIs.
jq
is an excellent CLI tool for parsing and filtering JSON:
https://stedolan.github.io/jq/
It can be downloaded from its website, or through your package manager (for instance on macOS through brew install jq
).
- Michael Prescott for having a need for this script, and for bouncing ideas back and forth with me.
- Abdelghafour Elkaaba, for giving some ideas on how to import this back into Blogger's native comments
- Edward Morbius, for moderating the Google+ Mass Migration community on Google+
- Peggy K, for signal boosting and being a soundboard
You can follow related discussions here:
- Google+ Exodus Collection
- G+ Exodus: PSA: Google+ Comments for Blogger owners have 3 days to manually archive the comments on all their blog posts
- G+MM: You have only till February 4th to archive all the comments on your Blogger blogs
- G+ Comment on Blogger's official GoogleBlog: Lack of Migration Path
- Google+ Help Community: Where are the Google+ Comments on my Blogger blog located?
Some of the other tools will assist in converting the (filtered) data to other formats, such as for instance HTML, or possibly Atom of json-ld, for import into other platforms.
This project is licensed under the terms of the GPLv3 license.
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.