This project builds redundancy in critical business reports for business continuity purposes.
Note that this does not backup all the raw data. It backs up all required reports for the time period for which they are available (Jan-2017 to Mar-2019 note that date range can be modified by using the GSheet mentioned below). These are reports that include top line metrics like pageviews, unique visitors, etc which can be broken down my dimensions that we tend to use most often at JPIMedia. These reports are daily, weekly and monthly reports.
To look at all the reports that are actually downloaded by this script visit this Google Sheet.
The script was run on a Google VM instance using the R version 3.5.3. with the following required packages:
- readr
- dplyr
- lubridate
- stringr
- googlesheets
Each report was downloaded as a csv file.
The data was downloaded onto a Google Cloud Storage bucket gs://ibm_da_backup ]
Two types of reports were saved:
- Raw reports (as is reports from IBM)
- Clean reports (cleaned up reports which are more user friendly and ready for analysis)
The naming convention is "report_name"_"unit_of_analysis"_"frequency"_"period".csv
The raw reports are stored in a folder called raw_reports
and clean reports in clean_reports
.
Report number 3 and 10 had a column called Page Attribute: URL Top Second Category
. For an article that is published under /sport/football
the value for this column would be "Sport | Football"
. For the sake of analysis this has been split into two columns called "Top Level Category" and "Top Second Level Category"
For the sake of analysis this has been converted in seconds.
The script generates two logs.
- Download log - a log of all reports downloaded. It contains the following information:
- Number of rows per downloaded report
- Report name
- Report period
- Warning log - a log of all instances when a report hit its size limit, i.e., reached 50k rows. This means that the data returned by that report is no longer accurate and hence cannot be used for complete analysis.
This section will be populated once we know which reports hit their size limit.