GitHub - rachitkinger/data_pipeline_coremeterics_to_gcs

Building data pipeline from IBM Coremetrics to GCS

This project builds redundancy in critical business reports for business continuity purposes.

Note that this does not backup all the raw data. It backs up all required reports for the time period for which they are available (Jan-2017 to Mar-2019 note that date range can be modified by using the GSheet mentioned below). These are reports that include top line metrics like pageviews, unique visitors, etc which can be broken down my dimensions that we tend to use most often at JPIMedia. These reports are daily, weekly and monthly reports.

To look at all the reports that are actually downloaded by this script visit this Google Sheet.

The script was run on a Google VM instance using the R version 3.5.3. with the following required packages:

readr
dplyr
lubridate
stringr
googlesheets

Each report was downloaded as a csv file.

The data was downloaded onto a Google Cloud Storage bucket gs://ibm_da_backup ]

Additional usage notes

Two types of reports were saved:

Raw reports (as is reports from IBM)
Clean reports (cleaned up reports which are more user friendly and ready for analysis)

The naming convention is "report_name"_"unit_of_analysis"_"frequency"_"period".csv
The raw reports are stored in a folder called raw_reports and clean reports in clean_reports.

Notes on column names

"Page Attribute: URL Top Second Category", "Top Level Category", "Top Second Level Category"

Report number 3 and 10 had a column called Page Attribute: URL Top Second Category. For an article that is published under /sport/football the value for this column would be "Sport | Football". For the sake of analysis this has been split into two columns called "Top Level Category" and "Top Second Level Category"

"Average Time On Page"

For the sake of analysis this has been converted in seconds.

Logs

The script generates two logs.

Download log - a log of all reports downloaded. It contains the following information:

Number of rows per downloaded report
Report name
Report period

Warning log - a log of all instances when a report hit its size limit, i.e., reached 50k rows. This means that the data returned by that report is no longer accurate and hence cannot be used for complete analysis.

What data not to use

This section will be populated once we know which reports hit their size limit.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
api_based_scripts_to_source_data_from_ibm		api_based_scripts_to_source_data_from_ibm
.gitignore		.gitignore
README.md		README.md
debugging.R		debugging.R
download_log.csv		download_log.csv
ibm_da_backup.Rproj		ibm_da_backup.Rproj
script.R		script.R
warning_log.csv		warning_log.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Building data pipeline from IBM Coremetrics to GCS

Additional usage notes

Notes on column names

"Page Attribute: URL Top Second Category", "Top Level Category", "Top Second Level Category"

"Average Time On Page"

Logs

What data not to use

About

Releases

Packages

Languages

rachitkinger/data_pipeline_coremeterics_to_gcs

Folders and files

Latest commit

History

Repository files navigation

Building data pipeline from IBM Coremetrics to GCS

Additional usage notes

Notes on column names

"Page Attribute: URL Top Second Category", "Top Level Category", "Top Second Level Category"

"Average Time On Page"

Logs

What data not to use

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages