This Dataform project processes various marketing data sources and creates a Marketing Data Store (MDS) to be used in several use cases:
- Retain historical marketing data
- Create high performance dashboards
- Perform Data Analytics
- Implement GenAI use cases
- Train ML models
There are currently two domains:
- Google Analytics 4
- Google Ads
All assets are tagged with domain-specific tags - "ga4" or "ads".
This project is structured to support multiple environments, typically "dev", "stage" and "prod".
Every domain consists of two or more BigQuery datasets.
"Base" dataset contains views into the source data that are used to process the source data and tables to store the processed data. These datasets are named as "marketing_base", e.g., marketing_ga4_base_dev.
"Product" dataset(s) contain only views into the base datasets. These datasets are named as "marketing_v<version_number>", e.g., marketing_g4_v1_dev.
Consumers of the domain data should only access data through the product views. This will help with maintaining stable "data interfaces" across releases of the MDS.
Source datasets and destination datasets are controled by the values in dataform.json file.
Variables:
-
ga4_export_project
- GA4 project -
ga4_export_datase
- GA4 dataset -
ga4_export_table_suffix
- typically_*
-
ga4_incremental_processing_days_back
- number of days to look back when processing incremental data. Default is 3. -
ads_export_data
- an array of datasets where Ads data is exported. See the file for the exact syntax. -
ads_metrics_lookback_days
- number of days to query and load from the Ads Data Transfer Service landing tables to the fact tables for Ads Performance and Ads Conversions. Default is 365.
Destination datasets are created in the project specified by defaultDatabase
.
Environment in the dataset name is defined by env
variable.
The shell script mde_repo_sync.sh
is provided in the root directory to assist in synchronizing changes from this repository to your own.
It assumes that you have a local directory in your development environment for both repos.
Due to limitations on how the patch process works, it is required that both directories are located in the same parent directory and that relative paths are passed to the script.
In a Linux command line, execute the script by entering:
bash mde_repo_sync.sh . ../<destination>
The script is designed to run in the source directory. Destination is the directory of the repository where you want to synchronize changes.
The script will perform the following steps:
Destination Repo Prep
- Stash any uncommitted changes in the current branch of the destination repo
- Switch to the main branch of the destination repo
- Pull any changes from the remote of the destination repo
- Create a new branch named
sync-YYYYMMDD-HHMMSS
based on the execution time that is a copy of the main branch
Source Repo Prep
- Stash any uncommitted changes in the current branch of the source repo
- Switch to the main branch of the source repo
- Pull any changes from the remote of the source repo
Diff & Patch Changes
- Execute a diff command between the two local directories.
The diff will ignore changes in
dataform.json
,.gitignore
and the.git
directory. - Write the changes to a file named
patch.file
. Note: The process will fail ff patch.file already exists. You will need to delete or move the existing file. - Patch changes to the destination directory.
- Delete
patch.file
Once the patch is completed, the changes will be uncommitted in the destination directory. You will need to do the following:
- In the destination directory, add changes using
git add .
- Commit the changes to the
sync-YYYYMMDD-HHMMSS
branch usinggit commit -m "Meaningful message"
. - Push changes to the remote or merge the changes to the main branch using your preferred deployment process.
- Delete the
sync-YYYYMMDD-HHMMSS
branch when the changes have been merged.
Assuming you have a directory structure of:
- ~/marketing-analytics-jumpstart-dataform (this repo)
- ~/our-marketing-analytics-jumpstart-dataform (a copy of this repo containing your personalized dataform environment)
If you want to synchronize changes from marketing-analytics-jumpstart-dataform
to our-marketing-analytics-jumpstart-dataform
, you would execute the following:
cd ~/marketing-analytics-jumpstart-dataform
bash mde_repo_sync.sh . ../our-marketing-analytics-jumpstart-dataform