Orchestra is not an official Google Product
- Overview
- Setting up your Orchestra environment in GCP
- Service Accounts
- Configuring Orchestra
- GMP Reporting
- Structured Data Files
- Additional info
Composer is a Google Cloud managed version of Apache Airflow, an open source project for managing ETL workflows. We use it for this solution as you are able to deploy your code to production simply by moving files to Google Cloud Storage. It also provides Monitoring, Logging and software installation, updates and bug fixes for Airflow are fully managed.
It is recommended that you install this solution through the Google Cloud Platform UI.
We recommend familiarising yourself with Composer here.
Orchestra is an open source project, built on top of Composer, for managing common Display and Video 360 ETL tasks such as downloading Entity Read Files and uploading them to BigQuery.
It is available on github.
Below we will explain how to set up an environment for Composer, which files to use from Orchestra and how to grant access to your DV360 account to your Cloud Project.
This will create a fully managed workflow that - in our example - will import your required Entity Read Files to BigQuery.
Composer and Big Query - two of the main Google Cloud Platform tools which Orchestra is based on - will require a GCP Project with a valid billing account.
See this article for more information Google Cloud Billing.
In you GCP Project menu (or directly through this link) access the API Library so that you can enable the following APIs:
- Cloud Composer
- Cloud Dataproc
- Cloud Storage APIs
- BigQuery
Follow these steps to create a Composer environment in Google Cloud Platform - please note that it can take up to 20/30 minutes.
For your installation you must set your Python version to 2, and we are assuming you are using the default service account.
Environment Variables, Tags and Configuration Properties (airflow.cfg) can all be left as standard and you can use the default values for number of nodes, machine types and disk size (you can use a smaller disk size if you want to save some costs).
Google Cloud uses service accounts to automate tasks between services. This includes other Google services such as DV360 and CM.
You can see full documentation for Service Accounts here:
https://cloud.google.com/iam/docs/service-accounts
By default you will see in the IAM section of your Project a default service account for Composer ("Cloud Composer Service Agent") and a default service account for Compute Engine ("Compute Engine default service account") - with their respective email addresses.
These service accounts have access to all Cloud APIs enabled for your project, making them a good fit for Orchestra. We recommend you use in particular the Compute Engine Service Account (i.e. "Compute Engine default service account" because it is the one used by the individual Compute Engine virtual machines that will run your tasks) as the main "Orchestra" service account.
If you wish to use another account, you will have to give it access to BigQuery and full permissions for the Storage APIs.
Your Service Account will need to be setup as a DV360 user so that it can access the required data from your DV360 account.
You need to have partner-level access to your DV360 account to be able to add a new user; follow the simple steps to create a new user in DV360, using this configuration:
- Give this user the email of the service account you wish to use.
- Select all the advertisers you want to be able to access
- Give** Read&Write** permissions
- Save!
Entity Read Files are large json files showing the state of an account. These are held in Google Cloud Storage. Access is granted via a Google Group.
This is found in
Settings > Basic Details > Entity Read Files Configuration > Entity Read Files Read Google Group
You should add the service account to the Entity Read Files Read Google Group.
Add the service account email to this Google Group to allow it to read private entity read files.
You can find more info on Entity Read Files access here: https://developers.google.com/bid-manager/guides/entity-read/overview.
If you are intending to use many google groups, it is also possible to set up a single Google Group containing all other Google Groups. You can then Add the Service account to this Google Group to grant access to all accounts at once
You have now set up the Composer environment in GCP and granted the proper permissions to its default Service Account.
You're ready to configure Orchestra!
The Orchestra project will require several variables to run.
These can be set via the Admin section in the Airflow UI (accessible from the list of Composer Environments, clicking on the corresponding link under "Airflow Web server").
Area | Variable Name | Value | Needed For |
Cloud Project | gce_zone | Your Google Compute Engine Zone (you can find it under "Location" in the list of Composer Environments) | All |
Cloud Project | gcs_bucket | The Cloud Storage bucket for your Airflow DAGs (you can find a link to the bucket in the Environments page - see Image1) | All |
Cloud Project | cloud_project_id | The Project ID you can find in your GCP console homepage. | All |
BigQuery | erf_bq_dataset | The name of the BigQuery Dataset you wish to use - see image2 and documentation here. | ERFs |
DV360 | partner_ids | The list of partners ids from DV360, used for Entity Read Files, comma separated. | All |
DV360 | private_entity_types | A comma separated list of Private Entity Read Files you would like to import. | ERFs |
DV360 | sequential_erf_dag_name | The name of your dag as it will show up in the UI. Name it whatever makes sense for you (alphanumeric characters, dashes, dots and underscores exclusively). | ERFs |
DV360 | dv360_sdf_advertisers | Dictionary of partners (keys) and advertisers (values) which will be used to download SDFs. Initially you can set up the value to: {"partner_id": ["advertiser_id1", “advertiser_id2”]} and use the dv360_get_sdf_advertisers_from_report_dag dag to update it programmatically. | SDFs |
DV360 | dv360_sdf_advertisers_report_id | DV360 report ID which will be used to get a list of all active partners and advertisers. Initially, you can set up the value as: 1 and use the dv360_create_sdf_advertisers_report_dag dag to update it programmatically. | SDFs, Reports |
DV360 | number_of_advertisers_per_sdf_api_call | Number of advertiser IDs which will be included in each call to DV360 API to retrieve SDFs. Set up the value to: 1 | SDFs |
DV360 | sdf_api_version | SDF Version (column names, types, order) in which the entities will be returned. Set up the value to: 4.2 (no other versions are currently supported). | SDFs |
BigQuery | sdf_bq_dataset | The name of the BigQuery dataset you wish to use to store SDFs. | SDFs |
BigQuery | sdf_file_types | Comma separated value of SDF types that will be returned (e.g. LINE_ITEM, AD_GROUP). Currently, this solution supports: LINE_ITEM, AD_GROUP, AD, INSERTION_ORDER and CAMPAIGN. | SDFs |
Image1:
Image2:
As with any other Airflow deployment, you will need DAG files describing your Workflows to schedule and run your tasks; plus, you'll need hooks, operators and other libraries to help building those tasks.
You can find the core files for Orchestra in our github repository: clone the repo (or directly download the files) and you will obtain the following folders:
- dags: includes a sample DAG file to upload multiple partners ERF files from the Cloud Storage Bucket to BigQuery
- hooks: includes the hooks needed to connect to the reporting APIs of GMP platforms (CM and DV360)
- operators: includes two subfolders for basic operators for CM and DV360 APIs, respectively
- schema: includes files describing the structure of most CM and DV360 entities (can be useful when creating new report or to provide the schema to create a BQ table)
- utils: a general purpose folder to include utility files
You can then design the dags you wish to run and add them to the dags folder.
Upload all the DAGs and other required files to the DAGs Storage Folder that you can access from the Airflow UI.
This will automatically generate the DAGs and schedule them to run (you will be able to see them in the Airflow UI).
From now, you can use (the Composer-managed instance of) Airflow as you normally would - including the different available functionalities for scheduling, troubleshooting, …
With the sample DAG provided, if all proper accesses have been granted to the Service Account, you will be able to see the results directly in BigQuery: in the Dataset you've selected in the corresponding Variable, you will find different tables for all the Entity Read Files entities that you've chosen to import.
Congratulations!
The example workflow we've just set up (importing Entity Read Files from Cloud Storage to Big Query) doesn't require access to DV360 (or in general GMP) reports, but that's a task that you might end up needing in other workflows. For instance, rather (or in addition to) Entity Read Files data, you might want to add aggregated performance data to BigQuery.
In order to be able to do this, you'll need to setup a Connection to GMP reporting (i.e. specifying your Service Accounts credentials to be used to leverage GMP APIs) and then create a report (and collect its results).
An Airflow Connection to the GMP Reporting API is needed for the tasks which will collect DV360 (or CM) reports.
First of all, you will need to enable your Service Account to access GMP Reporting API (and the DV360 Reporting API in particular):
- From the API & Services > Library menu in the GCP console, look for and enable the DoubleClick Bid Manager API (DoubleClick Bid Manager is the former name of DV360)
- If necessary, also enable the DCM/DFA Reporting And Trafficking API and/or the _DoubleClick Search API _for CM and SA360 reporting respectively.
- From the IAM & admin > Service Accounts menu in the GCP console, **look for the Compute Engine default service account **(or your custom Service Account if you aren't using the default one) and click on the three-dots button under "Action" to Create a key. Pick the JSON option and store the file securely.
- Upload the JSON keyfile you've just downloaded to the Storage Bucket linked to your Composer environment (the same bucket where you're uploading DAG and other python files, but in another subfolder - e.g. "data")
You are now ready to access the Connections list in the Airflow UI (Admin > Connections) and click on Create.
Use the following values (please note that the list of fields changes depending on the "Connection Type" you select, so don't worry if you don't see these exact fields initially):
Field | Value |
Conn Id | gmp_reporting |
Conn Type | Google Cloud Platform |
Project Id | [Your Cloud Project ID] |
Keyfile Path | The path to the JSON file you've uploaded in the Storage Bucket during the previous steps. In particular, if you have uploaded your keyfile in a data folder, enter:
"/home/airflow/gcs/data/[keyfile_name].json" |
Keyfile JSON | [empty] |
Scopes (comma separated) | https://www.googleapis.com/auth/doubleclickbidmanager
Or, if necessary, also add other scopes such as: |
You can follow these simple steps to have your Service Account create a DV360 report, so that a subsequent task can, following our example, collect the report result and push it to BigQuery.
It's important that the Service Account creates the report because if you create it directly in the DV360 UI the Service Account won't be able to access the resulting files!
The Service Account needs to have read access to the Partners/Advertisers you're running reports for.
In this example below we are providing a DAG file (dv360_create_report_dag.py) which will let you manually launch the corresponding **DV360_Create_Query **DAG and will create a new report, but in order to do that you must first configure which kind of report you want to create.
To do this, you will need to add and populate a specific Variable in the Airflow UI, called dv360_report_body, which corresponds to the "body" of the DV360 query to be created.
Comprehensive documentation on how this object can be populated with Filters, Dimensions (GroupBys), Metrics can be found here, and we suggest you first test your request through the API Explorer: https://developers.google.com/apis-explorer/#p/doubleclickbidmanager/v1/doubleclickbidmanager.queries.createquery
(clicking on the small downside arrow on the right you can switch from "Structured editor" to "Freeform editor", which allows you to directly copy-and-paste the JSON structure).
Here's an example of a basic request body:
{
"kind": "doubleclickbidmanager#query",
"metadata": {
"title": "myTest",
"dataRange": "LAST_30_DAYS",
"format": "CSV",
"sendNotification": false
},
"params": {
"type": "TYPE_GENERAL",
"groupBys": [
"FILTER_ADVERTISER",
"FILTER_INSERTION_ORDER"
],
"filters": [
{
"type": "FILTER_PARTNER",
"value": "12345678"
}
],
"metrics": [
"METRIC_IMPRESSIONS",
"METRIC_CLICKS"
],
"includeInviteData": true
},
"schedule": {
"frequency": "DAILY",
"nextRunMinuteOfDay": 0,
"nextRunTimezoneCode": "Europe/London"
}
}
You can then (manually) launch your DAG from the Airflow UI: identify the DAG named "DV360_Create_Query" in the list and launch it clicking on the "Trigger Dag" link (play-button like icon).
Once the DAG has completed successfully, you will find a new variable called dv360_latest_report_id in the list of Variables, populated with the ID of the generated report that you can use in the following steps of your pipeline.
We use the requests library to handle larger report files.
You can add this to the project via the Environment Page
Full details are covered here.
Simply follow the instructions and add requests as the name of the package (no version required)
We're providing other DV360 operators, to be used in your DAGs, so that you're able to run reports, check their status and read their results:
Operator | Function |
dv360_run_query_operator | Takes a query ID and runs the report (useful when you haven't set up the report to run on a schedule) |
dv360_get_report_file_path_operator | Given a query ID, collects the latest file path of the resulting report file and stores it in an XCOM variable. |
dv360_download_report_by_file_path | Reads the report file path from a XCOM variable and downloads it to a Cloud Storage bucket. |
dv360_upload_bq_operator | Loads a report CSV file from Cloud Storage, inferes the schema and uploads the data to a BigQuery table. Note: the BigQuery dataset needs to exist. |
Below we will explain how to set up a workflow which will import your DV360 Structured Data Files (SDFs) to BigQuery.
If you haven’t created an Airflow Connection for GMP APIs follow Create the Airflow Connection to GMP reporting step to create one. Make sure that in the last step, the following scopes are added:
https://www.googleapis.com/auth/doubleclickbidmanager, https://www.googleapis.com/auth/devstorage.full_control, https://www.googleapis.com/auth/bigquery
This report will contain all active advertiser IDs with their Partner IDs which will be used to retrieve SDFs via API. To create a new SDF advertisers report, in Airflow, please manually run:
dv360_create_sdf_advertisers_report_dag
The above DAG will create a scheduled DV360 report which will run daily. After it’s successfully completed, you should see that dv360_sdf_advertisers_report_id Airflow variable has updated with a newly created report ID.
Note: These scheduled reports expire on the 1st of January 2029.
Once you’ve created the SDF advertisers report, please manually run the following DAG:
dv360_run_sdf_advertisers_report_dag
After it’s completed, manually run:
dv360_get_sdf_advertisers_from_report_dag
After it’s completed, you should be able to verify that dv360_sdf_advertisers Airflow variable now contains relevant partner and advertiser IDs which will be used to retrieve SDFs. The above DAG will be automatically configured to run daily.
Please manually run the following DAG:
dv360_sdf_uploader_to_bq_dag
To upload SDFs to BigQuery. The process will use the dictionary stored in the dv360_sdf_advertisers Airflow variable to make API requests and store responses in your BigQuery dataset.
Once the DAG has completed successfully, you will find new tables in your BigQuery dataset. Tables will correspond to SDF types you’ve configured to retrieve in the sdf_file_types Airflow variable (e.g. if you’ve configured “LINE_ITEMS”, you should see a table called “SDFLineItem”). The above DAG will be automatically configured to run daily.
To sum up, we’ve scheduled two DAGs which run daily and independently from each other. The first DAG downloads a report and updates an Airflow variable with your partner and advertiser IDs. The second DAG fetches Structured Data Files using partner and advertiser IDs from the Airflow variable and uploads them to BigQuery.
Full details can be found here. Please note that files created by Composer are not automatically deleted and you will need to remove them manually or they will still incur. Same thing applies to the BigQuery datasets.
Orchestra is a Framework that allows powerful API access to your data. Liability for how you use that data is your own. It is important that all data you keep is secure and that you have legal permission to work and transfer all data you use. Orchestra can operate across multiple Partners, please be sure that this access is covered by legal agreements with your clients before implementing Orchestra. This project is covered by the Apache License.