Go Data Ingestion

Introduction

This project contains a function for ingesting data from Google Cloud Storage to BigQuery using batch load. The function can be deployed to Cloud Functions and triggered with an HTTP POST request.

The data sources are configured in the file resources/config/batch-ingestion.yaml. There are four examples, each for a different source format. The source objects are ingested to time partitioned BigQuery tables. The configuration file is templated with the fields .Bucket and .Date representing the bucket containing data and the date of ingestion.

Below are the options that can be configured in the yaml configuration file. See pkg/data_source.go for details about specific file type options.

Key	Description
source_uri	GCS source URI
source_format	Source format: `AVRO`, `CSV`, `NEWLINE_DELIMITED_JSON`, `PARQUET`
avro_options	Options for avro files
csv_options	Options for csv files
parquet_options	Options for parquet files
auto_detect	Whether schema should be auto-detected from file
dataset_id	BigQuery dataset id
table_id	BigQuery table id
create_disposition	BigQuery create disposition: `CREATE_IF_NEEDED`, `CREATE_NEVER`
write_disposition	BigQuery write disposition: `WRITE_APPEND`, `WRITE_EMPTY`, `WRITE_TRUNCATE`

Pre-requisites

Go version 1.19
Gcloud SDK

Development

Setup

Install dependencies

go mod download

Testing

Run unit tests

go test ./...

Running the application

Set environment variables

Variable	Description
PROJECT	GCP project
BUCKET	Bucket with source data
FUNCTION_TARGET	Function entrypoint. Should be set to `IngestBatch`
PORT	Port for web server. If not set, will listen on `8080`

Run application

go run cmd/main.go

Set URL

URL="localhost:${PORT}"

Invoke the function by making an HTTP POST request with a body containing a date in the format %Y-%m-%d

curl -i -X POST ${URL} \
-H "Content-Type: application/json" \
-d '{"date":"2022-04-30"}'

Deployment

Deploying to Cloud Functions

Set environment variables

Variable	Description
PROJECT	GCP project
REGION	Region
CLOUD_FUNCTION_SA	Email of service account used for Cloud Function. Needs the roles: `roles/bigquery.dataEditor` `roles/bigquery.jobUser` `roles/storage.objectViewer`
FUNCTION_NAME	Name of function
BUCKET	Bucket with source data

Deploy to Cloud Functions

gcloud functions deploy ${FUNCTION_NAME} \
--gen2 \
--project=${PROJECT} \
--region=${REGION} \
--trigger-http \
--runtime=go119 \
--entry-point=IngestBatch \
--service-account=${CLOUD_FUNCTION_SA} \
--set-env-vars=PROJECT=${PROJECT},BUCKET=${BUCKET} \
--no-allow-unauthenticated

Invoking Cloud Function

Set URL and token

URL=$(gcloud functions describe ${FUNCTION_NAME} --gen2 --region=${REGION} --format="value(serviceConfig.uri)")

TOKEN=$(gcloud auth print-identity-token)

Invoke the function by making an HTTP POST request with a body containing a date in the format %Y-%m-%d

curl -i -X POST ${URL} \
-H "Authorization: Bearer ${TOKEN}" \
-H "Content-Type: application/json" \
-d '{"date":"2022-04-30"}'

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Go Data Ingestion

Introduction

Pre-requisites

Development

Setup

Testing

Running the application

Deployment

Deploying to Cloud Functions

Invoking Cloud Function

Files

README.md

Latest commit

History

README.md

File metadata and controls

Go Data Ingestion

Introduction

Pre-requisites

Development

Setup

Testing

Running the application

Deployment

Deploying to Cloud Functions

Invoking Cloud Function