Build an automatic process to ingest data on an on-demand basis. The data represents trips taken by different vehicles, and include a city, a point of origin and a destination. Example data point:
{
"region": "Prague",
"origin_coord": "POINT (14.4973794438195 50.00136875782316)",
"destination_coord": "POINT (14.43109483523328 50.04052930943246)",
"datetime": "2018-05-28 9:03:40",
"datasource": "funny_car"
}
I did everything first in the UI. Since the solution should be delivered in a repository, it would be very weird to have a repo full of videos/images of UI, so I decided to learn Terraform and do that way. It's probably not a best practice to have the entire project in one file, but it worked! It was very fun.
I went for the GCP services that I am more familiar with, enabling a full serverless architecture.
For an Open Source solution, I would think about using RabbitMQ for messaging, Knative for Functions and Druid for Data Warehousing.
The data will be accessed by the Data Scientists by querying the BigQuery table. GCP has great documentation for manipulating GIS data: https://cloud.google.com/bigquery/docs/gis
Install Terraform in your system of preference: https://www.terraform.io/downloads.html
Navigate to your folder of preference
Clone this repo
git clone https://github.com/andrecsq/trips_challenge.git
Navigate to its folder
cd trips_challenge
Set up your GCP account as per the Terraform documentation on "Setting up GCP": Link
Move the created Service Account JSON on the repo's root folder as credentials.json
Check if the local variables in the main.tf
file are appropriate. (TODO: put locals
on another file)
Initialize terraform
terraform init
Create infraestructure
terraform apply
Now you can publish messages to the PubSub topic pubsub_topic_id
(default value trips
) and have these topics processed and loaded to the BigQuery table (default GIS_DATA.trips
)
It cannot be done directly via HTTP. (Guide here)[https://cloud.google.com/pubsub/docs/publisher]
- There must be an automated process to ingest and store the data.
- Trips with similar origin, destination, and time of day should be grouped together.
- Develop a way to obtain the weekly average number of trips for an area, defined by a bounding box (given by coordinates) or by a region.
- Develop a way to inform the user about the status of the data ingestion without using a polling solution.
- The solution should be scalable to 100 million entries. It is encouraged to simplify the data by a data model. Please add proof that the solution is scalable.
- Use a SQL database.
- I grouped the data by using BigQuery's partitioning (time of day - hour) and clustering (origin, destination). Time of day filter is mandatory, but the clustering filter is not.
- The solution is Serverless, so it can scale indefinitely.
- There isn't a centralized way to monitor messages being processed. But each service can be monitored individually via Logs Explorer. These logs are temporary, but it is possible to create a sink from any service's logs to BigQuery.
- Need to replace
' '
to'T'
on thedatetime
field before publishing a message - PubSub can have a schema. It is without a schema because Terraform doesn't currently support it.
- PubSub doesn't guarantee deduplication, so messages on BigQuery will have duplicated data with high volume.
- BigQuery's clustering only works in the order of the clustered fields, which is first origin and destination second. e.g. if you filter in the query by destination and not by origin the clustering won't work.
- The message_generator is WIP