Data Pulse

Overview

Data Pulse is a pipeline designed to process and transform pageviews scraped from a web scraper. Utilizing Apache Beam and Google Cloud Dataflow, the pipeline reads, parses, filters, enriches, and writes web-scraped data to Google BigQuery, ensuring efficient storage and analysis.

Features

JSON Parsing: Reads JSON lines and converts them into PageView objects.
Data Filtering: Retains only pageviews with a post type of "product".
Data Enrichment: Adds country information based on the user's IP address using the MaxMind GeoLite2 database.
BigQuery Integration: Writes processed data to Google BigQuery.
Robust Logging and Error Handling: Ensures data integrity throughout the pipeline.

Setup

Prerequisites

Java 8 or higher
Apache Beam SDK
Google Cloud SDK
Access to Google Cloud Platform with BigQuery and Cloud Storage

Installation

Clone the repository

git clone https://github.com/yourusername/data-pulse.git
cd data-pulse

Set up your Google Cloud environment

gcloud init
gcloud auth application-default login

Modify configuration files Update the helpers/Config.java file with your GCP project ID, dataset ID, and Cloud Storage path.

Running the Pipeline

To run the pipeline locally or on Google Cloud Dataflow, use the following command:

mvn compile exec:java -Dexec.mainClass=org.apache.beam.DataPulsePipeline -Dexec.args="--project=<YOUR_PROJECT_ID> --stagingLocation=gs://<YOUR_BUCKET>/staging --tempLocation=gs://<YOUR_BUCKET>/tmp --runner=DataflowRunner --inputFile=gs://<YOUR_BUCKET>/input/input.json"

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.idea		.idea
src		src
target/classes		target/classes
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Pulse

Overview

Features

Setup

Prerequisites

Installation

Running the Pipeline

About

Releases

Packages

Languages

fatimashehab99/data-pulse-pipeline

Folders and files

Latest commit

History

Repository files navigation

Data Pulse

Overview

Features

Setup

Prerequisites

Installation

Running the Pipeline

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages