This is a learn-as-you-go project aimed at learning data engineering technologies and practices. The overarching aim is to tap ino a source - The Ethereum Blockchain Network - and deliver the data to a consumer - Tableau or any other front-end equivalent. The project is meant to be expansive, where deploying most optimal solution is not necessarily top of the priority, but learning how to definitely is. The project timeline will be in phases and the project structure fluid.
Ethereum Blockchain > Web3.eth.py > MongoDB / Hadoop > Kafka > Spark / Flink > Hadoop / Doris > Tableau / Plotly Dash / ML-Pytorch-Streamlit
Setup of goals, basic project structure, timeline, virtual environments and version control.
Get the entire pipeline running in whatever form as quick as possible.
Start implementation of refactoring, OOP and modular programming. Build infrastructure for health-checks, cybersecurity, exception and logging if not already implemented.
Assessment of 'Data Product' requirements, design and planning of data transformations required at each point of the pipeline to deliver the required 'Data Product' to the consumer.
Exploration of other data sources, and new data products from these new data sources. ie: X/TikTok sentiment data sources and delivering a sentiment analysis front-end.
Patches and red-teaming on integrity of data pipeline. Cost-benefit analysis of version upgrades.