Skip to content

Commit

Permalink
2021/v1.1 (#33)
Browse files Browse the repository at this point in the history
* Create CHANGELOG for version 2021

* Initial text version for visually impaired users

* Add 2021 updates (see CHANGELOG)

* Include link to text version

* Format markdown

* Format markdown
  • Loading branch information
alexandraabbas authored Jan 15, 2021
1 parent e7b19ea commit b805848
Show file tree
Hide file tree
Showing 6 changed files with 232 additions and 1 deletion.
23 changes: 23 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
## Roadmap 2021

### Update 2021-01-15

* Added text version for visually impaired users (issue #10)
* Math & statistics basics have been added to CS fundamentals (issue #22)
* Dimensional modelling has been added to Database fundamentals
* Added section for Object storage (issue #7)
* Azure CosmosDB has been added to Document databases
* Apache Impala has been moved from Batch processing to Data Warehouses
* Azure Synapse Analytics (issue #18) and ClickHouse (issue #24) have been added to Data Warehouses
* Lambda & Kappa architectures have been added to Cluster computing fundamentals (issue #31)
* Azure Data Lake has been added to Managed Hadoop
* Apache NiFi has been added to Hybrid data processing
* Cloud specific messaging services have been added to Messaging (issue #8)
* Luigi has been added to Workflow scheduling
* AWS CDK has replaced AWS CloudFormation in Infrastructure provisioning (issue #4, issue #6)
* Power BI has been added to data visualisation tools (issue #29)
* MLflow has been added to Machine Learning Ops (issue #30)

## Roadmap 2020

[Modern Data Engineer Roadmap 2020](https://github.com/datastacktv/data-engineer-roadmap/tree/8b1ccdce4524961bfd37495de20117c47766b1eb)
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
> Roadmap to becoming a data engineer in 2021
[![Twitter](https://img.shields.io/badge/-Twitter-1DA1F2)](https://twitter.com/datastacktv)
[![YouTube](https://img.shields.io/badge/-YouTube-FF0000)](https://www.youtube.com/channel/UCQSbqkMlvf_J949HDWxOt7Q)
[![YouTube](https://img.shields.io/badge/-YouTube-FF0000)](http://youtube.com/c/datastacktv)
[![Website](https://img.shields.io/badge/-Website-565CD8)](https://datastack.tv/)

This roadmap aims to give a **complete picture of the modern data engineering landscape** and serve as a **study guide** for aspiring data engineers.
Expand All @@ -17,10 +17,14 @@ This roadmap aims to give a **complete picture of the modern data engineering la
***

> [Text version for visually impaired users](text/roadmap.md)
![Data Engineer Roadmap](img/roadmap.png)

## Nice to have 😎

> [Text version for visually impaired users](text/extras.md)
![Data Engineer Roadmap Extras](img/extras.png)

## Contributions are welcome 💜
Expand Down
Binary file modified img/extras.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified img/roadmap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
29 changes: 29 additions & 0 deletions text/extras.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
> Text version for visually impaired users
*Note: Data engineers often work closely with Data scientists, Data analysts and Machine Learning engineers. It’s good to have a basic understanding of the tools they use.*

* Visualise data
* Tableau [general recommendation]
* Looker [personal recommendation]
* Grafana [general recommendation]
* Jupyter Notebook [general recommendation]
* Microsoft Power BI

* Machine Learning fundamentals
* Terminology [general recommendation]
* Supervised vs unsupervised learning
* Classification vs regression
* Evaluation metrics
* scikit-learn [general recommendation]
* Tensorflow [personal recommendation]
* Keras [personal recommendation]
* PyTorch [general recommendation]

* Machine Learning Ops
* Tensorflow Extended (TFX) [general recommendation]
* Kubeflow [personal recommendation]
* MLflow
* Amazon SageMaker
* Google Cloud AI Platform

*Note: Keep learning...*
175 changes: 175 additions & 0 deletions text/roadmap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
> Text version for visually impaired users
# Data Engineer in 2021

* CS fundamentals
* Basic terminal usage [general recommendation]
* Data structures & algorithms [general recommendation]
* APIs [general recommendation]
* REST [general recommendation]
* Structured vs unstructured data [general recommendation]
* Serialisation
* Linux [general recommendation]
* CLI
* Vim
* Shell scripting
* Cronjobs
* How does the computer work? [general recommendation]
* How does the Internet work? [general recommendation]
* Git — Version control [general recommendation]
* Math & statistics basics [general recommendation]

*Note: Git is used for tracking changes in source code and coordinating work among programmers. In your day to day work you will use Git server as a service like GitHub, GitLab or Bitbucket.*

* Learn a programming language
* Python [personal recommendation]
* Java [general recommendation]
* Scala
* Go

*Note: Learn how to write clean, extensibile code. Spend some time understanding programming paradigms (functional vs. OOP) and best practices (design patterns, YAGNI, stateful vs stateless applications). Get familiar with an IDE or code editor like VSCode.*

* Testing
* Unit testing [general recommendation]
* Integration testing [general recommendation]
* Functional testing [general recommendation]

* Database fundamentals
* SQL [general recommendation]
* Normalisation [general recommendation]
* ACID transactions [general recommendation]
* CAP theorem [general recommendation]
* OLTP vs OLAP [general recommendation]
* Horizontal vs vertical scaling [general recommendation]
* Dimensional modeling [general recommendation]

* Relational databases
* MySQL [general recommendation]
* PostgreSQL [general recommendation]
* MariaDB
* Amazon Aurora

* Non-relational databases
* Document databases
* MongoDB [general recommendation]
* Elasticsearch [general recommendation]
* Apache CouchDB
* Azure CormosDB
* Wide column databases
* Apache Cassandra [general recommendation]
* Apache HBase [general recommendation]
* Google Cloud Bigtable [personal recommendation]
* Graph databases
* Neo4j
* Amazon Neptune
* Key-value stores
* Redis [personal recommendation]
* Memcached
* Amazon DynamoDB [general recommendation]

*Note: Understand the difference between Document, Wide column, Graph and Key-value NoSQL databases. We recommend mastering one database from each category.*

* Data warehouses
* Snowflake [general recommendation]
* Presto
* Apache Hive
* Apache Impala
* Amazon Redshift [general recommendation]
* Google BigQuery [personal recommendation]
* Azure Synapse
* ClickHouse

* Object storage
* AWS S3 [general recommendation]
* Azure Blob Storage
* Google Cloud Storage

* Cluster computing fundamentals
* Apache Hadoop [general recommendation]
* HDFS [general recommendation]
* MapReduce [general recommendation]
* Lambda & Kappa architectures
* Managed Hadoop [general recommendation]
* Amazon EMR
* Google Dataproc
* Azure Data Lake

*Note: Most modern data processing frameworks are based on Apache Hadoop and MapReduce to some extent. Understanding these concepts can help you learn modern data processing frameworks much quicker.*

* Data processing
* Batch
* Apache Pig [general recommendation]
* Apache Arrow
* data build tool [personal recommendation]
* Hybrid
* Apache Spark [general recommendation]
* Apache Beam [personal recommendation]
* Apache Flink [general recommendation]
* Apache NiFi
* Streaming
* Apache Kafka [personal recommendation]
* Apache Storm [general recommendation]
* Apache Samza
* Amazon Kinesis

*Note: Hybrid frameworks are able to process both batch and streaming data. Batch data processing is often done by analytical data warehouse applications. See Data warehouses section for more.*

* Messaging
* RabbitMQ [general recommendation]
* Apache ActiveMQ
* Amazon SNS & SQS
* Google PubSub
* Azure Service Bus

* Workflow scheduling
* Apache Airflow [personal recommendation]
* Google Composer
* Apache Oozie
* Luigi

*Note: Cloud Composer is a managed Apache Airflow service on Google Cloud Platform.*

* Monitoring data pipelines
* Prometheus [general recommendation]
* Datadog [general recommendation]
* Sentry [general recommendation]
* StatsD

* Networking
* Protocols [general recommendation]
* HTTP / HTTPS
* TCP
* SSH
* IP
* DNS
* Firewalls [general recommendation]
* VPN [general recommendation]
* VPC [general recommendation]

* Infrastructure as Code
* Containers
* Docker [personal recommendation]
* LXC
* Container orchestration
* Kubernetes [general recommendation]
* Docker Swarm
* Apache Mesos
* Google Kubernetes Engine (GKE) [general recommendation]
* Infrastructure provisioning
* Terraform [personal recommendation]
* Pulumi
* AWS CDK [general recommendation]

* CI/CD
* GitHub Actions [general recommendation]
* Jenkins [general recommendation]

* Identity and access management
* Active Directory [general recommendation]
* Azure Active Directory

* Data security & privacy
* Legal compliance [general recommendation]
* Encryption [general recommendation]
* Key management [general recommendation]
* Data governance & integrity

0 comments on commit b805848

Please sign in to comment.