Skip to content

mhhnr/Data-Engineer-Job-Prep

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 

Repository files navigation

Data Engineer Job Prep

A comprehensive guide to help you prepare for Data Engineering roles.


Table of Contents

  1. Data Structures and Algorithms
  2. System Design
  3. Data Engineering

Data Structures and Algorithms

S.No Topic Subtopics
1 Sorting - Introduction to Sorting
- Basics of Asymptotic Analysis and Worst Case & Average Case Analysis
- Different Sorting Algorithms and their comparison
- Algorithm paradigms: Divide & Conquer, Decrease & Conquer, Transform & Conquer
- Presorting
- Extensions of Merge Sort, Quick Sort, Heap Sort
- Common sorting-related coding interview problems
2 Recursion - Recursion as a Lazy Manager's Strategy
- Recursive Mathematical Functions
- Combinatorial Enumeration
- Backtracking
- Exhaustive Enumeration & General Template
- Common recursion- and backtracking-related coding interview problems
3 Trees - Dictionaries & Sets, Hash Tables
- Modeling data as Binary Trees and Binary Search Tree and performing different operations over them
- Tree Traversals and Constructions
- Breadth-First Search (BFS) & Depth-First Search (DFS) Coding Patterns
- Tree Construction from its traversals
- Common trees-related coding interview problems
4 Graphs - Overview of Graphs
- 7 Bridges of Konigsberg problem
- Graph storage (Adjacency Lists, Matrices, Maps)
- Graph traversal: BFS and DFS
- Graphs in Interviews
- Common graphs-related coding interview problems
5 Dynamic Programming - Introduction to Dynamic Programming (DP)
- Modeling problems as recursive functions
- Detecting overlapping subproblems
- Top-down Memorization
- Bottom-up Tabulation
- Optimizing Bottom-up Tabulation
- Common DP-related coding interview problems

System Design

S.No Topic Subtopics
6 Online Processing Systems - The client-server model of Online processing
- Top-down steps for system design interview
- Depth and breadth analysis
- Cryptographic hash function
- Network Protocols, Web Server, Hash Index
- Scaling
- Performance Metrics
- Service Level Objectives (SLOs) and Service Level Agreements (SLAs)
- Proxy: Reverse and Forward
- Load balancing
- CAP Theorem
- Content Delivery Network (CDN)
- Cache
- Sharding
- Consistent Hashing
- Storage
- Case Studies: URL Shortener, Instagram, Uber, Twitter, Messaging Services
7 Batch Processing Systems - Inverted Index
- External Sort Merge
- K-way External Sort-Merge
- Distributed File System
- MapReduce Framework
- Distributed Sorting
- Case Studies: Search Engine, Graph Processor, Typeahead Suggestions, Recommendation Systems
8 Stream Processing Systems - Case Studies: Application Performance Monitoring (APM), Social Connections, Netflix, Google Maps, Trending Topics, YouTube

Data Engineering

S.No Topic Subtopics
9 SQL Programming - Derive business insights from a food delivery app using SQL
- Intermediate SQL concepts: Case Statements, Subqueries
- Advanced SQL functions: Joins, Analytical functions
- Window functions: lead, lag, rank, dense rank
- Complex SQL problems: customer-merchant dependence
- Comparison of joins
- Thematic SQL interview problems
- Guide to SQL interviews
10 Data Modeling - Design Data Warehouse tables for Uber
- Conceptual and logical models
- Fact and Dimension tables
- Best practices for keys
- Normalization
- Slowly Changing Dimensions
- Star vs. Snowflake schemas
- Interview problems from Meta, Amazon, Uber
- Guide to atypical interview questions
11 ETL and Pipeline Design - Data pipeline for Netflix
- Extract, Transform, Load (ETL) design: data ingestion, file formats, storage, metrics
- Performance parameters
- Pipeline architecture
- Handling unstructured data
- Machine Learning (ML) platform architecture
- Data Engineering role in large-scale systems
- Interview tips
12 Data Platforms - Design data platform for a gaming company
- High-level components: Ingestion, Warehousing, Transformation, Governance
- High-performance platform design with Kafka and Spark
- Success metrics and data relevance
- Data backup strategies
- Optimization techniques: partitioning, distributed platform, cloud services
- Product Sense
- Interview tips
13 Big Data Frameworks - Introduction to Big Data ecosystems
- Overview of Distributed Storage Systems: Hadoop Distributed File System (HDFS), Amazon Simple Storage Service (S3)
- Data processing with Spark
- Partitioning and Shuffling
- Fault Tolerance in distributed environments
- Cluster Management: Yet Another Resource Negotiator (YARN), Mesos, Kubernetes
- Optimizing Spark queries
- Handling petabyte-scale data
- Case Studies: ETL in e-commerce, streaming analytics for real-time systems
14 Data Warehousing - Introduction to Data Warehousing
- Online Analytical Processing (OLAP) vs. Online Transaction Processing (OLTP) systems
- Schema Design: Star vs. Snowflake schemas
- Best practices for Fact and Dimension tables
- Partitioning strategies
- Data Ingestion: Batch, Micro-batch, Real-time
- Data Governance and Security
- Performance tuning
- Cloud-based Warehouses: Redshift, BigQuery, Snowflake
- Case Studies: Retail, Financial Services, Healthcare
15 Data Governance - Importance of Data Governance
- Data Quality and Management
- Data Lineage and Provenance tracking
- Data Privacy and Security: General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPAA) compliance
- Access Control, Authentication, Encryption
- Data Masking
- Monitoring and Auditing
- Tools for Data Governance
- Governance strategy
- Case Studies: Banking, healthcare, public institutions
16 Cloud Data Engineering - Cloud Data Engineering: Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure
- Cloud Storage: Amazon Simple Storage Service (S3), Google Cloud Storage (GCS), Azure Blob Storage
- Serverless Data Engineering: AWS Lambda, GCP Cloud Functions
- Cloud-based Data Warehousing: Redshift, BigQuery, Azure Synapse
- Real-time data processing: AWS Kinesis, Google Dataflow
- Infrastructure as Code: Terraform, CloudFormation
- Monitoring cloud environments
- Cost Optimization
- Case Studies: Data Lakes, Internet of Things (IoT) pipelines
17 Machine Learning Integration - Introduction to ML Pipelines
- Feature Engineering at scale
- Model Lifecycle Management
- Serving ML models with low latency
- MLOps frameworks: Kubeflow, MLflow, TensorFlow Extended (TFX)
- Data drift and Model retraining
- Integrating Data Engineering with ML workflows
- Case Studies: Fraud detection, recommendation engines, predictive analytics

About

All That You Need To Know

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published