A comprehensive guide to help you prepare for Data Engineering roles.
S.No | Topic | Subtopics |
---|---|---|
1 | Sorting | - Introduction to Sorting - Basics of Asymptotic Analysis and Worst Case & Average Case Analysis - Different Sorting Algorithms and their comparison - Algorithm paradigms: Divide & Conquer, Decrease & Conquer, Transform & Conquer - Presorting - Extensions of Merge Sort, Quick Sort, Heap Sort - Common sorting-related coding interview problems |
2 | Recursion | - Recursion as a Lazy Manager's Strategy - Recursive Mathematical Functions - Combinatorial Enumeration - Backtracking - Exhaustive Enumeration & General Template - Common recursion- and backtracking-related coding interview problems |
3 | Trees | - Dictionaries & Sets, Hash Tables - Modeling data as Binary Trees and Binary Search Tree and performing different operations over them - Tree Traversals and Constructions - Breadth-First Search (BFS) & Depth-First Search (DFS) Coding Patterns - Tree Construction from its traversals - Common trees-related coding interview problems |
4 | Graphs | - Overview of Graphs - 7 Bridges of Konigsberg problem - Graph storage (Adjacency Lists, Matrices, Maps) - Graph traversal: BFS and DFS - Graphs in Interviews - Common graphs-related coding interview problems |
5 | Dynamic Programming | - Introduction to Dynamic Programming (DP) - Modeling problems as recursive functions - Detecting overlapping subproblems - Top-down Memorization - Bottom-up Tabulation - Optimizing Bottom-up Tabulation - Common DP-related coding interview problems |
S.No | Topic | Subtopics |
---|---|---|
6 | Online Processing Systems | - The client-server model of Online processing - Top-down steps for system design interview - Depth and breadth analysis - Cryptographic hash function - Network Protocols, Web Server, Hash Index - Scaling - Performance Metrics - Service Level Objectives (SLOs) and Service Level Agreements (SLAs) - Proxy: Reverse and Forward - Load balancing - CAP Theorem - Content Delivery Network (CDN) - Cache - Sharding - Consistent Hashing - Storage - Case Studies: URL Shortener, Instagram, Uber, Twitter, Messaging Services |
7 | Batch Processing Systems | - Inverted Index - External Sort Merge - K-way External Sort-Merge - Distributed File System - MapReduce Framework - Distributed Sorting - Case Studies: Search Engine, Graph Processor, Typeahead Suggestions, Recommendation Systems |
8 | Stream Processing Systems | - Case Studies: Application Performance Monitoring (APM), Social Connections, Netflix, Google Maps, Trending Topics, YouTube |
S.No | Topic | Subtopics |
---|---|---|
9 | SQL Programming | - Derive business insights from a food delivery app using SQL - Intermediate SQL concepts: Case Statements, Subqueries - Advanced SQL functions: Joins, Analytical functions - Window functions: lead, lag, rank, dense rank - Complex SQL problems: customer-merchant dependence - Comparison of joins - Thematic SQL interview problems - Guide to SQL interviews |
10 | Data Modeling | - Design Data Warehouse tables for Uber - Conceptual and logical models - Fact and Dimension tables - Best practices for keys - Normalization - Slowly Changing Dimensions - Star vs. Snowflake schemas - Interview problems from Meta, Amazon, Uber - Guide to atypical interview questions |
11 | ETL and Pipeline Design | - Data pipeline for Netflix - Extract, Transform, Load (ETL) design: data ingestion, file formats, storage, metrics - Performance parameters - Pipeline architecture - Handling unstructured data - Machine Learning (ML) platform architecture - Data Engineering role in large-scale systems - Interview tips |
12 | Data Platforms | - Design data platform for a gaming company - High-level components: Ingestion, Warehousing, Transformation, Governance - High-performance platform design with Kafka and Spark - Success metrics and data relevance - Data backup strategies - Optimization techniques: partitioning, distributed platform, cloud services - Product Sense - Interview tips |
13 | Big Data Frameworks | - Introduction to Big Data ecosystems - Overview of Distributed Storage Systems: Hadoop Distributed File System (HDFS), Amazon Simple Storage Service (S3) - Data processing with Spark - Partitioning and Shuffling - Fault Tolerance in distributed environments - Cluster Management: Yet Another Resource Negotiator (YARN), Mesos, Kubernetes - Optimizing Spark queries - Handling petabyte-scale data - Case Studies: ETL in e-commerce, streaming analytics for real-time systems |
14 | Data Warehousing | - Introduction to Data Warehousing - Online Analytical Processing (OLAP) vs. Online Transaction Processing (OLTP) systems - Schema Design: Star vs. Snowflake schemas - Best practices for Fact and Dimension tables - Partitioning strategies - Data Ingestion: Batch, Micro-batch, Real-time - Data Governance and Security - Performance tuning - Cloud-based Warehouses: Redshift, BigQuery, Snowflake - Case Studies: Retail, Financial Services, Healthcare |
15 | Data Governance | - Importance of Data Governance - Data Quality and Management - Data Lineage and Provenance tracking - Data Privacy and Security: General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPAA) compliance - Access Control, Authentication, Encryption - Data Masking - Monitoring and Auditing - Tools for Data Governance - Governance strategy - Case Studies: Banking, healthcare, public institutions |
16 | Cloud Data Engineering | - Cloud Data Engineering: Amazon Web Services (AWS), Google Cloud Platform (GCP), Microsoft Azure - Cloud Storage: Amazon Simple Storage Service (S3), Google Cloud Storage (GCS), Azure Blob Storage - Serverless Data Engineering: AWS Lambda, GCP Cloud Functions - Cloud-based Data Warehousing: Redshift, BigQuery, Azure Synapse - Real-time data processing: AWS Kinesis, Google Dataflow - Infrastructure as Code: Terraform, CloudFormation - Monitoring cloud environments - Cost Optimization - Case Studies: Data Lakes, Internet of Things (IoT) pipelines |
17 | Machine Learning Integration | - Introduction to ML Pipelines - Feature Engineering at scale - Model Lifecycle Management - Serving ML models with low latency - MLOps frameworks: Kubeflow, MLflow, TensorFlow Extended (TFX) - Data drift and Model retraining - Integrating Data Engineering with ML workflows - Case Studies: Fraud detection, recommendation engines, predictive analytics |