From 571df483c41cca01fa46fd3a58d0f6f8f5164a63 Mon Sep 17 00:00:00 2001 From: S B <64776588+seblum@users.noreply.github.com> Date: Thu, 3 Aug 2023 17:19:39 +0200 Subject: [PATCH] Chapter/infrastructure-deployment (#12) * added initial deployment files * new naming of deployment files * split deployment into separate files * added code for deployment usage * draft for root and essentials * added exemplary acknowledgements * added rough overview over modules * added infrastructure essentials text and code * fix comma * overview graph to new structure and descriptions * added airflow and jupyterhub modules * added initial deployment files * new naming of deployment files * split deployment into separate files * added code for deployment usage * draft for root and essentials * added rough overview over modules * added infrastructure essentials text and code * fix comma * overview graph to new structure and descriptions * added airflow and jupyterhub modules * removing old files * removed redundant 1.5 section * delete redundant 1.4 section * restructured toc * rephrased book overview * Create LICENSE * added readme and contributing * removed temporary structure * remove old files --- CONTRIBUTING.md | 15 + LICENSE | 201 +++++ README.md | 81 +- manuscript/01.4-Introduction-Ops_practices.md | 164 ---- ...duction-MLOps_Engineering_with_Airflow.Rmd | 27 - .../02-Overview_about_book_tutorials.Rmd | 5 + ...h_Airflow.Rmd => 07-ML-Project_Design.Rmd} | 4 +- .../08-Deployment-Infrastructure_Overview.md | 62 +- .../08.1-Deployment_Infrastructure_Root.md | 133 +++ ....2-Deployment-Infrastructure_Essentials.md | 849 ++++++++++++++++++ .../08.3-Deployment-Infrastructure_Modules.md | 561 ++++++++++++ ...loyment-Infrastructure_Design_Decisions.md | 1 + ...terhub.md => 09.1-Deployment-Usage_IDE.md} | 2 +- ...eployment-Usage_Building-Model-Pipeline.md | 3 - manuscript/10-Acknowledgements.Rmd | 7 + manuscript/_bookdown.yml | 14 +- temporary_structure/02-MLOps.Rmd | 11 - temporary_structure/03-Airflow.Rmd | 11 - temporary_structure/041-k8s.md | 195 ---- temporary_structure/05-Terraform.Rmd | 30 - temporary_structure/06-MLFlow_DVC.Rmd | 31 - temporary_structure/08-NeuralNetworks.tex | 584 ------------ temporary_structure/09-Deployment.Rmd | 34 - temporary_structure/10-blocks.Rmd | 30 - temporary_structure/10-citations.Rmd | 15 - temporary_structure/10-parts.Rmd | 12 - temporary_structure/10-references.Rmd | 3 - temporary_structure/10-share.Rmd | 31 - 28 files changed, 1907 insertions(+), 1209 deletions(-) create mode 100644 CONTRIBUTING.md create mode 100644 LICENSE delete mode 100644 manuscript/01.4-Introduction-Ops_practices.md delete mode 100644 manuscript/01.5-Introduction-MLOps_Engineering_with_Airflow.Rmd create mode 100644 manuscript/02-Overview_about_book_tutorials.Rmd rename manuscript/{02-Project-MLOps_Engineering_with_Airflow.Rmd => 07-ML-Project_Design.Rmd} (96%) create mode 100644 manuscript/08.1-Deployment_Infrastructure_Root.md create mode 100644 manuscript/08.2-Deployment-Infrastructure_Essentials.md create mode 100644 manuscript/08.3-Deployment-Infrastructure_Modules.md create mode 100644 manuscript/08.4-Deployment-Infrastructure_Design_Decisions.md rename manuscript/{09.1-Deployment-Usage_Jupyterhub.md => 09.1-Deployment-Usage_IDE.md} (98%) create mode 100644 manuscript/10-Acknowledgements.Rmd delete mode 100644 temporary_structure/02-MLOps.Rmd delete mode 100644 temporary_structure/03-Airflow.Rmd delete mode 100644 temporary_structure/041-k8s.md delete mode 100644 temporary_structure/05-Terraform.Rmd delete mode 100644 temporary_structure/06-MLFlow_DVC.Rmd delete mode 100755 temporary_structure/08-NeuralNetworks.tex delete mode 100644 temporary_structure/09-Deployment.Rmd delete mode 100644 temporary_structure/10-blocks.Rmd delete mode 100644 temporary_structure/10-citations.Rmd delete mode 100644 temporary_structure/10-parts.Rmd delete mode 100644 temporary_structure/10-references.Rmd delete mode 100644 temporary_structure/10-share.Rmd diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md new file mode 100644 index 0000000..7a65457 --- /dev/null +++ b/CONTRIBUTING.md @@ -0,0 +1,15 @@ +# Contributing + +I enthusiastically welcome any contributions. Whether it's spotting a typo, suggesting better sentence formulations, or proposing stylistic improvements, the initial step involves forking the repository. After making your changes, you can create a pull request. + +If you're interested in making a more substantial contribution, such as writing a chapter or providing examples, I'd be thrilled! Please open an issue on GitHub, and we can discuss your ideas further. + +Fork the repository and submit a pull request (PR) to propose your changes. You may use "[WIP]" in the PR title if you're still working on it. + +## Formatting Guidelines + ++ Please write in clear and concise language to make the content easily understandable. ++ Use proper headings, subheadings, and bullet points to structure the text. ++ Code examples should be well-formatted and properly documented. ++ For adding images or diagrams, make sure they are clear and relevant to the topic. + diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..7b85a07 --- /dev/null +++ b/LICENSE @@ -0,0 +1,201 @@ + Apache License + Version 2.0, January 2004 + http://www.apache.org/licenses/ + + TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION + + 1. Definitions. + + "License" shall mean the terms and conditions for use, reproduction, + and distribution as defined by Sections 1 through 9 of this document. + + "Licensor" shall mean the copyright owner or entity authorized by + the copyright owner that is granting the License. + + "Legal Entity" shall mean the union of the acting entity and all + other entities that control, are controlled by, or are under common + control with that entity. For the purposes of this definition, + "control" means (i) the power, direct or indirect, to cause the + direction or management of such entity, whether by contract or + otherwise, or (ii) ownership of fifty percent (50%) or more of the + outstanding shares, or (iii) beneficial ownership of such entity. + + "You" (or "Your") shall mean an individual or Legal Entity + exercising permissions granted by this License. + + "Source" form shall mean the preferred form for making modifications, + including but not limited to software source code, documentation + source, and configuration files. + + "Object" form shall mean any form resulting from mechanical + transformation or translation of a Source form, including but + not limited to compiled object code, generated documentation, + and conversions to other media types. + + "Work" shall mean the work of authorship, whether in Source or + Object form, made available under the License, as indicated by a + copyright notice that is included in or attached to the work + (an example is provided in the Appendix below). + + "Derivative Works" shall mean any work, whether in Source or Object + form, that is based on (or derived from) the Work and for which the + editorial revisions, annotations, elaborations, or other modifications + represent, as a whole, an original work of authorship. For the purposes + of this License, Derivative Works shall not include works that remain + separable from, or merely link (or bind by name) to the interfaces of, + the Work and Derivative Works thereof. + + "Contribution" shall mean any work of authorship, including + the original version of the Work and any modifications or additions + to that Work or Derivative Works thereof, that is intentionally + submitted to Licensor for inclusion in the Work by the copyright owner + or by an individual or Legal Entity authorized to submit on behalf of + the copyright owner. For the purposes of this definition, "submitted" + means any form of electronic, verbal, or written communication sent + to the Licensor or its representatives, including but not limited to + communication on electronic mailing lists, source code control systems, + and issue tracking systems that are managed by, or on behalf of, the + Licensor for the purpose of discussing and improving the Work, but + excluding communication that is conspicuously marked or otherwise + designated in writing by the copyright owner as "Not a Contribution." + + "Contributor" shall mean Licensor and any individual or Legal Entity + on behalf of whom a Contribution has been received by Licensor and + subsequently incorporated within the Work. + + 2. Grant of Copyright License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + copyright license to reproduce, prepare Derivative Works of, + publicly display, publicly perform, sublicense, and distribute the + Work and such Derivative Works in Source or Object form. + + 3. Grant of Patent License. Subject to the terms and conditions of + this License, each Contributor hereby grants to You a perpetual, + worldwide, non-exclusive, no-charge, royalty-free, irrevocable + (except as stated in this section) patent license to make, have made, + use, offer to sell, sell, import, and otherwise transfer the Work, + where such license applies only to those patent claims licensable + by such Contributor that are necessarily infringed by their + Contribution(s) alone or by combination of their Contribution(s) + with the Work to which such Contribution(s) was submitted. If You + institute patent litigation against any entity (including a + cross-claim or counterclaim in a lawsuit) alleging that the Work + or a Contribution incorporated within the Work constitutes direct + or contributory patent infringement, then any patent licenses + granted to You under this License for that Work shall terminate + as of the date such litigation is filed. + + 4. Redistribution. You may reproduce and distribute copies of the + Work or Derivative Works thereof in any medium, with or without + modifications, and in Source or Object form, provided that You + meet the following conditions: + + (a) You must give any other recipients of the Work or + Derivative Works a copy of this License; and + + (b) You must cause any modified files to carry prominent notices + stating that You changed the files; and + + (c) You must retain, in the Source form of any Derivative Works + that You distribute, all copyright, patent, trademark, and + attribution notices from the Source form of the Work, + excluding those notices that do not pertain to any part of + the Derivative Works; and + + (d) If the Work includes a "NOTICE" text file as part of its + distribution, then any Derivative Works that You distribute must + include a readable copy of the attribution notices contained + within such NOTICE file, excluding those notices that do not + pertain to any part of the Derivative Works, in at least one + of the following places: within a NOTICE text file distributed + as part of the Derivative Works; within the Source form or + documentation, if provided along with the Derivative Works; or, + within a display generated by the Derivative Works, if and + wherever such third-party notices normally appear. The contents + of the NOTICE file are for informational purposes only and + do not modify the License. You may add Your own attribution + notices within Derivative Works that You distribute, alongside + or as an addendum to the NOTICE text from the Work, provided + that such additional attribution notices cannot be construed + as modifying the License. + + You may add Your own copyright statement to Your modifications and + may provide additional or different license terms and conditions + for use, reproduction, or distribution of Your modifications, or + for any such Derivative Works as a whole, provided Your use, + reproduction, and distribution of the Work otherwise complies with + the conditions stated in this License. + + 5. Submission of Contributions. Unless You explicitly state otherwise, + any Contribution intentionally submitted for inclusion in the Work + by You to the Licensor shall be under the terms and conditions of + this License, without any additional terms or conditions. + Notwithstanding the above, nothing herein shall supersede or modify + the terms of any separate license agreement you may have executed + with Licensor regarding such Contributions. + + 6. Trademarks. This License does not grant permission to use the trade + names, trademarks, service marks, or product names of the Licensor, + except as required for reasonable and customary use in describing the + origin of the Work and reproducing the content of the NOTICE file. + + 7. Disclaimer of Warranty. Unless required by applicable law or + agreed to in writing, Licensor provides the Work (and each + Contributor provides its Contributions) on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or + implied, including, without limitation, any warranties or conditions + of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A + PARTICULAR PURPOSE. You are solely responsible for determining the + appropriateness of using or redistributing the Work and assume any + risks associated with Your exercise of permissions under this License. + + 8. Limitation of Liability. In no event and under no legal theory, + whether in tort (including negligence), contract, or otherwise, + unless required by applicable law (such as deliberate and grossly + negligent acts) or agreed to in writing, shall any Contributor be + liable to You for damages, including any direct, indirect, special, + incidental, or consequential damages of any character arising as a + result of this License or out of the use or inability to use the + Work (including but not limited to damages for loss of goodwill, + work stoppage, computer failure or malfunction, or any and all + other commercial damages or losses), even if such Contributor + has been advised of the possibility of such damages. + + 9. Accepting Warranty or Additional Liability. While redistributing + the Work or Derivative Works thereof, You may choose to offer, + and charge a fee for, acceptance of support, warranty, indemnity, + or other liability obligations and/or rights consistent with this + License. However, in accepting such obligations, You may act only + on Your own behalf and on Your sole responsibility, not on behalf + of any other Contributor, and only if You agree to indemnify, + defend, and hold each Contributor harmless for any liability + incurred by, or claims asserted against, such Contributor by reason + of your accepting any such warranty or additional liability. + + END OF TERMS AND CONDITIONS + + APPENDIX: How to apply the Apache License to your work. + + To apply the Apache License to your work, attach the following + boilerplate notice, with the fields enclosed by brackets "[]" + replaced with your own identifying information. (Don't include + the brackets!) The text should be enclosed in the appropriate + comment syntax for the file format. We also recommend that a + file or class name and description of purpose be included on the + same "printed page" as the copyright notice for easier + identification within third-party archives. + + Copyright [2023] [Sebastian Blum] + + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. diff --git a/README.md b/README.md index b6e9347..dd0c98e 100644 --- a/README.md +++ b/README.md @@ -1,26 +1,73 @@ -Welcome! +# MLOps Engineering Book -This is a minimal example of a book based on R Markdown and **bookdown** (https://github.com/rstudio/bookdown). +![MLOps Engineering Book](https://raw.githubusercontent.com/seblum/mlops-engineering-book/main/images/mlops_book.jpg) -This template provides a skeleton file structure that you can edit to create your book. +Welcome to the MLOps Engineering Book repository! This project hosts the source code and content for the MLOps Engineering Book, an online resource dedicated to MLOps practices, methodologies, and tools for efficiently deploying, managing, and scaling machine learning models. -The contents inside the .Rmd files provide some pointers to help you get started, but feel free to also delete the content in each file and start fresh. +## Introduction -Additional resources: +MLOps, short for "Machine Learning Operations," is the practice of integrating machine learning models into the software development and deployment processes. This book aims to provide readers with a comprehensive guide to MLOps principles, best practices, and technologies that help streamline the lifecycle of machine learning projects. -The **bookdown** book: https://bookdown.org/yihui/bookdown/ +The book covers a wide range of topics, including: -The **bookdown** package reference site: https://pkgs.rstudio.com/bookdown +- Setting up a MLOps infrastructure and an ML platform +- Data management and versioning for ML projects +- Model training and evaluation +- Model deployment and monitoring +- Continuous Integration/Continuous Deployment (CI/CD) pipelines for ML +- Kubernetes and containerization for ML +- And much more! +Whether you are a data scientist, machine learning engineer, or software developer, this book is designed to equip you with the knowledge and tools needed to implement MLOps effectively. -## Table of Contents: +## Getting Started -1. Introduction -2. MLOps -3. Airflow -4. MLflow -5. Terraform -6. Kuberenetes -7. Example - 7. Infrastructure - 8. Modeling \ No newline at end of file +To get started with the MLOps Engineering Book, you have two options: + +1. **Read Online:** You can access the book online at [https://seblum.github.io/mlops-engineering-book/](https://seblum.github.io/mlops-engineering-book/). The website provides a user-friendly interface for easy navigation through chapters and sections. + +2. **Build Locally:** If you prefer to read the book on your local machine or contribute to its development, you can build it using the [Bookdown](https://bookdown.org/) framework. + + Here are the steps to build the book locally: + + 1. Clone this repository to your local machine: + + ```bash + git clone https://github.com/seblum/mlops-engineering-book.git + ``` + + 2. Install the required dependencies: + + ```bash + # Assuming you have R and RStudio installed + # Install the required R packages using RStudio or the following command: + Rscript -e "install.packages(c('bookdown', 'rmarkdown'))" + ``` + + 3. Navigate to the `book` directory: + + ```bash + cd mlops-engineering-book/manuscript + ``` + + 4. Build the book using Bookdown: + + ```bash + # For HTML output + Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::gitbook')" + + # For PDF output (requires LaTeX) + Rscript -e "bookdown::render_book('index.Rmd', 'bookdown::pdf_book')" + ``` + + 5. Once the build process is complete, you can find the output files in the `_book` directory. + +## Contributing + +We welcome contributions to the MLOps Engineering Book! If you would like to improve existing content, fix errors, or add new chapters, feel free to open issues and submit pull requests. Please ensure that your contributions align with the book's theme and follow the [contribution guidelines](CONTRIBUTING.md). + +## License + +This repository is licensed under the Apache License, Version 2.0. The Apache License is an open-source license that allows users to freely use, modify, distribute, and sublicense the code. + +Please refer to the [LICENSE](LICENSE) file in this repository for the full text of the Apache License, Version 2.0. By using, contributing, or distributing this repository, you agree to be bound by the terms and conditions of the Apache License. diff --git a/manuscript/01.4-Introduction-Ops_practices.md b/manuscript/01.4-Introduction-Ops_practices.md deleted file mode 100644 index 6d0265c..0000000 --- a/manuscript/01.4-Introduction-Ops_practices.md +++ /dev/null @@ -1,164 +0,0 @@ - -## Ops Tools & Principles - -MLOps integrates a range of DevOps techniques and tools to enhance the development and deployment of machine learning models. By promoting cooperation between development and operations teams, MLOps strives to improve communication, enhance efficiency, and reduce delays in the development process. Advanced version control systems can be employed to achieve these objectives. - -Automation plays a significant role in achieving these goals. For instance, CI/CD pipelines streamline repetitive tasks like building, testing, and deploying software. The management of infrastructure can also be automated, by using infrastructure as code to facilitate an automated provisioning, scaling, and management of infrastructure. - -To enhance flexibility and scalability in the operational process, containers and microservices are used to package and deploy software. Finally, monitoring and logging tools are used to track the performance of deployed and containerized software and address any issues that arise. - - -### Containerization - -Containerization is an essential component in operations as it enables deploying and running applications in a standardized, portable, and scalable way. This is achieved by packaging an application and its dependencies into a container image, which contains all the necessary code, runtime, system tools, libraries, and settings needed to run the application, isolated from the host operating system. Containers are lightweight, portable, and can run on any platform that supports containerization, such as Docker or Kubernetes. - -All of this makes them beneficial compared to deploying an application on a virtual machine or traditionally directly on a machine. Virtual machines would emulate an entire computer system and require a hypervisor to run, which introduces additional overhead. Similarly, a traditional deployment involves installing software directly onto a physical or virtual machine without the use of containers or virtualization. Not to mention the lack of portability of both. - -![](./images/01-Introduction/ops-containerization.drawio.svg) - -The concept of container images is analogous to shipping containers in the physical world. Like shipping containers can be loaded with different types of cargo, a container image can be used to create different containers with various applications and configurations. Both the physical containers and container images are standardized, just like blueprints, enabling multiple operators to work with them. This allows for the deployment and management of applications in various environments and cloud platforms, making containerization a versatile solution. - -Containerization offers several benefits for MLOps teams. By packaging the machine learning application and its dependencies into a container image, reproducibility is achieved, ensuring consistent results across different environments and facilitating troubleshooting. Containers are portable which enables easy movement of machine learning applications between various environments, including development, testing, and production. Scalability is also a significant advantage of containerization, as scaling up or down compute resources in an easy fashion allows to handle large-scale machine learning workloads and adjust to changing demand quickly. Additionally, containerization enables version control of machine learning applications and their dependencies, making it easier to track changes, roll back to previous versions, and maintain consistency across different environments. To effectively manage model versions, simply saving the code into a version control system is insufficient. It's crucial to include an accurate description of the environment, which encompasses Python libraries, their versions, system dependencies, and more. Virtual machines (VMs) can provide this description, but container images have become the preferred industry standard due to their lightweight nature. -Finally, containerization facilitates integration with other DevOps tools and processes, such as CI/CD pipelines, enhancing the efficiency and effectiveness of MLOps operations. - - - - -### Version Control - -Version control is a system that records changes to a file or set of files over time, to be able to recall specific versions later. It is an essential tool for any software development project as it allows multiple developers to work together, track changes, and easily rollback in case of errors. There are two main types of version control systems: centralized and distributed. - -1. Centralized Version Control Systems (CVCS) : In a centralized version control system, there is a single central repository that contains all the versions of the files, and developers must check out files from the repository in order to make changes. Examples of CVCS include Subversion and Perforce. - -2. Distributed Version Control Systems (DVCS) : In a distributed version control system, each developer has a local copy of the entire repository, including all the versions of the files. This allows developers to work offline, and it makes it easy to share changes with other developers. Examples of DVCS include Git, Mercurial and Bazaar - -Version control is a vital component of software development that offers several benefits. First, it keeps track of changes made to files, enabling developers to revert to a previous version in case something goes wrong. Collaboration is also made easier with version control, as it allows multiple developers to work on a project simultaneously and share changes with others. In addition, it provide backup capabilities by keeping a history of all changes, allowing you to retrieve lost files. Version control also allows auditing of changes, tracking who made a specific change, when, and why. Finally, it enables developers to create branches of a project, facilitating simultaneous work on different features without affecting the main project, with merging later. - -Versioning all components of a machine learning project, such as code, data, and models, is essential for reproducibility and managing models in production. While versioning code-based components is similar to typical software engineering projects, versioning machine learning models and data requires specific version control systems. There is no universal standard for versioning machine learning models, and the definition of "a model" can vary depending on the exact setup and tools used. - -Popular tools such as Azure ML, AWS Sagemaker, Kubeflow, and MLflow offer their own mechanisms for model versioning. For data versioning, there are tools like Data Version Control (DVC) and Git Large File Storage (LFS). The de-facto standard for code versioning is Git. The code-versioning system Github is used for this project, which will be depicted in more detail in the following. - -#### Github - -GitHub provides a variety of branching options to enable flexible collaboration workflows. Each branch serves a specific purpose in the development process, and using them effectively can help teams collaborate more efficiently and effectively. - -![](./images/01-Introduction/ops-version-control.drawio.svg) - -*Main Branch:* The main branch is the default branch in a repository. It represents the latest stable version and production-ready state of a codebase, and changes to the code are merged into the main branch as they are completed and tested. -*Feature Branch:* A feature branch is used to develop a new feature or functionality. It is typically created off the main branch, and once the feature is completed, it can be merged back into the main branch. -*Hotfix Branch:* A hotfix branch is used to quickly fix critical issues in the production code. It is typically created off the main branch, and once the hotfix is completed, it can be merged back into the main branch. -*Release Branch:* A release branch is a separate branch that is created specifically for preparing a new version of the software for release. Once all the features and fixes for the new release have been added and tested, the release branch is merged back into the main branch, and a new version of the software is tagged and released. - -#### Git lifecycle - -After a programmer has made changes to their code, they would typically use Git to manage those changes through a series of steps. First, they would use the command `git status` to see which files have been changed and are ready to be committed. They would then stage the changes they want to include in the commit using the command `git add `, followed by creating a new commit with a message describing the changes using `git commit -m "MESSAGE"`. - -After committing changes locally, the programmer may want to share those changes with others. They would do this by pushing their local commits to a remote repository using the command `git push`. Once the changes are pushed, others can pull those changes down to their local machines and continue working on the project by using the command `git pull`. - -![](./images/01-Introduction/ops-git-commands.png) - -If the programmer is collaborating with others, they may need to merge their changes with changes made by others. This can be done using the `git merge ` command, which combines two branches of development history. The programmer may need to resolve any conflicts that arise during the merge. - -If the programmer encounters any issues or bugs after pushing their changes, they can use Git to revert to a previous version of the code by checking out an older commit using the command git checkout. Git's ability to track changes and revert to previous versions makes it an essential tool for managing code in collaborative projects. - -While automating the code review process is generally viewed as advantageous, it is still typical to have a manual code review as the final step before approving a pull or merge request to be merged into the main branch. It is considered a best practice to mandate a manual approval from one or more reviewers who are not the authors of the code changes. - - -### CI/CD - -Continuous Integration (CI) and Continuous Delivery / Continuous Delivery (CD) are related software development practices that work together to automate and streamline the software development and deployment process of code changes to production. Deploying new software and models without CI/CD often requires a lot of implicit knowledge and manual steps. - -![](./images/01-Introduction/ops-ci-cd.drawio.svg) - -1. *Continuous Integration (CI)*: is a software development practice that involves frequently integrating code changes into a shared central repository. The goal of CI is to catch and fix integration errors as soon as they are introduced, rather than waiting for them to accumulate over time. This is typically done by running automated tests and builds, to catch any errors that might have been introduced with new code changes, for example when merging a Git feature branch into the main branch. - -2. *Continuous Delivery (CD)*: is the practice that involves automating the process of building, testing, and deploying software to a production-like environment. The goal is to ensure that code changes can be safely and quickly deployed to production. This is typically done by automating the deployment process and by testing the software in a staging environment before deploying it to production. - -3. *Continuous Deployment (CD):* is the practice of automatically deploying code changes to production once they pass automated tests and checks. The goal is to minimize the time it takes to get new features and bug fixes into the hands of end-users. In this process, the software is delivered directly to the end-user without manual testing and verification. - -The terms *Continuous Delivery* and *Continuous Deployment* are often used interchangeably, but they have distinct meanings. Continuous delivery refers to the process of building, testing, and running software on a production-like environment, while continuous deployment refers specifically to the process of running the new version on the production environment itself. However, fully automated deployments may not always be desirable or feasible, depending on the organization's business needs and the complexity of the software being deployed. While continuous deployment builds on continuous delivery, the latter can offer significant value on its own. - -CI/CD integrates the principles of continuous integration and continuous delivery in a seamless workflow, allowing teams to catch and fix issues early and quickly deliver new features to users. The pipeline is often triggered by a code commit. Ideally, a Data Scientist would push the changes made to the code at each incremental step of development to a share repository, including metadata and documentation. This code commit would trigger the CI/CD pipeline to build, test, package, and deploy the model software. In contrast to the local development, the CI/CD steps will test the model changes on the full dataset and aiming to deploy for production. - -CI and CD practices help to increase the speed and quality of software development, by automating repetitive tasks and catching errors early, reducing the time and effort required to release new features, and increasing the stability of the deployed software. Examples for CI/CD Tools that enable automated testing with already existing build servers are for example GitHub Actions, Gitlab CI/CD, AWS Code Build, or Azure DevOps - -The following code snippet shows an exemplary GitHub Actions pipeline to test, build and push a Docker image to the DockerHub registry. The code is structured in three parts. -At first, the environment variables are defined under `env`. Two variables are defined here which are later called with by the command `env.VARIABLE`. -The second part defines when the pipeline is or should be triggered. The exampele shows three possibilites to trigger a pipelines, when pushing on the master branch `push`, when a pull request to the master branch is granted `pull_request`, or when the pipeline is triggered manually via the Github interface `workflow_dispatch`. -The third part of the code example introduces the actual jobs and steps performed by the pipeline. The pipeline consists of two jobs `pytest` and `docker`. The first represents the CI part of the pipeline. The run environment of the job is set up and the necessary requirements are installed. Afterward unit tests are run using the pytest library. If the `pytest` job was successful, the `docker` job will be triggered. The job builds the Dockerfile and pushes it automatically to the specified Dockerhub repository specified in `tags`. The step introduces another variable just like the `env.Variable` before, the `secrets.`. Secrets are a way by Github to safely store classified information like username and passwords. They can be set up using the Github Interface and used in the Github Actions CI using `secrets.SECRET-NAME`. - -```yaml -name: Docker CI base - -env: - DIRECTORY: base - DOCKERREPO: seblum/mlops-public - -on: - push: - branches: master - paths: $DIRECTORY/** - pull_request: - branches: [ master ] - workflow_dispatch: - -jobs: - pytest: - runs-on: ubuntu-latest - defaults: - run: - working-directory: ./${{ env.DIRECTORY }} - steps: - - uses: actions/checkout@v3 - - name: Set up Python - uses: actions/setup-python@v4 - with: - python-version: '3.x' - - name: Install dependencies - run: | - python -m pip install --upgrade pip - pip install -r requirements.txt - pip install pytest - pip install pytest-cov - - name: Test with pytest - run: | - pytest test_app.py --doctest-modules --junitxml=junit/test-results.xml --cov=com --cov-report=xml --cov-report=html - docker: - needs: pytest - runs-on: ubuntu-latest - steps: - - name: Set up QEMU - uses: docker/setup-qemu-action@v2 - - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v2 - - name: Login to DockerHub - uses: docker/login-action@v2 - with: - username: ${{ secrets.DOCKERHUB_USERNAME }} - password: ${{ secrets.DOCKERHUB_TOKEN }} - - name: Build and push - uses: docker/build-push-action@v3 - with: - file: ./${{ env.DIRECTORY }}/Dockerfile - push: true - tags: ${{ env.DOCKERREPO }}:${{ env.DIRECTORY }} -``` - - -### Infrastructure as code - -Infrastructure as Code (IaC) is a software engineering approach that enables the automation of infrastructure provisioning and management using machine-readable configuration files rather than manual processes or interactive interfaces. - -This means that the infrastructure is defined using code, instead of manually setting up servers, networks, and other infrastructure components. This code can be version controlled, tested, and deployed just like any other software code. It also allows to automate the process of building and deploying infrastructure resources, enabling faster and more reliable delivery of services, as well as ensuring to provide the same environment every time. It also comes with the benefit of an increased scalability, improved security, and better visibility into infrastructure changes. - -It is recommended to utilize infrastructure-as-code to deploy an MLOps platform. Popular tools for implementing IaC are for example Terraform, CloudFormation, and Ansible. Chapter 6 gives a more detailed description and a tutorial on how to use Infrastructure as code using *Terraform*. diff --git a/manuscript/01.5-Introduction-MLOps_Engineering_with_Airflow.Rmd b/manuscript/01.5-Introduction-MLOps_Engineering_with_Airflow.Rmd deleted file mode 100644 index 9e2b0b5..0000000 --- a/manuscript/01.5-Introduction-MLOps_Engineering_with_Airflow.Rmd +++ /dev/null @@ -1,27 +0,0 @@ -## MLOps Engineering with Airflow and MLflow on Kubernetes - -MLOps platforms can be set up in various ways to apply MLOps practices to the machine learning workflow. -(1) SaaS tools provide an integrated development and management experience, with an aim to offer an end-to-end process. (2) Custom-made platforms offer high flexibility and can be tailored to specific needs. However, integrating multiple different services requires significant engineering effort. (3) Many cloud providers offer a mix of SaaS and custom-tailored platforms, providing a relatively well-integrated experience while remaining open enough to integrate other services. - - This project involves building a custom-tailored MLOps platform focused on MLOps engineering, as the entire infrastructure will be set up from scratch. An exemplary MLOps platform will be developed using Airflow and MLflow for management during the machine learning lifecycle and JupyterHub to provide a development environment. - - Even though there are workflow tools better designed for machine learning pipelines, for example Kubeflow Pipelines, Airflow and MLflow can leverage and an combine there functionalities to provide similar capabilites. Airflow provides the workflow management for the platform whilst MLflow is used for machine learning tracking. MLflow further allow to register each model effortlessly. As an MLOps plattform should also provide an environment to develop machine learning model code, JupyterHub will be deployed to be able to develop code in the cloud and without the need for a local setup. The coding environment will synchronize with Airflow's DAG repository to seamlessly integrate the defined models within the workflow management. -Airflow and MLflow are very flexible with their running environment and their stack would be very suitable for small scale systems, where there is no need for a setup maintaining a Kubernetes cluster. While it would be possible to run anything on a docker/docker-compose setup, this work will scale the mentioned tools to a Kubernetes cluster in the cloud to fully enable the concept of an MLOps plattform. -The infrastructure will be maintained using the Infrastructure as Code tool *Terraform*, and incorporate best Ops practices such as CI/CD and automation. The project will also incorporate the work done by data and machine learning scientists since basic machine learning models will be implemented and run on the platform. - - -![](images/01-Introduction/airflow-on-eks-basic.drawio.svg) - -The following chapters give an introductory tutorial on each of the previously introduced tools. A machine learning workflow using Airflow is set up on the deployed infrastructure, including data preprocessing, model training, and model deployment, as well as tracking the experiment and deploying the model into production using MLFlow. - -The necessary AWS infrastructure is set up using Terraform. This includes creating an AWS EKS cluster and the associated ressources like a virtual private cloud (VPC), subnets, security groups, IAM roles, as well as further AWS ressources needed to deploy Airflow and MLflow. -Once the EKS cluster is set up, Kubernetes can be used to deploy and manage applications on the cluster. Helm, a package manager for Kubernetes, is used to manage the deployment of Airflow and MLflow. The EKS cluster allows for easy scalability and management of the platforms. The code is made public on a Github repository and Github Actions is used for automating the deployment of the infrastructure using CI/CD principles. - -Once the infrastructure is set up, machine learning models can be deployed to the EKS cluster as Kubernetes pods, using Airflows scheduling processes. Airflow's ability to scan local directories or Git repositories will be used to import the relevant machine learning code from second Github repository. -Similarly, to building Airflow workflows, the machine learning code will also include using the MLFlow API to allow for model tracking. Github Actions is used as a CI/CD pipeline to automatically build, test, and deploy machine learning models to this repository similarly as it is used in the repository for the infrastructure. - - - -Whereas the deployment of the infrastructure would be taken care of by MLOps-, DevOps-, and Data Engineers, the development of the Airflow workflows including MLFlow would be taken care of by Data Scientist and ML Engineers. diff --git a/manuscript/02-Overview_about_book_tutorials.Rmd b/manuscript/02-Overview_about_book_tutorials.Rmd new file mode 100644 index 0000000..3dcfb2d --- /dev/null +++ b/manuscript/02-Overview_about_book_tutorials.Rmd @@ -0,0 +1,5 @@ +# Overview about book tutorials + +The book contains two sections with distinct focuses. The first section comprises Chapters 3 to 6, which consist of tutorials on the specific tools aforementioned. These chapters also serve as prerequisites for the subsequent sections. Among these tutorials, the chapters dedicated to *Airflow* and *MLflow* are oriented towards Data Scientists, providing insights into their usage. The chapters centered around *Kubernetes* and *Terraform* target Data- and MLOps Engineers, offering detailed guidance on deploying and managing these tools. + +The second section, comprising Chapters 7 to 9, delves into an exemplary ML Platform. This section demands a strong background in engineering due to its complexity. While these chapters cover the essential tools introduced in the previous section, they may not explore certain intricate aspects used like OAuth authentication and networking details in great depth. Moreover, it is crucial to note that the ML Platform example presented is not intended for production deployment, as there should be significant security concerns considered. Instead, its main purpose is to serve as an informative illustration of ML platforms and MLOps engineering principles. diff --git a/manuscript/02-Project-MLOps_Engineering_with_Airflow.Rmd b/manuscript/07-ML-Project_Design.Rmd similarity index 96% rename from manuscript/02-Project-MLOps_Engineering_with_Airflow.Rmd rename to manuscript/07-ML-Project_Design.Rmd index 4fdf6ab..a949592 100644 --- a/manuscript/02-Project-MLOps_Engineering_with_Airflow.Rmd +++ b/manuscript/07-ML-Project_Design.Rmd @@ -1,6 +1,6 @@ -# MLOps Engineering with Airflow and MLflow on Kubernetes +# ML Platform Design -MLOps platforms can be set up in various ways to apply MLOps practices to the machine learning workflow. +ML platforms can be set up in various ways to apply MLOps practices to the machine learning workflow. (1) SaaS tools provide an integrated development and management experience, with an aim to offer an end-to-end process. (2) Custom-made platforms offer high flexibility and can be tailored to specific needs. However, integrating multiple different services requires significant engineering effort. (3) Many cloud providers offer a mix of SaaS and custom-tailored platforms, providing a relatively well-integrated experience while remaining open enough to integrate other services. This project involves building a custom-tailored MLOps platform focused on MLOps engineering, as the entire infrastructure will be set up from scratch. An exemplary MLOps platform will be developed using Airflow and MLflow for management during the machine learning lifecycle and JupyterHub to provide a development environment. diff --git a/manuscript/08-Deployment-Infrastructure_Overview.md b/manuscript/08-Deployment-Infrastructure_Overview.md index df932ba..a1d9159 100644 --- a/manuscript/08-Deployment-Infrastructure_Overview.md +++ b/manuscript/08-Deployment-Infrastructure_Overview.md @@ -1,5 +1,63 @@ -# Infrastructure Deployment +# ML Platform Deployment +> **_NOTE:_** The chapter discussing the deployment of an ML platform with Airflow and MLflow on AWS EKS, utilizing Terraform for deployment, is currently in the writing phase. The information provided in this disclaimer is based on the current state of knowledge up until July 2023. Thank you for your understanding and patience as I work on completing this chapter. + +The provided directory structure represents the Terraform project for managing the infrastructure of our ML platform. It follows a modular organization to promote reusability and maintainability of the codebase. The full codebase is also available and can be accessed on [github](https://github.com/seblum/mlops-airflow-on-eks) + +```bash +root +│ main.tf +│ variables.tf +│ outputs.tf +│ providers.tf +│ +└── infrastructure +│ │ +│ └── vpc +│ │ +│ └── eks +│ │ +│ └── networking +│ │ +│ └── rds +│ +└── modules + │ + └── airflow + │ + └── mlflow + │ + └── jupyterhub +``` + +By structuring the Terraform project this way it becomes easier to manage, scale, and maintain the infrastructure as the project grows. Each module can be independently developed, tested, and reused across different projects, promoting consistency and reducing duplication of code and effort. + +### Root {.unlisted .unnumbered} + +The *root* directory of the Terraform project contains the general configuration files related to the overall infrastructure setup. + +* The `main.tf` Terraform configuration file, where all major resources are defined and organized into modules. +* The `variables.tf` containing the definition of input variables used throughout the project, allowing users to customize the infrastructure setup. +* The `outputs.tf` defining the output variables that expose relevant information about the deployed infrastructure. +* The `providers.tf` that defining and configuring the providers used in the project, for example, AWS, Kubernetes, Helm. + + +### Infrastructure {.unlisted .unnumbered} + +The *infrastructure* directory holds the individual modules responsible for provisioning specific components of the AWS Cloud and EKS setup. + +* `vpc` defines a module that configures resources related to the Virtual Private Cloud (VPC), such as subnets, route tables, and internet gateways. +* The `eks` module is responsible for creating and configuring an Amazon Elastic Kubernetes Service (EKS) cluster, including worker nodes and other related resources like the Cluster Autoscaler, Elastic Block Storage, or Elastic File System. +* `networking` contains networking components that provide access to the cluster using ingresses and DNS records, for example the AWS Application Load Balancer or an External DNS. +* The `rds` module provides resources to deploy and Amazon Relational Database Service (RDS), such as database instances, subnets, and security groups. This module is needed for the specific tools and components of our ML platform. + + +### Modules {.unlisted .unnumbered} + +The *modules* directory contains Terraform modules that are specific for setting up out ML Platform and provides the components to integrate the MLOps Framework, such as tools for model tracking (MLflow), workflow management (Airflow), or a integrated development environment (JupyterHub). + +* `airflow` provides the Terraform module to deploy an Apache Airflow instance based on the Helm provider, which enables to orchestrate our ML workflows. The module is highly customized as it sets up necessary connections to other services, sets airflow variables that can be used by Data Scientists, creates an ingress ressource, and enables user management and authentication using Github. +* The `mlflow` module sets up MLflow to managing machine learning experiments and models. As MLflow does not natively provide a solution to deploy on Kubernetes, a custom Helm deployment is integrated that configures the necessary deployment, services, and ingress ressources. +* `jupyterhub` deploys a JupyterHub environment via Helm that enables multi-user notebook environment, suitable for collaborative data science and machine learning work. The Helm chart is highly customized providing user management and authentication via Github, provisioning ingress resources, and cloning a custom Github repository that provides all our Data Science and Machine Learning code. -The chapter discussing the deployment of an ML platform with Airflow and MLflow on AWS EKS, utilizing Terraform for deployment, is currently in the writing phase. The information provided in this disclaimer is based on the current state of knowledge up until June 20203. Thank you for your understanding and patience as I work on completing this chapter. diff --git a/manuscript/08.1-Deployment_Infrastructure_Root.md b/manuscript/08.1-Deployment_Infrastructure_Root.md new file mode 100644 index 0000000..08e4725 --- /dev/null +++ b/manuscript/08.1-Deployment_Infrastructure_Root.md @@ -0,0 +1,133 @@ + +## Root directory module + +The root directory of the Terraform infrastructure consists of the main module, calling other submodules that deploy specific infrastructure setting or tools. +This enables to have an overview about the deployment in one place. At first, the necessary cluster infrastructure is deployed such as the `vpc` and the `eks` cluster itself. Afterward the custom tools to be run on EKS are deployed, such as `airflow`, `mlflow`, and `jupyterhub`. + +The following will look a bit more detailed into the call of the module `airflow`, yet a lot of it also applies for the other modules. +The module imports are structure in three sections. At first the general information about the module are given, such as `name` of the module, or the `cluster_name`, as well as more specific variables needed for specific Terraform calls in the module, like `cluster_endpoint`. +Terraform does not provide the functionality to *activate* or *deactivate* a module by itself. As this is a useful feature, a custom workaround is proposed by setting the count a module as such `count = var.deploy_airflow ? 1 : 0`. This will set the cound of the module to `0` or `1`, depending on the `var.deploy_airflow` variable. This functionality is proposed for all custom modules. + +Secondly, as Airflow needs access to and RDS Database, the RDS module is called. Therefore it is needed to pass the relevant information to create the the RDS under the correct settings, like `vpc_id`, `rds_engine`, or `storage_type`. + +Third, variable values for the Airflow Helm chart are passed to the module. Using Helm makes the deployment of Airflow very easy. Since there are customizations on the deployment, such as a connection to the Airflow DAG repository on Github, it is necessary to specify these information beforehand, and to integrate them into the deployment. + +```javascript +locals { + cluster_name = "${var.name_prefix}-eks" + vpc_name = "${var.name_prefix}-vpc" + port_airflow = var.port_airflow + port_mlflow = var.port_mlflow + mlflow_s3_bucket_name = "${var.name_prefix}-mlflow-bucket" + force_destroy_s3_bucket = true + storage_type = "gp2" + max_allocated_storage = var.max_allocated_storage + airflow_github_ssh = var.airflow_github_ssh + git_username = var.git_username + git_token = var.git_token + git_repository_url = var.git_repository_url + git_branch = var.git_branch +} + +data "aws_caller_identity" "current" {} + + +# INFRASTRUCTURE +module "vpc" { + source = "./infrastructure/vpc" + cluster_name = local.cluster_name + vpc_name = local.vpc_name +} + +module "eks" { + source = "./infrastructure/eks" + cluster_name = local.cluster_name + eks_cluster_version = "1.23" + vpc_id = module.vpc.vpc_id + private_subnets = module.vpc.private_subnets + security_group_id_one = [module.vpc.worker_group_mgmt_one_id] + security_group_id_two = [module.vpc.worker_group_mgmt_two_id] + depends_on = [ + module.vpc + ] +} + +# CUSTOM TOOLS +module "airflow" { + count = var.deploy_airflow ? 1 : 0 + source = "./modules/airflow" + name = "airflow" + cluster_name = local.cluster_name + cluster_endpoint = module.eks.cluster_endpoint + + # RDS + vpc_id = module.vpc.vpc_id + private_subnets = module.vpc.private_subnets + private_subnets_cidr_blocks = module.vpc.private_subnets_cidr_blocks + rds_port = local.port_airflow + rds_name = "airflow" + rds_engine = "postgres" + rds_engine_version = "13.3" + rds_instance_class = "db.t3.micro" + storage_type = local.storage_type + max_allocated_storage = local.max_allocated_storage + + # HELM + helm_chart_repository = "https://airflow-helm.github.io/charts" + helm_chart_name = "airflow" + helm_chart_version = "8.6.1" + git_username = local.git_username + git_token = local.git_token + git_repository_url = local.git_repository_url + git_branch = local.git_branch + + depends_on = [ + module.eks + ] +} + + +module "mlflow" { + count = var.deploy_mlflow ? 1 : 0 + source = "./modules/mlflow" + name = "mlflow" + mlflow_s3_bucket_name = local.mlflow_s3_bucket_name + s3_force_destroy = local.force_destroy_s3_bucket + + # RDS + vpc_id = module.vpc.vpc_id + private_subnets = module.vpc.private_subnets + private_subnets_cidr_blocks = module.vpc.private_subnets_cidr_blocks + rds_port = local.port_mlflow + rds_name = "mlflow" + rds_engine = "mysql" + rds_engine_version = "8.0.30" + rds_instance_class = "db.t3.micro" + storage_type = local.storage_type + max_allocated_storage = local.max_allocated_storage + + depends_on = [ + module.eks + ] +} + + +module "jupyterhub" { + count = var.deploy_jupyterhub ? 1 : 0 + source = "./modules/jupyterhub" + name = "jupyterhub" + cluster_name = local.cluster_name + cluster_endpoint = module.eks.cluster_endpoint + + # HELM + helm_chart_repository = "https://jupyterhub.github.io/helm-chart/" + helm_chart_name = "jupyterhub" + helm_chart_version = "2.0.0" + + depends_on = [ + module.eks + ] +} + + +``` \ No newline at end of file diff --git a/manuscript/08.2-Deployment-Infrastructure_Essentials.md b/manuscript/08.2-Deployment-Infrastructure_Essentials.md new file mode 100644 index 0000000..4867a60 --- /dev/null +++ b/manuscript/08.2-Deployment-Infrastructure_Essentials.md @@ -0,0 +1,849 @@ +## Infrastructure + +The subdirectory `infrastructure` consists of four main modules, `vpc`, `eks`, `networking`, and `rds`. The former three are responsible to create the cluster itself, as well as the necessary tools to implement the platform functionalities. The `rds` module is merely an extension linked to the cluster which is needed to store data of tools like Airflow or Mlflow. The `rds` module is thereby called in the corresponding modules where an AWS RDS is needed, even though the module is placed in the Infrastructure directory. + +### Virtual Private Cloud + +The provided code in the `vpc` module establishes a Virtual Private Cloud (VPC) with associated subnets and security groups. It configures the required networking and security infrastructure to serve as the foundation to deploy an AWS EKS cluster. + +The VPC is created using the `terraform-aws-modules/vpc/aws` module version 5.0.0. The VPC is assigned the IPv4 CIDR block `"10.0.0.0/16"` and spans across all three available AWS availability zones within the specified region `eu-central-1`. It includes both public and private subnets, with private subnets associated with NAT gateways for internet access. DNS hostnames are enabled for the instances launched within the VPC. + +The VPC subnets are tagged with specific metadata relevant to Kubernetes cluster management. The public subnets are tagged with `"kubernetes.io/cluster/${local.cluster_name}"` set to `"shared"` and `"kubernetes.io/role/elb"` set to `1`. The private subnets are tagged with `"kubernetes.io/cluster/${local.cluster_name}"` set to `"shared"` and `"kubernetes.io/role/internal-elb"` set to `1`. + +Additionally, three security groups are defined to manage access to worker nodes. They are intended to provide secure management access to the worker nodes within the EKS cluster. Two of these security groups, `"worker_group_mgmt_one"` and `"worker_group_mgmt_two"`, allow SSH access from specific CIDR blocks. The third security group, `"all_worker_mgmt,"` allows SSH access from multiple CIDR blocks, including `"10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16"` + +```javascript +locals { + cluster_name = var.cluster_name +} + +data "aws_availability_zones" "available" {} + +module "vpc" { + source = "terraform-aws-modules/vpc/aws" + version = "5.0.0" + + name = var.vpc_name + + cidr = "10.0.0.0/16" + azs = slice(data.aws_availability_zones.available.names, 0, 3) + + private_subnets = ["10.0.1.0/24", "10.0.2.0/24", "10.0.3.0/24"] + public_subnets = ["10.0.4.0/24", "10.0.5.0/24", "10.0.6.0/24"] + + enable_nat_gateway = true + single_nat_gateway = true + enable_dns_hostnames = true + + public_subnet_tags = { + "kubernetes.io/cluster/${local.cluster_name}" = "shared" + "kubernetes.io/role/elb" = 1 + } + + private_subnet_tags = { + "kubernetes.io/cluster/${local.cluster_name}" = "shared" + "kubernetes.io/role/internal-elb" = 1 + } +} + +resource "aws_security_group" "worker_group_mgmt_one" { + name_prefix = "worker_group_mgmt_one" + vpc_id = module.vpc.vpc_id + + ingress { + from_port = 22 + to_port = 22 + protocol = "tcp" + + cidr_blocks = [ + "10.0.0.0/8", + ] + } +} + +resource "aws_security_group" "worker_group_mgmt_two" { + name_prefix = "worker_group_mgmt_two" + vpc_id = module.vpc.vpc_id + + ingress { + from_port = 22 + to_port = 22 + protocol = "tcp" + + cidr_blocks = [ + "192.168.0.0/16", + ] + } +} + +resource "aws_security_group" "all_worker_mgmt" { + name_prefix = "all_worker_management" + vpc_id = module.vpc.vpc_id + + ingress { + from_port = 22 + to_port = 22 + protocol = "tcp" + + cidr_blocks = [ + "10.0.0.0/8", + "172.16.0.0/12", + "192.168.0.0/16", + ] + } +} +``` + +### Elastic Kubernetes Service + +The provided Terraform code sets up an AWS EKS (Elastic Kubernetes Service) cluster with specific configurations and multiple node groups. The `"eks"` module is used to create the EKS cluster, specifying its name and version. The cluster has public and private access endpoints enabled, and a managed AWS authentication configuration. The creation of the `vpc` module is a prerequisite for the `"eks"` module, as the latter requires information like the `vpc_id`, or `subnet_ids` for a successful creation. + +The EKS cluster itself is composed of three managed node groups: `"group_t3_small"`, `"group_t3_medium"`, and `"group_t3_large"`. Each node group uses a different instance type (`t3.small`, `t3.medium`, and `t3.large`) and has specific scaling policies. All three node groups have auto-scaling enabled. The node group `"group_t3_medium"` has set the minimum and desired sizes of nodes to `4`, which ensures a base amount of nodes and thus resources to manage further deployments. The `"group_t3_large"` is tainted with a `NoSchedule`. This node group can be used for more resource intensive tasks by specifiyng a pod's toleration. + +The `eks` module also deploys several Kubernetes add-ons, including `coredns`, `kube-proxy`, `aws-ebs-csi-driver`, and `vpc-cni`. The vpc-cni add-on is configured with specific environment settings, enabling prefix delegation for IP addresses. + +- `CoreDNS` provides DNS-based service discovery, allowing pods and services to communicate with each other using domain names, and thus enabling seamless communication within the cluster without the need for explicit IP addresses. +- `kube-proxy`: is responsible for network proxying on Kubernetes nodes which ensures that network traffic is properly routed to the appropriate pods, services, and endpoints. It allows for an seamless communication between different parts of the cluster. +- `aws-ebs-csi-driver`(Container Storage Interface) is an add-on that enables Kubernetes pods to use Amazon Elastic Block Store (EBS) volumes for persistent storage, allowing data to be retained across pod restarts and ensuring data durability for stateful applications. The EBS configuration and deployment are describen in the following subsection, but the respective `service_account_role_arn` is linked to the EKS cluster on creation. +- `vpc-cni` (Container Network Interface) is essential for AWS EKS clusters, as it enables networking for pods using AWS VPC (Virtual Private Cloud) networking. It ensures that each pod gets an IP address from the VPC subnet and can communicate securely with other AWS resources within the VPC. + +```javascript +locals { + cluster_name = var.cluster_name + cluster_namespace = "kube-system" + ebs_csi_service_account_name = "ebs-csi-controller-sa" + ebs_csi_service_account_role_name = "${var.cluster_name}-ebs-csi-controller" + autoscaler_service_account_name = "autoscaler-controller-sa" + autoscaler_service_account_role_name = "${var.cluster_name}-autoscaler-controller" + + nodegroup_t3_small_label = "t3_small" + nodegroup_t3_medium_label = "t3_medium" + nodegroup_t3_large_label = "t3_large" + eks_asg_tag_list_nodegroup_t3_small_label = { + "k8s.io/cluster-autoscaler/enabled" : true + "k8s.io/cluster-autoscaler/${local.cluster_name}" : "owned" + "k8s.io/cluster-autoscaler/node-template/label/role" : local.nodegroup_t3_small_label + } + + eks_asg_tag_list_nodegroup_t3_medium_label = { + "k8s.io/cluster-autoscaler/enabled" : true + "k8s.io/cluster-autoscaler/${local.cluster_name}" : "owned" + "k8s.io/cluster-autoscaler/node-template/label/role" : local.nodegroup_t3_medium_label + } + + eks_asg_tag_list_nodegroup_t3_large_label = { + "k8s.io/cluster-autoscaler/enabled" : true + "k8s.io/cluster-autoscaler/${local.cluster_name}" : "owned" + "k8s.io/cluster-autoscaler/node-template/label/role" : local.nodegroup_t3_large_label + "k8s.io/cluster-autoscaler/node-template/taint/dedicated" : "${local.nodegroup_t3_large_label}:NoSchedule" + } + + tags = { + Owner = "terraform" + } +} + +data "aws_caller_identity" "current" {} + +# +# EKS +# +module "eks" { + source = "terraform-aws-modules/eks/aws" + version = "19.5.1" + + cluster_name = local.cluster_name + cluster_version = var.eks_cluster_version + cluster_enabled_log_types = ["api", "controllerManager", "scheduler"] + + vpc_id = var.vpc_id + subnet_ids = var.private_subnets + + cluster_endpoint_private_access = true + cluster_endpoint_public_access = true + manage_aws_auth_configmap = true + + # aws_auth_users = local.cluster_users # add users in later step + + cluster_addons = { + coredns = { + most_recent = true + }, + kube-proxy = { + most_recent = true + }, + aws-ebs-csi-driver = { + service_account_role_arn = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/${local.ebs_csi_service_account_role_name}" + }, + vpc-cni = { + most_recent = true + before_compute = true + service_account_role_arn = module.vpc_cni_irsa.iam_role_arn + configuration_values = jsonencode({ + env = { + # Reference docs https://docs.aws.amazon.com/eks/latest/userguide/cni-increase-ip-addresses.html + ENABLE_PREFIX_DELEGATION = "true" + WARM_PREFIX_TARGET = "1" + } + }) + } + + } + + eks_managed_node_group_defaults = { + ami_type = "AL2_x86_64" + disk_size = 10 + iam_role_attach_cni_policy = true + enable_monitoring = true + } + + eks_managed_node_groups = { + group_t3_small = { + name = "ng0_t3_small" + + instance_types = ["t3.small"] + + min_size = 0 + max_size = 6 + desired_size = 0 + capacity_type = "ON_DEMAND" + labels = { + role = local.nodegroup_t3_small_label + } + tags = { + "k8s.io/cluster-autoscaler/enabled" = "true" + "k8s.io/cluster-autoscaler/${local.cluster_name}" = "owned" + "k8s.io/cluster-autoscaler/node-template/label/role" = "${local.nodegroup_t3_small_label}" + } + } + group_t3_medium = { + name = "ng1_t3_medium" + + instance_types = ["t3.medium"] + + min_size = 4 + max_size = 6 + desired_size = 4 + capacity_type = "ON_DEMAND" + labels = { + role = local.nodegroup_t3_medium_label + } + tags = { + "k8s.io/cluster-autoscaler/enabled" = "true" + "k8s.io/cluster-autoscaler/${local.cluster_name}" = "owned" + "k8s.io/cluster-autoscaler/node-template/label/role" = "${local.nodegroup_t3_medium_label}" + } + } + group_t3_large = { + name = "ng2_t3_large" + + instance_types = ["t3.large"] + + min_size = 0 + max_size = 3 + desired_size = 0 + capacity_type = "ON_DEMAND" + labels = { + role = local.nodegroup_t3_large_label + } + taints = [ + { + key = "dedicated" + value = local.nodegroup_t3_large_label + effect = "NO_SCHEDULE" + } + ] + tags = { + "k8s.io/cluster-autoscaler/enabled" = "true" + "k8s.io/cluster-autoscaler/${local.cluster_name}" = "owned" + "k8s.io/cluster-autoscaler/node-template/label/role" = "${local.nodegroup_t3_large_label}" + "k8s.io/cluster-autoscaler/node-template/taint/dedicated" = "${local.nodegroup_t3_large_label}:NoSchedule" + } + } + } + tags = local.tags +} + +# Role for Service Account +module "vpc_cni_irsa" { + source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks" + version = "~> 5.0" + + role_name_prefix = "VPC-CNI-IRSA" + attach_vpc_cni_policy = true + vpc_cni_enable_ipv4 = true + + oidc_providers = { + main = { + provider_arn = module.eks.oidc_provider_arn + namespace_service_accounts = ["kube-system:aws-node"] + } + } +} +``` + +#### Elastic Block Store + +The EBS CSI controller (Elastic Block Store Container Storage Interface) is set up by defining an IAM (Identity and Access Management) role using the `"ebs_csi_controller_role"` module. The role allows the EBS CSI controller to assume a specific IAM role with OIDC (OpenID Connect) authentication, granting it the necessary permissions for EBS-related actions in the AWS environment by an IAM policy. The IAM policy associated with the role is created likewise and permits various EC2 actions, such as attaching and detaching volumes, creating and deleting snapshots, and describing instances and volumes. + +The code also configures the default Kubernetes StorageClass named `"gp2"` and annotates it as not the default storage class for the cluster, managing how storage volumes are provisioned and utilized in the cluster. Ensuring that the `"gp2"` StorageClass does not become the default storage class is needed as we additionally create an EFS Storage (Elastic File System), which is described in the next subsection. + +```javascript +# +# EBS CSI controller +# +module "ebs_csi_controller_role" { + source = "terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc" + version = "5.11.1" + create_role = true + role_name = local.ebs_csi_service_account_role_name + provider_url = replace(module.eks.cluster_oidc_issuer_url, "https://", "") + role_policy_arns = [aws_iam_policy.ebs_csi_controller_sa.arn] + oidc_fully_qualified_subjects = ["system:serviceaccount:${local.cluster_namespace}:${local.ebs_csi_service_account_name}"] +} + +resource "aws_iam_policy" "ebs_csi_controller_sa" { + name = local.ebs_csi_service_account_name + description = "EKS ebs-csi-controller policy for cluster ${var.cluster_name}" + + policy = jsonencode({ + "Version" : "2012-10-17", + "Statement" : [ + { + "Action" : [ + "ec2:AttachVolume", + "ec2:CreateSnapshot", + "ec2:CreateTags", + "ec2:CreateVolume", + "ec2:DeleteSnapshot", + "ec2:DeleteTags", + "ec2:DeleteVolume", + "ec2:DescribeInstances", + "ec2:DescribeSnapshots", + "ec2:DescribeTags", + "ec2:DescribeVolumes", + "ec2:DetachVolume", + ], + "Effect" : "Allow", + "Resource" : "*" + } + ] }) +} + +resource "kubernetes_annotations" "ebs-no-default-storageclass" { + api_version = "storage.k8s.io/v1" + kind = "StorageClass" + force = "true" + + metadata { + name = "gp2" + } + annotations = { + "storageclass.kubernetes.io/is-default-class" = "false" + } +} +``` + +#### Elastic File System + +The EFS CSI (Elastic File System Container Storage Interface) driver permits EKS pods to use EFS as a persistent volume for data storage, enabling pods to use EFS as a scalable and shared storage solution.. The driver itself is deployed using a Helm chart through the `"helm_release"` resource. Of course we also need to create an IAM role for the EFS CSI driver, which is done using the `"attach_efs_csi_role"` module, which allows the driver to assume a role with OIDC authentication, and grants the necessary permissions for working with EFS, similar to the EBS setup. + +For security, the code creates an AWS security group named `"allow_nfs"` that allows inbound NFS traffic on port 2049 from the private subnets of the VPC. This allows the EFS mount targets to communicate with the EFS file system securely. The EFS file system and access points are created manually for each private subnet mapping the `"aws_efs_mount_target"` to the `"aws_efs_file_system"` resource. + +Finally, the code defines a Kubernetes StorageClass named `"efs"` using the `"kubernetes_storage_class_v1"` resource. The StorageClass specifies the EFS CSI driver as the storage provisioner and the EFS file system created earlier as the backing storage. Additionally, the `"efs"` StorageClass is marked as the default storage class for the cluster using an annotation. This allows dynamic provisioning of EFS-backed persistent volumes for Kubernetes pods on default, simplifying the process of handling storage in the EKS cluster. This is done for example for the Airflow deployment in a later step. + +```javascript +# +# EFS +# +resource "helm_release" "aws_efs_csi_driver" { + chart = "aws-efs-csi-driver" + name = "aws-efs-csi-driver" + namespace = "kube-system" + repository = "https://kubernetes-sigs.github.io/aws-efs-csi-driver/" + set { + name = "controller.serviceAccount.create" + value = true + } + set { + name = "controller.serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn" + value = module.attach_efs_csi_role.iam_role_arn + } + set { + name = "controller.serviceAccount.name" + value = "efs-csi-controller-sa" + } +} + +module "attach_efs_csi_role" { + source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks" + role_name = "efs-csi" + attach_efs_csi_policy = true + oidc_providers = { + ex = { + provider_arn = module.eks.oidc_provider_arn + namespace_service_accounts = ["kube-system:efs-csi-controller-sa"] + } + } +} + +resource "aws_security_group" "allow_nfs" { + name = "allow nfs for efs" + description = "Allow NFS inbound traffic" + vpc_id = var.vpc_id + + ingress { + description = "NFS from VPC" + from_port = 2049 + to_port = 2049 + protocol = "tcp" + cidr_blocks = var.private_subnets_cidr_blocks + } + egress { + from_port = 0 + to_port = 0 + protocol = "-1" + cidr_blocks = ["0.0.0.0/0"] + ipv6_cidr_blocks = ["::/0"] + } +} + +resource "aws_efs_file_system" "stw_node_efs" { + creation_token = "efs-for-stw-node" +} + +resource "aws_efs_mount_target" "stw_node_efs_mt_0" { + file_system_id = aws_efs_file_system.stw_node_efs.id + subnet_id = var.private_subnets[0] + security_groups = [aws_security_group.allow_nfs.id] +} + +resource "aws_efs_mount_target" "stw_node_efs_mt_1" { + file_system_id = aws_efs_file_system.stw_node_efs.id + subnet_id = var.private_subnets[1] + security_groups = [aws_security_group.allow_nfs.id] +} + +resource "aws_efs_mount_target" "stw_node_efs_mt_2" { + file_system_id = aws_efs_file_system.stw_node_efs.id + subnet_id = var.private_subnets[2] + security_groups = [aws_security_group.allow_nfs.id] +} + +resource "kubernetes_storage_class_v1" "efs" { + metadata { + name = "efs" + annotations = { + "storageclass.kubernetes.io/is-default-class" = "true" + } + } + + storage_provisioner = "efs.csi.aws.com" + parameters = { + provisioningMode = "efs-ap" # Dynamic provisioning + fileSystemId = aws_efs_file_system.stw_node_efs.id # module.efs.id + directoryPerms = "777" + } + + mount_options = [ + "iam" + ] +} +``` + +#### Cluster Autoscaler + +The EKS Cluster Autoscaler ensures that the cluster can automatically scale its worker nodes based on the workload demands, ensuring optimal resource utilization and performance. + +The necessary IAM settings are set up prior to deploying the Autoscaler. First, an IAM policy named `"node_additional"` is created to grant permission to describe EC2 instances and related resources. This enables the Autoscaler to gather information about the current state of the worker nodes and make informed decisions regarding scaling. For each managed node group in the EKS cluster (defined by the `"eks_managed_node_groups"` module output), the IAM policy is attached to its corresponding IAM role. This ensures that all worker nodes have the required permissions to work with the Autoscaler. After setting up the IAM policies, tags are added to provide the necessary information for the EKS Cluster Autoscaler to identify and manage the Auto Scaling Groups effectively and to support cluster autoscaling from zero for each node group. The tags are created for each node group (`"nodegroup_t3_small"`, `"nodegroup_t3_medium"` ,and `"nodegroup_t3_large"`) and are based on the specified tag lists defined in the `"local.eks_asg_tag_list_*"` variables. + +The EKS Cluster Autoscaler itself is instantiated using the custom `"eks_autoscaler"` module on the bottom of the code snippet. The module is called to set up the Autoscaler for the EKS cluster and the required input variables are provided accordingly. Its components are described in detailed in the following. + +```javascript +# +# EKS Cluster autoscaler +# +resource "aws_iam_policy" "node_additional" { + name = "${local.cluster_name}-additional" + description = "${local.cluster_name} node additional policy" + + policy = jsonencode({ + Version = "2012-10-17" + Statement = [ + { + Action = [ + "ec2:Describe*", + ] + Effect = "Allow" + Resource = "*" + }, + ] + }) +} + +resource "aws_iam_role_policy_attachment" "additional" { + for_each = module.eks.eks_managed_node_groups + + policy_arn = aws_iam_policy.node_additional.arn + role = each.value.iam_role_name +} + +# Tags for the ASG to support cluster-autoscaler scale up from 0 for nodegroup2 +resource "aws_autoscaling_group_tag" "nodegroup_t3_small" { + for_each = local.eks_asg_tag_list_nodegroup_t3_small_label + autoscaling_group_name = element(module.eks.eks_managed_node_groups_autoscaling_group_names, 2) + tag { + key = each.key + value = each.value + propagate_at_launch = true + } +} + +resource "aws_autoscaling_group_tag" "nodegroup_t3_medium" { + for_each = local.eks_asg_tag_list_nodegroup_t3_medium_label + autoscaling_group_name = element(module.eks.eks_managed_node_groups_autoscaling_group_names, 1) + tag { + key = each.key + value = each.value + propagate_at_launch = true + } +} + +resource "aws_autoscaling_group_tag" "nodegroup_t3_large" { + for_each = local.eks_asg_tag_list_nodegroup_t3_large_label + autoscaling_group_name = element(module.eks.eks_managed_node_groups_autoscaling_group_names, 0) + tag { + key = each.key + value = each.value + propagate_at_launch = true + } +} + +module "eks_autoscaler" { + source = "./autoscaler" + cluster_name = local.cluster_name + cluster_namespace = local.cluster_namespace + aws_region = var.aws_region + cluster_oidc_issuer_url = module.eks.cluster_oidc_issuer_url + autoscaler_service_account_name = local.autoscaler_service_account_name +} +``` + +The configurationof the Cluster Autoscaler begins with the creation of a Helm release named `"cluster-autoscaler"` using the `"helm_release"` resource. The Helm chart is sourced from the `"kubernetes.github.io/autoscaler"` repository with the chart version `"9.10.7"`. The settings inside the Helm release include the AWS region, RBAC (Role-Based Access Control) settings for the service account, cluster auto-discovery settings, and the creation of the service account with the required permissions. + +The necessary resources for the settings are created accordingly in the following. The service account is created using the `"iam_assumable_role_admin"` module with an assumable IAM role that allows the service account to access the necessary resources for scaling. It is associated with the OIDC (OpenID Connect) provider for the cluster to permit access. + +An IAM policy named `"cluster_autoscaler"` is created to permit the Cluster Autoscaler to interact with Auto Scaling Groups, EC2 instances, launch configurations, and tags. The policy includes two statements: `"clusterAutoscalerAll"` and `"clusterAutoscalerOwn"`. The first statement grants read access to Auto Scaling Group-related resources, while the second statement allows the Cluster Autoscaler to modify the desired capacity of the Auto Scaling Groups and terminate instances. The policy also includes conditions to ensure that the Cluster Autoscaler can only modify resources with specific tags. The conditions check that the Auto Scaling Group has a tag `"k8s.io/cluster-autoscaler/enabled"` set to `"true"` and a tag `"k8s.io/cluster-autoscaler/"` set to `"owned"`. If you remember it, we have set these tags when setting up the managed node groups for the EKS Cluster in the previous step. + +```javascript +resource "helm_release" "cluster-autoscaler" { + name = "cluster-autoscaler" + namespace = var.cluster_namespace + repository = "https://kubernetes.github.io/autoscaler" + chart = "cluster-autoscaler" + version = "9.10.7" + create_namespace = false + + set { + name = "awsRegion" + value = var.aws_region + } + set { + name = "rbac.serviceAccount.name" + value = var.autoscaler_service_account_name + } + set { + name = "rbac.serviceAccount.annotations.eks\\.amazonaws\\.com/role-arn" + value = module.iam_assumable_role_admin.iam_role_arn + type = "string" + } + set { + name = "autoDiscovery.clusterName" + value = var.cluster_name + } + set { + name = "autoDiscovery.enabled" + value = "true" + } + set { + name = "rbac.create" + value = "true" + } +} + +module "iam_assumable_role_admin" { + source = "terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc" + version = "~> 4.0" + create_role = true + role_name = "cluster-autoscaler" + provider_url = replace(var.cluster_oidc_issuer_url, "https://", "") + role_policy_arns = [aws_iam_policy.cluster_autoscaler.arn] + oidc_fully_qualified_subjects = ["system:serviceaccount:${var.cluster_namespace}:${var.autoscaler_service_account_name}"] +} + +resource "aws_iam_policy" "cluster_autoscaler" { + name_prefix = "cluster-autoscaler" + description = "EKS cluster-autoscaler policy for cluster ${var.cluster_name}" + policy = data.aws_iam_policy_document.cluster_autoscaler.json +} + +data "aws_iam_policy_document" "cluster_autoscaler" { + statement { + sid = "clusterAutoscalerAll" + effect = "Allow" + + actions = [ + "autoscaling:DescribeAutoScalingGroups", + "autoscaling:DescribeAutoScalingInstances", + "autoscaling:DescribeLaunchConfigurations", + "autoscaling:DescribeTags", + "ec2:DescribeLaunchTemplateVersions", + ] + + resources = ["*"] + } + + statement { + sid = "clusterAutoscalerOwn" + effect = "Allow" + + actions = [ + "autoscaling:SetDesiredCapacity", + "autoscaling:TerminateInstanceInAutoScalingGroup", + "autoscaling:UpdateAutoScalingGroup", + ] + + resources = ["*"] + + condition { + test = "StringEquals" + variable = "autoscaling:ResourceTag/k8s.io/cluster-autoscaler/${var.cluster_name}" + values = ["owned"] + } + condition { + test = "StringEquals" + variable = "autoscaling:ResourceTag/k8s.io/cluster-autoscaler/enabled" + values = ["true"] + } + } +} +``` + +### Networking + +The `networking` module of the infrastructure directory integrates an *Application Load Balancer* (ALB) and *External DNS* in the cluster. Both play crucial roles in managing and exposing Kubernetes applications within the EKS cluster to the outside world. The ALB serves as an Ingress Controller to route external traffic to Kubernetes services, while External DNS automates the management of DNS records, making it easier to access services using user-friendly domain names. The root module of network just calls both submodules, which are described in detail in the following sections. + +```javascript +module "external-dns" { + ... +} + +module "application-load-balancer" { + ... +} +``` + +#### AWS Application Load Balancer (ALB) + +The ALB is a managed load balancer service provided by AWS. In the context of an EKS cluster, the ALB serves as an Ingress Controller and thus is responsible for routing external traffic to the appropriate services and pods running inside your Kubernetes cluster. The ALB acts as the entry point to our applications and enables us to expose multiple services over a single public IP address or domain name, which simplifies access for users and clients. + +The code starts by defining some local variables, followed by creating an assumable IAM role for the AWS Load Balancer Controller service account by the module `aws_load_balancer_controller_controller_role`. The service account holds the necessary permissions and associates with the OIDC provider of the EKS cluster as it is the same module call we already used multiple times beforehand. The IAM policy for the role is defined in the `"aws_iam_policy.aws_load_balancer_controller_controller_sa"` resource. + +Since its policy document is quite extensive, it is loaded from a file named `"AWSLoadBalancerControllerPolicy.json.`". In summary, the AWS IAM document allows the AWS Elastic Load Balancing (ELB) controller, specifically the Elastic Load Balancer V2 (ELBV2) API, to perform various actions related to managing load balancers, target groups, listeners, rules, and tags. The document includes several "Allow" statements that grant permissions for actions like describing and managing load balancers, target groups, listeners, and rules. It also allows the controller to create and delete load balancers, target groups, and listeners, as well as modify their attributes. Additionally, the document permits the addition and removal of tags for ELBV2 resources. + +After setting up the IAM role, the code proceeds to install the AWS Load Balancer Controller using Helm. The Helm chart is sourced from the `"aws.github.io/eks-charts"` repository, specifying version `"v2.4.2"`. The service account configuration is provided to the Helm release's values, including the name of the service account and annotations to associate it with the IAM role created earlier. The `"eks.amazonaws.com/role-arn"` annotation points to the ARN of the IAM role associated with the service account, allowing the controller to assume that role and operate with the appropriate permissions. + +```javascript +locals { + aws_load_balancer_controller_service_account_role_name = "aws-load-balancer-controller-role" + aws_load_balancer_controller_service_account_name = "aws-load-balancer-controller-sa" +} + +data "aws_caller_identity" "current" {} +data "aws_region" "current" {} # + +module "aws_load_balancer_controller_controller_role" { + source = "terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc" + version = "5.11.1" + create_role = true + role_name = local.aws_load_balancer_controller_service_account_role_name + provider_url = replace(var.cluster_oidc_issuer_url, "https://", "") + role_policy_arns = [aws_iam_policy.aws_load_balancer_controller_controller_sa.arn] + oidc_fully_qualified_subjects = ["system:serviceaccount:kube-system:${local.aws_load_balancer_controller_service_account_name}"] +} + +resource "aws_iam_policy" "aws_load_balancer_controller_controller_sa" { + name = local.aws_load_balancer_controller_service_account_name + description = "EKS ebs-csi-controller policy for cluster ${var.cluster_name}" + + policy = file("${path.module}/AWSLoadBalancerControllerPolicy.json") +} + +resource "helm_release" "aws-load-balancer-controller" { + name = var.helm_chart_name + namespace = var.namespace + chart = "aws-load-balancer-controller" + create_namespace = false + + repository = "https://aws.github.io/eks-charts" + version = var.helm_chart_version + + values = [yamlencode({ + clusterName = var.cluster_name + image = { + tag = "v2.4.2" + }, + serviceAccount = { + name = "${local.aws_load_balancer_controller_service_account_name}" + annotations = { + "eks.amazonaws.com/role-arn" = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/${local.aws_load_balancer_controller_service_account_role_name}" + } + } + })] +} +``` + +#### External DNS + +External DNS is a Kubernetes add-on that automates the creation and management of DNS records for Kubernetes services. It is particularly useful when services are exposed to the internet through the ALB or any other Ingress Controller. When an Ingress resource is created that defines how external traffic should be routed to services within the EKS cluster, External DNS automatically updates the DNS provider with the corresponding DNS records (in our case this is Route 53 in AWS). Automatically configuring DNS records ensures that the records are always up-to-date, which helps maintain consistency and reliability in the DNS configuration, and users can access the Kubernetes services using user-friendly domain names rather than relying on IP addresses. + +The code is structured similar to the ALB and defines local variables first, followed by creating a service account to interact with AWS resources. The service account, its role with OIDC and the policy with relevant permissions are created by the `external_dns_controller_role` module same to as we know it from previous implementations. The policy allows the external DNS controller to operate within the specified AWS Route 53 hosted zone, such as changing resource record sets, and listing hosted zones and resource record sets. + +Finally, the Helm is used to to deploy the external DNS controller as a Kubernetes resource. The Helm release configuration includes specifying the previously create service account, the IAM `role-arn` associated with it, the `aws.region` where the Route 53 hosted zone exists, and a `domainFilter` which filters to a specific domain provided by us. + +```javascript +locals { + external_dns_service_account_role_name = "external-dns-role" + external_dns_service_account_name = "external-dns-sa" +} + +data "aws_caller_identity" "current" {} +data "aws_region" "current" {} # + +module "external_dns_controller_role" { + source = "terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc" + version = "5.11.1" + create_role = true + role_name = local.external_dns_service_account_role_name + provider_url = replace(var.cluster_oidc_issuer_url, "https://", "") + role_policy_arns = [aws_iam_policy.external_dns_controller_sa.arn] + oidc_fully_qualified_subjects = ["system:serviceaccount:${var.namespace}:${local.external_dns_service_account_name}"] +} + +resource "aws_iam_policy" "external_dns_controller_sa" { + name = local.external_dns_service_account_name + description = "EKS ebs-csi-controller policy for cluster ${var.cluster_name}" + + policy = jsonencode({ + "Version" : "2012-10-17", + "Statement" : [ + { + "Effect" : "Allow", + "Action" : [ + "route53:ChangeResourceRecordSets" + ], + "Resource" : [ + "arn:aws:route53:::hostedzone/*" + ] + }, + { + "Effect" : "Allow", + "Action" : [ + "route53:ListHostedZones", + "route53:ListResourceRecordSets" + ], + "Resource" : [ + "*" + ] + } + ] + }) +} + +resource "helm_release" "external_dns" { + name = var.name + namespace = var.namespace + chart = var.helm_chart_name + create_namespace = false + + repository = "https://charts.bitnami.com/bitnami" + version = var.helm_chart_version + + values = [yamlencode({ + serviceAccount = { + create = true + name = "${local.external_dns_service_account_name}" + annotations = { + "eks.amazonaws.com/role-arn" = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:role/${local.external_dns_service_account_role_name}" + } + }, + aws = { + zoneType = "public" + region = "${data.aws_region.current.name}" + }, + policy = "sync" + domainFilter = [ + "${var.domain_name}" + ] + provider = "aws" + txtOwnerId = "${var.name}" + })] +} +``` + +### Relational Database Service + +The Amazon RDS (Relational Database Service) instance is provisioned by the `aws_db_instance` resource. It configures the instance with the specified settings, such as `allocated_storage`, `storage_type`, `engine`, `db_name`, `username`, and `password`, etc. All these parameters are provided whenever the module is invoked, e.g. in the Airflow or Mlflow modules.. The `skip_final_snapshot` set to true states that no final DB snapshot will be created when the instance is deleted. + +The resource `aws_db_subnet_group` creates an RDS subnet group with the name `"vpc-subnet-group-${local.rds_name}"`. It associates the RDS instance with the private subnets specified in the `VPC` module, and is used to define the subnets in which the RDS instance can be launched. Similar to the subnet group, the RDS instance uses an own security group. The security group `aws_security_group` is attached to the RDS instance. It specifies `ingress` (inbound)) and `egress` (outbound) rules to control network traffic. In this case, it allows inbound access on the specified port used by the RDS engine (5432 for PostgreSQL) from the CIDR blocks specified in the `private_subnets_cidr_blocks`, and allows all outbound traffic (`0.0.0.0/0`) from the RDS instance. + +The `rds` module is not necessarily needed to run a kubernetes cluster properly. It is merely an extension of the cluster and is needed to store relevant data of the tools used, such as airflow or mlflow. The module is thus called directly from the own airflow and mlflow modules. + +```javascript +locals { + rds_name = var.rds_name + rds_engine = var.rds_engine + rds_engine_version = var.rds_engine_version + rds_port = var.rds_port +} + +resource "aws_db_subnet_group" "default" { + name = "vpc-subnet-group-${local.rds_name}" + subnet_ids = var.private_subnets +} + +resource "aws_db_instance" "rds_instance" { + allocated_storage = var.max_allocated_storage + storage_type = var.storage_type + engine = local.rds_engine + engine_version = local.rds_engine_version + instance_class = var.rds_instance_class + db_name = "${local.rds_name}_db" + username = "${local.rds_name}_admin" + password = var.rds_password + identifier = "${local.rds_name}-${local.rds_engine}" + port = local.rds_port + vpc_security_group_ids = [aws_security_group.rds_sg.id] + db_subnet_group_name = aws_db_subnet_group.default.name + skip_final_snapshot = true +} + +resource "aws_security_group" "rds_sg" { + name = "${local.rds_name}-${local.rds_engine}-sg" + vpc_id = var.vpc_id + + ingress { + description = "Enable postgres access" + from_port = local.rds_port + to_port = local.rds_port + protocol = "tcp" + cidr_blocks = var.private_subnets_cidr_blocks + } + egress { + from_port = 0 + to_port = 0 + protocol = "-1" + cidr_blocks = ["0.0.0.0/0"] + } +} +``` \ No newline at end of file diff --git a/manuscript/08.3-Deployment-Infrastructure_Modules.md b/manuscript/08.3-Deployment-Infrastructure_Modules.md new file mode 100644 index 0000000..27d9201 --- /dev/null +++ b/manuscript/08.3-Deployment-Infrastructure_Modules.md @@ -0,0 +1,561 @@ +## Modules + +Within the setup, there are multiple custom modules, namely airflow, mlflow, jupyterhub, and monitoring. Each module is responsible for deploying a specific workflow tool. + +These module names also align with their corresponding namespaces within the cluster. + +### Airflow + +The `Airflow` module is responsible for provisioning all components related to the deployment of Airflow. Being a crucial workflow orchestration tool in our ML platform, Airflow is tightly integrated with various other components in the Terraform codebase, which requires it to receive multiple input variables and configurations. +Airflow itself is deployed in the Terraform code via a Helm chart. The provided Terraform code also integrates the Airflow deployment with AWS S3 for efficient data storage and logging. It also utilizes an AWS RDS instance from the infrastructure section to serve as the metadata storage. Additionally, relevant Kubernetes secrets are incorporated into the setup to ensure a secure deployment. + +The code starts by declaring several local variables that store the names of Kubernetes secrets and S3 buckets for data storage and logging. Next, it creates a Kubernetes namespace for Airflow to isolate the deployment. + +```javascript +locals { + k8s_airflow_db_secret_name = "${var.name_prefix}-${var.namespace}-db-auth" + git_airflow_repo_secret_name = "${var.name_prefix}-${var.namespace}-https-git-secret" + git_organization_secret_name = "${var.name_prefix}-${var.namespace}-organization-git-secret" + s3_data_bucket_secret_name = "${var.name_prefix}-${var.namespace}-${var.s3_data_bucket_secret_name}" + s3_data_bucket_name = "${var.name_prefix}-${var.namespace}-${var.s3_data_bucket_name}" + s3_log_bucket_name = "${var.name_prefix}-${var.namespace}-log-storage" +} + +data "aws_caller_identity" "current" {} +data "aws_region" "current" {} # + +resource "kubernetes_namespace" "airflow" { + metadata { + + name = var.namespace + } +} + +# +# Log Storage +# +module "s3-remote-logging" { + source = "./remote_logging" + s3_log_bucket_name = local.s3_log_bucket_name + namespace = var.namespace + s3_force_destroy = var.s3_force_destroy + oidc_provider_arn = var.oidc_provider_arn +} + +# +# Data Storage +# +module "s3-data-storage" { + source = "./data_storage" + s3_data_bucket_name = local.s3_data_bucket_name + namespace = var.namespace + s3_force_destroy = var.s3_force_destroy + s3_data_bucket_secret_name = local.s3_data_bucket_secret_name +} +``` +Afterward, two custom modules, `"s3-remote-logging"` and `"s3-data-storage"` set up S3 buckets for remote logging and data storage. Both modules handle creating the S3 buckets and necessary IAM roles for accessing them. The terraform code of both modules is not depicted here, it is visible on [github](https://github.com/seblum/mlops-airflow-on-eks) though. +The main difference between the modules are in the in the assume role policies that are needed for the different use cases of storing and reading data, or logging to S3. While the `"s3_log_bucket_role"` allows a Federated entity, specified by an OIDC provider ARN, to assume the role using `"sts:AssumeRoleWithWebIdentity"`, the `"s3_data_bucket_role"` allows both a specific IAM user (constructed from the user's ARN) and the Amazon S3 service itself to assume the role using `"sts:AssumeRole"`. + +**s3-data-storage role policy** + +```javascript +# s3-data-storage role policy +resource "aws_iam_role" "s3_data_bucket_role" { + name = "${var.namespace}-s3-data-bucket-role" + max_session_duration = 28800 + + assume_role_policy = < + +### Jupyterhub + +JupyterHub is utilized in the setup to provide an IDE (Integrated Development Environment). Belows Terraform code defines a `helm_release` that deploys JupyterHub on our EKS cluster. There are no other resources needed to run JupyterHub compared to the other components of our ML platform +The Helm configuration specifies various settings and customizations to include a JupyterHub instance with a single-user Jupyter notebook server. For example defining a post-start lifecycle hook to run a Git clone command inside the single-user notebook server container, or defining extra environment variable for the single-user server, namely `"MLFLOW_TRACKING_URI"` pointing to the previous specified MLflow service. + +The Airflow configuration enables an Ingress resource to expose JupyterHub to the specified domain, and adds annotations to control routing and manage the AWS Application Load Balancer (ALB). It also includes settings for the JupyterHub proxy and enables a culling mechanism to automatically shut down idle user sessions. + +Similar to the Airflow deployment, the JupyterHub instance is configured to use GitHub OAuthenticator for user authentication. The OAuthenticator is configured with the provided GitHub `client_id` and `client_secret`, and the `oauth_callback_url` to set a specific endpoint under the specified domain name. + +```javascript +resource "helm_release" "jupyterhub" { + name = var.name + namespace = var.name + create_namespace = var.create_namespace + + repository = "https://jupyterhub.github.io/helm-chart/" + chart = var.helm_chart_name + version = var.helm_chart_version + + values = [yamlencode({ + singleuser = { + defaultUrl = "/lab" + image = { + name = "seblum/jupyterhub-server" + tag = "latest" + }, + lifecycleHooks = { + postStart = { + exec = { + command = ["git", "clone", "${var.git_repository_url}"] + } + } + }, + extraEnv = { + "MLFLOW_TRACKING_URI" = "http://mlflow-service.mlflow.svc.cluster.local" + } + }, + ingress = { + enabled : true + annotations = { + "external-dns.alpha.kubernetes.io/hostname" = "${var.domain_name}" + "alb.ingress.kubernetes.io/scheme" = "internet-facing" + "alb.ingress.kubernetes.io/target-type" = "ip" + "kubernetes.io/ingress.class" = "alb" + "alb.ingress.kubernetes.io/group.name" = "mlplatform" + } + hosts = ["${var.domain_name}", "www.${var.domain_name}"] + }, + proxy = { + service = { + type = "ClusterIP" + } + secretToken = var.proxy_secret_token + } + cull = { + enabled = true + users = true + } + hub = { + baseUrl = "/${var.domain_suffix}" + config = { + GitHubOAuthenticator = { + client_id = var.git_client_id + client_secret = var.git_client_secret + oauth_callback_url = "http://${var.domain_name}/${var.domain_suffix}/hub/oauth_callback" + } + JupyterHub = { + authenticator_class = "github" + } + } + } + })] +} +``` + + + diff --git a/manuscript/08.4-Deployment-Infrastructure_Design_Decisions.md b/manuscript/08.4-Deployment-Infrastructure_Design_Decisions.md new file mode 100644 index 0000000..4bf3472 --- /dev/null +++ b/manuscript/08.4-Deployment-Infrastructure_Design_Decisions.md @@ -0,0 +1 @@ +## Design Decisions \ No newline at end of file diff --git a/manuscript/09.1-Deployment-Usage_Jupyterhub.md b/manuscript/09.1-Deployment-Usage_IDE.md similarity index 98% rename from manuscript/09.1-Deployment-Usage_Jupyterhub.md rename to manuscript/09.1-Deployment-Usage_IDE.md index f7e7ed0..d43d600 100644 --- a/manuscript/09.1-Deployment-Usage_Jupyterhub.md +++ b/manuscript/09.1-Deployment-Usage_IDE.md @@ -1,4 +1,4 @@ -## Jupyterhub +## Integrated Development Environment Jupyterhub serves as the integrated server environment within the ML platform, providing an Integrated Development Environment (IDE). However, it deviates from the traditional Jupyter Notebooks and instead utilizes VSCode as the IDE. Upon initialization, Jupyterhub clones the GitHub repository *mlops-airflow-DAGs* that contains the code for the use case. This is the same repository that is synchronized with Airflow to load the provided DAGs. The purpose of this approach is to offer a user-friendly and efficient development experience to platform users. diff --git a/manuscript/09.3-Deployment-Usage_Building-Model-Pipeline.md b/manuscript/09.3-Deployment-Usage_Building-Model-Pipeline.md index 7d1afc7..7bd5dfe 100644 --- a/manuscript/09.3-Deployment-Usage_Building-Model-Pipeline.md +++ b/manuscript/09.3-Deployment-Usage_Building-Model-Pipeline.md @@ -340,8 +340,5 @@ After the model training, the trained model is tested on the test data and its p # Return run ID, model name, model version, and current stage of the model return run_id, mv.name, mv.version, mv.current_stage - ``` - - diff --git a/manuscript/10-Acknowledgements.Rmd b/manuscript/10-Acknowledgements.Rmd new file mode 100644 index 0000000..8737b2f --- /dev/null +++ b/manuscript/10-Acknowledgements.Rmd @@ -0,0 +1,7 @@ +# Acknowledgements + +I would like to express my gratitude to the everyone contributing to this project and everyone who has provided invaluable insights and support throughout the development of this bookdown project: + + +I am grateful to everyone who have contributed their time and expertise in reviewing drafts and providing valuable feedback. Your input has undoubtedly shaped the final outcome of this bookdown, and I am deeply appreciative of your efforts. + diff --git a/manuscript/_bookdown.yml b/manuscript/_bookdown.yml index 3df1039..a2b0ea1 100644 --- a/manuscript/_bookdown.yml +++ b/manuscript/_bookdown.yml @@ -10,7 +10,7 @@ rmd_files: ["index.Rmd", "01.2-Introduction-MLOps.Rmd", "01.3-Introduction-Roles_and_Tasks.Rmd", "01.4-Introduction-Ops_practices.Rmd", - "02-Project-MLOps_Engineering_with_Airflow.Rmd", + "02-Overview_about_book_tutorials.Rmd", "03-Airflow.Rmd", "03.1-Airflow-Core_Components.Rmd", "03.2-Airflow-Exemplary_ML_Workflow.Rmd", @@ -32,16 +32,18 @@ rmd_files: ["index.Rmd", "06.3-Terraform-Modules.Rmd", "06.4-Terraform-Tips_and_Tricks.Rmd", "06.5-Terraform-Exemplary_deployment.Rmd", + "07-ML-Project_Design.Rmd", "08-Deployment-Infrastructure_Overview.md", - #"08.1-Deployment_Infrastructure_Root.Rmd" - #"08.2-Deployment-Infrastructure_Essentials.md", - #"08.3-Deployment-Infrastructure_Modules.md", - #"08.4-Deployment-Infrastructure_Applications.md", + "08.1-Deployment_Infrastructure_Root.md", + "08.2-Deployment-Infrastructure_Essentials.md", + "08.3-Deployment-Infrastructure_Modules.md", + "08.4-Deployment-Infrastructure_Design_Decisions.md", "09-Deployment-Usage_Overview.md", - "09.1-Deployment-Usage_Jupyterhub.md", + "09.1-Deployment-Usage_IDE.md", "09.2-Deployment-Usage_Pipeline-Workflow.md", "09.3-Deployment-Usage_Building-Model-Pipeline.md", "09.4-Deployment-Usage_Model-Serving.md", + "10-Acknowledgements.Rmd" ] diff --git a/temporary_structure/02-MLOps.Rmd b/temporary_structure/02-MLOps.Rmd deleted file mode 100644 index 312861f..0000000 --- a/temporary_structure/02-MLOps.Rmd +++ /dev/null @@ -1,11 +0,0 @@ -# Hello bookdown - -All chapters start with a first-level heading followed by your chapter title, like the line above. There should be only one first-level heading (`#`) per .Rmd file. - -## A section - -All chapter sections start with a second-level (`##`) or higher heading followed by your section title, like the sections above and below here. You can have as many as you want within a chapter. - -### An unnumbered section {-} - -Chapters and sections are numbered by default. To un-number a heading, add a `{.unnumbered}` or the shorter `{-}` at the end of the heading, like in this section. diff --git a/temporary_structure/03-Airflow.Rmd b/temporary_structure/03-Airflow.Rmd deleted file mode 100644 index 312861f..0000000 --- a/temporary_structure/03-Airflow.Rmd +++ /dev/null @@ -1,11 +0,0 @@ -# Hello bookdown - -All chapters start with a first-level heading followed by your chapter title, like the line above. There should be only one first-level heading (`#`) per .Rmd file. - -## A section - -All chapter sections start with a second-level (`##`) or higher heading followed by your section title, like the sections above and below here. You can have as many as you want within a chapter. - -### An unnumbered section {-} - -Chapters and sections are numbered by default. To un-number a heading, add a `{.unnumbered}` or the shorter `{-}` at the end of the heading, like in this section. diff --git a/temporary_structure/041-k8s.md b/temporary_structure/041-k8s.md deleted file mode 100644 index 242d79e..0000000 --- a/temporary_structure/041-k8s.md +++ /dev/null @@ -1,195 +0,0 @@ -### The problem {-} - -Deploy a web application into a node. Consisting of one virtual machine of 8GB RAM, 4 Cores. Usually one would deploy one container. Having more users of the application has the need to scale the application, meaning to create yet another node with the same application running. - -Even worse, say if we have a versioned application, e.g. v2, that we want to deploy. We have to deploy a new node before destroying a prior version, e.g. v1. - -This leaves us with a problem. We would have 12 cores, 24GB RAM, and three containers overall. This is a bit insane. Now, this is where Kubernetes comes into play and can help. - -Kubernetes will take a single node and then utilize the resources in the correct manner. So instead create new nodes, it will fill one node with as many pods as it can (In K8s you can think of pods as a container). This means, instead of having one container per node, we have multiple containers in one node. Kubernetes orchestrates this cluster of us. - -![](multiple_containers.png) - -**Kubernetes (K8s) is an application orchestrator** - -More specifically: - -* K8s deploys and manages containers that run an application (..or else). -* K8s scales up and down according to demand -* K8s performs zero downtime deployments -* Rollbacks, etc ... - - -*Minikube* - -describe minikube - -## Resources within Kubernetes {-} - -Let's first have a look at its components. - -## Cluster - -A cluster is a set of nodes. A node can be a virtual (VM) or a physical machine, running on the cloud, e.g. Azure, AWS, GCP, or on premise. - -## Nodes - -It is important to distinguish between nodes within a K8s cluster. In particular between *master nodes* and *worker nodes*. - -The **master node** can be seen as the brain of the cluster. This is where all of the decisions are made. Within the master node, there multiple components that make up the *control plane*, e.g. scheduler, cluster store, API server, cloud controller manager, controller manager. - -* scheduler -* cluster store -* API server -* cloud controller manager -* controller manager - -The **worker nodes** are responsible for the "heavy lifting" of running an application. - -Within one cluster there are often more than one worker node but only one master node. -Master and worker nodes communicate to each other via the *kubelet*. - -![](kubernetes_cluster.png) - - -## Services - -Lets let pods talk to each other - -customer microservices performs a REST api call to order microservice to fetch some order information - -bad way: -get ip of order -go to customer deployment -insert spec env: name order-service -after deleting the pods, the ip address will change. Thus our hardcoded way does not work anymore. never rely on ip adress and uses Services instead (ClusterIpServces). - -using service: -containerPort of pod and targetPort have to match, and selector-app and pod-app -clusterip service get endpoints -access only service ip - what if the service pod is restarting? -still need portforward to test because we did not implement external ip service yet -can also use minikube service customer-node to open directly - - -* ClusterIP (Default): Default Kubernetes Service Type. Only used for internal access and no external. When letting customer talk to order, we use a order service of type clusterIP calling service-name:port. kubernetes clusterIP is created on default to be able to talk to the kube-apiserver-minikube -* NodePort: Allows to open a port on all nodes. Port range between 30000-32767. Example of two nodes: Nodeport opens one port to both nodes so the client can choose which node to access under one port. The NodePort Service handles this request and checks which pod is healthy and only send requests there. Client wants to run on node one, but pod is only running on node 2 -> Nodeport will send request to pod on node two. Disadvantage is that we can only have one service per port: changes usage of ingress (? what example). If node IP change, then we have a problem -* ExternalName -* LoadBalancer: Standard way of exposing applications to the internet. Creates a load balancer per service (a second service needs a second LB). AWS & GCP create a network load balancer (NLB). NLB distributes traffic between instances. minikube tunnel to run locally. Cloud controller manager talks to underlying cloud provider (which it creates an NLB). - - -## Deployments - -## ReplicaSets - -$$ z = x_{1}w_{11}^{(1)} + x_{2}w_{21}^{(1)} + b_{1}$$ - - - -## DaemonSets - -```{=latex} -\begin{table}[h!] - \begin{center} - \small\sffamily\renewcommand{\arraystretch}{0.9} - \begin{tabular}{p{0.5cm}p{0.5cm}p{0.5cm}} - x1 & x2 & y \\ - \midrule - 1 & 1 & 0 \\ - 1 & 0 & 1 \\ - 0 & 1 & 1 \\ - 0 & 0 & 0 \\ - % \underline{(1,2)} - \end{tabular} - \end{center} - \caption[XOR truth table]{XOR truth table} - \label{tab:event3} -\end{table} -``` - - - -## Commands {-} - -To interact with the cluster from our local machine, *kubectl* is needed. - -**kubectl** is a command line tool to run commands against our cluster, e.g. deploy, inspect, edit resources, debug, view logs, etc.. -kubectl is also used to connect your cluster, whether it's running in production or any environment. - - -```bash -# Start a cluster with two nodes -minikube start --nodes=2 - -# check status of minikube nodes -minikube status - -# access the running application using the appname -minikube service myapp - -# show docker containers created within nodes -docker ps - -docker run --rm -p 80:80 amigoscode/kubernetes:customer-v1 - -# to interact with the cluster use kubectl -# to show available nodes -kubectl get nodes - -# to show all available pods in all namespaces -# this also shows pods of control pane -kubectl get pods -A - -# apply configuration file to run. -kubectl apply -f deployment.yml - -# to show the cluster-ips and ports -kubectl get svc - -kubectl describe pod -kubectl describe node minikube-m02 - -# list all kubectl api commands -kubectl api-resources - -kubectl port-forward deployment/customer 8080:8080 - -kubectl exec -it hello-world -- ls / -kubectl exec -it hello-world -c hello-world -- bash - -kubectl logs hello-world -kubectl logs hello-world -c hello-world - -kubectl delete pod hello-world - -cat pod.yml | kubectl apply -f - - -kubectl run hello-world --image=amigoscode/kubernetes:hello-world --port=80 - -kubectl get endpoints - -kubectl describe service order - -minikube ip -minikube ip -n minikube-m02 - -# open up the url to the service -minikube service customer-node - -kubectl exec -it order-7d87cb7758-664rl -- sh - -# watch for changes -kubectl get svc -w - -# access LoadBalancer on minikube -minikube tunnel -``` - -## Exemplary deployment {-} - -To run K8s locally create a local cluster for example using minikube. Make sure do install Docker and Minikube. - -To apply the deployment.yml configuration using kubectl use *kubectl apply -f deployment.yml*. *kubectl get pods* should show two pods running now. Check the cluster-ips and ports using *kubectl get svc*. The ports should denote the same as specified within the deployment.yml. Access the running application using the appname -by using *minikube service myapp*. The app can be run within the browser using the shown ip adress. - - diff --git a/temporary_structure/05-Terraform.Rmd b/temporary_structure/05-Terraform.Rmd deleted file mode 100644 index 4784463..0000000 --- a/temporary_structure/05-Terraform.Rmd +++ /dev/null @@ -1,30 +0,0 @@ -# Blocks - -## Equations - -Here is an equation. - -\begin{equation} - f\left(k\right) = \binom{n}{k} p^k\left(1-p\right)^{n-k} - (\#eq:binom) -\end{equation} - -You may refer to using `\@ref(eq:binom)`, like see Equation \@ref(eq:binom). - - -## Theorems and proofs - -Labeled theorems can be referenced in text using `\@ref(thm:tri)`, for example, check out this smart theorem \@ref(thm:tri). - -::: {.theorem #tri} -For a right triangle, if $c$ denotes the *length* of the hypotenuse -and $a$ and $b$ denote the lengths of the **other** two sides, we have -$$a^2 + b^2 = c^2$$ -::: - -Read more here . - -## Callout blocks - - -The R Markdown Cookbook provides more help on how to use custom blocks to design your own callouts: https://bookdown.org/yihui/rmarkdown-cookbook/custom-blocks.html diff --git a/temporary_structure/06-MLFlow_DVC.Rmd b/temporary_structure/06-MLFlow_DVC.Rmd deleted file mode 100644 index ad51e36..0000000 --- a/temporary_structure/06-MLFlow_DVC.Rmd +++ /dev/null @@ -1,31 +0,0 @@ -# Sharing your book - -## Publishing - -HTML books can be published online, see: https://bookdown.org/yihui/bookdown/publishing.html - -## 404 pages - -By default, users will be directed to a 404 page if they try to access a webpage that cannot be found. If you'd like to customize your 404 page instead of using the default, you may add either a `_404.Rmd` or `_404.md` file to your project root and use code and/or Markdown syntax. - -## Metadata for sharing - -Bookdown HTML books will provide HTML metadata for social sharing on platforms like Twitter, Facebook, and LinkedIn, using information you provide in the `index.Rmd` YAML. To setup, set the `url` for your book and the path to your `cover-image` file. Your book's `title` and `description` are also used. - - - -This `gitbook` uses the same social sharing data across all chapters in your book- all links shared will look the same. - -Specify your book's source repository on GitHub using the `edit` key under the configuration options in the `_output.yml` file, which allows users to suggest an edit by linking to a chapter's source file. - -Read more about the features of this output format here: - -https://pkgs.rstudio.com/bookdown/reference/gitbook.html - -Or use: - -```{r eval=FALSE} -?bookdown::gitbook -``` - - diff --git a/temporary_structure/08-NeuralNetworks.tex b/temporary_structure/08-NeuralNetworks.tex deleted file mode 100755 index fb0fac3..0000000 --- a/temporary_structure/08-NeuralNetworks.tex +++ /dev/null @@ -1,584 +0,0 @@ -%\setcounter{chapter}{-1} -\chapter{Artificial Neural Networks} -% -%% ------------------------------------------------------------------------------- -% -\section{Introduction} -% -%% ------------------------------------------------------------------------------- -% -Artificial Intelligence aims to mimic human intelligence using various mathematical and logical tools. Initial AI systems were rule based systems and thus based on learning of formal mathematical rules. However, what about problems which do not have any formal rules, for example identifying objects, understanding spoken words etc.? This is where Artificial Neural Networks (ANN) come into play. \\ - -While neural networks were inspired by the human mind, their goal are not to copy the human mind, but to use mathematical tools to solve problems without formal rules like image recognition, speech/dialogue, language translation, art generation etc. This is done by learning a model to depict the given problem space. The most basic ANN is called a \textit{Perceptron} and proposed by Frank Rosenblatt. A Perceptron is based on the simplification of a neuron architecture as proposed by McCulloch–Pitts. Not going into much details, it has two inputs and one output and the neuron itself (it is sometimes also referred to as unit as it's not biological anymore) has a predefined threshold. Now, if we feed the inputs to the neuron and the sum of inputs exceed the threshold of the unit, the output is active else it is inactive. - -This means, we have a linear activation like - -$$ z = x_{1}w_{11}^{(1)} + x_{2}w_{21}^{(1)} + b_{1}$$ - - -\begin{figure}[h] - \begin{center} - \begin{tikzpicture}[>=latex,scale=0.2] - % \draw [help lines, black!30, step=0.5] (0,0) grid (20,50); - - % Styles for states, and state edges - \tikzstyle{unit} = [draw, thick, fill=white, circle, minimum height=1em, minimum width=1em, node distance=1em, font={\sffamily}] - \tikzstyle{stateEdgePortion} = [black,thick]; - \tikzstyle{stateEdge} = [stateEdgePortion,->]; - \tikzstyle{edgeLabel} = [pos=0.5, text centered, font={\sffamily\small}]; - - % ACT-R - \node[unit, name = input1, scale = .75] {$x_{1}$}; - \node[unit, name = input2, below =1cm of input1, scale = .75] {$x_{2}$}; - \node[unit, name = bias1, below =0.5cm of input2, scale = .75] {$b_{1}$}; - - \node[unit, name = unit, below right =0.2cm and 2cm of input1, scale = 1.5] {$z$}; - - \node[unit, name = output, right =1cm of unit, scale = .75] {$y$}; - - \draw (input1) edge [stateEdge] node[edgeLabel]{} (unit); - \draw (input2) edge [stateEdge] node[edgeLabel]{} (unit); - \draw (bias1) edge [stateEdge] node[edgeLabel]{} (unit); - \draw (unit) edge [stateEdge] node[edgeLabel]{} (output); - - \end{tikzpicture} - \end{center} - \caption[Perceptron]{Perceptron} - \label{fig:perceptron} -\end{figure} - -However, Minsky and Papert concluded that perceptrons only separate linearly separable classes and are thus incapable of learning very simple functions that exceed a 2-D space. They chose the Exclusive-OR (XOR)-problem to prove that uni layered perceptrons cannot learn beyond linearly separable data. -% -%% ------------------------------------------------------------------------------- -% -\section{The XOR Problem} -% -%% ------------------------------------------------------------------------------- -% -In the XOR problem, we try to train a perceptron to mimic a 2D XOR function. The logical “exclusive OR” function states that for two given logical statements, the XOR function would return TRUE if one of the statements is true and FALSE if both statements are true. If neither of the statements is true, it also returns FALSE. If we plot it, we get the following chart - -% -\vspace{0.5cm} -\begin{table}[h!] - \begin{center} - \small\sffamily\renewcommand{\arraystretch}{0.9} - \begin{tabular}{p{0.5cm}p{0.5cm}p{0.5cm}} - x1 & x2 & y \\ - \midrule - 1 & 1 & 0 \\ - 1 & 0 & 1 \\ - 0 & 1 & 1 \\ - 0 & 0 & 0 \\ - % \underline{(1,2)} - \end{tabular} - \end{center} - \caption[XOR truth table]{XOR truth table} - \label{tab:event3} -\end{table} - -However, a perceptron can only converge on linearly separable data. Have a look at the graph and try to fit a linear function that separates our problem in two classes. It's just not possible. Therefore, a perceptron isn’t capable of imitating the XOR function. The solution to this problem is to expand beyond the single-layer architecture of a perceptron by adding an additional layer of units. This is also know as a hidden layer. Now we come a bit closer to a "real" network. The "vanilla" one can be referred to as a multilayer perceptron (MLP). -% -%% ------------------------------------------------------------------------------- -% -\section{MLP} -% -%% ------------------------------------------------------------------------------- -% -A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN). This means it forwards the input through the network to output the resulting predictions. An MLP consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. Its multiple layers and non-linear activation distinguish MLP from a linear perceptron and can distinguish data that is not linearly separable. -\clearpage - -\begin{figure}[h] - \begin{center} - \begin{tikzpicture}[>=latex,scale=0.2] - % \draw [help lines, black!30, step=0.5] (0,0) grid (20,50); - - % Styles for states, and state edges - \tikzstyle{unit} = [draw, thick, fill=white, circle, minimum height=1em, minimum width=1em, node distance=1em, font={\sffamily}] - \tikzstyle{stateEdgePortion} = [black,thick]; - \tikzstyle{stateEdge} = [stateEdgePortion,->]; - \tikzstyle{edgeLabel} = [pos=0.5, text centered, font={\sffamily\small}]; - - % ACT-R - \node[unit, name = input1, scale = .75] {x 1}; - \node[unit, name = input2, below =1cm of input1, scale = .75] {x 2}; - \node[unit, name = bias1, below =0.5cm of input2, scale = .75] {b1}; - - \node[unit, name = hidden1, below right =0.2cm and 2cm of input1, scale = 1.25] {h1}; - \node[unit, name = hidden2, below =1cm of hidden1, scale = 1.25] {h2}; - \node[unit, name = bias2, below =0.5cm of hidden2, scale = .75] {b2}; - - \node[unit, name = output, below right =0.2cm and 2cm of hidden1, scale = 1.25] {y}; - - \draw (input1) edge [stateEdge] node[edgeLabel]{} (hidden1); - \draw (input2) edge [stateEdge] node[edgeLabel]{} (hidden1); - \draw (bias1) edge [stateEdge] node[edgeLabel]{} (hidden1); - - \draw (input1) edge [stateEdge] node[edgeLabel]{} (hidden2); - \draw (input2) edge [stateEdge] node[edgeLabel]{} (hidden2); - \draw (bias1) edge [stateEdge] node[edgeLabel]{} (hidden2); - - \draw (hidden2) edge [stateEdge] node[edgeLabel]{} (output); - \draw (hidden1) edge [stateEdge] node[edgeLabel]{} (output); - \draw (bias2) edge [stateEdge] node[edgeLabel]{} (output); - - \end{tikzpicture} - \end{center} - \caption[MLP]{MLP} - \label{fig:mlp} -\end{figure} - -Figure above shows a MLP with $x$ denoting the input, $y$ the output, $w_{ij}^{(l)}$ the weights, and $b$ the bias term. Since MLPs are fully connected, each node in one layer connects with a certain weight to every node in the following layer. - -The weight $ w_{12}^{(1)}$ is the weight of the 1st layer $^{(1)}$ and connects the 1st neuron from the 1st layer to 2nd neuron in the next layer $_{(1,2)}$. \\ - -But what happens inside the neuron? There can be two activations: Linear \& non-linear! - -$$ z = x_{1}w_{11}^{(1)} + x_{2}w_{21}^{(1)} + b_{1}$$ - -as a linear activation, and the sigmoid function - - -$$ a_{1,1}^{(1)} = \sigma(z) $$ - -$$ \sigma(x) = \frac{1}{1 + e^{-x}} $$ - -as a non-linear activation function. \\ - -Activation function decides, whether a neuron should be activated or not by calculating weighted sum and further adding bias with it. The purpose of the activation function is to introduce non-linearity into the output of a neuron. A neural network without an activation function is essentially just a linear regression model. Thus, the activation function does the non-linear transformation to the input making it capable to learn and perform more complex tasks. - -No matter how many layers we have, if all are linear in nature, the final activation function of last layer is nothing but just a linear function of the input of first layer. - -MLP can be seen as a very shallow ANN. -% -%% ------------------------------------------------------------------------------- -% -\section{How it works} -% -%% ------------------------------------------------------------------------------- -% -Learning of the MLP occurs by changing connection weights after each piece of data is processed, based on the amount of error in the output compared to the expected result. Thus, we need to know the expected result and have the learning is supervised. The change of the weights is carried out through an algorithm called backpropagation, a generalization of the least mean squares algorithm in the linear perceptron. - - -In general, there are four (or five) stages of an ANN - -\begin{enumerate} - \item Initialize weights „somehow“, i.e. randomly – as this is only done once, it is sometimes counted as a proper “step” and sometimes not. Regardless of the terminology, weights have to be initialized - \item Forward-Pass of the inputs - \item Calculation of the loss/ cost function to determine the prediction - \item Backpropagation to check the influence of each weight in the prediction - \item Weight update such that the loss decreases in future forward steps -\end{enumerate} -% -%% ------------------------------------------------------------------------------- -% -\section{The math behind it} -% -%% ------------------------------------------------------------------------------- -% -Now the math is the more tricky part. So far, we have only done a forward-pass and probably got some wrong predictions. Similarly, we have calculated each activation separately. What is the big picture? - -\begin{align*} - \hat{y} = a^{(2)} &= \sigma(z^{(2)}) \\ - &= \sigma(a^{(1)}w^{(2)}+b^{(2)}) \\ - &= \sigma(\sigma(z^{(2)})w^{(2)}+b^{(2)}) \\ - &= \sigma(\sigma(xw^{(1)}+b^{(1)})w^{(2)}+b^{(2)}) \\ -\end{align*} - -Now the cost function comes into play to see how good this prediction is. \\ - -\paragraph{Loss function} -To optimize the network we need a function that specifies the error of our prediction towards the expected output. Typically, we seek to minimize the error. As such, the objective function is often referred to as a cost function or a loss function and the value calculated by the loss function is referred to as simply “loss”. The cost or loss function has an important job in that it must faithfully distill all aspects of the model down into a single number in such a way that improvements in that number are a sign of a better model. - -A well known is the mean squared error (MSE): - -$$ L = \frac{1}{2}(y_{true}-\hat{y})^2 $$ - -Ok, now that we have the loss of our trained network, what do we do with it? The idea is to take small steps towards the minimum of the loss function. -This is achieved by updating the weights with a fraction of the (negative) gradient. \\ - -\paragraph{Gradient Descent} -The “gradient” in gradient descent refers to an error gradient. The gradient descent algorithm seeks to change the weights so that the next evaluation reduces the loss of the model, meaning the optimization algorithm is navigating down the gradient (or slope) of error. - -$$ \Delta L $$ - -Here the gradient is a vector of the partial derivatives, which points in the direction of the steepest slope. This vector is multiplied by a negative step size (or learning rate) and thus moves in the direction of the towards the minimum. Now suppose we want to calculate $\frac{\partial c}{\partial a}$, that is, the gradient from c to a. This gradient represents the influence of the operator a on the result c: If a changes, so does c, where the gradient indicates how much. \\ - -\paragraph{Learning rate} - -We want to go in small steps towards the minimum (alpha), a learning rate -not to small as we do not want to take forever. - -The aim is to update the weights with only a fraction of the (negative) gradient. Trade off between speed of training and „closeness to minimum“ achieved -In practice: Lot’s of training strategies (continuous decay of learning rate, decay on plateau, manual tuning, optimizers like RMSprop) \\ - -\paragraph{Backpropagation} - -With such a small graph, we could compute $\frac{\partial c}{\partial a}$ in one go, by determining the derivative of c with respect to a. However, this would be -impractical for more extensive graphs. Instead of a naive direct computation of the gradient with respect to each weight individually, we determine the gradient using backpropagation. The backpropagation algorithm works by computing the gradient one layer at a time by the chain rule and iterating backward from the last layer to avoid redundant calculations of intermediate terms. Using the chain rule makes it suitable for graphs of arbitrary size and for training multilayer networks. According to this rule, to calculate $\frac{\partial c}{\partial a}$, we need to do the following: - -\begin{enumerate} - \item we traverse the graph backwards from c to a. - \item we compute the local gradient for each intervening operation, that is, the derivative of the output of that operation after its input. - \item we multiply all local gradients. -\end{enumerate} - -% You will probably never have to implement the backpropagation algorithm yourself in real-world projects as modern ML libraries already bring their own ready-made implementations. Its good to understand how it works though, as it helps when something is failing. - -\subsection{A practical example} - -This example is done for the above MLP and shows the backpropagation step to weight 2. - -$$\frac{\partial L}{\partial w^{(2)}} = \frac{\partial L}{\partial a^{(2)}} * \frac{\partial a^{(2)}}{\partial z^{(2)}} * \frac{\partial z^{(2)}}{\partial w^{(2)}}$$ - -Remember, as we want to decrease the gradient we still need to have the derivatives. This means, in order to backpropagate to the second weight (as in this example), we need the derivative of the cost function $\frac{\partial L}{\partial a^{(2)}}$, if the activation (sigmoid) function, $\frac{\partial a^{(2)}}{\partial z^{(2)}}$, and the the linear function of the inputs $\frac{\partial z^{(2)}}{\partial w^{(2)}}$. \\ - -Let‘s start with the cost-function: $L = \frac{1}{2}(y_{true}-\hat{y})^2$ - -\begin{align*} - \frac{\partial L}{\partial \hat{y}} &= \frac{1}{2}*2*(y_{true}-\hat{y})*(-1) \\ - &= (y - \hat{y}) \text{ with } \hat{y} = a^2 \\ - &= (y - a^{(2)}) -\end{align*} - -Now lets get the derivative of the (sigmoid) activation function $\hat{y} = a^{(2)} = \sigma(z^{(2)})$ with $\sigma(z^{(2)}) = \sigma(1 - \sigma)$ and $a = \sigma$ follows: - -$$\frac{\partial a^{(2)}}{\partial z^{(2)}} = (a^{(2)}(1-a^{(2)}))$$ - -This leaves us with the derivative of the linear neuron input: $z^{(2)} = a^{(1)}w^{(2)}+b^{(2)}$ - -$$\frac{\partial z^{(2)}}{\partial w^{(2)}} = a^{(1)}$$ - - -Now this was only a backpropagation to the second weight. If we want to go back to the bottom of the net to calculate the influence of w1 to the output, we have for the weights: - -$$\frac{\partial L}{\partial w^{(1)}} = \frac{\partial L}{\partial a^{(2)}} * \frac{\partial a^{(2)}}{\partial z^{(2)}} * \frac{\partial z^{(2)}}{\partial a^{(1)}} * \frac{\partial a^{(1)}}{\partial z^{(1)}} * \frac{\partial z^{(1)}}{\partial w^{(1)}}$$ - -and for the bias: - -$$\frac{\partial L}{\partial b^{(1)}} = \frac{\partial L}{\partial a^{(2)}} * \frac{\partial a^{(2)}}{\partial z^{(2)}} * \frac{\partial z^{(2)}}{\partial a^{(1)}} * \frac{\partial a^{(1)}}{\partial z^{(1)}} * \frac{\partial z^{(1)}}{\partial b^{(1)}}$$ - - - -This leaves us with the calculated gradients, respectively the influence of $w^{(2)}$ on the output $a^{(2)}$. Finally we need to update the weights to minimize the loss of the next evaluation of the mode. This is done by multiplying the calculated gradient with the learning rate and adding the current weight, like: - -$$w^{(2)}_{new} = \alpha * \frac{\partial L}{\partial w^{(2)}} + w^{(2)}$$ - -with $\alpha$ as the learning rate. Unwinding the function gives us: - -\begin{align*} - w^{(2)}_{new} &= \alpha * \frac{\partial L}{\partial w^{(2)}} + w^{(2)} \\ - &= \alpha * \frac{\partial L}{\partial a^{(2)}} * \frac{\partial a^{(2)}}{\partial z^{(2)}} * \frac{\partial z^{(2)}}{\partial w^{(2)}} + w^{(2)} \\ - &= \alpha * (y - a^{(2)}) * (a^{(2)}(1-a^{(2)})) * a^{(1)} + w^{(2)} \\ -\end{align*} - -Now this has to be done to update the weight $w^{(1)}$ and the biases $b^{(2)}$ and $b^{(1)}$ as well. And this is only for two layers - image how much work it is to calculate it with even more! -% -%% ------------------------------------------------------------------------------- -% -\section{Hyperparameters} -% -%% ------------------------------------------------------------------------------- -% -In the above example as well as calculations, we took some things for granted. However, there are multiple possibilities for different problems and sometimes finding the best trained networks means adjusting the hyperparamters -% -%% ------------------------------------------------------------------------------- -% -\subsection{Activation functions} -% -%% ------------------------------------------------------------------------------- -% - -\paragraph{Tanh} - -\paragraph{ReLU} - -Rectified Non-Linear unit (ReLU), which combats the vanishing gradient problem occurring in sigmoids. ReLU is easier to compute and generates sparsity (not always beneficial). - -\paragraph{Leaky ReLU} - -\paragraph{Softmax} - -% -%% ------------------------------------------------------------------------------- -% -\subsection{Loss functions} -% -%% ------------------------------------------------------------------------------- -Loss is the prediction error of Neural Net and Loss Function. It's a method of evaluating how well specific algorithm models the given data. If predictions deviates too much from actual results. We want our loss function do minimize. The Loss is later used to calculate the gradients. - -In calculating the error of the model during the optimization process, a loss function must be chosen. - -This can be a challenging problem as the function must capture the properties of the problem and be motivated by concerns that are important to the project and stakeholders. - -% -\paragraph{Binary cross entropy} - -\paragraph{Categorical (Multiclass) cross entropy} -% -%% ------------------------------------------------------------------------------- -% -\subsection{Optimizers} -% -%% ------------------------------------------------------------------------------- -% -optimizer for gradient descent optimization problem and is directly influences by the learning rate. keep in mind that gradients are used to update the weights of the Neural Net. - -\paragraph{SDG} - -SDG: gradient can be approximated with small number of batches, not the entire dataset (otherwise computationally expensive) - -\paragraph{RmsProp} - -RmsProp: adapts learning rate in each step, by weighting with GD-values. Large GD-values get penalized => adapted speed - -\paragraph{Adam} - -Adam: Includes updated learning rate and smoothened Gradient Descent direction (accumulates direction from previous steps) - - -% -%% ------------------------------------------------------------------------------- -% -\section{Coding an ANN} -% -%% ------------------------------------------------------------------------------- -% - -Deep learning extends the topic by adding more layers. the more layers, the more advances probleems. However, going deep is also expensive and sometimes not the best choice. - -How to battle overfitting and achieve generalization? - -Regularization of the model! -For Neural Networks: -Add penalties for large weights in the loss function. L1- or L2-Norm can be used for the weights -Dropout: Train multiple architectures at the same time by randomly “dropping out” nodes in the network - - -Now let's create the XOR problem we had in the initial chapter with the initial setting. -optimizer adam, activation function sigmoid and loss mean squared error -% -% % For both models, mean accuracy is computed across pilots. \citet{Klaproth.2019} -% -% -\clearpage -% -%% ------------------------------------------------------------------------------- -% -\subsection{...plain and custom} -% -%% ------------------------------------------------------------------------------- -% -\vspace{-1cm} -\begin{algorithm} - \begin{lstlisting}[language=Python] - import numpy as np - - def sigmoid (x): - return 1/(1 + np.exp(-x)) - - def sigmoid_derivative(x): - return x * (1 - x) - - # Input datasets - inputs = np.array([[0,0],[0,1],[1,0],[1,1]]) - y_true = np.array([[0],[1],[1],[0]]) - - epochs = 10000 - lr = 0.1 - input_size, hidden_size, output_size = 2, 2, 1 - - # Random weights and bias initialization - w_1 = np.random.uniform(size=(input_size, hidden_size)) - b_1 = np.random.uniform(size=(1, hidden_size)) - w_2 = np.random.uniform(size=(hidden_size, output_size)) - b_2 = np.random.uniform(size=(1, output_size)) - - # Training the network - for i in range(epochs): - # Forward propagation - z_1 = np.dot(inputs, w_1) + b_1 - a_1 = sigmoid(z_1) - z_2 = np.dot(a_1, w_2) + b_2 - a_2 = sigmoid(z_2) - - y_hat = a_2 - - # Calculation of Loss (cost function) - L = np.mean(0.5 * np.square(y_true - y_hat)) - - # Backpropagation - d_y_hat = y_true - y_hat - d_z_2 = d_y_hat * sigmoid_derivative(a_2) - d_a_1 = d_z_2.dot(w_2.T) - d_z_1 = d_a_1 * sigmoid_derivative(a_1) - # d_z_1 is the long dot product in 1.5 - - # Updating Weights and Biases - w_2 += a_1.T.dot(d_z_2) * lr - b_2 += np.sum(d_z_2, axis=0, keepdims=True) * lr - w_1 += inputs.T.dot(d_z_1) * lr - b_1 += np.sum(d_z_1, axis=0, keepdims=True) * lr - - print(f"Output after training 10,000 epochs: {a_2}") - -\end{lstlisting} -\caption[Custom implementation of the \textit{XOR} Problem]{Custom implementation of the \textit{XOR} Problem.} -\label{alg:xor_ann_own} -\end{algorithm} -%\vspace{0.5cm} - -% -\clearpage - - -% -%% ------------------------------------------------------------------------------- -% -\subsection{...with Keras} -% -%% ------------------------------------------------------------------------------- -% - -% -% Schaubild mit dem CM strategien und Modellierungen -% -\vspace{0.5cm} -\begin{algorithm} - \begin{lstlisting}[language=Python] - import numpy as np - from tensorflow import keras - from tensorflow.keras import layers - - # Input datasets - # y_true is denoted as y_train - # inputs are denoted as x_train - x_train = np.array([[0,0],[0,1],[1,0],[1,1]]) - y_train = np.array([[0],[1],[1],[0]]) - - # hyperparameters - epochs = 10000 - lr = 0.1 - - # model size - input_shape, hidden_shape, output_shape = 2, 2, 1 - - model = keras.models.Sequential( - [ - keras.Input(shape = input_shape), - layers.Dense(units = hidden_shape, activation = 'sigmoid'), - layers.Dense(units = output_shape, activation = 'sigmoid') - ] - ) - - print(model.summmary()) - - model.compile(loss = 'mean_squared_error', - optimizer = 'adam', - metrics = ['mean_squared_error'] - ) - - # train the model according to y_true - model.fit(x_train, y_train, epochs = epochs) - - predictions = model.predict(x_train) - print(predictions) -\end{lstlisting} -\caption[Keras implementation of the \textit{XOR} Problem]{Exemplary implementation of the \textit{XOR} Problem by implementing an ANN using Keras.} -\label{alg:xor_ann_keras} -\end{algorithm} -\vspace{0.5cm} -% -\vspace{2em} -% - -% -%% ------------------------------------------------------------------------------- -% -\subsection{...for images} -% -%% ------------------------------------------------------------------------------- -% - -basically the same. However, the data are somewhat different. We now have an array. we have to flatten first. example we have images of numbers 0 to 9 and what to classify which number it is. This already tells us the number of output layers. - -input, hidden layers, - - -\vspace{0.5cm} -\begin{algorithm} - \begin{lstlisting}[language=Python] - from tensorflow import keras - from tensorflow.keras import layers - import matplotlib.pyplot as plt - from collections import Counter - - # load the data - (X_train, y_train), (X_test, y_test) = keras.datasets.mnist.load_data() - - # check the shape of the training data and how many labels there are - print(X_train.shape) - # (60000, 28, 28) - label_len = len([*Counter(y_train)]) - # 10 - - # reshape image to vector and normalize between 0 and 1 - img_size = 28*28 - X_train_flat = X_train.reshape(X_train.shape[0], img_size) / 255 - X_test_flat = X_test.reshape(X_test.shape[0], img_size) / 255 - - epochs=10 - lr = 0.001 - - input_shape, hidden_shape, output_shape = img_size, 128, label_len - - model = keras.models.Sequential( - [ - keras.Input(shape=input_shape), - layers.Dense(units=hidden_shape, activation='relu'), - layers.Dense(units=output_shape, activation='softmax'), - ] - ) - - model.compile( loss = 'sparse_categorical_crossentropy', - optimizer = keras.optimizers.Adam(lr), - metrics = ['mean_squared_error','accuracy'] - ) - - model.fit(X_train_flat, y_train, epochs = epochs) - - predictions = model.predict(X_test_flat) - print(f"Predictions: {predictions}") - - # plot loss and accuracy - plt.plot(history.history['accuracy']) - plt.plot(history.history['loss']) - plt.legend(['accuracy','loss'], loc='upper right') - plt.show() -\end{lstlisting} -\caption[Keras implementation for \textit{MNIST} classification]{Keras implementation for \textit{MNIST} classification} -\label{alg:mnist_ann_keras} -\end{algorithm} -\vspace{0.5cm} - - - -https://towardsdatascience.com/stochastic-gradient-descent-with-momentum-a84097641a5d - -https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234 - -https://towardsdatascience.com/the-5-classification-evaluation-metrics-you-must-know-aa97784ff226 - -https://towardsdatascience.com/entropy-cross-entropy-kl-divergence-binary-cross-entropy-cb8f72e72e65 - -https://towardsdatascience.com/understanding-maximum-likelihood-estimation-fa495a03017a - -https://towardsdatascience.com/the-five-discrete-distributions-every-statistician-should-know-131400f77782 - -https://towardsdatascience.com/common-loss-functions-in-machine-learning-46af0ffc4d23 - -https://towardsdatascience.com/importance-of-loss-function-in-machine-learning-eddaaec69519 - -https://towardsdatascience.com/tagged/loss-function - -https://towardsdatascience.com/backpropagation-for-people-who-are-afraid-of-math-936a2cbebed7 -% \ No newline at end of file diff --git a/temporary_structure/09-Deployment.Rmd b/temporary_structure/09-Deployment.Rmd deleted file mode 100644 index 3d1ddea..0000000 --- a/temporary_structure/09-Deployment.Rmd +++ /dev/null @@ -1,34 +0,0 @@ -# Cross-references {#cross} - -Cross-references make it easier for your readers to find and link to elements in your book. - -## Chapters and sub-chapters - -There are two steps to cross-reference any heading: - -1. Label the heading: `# Hello world {#nice-label}`. - - Leave the label off if you like the automated heading generated based on your heading title: for example, `# Hello world` = `# Hello world {#hello-world}`. - - To label an un-numbered heading, use: `# Hello world {-#nice-label}` or `{# Hello world .unnumbered}`. - -1. Next, reference the labeled heading anywhere in the text using `\@ref(nice-label)`; for example, please see Chapter \@ref(cross). - - If you prefer text as the link instead of a numbered reference use: [any text you want can go here](#cross). - -## Captioned figures and tables - -Figures and tables *with captions* can also be cross-referenced from elsewhere in your book using `\@ref(fig:chunk-label)` and `\@ref(tab:chunk-label)`, respectively. - -See Figure \@ref(fig:nice-fig). - -```{r nice-fig, fig.cap='Here is a nice figure!', out.width='80%', fig.asp=.75, fig.align='center', fig.alt='Plot with connected points showing that vapor pressure of mercury increases exponentially as temperature increases.'} -par(mar = c(4, 4, .1, .1)) -plot(pressure, type = 'b', pch = 19) -``` - -Don't miss Table \@ref(tab:nice-tab). - -```{r nice-tab, tidy=FALSE} -knitr::kable( - head(pressure, 10), caption = 'Here is a nice table!', - booktabs = TRUE -) -``` diff --git a/temporary_structure/10-blocks.Rmd b/temporary_structure/10-blocks.Rmd deleted file mode 100644 index 4784463..0000000 --- a/temporary_structure/10-blocks.Rmd +++ /dev/null @@ -1,30 +0,0 @@ -# Blocks - -## Equations - -Here is an equation. - -\begin{equation} - f\left(k\right) = \binom{n}{k} p^k\left(1-p\right)^{n-k} - (\#eq:binom) -\end{equation} - -You may refer to using `\@ref(eq:binom)`, like see Equation \@ref(eq:binom). - - -## Theorems and proofs - -Labeled theorems can be referenced in text using `\@ref(thm:tri)`, for example, check out this smart theorem \@ref(thm:tri). - -::: {.theorem #tri} -For a right triangle, if $c$ denotes the *length* of the hypotenuse -and $a$ and $b$ denote the lengths of the **other** two sides, we have -$$a^2 + b^2 = c^2$$ -::: - -Read more here . - -## Callout blocks - - -The R Markdown Cookbook provides more help on how to use custom blocks to design your own callouts: https://bookdown.org/yihui/rmarkdown-cookbook/custom-blocks.html diff --git a/temporary_structure/10-citations.Rmd b/temporary_structure/10-citations.Rmd deleted file mode 100644 index 2bc1a29..0000000 --- a/temporary_structure/10-citations.Rmd +++ /dev/null @@ -1,15 +0,0 @@ -# Footnotes and citations - -## Footnotes - -Footnotes are put inside the square brackets after a caret `^[]`. Like this one ^[This is a footnote.]. - -## Citations - -Reference items in your bibliography file(s) using `@key`. - -For example, we are using the **bookdown** package [@R-bookdown] (check out the last code chunk in index.Rmd to see how this citation key was added) in this sample book, which was built on top of R Markdown and **knitr** [@xie2015] (this citation was added manually in an external file book.bib). -Note that the `.bib` files need to be listed in the index.Rmd with the YAML `bibliography` key. - - -The RStudio Visual Markdown Editor can also make it easier to insert citations: diff --git a/temporary_structure/10-parts.Rmd b/temporary_structure/10-parts.Rmd deleted file mode 100644 index 0a6b2f0..0000000 --- a/temporary_structure/10-parts.Rmd +++ /dev/null @@ -1,12 +0,0 @@ -# Parts - -You can add parts to organize one or more book chapters together. Parts can be inserted at the top of an .Rmd file, before the first-level chapter heading in that same file. - -Add a numbered part: `# (PART) Act one {-}` (followed by `# A chapter`) - -Add an unnumbered part: `# (PART\*) Act one {-}` (followed by `# A chapter`) - -Add an appendix as a special kind of un-numbered part: `# (APPENDIX) Other stuff {-}` (followed by `# A chapter`). Chapters in an appendix are prepended with letters instead of numbers. - - - diff --git a/temporary_structure/10-references.Rmd b/temporary_structure/10-references.Rmd deleted file mode 100644 index b216bb7..0000000 --- a/temporary_structure/10-references.Rmd +++ /dev/null @@ -1,3 +0,0 @@ -`r if (knitr::is_html_output()) ' -# References {-} -'` diff --git a/temporary_structure/10-share.Rmd b/temporary_structure/10-share.Rmd deleted file mode 100644 index ad51e36..0000000 --- a/temporary_structure/10-share.Rmd +++ /dev/null @@ -1,31 +0,0 @@ -# Sharing your book - -## Publishing - -HTML books can be published online, see: https://bookdown.org/yihui/bookdown/publishing.html - -## 404 pages - -By default, users will be directed to a 404 page if they try to access a webpage that cannot be found. If you'd like to customize your 404 page instead of using the default, you may add either a `_404.Rmd` or `_404.md` file to your project root and use code and/or Markdown syntax. - -## Metadata for sharing - -Bookdown HTML books will provide HTML metadata for social sharing on platforms like Twitter, Facebook, and LinkedIn, using information you provide in the `index.Rmd` YAML. To setup, set the `url` for your book and the path to your `cover-image` file. Your book's `title` and `description` are also used. - - - -This `gitbook` uses the same social sharing data across all chapters in your book- all links shared will look the same. - -Specify your book's source repository on GitHub using the `edit` key under the configuration options in the `_output.yml` file, which allows users to suggest an edit by linking to a chapter's source file. - -Read more about the features of this output format here: - -https://pkgs.rstudio.com/bookdown/reference/gitbook.html - -Or use: - -```{r eval=FALSE} -?bookdown::gitbook -``` - -