Added AIOps page #57

arbass22 · 2023-11-12T23:52:29Z

This is not the final draft yet but pushing so we can iterate on it together. Notably it still needs Learning Objectives, a conclusion and proper citations.

Content
- The chapter content is complete and covers the topic in detail.
- All technical terms are well-defined and explained.
- Any code snippets or algorithms are well-documented and tested.
- The chapter follows a logical flow and structure.
- Learning Objectives and Conclusions
References & Citations
- All references are correctly listed at the end of the chapter.
- In-text citations are used appropriately and match the references.
- All figures, tables, and images have proper sources and are cited correctly.
Quarto Website Rendering
- The chapter has been locally built and tested using Quarto.
- All images, figures, and tables render properly without any glitches.
- All images have a source or they are properly linked to external sites.
- Any interactive elements or widgets work as intended.
- The chapter's formatting is consistent with the rest of the book.
Grammar & Style
- The chapter has been proofread for grammar and spelling errors.
- The writing style is consistent with the rest of the book.
- Any jargon is clearly explained or avoided where possible.
Collaboration
- All group members have reviewed and approved the chapter.
- Any feedback from previous reviews or discussions has been addressed.
Miscellaneous
- All external links (if any) are working and lead to the intended destinations.
- If datasets or external resources are used, they are properly credited and linked.
- Any necessary permissions for reused content have been obtained.
Final Steps
- The chapter is pushed to the correct branch on the repository.
- The Pull Request is made with a clear title and description.
- The Pull Request includes any necessary labels or tags.
- The Pull Request mentions any stakeholders or reviewers who should take a look.

vijay-edu · 2023-11-13T23:04:07Z

Looks good!

ciyer64 · 2023-11-13T23:14:44Z

Looks good, excited to get feedback from the group! Thanks @profvjreddi for helping us.

annielcook · 2023-11-14T14:43:49Z

Looks good to me. Excited for this chapter!

arbass22 · 2023-11-15T16:41:17Z

Looks good!

sophiacho1 · 2023-11-16T03:13:38Z

ops.qmd

+
+As Agile methodologies became more popular, organizations realized the need for better collaboration and communication between development and operations teams. The siloed nature of development and operations teams often led to inefficiencies, conflicts, and delays in software delivery. This need for better collaboration and integration between development and operations teams led to the [DevOps](https://www.atlassian.com/devops) movement.
+
+The term "DevOps" was first coined in 2009 by [Patrick Debois](https://www.jedi.be/blog/2010/02/12/what-devops-means-to-me/), a consultant and Agile practitioner. Debois organized the first [DevOpsDays](https://www.devopsdays.org/) conference in Ghent, Belgium, in 2009, which brought together development and operations professionals to discuss ways to improve collaboration and automate processes. The conference was a success, and the DevOps movement started to gain momentum.


Small thing, but the link on "Patrick Debois" gives me a 404 Not Found error...

Good point! Thanks for catching that. Fixing it. https://www.jedi.be/ and https://www.youtube.com/watch?v=o7-IuYS0iSE&feature=youtu.be

sophiacho1 · 2023-11-16T03:29:00Z

ops.qmd

+
+CI/CD pipelines orchestrate key steps, including checking out new code changes, transforming data, training and registering new models, validation testing, containerization, deploying to environments like staging clusters, and promoting to production. Teams leverage popular CI/CD solutions like [Jenkins](https://www.jenkins.io/), [CircleCI](https://circleci.com/) and [GitHub Actions](https://github.com/features/actions) to execute these MLOps pipelines, while [Prefect](https://www.prefect.io/), [Metaflow](https://metaflow.org/) and [Kubeflow](https://www.kubeflow.org/) offer ML-focused options.
+
+For example, when a data scientist checks improvements to an image classification model into a [GitHub](https://github.com/) repository, this actively triggers a Jenkins CI/CD pipeline. The pipeline reruns data transformations and model training on the latest data, tracking experiments with [MLflow](https://mlflow.org/). After automated validation testing, teams deploy the model container to a [Kubernetes](https://kubernetes.io/) staging cluster for further QA. Once approved, Jenkins facilitates a phased rollout of the model to production with [canary deployments](https://kubernetes.io/docs/concepts/cluster-administration/manage-deployment/#canary-deployments) to catch any issues. If anomalies are detected, the pipeline enables teams to roll back to the previous model version gracefully.


I found this example very helpful!

sophiacho1 · 2023-11-16T04:41:24Z

ops.qmd

+
+![Diagram showing some tasks affect other future tasks](images/ai_ops/data_cascades.png)
+
+Building models sequentially creates risky dependencies where later models rely on earlier ones. For example, taking an existing model and fine-tuning it for a new use case seems efficient. However, this bakes in assumptions from the original model that may eventually need correction. Modifying foundational components then becomes extremely costly due to the cascading effects on subsequent models. One mitigation is to only augment existing models when absolutely necessary to reuse some capabilities. Often, it is safer to train models for new use cases from scratch to avoid baking in dependencies. Careful thought should be given to identifying points where introducing fresh model architectures can avoid correction cascades down the line.


I wonder if it'd be worth explicitly defining what a "correction cascade" is in this paragraph?

Done, thanks for the feedback.

sophiacho1 · 2023-11-16T04:57:32Z

ops.qmd

+
+Careful monitoring and canary deployments help detect feedback. But fundamental challenges remain in understanding complex model interactions. Architectural choices that reduce entanglement and coupling mitigate analysis debt's compounding effect.
+
+### ML System Antipatterns 


I wonder if it'd be clearer for this section to be called "Pipeline Jungles" instead of "ML System Antipatterns"?

Done! Thanks for the feedback.

ops.qmd

sophiacho1 · 2023-11-16T05:34:26Z

ops.qmd

+* Evaluating model performance through metrics like accuracy, AUC, F1 scores. Performing error analysis to identify areas for improvement.
+* Developing new model versions by incorporating new data, testing different approaches, and optimizing model behavior. Maintaining documentation and lineage for models.
+
+For example, a data scientist may leverage TensorFlow and TensorFlow Probability to develop a demand forecasting model for retail inventory planning. They would iterate on different sequence models like LSTMs and experiment with features derived from product, sales and seasonal data. The model would be evaluated based on error metrics versus actual demand before deployment. The data scientist monitors performance and retrains/enhances the model as new data comes in.


I think it'd be nice to have a link for TensorFlow Probability: https://www.tensorflow.org/probability.

👍 Appreciate it.

sophiacho1 · 2023-11-16T05:39:10Z

ops.qmd

+ML engineers enable models data scientists develop to be productized and deployed at scale. Their expertise makes models reliably serve predictions in applications and business processes. Their main responsibilities include:
+
+* Taking prototype models from data scientists and hardening them for production environments through coding best practices.
+* Building APIs and microservices for model deployment using tools like Flask, FastAPI. Containerizing models with Docker.


Possible links for Flask and FastAPI:

https://flask.palletsprojects.com/en/3.0.x/

https://fastapi.tiangolo.com/

sophiacho1 · 2023-11-16T05:43:14Z

ops.qmd

+
+* Provisioning and managing cloud infrastructure for ML workflows using IaC tools like Terraform, Docker, Kubernetes. 
+* Developing CI/CD pipelines for model retraining, validation, and deployment. Integrating ML tools into the pipeline like MLflow, Kubeflow.
+* Monitoring model and infrastructure performance using tools like Prometheus, Grafana, ELK stack. Building alerts and dashboards.


Possible link for ELK stack: https://aws.amazon.com/what-is/elk-stack/#:~:text=offered%20by%20AWS%3F-,What%20is%20the%20ELK%20Stack%3F,Elasticsearch%2C%20Logstash%2C%20and%20Kibana.

sophiacho1 · 2023-11-16T05:50:23Z

ops.qmd

+
+Skilled project managers enable MLOps teams to work synergistically to deliver maximum business value from ML investments rapidly. Their leadership and organization align with diverse teams.
+
+## Challenges in Embedded MLOps


I wonder if it'd be better for the "Traditional MLOps vs. Embedded MLOps" section to come before this section? The transition felt a bit abrupt to me...

Hmm. The issue is the title I think. so I changed the title to just be a review of the system challenges, and then we get into the Embedded Ops section. Hope that resolves the confusion

I agree with this

sophiacho1 · 2023-11-16T07:22:10Z

ops.qmd

+
+Furthermore, the models themselves need to use simplified architectures optimized for low-power edge hardware. There is no access to high-end GPUs for intensive deep learning given the compute limitations. Training leverages lower-powered edge servers and clusters with distributed approaches to spread load.
+
+![A diagram showing how transfer learning targets updates to specific layers of a model.(images/ai_ops/transfer_learning.png)


Just a heads-up: this image doesn't seem to render in quarto. I think it might be because there's no end bracket after "model."

Got fixed already by @mrdrangonbear

Co-Authored-By: sophiacho1 <67521139+sophiacho1@users.noreply.github.com>

alxrod · 2023-11-19T18:52:32Z

ops.qmd

-Explanation: This subsection sets the groundwork for the discussions to follow, elucidating the fundamental concept of MLOps and its critical role in enhancing the efficiency, reliability, and scalability of embedded AI systems. It outlines the unique characteristics of implementing MLOps in an embedded context, emphasizing its significance in the streamlined deployment and management of machine learning models.
+Machine Learning Operations (MLOps), is a systematic approach that combines machine learning (ML), data science, and software engineering to automate the end-to-end ML lifecycle. This includes everything from data preparation and model training to deployment and maintenance. MLOps ensures that ML models are developed, deployed, and maintained efficiently and effectively.
+
+Consider a ridesharing company that wants to deploy a machine-learning model to predict rider demand in real time. The data science team spends months developing a model, but when it's time to deploy, they realize it needs to be compatible with the engineering team's production environment. Deploying the model requires rebuilding it from scratch - costing weeks of additional work. This is where MLOps comes in. 


This is a small thing but opening with a non-edge ML example feels slightly disconnected from the rest of the book. I don't think you would need to change it but I might just add a sentence specifying that it's not an edge model. specifically mention where "production environment" is

Good point, fixed it. Thanks @alxrod

alxrod · 2023-11-19T18:58:20Z

ops.qmd

+
+With MLOps, there are protocols and tools in place to ensure that the model developed by the data science team can be seamlessly deployed and integrated into the production environment. In essence, MLOps removes friction during the development, deployment, and maintenance of ML systems. It improves collaboration between teams through defined workflows and interfaces. MLOps also accelerates iteration speed by enabling continuous delivery for ML models.
+
+For the ridesharing company, implementing MLOps means their demand prediction model can be frequently retrained and deployed based on new incoming data. This keeps the model accurate despite changing rider behavior. MLOps also allows the company to experiment with new modeling techniques since models can be quickly tested and updated.


Yeah because then you say deployed data here which suggests maybe an edge model, but I think in practice this prediction would be done in the cloud since you want information from many devices

Not necessarily, esp. in the context of autonomous vehicles. But you are right that the context is not setup correctly. Will try to improve.

alxrod · 2023-11-19T19:15:18Z

ops.qmd

+
+Canary testing releases a model to a small subset of users to evaluate real-world performance before wide deployment. Teams incrementally route traffic to the canary release while monitoring for issues.
+
+For example, a retailer evaluates a personalized product recommendation model against historical test data, reviewing accuracy and diversity metrics. Teams also calculate metrics on live customer data over time, detecting decreased accuracy over the last 2 weeks. Before full rollout, the new model is released to 5% of web traffic to ensure no degradation.


These examples are great! Super helpful way to think through things

alxrod · 2023-11-19T19:24:16Z

ops.qmd

+
+Teams actively monitor key model aspects including analyzing samples of live predictions to track metrics like accuracy and [confusion matrix](https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html) over time.
+
+When monitoring performance, it is important for teams to profile incoming data to check for model drift - a steady decline in model accuracy over time after production deployment. Model drift can occur in one of two ways: [concept drift](https://en.wikipedia.org/wiki/Concept_drift) and data drift. Concept drift refers to a fundamental change observed in the relationship between the input data and the target outcomes. For instance, as the COVID-19 pandemic progressed e-commerce and retail sites had to correct their model recommendations, since purchase data was overwhelmingly skewed towards items like hand sanitizer. Data drift describes changes in the distribution of data over time. For example, image recognition algorithms used in self-driving cars will need to account for seasonality in observing their surroundings. Teams also track application performance metrics like latency and errors for model integrations.


Are there any frameworks for detecting data drift / concept drift overtime? Like monitoring data distributions and registering changes in the raw input data?

https://www.reddit.com/r/mlops/comments/15z4fwk/how_do_you_monitor_data_drift_in_your_cv/

I did a little googling and this reddit had some good links.

Im just curious if there are monitoring tools for just the data and not the model performance. If you have many models across your system working of off a single dataset feels like you would want to be able to monitor stats about the data that were model agnostic.

I'm not super familiar with any tools available to do so in the context of MLOps but wondering if anyone was

That's a great question @alxrod. I'd have to sweat this a bit cause I know of a good number of frameworks but none that directly integrate with the rest of the MLOps pipelines.

alxrod · 2023-11-19T19:28:39Z

ops.qmd

+
+![Figure 14.3: The flowchart depicts the concept of correction cascades in the ML workflow, from problem statement to model deployment. The arcs represent the potential iterative corrections needed at each stage of the workflow, with different colors corresponding to distinct issues such as interacting with physical world brittleness, inadequate application-domain expertise, conflicting reward systems, and poor cross-organizational documentation. The red arrows indicate the impact of cascades, which can lead to significant revisions in the model development process, while the dotted red line represents the drastic measure of abandoning the process to restart. This visual emphasizes the complex, interconnected nature of ML system development and the importance of addressing these issues early in the development cycle to mitigate their amplifying effects downstream. [@data_cascades]](images/ai_ops/data_cascades.png)
+
+Building models sequentially creates risky dependencies where later models rely on earlier ones. For example, taking an existing model and fine-tuning it for a new use case seems efficient. However, this bakes in assumptions from the original model that may eventually need correction. Modifying foundational components then becomes extremely costly due to the cascading effects on subsequent models. One mitigation is to only augment existing models when absolutely necessary to reuse some capabilities. Often, it is safer to train models for new use cases from scratch to avoid baking in dependencies. Careful thought should be given to identifying points where introducing fresh model architectures can avoid correction cascades down the line (see Figure 14.3).


Maybe worth mentioning here how you decide to build models sequentially or not. What factors go into that decision?

Dataset size, is the dataset growing/living or static? How much training resources do you have?

There are still scenarios where this makes sense so it is a situation of tradeoffs

Thanks for the feedback, updated the text to read:

Building models sequentially creates risky dependencies where later models rely on earlier ones. For example, taking an existing model and fine-tuning it for a new use case seems efficient. However, this bakes in assumptions from the original model that may eventually need correction. Several factors inform the decision to build models sequentially or not:

Dataset size and rate of growth - With small, static datasets, it often makes sense to fine-tune existing models. For large, growing datasets, training custom models from scratch allows more flexibility to account for new data.

Available computing resources - Fine-tuning requires less resources than training large models from scratch. With limited resources, leveraging existing models may be the only feasible approach.

While fine-tuning can be efficient, modifying foundational components later becomes extremely costly due to the cascading effects on subsequent models. Careful thought should be given to identifying points where introducing fresh model architectures, even with large resource requirements, can avoid correction cascades down the line (see Figure 14.3). There are still scenarios where sequential model building makes sense, so it entails weighing these tradeoffs around efficiency, flexibility, and technical debt.

alxrod · 2023-11-19T19:31:59Z

ops.qmd

+### Pipeline Jungles 
+ML workflows often lack standardized interfaces between components. This leads teams to incrementally "glue" together pipelines with custom code. What emerges are "pipeline jungles" – tangled preprocessing steps that are brittle and resist change. Avoiding modifications to these messy pipelines causes teams to experiment through alternate prototypes. Soon, multiple ways of doing everything proliferate. The lack of abstractions and interfaces then impedes sharing, reuse, and efficiency.
+
+Technical debt accumulates as one-off pipelines solidify into legacy constraints. Teams sink time into managing idiosyncratic code rather than maximizing model performance. Architectural principles like modularity and encapsulation are needed to establish clean interfaces. Shared abstractions enable interchangeable components, prevent lock-in, and promote best practice diffusion across teams. Breaking free of pipeline jungles ultimately requires enforcing standards that prevent accretion of abstraction debt. The benefits of interfaces and APIs that tame complexity outweigh the transitional costs.


I really like this section because it is so real for ML ops. At the same time, most small projects or companies accumulate technical debt because it is so hard to tell what parts of your product or models will reach what scales. I feel like it might be worth having a section explaining how one starts off in a way where they can build fast and accumulate technical debt but in a smart way. How do you build fast lightweight systems that don't actively impede your ability to implement better Ops infrastructure down the road?

Thanks Alex. I added this section before the summary section:

Navigating Technical Debt in Early Stages

It is understandable that technical debt accumulates naturally in early stages of model development. When aiming to build MVP models quickly, teams often lack complete information on what components will reach scale or require modification. Some deferred work is expected.

However, even scrappy initial systems should follow principles like "Flexible Foundations" to avoid painting themselves into corners:

Modular code and reusable libraries allow components to be swapped later

Loose coupling between models, data stores, and business logic facilitates change

Abstraction layers hide implementation details that may shift over time

Containerized model serving keeps options open on deployment requirements

Decisions that seem expedient in the moment can seriously limit future flexibility. For example, baking key business logic into model code rather than keeping it separate makes subsequent model changes extremely difficult.

With thoughtful design, though, it is possible to build quickly at first while retaining degrees of freedom to improve. As the system matures, prudent break points emerge where introducing fresh architectures proactively avoids massive rework down the line. This balances urgent timelines with reducing future correction cascades.

alxrod · 2023-11-19T19:44:31Z

ops.qmd

+
+Labeling also faces challenges without centralized data access, requiring more automated techniques like federated learning where devices collaboratively label peers' data. With personal edge devices, data privacy and regulations are critical concerns. Data collection, transmission and storage must be secure and compliant.
+
+For instance, a smartwatch may collect step count, heart rate, GPS coordinates throughout the day. This data is cached locally and transmitted to an edge gateway when WiFi is available. The gateway processes and filters data before syncing relevant subsets with the cloud platform to retrain models.


I don't totally know if this is in the scope of this chapter but I think it might add to the example to explain the process of the data coming off of the sensors on device, the signal processing that happens locally there, then the transmitting, and then the processing and filter that happens on the gateway and the cloud. Just good to give a full FOV of the data pipeline in the usecase you're using. Then you can specify what part of the overall process is the specific focus of mlops

Hmmm. I like the idea. I will try to integrate this elsewhere. Otherwise, this might get cluttered.

alxrod · 2023-11-19T19:49:08Z

ops.qmd

+
+Over-the-air updates require setting up specialized servers to securely distribute model bundles to devices in the field. Rollout and rollback procedures must be carefully tailored for particular device families.
+
+With traditional CI/CD tools less applicable, embedded MLOps relies more on custom scripts and integration. Companies take varied approaches from open source frameworks to fully in-house solutions. Tight integration between developers, edge engineers and end customers establishes trusted release processes.


I feel like federated learning would be worth mentioning or at least referencing here. See a lot of potential for CI/CD for server side updating in the long term as federated learning becomes more common and platforms are fleshed out more

Done! Updated it with this:

In traditional MLOps, new model versions are directly deployed onto servers via API endpoints. However, embedded devices require optimized delivery mechanisms to receive updated models. Over-the-air (OTA) updates provide a standardized approach to wirelessly distribute new software or firmware releases to embedded devices. Rather than direct API access, OTA packages allow remotely deploying models and dependencies as pre-built bundles. As an alternative, federated learning allows model updates without direct access to raw training data. This decentralized approach has potential for continuous model improvement, but currently lacks robust MLOps platforms.

For deeply embedded devices lacking connectivity, model delivery relies on physical interfaces like USB or UART serial connections. The model packaging still follows similar principles to OTA updates, but the deployment mechanism is tailored for the capabilities of the edge hardware. Moreover, specialized OTA protocols optimized for IoT networks are often used rather than standard WiFi or Bluetooth protocols. Key factors include efficiency, reliability, security, and telemetry like progress tracking. Solutions like Mender.io provide embedded-focused OTA services handling differential updates across device fleets.

alxrod · 2023-11-19T20:06:47Z

ops.qmd

+
+Because of the time saved on data processing thanks to Edge Impulse, the Oura team was able to focus on the key drivers of their prediction. In fact, they ended up only extracting three types of sensor data: heart rate, motion, and body temperature. After partitioning the data using five-fold cross validation and classifying sleep stage, the team was able to achieve a correlation of 79% - just a few percentage points off the standard. They were able to readily deploy two types of models for sleep detection: one simplified using just the ring’s accelerometer and one more comprehensive leveraging Autonomic Nervous System (ANS)-mediated peripheral signals and circadian features. With Edge Impulse, they plan to conduct further analyses of different activity types and leverage the scalability of the platform to continue to experiment with different sources of data and subsets of features extracted.
+
+While most ML research focuses on the model-dominant steps such as training and finetuning, this case study underscores the importance of a holistic approach to ML Ops, where even the initial steps of data aggregation and preprocessing have a fundamental impact on successful outcomes.


I feel like this is a great example of ML Ops but it doesn't as much touch on the embedded ops as much as it could. Could you mention how their data and ML pipelines work in production? or at least hypothesize if they don't publish. Just kind of applying what was mentioned above about synching when access to gateway (charging time?) and communication, this may overlap a little with the federated learning section in concepts

I agree that it doesn't directly tie to the embedded side of things but it was more about how the data pipelines were used. So that end I agree we should talk about it. I know they use the usual data lakes etc. on the backend, so we can link back to the data engineering chapter for this. Will do that.

euranofshin · 2023-11-21T14:10:50Z

ops.qmd

+
+For the ridesharing company, implementing MLOps means their demand prediction model can be frequently retrained and deployed based on new incoming data. This keeps the model accurate despite changing rider behavior. MLOps also allows the company to experiment with new modeling techniques since models can be quickly tested and updated.
+
+Other MLOps benefits include enhanced model lineage tracking, reproducibility, and auditing. Cataloging ML workflows and standardizing artifacts enables deeper insight into model provenance. It also facilitates regulation compliance, which is especially critical in regulated industries like healthcare and finance.


Not sure what "standardizing artifacts" means in this context. Maybe reword or add examples? "[...]and standardizing artifacts -- such as ex1, ex2-- enables deeper[...]"

Good point, I expanded it:

Other MLOps benefits include enhanced model lineage tracking, reproducibility, and auditing. Cataloging ML workflows and standardizing artifacts - such as logging model versions, tracking data lineage, and packaging models and parameters - enables deeper insight into model provenance. Standardizing these artifacts facilitates tracing a model back to its origins, replicating the model development process, and examining how a model version has changed over time. This also facilitates regulation compliance, which is especially critical in regulated industries like healthcare and finance where being able to audit and explain models is important.

euranofshin · 2023-11-21T14:23:22Z

ops.qmd

+
+The term "DevOps" was first coined in 2009 by [Patrick Debois](https://www.jedi.be/), a consultant and Agile practitioner. Debois organized the first [DevOpsDays](https://www.devopsdays.org/) conference in Ghent, Belgium, in 2009, which brought together development and operations professionals to discuss ways to improve collaboration and automate processes. The conference was a success, and the DevOps movement started to gain momentum.
+
+The key principles of DevOps include collaboration, automation, continuous integration and delivery, and feedback. These principles are aligned with the Agile methodology, which emphasizes collaboration, customer feedback, and iterative releases. DevOps extends the Agile principles to include operations teams and focuses on automating the entire software delivery pipeline, from development to deployment.


This paragraph gives a good high-level definition of DevOps, and therefore, might make more sense to be moved to the very start of this section.

Thanks, that makes sense, done!

euranofshin · 2023-11-21T14:26:12Z

ops.qmd

+
+DevOps has its roots in the [Agile](https://agilemanifesto.org/) movement, which began in the early 2000s as a reaction to the limitations of traditional software development methodologies, such as the [Waterfall model](https://www.tutorialspoint.com/sdlc/sdlc_waterfall_model.htm). Agile emphasizes collaboration, customer feedback, and small, iterative releases, which are in stark contrast to the long, siloed development cycles and rigid structures of traditional methodologies. Agile provided the foundation for a more collaborative and responsive approach to software development.
+
+As Agile methodologies became more popular, organizations realized the need for better collaboration and communication between development and operations teams. The siloed nature of development and operations teams often led to inefficiencies, conflicts, and delays in software delivery. This need for better collaboration and integration between development and operations teams led to the [DevOps](https://www.atlassian.com/devops) movement.


Not super sure the relationship between DevOps and Agile. Is DevOps a second movement that followed the Agile movement? What are the key differences?

If they are very similar movements, do we need to go into so much detail about Agile? Thinking specifically that the second paragraph ("As Agile..") feels redundant to the last sentence of the prior paragraph.

Good clarification request. I updated the text to this Eura.

The term "DevOps" was first coined in 2009 by Patrick Debois, a consultant and Agile practitioner. Debois organized the first DevOpsDays conference in Ghent, Belgium, in 2009, which brought together development and operations professionals to discuss ways to improve collaboration and automate processes.

DevOps has its roots in the Agile movement, which began in the early 2000s. Agile provided the foundation for a more collaborative approach to software development and emphasized small, iterative releases. However, Agile primarily focused on collaboration between development teams. As Agile methodologies became more popular, organizations realized the need to extend this collaboration to operations teams as well.

The siloed nature of development and operations teams often led to inefficiencies, conflicts, and delays in software delivery. This need for better collaboration and integration between these teams led to the DevOps movement. In a sense, DevOps can be seen as an extension of the Agile principles to include operations teams.

The key principles of DevOps include collaboration, automation, continuous integration and delivery, and feedback. DevOps focuses on automating the entire software delivery pipeline, from development to deployment. It aims to improve the collaboration between development and operations teams, utilizing tools like Jenkins, Docker, and Kubernetes to streamline the development lifecycle.

While Agile and DevOps share common principles around collaboration and feedback, DevOps specifically targets the integration of development and IT operations - expanding Agile beyond just development teams. It introduces practices and tools to automate software delivery and enhance the speed and quality of software releases.

euranofshin · 2023-11-21T14:28:26Z

ops.qmd

+
+[MLOps](https://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning), on the other hand, stands for MLOps, and it extends the principles of DevOps to the ML lifecycle. MLOps aims to automate and streamline the end-to-end ML lifecycle, from data preparation and model development to deployment and monitoring. The main focus of MLOps is to facilitate collaboration between data scientists, data engineers, and IT operations, and to automate the deployment, monitoring, and management of ML models.
+
+Some key factors led to the rise of MLOps:


Love this list!

euranofshin · 2023-11-21T14:28:35Z

ops.qmd

+* Difficulty reproducing and explaining model behavior
+* Lack of visibility into model performance post-deployment
+* Painful retraining and deployment procedures
+* Infrastructure misconfigured for ML


What does this mean?

Thanks, I expanded on these:

Data drift - As new data enters a production system over time, its statistics and properties may change, causing model performance to degrade if the models are not retrained on updated data. MLOps provides monitoring to detect data drift.

Reproducibility - It's challenging to reproduce the exact behavior and performance of machine learning models without capturing detailed info about code, data, and environment. MLOps enables reproducible experiments through pipeline and artifact tracking.

Explainability - The complex inner workings of many machine learning models make it hard to explain why they make certain predictions. MLOps systems aim to increase model transparency and explainability.

Performance monitoring - Unlike traditional software, it's not straightforward to monitor if a deployed ML model still performs well over time. MLOps provides model performance instrumentation and alerts around key metrics.

Deployment friction - Transitioning models from development to production is often a time-consuming manual process. MLOps automates model deployment pipelines to make it easier and more reliable.

Infrastructure optimization - Setting up and configuring infrastructure tailored for machine learning is challenging. MLOps solutions provide ready-made infrastructure optimized for ML experiments and workloads.

euranofshin · 2023-11-21T14:36:26Z

ops.qmd

+
+MLOps also requires collaboration between various stakeholders, including data scientists, data engineers, and IT operations.
+
+While DevOps and MLOps share similarities in their goals and principles, they differ in their focus and challenges. DevOps focuses on improving the collaboration between development and operations teams and automating software delivery. In contrast, MLOps focuses on streamlining and automating the ML lifecycle and facilitating collaboration between data scientists, data engineers, and IT operations.


Appreciate this specific compare and contrast.

What do you think about adding a table like the following to the beginning of this section, to provide overview?

Dev Ops ML Ops

Goal is to automate... Software delivery ML Lifecycle

Facilitates collaboration between... Development and operations teams Data scientists, data engineers, IT operations

Lifecycle components include... ?? Deployment, monitoring, and management of ML models

Love the idea! Added this.

Aspect DevOps MLOps

Objective Streamlining software development and operations processes Optimizing the lifecycle of machine learning models

Methodology Continuous Integration and Continuous Delivery (CI/CD) for software development Similar to CI/CD but focuses on machine learning workflows

Primary Tools Version control (Git), CI/CD tools (Jenkins, Travis CI), Configuration management (Ansible, Puppet) Data versioning tools, Model training and deployment tools, CI/CD pipelines tailored for ML

Primary Concerns Code integration, Testing, Release management, Automation, Infrastructure as code Data management, Model versioning, Experiment tracking, Model deployment, Scalability of ML workflows

Typical Outcomes Faster and more reliable software releases, Improved collaboration between development and operations teams Efficient management and deployment of machine learning models, Enhanced collaboration between data scientists and engineers

euranofshin · 2023-11-21T14:40:02Z

ops.qmd

+
+## Key Components of MLOps
+
+In this chapter, we will provide an overview of the core components of MLOps, an emerging set of practices that enables robust delivery and lifecycle management of ML models in production. While some MLOps elements like automation and monitoring were covered in previous chapters, we will integrate them into an integrated framework and expand on additional capabilities like governance. By the end, we hope that you will understand the end-to-end MLOps methodology that takes models from ideation to sustainable value creation within organizations.


Maybe mention in a sentence that, in this section, you describe and link to popular tools that are used within each component (e.g. you point out things like "LabelStudio" for data labeling). This is what I found personally most helpful about this section!

Good feedbdack, fixed: In this chapter, we will provide an overview of the core components of MLOps, an emerging set of practices that enables robust delivery and lifecycle management of ML models in production. While some MLOps elements like automation and monitoring were covered in previous chapters, we will integrate them into an integrated framework and expand on additional capabilities like governance. Additionally, we will describe and link to popular tools used within each component, such as LabelStudio for data labeling. By the end, we hope that you will understand the end-to-end MLOps methodology that takes models from ideation to sustainable value creation within organizations.

euranofshin · 2023-11-21T14:52:45Z

ops.qmd

+
+Automating and standardizing model training empowers teams to accelerate experimentation and achieve the rigor needed for production of ML systems.
+
+### Model Evaluation


Overall, I think this section reads a bit like a laundry list and would benefit from linking the evaluation options (e.g. do I use accuracy or recall?) to some sort of meaningful downstream application.

Specifically, the product recommendation example could be moved to the start of the section and then used as an example throughout. For example, after explaining all the metrics (accuracy, AUC, recall, F1), you could have a sentence explaining how you would choose a metric with the product example-- "For product recommendations, we might focus on minimizing false negatives, to avoid withholding items that may interest the customer."

@eurashin thanks for the candid feedback. Did you mean the model evaluation section? Or the one ending just before that?

euranofshin · 2023-11-21T14:56:45Z

ops.qmd

+
+## Embedded System Challenges
+
+We will briefly review the challenges with embedded systems so taht it sets the context for the specific challenges that emerge with embedded MLOps that we will discuss in the following section.


typo in "taht"

Co-Authored-By: Alex Rodriguez <alexbrodriguez@gmail.com>

Co-Authored-By: eurashin <25086665+eurashin@users.noreply.github.com>

Added AIOps page

5d0f75c

profvjreddi added the cs249r label Nov 13, 2023

profvjreddi and others added 2 commits November 13, 2023 08:57

Updated formatting + references + text fixes + learning objectives

6588fc7

Make sure all images have a caption/alt text

3c05d8a

Add rest of citations

8e28201

profvjreddi added 2 commits November 14, 2023 16:28

Manual linting + reference fixes

5f6afda

Added an overview paragraph

21242ab

Updated acronyms in ops.qmd

411e824

sophiacho1 reviewed Nov 16, 2023

View reviewed changes

ops.qmd Outdated Show resolved Hide resolved

sophiacho1 reviewed Nov 16, 2023

View reviewed changes

mpstewart1 and others added 3 commits November 16, 2023 10:29

Merged with latest from remote repository

77b253a

Updated figures, captions, and references in ops.qmd

0b1cae4

Incorporate feedback from sophia

0111e2e

Co-Authored-By: sophiacho1 <67521139+sophiacho1@users.noreply.github.com>

alxrod reviewed Nov 19, 2023

View reviewed changes

Addressing @alxrod feedback

d9db760

alxrod reviewed Nov 19, 2023

View reviewed changes

euranofshin reviewed Nov 21, 2023

View reviewed changes

profvjreddi and others added 3 commits November 21, 2023 10:00

Merge branch 'main' into mlops

3319069

Fixes that addresses alex's feedback

5b5b764

Co-Authored-By: Alex Rodriguez <alexbrodriguez@gmail.com>

Addressing Eura's feedback.

0e6b7c6

Co-Authored-By: eurashin <25086665+eurashin@users.noreply.github.com>

mpstewart1 merged commit 9657aa2 into harvard-edge:main Nov 22, 2023
2 checks passed


		As Agile methodologies became more popular, organizations realized the need for better collaboration and communication between development and operations teams. The siloed nature of development and operations teams often led to inefficiencies, conflicts, and delays in software delivery. This need for better collaboration and integration between development and operations teams led to the [DevOps](https://www.atlassian.com/devops) movement.

		The term "DevOps" was first coined in 2009 by [Patrick Debois](https://www.jedi.be/blog/2010/02/12/what-devops-means-to-me/), a consultant and Agile practitioner. Debois organized the first [DevOpsDays](https://www.devopsdays.org/) conference in Ghent, Belgium, in 2009, which brought together development and operations professionals to discuss ways to improve collaboration and automate processes. The conference was a success, and the DevOps movement started to gain momentum.


		CI/CD pipelines orchestrate key steps, including checking out new code changes, transforming data, training and registering new models, validation testing, containerization, deploying to environments like staging clusters, and promoting to production. Teams leverage popular CI/CD solutions like [Jenkins](https://www.jenkins.io/), [CircleCI](https://circleci.com/) and [GitHub Actions](https://github.com/features/actions) to execute these MLOps pipelines, while [Prefect](https://www.prefect.io/), [Metaflow](https://metaflow.org/) and [Kubeflow](https://www.kubeflow.org/) offer ML-focused options.

		For example, when a data scientist checks improvements to an image classification model into a [GitHub](https://github.com/) repository, this actively triggers a Jenkins CI/CD pipeline. The pipeline reruns data transformations and model training on the latest data, tracking experiments with [MLflow](https://mlflow.org/). After automated validation testing, teams deploy the model container to a [Kubernetes](https://kubernetes.io/) staging cluster for further QA. Once approved, Jenkins facilitates a phased rollout of the model to production with [canary deployments](https://kubernetes.io/docs/concepts/cluster-administration/manage-deployment/#canary-deployments) to catch any issues. If anomalies are detected, the pipeline enables teams to roll back to the previous model version gracefully.


		![Diagram showing some tasks affect other future tasks](images/ai_ops/data_cascades.png)

		Building models sequentially creates risky dependencies where later models rely on earlier ones. For example, taking an existing model and fine-tuning it for a new use case seems efficient. However, this bakes in assumptions from the original model that may eventually need correction. Modifying foundational components then becomes extremely costly due to the cascading effects on subsequent models. One mitigation is to only augment existing models when absolutely necessary to reuse some capabilities. Often, it is safer to train models for new use cases from scratch to avoid baking in dependencies. Careful thought should be given to identifying points where introducing fresh model architectures can avoid correction cascades down the line.


		Careful monitoring and canary deployments help detect feedback. But fundamental challenges remain in understanding complex model interactions. Architectural choices that reduce entanglement and coupling mitigate analysis debt's compounding effect.

		### ML System Antipatterns


		Skilled project managers enable MLOps teams to work synergistically to deliver maximum business value from ML investments rapidly. Their leadership and organization align with diverse teams.

		## Challenges in Embedded MLOps


		Furthermore, the models themselves need to use simplified architectures optimized for low-power edge hardware. There is no access to high-end GPUs for intensive deep learning given the compute limitations. Training leverages lower-powered edge servers and clusters with distributed approaches to spread load.

		![A diagram showing how transfer learning targets updates to specific layers of a model.(images/ai_ops/transfer_learning.png)


		With MLOps, there are protocols and tools in place to ensure that the model developed by the data science team can be seamlessly deployed and integrated into the production environment. In essence, MLOps removes friction during the development, deployment, and maintenance of ML systems. It improves collaboration between teams through defined workflows and interfaces. MLOps also accelerates iteration speed by enabling continuous delivery for ML models.

		For the ridesharing company, implementing MLOps means their demand prediction model can be frequently retrained and deployed based on new incoming data. This keeps the model accurate despite changing rider behavior. MLOps also allows the company to experiment with new modeling techniques since models can be quickly tested and updated.


		Canary testing releases a model to a small subset of users to evaluate real-world performance before wide deployment. Teams incrementally route traffic to the canary release while monitoring for issues.

		For example, a retailer evaluates a personalized product recommendation model against historical test data, reviewing accuracy and diversity metrics. Teams also calculate metrics on live customer data over time, detecting decreased accuracy over the last 2 weeks. Before full rollout, the new model is released to 5% of web traffic to ensure no degradation.


		Teams actively monitor key model aspects including analyzing samples of live predictions to track metrics like accuracy and [confusion matrix](https://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html) over time.

		When monitoring performance, it is important for teams to profile incoming data to check for model drift - a steady decline in model accuracy over time after production deployment. Model drift can occur in one of two ways: [concept drift](https://en.wikipedia.org/wiki/Concept_drift) and data drift. Concept drift refers to a fundamental change observed in the relationship between the input data and the target outcomes. For instance, as the COVID-19 pandemic progressed e-commerce and retail sites had to correct their model recommendations, since purchase data was overwhelmingly skewed towards items like hand sanitizer. Data drift describes changes in the distribution of data over time. For example, image recognition algorithms used in self-driving cars will need to account for seasonality in observing their surroundings. Teams also track application performance metrics like latency and errors for model integrations.


		![Figure 14.3: The flowchart depicts the concept of correction cascades in the ML workflow, from problem statement to model deployment. The arcs represent the potential iterative corrections needed at each stage of the workflow, with different colors corresponding to distinct issues such as interacting with physical world brittleness, inadequate application-domain expertise, conflicting reward systems, and poor cross-organizational documentation. The red arrows indicate the impact of cascades, which can lead to significant revisions in the model development process, while the dotted red line represents the drastic measure of abandoning the process to restart. This visual emphasizes the complex, interconnected nature of ML system development and the importance of addressing these issues early in the development cycle to mitigate their amplifying effects downstream. [@data_cascades]](images/ai_ops/data_cascades.png)

		Building models sequentially creates risky dependencies where later models rely on earlier ones. For example, taking an existing model and fine-tuning it for a new use case seems efficient. However, this bakes in assumptions from the original model that may eventually need correction. Modifying foundational components then becomes extremely costly due to the cascading effects on subsequent models. One mitigation is to only augment existing models when absolutely necessary to reuse some capabilities. Often, it is safer to train models for new use cases from scratch to avoid baking in dependencies. Careful thought should be given to identifying points where introducing fresh model architectures can avoid correction cascades down the line (see Figure 14.3).


		Labeling also faces challenges without centralized data access, requiring more automated techniques like federated learning where devices collaboratively label peers' data. With personal edge devices, data privacy and regulations are critical concerns. Data collection, transmission and storage must be secure and compliant.

		For instance, a smartwatch may collect step count, heart rate, GPS coordinates throughout the day. This data is cached locally and transmitted to an edge gateway when WiFi is available. The gateway processes and filters data before syncing relevant subsets with the cloud platform to retrain models.


		Over-the-air updates require setting up specialized servers to securely distribute model bundles to devices in the field. Rollout and rollback procedures must be carefully tailored for particular device families.

		With traditional CI/CD tools less applicable, embedded MLOps relies more on custom scripts and integration. Companies take varied approaches from open source frameworks to fully in-house solutions. Tight integration between developers, edge engineers and end customers establishes trusted release processes.


		Because of the time saved on data processing thanks to Edge Impulse, the Oura team was able to focus on the key drivers of their prediction. In fact, they ended up only extracting three types of sensor data: heart rate, motion, and body temperature. After partitioning the data using five-fold cross validation and classifying sleep stage, the team was able to achieve a correlation of 79% - just a few percentage points off the standard. They were able to readily deploy two types of models for sleep detection: one simplified using just the ring’s accelerometer and one more comprehensive leveraging Autonomic Nervous System (ANS)-mediated peripheral signals and circadian features. With Edge Impulse, they plan to conduct further analyses of different activity types and leverage the scalability of the platform to continue to experiment with different sources of data and subsets of features extracted.

		While most ML research focuses on the model-dominant steps such as training and finetuning, this case study underscores the importance of a holistic approach to ML Ops, where even the initial steps of data aggregation and preprocessing have a fundamental impact on successful outcomes.


		For the ridesharing company, implementing MLOps means their demand prediction model can be frequently retrained and deployed based on new incoming data. This keeps the model accurate despite changing rider behavior. MLOps also allows the company to experiment with new modeling techniques since models can be quickly tested and updated.

		Other MLOps benefits include enhanced model lineage tracking, reproducibility, and auditing. Cataloging ML workflows and standardizing artifacts enables deeper insight into model provenance. It also facilitates regulation compliance, which is especially critical in regulated industries like healthcare and finance.


		The term "DevOps" was first coined in 2009 by [Patrick Debois](https://www.jedi.be/), a consultant and Agile practitioner. Debois organized the first [DevOpsDays](https://www.devopsdays.org/) conference in Ghent, Belgium, in 2009, which brought together development and operations professionals to discuss ways to improve collaboration and automate processes. The conference was a success, and the DevOps movement started to gain momentum.

		The key principles of DevOps include collaboration, automation, continuous integration and delivery, and feedback. These principles are aligned with the Agile methodology, which emphasizes collaboration, customer feedback, and iterative releases. DevOps extends the Agile principles to include operations teams and focuses on automating the entire software delivery pipeline, from development to deployment.


		DevOps has its roots in the [Agile](https://agilemanifesto.org/) movement, which began in the early 2000s as a reaction to the limitations of traditional software development methodologies, such as the [Waterfall model](https://www.tutorialspoint.com/sdlc/sdlc_waterfall_model.htm). Agile emphasizes collaboration, customer feedback, and small, iterative releases, which are in stark contrast to the long, siloed development cycles and rigid structures of traditional methodologies. Agile provided the foundation for a more collaborative and responsive approach to software development.

		As Agile methodologies became more popular, organizations realized the need for better collaboration and communication between development and operations teams. The siloed nature of development and operations teams often led to inefficiencies, conflicts, and delays in software delivery. This need for better collaboration and integration between development and operations teams led to the [DevOps](https://www.atlassian.com/devops) movement.


		[MLOps](https://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning), on the other hand, stands for MLOps, and it extends the principles of DevOps to the ML lifecycle. MLOps aims to automate and streamline the end-to-end ML lifecycle, from data preparation and model development to deployment and monitoring. The main focus of MLOps is to facilitate collaboration between data scientists, data engineers, and IT operations, and to automate the deployment, monitoring, and management of ML models.

		Some key factors led to the rise of MLOps:


		MLOps also requires collaboration between various stakeholders, including data scientists, data engineers, and IT operations.

		While DevOps and MLOps share similarities in their goals and principles, they differ in their focus and challenges. DevOps focuses on improving the collaboration between development and operations teams and automating software delivery. In contrast, MLOps focuses on streamlining and automating the ML lifecycle and facilitating collaboration between data scientists, data engineers, and IT operations.

	Dev Ops	ML Ops
Goal is to automate...	Software delivery	ML Lifecycle
Facilitates collaboration between...	Development and operations teams	Data scientists, data engineers, IT operations
Lifecycle components include...	??	Deployment, monitoring, and management of ML models

Aspect	DevOps	MLOps
Objective	Streamlining software development and operations processes	Optimizing the lifecycle of machine learning models
Methodology	Continuous Integration and Continuous Delivery (CI/CD) for software development	Similar to CI/CD but focuses on machine learning workflows
Primary Tools	Version control (Git), CI/CD tools (Jenkins, Travis CI), Configuration management (Ansible, Puppet)	Data versioning tools, Model training and deployment tools, CI/CD pipelines tailored for ML
Primary Concerns	Code integration, Testing, Release management, Automation, Infrastructure as code	Data management, Model versioning, Experiment tracking, Model deployment, Scalability of ML workflows
Typical Outcomes	Faster and more reliable software releases, Improved collaboration between development and operations teams	Efficient management and deployment of machine learning models, Enhanced collaboration between data scientists and engineers

Added AIOps page #57

Added AIOps page #57

Conversation

arbass22 commented Nov 12, 2023 • edited Loading

vijay-edu commented Nov 13, 2023

ciyer64 commented Nov 13, 2023

annielcook commented Nov 14, 2023

arbass22 commented Nov 15, 2023

sophiacho1 Nov 16, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

profvjreddi Nov 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Navigating Technical Debt in Early Stages

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arbass22 commented Nov 12, 2023 •

edited

Loading

sophiacho1 Nov 16, 2023 •

edited

Loading

profvjreddi Nov 19, 2023 •

edited

Loading


		## Key Components of MLOps

		In this chapter, we will provide an overview of the core components of MLOps, an emerging set of practices that enables robust delivery and lifecycle management of ML models in production. While some MLOps elements like automation and monitoring were covered in previous chapters, we will integrate them into an integrated framework and expand on additional capabilities like governance. By the end, we hope that you will understand the end-to-end MLOps methodology that takes models from ideation to sustainable value creation within organizations.


		Automating and standardizing model training empowers teams to accelerate experimentation and achieve the rigor needed for production of ML systems.

		### Model Evaluation


		## Embedded System Challenges

		We will briefly review the challenges with embedded systems so taht it sets the context for the specific challenges that emerge with embedded MLOps that we will discuss in the following section.