Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix image path in efficient_ai.qmd and update data_engineering.qmd #301

Closed
wants to merge 6 commits into from

Conversation

Sara-Khosravi
Copy link
Contributor

@Sara-Khosravi Sara-Khosravi commented Jul 2, 2024

Before submitting your Pull Request, please ensure that you have carefully reviewed and completed all items on this checklist.

  1. Content

    • The chapter content is complete and covers the topic in detail.
    • All technical terms are well-defined and explained.
    • Any code snippets or algorithms are well-documented and tested.
    • The chapter follows a logical flow and structure.
  2. References & Citations

    • All references are correctly listed at the end of the chapter.
    • In-text citations are used appropriately and match the references.
    • All figures, tables, and images have proper sources and are cited correctly.
  3. Quarto Website Rendering

    • The chapter has been locally built and tested using Quarto.
    • All images, figures, and tables render properly without any glitches.
    • All images have a source or they are properly linked to external sites.
    • Any interactive elements or widgets work as intended.
    • The chapter's formatting is consistent with the rest of the book.
  4. Grammar & Style

    • The chapter has been proofread for grammar and spelling errors.
    • The writing style is consistent with the rest of the book.
    • Any jargon is clearly explained or avoided where possible.
  5. Collaboration

    • All group members have reviewed and approved the chapter.
    • Any feedback from previous reviews or discussions has been addressed.
  6. Miscellaneous

    • All external links (if any) are working and lead to the intended destinations.
    • If datasets or external resources are used, they are properly credited and linked.
    • Any necessary permissions for reused content have been obtained.
  7. Final Steps

    • The chapter is pushed to the correct branch on the repository.
    • The Pull Request is made with a clear title and description.
    • The Pull Request includes any necessary labels or tags.
    • The Pull Request mentions any stakeholders or reviewers who should take a look.

@Sara-Khosravi
Copy link
Contributor Author

Sara-Khosravi commented Jul 2, 2024

@profvjreddi

Hi Professor Vijay,

I hope all is well! I've made significant updates and fixes that are crucial for our project's progress:

I fixed the image path in efficient_ai: I corrected the reference to ensure it is rendered correctly in the document.
Updated data_engineering: I refined the content for better clarity and coherence.
I used Quarto for rendering to ensure the changes were adequately reflected in the output.

Additionally, I am working on telecom outage prediction and am fully committed to applying TinyML in this domain to the best of my abilities.

Working on this book has been a truly enjoyable experience, especially given my over 7 years of industry experience. I am deeply committed to this project and look forward to our continued collaboration.

Please let me know your feedback, and I am ready to work on other chapters.

Warm regards,
Sara

@profvjreddi
Copy link
Contributor

@Sara-Khosravi thanks again for these edits. In the future, could you please make sure that you modify only one file at a time, as it is easier to do merges and rollbacks if needed?

@Sara-Khosravi
Copy link
Contributor Author

Sara-Khosravi commented Jul 4, 2024 via email

Copy link
Contributor

@profvjreddi profvjreddi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please look over the comments and make the small tweaks please?

@@ -173,7 +173,8 @@ Another important consideration is the relationship between model complexity and

Furthermore, while benchmark datasets, such as ImageNet [@russakovsky2015imagenet], COCO [@lin2014microsoft], Visual Wake Words [@chowdhery2019visual], Google Speech Commands [@warden2018speech], etc. provide a standardized performance metric, they might not capture the diversity and unpredictability of real-world data. Two facial recognition models with similar benchmark scores might exhibit varied competencies when faced with diverse ethnic backgrounds or challenging lighting conditions. Such disparities underscore the importance of robustness and consistency across varied data. For example, @fig-stoves from the Dollar Street dataset shows stove images across extreme monthly incomes. Stoves have different shapes and technological levels across different regions and income levels. A model that is not trained on diverse datasets might perform well on a benchmark but fail in real-world applications. So, if a model was trained on pictures of stoves found in wealthy countries only, it would fail to recognize stoves from poorer regions.

![Different types of stoves. Credit: Dollar Street stove images.](https://pbs.twimg.com/media/DmUyPSSW0AAChGa.jpg){#fig-stoves}
![Different types of stoves. Credit: Dollar Street stove images.](images/jpg/DmUyPSSW0AAChGa.jpg))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please rename the file as something like dollar_street.jpg

@@ -173,7 +173,8 @@ Another important consideration is the relationship between model complexity and

Furthermore, while benchmark datasets, such as ImageNet [@russakovsky2015imagenet], COCO [@lin2014microsoft], Visual Wake Words [@chowdhery2019visual], Google Speech Commands [@warden2018speech], etc. provide a standardized performance metric, they might not capture the diversity and unpredictability of real-world data. Two facial recognition models with similar benchmark scores might exhibit varied competencies when faced with diverse ethnic backgrounds or challenging lighting conditions. Such disparities underscore the importance of robustness and consistency across varied data. For example, @fig-stoves from the Dollar Street dataset shows stove images across extreme monthly incomes. Stoves have different shapes and technological levels across different regions and income levels. A model that is not trained on diverse datasets might perform well on a benchmark but fail in real-world applications. So, if a model was trained on pictures of stoves found in wealthy countries only, it would fail to recognize stoves from poorer regions.

![Different types of stoves. Credit: Dollar Street stove images.](https://pbs.twimg.com/media/DmUyPSSW0AAChGa.jpg){#fig-stoves}
![Different types of stoves. Credit: Dollar Street stove images.](images/jpg/DmUyPSSW0AAChGa.jpg))
{#fig-stoves}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency could we please put this next to the closing ]

@@ -8,7 +8,7 @@ bibliography: data_engineering.bib
Resources: [Slides](#sec-data-engineering-resource), [Videos](#sec-data-engineering-resource), [Exercises](#sec-data-engineering-resource), [Labs](#sec-data-engineering-resource)
:::

![_DALL·E 3 Prompt: Create a rectangular illustration visualizing the concept of data engineering. Include elements such as raw data sources, data processing pipelines, storage systems, and refined datasets. Show how raw data is transformed through cleaning, processing, and storage to become valuable information that can be analyzed and used for decision-making._](images/png/cover_data_engineering.png)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep this as is because this is verbatim what went into the DALLE model :)

Once data is collected, thoughtful labeling through manual or AI-assisted annotation enables the creation of high-quality training datasets. Proper storage in databases, warehouses, or lakes facilitates easy access and analysis. Metadata provides contextual details about the data. Data processing transforms raw data into a clean, consistent format for machine learning model development.
Throughout this pipeline, transparency through documentation and provenance tracking is crucial for ethics, auditability, and reproducibility. Data licensing protocols also govern legal data access and use. Key challenges in data engineering include privacy risks, representation gaps, legal restrictions around proprietary data, and the need to balance competing constraints like speed versus quality.
By thoughtfully engineering high-quality training data, machine learning practitioners can develop accurate, robust, and responsible AI systems. This includes applications in embedded systems and TinyML, where resource constraints demand particularly efficient and effective data-handling practices. In the context of TinyML, data engineering practices take on a unique character. Resource-constrained devices often necessitate smaller datasets with high signal-to-noise ratios. Data collection may be limited to on-device sensors or specific environmental conditions. Crowdsourcing and synthetic data generation have become precious tools for generating specialized datasets with limited memory and processing power. Careful optimization techniques for data cleansing, feature selection, and model compression are essential for TinyML applications. By understanding these nuances, data engineers can empower the development of efficient and effective AI solutions at the edge.
## Resources {#sec-data-engineering-resource .unnumbered}

Data is the fundamental building block of AI systems. Without quality data, even the most advanced machine learning algorithms will fail. Data engineering encompasses the end-to-end process of collecting, storing, processing, and managing data to fuel the development of machine learning models. It begins with clearly defining the core problem and objectives, which guides effective data collection. Data can be sourced from diverse means, including existing datasets, web scraping, crowdsourcing, and synthetic data generation. Each approach involves tradeoffs between cost, speed, privacy, and specificity. Once data is collected, thoughtful labeling through manual or AI-assisted annotation enables the creation of high-quality training datasets. Proper storage in databases, warehouses, or lakes facilitates easy access and analysis. Metadata provides contextual details about the data. Data processing transforms raw data into a clean, consistent format for machine learning model development. Throughout this pipeline, transparency through documentation and provenance tracking is crucial for ethics, auditability, and reproducibility. Data licensing protocols also govern legal data access and use. Key challenges in data engineering include privacy risks, representation gaps, legal restrictions around proprietary data, and the need to balance competing constraints like speed versus quality. By thoughtfully engineering high-quality training data, machine learning practitioners can develop accurate, robust, and responsible AI systems, including embedded and TinyML applications.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be deleted based on the above text?

There seems to be some repetition.

@profvjreddi
Copy link
Contributor

profvjreddi commented Jul 4, 2024 via email

@Sara-Khosravi
Copy link
Contributor Author

Sara-Khosravi commented Jul 4, 2024 via email

@profvjreddi
Copy link
Contributor

Looked over this and the changes are already merged in from other edits we did, so these updates are already in!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants