Unraveling mysteries hidden within datasets, a relentless data detective, transforming chaos into knowledge.
- 👋 Hi, I’m @TatjanaChernenko
- 👀 I’m interested in Data Science, ML/DL, NLP and .
- 📫 How to reach me: tatjana.chernenko.work@gmail.com
- 📁 New Public Repository: This new public GitHub profile contains both old (starting from approx. 2015) and new my projects, uploaded now after years of working in a private capacity due to privacy policies of my employers.
- 📁 Project Uploads: All projects uploaded here are from my personal endeavors and university research. Due to privacy policies at SAP SE, where I am employed, I am unable to share work-related projects publicly. These repositories exclusively feature my private projects and are newly uploaded to this fresh GitHub profile. Thank you for your understanding.
-
CHERTOY: Word Sense Induction for Web Search Result Clustering
- GitHub: CHERTOY System
-
Data-to-text Generation
- GitHub: Data-to-text Generation
-
Text Summarization with LexRank
- GitHub: Text Summarization with LexRank
LSTM for predictive maintenance of aircraft machines
Anomaly Detection for Time Series with IBM API (SVR), K-Means clustering, statsmodels decomposition and Fourier analysis
Reinforcement Learning Agent for Bomberman
- GitHub: RL Agent for Bomberman
Speech-to-text with Transfer Learning
Data Augmentation Techniques for Classification
- GitHub: Data Augmentation Techniques
[My Playground (smaller projects / samples)](#playground):
- EDA (Explorative Data Analysis)
- Basic NLP Examples
- Text Categorisation Task with ML
- Dialogue Systems
- Recommendation Systems
- Sentiment Analysis
- Voice technologies (speech-to-text, speech-to-speech, text-to-speech)
- Various ML tasks
- Apps with ChatGPT and OpenAI
- Databases, SQL, noSQL, webscrapping, email notifications
- NMT
- 2017/2018 CHERTOY: Word Sense Induction for better web search result clustering - An approach to improve word sense induction systems (WSI) for web search result clustering. Exploring the boundaries of vector space models for the WSI Task. CHERTOY system. Authors: Tatjana Chernenko, Utaemon Toyota.
Whitepaper - link
Key words: word sense induction, web search results clustering, ML, NLP, word2vec, sent2vec, NLP, data science, data processing.
- 2018 Data-to-text: Natural Language Generation from structured inputs - This project investigates the generation of descriptions of images focusing on spatial relationships between the objects and sufficient attributes for the objects. Leveraging an encoder-decoder architecture with LSTM cells (the Dong et al. (2017) is taken as basis), the system transforms normalized vector representations of attributes into fixed-length vectors. These vectors serve as initial states for a decoder generating target sentences from sequences in description sentences.
Whitepaper - link
Key words: natural language generation, encoder-decoder, ML, NLP, data science, feed-forward neural network, LSTMs.
- 2018 Text Summarization research: Optimizing LexRank system with ECNU features - enhancing the LexRank-based text summarization system by incorporating semantic similarity measures from the ECNU system. The LexRank-based text summarization system employs a stochastic graph-based method to compute the relative importance of textual units for extractive multi-document text summarization. This implementation initially utilizes cosine similarity between sentences as a key metric. In this model, a connectivity matrix based on intra-sentence cosine similarity is used as the adjacency matrix of the graph representation of sentences. The objective is to explore the impact of replacing cosine similarity with a combination of features from the ECNU system, known for its semantic similarity measure. This modification aims to improve the summarization effectiveness of the LexRank approach.
Whitepaper - link
Key words: natural language processing, text summarizaton, ML, NLP, data science, LexRank, ECNU, semantic similarity metrics, multi-document text summarization, cosine similarity, connectivity matrix, optimization.
- 2019, Reinforcement Learning agent for Bomberman game Training a RL agent for the multi-player game Bomberman using reinforcement learning, deep Q-learning with a dueling network architecture and separate decision and target networks, prioritized experience replay.
Whitepaper - link
Key words: reinforcement learning, q-learning.
- 2018, Speech-to-text: Transfer Learning for Automatic Speech Translation (playground) - Playground for the Automated Speech Translation (AST) with transfer learning vs. AST trained from scratch; hyperparameters tuning and evaluation.
Report - link
Key words: transfer learning, automated speech translation
- 2018, Data Augmentation techniques for binary- and multi-label classification - Exploring Data Augmentation techniques (Thesaurus and Backtranslation, a winning Kaggle technique) to expand existing datasets, evaluating on binary- and multi-label classification task (spam/not spam and news articles classification). Important when training data is limited, especially in Machine Learning (ML) or Deep Learning (DL) applications. The primary concept involves altering text while retaining its meaning to enhance the dataset's diversity.
Key words: data augmentation, data science, ML, DL, binary and multi-class classification
- LSTM for predictive maintenance of aircraft machines: failure and RUL (remaining usefull life) prediction - Predictive Maintenance: Use LSTM to predict failure (binary classification) and RUL (remaining useful life or time to failure with regression) of aircraft engines.
Anomaly Detection for Time Series with IBM API (SVR), K-Means clustering, statsmodels decomposition and Fourier analysis
- IBM API for anomaly detection, univariate data - Jupyter Notebook
-
Categorization task with ML Algorithms for Reuters text categorization benchmark dataset - LinearSVC (Linear Support Vector Classifier), Decision Tree, Random Forest, Logistic Regression,k-Nearest Neighbors (k-NN),Naive Bayes, AdaBoost, LDA (Linear Discriminant Analysis),RBM (Restricted Boltzmann Machine),MLP (Multilayer Perceptron).
-
Collection of chatbots, dialogue systems
(coming soon)
- Explorative Data Analysis of Aibnb rental prices in New York, 2019 - Jupyter Notebook
(further projects coming soon)
- NLP examples - Jupyter Notebook with data preprocessing, top words, word cloud, frequencies, AgglomerativeClustering, PCA, Sentiment analysis, Topic Detection
- REGEX examples - simple summary of regex examples in Jupyter Notebook.
- LinkedIn webscrapping, saving data to local MongoDB and csv, filtering and updating the user via email - automatically extracting job postings from LinkedIn according to the predefined settings, storing them in a local MongoDB database and csv file, searching for relevant positions based on the keywords, and notifying the user via email (Gmail API) about relevant job opportunities.
- https://github.com/TatjanaChernenko/ml_playground
- Regression Task: Predicting Airbnb rental prices in New York - Regression task to predict rental prices in New York, playground. Models used: Linear Regression, Decision Trees, NNs.
(coming soon)
- OpenAI basic app - updating the basic OpenAI simple app to generate pet names to correspond to the OpenAI changes in code (January, 2024)
- [fork: GPT Chatbot - customizable]https://github.com/TatjanaChernenko/customizable-gpt-chatbot) - A dynamic, scalable AI chatbot built with Django REST framework, supporting custom training from PDFs, documents, websites, and YouTube videos. Leveraging OpenAI's GPT-3.5, Pinecone, FAISS, and Celery for seamless integration and performance.
(coming soon)
- Question answering with DistilBERT - Question answering with DistilBERT, HuggingFace
- Document Question Answering with LayoutLM - This is a fine-tuned version of the multi-modal LayoutLM model for the task of question answering on documents. It has been fine-tuned using both the SQuAD2.0 and DocVQA datasets.
Own projects:
(to be uploaded soon) Forks:
- Recommendation System with TensorFlow, approx.2020 - TensorFlow Recommenders is a library for building recommender system models using TensorFlow. Fork from smellslikeml
- TF-Recommenders with Kubernetes - Example of kubernetes deployment for tf-recommenders model
(to be uploaded soon)
Forks:
- Tweet Analysis - Analyzing ChatGPT-related tweets to observe technology interest trends over time
Own projects: (to be uploaded soon)
Forks:
- Whisper OpenAI - Robust Speech Recognition via Large-Scale Weak Supervision
- WhisperX Timestamps (& Diarization) - Automatic Speech Recognition with Word-level Timestamps (& Diarization)
- Whisper real-time - real-time speech-to-text conversion with Whisper
- SpeechGPT - detects microphone input and coverts it to text using Google's Speech Recognition API. It then opens ChatGPT and inputs the recognized text using selenium. It can be used with a wake word, and it can also use text to speech to repeat ChatGPT's answer to the query.
- Speaker Diarization Whisper - Whisper with with Speaker Diarization based on OpenAI Whisper
- Speech-to-Text-WaveNet: End-to-end sentence level English speech recognition based on DeepMind's WaveNet and tensorflow (forked from buriburisuri)
- Speech-to-text via Whisper and GPT-4 - transcribe dictations to text using whisper, and then fixing the resulting transcriptions into usable text using gpt-4 (forked from MNoichl)
- TensorFlow Speech Recognition - audio processing and speech classification with Tensorflow - convolution neural networks (forked from harshel)
- Watson_STT_CustomModel - a custom speech model using IBM Watson Speech to Text; an old one (approx. 2018)
- Simple Speech Recognition with Python - very simple setup using SpeechRecognition Python module
- CTTS - Controllable Text-to-speech system, based on Microsoft's FastSpeech2
- Google Sheets to Speech - Excel-to-speech, forked from Renoncio: A Python script for generating audio from a list of sentences in Google Sheets.
- StreamlitTTS - Streamlit app allows you to convert text to audio files using the Microsoft Edge's online text-to-speech service.
- Dolla Llama: Real-Time Co-Pilot for Closing the Deal - forked from smellslikeml; power a real-time speech-to-text agent with retrieval augmented generation based on webscraped customer use-cases, implements speech-to-text (STT) and retrieval-augmented generation (RAG) to assist live sales calls.
- Text-to-Speech on AWS - forked from codets1989; using AWS Lambda and Polly converting text to speech and creating a automated pipeline
- Whisper speech-to-text Telegram bot - forked from loyal-pelmen; Speech-to-Text Telegram bot
- DeepSpeech on devices - embedded (offline, on-device) speech-to-text engine which can run in real time ranging from a Raspberry Pi 4 to high power GPU servers
- Bash Whisper - using a Digital Voice Recorder (DVR) - Bash function to ease the transcription of audio files with OpenAI's whisper.
- Awesome Whisper - model variants and playgrounds
- TikTok Analyzer - Video Scraping and Content Analysis Tool. Search & download Tiktok videos by username and/or video tag, and analyze video contents. Transcribe video speech to text and perform NLP analysis tasks (e.g., keyword and topic discovery; emotion/sentiment analysis). Isolate audio signal and perform signal processing analysis tasks (e.g., pitch, prosody and sentiment analysis). Isolate visual stream and perform image tasks (e.g., object detection; face detection).
- SpeechBrain - an open-source PyTorch toolkit that accelerates Conversational AI development; spans speech recognition, speaker recognition, speech enhancement, speech separation, language modeling, dialogue, and beyond. Over 200 competitive training recipes on more than 40 datasets supporting 20 speech and text processing tasks. Supports both training from scratch and fine-tuning pretrained models such as Whisper, Wav2Vec2, WavLM, Hubert, GPT2, Llama2, and beyond. The models on HuggingFace can be easily plugged in and fine-tuned.
- Speech Synthesis Markup - SSML - XML-based markup language that you can use to fine-tune your text to speech output attributes (tutorial from Microsoft).
(further projects coming soon)
(coming soon)
Own projects: (to be uploaded soon)
Forks/Inspiration:
- Laundry Sorting with a robotic arm - Sorting laundry autonomously with a robotic arm and Computer Vision
- -
- -
- AutoXlicker - A lightweight and customizable autoclicker tool for automating repetitive clicking tasks.
- [Hand Gesture Computer Interface]((https://github.com/TatjanaChernenko/Hand-Gesture-Computer-Interface) - Software that allows you to interact with your computer through hand gestures.
Category | Project Title | GitHub |
---|---|---|
Research Repositories | CHERTOY: Word Sense Induction for better web search result clustering | CHERTOY System |
Research Repositories | Data-to-text: Natural Language Generation from structured inputs | Data-to-text Generation |
Research Repositories | Text Summarization research: Optimizing LexRank system with ECNU features | Text Summarization with LexRank |
Research Repositories | Reinforcement Learning agent for Bomberman game | RL Agent for Bomberman |
Research Repositories | Speech-to-text: Transfer Learning for Automatic Speech Translation (playground) | Speech-to-text with Transfer Learning |
Research Repositories | Data Augmentation techniques for binary- and multi-label classification | Data Augmentation Techniques |
Predictive Maintenance | LSTM for predictive maintenance of aircraft machines: failure and RUL (remaining useful life) prediction | Predictive Maintenance with LSTM |
Anomaly Detection | Anomaly Detection for Time Series with IBM API (SVR), K-Means clustering, statsmodels decomposition and Fourier analysis | IBM API for anomaly detection, univariate data |
Text Categorisation | Text Categorisation Task with ML (Reuters) | Categorization task with ML Algorithms for Reuters text categorization benchmark dataset |
Playground | Explorative Data Analysis of Airbnb rental prices in New York, 2019 | EDA of Airbnb Prices in New York |
Playground | Basic NLP Examples | NLP examples |
Databases, SQL, noSQL, webscrapping, email notifications | LinkedIn webscrapping, saving data to local MongoDB and csv, filtering and updating the user via email | LinkedIn Webscrapping and Email Notifications |
Various ML tasks | Regression Task: Predicting Airbnb rental prices in New York | Regression Task with Airbnb Data |
Dialogue Systems | Question answering with DistilBERT | DistilBERT Question Answering |
Dialogue Systems | Document Question Answering with LayoutLM | LayoutLM Document QA |
Recommendation Systems | Recommendation System with TensorFlow | TensorFlow Recommenders |
Sentiment Analysis | Sentiment Analysis | (to be uploaded soon) |
Voice Technologies | Speech-to-Text-WaveNet | Speech-to-Text-WaveNet |
Voice Technologies | Speech-to-text via Whisper and GPT-4 | Speech-to-text with Whisper to GPT |
Voice Technologies | TensorFlow Speech Recognition | TensorFlow Speech Recognition |
Voice Technologies | Watson_STT_CustomModel | Watson STT Custom Model |
Voice Technologies | Simple Speech Recognition with Python | Simple Speech Recognition |
Voice Technologies | CTTS | CTTS |
Voice Technologies | Google Sheets to Speech | Google Sheets to Speech |
Voice Technologies | StreamlitTTS | StreamlitTTS |
Voice Technologies | Dolla Llama: Real-Time Co-Pilot for Closing the Deal | Dolla Llama |
Voice Technologies | Text-to-Speech on AWS | Text-to-Speech on AWS |
Voice Technologies | Whisper speech-to-text Telegram bot | Whisper Speech-to-Text Telegram Bot |
NMT | NMT (Neural Machine Translation) | (coming soon) |
- AI for Time Series - papers, tutorials, surveys (!)
- IBM Hub API Tutorial - Anomaly Detection - use IBM API for anomaly detection
- IBM API for Anomaly Detection - playing around with the Anomaly Detection service to be made available on IBM API Hub
- AWS Forecast - end-to-end guide - Prediction with AWS
- Amazon Monitron Guidance for Predictive Maintenance - predictive maintenance management in industrial environments using Amazon Monitron and other AWS services.
- Azure Predictive Maintenance Template - Regression: predict the Remaining Useful Life (RUL), or Time to Failure (TTF); binary classification: predict if an asset will fail within certain time frame (e.g. days). Multi-class classification: Predict if an asset will fail in different time windows: E.g., fails in window [1, w0] days; fails in the window [w0+1,w1] days; not fail within w1 days.
- AI for Time Series - Tutorials
- PredictionIO - Apache; a machine learning server for developers and ML engineers.
- Conforal Prediction Tutorials - A professionally curated list of awesome Conformal Prediction videos, tutorials, books, papers, PhD and MSc theses, articles and open-source libraries.
- Time Series Prediction - LSTM Neural Network for Time Series Prediction
- Stock Prediction Models - gathers machine learning and deep learning models for Stock forecasting including trading bots and simulations
- Lime: Explaining the predictions of any machine learning classifier
- Time Series Prediction - TensorFlow Tutorial for Time Series Prediction
- Awesome-time-series - A comprehensive survey on the time series domains
- Predictive Maintenance Datasets - Datasets for predictive maintenance
- Data Science Resources - learning - The open-source curriculum for learning to be a Data Scientist (quite basic, but nice links to books, etc.)
- Data Science Resources - An Data Science repository to learn and apply for real world problems.
- Data Science Cheatsheets - List of Data Science Cheatsheets to rule the world
- Python Data Science Handbook - full text in Jupyter Notebooks
- Data science Python notebooks - Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
- Datasets from Huggingface - collection of Huggingface datasets
- Huggingface - web API for visualizing and exploring of datasets - Lightweight web API for visualizing and exploring all types of datasets - computer vision, speech, text, and tabular - stored on the Hugging Face Hub
- Huggingface - analyse datasets - EDA from Huggingface (Developing tools to automatically analyze datasets)
- Jupyter Notebooks for Big Data - with Spark and Hadoop - A guide on how to use Jupyter Notebook with big data frameworks like Apache Spark and Hadoop, including recommended libraries and tools.
- NLP state-of-the-art - Tracking Progress in Natural Language Processing
- NMT Tutorial - Neural Machine Translation (NMT) tutorial. Data preprocessing, model training, evaluation, and deployment.
- NMT - An educational tool to train, inspect, evaluate and translate using neural engines
- FasterNMT - NMT incl. data preprocessing, model training, evaluation, and deployment with great performance.
- DeepLearningForNLPInPytorch - an IPython Notebook tutorial on deep learning for natural language processing, including structure prediction.
- alennlp - An open-source NLP research library, built on PyTorch.
- Natural Language Processing Tutorial for Deep Learning Researchers
- Oxford Deep NLP 2017 course
- awasome-nlp - A curated list of resources dedicated to Natural Language Processing (NLP)
- German-NLP Datasets
- Scrapy - a fast high-level web crawling & scraping framework for Python.
- Ressources for redacting personally identifiable information - resources for programmatically redacting personally identifiable information
- Simple ML baselines - Jupyter Notebooks - simple ML baselines
- Huggingface - Transformer - Huggingface - Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX; various tasks
- Name Entity Recognition with Electra - Name Entity Recognition with Electra, Huggingface
- Text Generation with GPT-2 - Text Generation with GPT-2, Huggingface
- Natural Language Inference with RoBERTa - Natural Language Inference with RoBERTa, Huggingface
- Summarization with BART - Text Summarization with BART, Huggingface
- Data processing pipelines - data processing pipelines from Huggingface
- Tokenizers from Huggingface - Fast State-of-the-Art Tokenizers optimized for Research and Production
- text-generation-inference from Huggingface - Large Language Model Text Generation Inference. Text Generation Inference (TGI) is a toolkit for deploying and serving Large Language Models (LLMs). TGI enables high-performance text generation for the most popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, GPT-NeoX, and more.
- Open-source AI cookbook - Huggingface - Open-source AI cookbook(Fine_tuning_Code_LLM_on_single_GPU.ipynb, etc.)
- Styleformer - A neural language style transfer framework to transfer text smoothly between styles.
- conv-emotion - Implementation of different architectures for emotion recognition in conversations.
- Evaluate from Huggingface - Evaluate is a library that makes evaluating and comparing models and reporting their performance easier and more standardized. Implementations of dozens of popular metrics: the existing metrics cover a variety of tasks spanning from NLP to Computer Vision
- NMT Evaluation framework - A useful framework to evaluate and compare different Machine Translation engines between each other on variety datasets.
- FastChat - LLM chatbots evaluation platform - FastChat is an open platform for training, serving, and evaluating large language model based chatbots.
- ParlAI - a framework for training and evaluating AI models on a variety of openly available dialogue datasets.
- AutoGluon - if you prefer more control over the forecasting model exploration, training, and evaluation processes.
- tune from Huggingface - A benchmark for comparing Transformer-based models.
- Activity detection - Real-Time Spatio-Temporally Localized Activity Detection by Tracking Body Keypoints
- Dance transfer - acquire pose estimates from a participant, train a pix2pix model, transfer source dance video, and generate a dance gif; Motion transfer booth for a 1 hour everybody dance now video generation using EdgeTPU and Tensorflow 2.0
- Video embeddings and similarity - Training CNN model to generate image embeddings
- Deep Fakes Detection - (2019) Repository to detect deepfakes, an opensource project as part of AI Geeks effort.
- Diffusers from Huggingface - Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch
- Speech Cognitive Service - A Jupyter Notebook that details how to use Azure's Speech Cognitive Service to Translate speech
- Audio-Speech Tutorial, 2022 - an introduction on the topic of audio and speech processing - from basics to applications (approx. 2022)
- espnet - End-to-End Speech Processing Toolkit
- TTS - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
- Speech-to-text benchmark - speech-to-text benchmarking framework
- Speech-to-text - with Whisper and Python, March 2023
- Multilingual Text-to-Speech - Tomáš Nekvinda and Ondřej Dušek, One Model, Many Languages: Meta-Learning for Multilingual Text-to-Speech, 2020, Proc. Interspeech 2020
- Unified Speech Tokenizer for Speech Language Models - SpeechTokenizer; SpeechTokenizer: Unified Speech Tokenizer for Speech Language Models, Xin Zhang and Dong Zhang and Shimin Li and Yaqian Zhou and Xipeng Qiu, 2023
- FunASR - a Fundamental End-to-End Speech Recognition Toolkit and Open Source SOTA Pretrained Models; hopes to build a bridge between academic research and industrial applications on speech recognition. By supporting the training & finetuning of the industrial-grade speech recognition model, researchers and developers can conduct research and production of speech recognition models more conveniently, and promote the development of speech recognition ecology
- Whisper model - OpenAI Whisper
- Wenet - Production First and Production Ready End-to-End Speech Recognition Toolkit
- Distilled variant of Whisper - Distilled variant of Whisper for speech recognition. 6x faster, 50% smaller, within 1% word error rate.
- Fine-tune Whisper -Fine-Tune Whisper For Multilingual ASR with Transformers
- Applied ML - (not really up-to-date, but good) Papers & tech blogs by companies sharing their work on data science & machine learning in production.
- 500 AI projects - 500 AI Machine learning, Deep learning, Computer vision, NLP Projects with code
- Parameter-Efficient Fine-Tuning from Huggingface - PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
- Huggingface notebooks for various(!) tasks - Notebooks using the Hugging Face libraries
- Huggingface educational resources - Educational materials
- Huggingface: notifications - Knock Knock: Get notified when your training ends with only two additional lines of code
- Huggingface: No-code raining and deployments of state-of-the-art machine learning models - AutoTrain is a no-code tool for training state-of-the-art models for Natural Language Processing (NLP) tasks, for Computer Vision (CV) tasks, and for Speech tasks and even for Tabular tasks.
- Huggingface: Transformer Tutorials - transformers-tutorials (by @nielsrogge) - Tutorials for applying multiple models on real-world datasets.
- OpenAI - simple app - My note: a model used and several functions are already deprecated; my version above has things updated.
- Retrieval-Augmented Generation in Azure using Azure AI search - A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
- A collection of custom OpenAI WebApps
- Real time speech2text - Build real time speech2text web apps using OpenAI's Whisper
- OpenAI cookbook
- OpenAI WhatsApp Chatbot
- GPT-engineer - Specify what you want it to build, the AI asks for clarification, and then builds it.
- Prompt-engineering Guide
- PDF search app with OpenAI - an AI-app that allows you to upload a PDF and ask questions about it. It uses OpenAI's LLMs to generate a response.
- OpenAI Code Automation - Fully coded Apps by GPT-4 and ChatGPT. Power of AI coding automation and new way of developing.
- Semantic Search - Tutorial and template for a semantic search app powered by the Atlas Embedding Database, Langchain, OpenAI and FastAPI
- OptiGuide - Large Language Models for Supply Chain Optimization
- Generative AI lessons - 12 Lessons, Get Started Building with Generative AI
- LLMOps Workshop - Learn how to build solutions with Large Language Models.
- Data Science Lessons
- AI Lessons
- unilm - Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities. An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
- Old Photo Restoration via Deep Latent Space Translation - Bringing Old Photo Back to Life (CVPR 2020 oral)
- NNI - An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
- From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
- Seamless: Speech-to-speech translation (S2ST), Speech-to-text translation (S2TT), Text-to-speech translation (T2ST), Text-to-text translation (T2TT), Automatic speech recognition (ASR)
- Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization, language modeling and other text generation tasks.
- Faiss is a library for efficient similarity search and clustering of dense vectors.
- PyTorch-BigGraph (PBG) is a distributed system for learning graph embeddings for large graphs, particularly big web interaction graphs with up to billions of entities and trillions of edges.
- Llama 2 Fine-tuning - examples to quickly get started with fine-tuning for domain adaptation and how to run inference for the fine-tuned models. For ease of use, the examples use Hugging Face converted versions of the models.
- Pearl - A Production-ready Reinforcement Learning AI Agent Library
- TorchRecipes - Recipes are a standard, well supported set of blueprints for machine learning engineers to rapidly train models using the latest research techniques without significant engineering overhead.
- fastText is a library for efficient learning of word representations and sentence classification.
- ParlAI - a framework for training and evaluating AI models on a variety of openly available dialogue datasets.
- Image Generator with Stable Diffusion on Amazon Bedrock using Streamlit - A quick demostration to deploy a Stable Diffusion Web application with containers running on Amazon ECS. The model is provided by Amazon Bedrock in this example
- Transactional Data Lake using Apache Iceberg with AWS Glue Streaming and MSK Connect (Debezium) - Stream CDC into an Amazon S3 data lake in Apache Iceberg format with AWS Glue Streaming using Amazon MSK Serverless and MSK Connect (Debezium)
- MLOps using Amazon SageMaker and GitHub Actions - MLOps example using Amazon SageMaker Pipeline and GitHub Actions
- Near-Real Time Usage Anomaly Detection using OpenSearch - Detect AWS usage anomalies in near-real time using OpenSearch Anomaly Detection and CloudTrail for improved cost management and security
- Amazon DocumentDB (with MongoDB compatibility) samples - Code samples that demonstrate how to use Amazon DocumentDB
- Marketing Content Generator - CDK Deployment for a sample marketing portal using generative AI for content generation and distribution; Marketing Content Generation and Distribution powered by Generative AI
- Amazon SageMaker and AWS Trainium Examples - Text classification using Transformers, Pretrain BERT using Wiki Data, Pretrain/Fine tune Llama using Wiki Data.
- AWS SageMaker Local Mode - Amazon SageMaker Local Mode Examples
- End-to-end AIoT w/ SageMaker and Greengrass 2.0 on NVIDIA Jetson Nano - Hands-on lab from ML model training to model compilation to edge device model deployment on the AWS Cloud. It covers the detailed method of compiling SageMaker Neo for the target device, including cloud instance and edge device, and how to write and deploy Greengrass-v2 components from scratch.
- InsuranceLake ETL with CDK Pipeline - This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake Infrastructure project.
- Amazon Forecast - for a low-code/no-code fully managed time series AI/ML forecasting service.
- AutoGluon - if you prefer more control over the forecasting model exploration, training, and evaluation processes.
- Retrieval Augmented Generation with Streaming LLM - leverage LLMs for RAG(Retrieval Augmented Generation).
- Build generative AI agents with Amazon Bedrock, Amazon DynamoDB, Amazon Kendra, Amazon Lex, and LangChain
- Deep Learning Examples - State-of-the-Art Deep Learning scripts organized by models - easy to train and deploy with reproducible accuracy and performance on enterprise-grade infrastructure.
- NeMo: a toolkit for conversational AI