Learning materials/paths related to Graph Analytics and AI4DB.
Thanks to the following people for organizing and guiding this project
- graph-analytics-starter-pack
- Main Contributor
- Table-of-Contents
- Introduction
- Courses
- 1: Cohesive Subgraph Discovery
- 2: Generalized Anomaly Detection
- 3: AIGC-LLM
- 4: Differential Privacy
- 5: Graph Analytics on GPUs
- 6: Graph Similarity Computation
- 7: Subgraph Matching and Counting
- 8: Cardinality Estimation
- 9: Graph for DB and tabular data
- 10: Vector Database
- 12: GNN-based recommendation system
- 13: Others
Graph Analytics and AI for Database are two important branches in today's data analysis and artificial intelligence fields. Graph Analytics helps people better understand and utilize data by analyzing graph data to reveal patterns and relationships behind the data. AI4DB uses machine learning and artificial intelligence techniques to process and manage large-scale databases, solve NP-hard graph-related problems, and improve the efficiency and accuracy of data processing.
In the fields of Graph Analytics and AI for Database, we generally focus on academic papers from the following conferences:
Category | Conference | Link | Comment |
---|---|---|---|
Database | SIGMOD | DBLP, Official website | Pioneering conference in Database |
VLDB | VLDB | ||
ICDE | ICDE | ||
AI/ML/DL | ICML | ICML | |
ICLR | ICLR | ||
NeurIPS | NeurIPS | ||
Data Mining | KDD | KDD |
GNN is commonly implemented using PyTorch. Here's a series of tutorials using PyTorch: https://pytorch-geometric.readthedocs.io/en/latest/
If you're not familiar enough with deep learning, you can check out (the following are for reference, you don't need to look at all of them):
This book: Deep Learning https://github.com/exacity/deeplearningbook-chinese This course: Search for "李宏毅 机器学习" (Hung-yi Lee Machine Learning) on Bilibili Course + Book: Hands-on Deep Learning
Other potentially valuable Stanford University courses: cs224n (NLP), cs224w (graph), cs229 (ML), cs231n (CV), cs285 (RL)
- Stanford CS224W Machine Learning with Graphs: Course Website
- Stanford CS224W Machine Learning with Graphs: Course Video
- UNSW COMP9312 Data Analytics for Graphs
- Stanford CS520 Knowledge Graphs (2021)
- Stanford CS246 Big Data Mining (2019)
- Stanford Course Explore
Week | Content | Reading List | Material |
---|---|---|---|
1 | Node Embedding | 1: DeepWalk: Online Learning of Social Representations 2: node2vec: Scalable Feature Learning for Networks |
Node representation is one of the most fundamental problems in graph learning. You can refer to the content of CS224W 3rd and COMP9312 week 6. For traditional matrix factorization methods, you can refer to::matrix factorization |
2 | Graph Neural Networks | 1: Semi-Supervised Classification with Graph Convolutional Networks 2: Graph Attention Networks |
The structure of the model is the core of learning (CS224W 4th). The reading list provides two classic models: GCN and GAT. Regarding the structure of each layer in neural networks, you can refer to the tutorial: Learning basic |
3 | GNN Augmentation and Training | 1: RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space 2: Hyper-Path-Based Representation Learning for Hyper-Networks. |
For an introduction to the entire GNN process, you can refer to CS224W 6th. The reading list provides other representation learning methods and their extensions on knowledge graphs/hypergraphs. For how to do embedding, you can refer to the tutorial: Node embedding |
4 | Theory of Graph Neural Networks | 1: Anomaly Detection on Attributed Networks via Contrastive Self-Supervised Learning 2: Sub-graph Contrast for Scalable Self-Supervised Graph Representation Learning |
For analysis of neural networks, you can refer to CS224W 7th. The reading list provides current mainstream self-supervised embedding methods and network frameworks. For how to use learning to perform basic tasks, refer to the tutorial: Downstream Application: node classification, link prediction and graph classification |
5 | Label Propagation on Graphs | 1: GLSearch: Maximum Common Subgraph Detection via Learning to Search 2: Computing Graph Edit Distance via Neural Graph Matching |
Refer to CS224W 8th. The reading list provides two methods for solving the graph similarity computation (NP-hard) problem. To implement the GCN structure yourself, you can refer to the tutorial: Graph Convolutional Network |
6 | Subgraph Matching and Counting | 1: A Learned Sketch for Subgraph Counting 2: Neural Subgraph Counting with Wasserstein Estimator |
Refer to CS224W 12th. The reading list provides two papers on the subgraph counting (NP-hard) problem (from SIGMOD 2021 and SIGMOD 2022). The repository code for the SubCon paper is at Contrastive Learning on Graph |
Cohesive Subgraph Discovery is a problem of finding highly cohesive subgraphs in graph data. This survey: A Survey on Machine Learning Solutions for Graph Pattern Extraction (pay close attention to Ch2.6 community search) clearly explains several baseline articles for our work.
For more details, please refer to 1: Cohesive Subgraph Discovery
Generalized Anomaly Detection包括了很多类似的问题,比如: anomaly detection, novelty detection, open set recognition, out-of-distribution detection 和 outlier detection.
-
Generalized Out-of-Distribution Detection: A Survey [paper]
-
DGraph: A Large-Scale Financial Dataset for Graph Anomaly Detection [paper] [project page]
-
ADBench: Anomaly Detection Benchmark [paper] [project page]
Conference | Paper | Material | Abstract | Highlights |
---|---|---|---|---|
ICLR2022 | Anomaly detection for tabular data with internal contrastive learning. | [code] | KNN and Contrastive Learning for Tabular data | -- |
ICDE2023 | Unsupervised Graph Outlier Detection: Problem Revisit, New Insight, and Superior Method. | --- | --- | --- |
- 入门综述论文
- 入门论文列表
- 入门代码demo
- TKDE Community Aware反洗钱 Anti-Money Laundering by Group-Aware Deep Graph Learning
- AAAI Risk-Aware反诈骗 Semi-Supervised Credit Card Fraud Detection via Attribute-Driven Graph Representation
- TKDE Spatial-Aware反诈骗 Graph Neural Network for Fraud Detection via Spatial-temporal Attention
Please refer to 3: AIGC-LLM
Please refer to 4: Differential Privacy
and
DP & ML: https://github.com/JeffffffFu/Awesome-Differential-Privacy-and-Meachine-Learning
Please refer to 5: Graph Analytics on GPUs
Graphs are a valuable tool for representing connections between entities, while tabular or relational data is a convenient and user-friendly way to store information. Researchers frequently employ graphs to depict interdependencies among records, attributes, elements, and schemas within and across tables. It is worth noting that in contemporary usage, the term "tabular deep learning" is often used to refer to the application of deep learning techniques to relational data organized as records, while the term "database" is often reserved to refer specifically to the software and infrastructure used to manage and manipulate such data.
Conference | Paper | Material | Abstract | Highlights |
---|---|---|---|---|
PODS2023 | Databases as Graphs: Predictive Queries for Declarative Machine Learning. | --- | Using hypergraph to model the relationship behind the records | -- |
--- | Enabling tabular deep learning when d ge n with an auxiliary knowledge graph | --- | Capture the relation between two attributes by KG | -- |
CIKM22 | Local Contrastive Feature learning for Tabular Data | --- | Capture the relation between two attributes by maximum spanning tree | -- |
NIPS22 | Learning enhanced representations for tabular data via neighborhood propagation | --- | --- | -- |
dlpkdd2021 | TabGNN: Multiplex Graph Neural Network for Tabular Data Prediction | --- | --- | -- |
Similarity search at a very large scale.
Conference | Paper | Material | Abstract | Highlights |
---|---|---|---|---|
SIGMOD2023 | Near-Duplicate Sequence Search at Scale for Large Language Model Memorization. | --- | --- | -- |
Talk 1: Vector Database for Large Language Models in Production (Sam Partee)
推荐系统涵盖众多子领域,难免挂一漏万,仅在此介绍部分应用GNN的推荐系统子领域。
Recommender systems encompass numerous sub-domains, and it's inevitable to miss some areas. Here we only introduce some sub-domains of recommender systems that apply GNN.
Collaborative Filtering (CF) is one of the most classic and widely used methods in recommender systems. Its core idea is to find association patterns between similar users or items based on historical user behavior data. It mainly falls into two categories: User-based CF makes recommendations by finding similar user groups, while Item-based CF makes recommendations through similarity relationships between items.
Here are some cornerstone papers of GNN-based CF. Some of which, like LightGCN, still serves as baseline or even benchmark component in many SOTA researches.
Paper | Conference | Year | Highlights |
---|---|---|---|
Graph Convolutional Neural Networks for Web-Scale Recommender Systems | KDD | 2018 | First paper of GNN-based Collaborative Filtering, PinSAGE |
Neural Graph Collaborative Filtering | SIGIR | 2019 | Refinement of PinSAGE, NGCF |
LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation | SIGIR | 2020 | The most popular benchmark model, a refinement version of NGCF, LightGCN |
Self-supervised Graph Learning for Recommendation | SIGIR | 2021 | One of the first papers in GNN-based CF utilizing self-supervised method, SGL |
Session-based Recommender Systems focus on predicting the next item within an ongoing session, where a session is typically a short-term interaction sequence (such as a browsing session) with no user identification required. The key problem is: Given a sequence of user interactions
Sequence-based Recommender Systems consider the complete historical sequence of user interactions across multiple sessions over time. The problem is defined as: Given a user u and their entire historical interaction sequence
For detail information about session-based Rec.Sys. and Sequence-based Rec.Sys., please refer to:
Here are some cornerstone papers of GNN-based session-based/sequence-based Rec.Sys.
Paper | Conference | Year | Highlights |
---|---|---|---|
Session-Based Recommendation with Graph Neural Networks | AAAI | 2018 | SR-GNN, still a popular baseline choice. |
虽然但是,如果您想在此领域做出贡献,三思而后行:我相比序列模型,尤其是Transformer架构模型,有什么优势?
Substitute recommendation aims to suggest interchangeable items for a given query item. Most traditional methods infer substitute relationships through item similarities, and extract semantic information from consumer reviews for this purpose. Recently, due to the semantic connection between substitutes and complements (e.g., a substitute’s complement is often also a complement of the original item) , and with the development of GNN, current mainstream models primarily use networks of co-view and co-purchase relationships to learn substitute relationships . They employ various methods to explore the latent relationships between different item interactions.
Complement recommendation aims to suggest complement items (like mouse for a computer, game handle for a PS5) for a given query item. Due to semantic complextity, most methods utilize GNN-based methods just like Sub. Rec.Sys.
Here are some cornerstone papers of GNN-based substitute & complement Rec.Sys.
Paper | Conference | Year | Highlights |
---|---|---|---|
Inferring Networks of Substitutable and Complementary Products | KDD | 2015 | Most popular dataset: Amazon datasets |
Measuring the Value of Recommendation Links on Product Demand | ISR | 2019 | A Bussiness paper (ISR, UTD24) if you need |
Decoupled Graph Convolution Network for Inferring Substitutable and Complementary Items | CIKM | 2020 | One of the first GNN-based Sub. & Com. Rec.Sys., predicting substitute and complement relationships simultaneously |
Heterogeneous graph neural networks with neighbor-SIM attention mechanism for substitute product recommendation | AAAI | 2021 | |
Decoupled Hyperbolic Graph Attention Network for Modeling Substitutable and Complementary Item Relationships | CIKM | 2022 | |
Enhanced Multi-Relationships Integration Graph Convolutional Network for Inferring Substitutable and Complementary Items | AAAI | 2023 |
Please refer to Awesome-Cold-Start-Recommendation
Missing data imputation focuses on methods to fill missing values in datasets, improving data completeness quality. It includes techniques based on statistical methods, regression models, and graph models.
Category | Title | Source | Year |
---|---|---|---|
Vision and Survey Paper | An Experimental Survey of Missing Data Imputation Algorithm | IEEE T KNOWL DATA EN | 2023 |
Can Foundation Models Wrangle Your Data? | VLDB (Vision) | 2022 | |
Tabular Data Imputation | Missing Data Imputation with Uncertainty-Driven Network | SIGMOD | 2024 |
ReMasker: Imputing Tabular Data with Masked Autoencoding | ICLR | 2024 | |
Missing Data Imputation with Uncertainty-Driven Network | SIGMOD | 2024 | |
Transformed distribution matching for missing value imputation | ICML | 2023 | |
MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms | NeurIPS | 2021 | |
GAIN: Missing Data Imputation using Generative Adversarial Nets | ICML | 2018 | |
Multiple Imputation by Chained Equations | 2011 | ||
Graph Data Imputation | Data Imputation from the Perspective of Graph Dirichlet Energy | CIKM | 2024 |
Handling Missing Data via Max-Entropy Regularized Graph Autoencoder | AAAI | 2023 | |
Accurate Node Feature Estimation with Structured Variational Graph Autoencoder | KDD | 2022 | |
Learning on Attribute-Missing Graphs | TPAMI | 2021 | |
Deconvolutional Networks on Graph Data | NeurIPS | 2021 |