graph-analytics-starter-pack

Learning materials/paths related to Graph Analytics and AI4DB.

Main Contributor

Thanks to the following people for organizing and guiding this project

Table-of-Contents

graph-analytics-starter-pack

Introduction

Graph Analytics and AI for Database are two important branches in today's data analysis and artificial intelligence fields. Graph Analytics helps people better understand and utilize data by analyzing graph data to reveal patterns and relationships behind the data. AI4DB uses machine learning and artificial intelligence techniques to process and manage large-scale databases, solve NP-hard graph-related problems, and improve the efficiency and accuracy of data processing.

In the fields of Graph Analytics and AI for Database, we generally focus on academic papers from the following conferences:

Category	Conference	Link	Comment
Database	SIGMOD	DBLP, Official website	Pioneering conference in Database
	VLDB	VLDB
	ICDE	ICDE
AI/ML/DL	ICML	ICML
	ICLR	ICLR
	NeurIPS	NeurIPS
Data Mining	KDD	KDD

[A Comprehensive Survey on Graph Neural Networks] https://ieeexplore.ieee.org/document/9046288

GNN is commonly implemented using PyTorch. Here's a series of tutorials using PyTorch: https://pytorch-geometric.readthedocs.io/en/latest/

If you're not familiar enough with deep learning, you can check out (the following are for reference, you don't need to look at all of them):

This book: Deep Learning https://github.com/exacity/deeplearningbook-chinese This course: Search for "李宏毅机器学习" (Hung-yi Lee Machine Learning) on Bilibili Course + Book: Hands-on Deep Learning

Other potentially valuable Stanford University courses: cs224n (NLP), cs224w (graph), cs229 (ML), cs231n (CV), cs285 (RL)

Courses

Main Courses

Reference Courses

Key Chapters

Week	Content	Reading List	Material
1	Node Embedding	1: DeepWalk: Online Learning of Social Representations 2: node2vec: Scalable Feature Learning for Networks	Node representation is one of the most fundamental problems in graph learning. You can refer to the content of CS224W 3rd and COMP9312 week 6. For traditional matrix factorization methods, you can refer to:：matrix factorization
2	Graph Neural Networks	1: Semi-Supervised Classification with Graph Convolutional Networks 2: Graph Attention Networks	The structure of the model is the core of learning (CS224W 4th). The reading list provides two classic models: GCN and GAT. Regarding the structure of each layer in neural networks, you can refer to the tutorial: Learning basic
3	GNN Augmentation and Training	1: RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space 2: Hyper-Path-Based Representation Learning for Hyper-Networks.	For an introduction to the entire GNN process, you can refer to CS224W 6th. The reading list provides other representation learning methods and their extensions on knowledge graphs/hypergraphs. For how to do embedding, you can refer to the tutorial: Node embedding
4	Theory of Graph Neural Networks	1: Anomaly Detection on Attributed Networks via Contrastive Self-Supervised Learning 2: Sub-graph Contrast for Scalable Self-Supervised Graph Representation Learning	For analysis of neural networks, you can refer to CS224W 7th. The reading list provides current mainstream self-supervised embedding methods and network frameworks. For how to use learning to perform basic tasks, refer to the tutorial: Downstream Application: node classification, link prediction and graph classification
5	Label Propagation on Graphs	1: GLSearch: Maximum Common Subgraph Detection via Learning to Search 2: Computing Graph Edit Distance via Neural Graph Matching	Refer to CS224W 8th. The reading list provides two methods for solving the graph similarity computation (NP-hard) problem. To implement the GCN structure yourself, you can refer to the tutorial: Graph Convolutional Network
6	Subgraph Matching and Counting	1: A Learned Sketch for Subgraph Counting 2: Neural Subgraph Counting with Wasserstein Estimator	Refer to CS224W 12th. The reading list provides two papers on the subgraph counting (NP-hard) problem (from SIGMOD 2021 and SIGMOD 2022). The repository code for the SubCon paper is at Contrastive Learning on Graph

1: Cohesive Subgraph Discovery

Cohesive Subgraph Discovery is a problem of finding highly cohesive subgraphs in graph data. This survey: A Survey on Machine Learning Solutions for Graph Pattern Extraction (pay close attention to Ch2.6 community search) clearly explains several baseline articles for our work.

For more details, please refer to 1: Cohesive Subgraph Discovery

2: Generalized Anomaly Detection

Generalized Anomaly Detection包括了很多类似的问题，比如: anomaly detection, novelty detection, open set recognition, out-of-distribution detection 和 outlier detection.

2.1 Survey of anomaly detection and Benchmarks

Generalized Out-of-Distribution Detection: A Survey [paper]
DGraph: A Large-Scale Financial Dataset for Graph Anomaly Detection [paper] [project page]
ADBench: Anomaly Detection Benchmark [paper] [project page]

2.2 Anomaly Detection

Conference	Paper	Material	Abstract	Highlights
ICLR2022	Anomaly detection for tabular data with internal contrastive learning.	[code]	KNN and Contrastive Learning for Tabular data	--
ICDE2023	Unsupervised Graph Outlier Detection: Problem Revisit, New Insight, and Superior Method.	---	---	---

2.3 Fraud Detection

3: AIGC-LLM

Please refer to 3: AIGC-LLM

4: Differential Privacy

Please refer to 4: Differential Privacy

and

DP & ML: https://github.com/JeffffffFu/Awesome-Differential-Privacy-and-Meachine-Learning

5: Graph Analytics on GPUs

Please refer to 5: Graph Analytics on GPUs

6: Graph Similarity Computation

7: Subgraph Matching and Counting

8: Cardinality Estimation

9: Graph for DB and tabular data

Graphs are a valuable tool for representing connections between entities, while tabular or relational data is a convenient and user-friendly way to store information. Researchers frequently employ graphs to depict interdependencies among records, attributes, elements, and schemas within and across tables. It is worth noting that in contemporary usage, the term "tabular deep learning" is often used to refer to the application of deep learning techniques to relational data organized as records, while the term "database" is often reserved to refer specifically to the software and infrastructure used to manage and manipulate such data.

Conference	Paper	Material	Abstract	Highlights
PODS2023	Databases as Graphs: Predictive Queries for Declarative Machine Learning.	---	Using hypergraph to model the relationship behind the records	--
---	Enabling tabular deep learning when d ge n with an auxiliary knowledge graph	---	Capture the relation between two attributes by KG	--
CIKM22	Local Contrastive Feature learning for Tabular Data	---	Capture the relation between two attributes by maximum spanning tree	--
NIPS22	Learning enhanced representations for tabular data via neighborhood propagation	---	---	--
dlpkdd2021	TabGNN: Multiplex Graph Neural Network for Tabular Data Prediction	---	---	--

10: Vector Database

Similarity search at a very large scale.

Conference	Paper	Material	Abstract	Highlights
SIGMOD2023	Near-Duplicate Sequence Search at Scale for Large Language Model Memorization.	---	---	--

Talk 1: Vector Database for Large Language Models in Production (Sam Partee)

12: GNN-based recommendation system

推荐系统涵盖众多子领域，难免挂一漏万，仅在此介绍部分应用GNN的推荐系统子领域。

Recommender systems encompass numerous sub-domains, and it's inevitable to miss some areas. Here we only introduce some sub-domains of recommender systems that apply GNN.

GNN-based Collaborative Filtering

Collaborative Filtering (CF) is one of the most classic and widely used methods in recommender systems. Its core idea is to find association patterns between similar users or items based on historical user behavior data. It mainly falls into two categories: User-based CF makes recommendations by finding similar user groups, while Item-based CF makes recommendations through similarity relationships between items.

Here are some cornerstone papers of GNN-based CF. Some of which, like LightGCN, still serves as baseline or even benchmark component in many SOTA researches.

Paper	Conference	Year	Highlights
Graph Convolutional Neural Networks for Web-Scale Recommender Systems	KDD	2018	First paper of GNN-based Collaborative Filtering, PinSAGE
Neural Graph Collaborative Filtering	SIGIR	2019	Refinement of PinSAGE, NGCF
LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation	SIGIR	2020	The most popular benchmark model, a refinement version of NGCF, LightGCN
Self-supervised Graph Learning for Recommendation	SIGIR	2021	One of the first papers in GNN-based CF utilizing self-supervised method, SGL

GNN-based session-based/sequence-based Rec.Sys.

Session-based Recommender Systems focus on predicting the next item within an ongoing session, where a session is typically a short-term interaction sequence (such as a browsing session) with no user identification required. The key problem is: Given a sequence of user interactions $S = {i_1, i_2, \cdots, i_t}$ in the current session, predict the next item $i_{t+1}$ that the user is likely to interact with.

Sequence-based Recommender Systems consider the complete historical sequence of user interactions across multiple sessions over time. The problem is defined as: Given a user u and their entire historical interaction sequence $H = {i_1, i_2, \cdots, i_n}$ ordered by timestamp, predict the next items that the user will interact with in the future.

For detail information about session-based Rec.Sys. and Sequence-based Rec.Sys., please refer to:

Graph and Sequential Neural Networks in Session-based Recommendation: A Survey

Here are some cornerstone papers of GNN-based session-based/sequence-based Rec.Sys.

Paper	Conference	Year	Highlights
Session-Based Recommendation with Graph Neural Networks	AAAI	2018	SR-GNN, still a popular baseline choice.

虽然但是，如果您想在此领域做出贡献，三思而后行：我相比序列模型，尤其是Transformer架构模型，有什么优势？

GNN-based substitute & complement Rec.Sys.

Substitute recommendation aims to suggest interchangeable items for a given query item. Most traditional methods infer substitute relationships through item similarities, and extract semantic information from consumer reviews for this purpose. Recently, due to the semantic connection between substitutes and complements (e.g., a substitute’s complement is often also a complement of the original item) , and with the development of GNN, current mainstream models primarily use networks of co-view and co-purchase relationships to learn substitute relationships . They employ various methods to explore the latent relationships between different item interactions.

Complement recommendation aims to suggest complement items (like mouse for a computer, game handle for a PS5) for a given query item. Due to semantic complextity, most methods utilize GNN-based methods just like Sub. Rec.Sys.

Here are some cornerstone papers of GNN-based substitute & complement Rec.Sys.

Paper	Conference	Year	Highlights
Inferring Networks of Substitutable and Complementary Products	KDD	2015	Most popular dataset: Amazon datasets
Measuring the Value of Recommendation Links on Product Demand	ISR	2019	A Bussiness paper (ISR, UTD24) if you need
Decoupled Graph Convolution Network for Inferring Substitutable and Complementary Items	CIKM	2020	One of the first GNN-based Sub. & Com. Rec.Sys., predicting substitute and complement relationships simultaneously
Heterogeneous graph neural networks with neighbor-SIM attention mechanism for substitute product recommendation	AAAI	2021
Decoupled Hyperbolic Graph Attention Network for Modeling Substitutable and Complementary Item Relationships	CIKM	2022
Enhanced Multi-Relationships Integration Graph Convolutional Network for Inferring Substitutable and Complementary Items	AAAI	2023

GNN-based cold-start Rec.Syc.

Please refer to Awesome-Cold-Start-Recommendation

13. Missing Data Imputation

Missing data imputation focuses on methods to fill missing values in datasets, improving data completeness quality. It includes techniques based on statistical methods, regression models, and graph models.

Category	Title	Source	Year
Vision and Survey Paper	An Experimental Survey of Missing Data Imputation Algorithm	IEEE T KNOWL DATA EN	2023
	Can Foundation Models Wrangle Your Data?	VLDB (Vision)	2022
Tabular Data Imputation	Missing Data Imputation with Uncertainty-Driven Network	SIGMOD	2024
	ReMasker: Imputing Tabular Data with Masked Autoencoding	ICLR	2024
	Missing Data Imputation with Uncertainty-Driven Network	SIGMOD	2024
	Transformed distribution matching for missing value imputation	ICML	2023
	MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms	NeurIPS	2021
	GAIN: Missing Data Imputation using Generative Adversarial Nets	ICML	2018
	Multiple Imputation by Chained Equations		2011
Graph Data Imputation	Data Imputation from the Perspective of Graph Dirichlet Energy	CIKM	2024
	Handling Missing Data via Max-Entropy Regularized Graph Autoencoder	AAAI	2023
	Accurate Node Feature Estimation with Structured Variational Graph Autoencoder	KDD	2022
	Learning on Attribute-Missing Graphs	TPAMI	2021
	Deconvolutional Networks on Graph Data	NeurIPS	2021

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
COMP9312		COMP9312
files		files
notes		notes
pics		pics
sections		sections
tutorials		tutorials
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

graph-analytics-starter-pack

Main Contributor

Table-of-Contents

Introduction

Courses

Main Courses

Reference Courses

Key Chapters

1: Cohesive Subgraph Discovery

2: Generalized Anomaly Detection

2.1 Survey of anomaly detection and Benchmarks

2.2 Anomaly Detection

2.3 Fraud Detection

3: AIGC-LLM

4: Differential Privacy

5: Graph Analytics on GPUs

6: Graph Similarity Computation

7: Subgraph Matching and Counting

8: Cardinality Estimation

9: Graph for DB and tabular data

10: Vector Database

12: GNN-based recommendation system

GNN-based Collaborative Filtering

GNN-based session-based/sequence-based Rec.Sys.

GNN-based substitute & complement Rec.Sys.

GNN-based cold-start Rec.Syc.

13. Missing Data Imputation

14: Others

PaperWriting

FigureDrawing

Tools

About

Releases

Packages

Contributors 9

Languages

guaiyoui/graph-analytics-starter-pack

Folders and files

Latest commit

History

Repository files navigation

graph-analytics-starter-pack

Main Contributor

Table-of-Contents

Introduction

Courses

Main Courses

Reference Courses

Key Chapters

1: Cohesive Subgraph Discovery

2: Generalized Anomaly Detection

2.1 Survey of anomaly detection and Benchmarks

2.2 Anomaly Detection

2.3 Fraud Detection

3: AIGC-LLM

4: Differential Privacy

5: Graph Analytics on GPUs

6: Graph Similarity Computation

7: Subgraph Matching and Counting

8: Cardinality Estimation

9: Graph for DB and tabular data

10: Vector Database

12: GNN-based recommendation system

GNN-based Collaborative Filtering

GNN-based session-based/sequence-based Rec.Sys.

GNN-based substitute & complement Rec.Sys.

GNN-based cold-start Rec.Syc.

13. Missing Data Imputation

14: Others

PaperWriting

FigureDrawing

Tools

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Languages

Packages