Skip to content

guaiyoui/graph-analytics-starter-pack

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

graph-analytics-starter-pack

Awesome License: MIT Made With Love

Learning materials/paths related to Graph Analytics and AI4DB.

Main Contributor

Thanks to the following people for organizing and guiding this project


Table-of-Contents

Introduction

Graph Analytics and AI for Database are two important branches in today's data analysis and artificial intelligence fields. Graph Analytics helps people better understand and utilize data by analyzing graph data to reveal patterns and relationships behind the data. AI4DB uses machine learning and artificial intelligence techniques to process and manage large-scale databases, solve NP-hard graph-related problems, and improve the efficiency and accuracy of data processing.

In the fields of Graph Analytics and AI for Database, we generally focus on academic papers from the following conferences:

CategoryConferenceLinkComment
DatabaseSIGMOD DBLP, Official website Pioneering conference in Database
VLDBVLDB
ICDE ICDE
AI/ML/DLICML ICML
ICLR ICLR
NeurIPS NeurIPS
Data MiningKDD KDD
[A Comprehensive Survey on Graph Neural Networks] https://ieeexplore.ieee.org/document/9046288

GNN is commonly implemented using PyTorch. Here's a series of tutorials using PyTorch: https://pytorch-geometric.readthedocs.io/en/latest/

If you're not familiar enough with deep learning, you can check out (the following are for reference, you don't need to look at all of them):

This book: Deep Learning https://github.com/exacity/deeplearningbook-chinese This course: Search for "李宏毅 机器学习" (Hung-yi Lee Machine Learning) on Bilibili Course + Book: Hands-on Deep Learning

Other potentially valuable Stanford University courses: cs224n (NLP), cs224w (graph), cs229 (ML), cs231n (CV), cs285 (RL)

Courses

Main Courses

Reference Courses

Key Chapters

Week Content Reading List Material
1 Node Embedding 1: DeepWalk: Online Learning of Social Representations
2: node2vec: Scalable Feature Learning for Networks
Node representation is one of the most fundamental problems in graph learning. You can refer to the content of CS224W 3rd and COMP9312 week 6. For traditional matrix factorization methods, you can refer to::matrix factorization
2 Graph Neural Networks 1: Semi-Supervised Classification with Graph Convolutional Networks
2: Graph Attention Networks
The structure of the model is the core of learning (CS224W 4th). The reading list provides two classic models: GCN and GAT. Regarding the structure of each layer in neural networks, you can refer to the tutorial: Learning basic
3 GNN Augmentation and Training 1: RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space
2: Hyper-Path-Based Representation Learning for Hyper-Networks.
For an introduction to the entire GNN process, you can refer to CS224W 6th. The reading list provides other representation learning methods and their extensions on knowledge graphs/hypergraphs. For how to do embedding, you can refer to the tutorial: Node embedding
4 Theory of Graph Neural Networks 1: Anomaly Detection on Attributed Networks via Contrastive Self-Supervised Learning
2: Sub-graph Contrast for Scalable Self-Supervised Graph Representation Learning
For analysis of neural networks, you can refer to CS224W 7th. The reading list provides current mainstream self-supervised embedding methods and network frameworks. For how to use learning to perform basic tasks, refer to the tutorial: Downstream Application: node classification, link prediction and graph classification
5 Label Propagation on Graphs 1: GLSearch: Maximum Common Subgraph Detection via Learning to Search
2: Computing Graph Edit Distance via Neural Graph Matching
Refer to CS224W 8th. The reading list provides two methods for solving the graph similarity computation (NP-hard) problem. To implement the GCN structure yourself, you can refer to the tutorial: Graph Convolutional Network
6 Subgraph Matching and Counting 1: A Learned Sketch for Subgraph Counting
2: Neural Subgraph Counting with Wasserstein Estimator
Refer to CS224W 12th. The reading list provides two papers on the subgraph counting (NP-hard) problem (from SIGMOD 2021 and SIGMOD 2022). The repository code for the SubCon paper is at Contrastive Learning on Graph

1: Cohesive Subgraph Discovery

Cohesive Subgraph Discovery is a problem of finding highly cohesive subgraphs in graph data. This survey: A Survey on Machine Learning Solutions for Graph Pattern Extraction (pay close attention to Ch2.6 community search) clearly explains several baseline articles for our work.

For more details, please refer to 1: Cohesive Subgraph Discovery

2: Generalized Anomaly Detection

Generalized Anomaly Detection包括了很多类似的问题,比如: anomaly detection, novelty detection, open set recognition, out-of-distribution detection 和 outlier detection.

2.1 Survey of anomaly detection and Benchmarks

2.2 Anomaly Detection

Conference Paper Material Abstract Highlights
ICLR2022 Anomaly detection for tabular data with internal contrastive learning. [code] KNN and Contrastive Learning for Tabular data --
ICDE2023 Unsupervised Graph Outlier Detection: Problem Revisit, New Insight, and Superior Method. --- --- ---

2.3 Fraud Detection

3: AIGC-LLM

Please refer to 3: AIGC-LLM

4: Differential Privacy

Please refer to 4: Differential Privacy

and

DP & ML: https://github.com/JeffffffFu/Awesome-Differential-Privacy-and-Meachine-Learning

5: Graph Analytics on GPUs

Please refer to 5: Graph Analytics on GPUs

6: Graph Similarity Computation

7: Subgraph Matching and Counting

8: Cardinality Estimation

9: Graph for DB and tabular data

Graphs are a valuable tool for representing connections between entities, while tabular or relational data is a convenient and user-friendly way to store information. Researchers frequently employ graphs to depict interdependencies among records, attributes, elements, and schemas within and across tables. It is worth noting that in contemporary usage, the term "tabular deep learning" is often used to refer to the application of deep learning techniques to relational data organized as records, while the term "database" is often reserved to refer specifically to the software and infrastructure used to manage and manipulate such data.

Conference Paper Material Abstract Highlights
PODS2023 Databases as Graphs: Predictive Queries for Declarative Machine Learning. --- Using hypergraph to model the relationship behind the records --
--- Enabling tabular deep learning when d ge n with an auxiliary knowledge graph --- Capture the relation between two attributes by KG --
CIKM22 Local Contrastive Feature learning for Tabular Data --- Capture the relation between two attributes by maximum spanning tree --
NIPS22 Learning enhanced representations for tabular data via neighborhood propagation --- --- --
dlpkdd2021 TabGNN: Multiplex Graph Neural Network for Tabular Data Prediction --- --- --

10: Vector Database

Similarity search at a very large scale.

Conference Paper Material Abstract Highlights
SIGMOD2023 Near-Duplicate Sequence Search at Scale for Large Language Model Memorization. --- --- --

Talk 1: Vector Database for Large Language Models in Production (Sam Partee)

12: GNN-based recommendation system

推荐系统涵盖众多子领域,难免挂一漏万,仅在此介绍部分应用GNN的推荐系统子领域。

Recommender systems encompass numerous sub-domains, and it's inevitable to miss some areas. Here we only introduce some sub-domains of recommender systems that apply GNN.

GNN-based Collaborative Filtering

Collaborative Filtering (CF) is one of the most classic and widely used methods in recommender systems. Its core idea is to find association patterns between similar users or items based on historical user behavior data. It mainly falls into two categories: User-based CF makes recommendations by finding similar user groups, while Item-based CF makes recommendations through similarity relationships between items.

Here are some cornerstone papers of GNN-based CF. Some of which, like LightGCN, still serves as baseline or even benchmark component in many SOTA researches.

Paper Conference Year Highlights
Graph Convolutional Neural Networks for Web-Scale Recommender Systems KDD 2018 First paper of GNN-based Collaborative Filtering, PinSAGE
Neural Graph Collaborative Filtering SIGIR 2019 Refinement of PinSAGE, NGCF
LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation SIGIR 2020 The most popular benchmark model, a refinement version of NGCF, LightGCN
Self-supervised Graph Learning for Recommendation SIGIR 2021 One of the first papers in GNN-based CF utilizing self-supervised method, SGL

GNN-based session-based/sequence-based Rec.Sys.

Session-based Recommender Systems focus on predicting the next item within an ongoing session, where a session is typically a short-term interaction sequence (such as a browsing session) with no user identification required. The key problem is: Given a sequence of user interactions $S = {i_1, i_2, \cdots, i_t}$ in the current session, predict the next item $i_{t+1}$ that the user is likely to interact with.

Sequence-based Recommender Systems consider the complete historical sequence of user interactions across multiple sessions over time. The problem is defined as: Given a user u and their entire historical interaction sequence $H = {i_1, i_2, \cdots, i_n}$ ordered by timestamp, predict the next items that the user will interact with in the future.

For detail information about session-based Rec.Sys. and Sequence-based Rec.Sys., please refer to:

Here are some cornerstone papers of GNN-based session-based/sequence-based Rec.Sys.

Paper Conference Year Highlights
Session-Based Recommendation with Graph Neural Networks AAAI 2018 SR-GNN, still a popular baseline choice.

虽然但是,如果您想在此领域做出贡献,三思而后行:我相比序列模型,尤其是Transformer架构模型,有什么优势?

GNN-based substitute & complement Rec.Sys.

Substitute recommendation aims to suggest interchangeable items for a given query item. Most traditional methods infer substitute relationships through item similarities, and extract semantic information from consumer reviews for this purpose. Recently, due to the semantic connection between substitutes and complements (e.g., a substitute’s complement is often also a complement of the original item) , and with the development of GNN, current mainstream models primarily use networks of co-view and co-purchase relationships to learn substitute relationships . They employ various methods to explore the latent relationships between different item interactions.

Complement recommendation aims to suggest complement items (like mouse for a computer, game handle for a PS5) for a given query item. Due to semantic complextity, most methods utilize GNN-based methods just like Sub. Rec.Sys.

Here are some cornerstone papers of GNN-based substitute & complement Rec.Sys.

Paper Conference Year Highlights
Inferring Networks of Substitutable and Complementary Products KDD 2015 Most popular dataset: Amazon datasets
Measuring the Value of Recommendation Links on Product Demand ISR 2019 A Bussiness paper (ISR, UTD24) if you need
Decoupled Graph Convolution Network for Inferring Substitutable and Complementary Items CIKM 2020 One of the first GNN-based Sub. & Com. Rec.Sys., predicting substitute and complement relationships simultaneously
Heterogeneous graph neural networks with neighbor-SIM attention mechanism for substitute product recommendation AAAI 2021
Decoupled Hyperbolic Graph Attention Network for Modeling Substitutable and Complementary Item Relationships CIKM 2022
Enhanced Multi-Relationships Integration Graph Convolutional Network for Inferring Substitutable and Complementary Items AAAI 2023

GNN-based cold-start Rec.Syc.

Please refer to Awesome-Cold-Start-Recommendation

13. Missing Data Imputation

Missing data imputation focuses on methods to fill missing values in datasets, improving data completeness quality. It includes techniques based on statistical methods, regression models, and graph models.

Category Title Source Year
Vision and Survey Paper An Experimental Survey of Missing Data Imputation Algorithm IEEE T KNOWL DATA EN 2023
Can Foundation Models Wrangle Your Data? VLDB (Vision) 2022
Tabular Data Imputation Missing Data Imputation with Uncertainty-Driven Network SIGMOD 2024
ReMasker: Imputing Tabular Data with Masked Autoencoding ICLR 2024
Missing Data Imputation with Uncertainty-Driven Network SIGMOD 2024
Transformed distribution matching for missing value imputation ICML 2023
MIRACLE: Causally-Aware Imputation via Learning Missing Data Mechanisms NeurIPS 2021
GAIN: Missing Data Imputation using Generative Adversarial Nets ICML 2018
Multiple Imputation by Chained Equations 2011
Graph Data Imputation Data Imputation from the Perspective of Graph Dirichlet Energy CIKM 2024
Handling Missing Data via Max-Entropy Regularized Graph Autoencoder AAAI 2023
Accurate Node Feature Estimation with Structured Variational Graph Autoencoder KDD 2022
Learning on Attribute-Missing Graphs TPAMI 2021
Deconvolutional Networks on Graph Data NeurIPS 2021

14: Others

PaperWriting

FigureDrawing

Tools

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published