Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regarding some questions in the paper and code. #1

Open
hdd1009 opened this issue Jul 17, 2024 · 1 comment
Open

Regarding some questions in the paper and code. #1

hdd1009 opened this issue Jul 17, 2024 · 1 comment

Comments

@hdd1009
Copy link

hdd1009 commented Jul 17, 2024

I am very pleased to see your interesting paper and hope it will be accepted by a top conference soon! I would like to study it in depth but have encountered the following problems. I apologize for taking up your time and would like to ask for your guidance.
In the file gssc_layer.py:

  1. Why is self.deg_coef defined on line 12 as nn.Parameter(torch.zeros(1, 1, cfg.gnn.dim_inner, 2))?
  2. In lines 32-34, my understanding is that node degree information is added to the node embedding. I don't understand why this is done, and I couldn't find this in Equation 5 of the paper. Did I miss something when reading the paper?
  3. In lines 36-39, why is cfg.extra.reweigh_self done? I don't quite understand this and couldn't find the corresponding part in the paper. Did I overlook something?
  4. In lines 44-45, I don't fully understand the specific execution logic inside torch.einsum(). Even after comparing with Equation 5 in the paper, I am still unclear. Could you please explain this to me?
  5. In line 46, why is an additional mapping step performed? This is not mentioned in the paper. Did I miss something?
  6. I only see one qk, but there are two qk terms in Equation 5. Why is there only one qk in the code?
  7. In the page 5 of the paper, you said All W \in R^{m*m} with different subscripts are learnable weight matrices. But when i read code, I found the dimensions of W why are not same.

I apologize for the number of questions and for taking up your time. I look forward to your reply.

@siqim
Copy link
Member

siqim commented Aug 18, 2024

Hi dongdong,

Thanks for your interest and questions! They would definitely help us to improve the manuscript. I hope the following could resolve your questions.

  1. They are to define learnable parameters to control the impact of node degrees along different dimensions. This trick is to better incorporate node degree info, and has been used in multiple previous works. See this paper for more details.
  2. Eq. 5 demonstrates the main idea of this paper, i.e., graph attention computation can be linearized via the factorization property. In the implementation of gssc_layer.py, we have also included a few empirical tricks introduced from previous papers. For this trick that incorporates node degree info, please refer to the above paper (and their implementation), where a detailed description is provided.
  3. It's the second term in Eq. 4. So, basically, the first term in Eq. 4 is like aggregating info from the entire graph to update the embedding of node v, and the second term is to update the node v's embedding by its own PE. Combing these two embeddings then can be proved helpful for improving expressive power.
  4. So, for this I guess I would suggest running the code with provided datasets a few times to see how the matrix multiplication is done. The high-level idea is mentioned in the third point, i.e., self-reweighting + global aggregation. And line 44-45 is for the global aggregation. You can imagine having three matrices, A = left_pe, B = right_pe, and C = x, where left_pe and right_pe are the decomposed results from the Laplacian matrix (LapPE) so that the multiplication of left_pe and right_pe (i.e., AB) approximates the Laplacian matrix, which gives relative distance info in a graph. The x is node embeddings. So, to use LapPE as attention to aggregate global info, one can do things like ABC (the multiplication of the three matrices), but directly doing AB is with quadratic complexity, so we observe that we can leverage the factorization property to first compute D = BC and then do AD. The exact implement is a bit more complex, because I also incorporate the idea of multi-head attention from the standard transformer, so we would have one extra dimension, which can be understood as heads.
  5. These mappings are inspired by the standard transofmer, i.e., the linear mappings for q, k, v, and the mapping before outputs.
  6. It's just a different naming convention. You can view reweigh_pe and reweigh_pe_2 as another set of qk mappings.
  7. I'm not sure if I have fully understood your question. But all the W's in the paper are just the linear mappings in the code.

Thanks for your questions. We'll definitely use them to improve our paper writing. Just let us know if these could resolve your questions or if you have other questions!

Best,
Siqi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants