Kaggle link: https://www.kaggle.com/competitions/cs-mlns-22/overview
The problem deals with a graph of a citation network where each node represents a research paper and the nodes are connected upon having the citations between the research paper. A complete graph is when all the research papers are rightly cited where due and one can analyse the graph to identify the most cited research paper and the areas of research inspired through citation. This connected graph canalso help in finding the correlated areas of research through interactions among them. The data provided for the graph network, with 27770 nodes containing the title, year, journal, author and the abstract. From the citations (edges) some edges were purposely removed and classifying whether there is an edge between two remaining nodes was the goal of this Kaggle competition.
In preprocessing:
- stopwords removal
- stemming
- TF-IDF vectorization
- Word Embedding
- other miscellaneous actions
Feature Engineering: Based on 5 initial features on each node, we created 33 features for each pair of nodes.
The best performed classifiers are SVM and XGBoost. More details are attached in the report.pdf.