This repository contains the code for a trial Wikipedia article quality classification project, 09/2021
The idea behind this project was to build a Proof of Concept and to analyze how well conventional text classification algorithms as well as Neural Networks are able to evaluate the quality of Wikipedia articles automatically. For this, it utilizes articles as training and evaluation data that have manually been curated and have been assigned the "good article" batch by Wikipedia Editors. It is an example for the implementation of a simple, stand alone pipeline from data creation, curation and cleaning as well as analysis. The report included in this repository contains a detailed description of data sources, data processing and analysis and can be used for further improvements of conventional text classification models.
- parsing.py: Parse data from Wikipedia data dump, data available from https://dumps.wikimedia.org/enwiki/20210901/
- dataloading.py: Script to create datasets and contain functions for simple dataloading
- cleaning_utils.py: Some functions to clean data used during dataloading
- statistics.py: Script to analyze data and data distribution, optional, not required for analysis
- textvectorization.py: Script to create tf-idf text vectorizations of parsed and cleaned Wikipedia articles
- classifComp.py: Script to compare the performance of some of the most common classical ML classifiers on tf-idf data
- logisticRegression.py: More detailed analysis including feature importance of logistic Regression classification
- randomForest.py: More detailed analysis including feature importance of random Forest classification
- gloveModel.py: Script to run a more complex Glove (W2V derivate) and DNN classification model