WikipediaClassification

This repository contains the code for a trial Wikipedia article quality classification project, 09/2021

Idea

The idea behind this project was to build a Proof of Concept and to analyze how well conventional text classification algorithms as well as Neural Networks are able to evaluate the quality of Wikipedia articles automatically. For this, it utilizes articles as training and evaluation data that have manually been curated and have been assigned the "good article" batch by Wikipedia Editors. It is an example for the implementation of a simple, stand alone pipeline from data creation, curation and cleaning as well as analysis. The report included in this repository contains a detailed description of data sources, data processing and analysis and can be used for further improvements of conventional text classification models.

Code

parsing.py: Parse data from Wikipedia data dump, data available from https://dumps.wikimedia.org/enwiki/20210901/
dataloading.py: Script to create datasets and contain functions for simple dataloading
cleaning_utils.py: Some functions to clean data used during dataloading
statistics.py: Script to analyze data and data distribution, optional, not required for analysis
textvectorization.py: Script to create tf-idf text vectorizations of parsed and cleaned Wikipedia articles
classifComp.py: Script to compare the performance of some of the most common classical ML classifiers on tf-idf data
logisticRegression.py: More detailed analysis including feature importance of logistic Regression classification
randomForest.py: More detailed analysis including feature importance of random Forest classification
gloveModel.py: Script to run a more complex Glove (W2V derivate) and DNN classification model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WikipediaClassification

Idea

Code

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
NLP_Report_Leuner_final.pdf		NLP_Report_Leuner_final.pdf
README.md		README.md
classifComp.py		classifComp.py
cleaning_utils.py		cleaning_utils.py
dataloading.py		dataloading.py
gloveModel.py		gloveModel.py
logisticRegression.py		logisticRegression.py
parsing.py		parsing.py
randomForest.py		randomForest.py
statistics.py		statistics.py
textvectorization.py		textvectorization.py

rlnrbio/WikipediaClassification

Folders and files

Latest commit

History

Repository files navigation

WikipediaClassification

Idea

Code

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages