This documentation refers to 4 projects developed by me and Cristiana Gewerc for the Data Wrangling unit of Monash MDS. The main topics covered on those works were:
- parse data in the required format;
- assess the quality of data for problem identification;
- resolve data quality issues ready for the data analysis process;
- integrate data sources for data enrichment;
- document the wrangling process for professional reporting;
- write program scripts for data wrangling processes.
In a nutshell, the projects were developed in Jupyter Notebook python3
about:
parsing-data
Extraction data from semi-structured text files using only re
and pandas
libraries. Gets a TXT
file and generate a JSON
and a CSV
.
text-preprocessing
:
Extraction of a set of published papers from nonstructured format, preprocessing and convertion into numerical representations.
cleansing-raw-data
Outliers analysis and removal, missing data imputation and data anomalies fix.
data-integration-reshaping
Integrating multiple datasources, including web scraped data, XML files, Shapefiles, txt, GTFS data, csv and xlsx.