Skip to content

ATU H.Dipp in Computer Programming in Data Analytics: Repository for Programming for Data Analysis - Assignment 2

Notifications You must be signed in to change notification settings

AndrewShanahan/PfDA_2

Repository files navigation

Programming for Data Analysis - Project 2

Project Description:

This project will investigate the Wisconsin Breast Cancer dataset. With a focus on the following:

Undertake an analysis/review of the dataset and present an overview and background. Provide a literature review on classifiers which have been applied to the dataset and compare their performance. Present a statistical analysis of the dataset. Using a range of machine learning algorithms, train a set of classifiers on the dataset (using SKLearn etc.) and present classification performance results. Detail your rationale for the parameter selections you made while training the classifiers. Compare, contrast and critique your results with reference to the literature. Discuss and investigate how the dataset could be extended – using data synthesis of new tumour datapoints.

About the Dataset:

Title: Wisconsin Breast Cancer Database (January 8, 1991)
Source: Dr. WIlliam H. Wolberg (physician), University of Wisconsin Hospitals, Madison, Wisconsin, USA
Additional Sources:

  • O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.
  • William H. Wolberg and O.L. Mangasarian: "Multisurface method of pattern separation for medical diagnosis applied to breast cytology", Proceedings of the National Academy of Sciences, U.S.A., Volume 87, December 1990, pp 9193-9196.
  • O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition via linear programming: Theory and application to medical diagnosis", in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30.
  • K. P. Bennett & O. L. Mangasarian: "Robust linear programming discrimination of two linearly inseparable sets", Optimization Methods and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).

Data Set Characteristics: Multivariate Number of Instances: 699 Area: Life Attribute Characteristics: Integer Number of Attributes: 10

Requirements/How to use:

In order to run this on your PC, you require the following:

Install Anaconda https://www.anaconda.com/products/individual this ditribution includes Python and serveral packages used in this Assignment including the numpy package.
Install Jupyter: https://jupyter.org/ to run numpy-random.ipynb
Github: https://github.com/AndrewShanahan/PfDA_2

References:

Initial set-up - Troubleshooting repository set up:

[01] https://stackoverflow.com/questions/17096311/why-do-i-need-to-explicitly-push-a-new-branch/17096880#17096880
[02] https://www.educative.io/answers/the-fatal-refusing-to-merge-unrelated-histories-git-error

Readme:

[03] https://www.freecodecamp.org/news/how-to-write-a-good-readme-file/

Wisconsin Breast Cancer dataset:

[04] Dataset info/descriptoin https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%290
[05] https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29
[06] https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)
[07] https://data.world/health/breast-cancer-wisconsin
[08] Importing dataset and troubleshooting: https://stackoverflow.com/questions/31797013/how-to-open-a-data-file-extension
[09] Attribute Information: https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28original%29
[10] https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html

Breast Cancer:

[11] https://www2.hse.ie/conditions/breast-cancer-women/?gclid=CjwKCAiAnZCdBhBmEiwA8nDQxV9FZLIuR4GMAzCaJFwNTvHQGzP8oK-LCGZ-jOYXBTyNlzBNjKMK6RoCzLkQAvD_BwE&gclsrc=aw.ds
[12] https://www.cancer.ie/cancer-information-and-support/cancer-types/breast-cancer
[13] https://www.who.int/news-room/fact-sheets/detail/breast-cancer

Literature Review:

[14] Lavanya, D. (05/11/2011) ‘Analysis of feature selection with classification: Breast Cancer Datasets’, Indian Journal of Computer Science and Engineering, ISSN : 0976-5166, P. xxx. Available at: http://ijcse.com/docs/INDJCSE11-02-05-167.pdf (Accessed: 29/12/2022).

[15] Siham, M et al. (11/07/2020) 'Analysis of Breast Cancer Detection Using Different Machine Learning Techniques', Communications in Computer and Information Science, ISSN 1234, P. xxx. Available at: https://link.springer.com/chapter/10.1007/978-981-15-7205-0_10#Sec2 (Accessed: 01/01/2022).

Sklearn:

[16] https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
[17] https://scikit-learn.org/stable/modules/neighbors.html
[18] https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Other Resources Used:

[19] Datacamp - numerous courses/tracks completed over last number of months have supported this exercise: https://www.datacamp.com/
[20] Udemy course: https://www.udemy.com/course/the-modern-python3-bootcamp/learn/lecture/8680110?start=94#overview
[21] W3schools - Resource used on regular basis: https://www.w3schools.com/
[22] Stackoverflow - Resource used to help troubleshoot problems and help with coding: https://stackoverflow.com/
[23] matplotlyb.plyplot:https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.html
[24] pandas: https://pandas.pydata.org/
[25] numpy: https://numpy.org/doc/stable/index.html

Contact:

G00217642@atu.ie

About

ATU H.Dipp in Computer Programming in Data Analytics: Repository for Programming for Data Analysis - Assignment 2

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published