Skip to content

Latest commit

 

History

History
136 lines (98 loc) · 6.94 KB

readme.md

File metadata and controls

136 lines (98 loc) · 6.94 KB

ASSOCIATION MINING


Description :-

Reference: https://en.wikipedia.org/wiki/Association_rule_learning

Association mining falls into the category of UNSUPERVISED LEARNING. Association mining is useful to find patterns or rules for 2 or more items in a dataset. In this sample, associations are calculated as follows:-

  • Association from Diagnoses to Services ,
  • Association from Diagnosis and Provider to Service.

Screenshots towards the bottom of this page show that even if one does not have a medical background, one can get a pretty good understanding of associated diagnoses and services.

An association may be POSITIVE i.e. presence of an item implies presence of another item, or NEGATIVE i.e. presence of an item implies absence of another item. This positive/negative association is derived from a ratio of two components :-

  • LIFT = ACTUAL / EXPECTED , tells us - " How much more than expected is our Association ? "
  • When LIFT = 1, it means there is neither positive nor negative association, i.e. items compared are independent.
  • When LIFT > 1, it means there is positive association, i.e. presence of one item implies presence of the other item.
  • When LIFT < 1, it means there is negative association, i.e. presence of one item implies absence of the other item.

The ACTUAL and EXPECTED metrics are calculated using concepts called SUPPORT and CONFIDENCE.

  • SUPPORT represents the frequency of an item in the dataset.
  • CONFIDENCE represents conditional probability, i.e. probability of finding RHS item provided LHS item already exists.
  • Support A = (No. of transactions containing A) / (Total No. of transactions)
  • Support A to B = (No. of transactions containing A and B) / (Total No. of transactions)
  • Confidence A to B = (No. of transactions containing A and B) / (No. of transactions containing A)
  • Expected Confidence A to B = (No. of transactions containing B) / (Total No. of transactions)
  • Lift A to B = (Confidence A to B) / (Expected Confidence A to B)

Python programs :-

(1) Clean raw data:-

  • Python program to clean raw csv files: ~/association_mining/step01_clean_raw_data/CleanRawData.py
  • INPUT: Raw input csv files at ~/association_mining/step01_clean_raw_data/raw_csv_files/*.csv
  • OUTPUT: Clean csv files at ~/association_mining/step02_association_mining/clean_csv_files/

(2) Association Mining:-

  • Python program to find associations: ~/association_mining/step02_association_mining/AssociationMining.py
  • INPUT: Clean csv files at ~/association_mining/step02_association_mining/clean_csv_files/*.csv
  • OUTPUT: ~/association_mining/step02_association_mining/clean_csv_files/tran_df.csv

Data explained :- (Input data intentionally not provided)

Data is a sample of claims data. Columns explained below:-

RAW CSV FILES

(1) raw_csv_files/tran.csv

  • tid: Transaction ID. This is equivalent to a claim id. A claim is submitted by a provider for receiving payment. This tid is the metric counted for finding associations.
  • servprov: Servicing Provider ID. This is just an ID column.
  • diagcode: Diagnosis ID. This is just an ID column, not the actual diagnosis code.
  • servcode: Service Code. This is just an ID column, not the actual service code. The claim tells us which provider rendered what service against which diagnoses.

(2) raw_csv_files/diag.csv

  • dimDiagnosisID: same as diagcode in the transactions file. This is just an ID column.
  • DiagnosisCode: Diagnosis code present on the claim.
  • DiagnosisShortDesc: Short description of the diagnosis.
  • DiagnosisLongDesc: Long description of the diagnosis.

(3) raw_csv_files/prov.csv

  • dimProviderID: same as servprov in the transactions file. This is just an ID column.
  • ProviderName: Provider's name

(4) raw_csv_files/serv.csv

  • dimServiceCodeID: same as servcode in the transactions file. This is just an ID column.
  • ServiceCode: Service code present on the claim.
  • ServiceCodeShortDesc: Short description of the service rendered.
  • ServiceCodeLongDesc: Long description of the diagnosis rendered.

CLEAN CSV FILES

(1) clean_csv_files/clean_tran.csv. This is generated by cleaning the raw csv file.

  • tid: Transaction ID. This is equivalent to a claim id. A claim is submitted by a provider for receiving payment. This tid is the metric counted for finding associations.
  • servprov: Servicing Provider ID. This is just an ID column.
  • diagcode: Diagnosis ID. This is just an ID column, not the actual diagnosis code.
  • servcode: Service Code. This is just an ID column, not the actual service code. The claim tells us which provider rendered what service against which diagnoses.

(2) clean_csv_files/clean_diag.csv. This is generated by cleaning the raw csv file.

  • dimDiagnosisID: same as diagcode in the transactions file. This is just an ID column.
  • DiagnosisCode: Diagnosis code present on the claim.
  • DiagnosisShortDesc: Short description of the diagnosis.
  • DiagnosisLongDesc: Long description of the diagnosis.

(3) clean_csv_files/clean_prov.csv. This is generated by cleaning the raw csv file.

  • dimProviderID: same as servprov in the transactions file. This is just an ID column.
  • ProvName: Provider's name randomly scrambled.

(4) clean_csv_files/clean_serv.csv. This is generated by cleaning the raw csv file.

  • dimServiceCodeID: same as servcode in the transactions file. This is just an ID column.
  • ServiceCode: Service code present on the claim.
  • ServiceCodeShortDesc: Short description of the service rendered.
  • ServiceCodeLongDesc: Long description of the diagnosis rendered.

OUTPUT CSV FILE: clean_csv_files/tran_df.csv This file is the final output with all association mining metrics calculated.


Few PowerBI Report screenshots of the final output :-

(1) Services associated with diagnosis TYPE 2 DIABETES MELLITUS PDR MACULAR EDEMA BILATERAL

screenshot_01.png

(2) Diagnoses associated with service Treatment of extensive or progressive retinopathy (eg, diabetic retinopathy), photocoagulation

screenshot_02.png

(3) Services associated with diagnosis Primary osteoarthritis, right hand

screenshot_03.png

(4) Diagnoses associated with service APPLICATION CAST ELBOW FINGER SHORT ARM

screenshot_04.png

(5) Services associated with diagnosis Osteonecrosis in diseases classified elsewhere, left thigh

screenshot_05.png **

(6) Services associated with diagnosis Eyelid retraction left upper eyelid

screenshot_06.png

These associations have also been deployed as an API Web Application. See https://github.com/nsb700/association-mining-webapp