This repository contains an extract of a broader framework built to discover anomalous patterns among providers in MBS and PBS datasets, as part of an Industry PhD (IPhD) scholarship.
It was designed to allow rapid prototyping of anomaly detection processes, as well as providing flexibility of input data, and reproducibility and traceability through automated logging functions.
Implemented detection processes will be released alongside publication of papers describing the processes, as they occur.
Three processes are currently included:
• An association rule mining process which compares graphs of reference and provider models, to be presented at MEDINFO 2023 [1].
• A context discovery and cost estimation process, presented at HEALTHINF 2023 [2].
• A sequence pattern detection process, presented at the Machine Learning and Artificial Intelligence in Bioinformatics and Medical Informatics workshop of the IEEE International Conference on Bioinformatics and Biomedicine 2022 [3].
More information about the design rationale behind the processes was presented at the 2023 Health Informatics and Knowledge Management conference [4].
The data analysis framework is set up as a module, with analyses specified in related test cases within sub-modules.
The code is split into two types within the src folder:
• The 'core' folder contains tools shared across analyses, such as logging, graphing, and file I/O. It also contains abstract classes describing how the and a template for data analyses
• The 'analyses' folder contains the implemented anomaly detection processes, which extend the abstract classes from the code folder. Each analysis is separated into its own sub-module, and contains at minimum an implementation of the abstract analysis class, and a data extraction folder with an implementation of the abstract data extraction class. Each analysis includes its own README.md for further information. Output from each run of an analysis is located in the top-level 'Output' folder.
Parameters for each analysis are passed in when an analysis is run. Available parameters are established in the RequiredParams sub-class within each analysis file, and may be set either in the analysis file or in the analysis_runner file.
Analyses are run with analysis_runner.py in the top-level folder, which calls functions expected to exist as per the abstract classes for data extraction and analysis; while the actions of those functions is specified in the sub-modules of the analyses folder, the function definitions should remain the same.
Medical claims datasets have millions of rows and many features. It is expected that a subset will be used for each analysis, as the class of problems each solves is not generally applicable to a whole dataset.
The framework expects each analysis folder to contain a 'data_extraction' folder, which contains scripts extending the abstract data extraction class. The scripts should contain code for extracting a data subset from the primary data source.
Loading previously extracted and processed data from a file is a possible alternative, which bypasses the extraction and processing steps.
In this case, data is expected to be stored in the top-level 'data' folder.
For the IPhD project, two data sources were used: a publicly released (now retracted) sample of 10% of patients in the Australian Medicare and Pharmaceutical Benefits Scheme (PBS); and the full, current versions of those sets of data stored at the Australian Government Department of Health. The former was contained in parquet files, and was often used for prototyping even where the full set was used for the final publication. Data from the latter was accessed using a custom wrapper for a Spark/Hadoop implementation. As the wrapper contains some sensitive information, it has been removed from the public release of this code.
This code was built in Python 3.8.5 in Ubuntu 20.04.1 LTS.
Python package versions are documented in requirements.txt
Some external packages may be required.
-The 'sequence_detection' analysis relies on SPMF, available here.
-Python's pygraphviz package depends on graphviz. Documentation for installing pygraphviz can be found here.
-Construction of graph images can alternately be done with rpy2 and R's visnetwork package.
• A config file must be set up with header information. This allows the same analysis to be used across different data sets with different headers, but which may have similar features. An example is given in example_config.json
• Test parameters, including the config file location and which data analysis file to run, should be specified in analysis_runner.py in the parent directory. Required parameters for an analysis can be found within the relevant file.
• The test can then be run with python analysis_runner.py
Unit tests are located in the top-level 'tests' folder. They can be run within VSCode using the unittest framework or programmatically e.g. with the unittests script in the parent directory.
Please contact me with any questions
[1] J. Kemp, C. Barker, N. Good, and M. Bain, “Graphical association analysis for identifying variation in provider claims for joint replacement surgery,” in Proceedings of the 19th World Congress on Medical and Health Informatics. Amsterdam, Holland: IOS Press, 2023 (accepted for publication)
[2] J. Kemp, C. Barker, N. Good, and M. Bain, “Context discovery and cost prediction for detection of anomalous medical claims, with ontology structure providing domain knowledge,” in Proceedings of the 16th International Joint Conference on Biomedical Engineering Systems and Technologies - Volume 5: HEALTHINF . California, USA: SCITEPRESS, 2023, pp. 29-40
[3] J. Kemp, C. Barker, N. Good, and M. Bain, “Sequential pattern detection for identifying courses of treatment and anomalous claim behaviour in medical insurance,” in 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2022, pp. 3039–3046
[4] J. Kemp, C. Barker, N. Good, and M. Bain, “Developing an anomaly detection framework for Medicare claims,” in Proceedings of ACSW 2023: Australasian Computer Science Week 2023. New York, NY, USA: Association for Computing Machinery, 2023 (accepted for publication)