Pharmaceutical Product Clustering
This project is aimed at clustering pharmaceutical products from various sources and organizing them into coherent clusters based on their names, dosages, and forms. The goal is to create a structured and categorized dataset for further analysis or applications in the pharmaceutical domain.
The dataset used for this project consists of pharmaceutical product information, including:
- Medicine names
- Dosages
- Forms
- Sources
The data is sourced from various pharmaceutical suppliers, and it is organized into clusters based on the similarity of the medicine names.
The data preprocessing steps include:
- Cleaning and standardizing the medicine names
- Handling missing data
- Creating a consistent naming format
- Assigning cluster labels to each product
The clustering process involves:
- Using RapidFuzz and PolyFuzz for matching similar product names
- Creating clusters and subclusters for each product
- Organizing the products into coherent groups
The project's file structure includes:
- The dataset in CSV format
- Jupyter notebooks for data preprocessing and clustering
- The final clustered dataset in CSV format
- This README file
To use the project, follow these steps:
- Clone the GitHub repository to your local machine.
- Run the Jupyter notebooks for data preprocessing and clustering.
- Access the final clustered dataset for your analysis or applications.
If you would like to contribute to this project, please follow these steps:
- Fork the project.
- Create a new branch for your feature or bug fix.
- Make your changes and commit them.
- Submit a pull request to the main project repository.
This project is licensed under the MIT License - see the LICENSE file for details.