This project extends the ChemNLP library to meet the complex challenges of materials chemistry through NLP. Created for a Master's thesis at the University of Paris, we aim to automate the classification and recognition of chemical documents using the latest NLP techniques for in-depth text analysis.
The main objectives of the ChemNLP library, as refined by this research, are to:
- Develop ChemNLP for advanced NLP applications in materials chemistry.
- Curate datasets from arXiv and PubChem for subsequent NLP processing.
- Compare and develop Machine Learning models for data classification.
- Tackle domain-specific challenges and generate abstract summaries.
- Integrate with datasets like Density Functional Theory for enriched data analysis.
- Build a robust infrastructure to advance research in materials chemistry.
Mentored by Mr. Nadif Mohamed, the team includes:
- Meryem Belkaid
- Abir Oumghar
- Hafsa Boanani
The repository includes several key components critical to our analysis:
- notebooks/: A collection of Jupyter notebooks that capture the project's comprehensive methodology and results. The notebooks included are:
- ChemNLP_Part1: Data Preprocessing and Initial Exploration
- ChemNLP_Part2: Feature Engineering and Model Training
- ChemNLP_Part3: Model Evaluation and Result Interpretation
- pickles/: Serialized Python objects containing model states and preprocessed data to expedite the research process and ensure reproducibility.
- Report/: The detailed project report provides an in-depth analysis and discussion of the research conducted and outcomes achieved.
Note: The provided pickle files in the pickles/ directory contain all the necessary preprocessed data and model states. You can directly utilize these to run the second notebook, ChemNLP_Part2: Feature Engineering and Model Training, which allows for immediate continuation of the analysis without the need for initial data processing steps.
To explore the foundational work that our project builds upon, please visit the original ChemNLP project here: ChemNLP on GitHub.
The technical workflow of our project encompasses:
- Utilization of advanced NLP techniques for text analysis within the domain of materials chemistry.
- Implementation of Machine Learning algorithms for data classification and named entity recognition.
- Application of dimensionality reduction techniques such as PCA, t-SNE, and UMAP for insightful data visualization.
- Usage of serialized pickle files for effective data management and reproducibility of results.
Our study has successfully advanced the application of NLP in materials chemistry. We have highlighted innovative applications while critically addressing specific challenges. This project paves the