Skip to content

πŸ” ChemNLP-MaterialsAnalysis: Enhancing materials chemistry research with advanced NLP. Key features: πŸ“š Integrates with arXiv & PubChem datasets πŸ€– Applies BERT embeddings & ML clustering (KMeans, t-SNE, UMAP, PCA) πŸ”„ Uses pickle for efficient data handling 🌐 Aims for deeper insights & accelerated discovery in materials science.

Notifications You must be signed in to change notification settings

AbirOumghar/ChemNLP-MaterialsAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ChemNLP-MaterialsAnalysis

Introduction

This project extends the ChemNLP library to meet the complex challenges of materials chemistry through NLP. Created for a Master's thesis at the University of Paris, we aim to automate the classification and recognition of chemical documents using the latest NLP techniques for in-depth text analysis.

Objectives

The main objectives of the ChemNLP library, as refined by this research, are to:

  • Develop ChemNLP for advanced NLP applications in materials chemistry.
  • Curate datasets from arXiv and PubChem for subsequent NLP processing.
  • Compare and develop Machine Learning models for data classification.
  • Tackle domain-specific challenges and generate abstract summaries.
  • Integrate with datasets like Density Functional Theory for enriched data analysis.
  • Build a robust infrastructure to advance research in materials chemistry.

Team

Mentored by Mr. Nadif Mohamed, the team includes:

  • Meryem Belkaid
  • Abir Oumghar
  • Hafsa Boanani

Repository Structure

The repository includes several key components critical to our analysis:

Note: The provided pickle files in the pickles/ directory contain all the necessary preprocessed data and model states. You can directly utilize these to run the second notebook, ChemNLP_Part2: Feature Engineering and Model Training, which allows for immediate continuation of the analysis without the need for initial data processing steps.

To explore the foundational work that our project builds upon, please visit the original ChemNLP project here: ChemNLP on GitHub.

Technical Workflow

The technical workflow of our project encompasses:

  • Utilization of advanced NLP techniques for text analysis within the domain of materials chemistry.
  • Implementation of Machine Learning algorithms for data classification and named entity recognition.
  • Application of dimensionality reduction techniques such as PCA, t-SNE, and UMAP for insightful data visualization.
  • Usage of serialized pickle files for effective data management and reproducibility of results.

Conclusion

Our study has successfully advanced the application of NLP in materials chemistry. We have highlighted innovative applications while critically addressing specific challenges. This project paves the

About

πŸ” ChemNLP-MaterialsAnalysis: Enhancing materials chemistry research with advanced NLP. Key features: πŸ“š Integrates with arXiv & PubChem datasets πŸ€– Applies BERT embeddings & ML clustering (KMeans, t-SNE, UMAP, PCA) πŸ”„ Uses pickle for efficient data handling 🌐 Aims for deeper insights & accelerated discovery in materials science.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published