- https://github.com/modelscope/data-juicer
- https://github.com/katanaml/sparrow
- https://github.com/stochasticai/xTuring
- https://github.com/labring/FastGPT FastGPT have the feature about data processing
-
Data Collection
-
Data Pre-processing
-
Data cleaning
- Handling missing values
- Noise reduction
- Consistency checks Consistency checks ensure the data across the dataset adheres to consistent formats, rules, or conventions
- Deduplication
-
Data Feature Engineeing
data = { 'date': ['2024-01-01', '2024-01-02', '2024-01-03'], 'category': ['A', 'B', 'A'], 'value': [10, 20, 30] } df = pd.DataFrame(data) df['date'] = pd.to_datetime(df['date']) df['year'] = df['date'].dt.year df['month'] = df['date'].dt.month df['day'] = df['date'].dt.day
-
Data Parsing Data parsing is converting data from one format to another. Widely used for data structuring, it is generally done to make the existing, often unstructured, unreadable data more comprehensible.
-
Data Normalization Normalization is a crucial pre-processing technique for standardizing textual data to ensure uniformity and consistency in language usage and minimize complexity for NLP models. This process involves converting text to a common case, typically lowercase, to eliminate variations arising from capitalization.
-
-
Data Storage
-
Data Analysis To show insights with team which often using Data Visualization (Data visualization is a part of the data analysis , helping to present insights clearly to the team.)
Here's an example that combines Retrieval-Augmented Generation (RAG) using Langchain with a language model and data processing. This example will demonstrate how to preprocess data, set up a retrieval mechanism, and generate answers using a language model.
pip install langchain transformers pandas
import pandas as pd
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
# 1. Create a sample dataset
data = {
'context': [
"The capital of France is Paris.",
"Jane Austen wrote 'Pride and Prejudice'.",
"The blue whale is the largest mammal."
"......"
]
}
df = pd.DataFrame(data)
# 2. Data preprocessing: convert context to a list
contexts = df['context'].tolist()
# 3. Create an OpenAI embedding model
embedding_model = OpenAIEmbeddings()
# 4. Build a vector store using FAISS
vectorstore = FAISS.from_texts(contexts, embedding_model)
# 5. Initialize the Langchain Retrieval-Generation Chain using OpenAI's model
llm = OpenAI(model_name="gpt-3.5-turbo") # Change to "gpt-4" if you have access
retrieval_qa = RetrievalQA(llm=llm, retriever=vectorstore.as_retriever())
# 6. Use the model for inference
questions = [
"What is the capital of France?",
"Who wrote 'Pride and Prejudice'?",
"What is the largest mammal?"
]
for question in questions:
answer = retrieval_qa.run(question)
print(f"Q: {question}\nA: {answer}\n")
- Create a Sample Dataset: We define a simple dataset with contexts that will provide answers.
- Data Preprocessing: Convert the context column to a list for easier handling.
- Create an Embedding Model: Use a OpenAIEmbeddings model to create embeddings for the contexts.
- Build a Vector Store: Use FAISS to index the embedded contexts for efficient retrieval.
- Initialize the RAG Chain: Use Langchain to combine the retrieval and generation processes with OpenAI's language model.
- Inference: Loop through predefined questions, retrieve relevant contexts, and generate answers.
The pre processing technology code can be found in the code folder.
- Normalizing the Content
- Metadata Extraction and Chunking
- Preprocessing PDFs and Images
- Extracting Tables
- Example:Build Your Own RAG Bot