Data Processing Instructions

Copy subset_1_filtered_updated_final_output.csv into the local folder
- This is the raw output from the complete PDF extraction pipeline
Process the raw OCR output using process_raw_output.ipynb
- Demonstrates REGEX application via extract_meaningful_text function
- Pipeline:
  1. Filter pipeline errors from dirty web-scraped PDFs
  2. Apply REGEX to produce continuous training text
  3. Modify extract_meaningful_text function as needed for different outputs

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
pipeline		pipeline
.gitignore		.gitignore
README.md		README.md
conda_env.yml		conda_env.yml
pip_freeze_requirements.txt		pip_freeze_requirements.txt
process_raw_output.ipynb		process_raw_output.ipynb
requirements.txt		requirements.txt

Provide feedback