This project involves scraping retailer store locator data using REST APIs and BeautifulSoup.
In this case, the primary goal was to extract name
and address
pairs from various eyewear retailers and merge them into a consolidated dataset, ensuring accuracy and consistency in address formatting.
Since names and addresses can be entered into system databases with slightly differing formats from firm to firm (ie. "DR SMITH, 1001 Parkway East, Binghamton, NY 13905" versus "Dr. Smith, 1001 Pkwy E., Binghamton, NY 13905") this introduces a high likelihood of multi-counting any given true name, address pair.
To mitigate this issue, I use Fuzzy string-matching techniques to account for subtle differences in names and addresses and aggregate pairs using a moderate assumption of these differences across firm entries.
.
├── .idea # IDE-specific configurations
├── data # Raw data from scraping
├── final_data # Processed data ready for merging
├── merged_data # Final merged dataset
├── scrapers # Scripts for scraping store locators
├── utils # Utility scripts for data processing
├── README.md # Project documentation
└── requirements.txt # Python dependencies
- Data Scraping: Tailored scripts for scraping store locator data using API requests and HTML parsing.
- Data Processing: Standardization and preprocessing scripts to ensure data consistency.
- Data Merging: Fuzzy string matching (Levenshtein distance) to merge datasets with consistent address formats.
To set up and run this project locally, follow these steps:
- Clone the repository:
git clone https://github.com/yourusername/your-repository-name.git
- Navigate to the project:
cd eyewear-retailer-scraper-aggregator
- Install requirements:
pip install -r requirements.txt
- Run the scrapers:
python scrapers/<script_name>.py
- Data Processing:
python utils/<script_name>.py
This project is licensed under the MIT License - see the LICENSE.md file for details.