This Python script performs web scraping on various product categories from an online supermarket and saves the product details into a CSV file.
- Uses
SeleniumBase Driver
for web navigation and BeautifulSoup for parsing HTML. SeleniumBase
automatically downloads the necessary driver version.- Accepts cookies before starting the scraping process.
- Iterates over multiple categories, each with its own URL.
- Extracts the number of pages for each
category
. - Extracts the total number of ads in each
category
. - For each page, it extracts the HTML content and parses it to find all ads.
- Each ad's details (
ID
,date
,category
,title
,price
,href
,img_src
) are saved in a dictionary. - The dictionary is appended to a list of all ads.
- The list of ads is saved to a CSV file using the
eci_csv
function.
- Ensure that all required Python libraries are installed. These include
os
,re
,csv
,time
,random
,sqlite3
,datetime
,BeautifulSoup
, andSeleniumBase
. - Run the script. It will start by initializing the SeleniumBase Driver and maximizing the window.
- The script will then start the scanning process, iterating over each category and each page within the category.
- For each ad found, it will extract the details and save them to a
CSV file
.
It starts by initializing the SeleniumBase Driver to navigate the web and avoid being detected as a bot. Here is how the driver is set up:
from seleniumbase import Driver
driver = Driver(
browser="chrome",
uc=True,
headless2=False,
incognito=False,
agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36',
do_not_track=True,
undetectable=True
)
Here’s what each option means:
browser="chrome"
: This indicates that you’re using Google Chrome as your browser.
uc=True
: This enables “Undetectable Chrome” capabilities that make it harder for websites to detect that you’re using a bot.
headless2=False
: This indicates that you want the browser to display while the script is running. If you change this to True, the browser will run in the background.
incognito=False
: This indicates that you don’t want the browser to run in incognito mode. If you change this to True, the browser will run in incognito mode.
agent='Mozilla/5.0
(Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36': This sets the user agent of the browser. By changing the user agent, you can make your bot appear like a normal browser.
do_not_track=True
: This enables the “Do Not Track” setting in the browser.
This script is intended for educational purposes only.