🛠️ Web Scraping Exploration with Selenium
Take a gentle dive into the basics of web scraping with this repository! Using Selenium, the project walks you through extracting data from books and quotes websites. It's a simple yet effective exercise to get hands-on experience with web scraping techniques. The data collected is neatly organized into a CSV file, offering a practical glimpse into data processing. Whether you're new to web scraping or just looking for a straightforward example, this repository provides a humble starting point for your exploration. Happy coding!
- booksToScrape.csv Books data as a CSV file
- quotesToScrape.csv Quotes data as a CSV file
- books_web_scraping.py Books website web scraping code
- quotes_web_scraping.py Quotes website web scraping code
Selenium is a powerful web automation library for Python, widely used for web scraping and testing.
pip install selenium
Pandas is a versatile data manipulation library in Python, commonly employed for data analysis and storage, such as saving data to CSV files.
pip install pandas
- Create a webdriver instance
driver = webdriver.Chrome()
url = "http://books.toscrape.com/"
driver.get(url)
- Chrome must be loaded with the message
Chrome is being controlled by automated test software.
Use explicit waits for a smoother web scraping experience:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
try:
# Explicitly wait for the next page button to be present
WebDriverWait(driver, 20).until(EC.presence_of_element_located(next_page_button_locator))
# Explicitly wait for the next page button to be clickable
WebDriverWait(driver, 20).until(EC.element_to_be_clickable(next_page_button_locator))
# Find the next page button and click it
next_page_button = driver.find_element(*next_page_button_locator)
next_page_button.click()
except Exception as e:
print(f"Exception: {type(e).__name__} - {e}. Refreshing the page and retrying click.")
driver.refresh()
Use various locators using By for element identification:
By.
from selenium.webdriver.common.by import By
find_element(By.CSS_SELECTOR, some_string)
Finds element using CSS selector. It performs the same tasks as the old one.find_element_by_css_selector
find_element(By.XPATH, some_string)
Finds elment by XPATH instead offind_element_by_xpath
find_element(By.CLASS_NAME, some_string)
Finds element by Class Name as the old one didfind_element_by_class_name
These methods return an instance ofWebElement
element.click()
Clicking on the elementelement.get_attribute(‘class’)
Accessing attribute class, title...etc-
element.text
Accessing text element
Save a list of lists as a data frame using Pandas
df = pd.DataFrame(books_list)
Save the data frame to a CSV file for further use
df.to_csv('path-to-folder/booksToScrape.csv', index=True)
Close the browser
driver.quit()