Toscrape

🛠️ Web Scraping Exploration with Selenium

Take a gentle dive into the basics of web scraping with this repository! Using Selenium, the project walks you through extracting data from books and quotes websites. It's a simple yet effective exercise to get hands-on experience with web scraping techniques. The data collected is neatly organized into a CSV file, offering a practical glimpse into data processing. Whether you're new to web scraping or just looking for a straightforward example, this repository provides a humble starting point for your exploration. Happy coding!

1. Books to Scrape

2. Quotes to Scrape

Files

booksToScrape.csv Books data as a CSV file
quotesToScrape.csv Quotes data as a CSV file
books_web_scraping.py Books website web scraping code
quotes_web_scraping.py Quotes website web scraping code

Steps

Setting Up Libraries

Selenium is a powerful web automation library for Python, widely used for web scraping and testing.
pip install selenium
Pandas is a versatile data manipulation library in Python, commonly employed for data analysis and storage, such as saving data to CSV files.
pip install pandas

Getting Started

Create a webdriver instance

driver = webdriver.Chrome()
url = "http://books.toscrape.com/"
driver.get(url)

Chrome must be loaded with the message
Chrome is being controlled by automated test software.

Explicit Waits

Use explicit waits for a smoother web scraping experience:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

try:
            # Explicitly wait for the next page button to be present
            WebDriverWait(driver, 20).until(EC.presence_of_element_located(next_page_button_locator))

            # Explicitly wait for the next page button to be clickable
            WebDriverWait(driver, 20).until(EC.element_to_be_clickable(next_page_button_locator))

            # Find the next page button and click it
            next_page_button = driver.find_element(*next_page_button_locator)
            next_page_button.click()


        except Exception as e:
            print(f"Exception: {type(e).__name__} - {e}. Refreshing the page and retrying click.")
            driver.refresh()

Data Extraction

Use various locators using By for element identification:
By.

from selenium.webdriver.common.by import By

find_element(By.CSS_SELECTOR, some_string) Finds element using CSS selector. It performs the same tasks as the old one. find_element_by_css_selector
find_element(By.XPATH, some_string) Finds elment by XPATH instead of find_element_by_xpath
find_element(By.CLASS_NAME, some_string) Finds element by Class Name as the old one did find_element_by_class_name These methods return an instance of WebElement

WebElement

element.click() Clicking on the element
element.get_attribute(‘class’) Accessing attribute class, title...etc
- element.text Accessing text element

Store data

Save a list of lists as a data frame using Pandas

df = pd.DataFrame(books_list)

Save the data frame to a CSV file for further use

df.to_csv('path-to-folder/booksToScrape.csv', index=True)

Finally

Close the browser

driver.quit()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toscrape

1. Books to Scrape

2. Quotes to Scrape

Files

Steps

Setting Up Libraries

Getting Started

Explicit Waits

Data Extraction

WebElement

Store data

Finally

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
booksToScrape.csv		booksToScrape.csv
books_web_scraping.py		books_web_scraping.py
quotesToScrape.csv		quotesToScrape.csv
quotes_web_scraping.py		quotes_web_scraping.py

NouraAlgohary/Web-Scraping

Folders and files

Latest commit

History

Repository files navigation

Toscrape

1. Books to Scrape

2. Quotes to Scrape

Files

Steps

Setting Up Libraries

Getting Started

Explicit Waits

Data Extraction

WebElement

Store data

Finally

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages