Largest Public Companies in the US Web Scraping Project
Overview:
This project is a Python-based web scraper that extracts the list of the largest public companies in the United States by revenue from Wikipedia. Using the BeautifulSoup library for parsing and requests for fetching the webpage, it scrapes relevant data, structures it in a DataFrame using pandas, and exports the result to a CSV file for further analysis.
Features:
Scrapes data from a Wikipedia page containing a table of the largest public companies in the US.
Extracts company information such as ranking, name, revenue, and other details from the table.
Stores the scraped data in a pandas DataFrame.
Exports the data to a CSV file.
Technologies Used:
Python: The core programming language used to write the script.
Requests: To fetch the HTML content of the Wikipedia page.
BeautifulSoup: For parsing and navigating the HTML content to extract data.
Pandas: For data manipulation and exporting the scraped data to a CSV file.
Jupyter Notebook (Optional): For testing and experimenting with the code interactively.
Prerequisites:
Ensure you have the following libraries installed:
requests: For making HTTP requests to fetch the webpage.
BeautifulSoup: For parsing the HTML page.
pandas: For data manipulation and CSV export.