GitHub - Kricey/Web-scraping

In this LIHKG scraping task, we simulate a browser run by using the Selenium python library and navigate to the LIHKG website's API endpoints, retrieve JSON data for each thread and page, and store it in a CSV file.

The script begins by launching a headless Chrome browser and directing it to the specified thread page. By dynamically creating an tag within the page, the script navigates to the API URL and fetches the corresponding JSON data, which is then parsed. After each retrieval, the data is saved to a file, organised by thread ID and page number. The script supports resuming from the last saved position, allowing it to continue scraping from where it left off in the event of an interruption. To emulate typical user behaviour and mitigate anti-scraping mechanisms, random delays and robust error handling are incorporated throughout the process.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
Tweet API Request.py		Tweet API Request.py
lihkgCRAWLER.py		lihkgCRAWLER.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

Kricey/Web-scraping

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages