please read How to Use and Watch the Video before messaging me with any concerns/issues
A GUI based web scraper written in Python. Useful for data collectors who want a nice UI for scraping data from many sites.
I was looking through Reddit for fun project ideas and came across a thread in which there was a comment of someone complaining about there not being a GUI Web Scraper. Thus I started working on G-Scraper.
(✅ means that it is implemented. ❌ means that i am working on it.)
- ✅ Supports 2 request types; GET & POST (at the moment)
- ✅ Shows all your added info in a list
- ✅ Can scrape multiple URLs
- ✅ Can scrape multiple elements from the same URL (webpage)
- ✅ So putting the two together, can scrape multiple elements from multiple URLs, ensuring that the element is from the URL it was assigned to
- ✅ Can pass request parameters into the request to send for scrape EXCEPT FILES (for now)
- ✅ Since parameters can be passed, it can also handle logins/signups
- ✅ Saves the scraped data in a seperate 'data/scraped-data' folder
- Has a logging function: logs 3 types of outputs
- ✅ Elemental (for elements)
- ✅ Pagical (for webpages)
- ✅ Error (for errors)
- ✅ Handles all types of errors
- ✅ Request function runs in a seperate thread than GUI so you can do things while your request is being run
- ✅ Functionality to edit the variables once they have been added
- ✅ All errors are handled and logged
- ✅ Can delete an unwanted item from the list of added variables
- ✅ Can reset the entire app to start brand new after a scrape/set of scrapes
- ❌ Provides verbose output to user in the GUI
- ✅ User can set 'presets', basically if user does a scrape repetitively they can set a preset. User can then just load and run the preset without having to define the variables each time
- ✅ Can scrape links
- ✅ Unique way for generating unique filename for each log AND save data file so that no mixups happen
Main:
- PyQT5 (for the GUI) 💻
- Requests (for the web requests) 📶
- BeautifulSoup4 (for scraping and parsing the HTML) 🍲
- threading (for the seperate threads) 🧵
- datetime (used in logging and saved data file creation) 📅⌚
- random (used in file creation) ❔
- os (used to get current working directory) ⚡
Here
STEP 0: Install The App
-Clone this repository on your machine
git clone https://github.com/muaaz-ur-habibi/G-Scraper.git
-Move into the directory G-Scraper
-Run the command
pip install -r requirements.txt
to install the libraries
-Run the command
python gui.py
inside your terminal to launch the app
STEP 1: Adding URLs
-Add sites to scrape.
-To do this select the "Set the Site to scrape" button and a enter in the URL of any number of websites you wish to scrape, along with its request method (THIS IS COMPULSORY).
-Then just click on the "+" button and it is added.
-Note: URL should have format like 'https://someurl.com; simply click the URL bar at the top of the webpage, Ctrl+C, then Ctrl+V in the textbox.
-Note 2: add one URL at a time. Dont just enter the entire list into the text-box.
STEP 2: Adding Elements (OPTIONAL)
-Add elements of that site to scrape.
-This is optional in the sense that if you don't specify any elements the app will scrape the entire webpage.
-To specify, click the "Set the elements to scrape" button.
-In here you are presented with 3 text boxes: one for the element name, one for the attribute to specify (OPTIONAL) and one for the attribute value (OPTIONAL).
-So if you want to scrape a div with class of text-box, in the HTML of the webpage it would look like: div class="text-box". Here, "div" is the element name, "class" is the element attribute, "text-box" is the attribute value.
-Once you have entered the element, you must then select the URL/site this element belongs to from the URLs you added in the previous step.
-Finally click on the "+" button and its added. Note: if there are multiple elements with the same properties you specified, the script will scrape all their data.
-Note 2: it is possible you to only specify the element name, nothing else; this will scrape all the elements of that tag
-Note 3: In order to obtain the necessary info about an element, you will have to inspect it. Just right click on the element, select 'Inspect' then you will be presented with the HTML of the element. Use the info in the HTML to scrape it
-Note 4: If you have specified an a tag a.k.a a link tag to be scraped, it wont scrape the text it has, rather the link/href value of it. You can override this by going into 'requestExecutor.py' and finding the part where if says 'if x['name'] == 'a' then just comment out the else part, and the a tag's text will be scraped
STEP 3: Specifying Request Parameters
-Add the web request parameters/payloads to send with your request.
-Click on "Set Payloads or Headers for scrape".
-First you select the site with which you want to associate these parameters with.
-Then you select the type. Currently, only FILE is not worked on, so it will probably throw an unexpected error.
-The rest work fine. (NOTE: IF YOU DONT WANT TO SEND ANY PARAMETERS YOU MUST SPECIFY SO BY SELECTING THE SITE YOU DONT WANT ANY PARAMETERS FOR AND SELECTING THE "NO PARAMETER" VALUE. LEAVE THE REST EMPTY AND ADD).
-After you have selected your parameter, specify its contents, then "ADD (+)"
-Note: If you want to obtain the payload, headers, or any web parameter data, you can do so in the Networking tab of Dev Tools.
-Note 2: For sending files, more specifically images (currently only images are tested for files), just type the payload name then specify the complete path to the image file.
STEP 4: Starting Scrape
-Once you have everything set, you can start the scrape by clicking on "Start Scraping".
-Then once you have reviewed all the details, you can select "Yes".
-Note: If you havent specified any elements to scrape, app will give you a warning. If you forgot to, you can go back and specify them. Else you can just click on "Yes".
STEP 5: Setting Presets (OPTIONAL):
-You can also set presets, they are just what they sound like. You save some values, then in the future you can load those values without having to explicitly specify them
-Currently, you can only set a preset for one URL at a time, but the number of elements and web parameters for that URL is to your liking
-To set a preset, just type in the values like normally as specified above. But now instead of starting the scrape, click on the 'Set/Run Presets' button in the menu bar.
-Here you will be presented with an option to 'create a preset'.
-Then to load that preset in the future,
- First load them from the database using the 'Load presets from database' button
- Next select the preset you would like to run
-Note 3: Preset names are case-sensitive, so muaazkhan, muaazKhan and Muaazkhan are all different
As of now, there really isnt a way to give verbose output to the user. So once you start the scrape, just wait for a few seconds and check the scraped data folder in the data folder. Alternatively, if you find nothing there, you can check the logs folder to see if any error had occured.
- URL editing is implemented, but not request type.
- Images are supported in files payload, since only they have been tested so far
- Added functionality to scrape the links of a tags
- Fixed some code mess
- Started working on preset adding function
- Finished the presetting GUI elements
- Completed the basic presetting functionality, i.e being able to take, clean and process all the necessary data
- Also added some ifs and elses so that presetting now also can support webpage scraping
- Completed the presetting functionality, with the exception of deleting a preset
- Added a pop-up when scrape started to let user know
- Added the functionality to delete a preset