Skip to content

A simple Python script tool and package that uses web crawling concepts to find links and pages around the internet and SQLite databases to store found data.

License

Notifications You must be signed in to change notification settings

Niaev/link_tracking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

link_tracking

A simple Python script tool and package that uses web crawling concepts to find links and pages around the internet and SQLite databases to store found data.

Using

You can clone this Git repository and add it to your project to use link_tracking as a package or to use the tracker.py script.

$ git clone https://github.com/Niaev/link_tracking.git

This package is not available at Python Package Index yet.

tracker script

This script can be found in the root of this repository. Follow the example below:

$ python3 tracker.py SEEDS_FILE [depth]
  • SEEDS_FILE - a file path, referring to a text file with a list of internet links. Example:
http://link-one.com/
https://link.org/two
...
  • DEPTH - is an optional integer number (default is 2), defining the link tracking depth - that is how many times it will enter in a child page link and search the link in there in a recursive way

The script will track links using your seeds and scrape its respective pages, then store in a SQLite database data/pages.db.

as a package

It has two modules: crawler and indexer.

crawler has some functions and the Crawler class, responsible for web crawling.

Class that receives an url and use urllib and bs4 to get page 
information and with functions to track and scrape links

indexer has just the Indexer class, responsible for handling, organizing and storing the collected data;

Class that receives a list of links to organize and index

The code is well documented with docstrings and comments. A more deep documentation can be found in this repository wiki - not yet available.

About

A simple Python script tool and package that uses web crawling concepts to find links and pages around the internet and SQLite databases to store found data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages