Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring & Improvements #31

Merged
merged 16 commits into from
Oct 31, 2024
Merged

Refactoring & Improvements #31

merged 16 commits into from
Oct 31, 2024

Conversation

MikeMeliz
Copy link
Owner

@MikeMeliz MikeMeliz commented Oct 31, 2024

Description

I went through the code to improve the maintainability, and to make it easier for new contributors. The commits are broken down to explain each change, as follows:

  • a873088[Maintainability]: Fixed several typos and grammar errors around the files, which was producing warnings in IDEs.
  • e4e1f57[Maintainability]: Changed the names of functions and variables into something readable.
  • 1e20434[Fix]: Since we're on Python3 it's not necessary to have the names of modules inside the init file.
  • b59a593[Fix]: Fixed tests to not fail with https websites.
  • 4b95c19[EasinessToUse]: Made the name now() of files shorter for easier navigation.
  • 1109342[Maintainability]: Introduced an /output folder (with a .gitkeep inside) to keep the project clean from output files, and have an easier navigation.
  • d3070dc[Fix]: When crawling and extract, if the same filename exists (or index.htm if it can't fetch a filename), it would write above it.
  • ddca979[Fix]: Changed now() of links.txt to match the rest of the results.
  • 0ddfb9f[Performance]: Removed the 1 sec pause between requests, which greatly improved the performance.
  • e474760[Fix]: Small fix on the output of extractor.py to make it easier to find the file.
  • b689557[Performance]: External links, telephones and mails were not getting appending on the file.
  • 6a1a9da[Fix]: Logs didn't work, or was a bit similar to links.txt. Added time, and module name.
  • fd66e20[Maintainability]: Added some comments for future reference.
  • e960721[Maintainability]: Updated README.md.
  • f7bdb9d[Fix]: Fix Log function, and include 40x errors also.
  • 63c1e3c[Maintainability]: Reduced the size of .gitignore but removing unnecessary exclusions.

Motivation and Context

I was planning for way too long to go over the code and start making it easier to new contributors to understand how it works.
Also during several tests the script was failing, which drives users away of it, so several fixes were needed.
It should be now on an okay-ish stage to start including improvements (eg. parallel scanning, IP rotations, etc)

How Has This Been Tested?

python torcrawl.py -w -u google.com -c -d 1 -p 1 -l -v -e
python torcrawl.py -w -u google.com -c -d 1 -p 1 -l -v
python torcrawl.py -w -u google.com -c -d 3 -l -v

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.

Copy link

sonarcloud bot commented Oct 31, 2024

@MikeMeliz MikeMeliz marked this pull request as ready for review October 31, 2024 21:00
@MikeMeliz MikeMeliz self-assigned this Oct 31, 2024
@MikeMeliz MikeMeliz merged commit 568a859 into master Oct 31, 2024
3 checks passed
@MikeMeliz MikeMeliz deleted the refactoring branch October 31, 2024 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant