Document Similarity Checker

This tool is intended to help find duplicates and near-duplicate files in a directory tree.

Reproducibility

Creating the .exe file

To create the executable follow the following steps.

Clone the repo
Install python
Install dependencies (preferably in a virtual environment)

#Open your command line and navigate to where you cloned this repo.

# Create virtual environment
python -m venv venv

# Activate the virtual environment
venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install package to create the executable
pip install cx_freeze

Modify the venv\Lib\site-packages\streamlit\web\cli.py script

# Paste the following function inside the cli.py script
def main_run_clExplicit(file, command_line, args=[], flag_options=[]):
    main.is_running_with_streamlit = True
    bootstrap.run(file, command_line, args, flag_options)

Run this command: python setup.py build
Copy the .streamlit/ folder and the app.py script into the new build/exe.win-amd64-3.10/ folder.
Copy the streamlit, imapclient, and sklearn folders from venv/Lib/site-packages/ to build/exe.win-amd64-3.10/lib/. And replace any duplicate files.

If you want to develop the tool, you can, after activating the virtual environment, you can run the command streamlit run deduplication/app.py and the tool will run without needing to build the executable.

Code flow diagram

Usage

An executable file is being distributed for using this application.

Note: The .exe file must be in the same folder as the app.py and .streamlit folders.

Once open, paste the path to the folder where you want to check if the documents are duplicates. Then press Enter

The list of duplicates will appear at the bottom after the application finishes analyzing the files.

Then, you can select an appropriate similarity sensitivity to find documents that have high degrees of similarity.

Try to start with lower sensitivities first, and then increase gradually to see how this parameter affects the clustering.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.streamlit		.streamlit
deduplication		deduplication
media		media
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
get_type.py		get_type.py
launch.py		launch.py
requirements.txt		requirements.txt
run_app.py		run_app.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Similarity Checker

Reproducibility

Creating the .exe file

Code flow diagram

Usage

About

Releases

Packages

Languages

License

farrael004/Document-Duplicate-Checker

Folders and files

Latest commit

History

Repository files navigation

Document Similarity Checker

Reproducibility

Creating the .exe file

Code flow diagram

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages