This tool is intended to help find duplicates and near-duplicate files in a directory tree.
To create the executable follow the following steps.
-
Clone the repo
-
Install python
-
Install dependencies (preferably in a virtual environment)
#Open your command line and navigate to where you cloned this repo.
# Create virtual environment
python -m venv venv
# Activate the virtual environment
venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install package to create the executable
pip install cx_freeze
- Modify the
venv\Lib\site-packages\streamlit\web\cli.py
script
# Paste the following function inside the cli.py script
def main_run_clExplicit(file, command_line, args=[], flag_options=[]):
main.is_running_with_streamlit = True
bootstrap.run(file, command_line, args, flag_options)
-
Run this command:
python setup.py build
-
Copy the
.streamlit/
folder and theapp.py
script into the newbuild/exe.win-amd64-3.10/
folder. -
Copy the streamlit, imapclient, and sklearn folders from
venv/Lib/site-packages/
tobuild/exe.win-amd64-3.10/lib/
. And replace any duplicate files.
If you want to develop the tool, you can, after activating the virtual environment, you can run the command streamlit run deduplication/app.py
and the tool will run without needing to build the executable.
An executable file is being distributed for using this application.
Note: The .exe file must be in the same folder as the app.py and .streamlit folders.
Once open, paste the path to the folder where you want to check if the documents are duplicates. Then press Enter
The list of duplicates will appear at the bottom after the application finishes analyzing the files.
Then, you can select an appropriate similarity sensitivity to find documents that have high degrees of similarity.
Try to start with lower sensitivities first, and then increase gradually to see how this parameter affects the clustering.