- Software Title: DAPP (Detector of Anonymized Parts in PDFs)
- Description: DAPP is a tool for analyzing PDF documents to detect anonymized sections. It is designed as a web service that receives input data as an HTTP request with local paths or URLs to PDF files and returns analysis results in JSON format.
- Purpose and Description
- Usage
- Objectives and Requirements
- Architecture and Design
- Testing and Validation
- Timeline
- Configuration (Versioning)
- Assumptions
- Limitations
- Install .NET 7.0 or higher
- Install ImageMagick (https://imagemagick.org/)
- Install Ghostscript (https://www.ghostscript.com/)
- Clone the repository.
git clone github.com/Oranged9922/detection-anonymized-parts-in-pdfs-bachelor-thesis.git
- Navigate to the project folder and build the solution.
cd implementation/Dapp dotnet build
Run the console application (in the ConsoleApp folder) with the following command-line options:
--file-location
(mandatory): Path to the file to analyze.--return-images
(optional, default=false): Entertrue
if you want to return images.--output-folder
(optional): Directory where the images will be saved. If not specified, images won't be saved.
Analyzing a document without returning images:
dotnet run -- --file-location /path/to/file
Analyzing a document and returning images, but not saving them:
dotnet run -- --file-location /path/to/file --return-images true
Analyzing a document, returning and saving images:
dotnet run -- --file-location /path/to/file --return-images true --output-folder /path/to/output/folder
The console will print JSON data received from the API, including the document ID. If an output folder is specified, images will be saved in the format original_{i}.jpg
and result_{i}.jpg
.
That's it. Follow the steps to effectively use the console application.
The aim of this software is to develop and implement a tool capable of detecting anonymized sections in PDF documents. The software will be written in C# using the minimal API interface. The output will be in JSON format containing data on the analyzed PDF, such as the number of pages, the percentage of anonymization on each page, and the overall average anonymization.
- The software must be able to accept an HTTP request containing a URL link to a PDF file or a local path to the file.
- The software must be able to read and process PDF files.
- The software must be able to detect anonymized sections in PDF documents.
- The software must be able to analyze and calculate the percentage of anonymization on each page of the PDF file and the overall average anonymization.
- The software must be able to return the results in a specified JSON format containing the number of pages, the percentage of anonymization on each page, and the overall average anonymization.
- The software must be accompanied by appropriate documentation, both user and developer.
- The software must be written in C#.
- The software will use the minimal API interface for receiving requests and returning results.
- The software must be compatible with the latest version of the .NET platform (.NET 7).
- The software must support the processing of PDF files.
- The software must support the JSON format for output data.
- Client-Server: The client sends requests to the server, which processes PDF files and returns results in JSON format.
- Language: C#
- Framework: .NET 7
- API: Minimal API
- PDF Processing: Custom implementation of analyzer for working with PDFs in C#
- Input:
HTTP Request (URL link to a PDF file or a local path)
- Output: JSON Response (Number of pages, percentage of anonymization on each page, overall average anonymization)
- PDF File Reader: Custom implementation using third-party libraries to read and analyze PDF files.
- Anonymization Detection: Custom implementation using image processing and data analysis.
- Unit Testing: For each implemented method and function.
- Integration Testing: For testing the interaction between different components.
- Acceptance Testing: For verifying that the application meets the requirements specified.
- The output results will be validated by analyzing a set of sample PDFs that contain known amounts of anonymized sections.
-
- Define Objectives and Requirements: [Date]
-
- Develop Preliminary Design: [Date]
-
- Implement Core Features: [Date]
-
- Testing: [Date]
-
- Finalize and Deploy: [Date]
The project will be maintained on GitHub, and versioning will be managed through Git. Periodic backups and branches for new features will be created.
- The user will provide valid paths or URLs to PDF files.
- The PDF files will be properly formatted.
- The accuracy of anonymization detection may vary depending on the quality and content of the PDFs.
- May not be able to handle very large PDF files due to hardware limitations.
The DAPP system will be a valuable tool for detecting anonymized sections in PDF documents, providing quick and accurate results. By following the specified objectives and requirements, the project is expected to meet the needs of the users effectively.