The File_Search_Server
is a project designed to implement and benchmark various file search algorithms to efficiently search for strings within large text files. This README provides a comprehensive guide to the project, including setup instructions, detailed descriptions of the implemented algorithms, how to run the benchmarks, and how to interpret the results.
- Overview
- Project Structure
- Setup Instructions
- Usage Instructions
- Running as a Linux Service
- Unit Testing
- Implemented Algorithms
- Running the Benchmarks
- Analyzing the Results
- Limitations and Recommendations
- Future Enhancements
- Conclusion
The project is organized into the following directories and files:
File_Search_Server/ │ ├── README.md ├── requirements.txt ├── setup.sh ├── 200k.txt ├── server.key ├── server.csr ├── server.crt ├── server.py ├── client.py ├── config.ini ├── file-search_algorithms.py ├── benchmark_results.txt ├── test-suite_server.py ├── multiple-queries_simulation.py ├── speed_report.py └── benchmark_chart.png
Ensure you have the following installed on your machine:
- Python 3.7+
- pip (Python package installer)
-
Setup project directory and the virtual environment
-
Install the required Python packages:
pip install -r requirements.txt
Running the Server
To start the server, run:
python server.py
Running the Client
To interact with the server, run:
python client.py
To run the File_Search_Server as a Linux daemon or service, follow these steps:
- Create a systemd service file:
sudo nano /etc/systemd/system/File_Search_Server.service
- Add the following content to the service file:
[Unit]
Description=AS Introductory Task File Search Server
After=network.target
[Service]
User=yourusername
WorkingDirectory=/path/to/File_Search_Server
ExecStart=/usr/bin/python3 /path/to/File_Search_Server/server.py
Restart=always
[Install]
WantedBy=multi-user.target
Replace /path/to/File_Search_Server with the actual path to the project directory and yourusername with your Linux username.
- Reload the systemd daemon to recognize the new service:
sudo systemctl daemon-reload
- Start the service:
sudo systemctl start File_Search_Server.service
- Enable the service to start on boot:
sudo systemctl enable File_Search_Server.service
- Check the status of the service:
sudo systemctl status File_Search_Server.service
The File_Search_Server should now be running as a Linux service.
You can stop the service with
sudo systemctl stop File_Search_Server.service
and restart it with
sudo systemctl restart File_Search_Server.service
- Querying the server: Navigate to the project directory where the client.py script is located:
cd /path/to/File_Search_Server
- Run the client script:
python client.py
- Enter the string you want to search for in the 200k.txt file
Run the test suite server script to run all the unit test cases:
pytest -vv test-suite_server.py
This will run all the test cases and give you a summary of the test session results.
The following file search algorithms are implemented in this project:
-
Naive Search:
- A straightforward approach that checks each position in the text for a match.
-
Binary Search:
- Requires the data to be sorted; it divides the search interval in half repeatedly.
-
Knuth-Morris-Pratt (KMP) Algorithm:
- Uses the preprocessing of the pattern to avoid unnecessary comparisons.
-
Boyer-Moore Algorithm:
- Skips sections of the text, making it efficient for large alphabets and long patterns.
-
Rabin-Karp Algorithm:
- Uses hashing to find any one of a set of pattern strings in a text.
-
Z Algorithm:
- Computes the Z array which is used for pattern matching in linear time.
The config.ini
file contains the following setting:
REREAD_ON_QUERY
: If set toTrue
, the data will be re-read from the file on every query. IfFalse
, the data will be read once and reused for all queries.
To benchmark the search algorithms, run the file-search_algorithms.py
script:
python file-search_algorithms.py
This script will benchmark each algorithm on the generated test files and save the results to benchmark_results.txt
.
Once you have the benchmark results, generate the speed report by running the speed_report.py
script:
python speed_report.py
This will create two visual representations of the benchmark results, for when REREAD_ON_QUERY = True, saved as benchmark_chart_reread.png
and for when REREAD_ON_QUERY = False, saved as benchmark_chart_no_reread.png
.
The benchmark results are saved in benchmark_results.txt
and can be visualized using the generated charts, benchmark_chart_reread.png
and benchmark_chart_no_reread.png
. The results include
- Execution time for each algorithm across different file sizes.
- Comparative performance analysis of all algorithms with and without re-reading the file.
From the generated charts, key observations include:
- Naive Search shows a linear increase in execution time with file size.
- Binary Search performs well for smaller datasets but requires sorted data.
- KMP Algorithm and Boyer-Moore Algorithm demonstrate efficient performance for large datasets.
- Rabin-Karp Algorithm is effective for multiple pattern searches but less efficient for very large datasets.
- Z Algorithm outperforms other algorithms with linear time complexity.
- Impact of REREAD_ON_QUERY Analyzing the two charts can help in understanding how re-reading the file impacts the performance of each algorithm
- Memory Usage: Higher memory consumption for certain algorithms (e.g., Rabin-Karp).
- Execution Time: Inefficiency of Naive Search and Binary Search for large datasets.
- Scalability: Performance degradation with high concurrency or extremely large datasets.
- File Re-reading Overhead: Additional overhead when REREAD_ON_QUERY is set to True, which might affect execution time.
- Use Z Algorithm for Large Datasets: Implement the Z Algorithm for efficient performance.
- Optimize Memory Usage: Optimize algorithms to reduce memory consumption.
- Enhance Scalability: Implement load balancing and optimize server code for better scalability.
- Assess Impact of REREAD_ON_QUERY: Choose the REREAD_ON_QUERY setting based on specific use cases to balance between overhead and performance.
- Further Testing: Conduct additional tests with diverse data and query patterns.
- Advanced Query Options: Support regex and fuzzy searching.
- Real-Time Analytics: Monitor server performance and client query statistics in real-time.
- Scalability Improvements: Explore distributed computing techniques.
- Enhanced Benchmarking: Incorporate more detailed benchmarking to evaluate the impact of file re-reading and other parameters on performance.
The File_Search_Server
project provides an efficient solution for searching strings in large text files. By benchmarking various algorithms, we have identified the most suitable algorithm for different scenarios. The project offers a robust framework for further enhancements and optimizations.