This repo was created for the pre-interview challenge at LSEG.
This project is designed to read and process stock exchange data from CSV files. It identifies outliers in the data and saves the results to new CSV files. The main requirements for the task are done in the main.py file, I took the liberty to create some plots presenting a function that fits the stock values in the plot_stocks.py file. The coefficients for the function have been generated using Matlab.
- Read Multiple CSV Files: Reads one or two CSV files for each stock exchange directory.
- Data Sampling: Extracts 30 data points starting from a random timestamp from each CSV file.
- Outlier Detection: Processes the sampled data to identify outliers based on statistical analysis.
- Logging and Error Handling: Logs warnings and errors with timestamps for better traceability.
project-root/
│
├── inputs/
│ ├── exchange1/
│ │ ├── file1.csv
│ │ └── file2.csv
│ ├── exchange2/
│ │ ├── file1.csv
│ │ └── file2.csv
│ └── ...
├── output/
│ └── (outlier files will be saved here)
├── main.py
└── README.md
- Python 3.x
- Pandas library
- Clone the repository:
git clone https://github.com/Grosoiu/Grosoiu_Andrei_Submission.git
- Install the required dependencies:
pip install -r requirements.txt
- Run the script:
python main.py 1 OR python main.py 2, where the number represents how many files should be processed per stock exchange.
- Additionally, if you want to run the script that generates the function that fits the stock values, you can:
Install the required dependencies:
pip install -r requirements_extra.txt
Run the script:
python plot_stocks.py
Create a directory for each stock exchange in the inputs folder and add csv files for each stock, you will see results in the output folder.
- Parameters:
num_files
(int): Number of files to read (1 or 2).
- Returns:
- A dictionary with the stock exchange directory names as keys and lists of dataframes with 30 data points each as values for the stocks.
- Parameters:
processed_data
(dict): A dictionary with stock exchange directory names as keys and lists of dataframes with 30 data points each as values.
- Returns:
- Writes the outliers in the output folder with the following rule : {stock_exchange}_{stock}_outlier.csv
The script logs various levels of messages with timestamps to help in debugging and tracking the process flow. Log messages include warnings for missing files, critical errors for empty or insufficient data, and info messages for successfully saved outlier files. In my experience working in Monitoring I realized just how important logs are for monitoring the well being of software.
The script generates fitting plots for each stock and saves them in the output director. The coefficients have been calculated using Matlab's function polyfit, examples: