The csv_loader
function efficiently loads a partial portion of a large CSV file containing time-series data into a pandas DataFrame.
The function allows:
- Loading the last N lines from the end of the file.
- Loading the last N lines from a specific date.
It can load any type of time-series (both timezone aware and Naive) and daily or intraday data.
csv_loader_v2
is the latest update and recommended to use.
- It is faster than the previous version and more robust. See performance below.
- If a file is less than 19kb or less than chunk size it is fully loaded into memory and returned. This is faster for smaller files.
- Algorithm has been simplified to run faster.
- Unit tests has been rewritten to cover a wide variety of scenarios.
csv_loader_v1
is the original version and works well for daily data, but had issues loading intraday data due to its underlying algorithm.
There will be no further development or bugfix on csv_loader_v1
. I have left it here for comparison and will eventually be removed.
Loading a portion of a large file is significantly faster than loading the entire file in memory. Files used in the test were not particularly large. You may need to tweak the chunk_size parameter for your use case.
It is slower for smaller files or if you're loading nearly the entire portion of the file.
Here is the broad algorithm without getting too detailed. The function is well documented and commented on for understanding.
Assume you wish to load 160 lines from the end of the file. Dates can be any format passed to pandas.to_datetime.
- Read the first line to get the column header.
- Seek to the end of the file.
- Read the last N bytes (Chunk) of the file.
- On the first chunk, get a count of line breaks (
\n
) in the chunk to estimate lines per chunk. - On every chunk,
- Update the number of lines read by adding the lines per chunk.
- Store the chunks in a list
- Once we have the desired number of lines,
- Combine the column header and final chunk, append it to the list
- Reverse the list and join the list into a string.
- Load it into a Pandas DataFrame and return the slice of data required.
If end_date argument is specified, we parse the first date string in the chunk and check if we're past the end_date. If yes, we continue from Step 5, until desired number of lines have been loaded.
At the minimum, the CSV file must contain a Date and another column with newline chars at the end to correctly parse and load.
Date,Price\n
2023-12-01,200\n
def csv_loader_v2(
file_path: Path,
period: int = 160,
end_date: Optional[datetime] = None,
date_column_name: str = "Date",
chunk_size: int = 1024 * 6,
) -> pd.DataFrame
Parameters:
- file_path (Path): The path to the CSV file to be loaded.
- period (int): Number of lines/candles to return. The default is 160.
- end_date (Optional[datetime]): Load N lines up to this date.
- If None, will load the last N lines from file
- If the date is provided, load the last N lines from this date
- date_column_name (str): Name of the date column. Defaults to
Date
- chunk_size (int): The size of data chunks loaded into memory. The default is 6144 bytes (6 KB).
I chose a 6Kb chunk size based on testing with my specific requirements.
csv_loader_v2.py
is the main file containing the function.
run.py
measures the execution time of csv_loader_v2 vs. loading the entire file in Pandas DataFrame.
test_csv_loader.py
is the unit test file.
To run the test:
py -m unittest discover src/
You are free to use and improve the function as you see fit for your requirements.