Instead of changing columns to pd.datetime
after calling pd.read_csv()
, we can do it directly in the function call by passing a parse_dates
argument:
rides = pd.read_csv(
Path('green_tripdata_2019-01.csv'),
parse_dates=['lpep_pickup_datetime', 'lpep_dropoff_datetime']
)
print(rides.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 630918 entries, 0 to 630917
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 VendorID 630918 non-null int64
1 lpep_pickup_datetime 630918 non-null datetime64[ns]
2 lpep_dropoff_datetime 630918 non-null datetime64[ns]
(...)
dtypes: datetime64[ns](2), float64(10), int64(7), object(1)
If your goal is to partition the ingestion of data in a SQL database (and not the reading of the .csv
file into memory), you can do it directly in the pd.DataFrame.to_sql()
function, e.g.:
import pandas as pd
from sqlalchemy import create_engine
HOST=...
PORT=...
USER=...
PWD=...
DB_NAME=...
engine=create_engine(f'postgresql://{USER}:{PWD}@{HOST}:{PORT}/{DB_NAME}')
rides = pd.read_csv(
Path('green_tripdata_2019-01.csv'),
parse_dates=['lpep_pickup_datetime', 'lpep_dropoff_datetime']
)
rides.to_sql(name='rides', con=engine, index=False, if_exists='append', chunksize=10000)
As mentioned in this stackoverflow answer, we can access the host's network - and thus localhost
- from within a container by passing --network="host"
to the docker run
command, e.g.:
docker run -it --network="host" --entrypoint=bash ingest:latest