-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add possibility to iterate over table data (streaming) #440
Comments
You mention that Afaics it is correct that when reading up on the topic in the blogpost here: Pandas then would use The ideal Alternatively you can buffer the data and return them once they reach the desired chunk_size. data_file = "files.csv"
import pyarrow as pa
import pyarrow.csv as csv
block_size = 141
block_size = 19 # returns same result as streaming from PARQUET
read_options = csv.ReadOptions(column_names=["file", "int"], skip_rows=1, block_size=block_size)
# read_options = csv.ReadOptions(column_names=["file", "int"], skip_rows=1)
convert_options = csv.ConvertOptions(column_types=db["files"]._pyarrow_csv_schema())
chunk_size = 10
stream = csv.open_csv(data_file, read_options=read_options, convert_options=convert_options)
df_spillover = None
for next_chunk in stream:
if next_chunk is None:
break
df = pa.Table.from_batches([next_chunk]).to_pandas()
if df_spillover is not None:
df = pd.concat([df_spillover, df], axis=0)
list_df = [df[i:i+chunk_size] for i in range(0,df.shape[0],chunk_size)]
for df_chunk in list_df:
if chunk_size == df_chunk.shape[0]:
print(df_chunk.shape)
df_spillover = None
else:
df_spillover = df_chunk
df_chunk = df_spillover
print(df_chunk.shape)
print() |
Combining streaming with audbA few thoughts how we might want to combine streaming with The implementation discussed above for
import pyarrow as pa
import pyarropw.parquet as parquet
stream = parquet.ParquetFile("db.parquet")
for batch in stream.iter_batches(batch_size=batch_size):
selection = batch.filter(pa.compute.equal(batch["file"], file)
if len(selection):
# use the selected row to get infos to download table
break
|
This would be an approach that is straightforward ans is based of what we have with the Artifactory being setup as the "server side". What you do in the linked issue is to have the Artifactory as a kind of filesystem mounted. One would have to then locally cache the bandwidth sensitive media data on the local machine in order to further consume them by means of a download. This would be a kind of pseudo-streaming mode, and if this meets the requirements that lurk around the corner this will be a good way to go. In case that this does not meet the speed / bandwidth requirements, one could also think more radically, but this would require a stronger departure from what we have already as one would probably need the capability to stream from the server side and necessitate more flexible behavior on the server side. For example Apache Arrow comes with "arrow-flight", and see https://blog.djnavarro.net/posts/2022-10-18_arrow-flight/. This is probably out of scope for now, and I have no clue about future requirements. But as you say, for now this would be a good approach. |
Stream CSV tables
Thanks for the CSV suggestion, this totally makes sense. For the actual implementation, we then only need to decide if we want to use a fixed I also made my own version of your code: batch_size = 10 # number of rows to return from CSV
block_size = 200 # could be set variable, e.g. 1000 * batch_size
read_options = csv.ReadOptions(column_names=["file", "int"], skip_rows=1, block_size=block_size)
convert_options = csv.ConvertOptions(column_types=db["files"]._pyarrow_csv_schema())
stream = csv.open_csv("files.csv", read_options=read_options, convert_options=convert_options)
df_buffer = pd.DataFrame([])
for block in stream:
df_buffer = pd.concat([df_buffer, block.to_pandas()])
while len(df_buffer) >= batch_size:
df = df_buffer.iloc[:batch_size, :]
df_buffer = df_buffer.iloc[batch_size:, :]
print(df) # return value here |
Great. This is more condense and less clunky as it for example does not require a separate print statement (or yield if it were implemented accordingly) after looping is finished. |
To support loading large tables, that might not fit into memory, it would be a good idea to add an option to
Table.get()
(or another method?), to read the data piece-wise.Let us first create an example table and store as CSV and PARQUET:
Stream PARQUET tables
The table stored in PARQUET can be iterated with:
Stream CSV tables
Streaming a CSV file with
pyarrow
seems to be more complicated, as we cannot directly pass the number of rows we want per batch, but only the size in bytes of a batch. But the problem is the number of bytes per line can vary:returns
If we find a way to calculate the correct
block_size
value we could do:Fallback solution using
pandas
(as far as I understand this, it uses the Python read engine under the hood, which should be much slower thanpyarrow
):Argument name
The most straightforward implementation seems to me to add a single argument to
audformat.Table.get()
specifying the number of rows we want to read. This could be namednrows
,n_rows
,chunksize
,batch_size
or similar:We might want to add a second argument to specify an offset to the first row we start reading. This way, we would be able to read a particular part of the table, e.g.
We might also want to consider integration with
audb
already. Inaudb
we might want to have the option to stream the data directly from the backend and not from the cache. This means we load the part of the table file from the backend, and we also load the corresponding media files./cc @ChristianGeng
The text was updated successfully, but these errors were encountered: