Skip to content

Coming Soon

ccoffey edited this page Jun 29, 2011 · 4 revisions

This sections outlines the features that I am currently working on.

1) Querying across multiple .csv files

This will be very similar to SQL joins. I think the API will look something like the below.

#Import sql4csv and create some data sets. 
from novacode import sql4csv
ds_0 = sql4csv('ds_0.csv')
ds_1 = sql4csv('ds_1.csv')

#Join two data sets using their common age field.
ds_2 = ds_0.join(ds_1, '#0.age = #1.age')

#Query the data set as normal.
ds_2.query('select * where $age > 8')

2) Embarrassingly parallel execution

Currently sql4csv iterates through a .csv file row by row. The processing of each row is however completely independent of any other row. Therefore it should be very profitable to make execution parallel.

This feature will be optional, below is an example of what it should look like.

#Import sql4csv and create some data sets. 
from novacode import sql4csv
ds_small = sql4csv('ds_small.csv')
ds_large = sql4csv('ds_large.csv')

#Run one query sequentially and a second in parallel.
ds_small.query(some_query)
ds_large.parallel_query(some_query)

I will post some experimental results here soon. A comparison of different queries ran on different file sizes.

Clone this wiki locally