Consider adding simple filtering #10

krassowski · 2020-03-07T23:44:14Z

Simple filtering proposal - idea 1

~~To enable high-performance subsetting a simple, grip-like pre-filtering will be provided:~~

Import only first five rows:

%vault from notebook import large_frame.rows[:5] as large_frame_head

When subsetting, the use of as would be required to prevent potential confusion of the original large_frame object with its subset.

To import only rows including text "SNP":

%vault from notebook import large_frame.grep("SNP") as large_frame_snps
By design, no advanced filtering is intended at this step.

However, if your file is too big to fit into memory and you need more advanced filtering,
you can provide your custom import function to the low-level load_storage_object magic:

def your_function(f):
    return [
        line
        for i, line in enumerate(f)
        if i % 2 == 0   # replace with fancy filtering as needed
    ]
%vault import 'notebook_path/variable.tsv' as variable with your_function

The advanced filtering can be already achieved with existing code.

Simple filtering proposal - idea 2

Import the first 5 rows:

from data_vault import subset
%vault import 'notebook_path/variable.tsv' as variable with subset.rows[:5]

to be implemented with nrows

Import the first 5 columns:

%vault import 'notebook_path/variable.tsv' as variable with subset.columns[:5]

to be implemented with usecols

Import rows containig a string:

%vault import 'notebook_path/variable.tsv' as variable with subset.contains('text')

Import rows matching a regular expression:

%vault import 'notebook_path/variable.tsv' as variable with subset.matches('.*? text')

both to be implemented with a custom IO iterator which discards lines which do not match the criteria on the fly.

Challenges:

how to support the variety of delimiters and options?
- subset.using(sep='csv').rows[:5]?

The text was updated successfully, but these errors were encountered:

krassowski · 2020-03-07T23:46:08Z

with subset.containing('text) and with subset.matching('.*? text) might read better...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider adding simple filtering #10

Consider adding simple filtering #10

krassowski commented Mar 7, 2020

krassowski commented Mar 7, 2020

Consider adding simple filtering #10

Consider adding simple filtering #10

Comments

krassowski commented Mar 7, 2020

Simple filtering proposal - idea 1

Simple filtering proposal - idea 2

krassowski commented Mar 7, 2020