Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider adding simple filtering #10

Open
krassowski opened this issue Mar 7, 2020 · 1 comment
Open

Consider adding simple filtering #10

krassowski opened this issue Mar 7, 2020 · 1 comment

Comments

@krassowski
Copy link
Owner

Simple filtering proposal - idea 1

To enable high-performance subsetting a simple, grip-like pre-filtering will be provided:

Import only first five rows:

%vault from notebook import large_frame.rows[:5] as large_frame_head

When subsetting, the use of as would be required to prevent potential confusion of the original large_frame object with its subset.

To import only rows including text "SNP":

%vault from notebook import large_frame.grep("SNP") as large_frame_snps
By design, no advanced filtering is intended at this step.

However, if your file is too big to fit into memory and you need more advanced filtering,
you can provide your custom import function to the low-level load_storage_object magic:

def your_function(f):
    return [
        line
        for i, line in enumerate(f)
        if i % 2 == 0   # replace with fancy filtering as needed
    ]
%vault import 'notebook_path/variable.tsv' as variable with your_function

The advanced filtering can be already achieved with existing code.

Simple filtering proposal - idea 2

Import the first 5 rows:

from data_vault import subset
%vault import 'notebook_path/variable.tsv' as variable with subset.rows[:5]

to be implemented with nrows

Import the first 5 columns:

%vault import 'notebook_path/variable.tsv' as variable with subset.columns[:5]

to be implemented with usecols

Import rows containig a string:

%vault import 'notebook_path/variable.tsv' as variable with subset.contains('text')

Import rows matching a regular expression:

%vault import 'notebook_path/variable.tsv' as variable with subset.matches('.*? text')

both to be implemented with a custom IO iterator which discards lines which do not match the criteria on the fly.

Challenges:

  • how to support the variety of delimiters and options?
    • subset.using(sep='csv').rows[:5]?
@krassowski
Copy link
Owner Author

with subset.containing('text) and with subset.matching('.*? text) might read better...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant