Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement data.table-like j argument, without evalutation #37

Merged
merged 3 commits into from
May 7, 2018

Conversation

martinblostein
Copy link
Contributor

@martinblostein martinblostein commented May 3, 2018

Expressions are simplified to be in terms of on-disk columns and unknown symbols, but are not handled beyond that stage. Columns can be selected, rearranged, renamed and duplicated.

Example of functionality:

x <- data.table::data.table(a = 1:10, b = LETTERS[1:10])
tmp <- tempfile()
fst::write_fst(x, tmp)
ft <- fsttable::fst_table(tmp)
ft[, .(x = a, y = a + b)][1:5, .(z = y / 2, w = x)]

# <fst file>
# 5 rows, 2 columns
# 
#           z     w
#        <NA> <int>
# 1 (a + b)/2     1
# 2 (a + b)/2     2
# 3 (a + b)/2     3
# 4 (a + b)/2     4
# 5 (a + b)/2     5

The main idea of this PR is the introduction of the colexps field in the table proxy object. This object is used to track the current column-state of the table. It is a list of expressions, with symbols of either on-disk columns or extra in-memory data.

To go with the colexps field, I added a table_proxy_transform method (terminology roughly based on dplyr). This method is used to update colexps by substituting the exsting column expressions into the new ones. This method is extremely flexible (probably to a fault) and agnostic to the interface implementation.

Note: A lot of the code in table_proxy_methods.R is just hacked together so that the tables print nicely. Before working on this further I think we should consider unifying or refactoring table_proxy_read_full and table_proxy_read_range to avoid code duplication. I also wonder if the range method is even necessary; reading a range of a proxy table is the same as reading a "full" proxy table with a different slice_map.

…ons.

Expressions are simplified to be in terms of on-disk columns and unknown symbols, but are not handled beyond that stage. These virutal columns are printed simply as their expressions, with type NA. Columns can be selected, rearranged, renamed and duplicated.
@MarcusKlik MarcusKlik self-assigned this May 3, 2018
@MarcusKlik MarcusKlik added this to the fsttable v0.2.0 milestone May 3, 2018
@MarcusKlik
Copy link
Collaborator

Hi @martinblostein, thanks a lot for your pull request!

Nice work, so the variables in the expression are categorized as known columns in the fst file or as other variables. And further processing is done according to the specific category. That looks great!

I'm going to spent some time on your pull request to study it in more detail (as it deserves) so please be patient a little while longer until it's merged!

@martinblostein
Copy link
Contributor Author

martinblostein commented May 4, 2018

Nice work, so the variables in the expression are categorized as known columns in the fst file or as other variables. And further processing is done according to the specific category. That looks great!

That is the idea, but I didn't make that classification explicit yet. I think moving forward we definitely should! You may notice that the current code gets confused because when a new expression is introduced that is in the disk table, but not in the current table state, it thinks we want to refer to the on-disk column. Making this categorization explicit will fix that.

@MarcusKlik MarcusKlik merged commit bd45361 into fstpackage:develop May 7, 2018
@MarcusKlik
Copy link
Collaborator

Hi @martinblostein, I've merged your pull request into the dev branch with some minor tweaks, nice work!

At first, my idea was to completely exclude any expression parsing and substitutions from the table_proxy object. But as you mention in the code, the parsing mechanism for a dplyr interface will be very similar, so why not share that code among interfaces? The only rule would be that columns are present in the expression object as unevaluated symbols.

@MarcusKlik MarcusKlik added the enhancement New feature or request label Jan 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants