Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-2828: [JS] Refactor Data, Vectors, Visitor, Typings, build, tests, dependencies #3290

Closed
wants to merge 20 commits into from

Conversation

trxcllnt
Copy link
Contributor

@trxcllnt trxcllnt commented Dec 31, 2018

It's the big one; The Great ArrowJS Refactor of 2018. Thanks for bearing with me through yet another huge PR. Check out this sweet gif of all the new features in action. With streaming getting to a good place, we've already started working on demos/integrations with other projects like uber/deck.gl 🎉

The JIRAs

In addition to everything I detail below, this PR closes the following JIRAs:

The stats

The gulp scripts have been updated to parallelize as much as possible. These are the numbers from my Intel Core i7-8700K CPU @ 3.70GHz × 12 running Ubuntu 18.04 and node v11.6.0:

$ time npm run build
[22:11:04] Finished 'build' after 39 s

real	0m40.341s
user	4m55.428s
sys	0m5.559s
$ npm run test:coverage
=============================== Coverage summary ===============================
Statements   : 90.45% ( 4321/4777 )
Branches     : 76.7% ( 1570/2047 )
Functions    : 84.62% ( 1106/1307 )
Lines        : 91.5% ( 3777/4128 )
================================================================================

Test Suites: 21 passed, 21 total
Tests:       5644 passed, 5644 total
Snapshots:   0 total
Time:        16.023s

The fixes

  • Vector#indexOf(value) works for all DataTypes
  • Vector#set(i, value) now works for all DataTypes
  • Reading from node streams is now fully zero-copy
  • The IPC writers now serialize dictionaries of nested Vectors correctly (ARROW-3337)
  • DictionaryBatches marked as isDelta now correctly updates the dictionaries for all Vectors that point to that dictionary, even if they were created before the delta batch arrived
  • A few arrow2csv fixes:
    • Ignore stdin if it's a TTY
    • Now read all the Arrow formats from stdin
    • Always show the help text when we don't understand the input
    • Proper backpressure support to play nicely with other Unix utilities like head and less
  • Fixes an unfiled bug we encountered last week where JS would throw an error creating RowProxies for a Table or Struct with duplicate column names

The upgrades

  • New zero-copy Message/RecordBatchReaders!
  • New RecordBatchWriters
    • Adds RecordBatchJSONWriter, RecordBatchFileWriter and RecordBatchStreamWriter
    • Adds static RecordBatchWriter.writeAll() method to easily write a Table or stream of RecordBatches
    • Both sync and async flushes based on the WritableSink
  • Full integration with platform I/O primitives
  • Generic type parameters inherited from DataType now flow recursively
    const table = Table.from<{ str: Utf8, i32: Int32, bools: List<Bool> }>(data);
    table.get(0); // will be of type { str: string, i32: number, bools: BoolVector }
  • New simplified Data class
  • New simplified, faster Visitor class with support for optional, more narrow visitT implementations
    • New specialized Visitor implementations to enable runtime reflection (e.g. dynamically lookup the Vector constructor for a given DataType)
  • New abstract Chunked base class for the applicative (concat) operation
    • public chunkedInst.chunks field is the list of inner chunks
  • New Column class extends Chunked, combines Field with the chunks (provides access to the field name from the Schema)
  • RecordBatch#concat(...batchesOrTables) now returns a Table
  • Table now extends Chunked, so it inherits:
    • Table#slice(from, to)
    • Table#concat(...batchesOrTables)
    • Table#getChildAt(i) exists, alias of getColumnAt(i)
  • Table#getColumn[At]() returns a Column

The breaking changes

  • All the old IPC functions are gone, but the new APIs will live for much longer
  • Table#batches is now Table#chunks, which it inherits from Chunked (maybe controversial, open to aliasing)
  • Table#batchesUnion is now just... the Table instance itself (also maybe controversial, open to aliasing)
  • DataType#TType is now DataType#typeId -- it should have always been this, was a typo. Easy to alias if necessary.
  • The complicated View classes are now gone, logic centralized as specialized Visitors

The tests

  • Tests no longer rely on any C++ or Java generated integration files
  • Integration tests have been moved into bin/integration.js, and they finish much quicker
  • The tsconfig files have been tweaked to speed up test run time and improve the async debugging experience
  • A streaming RecordBatchJSONWriter has been implemented so we can easily debug and validate written output
    • The JSON results are also tested against the corresponding binary representation, similar to the integration tests
  • A suite of test-data helpers have been added to auto-generate data for validation at runtime
    • They produce the underlying Arrow VectorData buffers, as well as the expected plain-JS-value representation for verification
    • This allows us to test all possible type configuration combinations, e.g. all types Dictionary-encode, all types serialize when nested, etc.
  • A suite of IO test helpers has been added
    • We use memfs to mock the file system, which contributes to test performance improvements
    • This enables us to easily test all the flavors of io primitives across node and browser environments
  • A vscode debugging launch configuration has been added to ease the process of contributing more tests (and because I've been asked for mine so often)

The build

  • Faster
  • Node 11+ (needs Symbol.asyncIterator enabled)
  • Closure-compiler upgrades and build enhancements mean we can auto-generate the externs file during compilation, rather than maintaining it by hand

Misc

  • Added arrow2csv to js/bin/arrow2csv, so anybody with the JS project dependencies installed can easily view a CSV-ish thing (cat foo.arrow | js/bin/arrow2csv.js)

Todos

  • Docs/Recipes/Examples
  • Highlight/write more tools (like arrow2csv)
  • Flesh out the RecordBatchWriters a bit more
  • Gather feedback on the new RecordBatchReader APIs

@codecov-io
Copy link

codecov-io commented Jan 2, 2019

Codecov Report

Merging #3290 into master will decrease coverage by 0.1%.
The diff coverage is 90.49%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3290      +/-   ##
==========================================
- Coverage   88.57%   88.46%   -0.11%     
==========================================
  Files         540      631      +91     
  Lines       73106    79286    +6180     
  Branches        0     1069    +1069     
==========================================
+ Hits        64753    70140    +5387     
- Misses       8250     9031     +781     
- Partials      103      115      +12
Impacted Files Coverage Δ
js/src/vector/interval.ts 100% <100%> (ø)
js/src/enum.ts 100% <100%> (ø)
js/src/Arrow.node.ts 100% <100%> (ø)
js/src/ipc/node/writer.ts 100% <100%> (ø)
js/src/ipc/node/iterable.ts 100% <100%> (ø)
js/src/vector/decimal.ts 100% <100%> (ø)
js/src/compute/predicate.ts 100% <100%> (ø)
js/src/vector/bool.ts 100% <100%> (ø)
js/src/Arrow.dom.ts 100% <100%> (ø)
js/src/visitor/set.ts 100% <100%> (ø)
... and 244 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 46b1bc7...2ef150f. Read the comment docs.

@xhochy
Copy link
Member

xhochy commented Jan 4, 2019

@trxcllnt This seems to include some unwanted commits. Can you rebase on latest master?

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Jan 4, 2019

@xhochy I've tried doing git rebase apache/master, but each time it rebases all the way back to the initial commit, then fails with a merge conflict on the README.md file. I'm happy to give it another shot if you have any tips (also available in slack).

@xhochy
Copy link
Member

xhochy commented Jan 5, 2019

@trxcllnt Also had a look at this and I don't know what's going wrong here. Really puzzled.

@wesm
Copy link
Member

wesm commented Jan 5, 2019

I can look at fixing this tomorrow. How much do you care about maintaining the commit history?

@wesm
Copy link
Member

wesm commented Jan 5, 2019

The problem is that this patch started before the Parquet monorepo merge. Something about the merged repo history confuses rebase when the base commit is before the merge

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Jan 5, 2019

@wesm we can drop the the commit history, I have it all in my fork if I need something. It went through a bunch of iterations, so a lot of the code in those commits has been factored out at this point.

Change-Id: I423b49d58842a88b8a26c7fa646ed4771f9e31a0
@wesm
Copy link
Member

wesm commented Jan 5, 2019

Done. Please git reset --hard to that commit

@wesm
Copy link
Member

wesm commented Jan 5, 2019

Let me know when this is merge ready and I'll take a glance through as I'm able. I figure a lot of feedback / improvements will be handled in follow up PRs

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Jan 5, 2019

Thanks @wesm, I pulled and it all looks good. We can do feedback in followup PRs or discussions on the mailing list. Let me know if I can help point out where things are, or if you have any questions about what's going on.

The generated-data-tests are a great place to start. In addition to comprehensive tests across all the kinds of I/O primitives we want to support, I've tried to add as many real-world scenarios as possible (like reading and writing multiple tables to the same stream), which come from use cases at Graphistry. If you're curious about how the I/O tests work, check out ipc/helpers.ts. Those are the helper functions to generate and validate each I/O primitive.

Here's a checklist to build and test locally:

  • install/update to node v11.x (nvm install node will update to the latest)
  • $ cd arrow/js && rm -rf node_modules && npm install to ensure libraries are up to date
  • $ npx gulp will clean, build, and test everything

While there are always things to improve, I feel good this is ready to merge. I've been updating the Graphistry codebase to use the new APIs with no issues, and I've started work (tweet) on libraries to integrate with other tools in the node ecosystem.

@TheNeuralBit
Copy link
Member

Thanks for this Paul!
I've just been going through some smoke tests for now and everything looks good to me so far. I ran the benchmarks and noticed some changes. I posted the output from a run on master and a run on the refactor branch here.

Some notable changes (master -> refactor):
Accessing numeric vector by index - (49.81 ms -> 81.31 ms)
Accessing dictionary-encoded utf8 vector by index - (2758.32 ms -> 824.92 ms) 👍
Iterating over dictionary-encoded utf8 vector - (3152 ms -> 583.7 ms) 👍
slice toArray numeric vector - (1.34 ms -> 1.1 ms) 👍
slice toArray dictionary-encoded utf8 vector - (3147.78 ms -> 659.42 ms) 👍
DataFrame countBy dictionary-encoded utf8 vector - (9.95 ms -> 17.24 ms)
DataFrame filter-scan dictionary-encoded utf8 vector - (25.62ms -> 31.9ms)

Interestingly there seems to be a huge improvement in anything that decodes utf8 data. Is that because we're using Buffer.fromString to decode utf8 in node now?

There are some minor regressions elsewhere, but nothing to be too concerned about. Might be worth trying to tackle in some follow-up PRs though.

@wesm
Copy link
Member

wesm commented Jan 7, 2019

@TheNeuralBit if you could let me know when you're +1 on merging this, I can go ahead. I am swamped this week so don't want to hold this up, and I will do my best to leave some comments when I can for post-merge follow up work

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Jan 7, 2019

@TheNeuralBit: Interestingly there seems to be a huge improvement in anything that decodes utf8 data. Is that because we're using Buffer.fromString to decode utf8 in node now?

Yep, node's Buffer is implemented in C++, so I try to use that if available. Theoretically browsers should be able to do something similar to node with the TextEncoder, but I haven't validated it yet.

For the chunked-random-access tests, I was expecting to take a bit of a hit, since the binary-search in ChunkedVector#search() isn't inlined anymore and incurs a few more hash lookups + a function call.

But before we make that change, I was thinking we could probably see a boost by making our search a bit more intelligent too. If we pick the starting chunk as the search index divided by the average chunk length (discarding the shortest and longest), then instead of starting at 0 every time and almost guaranteeing log(n) lookup, we'd often find the chunk on the first few iterations.

There's also probably a bit of a hit due to the vector method rebinding starting here, as well as the getters for each buffer being pass-throughs back to the Vector's Data instance, instead of copying them onto the Vector on startup.

I tried to bind them as closely to the Vector instance as possible (at the bottom of src/vector/index.ts, where I bind again on vector initialization), but we may have to go further and update the Visitors that need to be super fast (like the GetVisitor) to operate directly on Data instances instead of Vectors. I think I made sure the Visitor typings would allow that, but we'll just have to update the implementations.

DictionaryVectors are going to be a bit trickier. When we see a DictionaryBatch with the isDelta flag set, there may exist DictionaryVectors who need to be told about the new concatenated dictionary values (for example, someone may update the indices, or concat two DictionaryVectors together). Even trickier, Vectors at different levels in the Schema tree may reference the same dictionary, which means we can't just scan the children of the RecordBatch or Table to know which ones to update (or serialize, which was the source of bug ARROW-3337).

But since the Dictionary DataType only exists once per dictionary id, and the Vector hierarchy matches the Schema Field hierarchy, the solution I came up with was to store both the dictionary value DataTypes (Utf8, etc.) and dictionary-encoded Fields (along with their Dictionary wrapper pseudo-DataTypes) in the Schema, and elevate the dictionary values reference to hang off the Dictionary wrapper DataType.

When we're reading or writing, we keep track of a the singleton dictionary values DataType, to ensure it's always set as the T in a Dictionary<T> for each new dictionary-encoded Field. As we read each new or isDelta DictionaryBatch, load (or load-and-concat) the dictionary values Vector, assigning it to each of the Dictionary wrapper instances of each dictionary-encoded Field.

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Jan 7, 2019

Oh I forgot, another thing probably affecting the results: Table#getColumnAt() always returns a Column (which is Chunked), even if there's only a single chunk.

The Column class is a pair of <Field<T>, Vector<T>[]>, which lets us access properties from the Field via the returned child (table.getColumn('foo').name === 'foo'). I was thinking for this case we could have a special SingleChunkColumn that skips the search() method entirely and passes through to the underlying Vector, or perhaps be even dirtier and dynamically bind the Field properties to a clone of the Chunk and fake the return type with an interface.

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Jan 9, 2019

@TheNeuralBit I did a pass last night on performance -- less binding, fewer getters, stuff like that. Here's the results I'm seeing now:

                                  test name |   master (ops/sec) |  branch (ops/sec) | % change   |
Table.from                                  |             914.00 |            624.00 |  -31.73% ❌ |
readBatches                                 |           1,113.00 |            623.00 |  -44.03% ❌ |
lat.get(idx)                                |           19.16.00 |             19.22 |    0.31% ➖ |
lng.get(idx)                                |           19.24.00 |             19.05 |   -0.99% ➖ |
origin.get(idx)                             |               0.50 |              1.12 |  124.00% ✔️ |
destination.get(idx)                        |               0.52 |              1.07 |  105.77% ✔️ |
lat[Symbol.iterator]()                      |              16.69 |             16.14 |   -3.30% ➖ |
lng[Symbol.iterator]()                      |              16.72 |             17.10 |    2.27% ✔️ |
origin[Symbol.iterator]()                   |               0.51 |              1.11 |  117.65% ✔️ |
destination[Symbol.iterator]()              |               0.53 |              1.04 |   96.23% ✔️ |
lat.slice().toArray()                       |           1,381.00 |          1,520.00 |   10.07% ✔️ |
lng.slice().toArray()                       |           1,568.00 |          1,426.00 |   -9.06% ➖ |
origin.slice().toArray()                    |               0.49 |              1.12 |  128.57% ✔️ |
destination.slice().toArray()               |               0.48 |              1.10 |  129.17% ✔️ |
lat.slice()                                 |          27,965.00 |         48,678.00 |   74.07% ✔️ |
lng.slice()                                 |          29,425.00 |         42,739.00 |   45.25% ✔️ |
origin.slice()                              |           5,092.00 |         41,802.00 |  720.93% ✔️ |
destination.slice()                         |           4,814.00 |         48,367.00 |  904.72% ✔️ |
DF CountBy [origin]                         |             153.00 |            285.00 |   86.27% ✔️ |
DF CountBy [destination]                    |             153.00 |            262.00 |   71.24% ✔️ |
DF Filter-Scan-Count [lat > 0]              |              30.00 |             35.19 |   17.30% ✔️ |
DF Filter-Scan-Count [lng > 0]              |              29.81 |             34.75 |   16.57% ✔️ |
DF Filter-Scan-Count [origin === "Seattle"] |              59.68 |             56.08 |   -6.03% ➖ |
DF Count [lat > 0]                          |             121.00 |            180.00 |   48.76% ✔️ |
DF Count [lng > 0]                          |             119.00 |            182.00 |   52.94% ✔️ |
DF Count [origin === "Seattle"]             |               0.55 |              1.26 |  129.09% ✔️ |

@TheNeuralBit
Copy link
Member

Wow! Thanks for tackling the perf stuff - Those numbers look great! I re-ran the benchmarks on my laptop and found a similar improvement 🎉

I don't think I'll have time for a thorough code review in the near future, but I've ran your branch through various tests to make sure the features I'm interested in still work, so I'm 👍 on this.

LGTM @wesm

@trxcllnt
Copy link
Contributor Author

And after the latest pass, I got the numbers for the two parse tests back up:

                      test name |   master (ops/sec) |  branch (ops/sec) | % change   |
Table.from                      |             914.00 |             1,549 |   64.61% ✔️ |
readBatches                     |           1,113.00 |             1,671 |   50.13% ✔️ |

@wesm wesm changed the title [JS]: Refactor Data, Vectors, Visitor, Typings, build, tests, dependencies ARROW-2828: [JS] Refactor Data, Vectors, Visitor, Typings, build, tests, dependencies Jan 13, 2019
@wesm
Copy link
Member

wesm commented Jan 13, 2019

Merging this. Can you resolve all the relevant JIRAs? Thank you!

@wesm wesm closed this in 5598d2f Jan 13, 2019
@trxcllnt
Copy link
Contributor Author

thanks @wesm!

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants