Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquetjs browser support #17

Merged
merged 23 commits into from
Jun 30, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 11 additions & 3 deletions .babelrc.js
Original file line number Diff line number Diff line change
@@ -1,6 +1,14 @@
module.exports = {
sourceType: "unambiguous",
plugins: ["babel-plugin-add-module-exports"],
presets: [
'@babel/preset-env',
'@babel/preset-typescript',
],
['@babel/preset-env', {
loose: true,
modules: "auto",
"useBuiltIns": "entry", // to ensure regeneratorRuntime is defined; see bootstrap.js
"corejs": 3, // use w/ "useBuiltIns", defaults=2, must match what is in package.json
// "targets": "> 0.25%, not dead"
}],
'@babel/preset-typescript'
]
};
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ npm-debug.log
.nyc_output
dist
!test/test-files/*.parquet
examples/server/package-lock.json
57 changes: 30 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,26 +14,37 @@ for compatibility with Apache's Java [reference implementation](https://github.c
write a large amount of structured data to a file, compress it and then read parts
of it back out efficiently. The Parquet format is based on [Google's Dremel paper](https://www.google.co.nz/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&uact=8&ved=0ahUKEwj_tJelpv3UAhUCm5QKHfJODhUQFggsMAE&url=http%3A%2F%2Fwww.vldb.org%2Fpvldb%2Fvldb2010%2Fpapers%2FR29.pdf&usg=AFQjCNGyMk3_JltVZjMahP6LPmqMzYdCkw).

Forked Notice
-------------
## Forked Notice

This is a forked repository with code from various sources:
- Primary source [ironSource](https://github.com/ironSource/parquetjs) [npm: parquetjs](https://www.npmjs.com/package/parquetjs)
- Secondary source [ZJONSSON](https://github.com/ZJONSSON/parquetjs) [npm: parquetjs-lite](https://www.npmjs.com/package/parquetjs-lite)

Installation
------------

To use parquet.js with node.js, install it using npm:
## Installation
_parquet.js requires node.js >= 14.16.0_

```
$ npm install @dsnp/parquetjs
```

_parquet.js requires node.js >= 14.16.0_
### NodeJS
To use with nodejs:
```javascript
import parquetjs from "@dsnp/parquetjs"
```

### Browser
To use in a browser, in your bundler, depending on your needs, write the appropriate plugin or resolver to point to:
```javascript
"node_modules/@dsnp/parquetjs/browser/parquetjs"
```
or:

```javascript
import parquetjs from "@dsnp/parquetjs/browser/parquetjs"
```

Usage: Writing files
--------------------
## Usage: Writing files

Once you have installed the parquet.js library, you can import it as a single
module:
Expand Down Expand Up @@ -110,8 +121,7 @@ The following options are provided to have the ability to adjust the split-block

Note that if numFilterBytes is provided then falsePositiveRate and numDistinct options are ignored.

Usage: Reading files
--------------------
## Usage: Reading files

A parquet reader allows retrieving the rows from a parquet file in order.
The basic usage is to create a reader and then retrieve a cursor/iterator
Expand Down Expand Up @@ -224,8 +234,7 @@ const file = fs.readFileSync('fruits.parquet');
let reader = await parquet.ParquetReader.openBuffer(file);
```

Encodings
---------
## Encodings

Internally, the Parquet format will store values from each field as consecutive
arrays which can be compressed/encoded using a number of schemes.
Expand Down Expand Up @@ -257,8 +266,7 @@ var schema = new parquet.ParquetSchema({
```


Optional Fields
---------------
### Optional Fields

By default, all fields are required to be present in each row. You can also mark
a field as 'optional' which will let you store rows with that field missing:
Expand All @@ -275,8 +283,7 @@ await writer.appendRow({name: 'banana' }); // not in stock
```


Nested Rows & Arrays
--------------------
### Nested Rows & Arrays

Parquet supports nested schemas that allow you to store rows that have a more
complex structure than a simple tuple of scalar values. To declare a schema
Expand Down Expand Up @@ -346,14 +353,12 @@ of that, knowing about the type of a field allows us to compress the remaining
data more efficiently.


Nested Lists for Hive / Athena
-----------------------
### Nested Lists for Hive / Athena

Lists have to be annotated to be queriable with AWS Athena. See [parquet-format](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists) for more detail and a full working example with comments in the test directory ([`test/list.js`](test/list.js))


List of Supported Types & Encodings
-----------------------------------
### List of Supported Types & Encodings

We aim to be feature-complete and add new features as they are added to the
Parquet specification; this is the list of currently implemented data types and
Expand Down Expand Up @@ -386,8 +391,7 @@ encodings:
</table>


Buffering & Row Group Size
--------------------------
## Buffering & Row Group Size

When writing a Parquet file, the `ParquetWriter` will buffer rows in memory
until a row group is complete (or `close()` is called) and then write out the row
Expand All @@ -403,14 +407,13 @@ writer.setRowGroupSize(8192);
```


Dependencies
-------------
## Dependencies

Parquet uses [thrift](https://thrift.apache.org/) to encode the schema and other
metadata, but the actual data does not use thrift.

Notes
-----

## Notes

Currently parquet-cpp doesn't fully support DATA_PAGE_V2. You can work around this
by setting the useDataPageV2 option to false.
5 changes: 5 additions & 0 deletions bootstrap.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
import "regenerator-runtime/runtime";
import "core-js/stable";
const coreImportPromise = import('./parquet').catch(e => console.error('Error importing `parquet.js`:', e))

export const core = coreImportPromise;
74 changes: 74 additions & 0 deletions esbuild-plugins.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
/**
* this plugin resolves to a browser version of compression.js that
* does not include LZO or Brötli comprssion.
*/
const compressionBrowserPlugin = {
name: 'compressionBrowser',
setup(build) {
let path = require('path')
build.onResolve({filter: /^\.\/compression$/}, args => {
return {
path: path.resolve(__dirname, "lib","browser","compression.js")
}
})
}
}

// Lifted from https://esbuild.github.io/plugins/#webassembly-plugin
const wasmPlugin = {
name: 'wasm',
setup(build) {
let path = require('path')
let fs = require('fs')

// Resolve ".wasm" files to a path with a namespace
build.onResolve({ filter: /\.wasm$/ }, args => {
// If this is the import inside the stub module, import the
// binary itself. Put the path in the "wasm-binary" namespace
// to tell our binary load callback to load the binary file.
if (args.namespace === 'wasm-stub') {
return {
path: args.path,
namespace: 'wasm-binary',
}
}

// Otherwise, generate the JavaScript stub module for this
// ".wasm" file. Put it in the "wasm-stub" namespace to tell
// our stub load callback to fill it with JavaScript.
//
// Resolve relative paths to absolute paths here since this
// resolve callback is given "resolveDir", the directory to
// resolve imports against.
if (args.resolveDir === '') {
return // Ignore unresolvable paths
}
return {
path: path.isAbsolute(args.path) ? args.path : path.join(args.resolveDir, args.path),
namespace: 'wasm-stub',
}
})

// Virtual modules in the "wasm-stub" namespace are filled with
// the JavaScript code for compiling the WebAssembly binary. The
// binary itself is imported from a second virtual module.
build.onLoad({ filter: /.*/, namespace: 'wasm-stub' }, async (args) => ({
contents: `import wasm from ${JSON.stringify(args.path)}
export default (imports) =>
WebAssembly.instantiate(wasm, imports).then(
result => result.instance.exports)`,
}))

// Virtual modules in the "wasm-binary" namespace contain the
// actual bytes of the WebAssembly file. This uses esbuild's
// built-in "binary" loader instead of manually embedding the
// binary data inside JavaScript code ourselves.
build.onLoad({ filter: /.*/, namespace: 'wasm-binary' }, async (args) => ({
contents: await fs.promises.readFile(args.path),
loader: 'binary',
}))
},
}

module.exports = { compressionBrowserPlugin, wasmPlugin}

23 changes: 23 additions & 0 deletions esbuild-serve.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
/**
* Use this to serve the parquetjs bundle at http://localhost:8000/main.js
* It attaches the parquet.js exports to a "parquetjs" global variable.
* See the example server for how to use it.
*/
const {compressionBrowserPlugin, wasmPlugin} = require("./esbuild-plugins");
// esbuild has TypeScript support by default. It will use .tsconfig
require('esbuild')
.serve({
servedir: __dirname,
}, {
entryPoints: ['parquet.js'],
outfile: 'main.js',
define: {"process.env.NODE_DEBUG": false, "process.env.NODE_ENV": "\"production\"", global: "window" },
platform: 'browser',
plugins: [compressionBrowserPlugin,wasmPlugin],
sourcemap: "external",
bundle: true,
globalName: 'parquetjs',
inject: ['./esbuild-shims.js']
}).then(server => {
console.log("serving parquetjs", server)
})
3 changes: 3 additions & 0 deletions esbuild-shims.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
const buffer = require("buffer/").Buffer;
export let Buffer = buffer;

32 changes: 32 additions & 0 deletions esbuild.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
const path = require("path")
const {compressionBrowserPlugin, wasmPlugin} = require("./esbuild-plugins");
// esbuild has TypeScript support by default
const outfile = 'parquet-bundle.min.js'
require('esbuild')
.build({
bundle: true,
entryPoints: ['parquet.js'],
outdir: path.resolve(__dirname, "dist","browser"),
define: {
"process.env.NODE_DEBUG": false,
"process.env.NODE_ENV": "\"production\"",
global: "window"
},
globalName: 'parquetjs',
inject: ['./esbuild-shims.js'],
minify: true,
platform: 'browser', // default
plugins: [compressionBrowserPlugin, wasmPlugin],
target: "esnext" // default
})
.then(res => {
if (!res.warnings.length) {
console.log("built with no errors or warnings")
}
})
.catch(e => {
console.error("Finished with errors: ", e.toString());
});



8 changes: 8 additions & 0 deletions examples/server/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Example Server
This is a toy server that illustrates how to use the parquetjs library built with esbuild.
To run it:
1. npm install
1. View and edit the files in `views` to taste
1. node app.js
1. Build and serve the parquetjs bundle in the main parquetjs directory: `npm run serve`
1. visit `http://localhost:3000` and click buttons, do things in the console.
20 changes: 20 additions & 0 deletions examples/server/app.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
const express = require('express')
const path = require("path")
const app = express()
const port = 3000

app.use(express.static(path.join(__dirname, 'public')));
app.engine('ejs', require('ejs').__express);

app.set('view engine', 'ejs');

app.get('/', (req, res) => {
res.render('parquetFiles', {
title: "Parquet Files",
port: port
})
})

app.listen(port, () => {
console.log(`Example app listening at http://localhost:${port}`)
})
13 changes: 13 additions & 0 deletions examples/server/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
{
"name": "untitled",
"version": "1.0.0",
"main": "index.js",
"license": "MIT",
"dependencies": {
"@dsnp/parquetjs": "../parquetjs",
"ejs": "^3.1.6"
},
"devDependencies": {
"express": "^4.17.1"
}
}
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added examples/server/public/files/fruits.parquet
Binary file not shown.
Binary file not shown.
Binary file added examples/server/public/files/list.parquet
Binary file not shown.
Loading