Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added gzipped WARC support, started TypeScript bindings #13

Merged
merged 6 commits into from
Oct 28, 2018

Conversation

hyl
Copy link
Contributor

@hyl hyl commented Oct 12, 2018

Hi! Few different themes in this PR - happy to revisit each of them individually, but figured it was worth starting the conversation.

I've added support in the base WARC writer for writing .warc.gz files - the GZIP spec specifies that records should be compressed individually, so it calls zlib.gzipSync on each buffer that goes into writeRecordBlock if the environment variable NODEWARC_WRITE_GZIPPED exists. Do you want it to be controlled by an envvar or can we pass in some kind of config? I couldn't see an obvious way to do the latter, so I decided to go for the former.

I've also started working on a TypeScript definition file - a lot of it is auto-gen and a bit rough and ready, but adding to it as I go. I'll continue working on this as I go.

@N0taN3rd
Copy link
Owner

@hyl thanks for adding gzip support and the typescript definition file!

Per NODEWARC_WRITE_GZIPPED I would like to have writing .warc.gz vs .warc configurable via both the envvar NODEWARC_WRITE_GZIPPED and config option (if possible).
I as well do not see an obvious way go about it but here are my initial thoughts.

Could change the second arg of WARCWriterBase.initWARC to accept an options object rather than a boolean indicating appending e.g.

const WARCInitOpts = {
  appending: false,
  gzip: process.env.NODEWARC_WRITE_GZIPPED != null
}

initWARC (warcPath, options = {}){
  this.opts = { ...options, ...WARCInitOpts }
}

This would be the simplest way to go about this unless this configuration goes into the constructor.
Do you have any thoughts on this / suggestions to improve how WARC writing is done?

Per the typescript definition file, I am currently in the process of improving the parsing and refactoring node-warc to be less janky when it comes to its exports.
Would working off the parsing-fix branch help in generating a typescript def file for node-warc?
I think typescript recently got the ability to lint regular JS via JSDOC comments so maybe that would help?
Either way I'm a big +1 for the typescript definition file, thanks for creating one.

@hyl
Copy link
Contributor Author

hyl commented Oct 15, 2018

👍 to the introduction of WARCInitOpts - think that's the best way to manage config, and makes it extensible if we need to add anything in the future. I really like the rest of the WARC writing process - it's a smart solution, and buffers per-record makes extensibility a lot easier. The only thing I can think of writing-wise is automatic WARC chunking - it's something I've imlpemented manually as a wrapper in my project, but could be nice to have an option to split and write new files after <x> MB.

I'll see if dts-gen does a better job on the parsing-fix branch - hopefully should pick up the args correctly. If not then I'm going to spend a bit of time doing it properly and then we'll just have to manually update it with releases. First pass was just to get some definitions in to stop my IDE complaining at me!

@N0taN3rd N0taN3rd changed the base branch from master to dev October 21, 2018 19:39
@N0taN3rd
Copy link
Owner

@hyl apologies for the delay in my reply... been a busy week for me.
Ok awesome glad you agree the WARCInitOpts is a good idea will go with that.

I changed the merge base from master to dev and broke this PR....
Would you mind merging my origin/dev into your branch?? (the merge does not appear to be too complex thankfully)
Or I can do it if you do not have the time.
Want to get this PR in and finally do another npm release now that I was able to finally put some TLC into the project....
Putting priority on node-warc this week

# Conflicts:
#	README.md
#	lib/writers/warcWriterBase.js
@N0taN3rd
Copy link
Owner

N0taN3rd commented Oct 28, 2018

@hyl I merged N0taN3rd/node-warc@dev into your branch. Merging this PR thanks for it!

@N0taN3rd N0taN3rd merged commit f595893 into N0taN3rd:dev Oct 28, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants