Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export improvements #400

Closed
wants to merge 1 commit into from
Closed

Conversation

seidtgeist
Copy link
Contributor

@seidtgeist seidtgeist commented May 28, 2020

Based on #302 I’ve been trying to improve export, and this has lead me on a path™. Have you considered using one of these options?

  1. VACUUM INTO a temporary file that’s read, unlinked and returned
  2. Using sqlite’s Online Backup API to create a copy that’s returned
  3. Duplicating and returning the working memory via deserialize

In this PR, I’m giving (1) a shot. I’m very interested to hear what you have to say. Ultimately, option (2) might be better suited for a generic export function since it doesn’t incur a VACUUM, and I will most likely give that a try as well.

On a related note: Since the current filesystem is MEMFS, have you considered just using an in-memory database instead? Right now, sqlite believes it’s writing to a persistent filesystem, but MEMFS is backed by memory. Since everything’s happening in-memory anyway, why not use sqlite’s native memory-backed mechanics and drop the MEMFS indirection along the way?

I have no experience with IDBFS (#397), and boy would I prefer if we had something like NativeIO.

Some notes to self:

@seidtgeist seidtgeist changed the title Persistence improvements Export improvements May 28, 2020
@seidtgeist
Copy link
Contributor Author

seidtgeist commented May 28, 2020

Regarding the test failure:

  • Only the sql-asm.js target fails, all other builds (wasm, asm.js debug, workers) pass
  • It’s breaking because getTempRet0 is undefined when called
  • AFAIK getTempRet0 should be provided by emcc since it’s being called

Looking through emscripten issues it looks like there have been various fun moments with getTempRet0/setTempRet0 over the years 😅

@lovasoa
Copy link
Member

lovasoa commented May 28, 2020

The issue55 test is the one in particular that tests this part of the code, I would like to make sure it passes before merging.

Since the current filesystem is MEMFS, have you considered just using an in-memory database instead?

👍 This seems like the most promising and logical thing to me. This should allow us to build with -s FILESYSTEM=0, and hopefully reduce the size of the distributed bundle, without loosing any functionality, just using sqlite3_serialize and sqlite3_deserialize.

@seidtgeist
Copy link
Contributor Author

The issue55 test is the one in particular that tests this part of the code, I would like to make sure it passes before merging.

Yeah, I’m trying to figure out how to make emscripten include env.getTempRet0 in the sql-asm.js target output. If I copy the implementation from one of the other files it passes.

Also, VACUUM can take some time and possibly lock up UI or workers. Do you have any concerns with changing the implementation in such a way?

👍 This seems like the most promising and logical thing to me. This should allow us to build with -s FILESYSTEM=0, and hopefully reduce the size of the distributed bundle, without loosing any functionality, just using sqlite3_serialize and sqlite3_deserialize.

I’ll give this a try, too. Having way too much fun with things!

@seidtgeist
Copy link
Contributor Author

@kripken Since this was/is your baby, do you have any opinions on exporting and in-memory vs filesystem? And if you have any getTempRet0 tricks up your sleeve... 😜

@kripken
Copy link
Collaborator

kripken commented May 28, 2020

I think going directly to memory might be better than using a filesystem, yeah. Could be much faster to just malloc some room and use that.

For tempRet0, if you need it you can export it to make sure it exists

-s "EXTRA_EXPORTED_RUNTIME_METHODS=['getTempRet0']"

@Taytay
Copy link
Contributor

Taytay commented Jul 16, 2020

For what it's worth, I would love to avoid the overhead of the virtual FS and just tell SQLite that it's writing to memory. Thank you for exploring this @seidtgeist!

@Taytay
Copy link
Contributor

Taytay commented Mar 12, 2021

@seidtgeist, did you do any experimentation with the memory-based option that would allow us to move away from a virtual FS? I No pressure! I just didn't want to duplicate work if I were to dive in.

@seidtgeist
Copy link
Contributor Author

@Taytay I have not, but there’s a custom VFS in deno-sqlite that might interest you: https://github.com/dyedgreen/deno-sqlite/blob/master/build/src/vfs.c

@rhashimoto
Copy link

Since the current filesystem is MEMFS, have you considered just using an in-memory database instead?

👍 This seems like the most promising and logical thing to me. This should allow us to build with -s FILESYSTEM=0, and hopefully reduce the size of the distributed bundle, without loosing any functionality, just using sqlite3_serialize and sqlite3_deserialize.

One drawback of that is export will use more RAM, potentially doubling peak WASM memory usage. An application could reach a point where the database can no longer be exported because there is not enough contiguous WASM memory for sqlite3_deserialize. I don't think this is an issue with the current implementation of export.

@seidtgeist
Copy link
Contributor Author

One drawback of that is export will use more RAM, potentially doubling peak WASM memory usage. An application could reach a point where the database can no longer be exported because there is not enough contiguous WASM memory for sqlite3_deserialize. I don't think this is an issue with the current implementation of export.

I agree. I have since learned more about sqlite and changed my opinion on this topic:

There is great utility in having a filesystem (totally aware of the irony expressed) exposed to sqlite, such as attaching multiple databases, vacuuming into a new file and other useful features that rely on a filesystem being present.

Areas worth investigating:

  • Choosing VFS compile time flags accommodating an in-memory filesystem or IDB
  • A custom VFS accommodating the same

@rhashimoto
Copy link

Areas worth investigating:

  • Choosing VFS compile time flags accommodating an in-memory filesystem or IDB
  • A custom VFS accommodating the same

I recently built a SQLite VFS that uses IndexedDB as a block device. It doesn't support multiple connections so the locking calls are just stubs. It works, but I don't have good tests or performance measures for it so I would call it a proof of concept.

The main obstacle is that the VFS model is synchronous while IndexedDB is not, which I bridge with Asyncify. This creates two issues:

  1. There's a performance hit. SQLite is effectively a worst case performance scenario for using Asyncify. I'm not sure whether the added slowdowns within SQLite itself will be dominated by IndexedDB throughput for typical usage, though.

  2. It's incompatible with the existing SQL.js synchronous API (I wrote my own rudimentary API). Using asynchronous storage propagates the asynchronicity to the application code. It can't be a drop-in replacement unless you're just mirroring memory, like IDBFS.

Getting a synchronous Storage Foundation API would bypass both issues...but it isn't clear when that would arrive, if ever. Note that IndexedDB originally had a synchronous API that ultimately was removed. Also I don't know of any existing or proposed way to implement synchronous locks, which might mean asynchronicity would still be required to support multiple connections.

@seidtgeist
Copy link
Contributor Author

I recently built a SQLite VFS that uses IndexedDB as a block device. It doesn't support multiple connections so the locking calls are just stubs. It works, but I don't have good tests or performance measures for it so I would call it a proof of concept.

That’s exciting! Would you be able to share it? It would help me with writing a memory VFS without needing to through the FS module.

The main obstacle is that the VFS model is synchronous while IndexedDB is not, which I bridge with Asyncify. This creates two issues:

  1. There's a performance hit. SQLite is effectively a worst case performance scenario for using Asyncify. I'm not sure whether the added slowdowns within SQLite itself will be dominated by IndexedDB throughput for typical usage, though.
  2. It's incompatible with the existing SQL.js synchronous API (I wrote my own rudimentary API). Using asynchronous storage propagates the asynchronicity to the application code. It can't be a drop-in replacement unless you're just mirroring memory, like IDBFS.

Getting a synchronous Storage Foundation API would bypass both issues...but it isn't clear when that would arrive, if ever. Note that IndexedDB originally had a synchronous API that ultimately was removed. Also I don't know of any existing or proposed way to implement synchronous locks, which might mean asynchronicity would still be required to support multiple connections.

Ok, I’ve been wondering about this. Since I don’t expect a sync filesystem API to arrive ever (persistence? sync? low level? lol), would it be more reasonable to keep files in memory and persist them to in a separate process?

@rhashimoto
Copy link

I recently built a SQLite VFS that uses IndexedDB as a block device.

That’s exciting! Would you be able to share it? It would help me with writing a memory VFS without needing to through the FS module.

I don't quite understand the advantages of a memory VFS over the current in-memory filesystem, can you explain? In any case, I can share what I have; I'll try to DM you.

Since I don’t expect a sync filesystem API to arrive ever (persistence? sync? low level? lol), would it be more reasonable to keep files in memory and persist them to in a separate process?

I wouldn't count a synchronous API in or out completely, at least for Worker contexts. There apparently is an actual implementation in the Chrome origin trial described here and mentioned in #302. I'm prepared for disappointment, but I do have hope for it.

I would like to see SQL.js either define its own official asynchronous API or make it easier to develop and use alternative APIs (including asynchronous ones). That's what I was getting at in my #377 comment.

@seidtgeist
Copy link
Contributor Author

I don't quite understand the advantages of a memory VFS over the current in-memory filesystem, can you explain? In any case, I can share what I have; I'll try to DM you.

Based on your previous comments I assume you mean emscripten’s MEMFS and not sqlite’s :memory: databases, so that database files can be easily accessed without needing to serialize or copy.

I haven’t looked much into sqlite’s VFS interface yet, so I don’t know whether there’s additional information/parameters available to the VFS interface that might be useful in implementing something better. Looking at https://www.sqlite.org/src/doc/trunk/src/test_demovfs.c it doesn’t seem that way.

  • The performance gain of an in-memory VFS compared to the default VFS → posix shims → MEMFS path is probably minimal.
  • The VFS interface is very close to posix and there’s no additional information on that layer.

I’m interested in this area because I once considered streaming WAL pages to remote persistent storage.

I wouldn't count a synchronous API in or out completely, at least for Worker contexts. There apparently is an actual implementation in the Chrome origin trial described here and mentioned in #302. I'm prepared for disappointment, but I do have hope for it.

Fair enough, but it’s not something I’ll expect to be available universally within the next three years or so.

I would like to see SQL.js either define its own official asynchronous API or make it easier to develop and use alternative APIs (including asynchronous ones). That's what I was getting at in my #377 comment.

I agree with your points. I’ve started my own emscripten-compiled version of the sqlite amalgamation to learn more about the C API and optimize it for my use case. Since I‘m not banking on widely available and reliable persistent storage any time soon I’m considering IDB for local persistence and S3 as a remote key-value store for reliable persistence. I find it interesting to think about guaranteeing data consistency on the application level and minimizing deltas.

@Taytay
Copy link
Contributor

Taytay commented Apr 21, 2021

One of the concerns I have with the VFS shims is that it appears to use exceptions for some unexceptional things. When I debug SQL.js and tell it to breaking on all exceptions (not just uncaught), it's constantly breaking due to some sort of exception-based flow control. I think it's due to SQLite performing stat calls on non-existent lock files, but it's been a long time since I looked.

@rhashimoto
Copy link

rhashimoto commented Apr 21, 2021

One of the concerns I have with the VFS shims is that it appears to use exceptions for some unexceptional things. When I debug SQL.js and tell it to breaking on all exceptions (not just uncaught), it's constantly breaking due to some sort of exception-based flow control. I think it's due to SQLite performing stat calls on non-existent lock files, but it's been a long time since I looked.

You're right that SQL.js throws and catches a bunch of exceptions looking for files on start up, but I think that's actually all in the Emscripten filesystem. When it sets up its initial directories and devices, it checks to see if creating them is allowed here:

    mayCreate: function(dir, name) {
      try {
        var node = FS.lookupNode(dir, name);
        return {{{ cDefine('EEXIST') }}};
      } catch (e) {
      }
      return FS.nodePermissions(dir, 'wx');
    },

And indeed the non-existence of a file is signaled with an exception. Amusingly, the error condition here occurs on the lack of an exception.

But I don't think this is related to SQLite VFS, other that the standard VFS uses the Emscripten FS. I don't see any exceptions, caught or uncaught, using my own VFS implementations (other than from my bugs).

@Taytay
Copy link
Contributor

Taytay commented Apr 24, 2021

Ah yes - you're right. I wasn't clear in my statement. The built in Emcripten posix-emulating file system was what concerned me. It sounds like your implementation is avoiding those issues and that sounds promising.

@seidtgeist seidtgeist closed this Sep 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants