Allow URLs to be inputted as BED files #329

ducku · 2023-08-18T00:45:26Z

URLs can be inputted via pasting in the text field and clicking the create option in the dropdown
BED files are scanned and discarded
Chunks specified in the BED file are stored until server termination

adamnovak

This looks pretty good, but I think the file management/sandboxing isn't quite robust enough.

Because there's no unified safe fetch wrapper function, we only limit content length on the BED file itself, and we don't seem to do things like check for unsuccessful response codes.
The path manipulation is done with a mixture of + and path.resolve, neither of which will properly protect against attacks like using ../ or a leading / to place stuff outside of the dedicated sandbox directories for the downloads. We should have a function that gets a target path based on like a chunk-local filename, some identifier for the chunk, and some identifier for the BED file, and guarantees that it will be actually inside the directory we want it in, and we should consistently use that.
The same chunk name seems like it will collide if it is used in different BED files at the same time. We might want to use some hashing to make up unique directories per BED.
I don't think downloading all the chunks up front (and in serial!) is going to scale to larger BED files. We probably want to download chunks when needed, and download all the files in the chunk in parallel (by launching the downloads and then awaiting all of them).

adamnovak · 2023-08-18T14:41:44Z

src/components/SelectionDropdown.js

@@ -1,6 +1,7 @@
 import React, { Component } from "react";
 import PropTypes from "prop-types";
-import Select from "react-select";
+//import Select from "react-select";


Suggested change

//import Select from "react-select";

adamnovak · 2023-08-18T14:43:31Z

src/components/SelectionDropdown.js

@@ -63,13 +64,16 @@ export class SelectionDropdown extends Component {
    };

    return (
-      <Select
+      <CreatableSelect


Has everywhere we use this been updated to cope with/disallow creating novel options?

adamnovak · 2023-08-18T14:44:40Z

src/server.mjs

 const SCRATCH_DATA_PATH = "tmp/";
+const TEMP_DATA_PATH = config.tempDirPath;


These seem very similar; either they should be the same thing or we should justify why they can't be.

adamnovak · 2023-08-18T14:54:54Z

src/server.mjs

+const retrieveChunks = async(bedURL, chunks) => {
+  // up go a directory level in the url
+  const dataPath = bedURL.split('/').slice(0, -1).join('/');
+  for (const dirName of chunks) {


It looks like this is assuming that the BED chunk paths are always directory names (chunk1 or chunks/chunk1), and never things like full URLs (http://example.com/beds/bedfile.bed pointing to http://example.com/chunks/chunk1), or strings with upwards path traversals in them (http://example.com/beds/bedfile.bed pointing to ../chunks/chunk1).

For even non-malicious files where the references are interpretable as both relative URLs and relative file paths, I'm not sure we can rely on that. If we want to impose that constraint we need to document and enforce it.

adamnovak · 2023-08-18T14:59:41Z

src/server.mjs

+  const destinationPath = path.resolve(dataPath, fileName);
+
+  const fileStream = fs.createWriteStream(destinationPath, { flags: 'wx' });
+  await finished(Readable.fromWeb(response.body).pipe(fileStream));


I think the fetch 15 second timeout we're using only applies for the time between making the fetch request and getting a response object. So this pipe seems to be allowed to run forever. We should probably limit the total number of bytes transferred by the pipe, and error out of the file is too big.

We also probably want to check the reported content length as soon as the response starts to come in, and error out if that is too big.

adamnovak · 2023-08-18T16:04:20Z

src/server.mjs

+    const dirName = bed_info["chunk"][i];
+    const chunk_path = `${dataPath}${dirName}`;
+    let track_json = `${chunk_path}/tracks.json`


I think this is the third place we manually paste together stuff from the BED line to get the path to a file. The third time you just implement the same thing is about when it's time to make it a function. We probably want a function we can call to get us the local path for a file, to either put it at or read it from.

I feel like getChunkPath wants to be the function that, given a BED URL and a chunk path string from that BED, determines the local path at which to look for stuff. But we can't use it here because it also does other stuff like actually downloading the files. Maybe it needs to get split up?

adamnovak · 2023-08-18T16:06:31Z

src/server.mjs

+    const chunkContent = await response.text();
+    const fileNames = chunkContent.split("\n");
+
+    for (const fileName of fileNames) {


We probably want some limit on the number of file names here.

adamnovak · 2023-08-18T16:07:29Z

src/server.mjs

+    let chunkContentURL = dataPath + "/" + dirName + "/chunk_contents.txt";
+    let response = await fetch(chunkContentURL, options);
+
+    const chunkContent = await response.text();


It looks like there's no limit on the size of the chunk contents file that we will accept. It could be 30 gigabytes and we will run out of memory to store it and crash.

adamnovak · 2023-08-18T16:08:05Z

src/server.mjs

@@ -1231,6 +1393,9 @@ export function start() {
      connections: undefined,
      // Shut down the server
      close: async () => {
+        // remove the temporary directory
+        fs.rmSync(config.tempDirPath, { recursive: true, force: true });


Cleaning the whole thing at the end is a pretty good idea.

adamnovak · 2023-08-18T16:09:42Z

scripts/prepare_chunks.sh

+do
+    printf "$file\n" >> $OUTDIR/chunk_contents.txt
+done


Do we have our own loop just to stop chunk_contents.txt from listing itself?

…up front

adamnovak

The new fetch wrapper looks pretty great! It does require storing whole files in memory, though, which we would need to change to support bigger files.

But I'm very confused about the baseURL.txt system (it doesn't seem like it should be needed), and about the whole notion of partially downloading all the chunks in serial at the beginning, and then completely downloading just the ones we need later. We might need to introduce a new API endpoint to get the tracks for a particular BED region, so we don't have to have the server load them all up and send them all to the client at once, which is why we need to visit every BED region right away on the server side.

adamnovak · 2023-08-23T15:32:37Z

src/server.mjs

+  const chunkDirPath = path.resolve(config.tempDirPath, chunkPath);
+  const baseURLPath = path.resolve(chunkDirPath, "baseURL.txt");
+  // if the chunk directory exists and the baseURL.txt is in it
+  if (fs.existsSync(chunkDirPath) && fs.existsSync(baseURLPath)) {


I don't think the baseURLPath file can exist without the parent directory existing.

adamnovak · 2023-08-23T15:34:32Z

src/server.mjs

@@ -654,9 +657,10 @@ function returnErrorMiddleware(err, req, res, next) {
 app.use(returnErrorMiddleware);

 // Gets the chunk path from a region specified in a bedfile 


This comment makes this look like it is a pure function that determines what path on the local disk some data for a BED file ought to live at.

Actually it is responsible for downloading that data if it doesn't exist. That's surprising and we should probably include in the comment something about how this function ensures that the chunk data is available on the local disk and returns a path to it.

adamnovak · 2023-08-23T15:37:18Z

src/server.mjs

+    const tempTracksPath = path.resolve(config.tempDirPath, sanitize(dirName), "tracks.json");
+    // check the temporary directory for downloaded tracks info
+    if (fs.existsSync(tempTracksPath)) {
+      track_json = tempTracksPath;
+    }


What if this file happens to exist, but the BED file didn't come from a URL and was actually a local file?

adamnovak · 2023-08-23T15:43:35Z

src/server.mjs

+    }
+  }
+
+  return new Response(new Blob(dataRead), { headers: response.headers });


Maybe the interface of this function should change to just return a Blob?

Alternately, if we want to support downloading files that are larger than is convenient to hold in memory, but still use a size limit, we'd want to return a stream from this function, but with a size-limiting filter on the stream that will raise an error when the reader goes to read past the size limit.

adamnovak · 2023-08-23T15:48:48Z

src/server.mjs

+}
+
+// downloads files fo the specified chunk name from the destination of the bedURL
+// includeContent only downloads the tracks.json file when set to false


I don't think this is a good design. When used with a false flag, this isn't really retrieving the chunk in a meaningful sense.

If we actually do want to create N directories, each with a base URL file and a track metadata file, when we first see a BED file with N entries, then we should split that out from downloading the actual chunk files. We don't need the code to make the base URL file when actually retrieving the full chunk, and we don't need the code to look at the chunk_contents.txt file when retrieving tracks.json (because we already know the name and we can just fetch it if it exists).

adamnovak · 2023-08-23T15:50:52Z

src/server.mjs

+  // baseURL/dirName/"chunk_contents.txt"
+  let chunkContentURL = new URL(path.join(sanitize(chunk), "chunk_contents.txt"), baseURL).toString();


I don't think the sanitize is appropriate here when creating the URL to fetch stuff from. I think / and .. are meant to be allowed in the URL that the BED file points to for its chunk. This looks like it is going to strop them out.

We also want to support e.g. baseURL/dirName/subDirName/chunk_contents.txt.

adamnovak · 2023-08-23T15:53:58Z

src/server.mjs

+  // write the baseURL to a file
+  const baseURLFile = path.resolve(chunkDir, "baseURL.txt");
+  fs.writeFile(baseURLFile, baseURL, function (error) {
+    if (error){
+      throw error;
+    }
+  })


I don't understand why this system is necessary.

When the user asks to get chunk data for a tube map rendering, they send along the BED URL, and we are already finding the right region in the BED file to get the chunk information so that we know where to look on the server for the downloaded chunk data. So we can always make the base URL of the chunk with a new URL() based on the BED file's own URL and whatever it lists as the path to the chunk, right? Why do we need to make it once and save it in a flat file for every chunk?

adamnovak · 2023-08-23T15:57:00Z

src/server.mjs

+    const dirName = bed_info["chunk"][i];
+    const chunk_path = `${dataPath}${dirName}`;
+    let track_json = `${chunk_path}/tracks.json`


I feel like getChunkPath wants to be the function that, given a BED URL and a chunk path string from that BED, determines the local path at which to look for stuff. But we can't use it here because it also does other stuff like actually downloading the files. Maybe it needs to get split up?

adamnovak · 2023-08-23T16:19:47Z

Plan:

Make sure you can't "go" without any tracks
Make having tracks.json mandatory for a premade chunk
Blank out the tracks when the user picks a BED region, and request the tracks for that chunk from the server. (We might need to tall the client if the BED regions have/don't have premade chunks and only do this for the ones that do)
When the tracks for the chunk arrive, put them in, which will re-enable "go"

Then we will be able to only fetch tracks for the BED regions we actually need them for.

adamnovak · 2023-08-23T16:34:08Z

We talked about this more and now the plan is to merge this as soon as the minor problems/cross-talk issues are fixed, and then refactor so we can download large files not to memory/avoid downloading all the track data up front, since neither of those seems like it will be a problem for Lancet data in particular.

…, go button doesn't work without tracks

…f directory, path or URL pairs

… tracks

adamnovak · 2023-08-30T19:33:28Z

OK, I've refactored this a bit to convince myself that we won't end up letting people write stuff on the server outside of where they are supposed to. I also fixed error handling, which doesn't magically work for async API functions we pass to Express since we don't have Express 5 yet (it's still in beta). I think eventually we will have to move all the guts of the tube map from next()-based error handling and careful callback ordering to promises and async/await, but I am not going to do that today.

Also I think I found a use case for premade chunks with no track JSON, which is that cactus.bed seems to use one, which got deleted and which I am restoring.

ducku added 7 commits August 3, 2023 15:42

allow selection dropdown to have creatable elements

485c780

add processing bed URLs

9df4962

restore progress

c518ef5

prepare_chunk now includes chunk file names in chunk directory

37dd0fc

implement download file

74bb5d9

let server get chunk path from temp directory

0718503

allow tracks to autofill from downloaded chunks

86739f2

adamnovak requested changes Aug 18, 2023

View reviewed changes

ducku added 2 commits August 20, 2023 21:02

fetch get wrapper and safer url concatenation

2ac003e

chunks from bedURLs are downloaded as needed, instead of all at once …

02575e2

…up front

adamnovak requested changes Aug 23, 2023

View reviewed changes

adamnovak linked an issue Aug 23, 2023 that may be closed by this pull request

Make BED system able to retrieve pre-made chunks from URLs rather than local folders #295

Closed

ducku and others added 7 commits August 24, 2023 20:28

temporary directories containing downloaded chunk data are now hashed…

734185c

…, go button doesn't work without tracks

use strict equality

8230ea9

Merge remote-tracking branch 'origin/master' into HEAD

f3ff321

Centralize BED chunk path logic, and pass URLs/paths around instead o…

f41941f

…f directory, path or URL pairs

Restore error handling middleware by eliminating async API handlers

69a836a

Switch to a secure hash function

b651a29

Bring back cactus.bed chunk data and fix handling BED regions without…

b2161ca

… tracks

adamnovak merged commit 586b044 into vgteam:master Aug 30, 2023
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow URLs to be inputted as BED files #329

Allow URLs to be inputted as BED files #329

ducku commented Aug 18, 2023

adamnovak left a comment

adamnovak Aug 18, 2023

adamnovak Aug 18, 2023

adamnovak Aug 18, 2023

adamnovak Aug 18, 2023

adamnovak Aug 18, 2023

adamnovak Aug 18, 2023

adamnovak Aug 23, 2023

adamnovak Aug 18, 2023

adamnovak Aug 18, 2023

adamnovak Aug 18, 2023

adamnovak Aug 18, 2023

adamnovak left a comment

adamnovak Aug 23, 2023

adamnovak Aug 23, 2023

adamnovak Aug 23, 2023

adamnovak Aug 23, 2023

adamnovak Aug 23, 2023

adamnovak Aug 23, 2023

adamnovak Aug 23, 2023

adamnovak Aug 23, 2023

adamnovak commented Aug 23, 2023

adamnovak commented Aug 23, 2023

adamnovak commented Aug 30, 2023

		const SCRATCH_DATA_PATH = "tmp/";
		const TEMP_DATA_PATH = config.tempDirPath;

		@@ -654,9 +657,10 @@ function returnErrorMiddleware(err, req, res, next) {
		app.use(returnErrorMiddleware);

		// Gets the chunk path from a region specified in a bedfile

		// baseURL/dirName/"chunk_contents.txt"
		let chunkContentURL = new URL(path.join(sanitize(chunk), "chunk_contents.txt"), baseURL).toString();

Allow URLs to be inputted as BED files #329

Allow URLs to be inputted as BED files #329

Conversation

ducku commented Aug 18, 2023

adamnovak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamnovak left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adamnovak commented Aug 23, 2023

adamnovak commented Aug 23, 2023

adamnovak commented Aug 30, 2023