Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow URLs to be inputted as BED files #329

Merged
merged 16 commits into from
Aug 30, 2023
Merged

Conversation

ducku
Copy link
Collaborator

@ducku ducku commented Aug 18, 2023

#295

  • URLs can be inputted via pasting in the text field and clicking the create option in the dropdown
  • BED files are scanned and discarded
  • Chunks specified in the BED file are stored until server termination

Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good, but I think the file management/sandboxing isn't quite robust enough.

  1. Because there's no unified safe fetch wrapper function, we only limit content length on the BED file itself, and we don't seem to do things like check for unsuccessful response codes.
  2. The path manipulation is done with a mixture of + and path.resolve, neither of which will properly protect against attacks like using ../ or a leading / to place stuff outside of the dedicated sandbox directories for the downloads. We should have a function that gets a target path based on like a chunk-local filename, some identifier for the chunk, and some identifier for the BED file, and guarantees that it will be actually inside the directory we want it in, and we should consistently use that.
  3. The same chunk name seems like it will collide if it is used in different BED files at the same time. We might want to use some hashing to make up unique directories per BED.
  4. I don't think downloading all the chunks up front (and in serial!) is going to scale to larger BED files. We probably want to download chunks when needed, and download all the files in the chunk in parallel (by launching the downloads and then awaiting all of them).

@@ -1,6 +1,7 @@
import React, { Component } from "react";
import PropTypes from "prop-types";
import Select from "react-select";
//import Select from "react-select";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//import Select from "react-select";

@@ -63,13 +64,16 @@ export class SelectionDropdown extends Component {
};

return (
<Select
<CreatableSelect
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has everywhere we use this been updated to cope with/disallow creating novel options?

src/server.mjs Outdated
Comment on lines 65 to 66
const SCRATCH_DATA_PATH = "tmp/";
const TEMP_DATA_PATH = config.tempDirPath;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem very similar; either they should be the same thing or we should justify why they can't be.

src/server.mjs Outdated
const retrieveChunks = async(bedURL, chunks) => {
// up go a directory level in the url
const dataPath = bedURL.split('/').slice(0, -1).join('/');
for (const dirName of chunks) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this is assuming that the BED chunk paths are always directory names (chunk1 or chunks/chunk1), and never things like full URLs (http://example.com/beds/bedfile.bed pointing to http://example.com/chunks/chunk1), or strings with upwards path traversals in them (http://example.com/beds/bedfile.bed pointing to ../chunks/chunk1).

For even non-malicious files where the references are interpretable as both relative URLs and relative file paths, I'm not sure we can rely on that. If we want to impose that constraint we need to document and enforce it.

src/server.mjs Outdated
const destinationPath = path.resolve(dataPath, fileName);

const fileStream = fs.createWriteStream(destinationPath, { flags: 'wx' });
await finished(Readable.fromWeb(response.body).pipe(fileStream));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fetch 15 second timeout we're using only applies for the time between making the fetch request and getting a response object. So this pipe seems to be allowed to run forever. We should probably limit the total number of bytes transferred by the pipe, and error out of the file is too big.

We also probably want to check the reported content length as soon as the response starts to come in, and error out if that is too big.

src/server.mjs Outdated
Comment on lines 1318 to 1320
const dirName = bed_info["chunk"][i];
const chunk_path = `${dataPath}${dirName}`;
let track_json = `${chunk_path}/tracks.json`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the third place we manually paste together stuff from the BED line to get the path to a file. The third time you just implement the same thing is about when it's time to make it a function. We probably want a function we can call to get us the local path for a file, to either put it at or read it from.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like getChunkPath wants to be the function that, given a BED URL and a chunk path string from that BED, determines the local path at which to look for stuff. But we can't use it here because it also does other stuff like actually downloading the files. Maybe it needs to get split up?

src/server.mjs Outdated
const chunkContent = await response.text();
const fileNames = chunkContent.split("\n");

for (const fileName of fileNames) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want some limit on the number of file names here.

src/server.mjs Outdated
let chunkContentURL = dataPath + "/" + dirName + "/chunk_contents.txt";
let response = await fetch(chunkContentURL, options);

const chunkContent = await response.text();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there's no limit on the size of the chunk contents file that we will accept. It could be 30 gigabytes and we will run out of memory to store it and crash.

src/server.mjs Outdated
@@ -1231,6 +1393,9 @@ export function start() {
connections: undefined,
// Shut down the server
close: async () => {
// remove the temporary directory
fs.rmSync(config.tempDirPath, { recursive: true, force: true });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaning the whole thing at the end is a pretty good idea.

Comment on lines +60 to +63
do
printf "$file\n" >> $OUTDIR/chunk_contents.txt
done
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have our own loop just to stop chunk_contents.txt from listing itself?

Copy link
Member

@adamnovak adamnovak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new fetch wrapper looks pretty great! It does require storing whole files in memory, though, which we would need to change to support bigger files.

But I'm very confused about the baseURL.txt system (it doesn't seem like it should be needed), and about the whole notion of partially downloading all the chunks in serial at the beginning, and then completely downloading just the ones we need later. We might need to introduce a new API endpoint to get the tracks for a particular BED region, so we don't have to have the server load them all up and send them all to the client at once, which is why we need to visit every BED region right away on the server side.

src/server.mjs Outdated
const chunkDirPath = path.resolve(config.tempDirPath, chunkPath);
const baseURLPath = path.resolve(chunkDirPath, "baseURL.txt");
// if the chunk directory exists and the baseURL.txt is in it
if (fs.existsSync(chunkDirPath) && fs.existsSync(baseURLPath)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the baseURLPath file can exist without the parent directory existing.

src/server.mjs Outdated
@@ -654,9 +657,10 @@ function returnErrorMiddleware(err, req, res, next) {
app.use(returnErrorMiddleware);

// Gets the chunk path from a region specified in a bedfile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment makes this look like it is a pure function that determines what path on the local disk some data for a BED file ought to live at.

Actually it is responsible for downloading that data if it doesn't exist. That's surprising and we should probably include in the comment something about how this function ensures that the chunk data is available on the local disk and returns a path to it.

src/server.mjs Outdated
Comment on lines 1344 to 1348
const tempTracksPath = path.resolve(config.tempDirPath, sanitize(dirName), "tracks.json");
// check the temporary directory for downloaded tracks info
if (fs.existsSync(tempTracksPath)) {
track_json = tempTracksPath;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if this file happens to exist, but the BED file didn't come from a URL and was actually a local file?

}
}

return new Response(new Blob(dataRead), { headers: response.headers });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the interface of this function should change to just return a Blob?

Alternately, if we want to support downloading files that are larger than is convenient to hold in memory, but still use a size limit, we'd want to return a stream from this function, but with a size-limiting filter on the stream that will raise an error when the reader goes to read past the size limit.

src/server.mjs Outdated
}

// downloads files fo the specified chunk name from the destination of the bedURL
// includeContent only downloads the tracks.json file when set to false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a good design. When used with a false flag, this isn't really retrieving the chunk in a meaningful sense.

If we actually do want to create N directories, each with a base URL file and a track metadata file, when we first see a BED file with N entries, then we should split that out from downloading the actual chunk files. We don't need the code to make the base URL file when actually retrieving the full chunk, and we don't need the code to look at the chunk_contents.txt file when retrieving tracks.json (because we already know the name and we can just fetch it if it exists).

src/server.mjs Outdated
Comment on lines 1215 to 1216
// baseURL/dirName/"chunk_contents.txt"
let chunkContentURL = new URL(path.join(sanitize(chunk), "chunk_contents.txt"), baseURL).toString();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the sanitize is appropriate here when creating the URL to fetch stuff from. I think / and .. are meant to be allowed in the URL that the BED file points to for its chunk. This looks like it is going to strop them out.

We also want to support e.g. baseURL/dirName/subDirName/chunk_contents.txt.

src/server.mjs Outdated
Comment on lines 1235 to 1241
// write the baseURL to a file
const baseURLFile = path.resolve(chunkDir, "baseURL.txt");
fs.writeFile(baseURLFile, baseURL, function (error) {
if (error){
throw error;
}
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this system is necessary.

When the user asks to get chunk data for a tube map rendering, they send along the BED URL, and we are already finding the right region in the BED file to get the chunk information so that we know where to look on the server for the downloaded chunk data. So we can always make the base URL of the chunk with a new URL() based on the BED file's own URL and whatever it lists as the path to the chunk, right? Why do we need to make it once and save it in a flat file for every chunk?

src/server.mjs Outdated
Comment on lines 1318 to 1320
const dirName = bed_info["chunk"][i];
const chunk_path = `${dataPath}${dirName}`;
let track_json = `${chunk_path}/tracks.json`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like getChunkPath wants to be the function that, given a BED URL and a chunk path string from that BED, determines the local path at which to look for stuff. But we can't use it here because it also does other stuff like actually downloading the files. Maybe it needs to get split up?

@adamnovak
Copy link
Member

Plan:

  • Make sure you can't "go" without any tracks
  • Make having tracks.json mandatory for a premade chunk
  • Blank out the tracks when the user picks a BED region, and request the tracks for that chunk from the server. (We might need to tall the client if the BED regions have/don't have premade chunks and only do this for the ones that do)
  • When the tracks for the chunk arrive, put them in, which will re-enable "go"

Then we will be able to only fetch tracks for the BED regions we actually need them for.

@adamnovak
Copy link
Member

We talked about this more and now the plan is to merge this as soon as the minor problems/cross-talk issues are fixed, and then refactor so we can download large files not to memory/avoid downloading all the track data up front, since neither of those seems like it will be a problem for Lancet data in particular.

@adamnovak
Copy link
Member

OK, I've refactored this a bit to convince myself that we won't end up letting people write stuff on the server outside of where they are supposed to. I also fixed error handling, which doesn't magically work for async API functions we pass to Express since we don't have Express 5 yet (it's still in beta). I think eventually we will have to move all the guts of the tube map from next()-based error handling and careful callback ordering to promises and async/await, but I am not going to do that today.

Also I think I found a use case for premade chunks with no track JSON, which is that cactus.bed seems to use one, which got deleted and which I am restoring.

@adamnovak adamnovak merged commit 586b044 into vgteam:master Aug 30, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make BED system able to retrieve pre-made chunks from URLs rather than local folders
2 participants