-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow URLs to be inputted as BED files #329
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good, but I think the file management/sandboxing isn't quite robust enough.
- Because there's no unified safe fetch wrapper function, we only limit content length on the BED file itself, and we don't seem to do things like check for unsuccessful response codes.
- The path manipulation is done with a mixture of
+
andpath.resolve
, neither of which will properly protect against attacks like using../
or a leading/
to place stuff outside of the dedicated sandbox directories for the downloads. We should have a function that gets a target path based on like a chunk-local filename, some identifier for the chunk, and some identifier for the BED file, and guarantees that it will be actually inside the directory we want it in, and we should consistently use that. - The same chunk name seems like it will collide if it is used in different BED files at the same time. We might want to use some hashing to make up unique directories per BED.
- I don't think downloading all the chunks up front (and in serial!) is going to scale to larger BED files. We probably want to download chunks when needed, and download all the files in the chunk in parallel (by launching the downloads and then awaiting all of them).
@@ -1,6 +1,7 @@ | |||
import React, { Component } from "react"; | |||
import PropTypes from "prop-types"; | |||
import Select from "react-select"; | |||
//import Select from "react-select"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
//import Select from "react-select"; |
@@ -63,13 +64,16 @@ export class SelectionDropdown extends Component { | |||
}; | |||
|
|||
return ( | |||
<Select | |||
<CreatableSelect |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has everywhere we use this been updated to cope with/disallow creating novel options?
src/server.mjs
Outdated
const SCRATCH_DATA_PATH = "tmp/"; | ||
const TEMP_DATA_PATH = config.tempDirPath; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem very similar; either they should be the same thing or we should justify why they can't be.
src/server.mjs
Outdated
const retrieveChunks = async(bedURL, chunks) => { | ||
// up go a directory level in the url | ||
const dataPath = bedURL.split('/').slice(0, -1).join('/'); | ||
for (const dirName of chunks) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this is assuming that the BED chunk paths are always directory names (chunk1
or chunks/chunk1
), and never things like full URLs (http://example.com/beds/bedfile.bed
pointing to http://example.com/chunks/chunk1
), or strings with upwards path traversals in them (http://example.com/beds/bedfile.bed
pointing to ../chunks/chunk1
).
For even non-malicious files where the references are interpretable as both relative URLs and relative file paths, I'm not sure we can rely on that. If we want to impose that constraint we need to document and enforce it.
src/server.mjs
Outdated
const destinationPath = path.resolve(dataPath, fileName); | ||
|
||
const fileStream = fs.createWriteStream(destinationPath, { flags: 'wx' }); | ||
await finished(Readable.fromWeb(response.body).pipe(fileStream)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the fetch 15 second timeout we're using only applies for the time between making the fetch request and getting a response object. So this pipe seems to be allowed to run forever. We should probably limit the total number of bytes transferred by the pipe, and error out of the file is too big.
We also probably want to check the reported content length as soon as the response starts to come in, and error out if that is too big.
src/server.mjs
Outdated
const dirName = bed_info["chunk"][i]; | ||
const chunk_path = `${dataPath}${dirName}`; | ||
let track_json = `${chunk_path}/tracks.json` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the third place we manually paste together stuff from the BED line to get the path to a file. The third time you just implement the same thing is about when it's time to make it a function. We probably want a function we can call to get us the local path for a file, to either put it at or read it from.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like getChunkPath
wants to be the function that, given a BED URL and a chunk path string from that BED, determines the local path at which to look for stuff. But we can't use it here because it also does other stuff like actually downloading the files. Maybe it needs to get split up?
src/server.mjs
Outdated
const chunkContent = await response.text(); | ||
const fileNames = chunkContent.split("\n"); | ||
|
||
for (const fileName of fileNames) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want some limit on the number of file names here.
src/server.mjs
Outdated
let chunkContentURL = dataPath + "/" + dirName + "/chunk_contents.txt"; | ||
let response = await fetch(chunkContentURL, options); | ||
|
||
const chunkContent = await response.text(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like there's no limit on the size of the chunk contents file that we will accept. It could be 30 gigabytes and we will run out of memory to store it and crash.
src/server.mjs
Outdated
@@ -1231,6 +1393,9 @@ export function start() { | |||
connections: undefined, | |||
// Shut down the server | |||
close: async () => { | |||
// remove the temporary directory | |||
fs.rmSync(config.tempDirPath, { recursive: true, force: true }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaning the whole thing at the end is a pretty good idea.
do | ||
printf "$file\n" >> $OUTDIR/chunk_contents.txt | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have our own loop just to stop chunk_contents.txt
from listing itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new fetch wrapper looks pretty great! It does require storing whole files in memory, though, which we would need to change to support bigger files.
But I'm very confused about the baseURL.txt
system (it doesn't seem like it should be needed), and about the whole notion of partially downloading all the chunks in serial at the beginning, and then completely downloading just the ones we need later. We might need to introduce a new API endpoint to get the tracks for a particular BED region, so we don't have to have the server load them all up and send them all to the client at once, which is why we need to visit every BED region right away on the server side.
src/server.mjs
Outdated
const chunkDirPath = path.resolve(config.tempDirPath, chunkPath); | ||
const baseURLPath = path.resolve(chunkDirPath, "baseURL.txt"); | ||
// if the chunk directory exists and the baseURL.txt is in it | ||
if (fs.existsSync(chunkDirPath) && fs.existsSync(baseURLPath)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the baseURLPath
file can exist without the parent directory existing.
src/server.mjs
Outdated
@@ -654,9 +657,10 @@ function returnErrorMiddleware(err, req, res, next) { | |||
app.use(returnErrorMiddleware); | |||
|
|||
// Gets the chunk path from a region specified in a bedfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment makes this look like it is a pure function that determines what path on the local disk some data for a BED file ought to live at.
Actually it is responsible for downloading that data if it doesn't exist. That's surprising and we should probably include in the comment something about how this function ensures that the chunk data is available on the local disk and returns a path to it.
src/server.mjs
Outdated
const tempTracksPath = path.resolve(config.tempDirPath, sanitize(dirName), "tracks.json"); | ||
// check the temporary directory for downloaded tracks info | ||
if (fs.existsSync(tempTracksPath)) { | ||
track_json = tempTracksPath; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if this file happens to exist, but the BED file didn't come from a URL and was actually a local file?
} | ||
} | ||
|
||
return new Response(new Blob(dataRead), { headers: response.headers }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the interface of this function should change to just return a Blob?
Alternately, if we want to support downloading files that are larger than is convenient to hold in memory, but still use a size limit, we'd want to return a stream from this function, but with a size-limiting filter on the stream that will raise an error when the reader goes to read past the size limit.
src/server.mjs
Outdated
} | ||
|
||
// downloads files fo the specified chunk name from the destination of the bedURL | ||
// includeContent only downloads the tracks.json file when set to false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a good design. When used with a false flag, this isn't really retrieving the chunk in a meaningful sense.
If we actually do want to create N directories, each with a base URL file and a track metadata file, when we first see a BED file with N entries, then we should split that out from downloading the actual chunk files. We don't need the code to make the base URL file when actually retrieving the full chunk, and we don't need the code to look at the chunk_contents.txt
file when retrieving tracks.json
(because we already know the name and we can just fetch it if it exists).
src/server.mjs
Outdated
// baseURL/dirName/"chunk_contents.txt" | ||
let chunkContentURL = new URL(path.join(sanitize(chunk), "chunk_contents.txt"), baseURL).toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the sanitize is appropriate here when creating the URL to fetch stuff from. I think /
and ..
are meant to be allowed in the URL that the BED file points to for its chunk. This looks like it is going to strop them out.
We also want to support e.g. baseURL/dirName/subDirName/chunk_contents.txt
.
src/server.mjs
Outdated
// write the baseURL to a file | ||
const baseURLFile = path.resolve(chunkDir, "baseURL.txt"); | ||
fs.writeFile(baseURLFile, baseURL, function (error) { | ||
if (error){ | ||
throw error; | ||
} | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this system is necessary.
When the user asks to get chunk data for a tube map rendering, they send along the BED URL, and we are already finding the right region in the BED file to get the chunk information so that we know where to look on the server for the downloaded chunk data. So we can always make the base URL of the chunk with a new URL()
based on the BED file's own URL and whatever it lists as the path to the chunk, right? Why do we need to make it once and save it in a flat file for every chunk?
src/server.mjs
Outdated
const dirName = bed_info["chunk"][i]; | ||
const chunk_path = `${dataPath}${dirName}`; | ||
let track_json = `${chunk_path}/tracks.json` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like getChunkPath
wants to be the function that, given a BED URL and a chunk path string from that BED, determines the local path at which to look for stuff. But we can't use it here because it also does other stuff like actually downloading the files. Maybe it needs to get split up?
Plan:
Then we will be able to only fetch tracks for the BED regions we actually need them for. |
We talked about this more and now the plan is to merge this as soon as the minor problems/cross-talk issues are fixed, and then refactor so we can download large files not to memory/avoid downloading all the track data up front, since neither of those seems like it will be a problem for Lancet data in particular. |
…, go button doesn't work without tracks
…f directory, path or URL pairs
OK, I've refactored this a bit to convince myself that we won't end up letting people write stuff on the server outside of where they are supposed to. I also fixed error handling, which doesn't magically work for async API functions we pass to Express since we don't have Express 5 yet (it's still in beta). I think eventually we will have to move all the guts of the tube map from Also I think I found a use case for premade chunks with no track JSON, which is that |
#295