Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

downsample can't process star file with particles in multiple folders. #386

Open
sholalkere opened this issue Jul 1, 2024 · 4 comments
Open
Assignees

Comments

@sholalkere
Copy link

Describe the bug
If a star file has particles in multiple folders (e.g. EMPIAR 10028) then StarfileSource is unable to load the data. This is due to appending only the particle paths' basename to the datadir. In the case of EMPIAR 10028 the parent folder name is needed as well as there are duplicate basenames.

To Reproduce
Here is the file structure:

.
+-- Particles
|    +-- MRC_0601
|        +-- 001_particles_shiny_nb50_new.mrcs
|    +-- MRC_1901
|        +-- 001_particles_shiny_nb50_new.mrcs
|    +-- shiny_2sets.star

Here is the command:
cryodrgn downsample Particles/shiny_2sets.star -D 256 -o particles.256.mrcs

Expected behavior
The command is expected to not error. Instead there is an (expected) FileNotFoundError.

Additional context
It is impossible to use --datadir to fix this as suggested in the README. The issue arises from the fact that for StarfileSource the _MRCDataFrameSource constructor is called with a datadir argument (even if none is provided through the command line) which always triggers this branch. A hacky way to fix this is to set up the file structure as shown above and uncomment the aforementioned branch.

@michal-g
Copy link
Collaborator

michal-g commented Jul 1, 2024

Hey, thanks for the heads up, we are hoping to refactor this part of the code to make it easier to support .star files with multiple folders some time soon!

In this case, without the --datadir flag, datadir will be set to Particles/, since in StarfileSource.__init__() we have:

if not datadir:
    datadir = os.path.dirname(filepath)

and filepath in this case is Particles/shiny_2sets.star. This conflicts with the parent class' _MRCDataFrameSource.__init__(), where letting datadir=None allows for use of basename() to be avoided, as mentioned above, thus allowing file names with different folder names to be used:

if datadir:
    self.df["__mrc_filepath"] = self.df["__mrc_filename"].apply(
        lambda filename: os.path.join(datadir, os.path.basename(filename))
        )
else:
    self.df["__mrc_filepath"] = self.df["__mrc_filename"]

So a somewhat less hacky fix might be to add better logic in StarfileSource.__init__() when setting the value of datadir — I will try to look into this for v3.4.0!

@michal-g michal-g self-assigned this Jul 2, 2024
@zhonge
Copy link
Collaborator

zhonge commented Jul 19, 2024

Were you able to get it to work using --datadir? IIRC, the behavior should be to append the relative path in the .star file to the provided --datadir, and if that doesn't work, then append just the filename to --datadir.

We may have changed this behavior after refactoring to ImageSource, but I recall fixing this.

Alternatively, assuming your .mrcs filename are all unique, you can symlink them all to a new directory that you can then provide to --datadir, e.g.:

mkdir new_datadir
cd new_datadir
for i in ../path/to/MRC_0601/*; do ln -s $i .; done
for i in ../path/to/MRC_1901/*; do ln -s $i .; done

@sholalkere
Copy link
Author

sholalkere commented Jul 19, 2024

I wasn't able to get it to work without modifying the code. I don't think it appends the starfile path to the datadir (see 1 and 2). Since StarfileSource is always initialized with a non-empty datadir, the constructor for _MRCDataFrameSource always appends the os.basename of the particle paths, perhaps it was intended to append the full path instead? Doing this fixes the issue.

That solution seems reasonable, however the 10028 starfile provided in the cryodrgn_empiar repo doesn't have unique filenames.

@michal-g
Copy link
Collaborator

michal-g commented Aug 15, 2024

For the upcoming v3.4.0 release, I have ended up doing a small refactor of the ImageSource class hierarchy — and especially the classes that handle .star files — in order to resolve issues like this one and also to simplify and consolidate the code used to parse and operate on .star input.

I have tried an early version of the v3.4.0 release on the same 80S case you described above. When running using v3.3.3, it produces the following error as expected, with and without --datadir=empiar_10028_80S/360:

FileNotFoundError: [Errno 2] No such file or directory: 'empiar_10028_80S/360/001_particles_shiny_nb50_new.mrcs'

However, now with v3.4.0, we can use shiny_2sets.star as input without any problems with and without --datadir. We can also still use --datadir to specify alternate input directories in the case of e.g. downsampling (--datadir=empiar_10028_80S/256), but the default behaviour when --datadir is not given remains treating the directory in which the input .star file is located as the --datadir.

In the codebase, we have consolidated the --datadir parsing logic into MRCDataFrameSource.parse_filename() which is used by __init__(), thus making this logic also accessible by other parts of the code. We only use basename in the specific circumstance where --datadir is given, and the file names are absolute but don't match --datadir:

def StarfileSource.__init__():
    ...
    self.df["__mrc_filepath"] = self.df["__mrc_filename"].apply(self.parse_filename)
...
def parse_filename(self, filename: str) -> str:
    newname = (
        os.path.abspath(filename) if os.path.isabs(filename) else str(filename)
    )
    if self.datadir is not None:
        if os.path.isabs(newname):
            if os.path.commonprefix([newname, self.datadir]) != self.datadir:
                newname = os.path.join(self.datadir, os.path.basename(newname))
        else:
            newname = os.path.join(self.datadir, newname)

    return newname

This in some part recapitulates (and now replaces) the logic in the old starfile.prefix_paths() method, which was used in some older versions of cryoDRGN but not in v3.3.3:

def prefix_paths(mrcs: List, datadir: str):
    mrcs1 = ["{}/{}".format(datadir, os.path.basename(x)) for x in mrcs]
    mrcs2 = ["{}/{}".format(datadir, x) for x in mrcs]
    try:
        for path in set(mrcs1):
            assert os.path.exists(path)
        mrcs = mrcs1
    except AssertionError:
        for path in set(mrcs2):
            assert os.path.exists(path), f"{path} not found"
        mrcs = mrcs2
    return mrcs

I would much rather explicitly write out the cases where we want to use basename, rather than trying basename and then using the full filename which may lead unexpected behaviour — but if we decide to re-employ this method at some point in the future, it should be easier to do now by incorporating it into parse_filename.

You can access this early version (v3.4.0-a6 and above) using pip install -i https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ 'cryodrgn<=3.4.0' --pre*. I have also created a write-up explaining the motivation and implementation details of this refactoring.

Please let us know if this works for you, and if you spot any other problems! I may still make some minor tweaks to this code over the next few days as I finalize testing for the v3.4.0 release, so stay tuned!

* note the use of single-quotes when specifying the cryoDRGN version, which is necessary in most bash environments!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants