Skip to content

Latest commit

 

History

History
273 lines (251 loc) · 13.4 KB

README.org

File metadata and controls

273 lines (251 loc) · 13.4 KB

TL;DR

fusta GRCh38.fa -o hg38
cat hg38/fasta/chr{X,Y}.fa > ~/sex_chrs.fa
cat hg38/get/chr17:18108706-18179802 > MYO15A.fa
rm hg38/seq/chr{3,5}.seq
fusermount -u hg38

What is FUSTA?

FUSTA is a FUSE-based virtual filesystem mirroring a (multi)FASTA file as a hierarchy of individual virtual files, simplifying efficient data extraction and bulk/automated processing of FASTA files.

fusta.png

The virtual files exposed by FUSTA behave like standard flat text files, and provide automatic compatibility with all existing programs. When handling large multiFASTA files, the intrinsic file caching capacities of the OS are leveraged to ensure the best experience to the user.

Citation

If you use FUSTA, please cite FUSTA: leveraging FUSE for manipulation of multiFASTA files at scale, https://doi.org/10.1093/bioadv/vbac091

Licensing

FUSTA is distributed under the CeCILL-C (LGPLv3 compatible) license. Please see the LICENSE file for details.

Installation

Ubuntu

sudo apt install cargo fuse3 libfuse3-dev pkg-config
cargo install --git https://github.com/delehef/fusta

You can now find fusta in $HOME/cargo/bin/; you should add this this path to your $PATH for easier use.

Fedora/Rocky Linux/Alma Linux

sudo yum install rust cargo fuse3 fuse3-devel
cargo install --git https://github.com/delehef/fusta

You can now find fusta in $HOME/cargo/bin/; you should add this this path to your $PATH for easier use.

Debian

sudo apt install curl fuse3 libfuse3-dev
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh # Debian cargo is outdated

then rebot your shell to update the PATH environment variable.

Finally, install FUSTA:

cargo install --git https://github.com/delehef/fusta

You can now find fusta in $HOME/cargo/bin/; you should add this this path to your $PATH for easier use.

Scientific Linux

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
sudo yum install fuse3 fuse3-devel

then rebot your shell to update the PATH environment variable.

Finally, install FUSTA:

cargo install --git https://github.com/delehef/fusta

You can now find fusta in $HOME/cargo/bin/; you should add this this path to your $PATH for easier use.

macOS

On macOS, you will need to install the build tools if you have not them ready yet: xcode-select --install

You must then download and install FUSE for macOS in order to be able to use FUSTA.

Finally, to build FUSTA, you need to install the Rust compiler. You can then build FUSTA by running cargo, the Rust build tool:

cargo install --git https://github.com/delehef/fusta

FreeBSD

sudo pkg install rust pkgconf fusefs-libs # Install build dependencies
sudo sysctl vfs.usermount=1  # enable FUSE mounting without requiring administrator permissions
sudo kldload fuse # load the FUSE kernel module

Finally, install FUSTA:

cargo install --git https://github.com/delehef/fusta

You can now find fusta in $HOME/cargo/bin/; you should add this this path to your $PATH for easier use.

From Sources

You should install FUSE (as well as its potential devel package), from your package manager – note that a reboot might be necessary for the kernel module to be loaded.

To build FUSTA, you need to install the Rust compiler. You can then build FUSTA by running cargo, the Rust build tool:

cargo install --git https://github.com/delehef/fusta

You can now find fusta in $HOME/cargo/bin/; you should add this this path to your $PATH for easier use.

Usage

Quick Start

These commands run fusta in the background, mount the FASTA file file.fa in an automatically created fusta folder, exposing all the sequences contained in file.fa there. The call to tree will display the virtual hierarchy, then fusermount is called to cleanly unmount the file.

fusta file.fa
tree -h fusta/
fusermount -u fusta

Description

Once started, fusta will expose the content of a FASTA file in a way that makes it usable by any piece of software using as if it were a set of independent files, detailed as follow.

For instance, here is the virtual hierarchy created by fusta after mounting a FASTA file containing A. thaliana genome

fusta
├── append
├── fasta
│   ├── 1.fa
│   ├── 2.fa
│   ├── 3.fa
│   ├── 4.fa
│   ├── 5.fa
│   ├── Mt.fa
│   └── Pt.fa
├── get
├── infos.csv
├── infos.txt
├── labels.txt
└── seqs
    ├── 1.seq
    ├── 2.seq
    ├── 3.seq
    ├── 4.seq
    ├── 5.seq
    ├── Mt.seq
    └── Pt.seq

FUSTA supports all FUSTA files using UNIX-style line endings, including but not restricted to DNA files, protein files, gapped files, mixed-case files, and independently of their inner formatting (line wrapping, line length, etc.).

infos.csv

This read-only CSV file contains a list of all the fragments present in the mounted FASTA file, with, for each of them, the standard id and additional informations field, plus a third one containing the length of the sequence.

infos.txt

This read-only text file provides the same informations, but in a more human-readable format.

labels.txt

This read-only file contains a list of all the sequence headers present in the mounted FASTA file.

fasta

This folder contains all the individual sequences present in the original FASTA file, exposed as virtually independent read-only FASTA files.

seqs

This folder contains all the individual sequences present in the original FASTA file, exposed as virtually independent read/write files containing only the sequences - without the FASTA headers, but with any newline preserved. These files can be read, copied, removed, edited, etc. as normal files, and any alteration will be reflected on the original FASTA file when fusta is closed.

append

This folder should be used to add new sequences to the mounted FASTA file. Any valid fasta file copied or moved to this directory will be appended to the original FASTA files. It should be noted that the process is completely transparent and the the folder will remain empty, even though the operation is successful.

get

This folder is used for range-access to the sequences in the mounted FASTA file. Although it is empty, any read access to a (non-existing) file following the pattern SEQID:START-END will return the corresponding range (1-indexed, fully-closed) in the specified sequence. It should be noted that the access skip headers and newlines, so that the START-END coordinates map to actual loci in the corresponding sequence and not to bytes in the mounted FASTA file.

Examples

All the following examples assume that a FASTA file has been mounted (e.g. fusta -D genome.fa), and is unmounted after manipulation (e.g. fusermount -u fusta).

Get an overview of the file content

cat fusta/infos.txt

Extract individual sequences as FASTA files

cat fusta/fasta/chr{X,Y}.fa > ~/sex_chrs.fa

Extract a range of chromosome 12

cat fusta/get/chr12:12000000-12002000

Remove sequences from the original file

rm fusta/seq/chr{3,5}.seq

Add a new sequence

cp more_sequences.fa fusta/append

Upcasing a sequence

sed 's/[a-z]/\U&/g' fusta/seqs/chr21.seq | sponge fusta/seqs/chr21.seq

Edit the mitochondrial genome

nano fusta/seq/chrMT.seq

Batch-rename chromosomes

cd fusta/seq; for i in *; do mv ${i} chr${i}; done

Use independent sequences in external programs

blastn mydb.db -query fusta/fasta/seq25.fa
asgart fusta/fasta/chrX.fa fusta/asgart/chrY.fa --out result.json

Compressed FASTA files

FUSTA only works with uncompressed (multi)FASTA files. If you wish to use FUSTA on compressed (multi)FASTA files, we recommend to use FASTAFS as an intermediary to expose a compressed (multi)FASTA file to FUSTA without requiring to ully uncompress it.

Runtime options

USAGE:
    fusta [OPTIONS] <FASTA>

ARGS:
    <FASTA>    A (multi)FASTA file containing the sequences to mount

OPTIONS:
    -C, --max-cache <max-cache>      Set the maximum amount of memory to use to cache writes (MB)
                                     [default: 500]
        --cache <cache>              Use either mmap, fseek(2) or memory-backed cache to extract
                                     sequences from FASTA files. WARNING: memory caching use as much
                                     RAM as the size of the FASTA file should be available.
                                     [default: mmap] [possible values: file, mmap, memory]
    -D, --no-daemon                  Do not daemonize
    -h, --help                       Print help information
    -o, --mountpoint <mountpoint>    Specifies the directory to use as mountpoint; it will be
                                     created if it does not exist
    -S, --sep <csv-separator>        Set the separator to use in CSV files [default: ,]
    -v                               Sets the level of verbosity
    -V, --version                    Print version information
    -W, --allow-overwrite            allow FUSTA to overwrite existing sequences, when (i) appending
                                     new sequences conflicting with an existing ID, (ii) renaming
                                     sequences

--cache

The cache option is key in adapting FUSTA to your use, and for files of non-trivial size, a correct choice is the difference between a memory overflow and a smooth run:

file
in this mode, FUSTA store all the fragments as offsets in their file, and access them through fseek accesses. The performances will probably be the worse, but memory consumption will be kept to the minimal.
mmap
this mode is extremely similar to the previous one, safe that access will proceed through mmmap(2) reads, leveraging the caching facilities of the OS – this is the default mode.
memory
in this mode, all fragments will directly be copied to memory. Performances will be at their best, but enough memory should be available to store the entirety of the processed files.

Troubleshooting

I get a “Cannot allocate memory” error

The FASTA files may be overflowing the default setting of the memory overcommit guard. You may change the overcommiting setting with sysctl -w vm.overcommit_memory 1, or use --cache=file for less performances, but less virtual memory pressure.

I still get a “Cannot allocate memory” error

Your FASTA file may contain too many fragments w.r.t. the number of mmap pages that can be mapped by a program. You may increase max_map_count with sysctl -w vm.max_map_count 200000, or use --cache=file for less performances, but less virtual memory pressure.

I have another error

Open an issue stating your problem!

Contact

If you have any question or if you encounter a problem, do not hesitate to open an issue.

Acknowledgments

FUSTA is standing on the shoulders of, among others, fuser, clap, memmap2 and daemonize.

Changelog

v1.7.1

  • Fix missing newline in some cases

v1.7

  • Use 1-based, fully-closed genomic coordinates

v1.6.1

  • Accept more characters as FASTA sequences: \n - _ . + =

v1.6

  • Fix truncating
  • Improved error handling
  • Better notifications
  • Add a flag to allow overwrite of existing sequences as a side-effect
  • Only ASCII alphanumerical content can be written to sequence files
  • Refuse to open FASTA files with IDs containing characters invalid in a filename

v1.5.7

  • Update dependencies

v1.5.6

  • The default mount point is now fusta-{filename}

v1.5.4

  • Fix mountpoint not being created

v1.5.3

  • Improve notification system

v1.5.2

  • Daemonize by default

v1.5.1

  • Update memmap to memmap2

v1.5

  • Add an index on ino for better performances at the cost of a bit of memory

v1.4

  • Daemonize after parsing FASTA files, so that (i) errors appear immediately and (ii) performances are better when launching multiple instances in parallel.

v1.3

  • Can now cache all fragments in memory: increased RAM consumption, but starkly reduced random access time

v1.2.1

  • Bugfixes

v1.2

  • FUSTA is now based on fuster instead of fuse-rs
  • Various optimization let FUSTA handle >40GB FASTA files in 6GB of RAM and much better performances
  • Added an optional notification system behind the notifications feature gate

v1.1.1

  • Use MMAP by default. While it may lead to unpleasant load when performing heavy operation on very large files, this should be a rather uncommon case.

v1.1

  • FUSTA can now directly extract ranges from a sequence

v1.0

  • Initial release