Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V0.1.4 #5

Merged
merged 7 commits into from
Feb 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "mgikit"
version = "0.1.3"
version = "0.1.4"
edition = "2021"
authors = ["Ziad Al Bkhetan <ziad.albkhetan@gmail.com>"]
repository = "https://github.com/sagc-bioinformatics/mgikit"
Expand All @@ -22,7 +22,9 @@ memchr = "2.6.4"
libdeflater = "1.12.0"
niffler = { version = "2.5.0", default-features = false, features = ["gz"]}
walkdir = "2.4.0"

glob = "0.3.1"
log = "0.4"
env_logger = "0.10.1"
[dev-dependencies]
md5 = "0.7.0"

Expand Down
Binary file removed bins/mgikit-V0.1.3.zip
Binary file not shown.
Binary file added bins/mgikit-V0.1.4.zip
Binary file not shown.
2 changes: 2 additions & 0 deletions docs/_data/sidebars/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,8 @@ subitems:
url: /template
- title: Reports
url: /report
- title: Reformat reads
url: /reformat
- title: Contact us
url: /contact_us

11 changes: 9 additions & 2 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,24 +28,31 @@ mgikit reports can be parsed by [mgikit-multiqc plugin](https://github.com/sagc-

### template

This command is used to detect the location and form of the indexes within the read barcode. It simply goes through a small number of the reads and investigates the number of matches with the indexes in the sample sheet within each possible location in the read barcode and considering the indexes as is and their reverse complementary.
This command is used to detect the location and form of the indexes within the read barcode. It simply goes through a small number of the reads and investigates the number of matches with the indexes in the sample sheet within each possible location in the read barcode and considers the indexes as is and their reverse complementary.

It reports matches for all possible combinations and uses the read template that has the maximum number of matches. This process happens for each sample individually and therefore, the best matching template for each sample will be reported.

Using this comprehensive scan, the tool can detect the templates for mixed libraries.

<hr/>

### report

This command is to merge demultiplexing and quality reports from multiple lanes into one comprehensive report for MultQC reports visualisation.

<hr/>

### reformat

This command is to reformat fastq files generated by `splitBarcode` into Illumina format and generate quality reports..

<hr/>

## Installation

You can use the static binary under bins directly, however, if you like to build it from the source code:

You need to have Rust and cargo installed first, check rust [documenation](https://doc.rust-lang.org/cargo/getting-started/installation.html)
You need to have `Rust` and `Cargo` installed first, check rust [documenation](https://doc.rust-lang.org/cargo/getting-started/installation.html)


```bash
Expand Down
18 changes: 9 additions & 9 deletions docs/pages/demultiplex.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,7 @@ file naming.

This parameter is used to provide the run id when the parameter `-i` or `--input` is not provided. The parameter is mandatory when Illumina format is requested for read header and file naming.

+ **`--writing-buffer-size`**: The default value is `67108864`. The size of the buffer for each sample to be filled with data then written once to the disk. Smaller buffers will need less memory but makes the tool slower. Largeer buffers need more memory.
+ **`--writing-buffer-size`**: The default value is `67108864`. The size of the buffer for each sample to be filled with data and then written once to the disk. Smaller buffers will need less memory but make the tool slower. Larger buffers need more memory.

+ **`--comprehensive-scan`**: Enable comperhansive scan.

Expand All @@ -127,7 +127,7 @@ the number of allowed mismatches is high.

+ **`--info-file`**: The name of the info file that contains the run information. Only needed when using the `--input` parameter. [default: BioInfo.csv]

+ **`--report-level`**: The level of reporting. [default: 2]
+ **`--report-level`**: The level of reporting. 0 no reports will be generated!, 1 data quality and demultiplexing reports. 2: all reports (reports on data quality, demultiplexing, undetermined and ambiguous barcodes).[default: 2]

+ **`--compression-level`**: The level of compression (between 0 and 12). 0 is fast but no compression, 12 is slow but high compression. [default: 1]

Expand Down Expand Up @@ -197,7 +197,7 @@ The run id will be the date and time of the run start ("YMDHmS" format). It will

If the input reads are passed using `-f` and `-r` parameters, mgikit will look for the file `BioInfo.csv` under the same directory as the read with barcodes (R1 for SE or R2 for PE). If found it will be used.

The user can also pass the path of a file formated in teh same way as `BioInfo.csv` file using the parameter `--info-file`. if this path is passed, `instrument` and `run` will be extratced from this file.
The user can also pass the path of a file formatted in the same way as `BioInfo.csv` file using the parameter `--info-file`. if this path is passed, `instrument` and `run` will be extracted from this file.

`--lane`, `--instrument`, and `--run` will be prioritised over the information in the `BioInfo.csv` file if these parameters were provided.

Expand Down Expand Up @@ -299,7 +299,7 @@ Templates and indexes forms can be provided by the user, however, the command `t
### Reports {#demultipexing-reports-section}

The demultiplex command generates multiple reports with file names that start with the flowcell and lane being demultiplexed.
a MultiQC hitm report can be generated from these reports using [mgikit-multiqc](https://github.com/sagc-bioinformatics/mgikit-multiqc) plugin as desciribed at the plugin [repository](https://github.com/sagc-bioinformatics/mgikit-multiqc).
a MultiQC hitm report can be generated from these reports using [mgikit-multiqc](https://github.com/sagc-bioinformatics/mgikit-multiqc) plugin as described at the plugin [repository](https://github.com/sagc-bioinformatics/mgikit-multiqc).

1. `flowcell.L0*.mgikit.info`

Expand Down Expand Up @@ -340,7 +340,7 @@ The first three reports must be generated for each run. It is unlikely that the

#### Generat MultiQC report from mgikit reports

In order to generate [multiqc](https://multiqc.info/) report from mgikit reports, multiqc needs to be installed.
In order to generate a [multiqc](https://multiqc.info/) report from mgikit reports, multiqc needs to be installed.

Here is an example of how to generate the report:

Expand Down Expand Up @@ -453,14 +453,14 @@ In the case of single-end, the R2 file of the dataset is used alone for demultip

### Memory utilisation

The default parameters of the tool are optimised to achive high performance. The majority of the memory needed is allocated for output buffering to reduce writing to disk operations.
The default parameters of the tool are optimised to achieve high performance. The majority of the memory needed is allocated for output buffering to reduce writing-to-disk operations.

The expected memory usage is influnced yb three main factors,
The expected memory usage is influenced by three main factors,

1. Number of samples in the sample sheet.
2. Writing buffer size (`--writing-buffer-size` parameter, default is `67108864`).
3. Compression buffer size (`--compression-buffer-size` parameter, default is `131072`).
4. Single end or paired end input data.
4. Single-end or paired-end input data.

The expected allocated memory is

Expand All @@ -474,7 +474,7 @@ When using the default parameters:

+ **Paired-end input**: `2 * number of smaples 64.25 MB`.

Reducing the writing buffer size will reduce the reqiured memory but also affect the performance time.
Reducing the writing buffer size will reduce the required memory but also affect the performance time.


### Execution examples
Expand Down
6 changes: 3 additions & 3 deletions docs/pages/mgikit-multiqc.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
---
title: Instructions to generate MultQC report from MGIKIT reports.
contributors: [Ziad Al-Bkhetan]
description: User guide for to generat html report summariusing the demultipexing results using mgikit output reports.
description: User guide for generating html report summarising the demultiplexing results using mgikit output reports.
toc: true
type: guides
---

## mgikit Reports
The demultiplex command generates multiple reports with file names that start with the flowcell and lane being demultiplexed.
a MultiQC hitm report can be generated from these reports using [mgikit-multiqc](https://github.com/sagc-bioinformatics/mgikit-multiqc) plugin as desciribed at the plugin [repository](https://github.com/sagc-bioinformatics/mgikit-multiqc).
a MultiQC hitm report can be generated from these reports using [mgikit-multiqc](https://github.com/sagc-bioinformatics/mgikit-multiqc) plugin as described at the plugin [repository](https://github.com/sagc-bioinformatics/mgikit-multiqc).

1. `flowcell.L0*.mgikit.info`

Expand Down Expand Up @@ -47,7 +47,7 @@ This report contains the top 50 frequent barcodes from the above report (6). Thi

The first three reports must be generated for each run. It is unlikely that the fourth and fifth reports will not be generated as usually there should be some undetermined reads in the run. It is highly likely that the sixth and seventh reports will not be generated. If they are generated, it is recommended to make sure that the input sample sheet does not have issues and that the allowed mismatches are less than the minimal Hamming distance between samples.

## Example: Generat MultiQC report from mgikit reports
## Example: Generate MultiQC report from mgikit reports

In order to generate a [multiqc](https://multiqc.info/) report from mgikit reports, multiqc needs to be installed.

Expand Down
88 changes: 88 additions & 0 deletions docs/pages/reformat.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
---
title: Instructions for reformat functionality
contributors: [Ziad Al-Bkhetan]
description: User guide for MGIKIT reformat functionality including parameters details and usage examples.
toc: true
type: guides
---

## Introduction

This functionality is performed with the command `reformat`. It is to reformat reads demultiplexed by `splitBarcode` tool provided by MGI into illumia format and generates quality reports explained at [mgikit reports page](/mgikit/demultiplex#demultipexing-reports-section).

This command should be used for each sample separately (either paired-end or single-end). if you have multiple samples, you need to process each of them individually.

## Command arguments

+ **`-f or --read1`**: the path to the forward reads fastq file for both paired-end and single-end input data.

+ **`-r or --read2`**: the path to the reverse reads fastq file.

+ **`-i or --input`**: the path to the directory that contains the input fastq files.

Either `-i` or `-f/-r`, `-f` should be provided for a run.

+ **`-o or --output`**: The path the output directory.

The tool will create the directory if it does not exist
or overwrite the content if the directory exists and the parameter `--force` is used. The tool will exit
with an error if the directory exists, and `--force` is not used. If this parameter is not provided, the tools
will create a directory (in the working directory) with a name based on the date and time
of the run as follows `mgiKit_Y-m-dTHMS`. where `Y`, `m`, `d`, `H`, `M`, and `S` are the date and time format.

+ **`--reports`**: The path of the output reports directory.

By default, the tool writes the files of the run reports in the same output directory as the
demultiplexed fastq files (`-o` or `--output` parameter). This parameter is used to write the reports in
a different folder as specified with this parameter.

+ **`--lane`**: Lane number such as `L01`.

This parameter is used to provide the lane number when the parameter `-i` or `--input` is not
provided. The lane number is used for QC reports and it is mandatory when Illumina format is
requested for file naming.

+ **`--instrument`**: The id of the sequncing machine.

This parameter is used to provide the instrument id when the parameter `-i` or `--input`
is not provided. The parameter is mandatory when Illumina format is requested for read header and
file naming.

+ **`--run`**: The run id. It is taken from Bioinf.csv as the date and time of starting the run.

This parameter is used to provide the run id when the parameter `-i` or `--input` is not provided. The parameter is mandatory when Illumina format is requested for read header and file naming.

+ **`--writing-buffer-size`**: The default value is `67108864`. The size of the buffer for each sample to be filled with data then written once to the disk. Smaller buffers will need less memory but makes the tool slower. Largeer buffers need more memory.

+ **`--compression-level`**: The level of compression (between 0 and 12). 0 is fast but no compression, 12 is slow but high compression. [default: 1]

+ **`--force`**: this flag is to force the run and overwrite the existing output directory if exists.

+ **`--flexible`**: By default, the tool will calculate the length of the first read and its all parts and use this information in the analysis for a quicker determination of the read boundaries. `--flexible` option, will make the tool determine the read boundaries based on the `new line` character (`\n`).

+ **`--info-file`**: The name of the info file that contains the run information. Only needed when using the `--input` parameter. [default: BioInfo.csv]

+ **`--disable-illumina`**: reads will be left as is and only quality reports will be generated.

+ **`--umi-length`**: The length of UMI expected at the end of the read (r1 for single-end, or r2 for paired-end) [Default: 0].

+ **`--report-level`**: The level of reporting. 0 no reports will be generated, 1 data quality and demultiplexing reports. 2: all reports (reports on data quality, demultiplexing, undetermined and ambiguous barcodes).[default: 2]

+ **`--sample-index`**: The index of the sample in the sample sheet. It is required for file naming. [default: 1]

+ **`--barcode`**: The barcode of the specific sample to calculate the mismatches for the reports. If not provided, no mismatches will be calculated.

## Usage Examples

**1. Demultiplexing a run with dual indexes (i7 and i5)**


```bash
target/release/mgikit reformat \
-f testing_data/input/extras_test/FC01_L01_sample1_1.fq.gz \
-r testing_data/input/extras_test/FC01_L01_sample1_2.fq.gz \
--lane L01 -o output \
--sample-index 1 \
--info-file testing_data/input/extras_test/BioInfo.csv
```

2 changes: 1 addition & 1 deletion docs/pages/report.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Instructions for report functionality
contributors: [Ziad Al-Bkhetan]
description: User guide for MGIKIT report functionality including parameters details and usage examples.
description: User guide for MGIKIT report functionality including parameter details and usage examples.
toc: true
type: guides
---
Expand Down
Loading
Loading