sagc-bioinformatics · ziadbkh · Feb 24, 2024 · Jan 30, 2024 · Feb 5, 2024 · Feb 9, 2024
diff --git a/Cargo.toml b/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "mgikit"
-version = "0.1.3"
+version = "0.1.4"
 edition = "2021"
 authors = ["Ziad Al Bkhetan <ziad.albkhetan@gmail.com>"]
 repository = "https://github.com/sagc-bioinformatics/mgikit"
@@ -22,7 +22,9 @@ memchr = "2.6.4"
 libdeflater = "1.12.0"
 niffler = { version = "2.5.0", default-features = false, features = ["gz"]}
 walkdir = "2.4.0"
-
+glob = "0.3.1"
+log = "0.4"
+env_logger = "0.10.1"
 [dev-dependencies]
 md5 = "0.7.0"
 

diff --git a/bins/mgikit-V0.1.3.zip b/bins/mgikit-V0.1.3.zip
diff --git a/bins/mgikit-V0.1.4.zip b/bins/mgikit-V0.1.4.zip
diff --git a/docs/_data/sidebars/main.yml b/docs/_data/sidebars/main.yml
@@ -7,6 +7,8 @@ subitems:
     url: /template
   - title: Reports
     url: /report
+  - title: Reformat reads
+    url: /reformat
   - title: Contact us
     url: /contact_us
 
diff --git a/docs/index.md b/docs/index.md
@@ -28,24 +28,31 @@ mgikit reports can be parsed by [mgikit-multiqc plugin](https://github.com/sagc-
 
 ### template
 
-This command is used to detect the location and form of the indexes within the read barcode. It simply goes through a small number of the reads and investigates the number of matches with the indexes in the sample sheet within each possible location in the read barcode and considering the indexes as is and their reverse complementary. 
+This command is used to detect the location and form of the indexes within the read barcode. It simply goes through a small number of the reads and investigates the number of matches with the indexes in the sample sheet within each possible location in the read barcode and considers the indexes as is and their reverse complementary. 
 
 It reports matches for all possible combinations and uses the read template that has the maximum number of matches. This process happens for each sample individually and therefore, the best matching template for each sample will be reported. 
 
 Using this comprehensive scan, the tool can detect the templates for mixed libraries. 
 
+<hr/>
 
 ### report
 
 This command is to merge demultiplexing and quality reports from multiple lanes into one comprehensive report for MultQC reports visualisation.
 
 <hr/>
 
+### reformat
+
+This command is to reformat fastq files generated by `splitBarcode` into Illumina format and generate quality reports..
+
+<hr/>
+
 ## Installation
 
 You can use the static binary under bins directly, however, if you like to build it from the source code:
 
-You need to have Rust and cargo installed first, check rust [documenation](https://doc.rust-lang.org/cargo/getting-started/installation.html)
+You need to have `Rust` and `Cargo` installed first, check rust [documenation](https://doc.rust-lang.org/cargo/getting-started/installation.html)
 
 
 ```bash

diff --git a/docs/pages/demultiplex.md b/docs/pages/demultiplex.md
@@ -100,7 +100,7 @@ file naming.
 
 This parameter is used to provide the run id when the parameter `-i` or `--input` is not provided. The parameter is mandatory when Illumina format is requested for read header and file naming.
 
-+ **`--writing-buffer-size`**: The default value is `67108864`. The size of the buffer for each sample to be filled with data then written once to the disk. Smaller buffers will need less memory but makes the tool slower. Largeer buffers need more memory. 
++ **`--writing-buffer-size`**: The default value is `67108864`. The size of the buffer for each sample to be filled with data and then written once to the disk. Smaller buffers will need less memory but make the tool slower. Larger buffers need more memory. 
 
 + **`--comprehensive-scan`**: Enable comperhansive scan. 
 
@@ -127,7 +127,7 @@ the number of allowed mismatches is high.
 
 + **`--info-file`**: The name of the info file that contains the run information. Only needed when using the `--input` parameter. [default: BioInfo.csv]
 
-+ **`--report-level`**: The level of reporting. [default: 2]
++ **`--report-level`**: The level of reporting. 0 no reports will be generated!, 1 data quality and demultiplexing reports. 2: all reports (reports on data quality, demultiplexing, undetermined and ambiguous barcodes).[default: 2]
 
 + **`--compression-level`**: The level of compression (between 0 and 12). 0 is fast but no compression, 12 is slow but high compression. [default: 1]
 
@@ -197,7 +197,7 @@ The run id will be the date and time of the run start ("YMDHmS" format). It will
 
 If the input reads are passed using `-f` and `-r` parameters, mgikit will look for the file `BioInfo.csv` under the same directory as the read with barcodes (R1 for SE or R2 for PE). If found it will be used.
 
-The user can also pass the path of a file formated in teh same way as `BioInfo.csv` file using the parameter `--info-file`. if this path is passed, `instrument` and `run` will be extratced from this file.
+The user can also pass the path of a file formatted in the same way as `BioInfo.csv` file using the parameter `--info-file`. if this path is passed, `instrument` and `run` will be extracted from this file.
 
 `--lane`, `--instrument`, and `--run` will be prioritised over the information in the `BioInfo.csv` file if these parameters were provided.
 
@@ -299,7 +299,7 @@ Templates and indexes forms can be provided by the user, however, the command `t
 ### Reports {#demultipexing-reports-section}
 
 The demultiplex command generates multiple reports with file names that start with the flowcell and lane being demultiplexed.
-a MultiQC hitm report can be generated from these reports using [mgikit-multiqc](https://github.com/sagc-bioinformatics/mgikit-multiqc) plugin as desciribed at the plugin [repository](https://github.com/sagc-bioinformatics/mgikit-multiqc).
+a MultiQC hitm report can be generated from these reports using [mgikit-multiqc](https://github.com/sagc-bioinformatics/mgikit-multiqc) plugin as described at the plugin [repository](https://github.com/sagc-bioinformatics/mgikit-multiqc).
 
 1. `flowcell.L0*.mgikit.info`
 
@@ -340,7 +340,7 @@ The first three reports must be generated for each run. It is unlikely that the
 
 #### Generat MultiQC report from mgikit reports
 
-In order to generate [multiqc](https://multiqc.info/) report from mgikit reports, multiqc needs to be installed.
+In order to generate a [multiqc](https://multiqc.info/) report from mgikit reports, multiqc needs to be installed.
 
 Here is an example of how to generate the report:
 
@@ -453,14 +453,14 @@ In the case of single-end, the R2 file of the dataset is used alone for demultip
 
 ### Memory utilisation
 
-The default parameters of the tool are optimised to achive high performance. The majority of the memory needed is allocated for output buffering to reduce writing to disk operations.
+The default parameters of the tool are optimised to achieve high performance. The majority of the memory needed is allocated for output buffering to reduce writing-to-disk operations.
 
-The expected memory usage is influnced yb three main factors, 
+The expected memory usage is influenced by three main factors, 
 
 1. Number of samples in the sample sheet.
 2. Writing buffer size (`--writing-buffer-size` parameter, default is `67108864`).
 3. Compression buffer size (`--compression-buffer-size` parameter, default is `131072`).
-4. Single end or paired end input data.
+4. Single-end or paired-end input data.
 
 The expected allocated memory is 
 
@@ -474,7 +474,7 @@ When using the default parameters:
 
 + **Paired-end input**: `2 * number of smaples 64.25 MB`.
 
-Reducing the writing buffer size will reduce the reqiured memory but also affect the performance time.
+Reducing the writing buffer size will reduce the required memory but also affect the performance time.
 
 
 ### Execution examples

diff --git a/docs/pages/mgikit-multiqc.md b/docs/pages/mgikit-multiqc.md
@@ -1,14 +1,14 @@
 ---
 title: Instructions to generate MultQC report from MGIKIT reports.
 contributors: [Ziad Al-Bkhetan]
-description: User guide for to generat html report summariusing the demultipexing results using mgikit output reports.
+description: User guide for generating html report summarising the demultiplexing results using mgikit output reports.
 toc: true
 type: guides
 ---
 
 ## mgikit Reports
 The demultiplex command generates multiple reports with file names that start with the flowcell and lane being demultiplexed.
-a MultiQC hitm report can be generated from these reports using [mgikit-multiqc](https://github.com/sagc-bioinformatics/mgikit-multiqc) plugin as desciribed at the plugin [repository](https://github.com/sagc-bioinformatics/mgikit-multiqc).
+a MultiQC hitm report can be generated from these reports using [mgikit-multiqc](https://github.com/sagc-bioinformatics/mgikit-multiqc) plugin as described at the plugin [repository](https://github.com/sagc-bioinformatics/mgikit-multiqc).
 
 1. `flowcell.L0*.mgikit.info`
 
@@ -47,7 +47,7 @@ This report contains the top 50 frequent barcodes from the above report (6). Thi
 
 The first three reports must be generated for each run. It is unlikely that the fourth and fifth reports will not be generated as usually there should be some undetermined reads in the run. It is highly likely that the sixth and seventh reports will not be generated. If they are generated, it is recommended to make sure that the input sample sheet does not have issues and that the allowed mismatches are less than the minimal Hamming distance between samples.
 
-## Example: Generat MultiQC report from mgikit reports
+## Example: Generate MultiQC report from mgikit reports
 
 In order to generate a [multiqc](https://multiqc.info/) report from mgikit reports, multiqc needs to be installed.
 

diff --git a/docs/pages/reformat.md b/docs/pages/reformat.md
@@ -0,0 +1,88 @@
+---
+title: Instructions for reformat functionality
+contributors: [Ziad Al-Bkhetan]
+description: User guide for MGIKIT reformat functionality including parameters details and usage examples.
+toc: true
+type: guides
+---
+
+## Introduction
+
+This functionality is performed with the command `reformat`. It is to reformat reads demultiplexed by `splitBarcode` tool provided by MGI into illumia format and generates quality reports explained at [mgikit reports page](/mgikit/demultiplex#demultipexing-reports-section).
+
+This command should be used for each sample separately (either paired-end or single-end). if you have multiple samples, you need to process each of them individually.
+
+## Command arguments
+
++ **`-f or --read1`**: the path to the forward reads fastq file for both paired-end and single-end input data.
+
++ **`-r or --read2`**: the path to the reverse reads fastq file.
+
++ **`-i or --input`**: the path to the directory that contains the input fastq files. 
+
+Either `-i` or `-f/-r`, `-f` should be provided for a run.
+
++ **`-o or --output`**: The path the output directory.
+
+The tool will create the directory if it does not exist
+or overwrite the content if the directory exists and the parameter `--force` is used. The tool will exit
+with an error if the directory exists, and `--force` is not used. If this parameter is not provided, the tools
+will create a directory (in the working directory) with a name based on the date and time
+of the run as follows `mgiKit_Y-m-dTHMS`. where `Y`, `m`, `d`, `H`, `M`, and `S` are the date and time format.
+
++ **`--reports`**: The path of the output reports directory.
+
+By default, the tool writes the files of the run reports in the same output directory as the
+demultiplexed fastq files (`-o` or `--output` parameter). This parameter is used to write the reports in
+a different folder as specified with this parameter.
+
++ **`--lane`**: Lane number such as `L01`.
+
+This parameter is used to provide the lane number when the parameter `-i` or `--input` is not
+provided. The lane number is used for QC reports and it is mandatory when Illumina format is
+requested for file naming.
+
++ **`--instrument`**: The id of the sequncing machine. 
+
+This parameter is used to provide the instrument id when the parameter `-i` or `--input`
+is not provided. The parameter is mandatory when Illumina format is requested for read header and
+file naming.
+
++ **`--run`**: The run id. It is taken from Bioinf.csv as the date and time of starting the run.
+
+This parameter is used to provide the run id when the parameter `-i` or `--input` is not provided. The parameter is mandatory when Illumina format is requested for read header and file naming.
+
++ **`--writing-buffer-size`**: The default value is `67108864`. The size of the buffer for each sample to be filled with data then written once to the disk. Smaller buffers will need less memory but makes the tool slower. Largeer buffers need more memory. 
+
++ **`--compression-level`**: The level of compression (between 0 and 12). 0 is fast but no compression, 12 is slow but high compression. [default: 1]
+
++ **`--force`**: this flag is to force the run and overwrite the existing output directory if exists.
+
++ **`--flexible`**: By default, the tool will calculate the length of the first read and its all parts and use this information in the analysis for a quicker determination of the read boundaries. `--flexible` option, will make the tool determine the read boundaries based on the `new line` character (`\n`). 
+
++ **`--info-file`**: The name of the info file that contains the run information. Only needed when using the `--input` parameter. [default: BioInfo.csv]
+
++ **`--disable-illumina`**: reads will be left as is and only quality reports will be generated.
+
++ **`--umi-length`**: The length of UMI expected at the end of the read (r1 for single-end, or r2 for paired-end) [Default: 0].
+
++ **`--report-level`**: The level of reporting. 0 no reports will be generated, 1 data quality and demultiplexing reports. 2: all reports (reports on data quality, demultiplexing, undetermined and ambiguous barcodes).[default: 2]
+
++ **`--sample-index`**:  The index of the sample in the sample sheet. It is required for file naming. [default: 1]
+
++ **`--barcode`**: The barcode of the specific sample to calculate the mismatches for the reports. If not provided, no mismatches will be calculated.
+
+## Usage Examples
+
+**1. Demultiplexing a run with dual indexes (i7 and i5)**
+
+
+```bash
+target/release/mgikit  reformat \
+    -f testing_data/input/extras_test/FC01_L01_sample1_1.fq.gz \
+    -r testing_data/input/extras_test/FC01_L01_sample1_2.fq.gz \
+    --lane L01 -o output \
+    --sample-index 1 \
+    --info-file testing_data/input/extras_test/BioInfo.csv
+```
+
diff --git a/docs/pages/report.md b/docs/pages/report.md
@@ -1,7 +1,7 @@
 ---
 title: Instructions for report functionality
 contributors: [Ziad Al-Bkhetan]
-description: User guide for MGIKIT report functionality including parameters details and usage examples.
+description: User guide for MGIKIT report functionality including parameter details and usage examples.
 toc: true
 type: guides
 ---