Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demo / sample code update #1666

Closed
wants to merge 9 commits into from
Closed

Conversation

vasudeva8
Copy link
Contributor

Samples updated with fasta/fastq index using, tabix index and thread pool queueing of tasks.

Comment on lines 94 to 99
//close and index it
sam_close(outfile); outfile = NULL;
if (fai_build3(outname, NULL, NULL) == -1) {
printf("Indexing failed with %d\n", errno);
goto end;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced this makes a good demonstration.

Fasta indices are good for references as they allow us to extract small portions of a large reference. However they are wholely inappropriate for NGS data, which is typically what we have in a SAM/BAM file.

Basically fasta/fastq are split personally and used for two different types of data. Messy.

We may be better having a tiny demo of a separate "samtools faidx" index builder instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This writes reference data in fasta/fastq format and showcases creation of index for later use, while reading.
It is not using NGS data, and uses sam_open/write/close and bam1_t for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reading DEMO.md I see it explains the indexing is suitable for reference data.

As this is for education purposes, maybe put a comment just before the fai_build3 call: // NB: this is only suitable for fasta/fastq files of reference sequences, and not sequencing reads

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separated as a different samples and mentioned as indexing of reference data.

samples/read_with_tabix.c Outdated Show resolved Hide resolved
samples/read_with_tabix.c Outdated Show resolved Hide resolved
samples/DEMO.md Outdated Show resolved Hide resolved
samples/read_with_tabix.c Outdated Show resolved Hide resolved
samples/qtask_thread.c Outdated Show resolved Hide resolved
samples/qtask_thread.c Outdated Show resolved Hide resolved
samples/qtask_thread.c Outdated Show resolved Hide resolved
samples/qtask_thread.c Outdated Show resolved Hide resolved
samples/qtask_thread.c Outdated Show resolved Hide resolved
@jkbonfield
Copy link
Contributor

I ran a spell checker over the DEMO and README files. Highlighted in bold. Some were there before, hence not easy to add comments to lines here, but it's trivial to search and fix.

Lines in README.md:
"length equal or greather to given input"
"This application shocases the use of mulitple queues,"

DEMO.md:
"Read_multireg - This application showcases the usage of mulitple regionn "
"Qtask_thread - This application shocases the use of mulitple queues, scheduling of different tasks and getting results in orderered and unordered fashion."
"Pileup shows the transposed view of the SAM alignment data, i.e. it shows the the reference positions"

@vasudeva8
Copy link
Contributor Author

Spell check and other corrections made.

Copy link
Contributor

@jkbonfield jkbonfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments on qtask_ordered.c only.

I'm tempted to say to drop this for this release if we want the rest of the changes merging, as it needs major work to fix race conditions.

samples/qtask_ordered.c Outdated Show resolved Hide resolved
samples/qtask_ordered.c Outdated Show resolved Hide resolved
@jkbonfield
Copy link
Contributor

I have a rebased version here that we may want to use before doing further changes. It fixes that bizarre github induced merge commit.

I haven't squashed anything yet, so all commits are the same content, but we can do that when we're ready to merge.

@vasudeva8 vasudeva8 force-pushed the demo3 branch 3 times, most recently from 3422cdf to 20aaa40 Compare May 1, 2024 13:18
@vasudeva8
Copy link
Contributor Author

thanks James, have used this rebased version and pushed it.
also the thread pool is shared with input and output files as well, to make it a better sample as mentioned.

@jkbonfield
Copy link
Contributor

I'm still finding qtask_ordered to be considerably slower than it needs to be.

I reimplemented the qtask_ordered problem using my own logic, mostly copied from the way htslib does multi-threading of SAM by generating blocks of BAM records to operate on and separate reading and writing threads to keep things running.

My code for this is here: https://github.com/jkbonfield/htslib/blob/PR_1666d/tv2.c

$ for i in `seq 1 5`;do taskset -c 0-15 /usr/bin/time -f '%U user\t%S system\
t%e elapsed\t%P %%CPU' ./qtask_ordered ~/scratch/data/novaseq.10m.bam 16 /tmp;done
35.51 user	7.23 system	10.69 elapsed	399% %CPU
37.74 user	7.68 system	10.43 elapsed	435% %CPU
36.11 user	7.90 system	11.44 elapsed	384% %CPU
38.03 user	7.74 system	10.60 elapsed	431% %CPU
38.71 user	8.28 system	12.48 elapsed	376% %CPU

So the original is typically 10-11s and maybe around 400% CPU utilisation.

$ for i in `seq 1 5`;do taskset -c 0-15 /usr/bin/time -f '%U user\t%S system\t%e elapsed\t%P %%CPU' ../tv2 -@16 ~/scratch/data/novaseq.10m.bam /tmp/_.sam > /dev/null;done
18.33 user	3.06 system	2.33 elapsed	916% %CPU
19.42 user	3.43 system	2.67 elapsed	855% %CPU
19.27 user	3.56 system	2.64 elapsed	862% %CPU
18.40 user	3.05 system	2.40 elapsed	890% %CPU
19.43 user	3.54 system	2.70 elapsed	848% %CPU

This is 850-900% CPU utilisation and ~2.5s. So around 4x quicker. It's clear there is also a considerable difference in CPU usage, which I can't explain yet, but possibly it's the optimisations in the core counting algorithm. (I should probably remove those as it's not good at demonstrating the benefits of threading when the task we are performing is so quick.)

The output of out.sam and _.sam are identical, although I had to change my code a bit because we counted differently. I think the GC content in this PR is subtly wrong in that it is numbers of G and C over the sequence length, not over the numbers of A, C, G and T. It only differs when Ns are present (or ambiguity codes, but they didn't occur). We don't know what the N base is so it shouldn't be counted in the maths.

Anyway I digress. If we want a demonstration of threading then it needs to be performant.

I also have a different version of this (tv1) which is purely counting the frequency of each base type and reporting summary statistics. That doesn't write anything back. However it means the order in which we do our sums is irrelevant as A + B + C is the same as A + C + B. Hence this becomes a trivial change to unordered threads instead. With hindsight I think that's a simpler example, but the one here may still be relevant too and has practical purpose.

@jkbonfield
Copy link
Contributor

I tried changing qtask_ordered to use "wb" to write to "out.bam" instead. It obviously then uses more CPU up as compression of the output becomes significant, but it's still behind tv2's BAM output.

#qtask_ordered
100.16 user	4.93 system	10.18 elapsed	1031% %CPU
100.79 user	5.13 system	10.42 elapsed	1016% %CPU
100.58 user	4.86 system	10.18 elapsed	1035% %CPU
100.91 user	5.17 system	10.43 elapsed	1016% %CPU
100.39 user	5.15 system	10.44 elapsed	1010% %CPU

#tv2
84.84 user	2.06 system	6.31 elapsed	1375% %CPU
84.44 user	1.72 system	5.93 elapsed	1452% %CPU
84.55 user	1.98 system	5.93 elapsed	1458% %CPU
84.77 user	1.73 system	5.94 elapsed	1454% %CPU
84.38 user	1.81 system	6.02 elapsed	1431% %CPU

It's interesting we still see a signficant "system" time difference, which could be something to do with malloc/free and zeroing pages, or it could be something else such as pthread interfaces. I haven't profiled it to find out.

On a larger file this seems to be more pronounced too:

#qtask_ordered
1226.54 user	290.74 system	316.69 elapsed	479% %CPU

#tv2
710.26 user	17.12 system	49.46 elapsed	1470% %CPU
715.45 user	15.22 system	50.37 elapsed	1450% %CPU

System time here is vastly bigger, which is a big hint. Note it's possible to profile the main thread and main thread only, to see where it's spending the CPU time. That in turn may explain why it can't parallelise as well as it ought to. Eg:

./qtask_ordered /local/scratch01/jkb/SRR6882909.name-.bam 16 /tmp & perf record -t $!
[ Wait and then control-C ]

perf report then shows:
  50.93%  qtask_ordered  [kernel]            [k] 0xffffffff99800190                                   ▒
  15.16%  qtask_ordered  libc-2.27.so        [.] _int_malloc                                          ▒
   9.10%  qtask_ordered  libc-2.27.so        [.] calloc                                               ▒
   6.76%  qtask_ordered  libc-2.27.so        [.] __libc_malloc                                        ▒
   4.54%  qtask_ordered  qtask_ordered       [.] bam_read1                                            ▒
   2.64%  qtask_ordered  libc-2.27.so        [.] __memmove_sse2_unaligned_erms                        ▒
   1.75%  qtask_ordered  qtask_ordered       [.] bgzf_read                                            ▒
   1.67%  qtask_ordered  libc-2.27.so        [.] __lll_lock_wait_private                              ▒
   1.11%  qtask_ordered  qtask_ordered       [.] sam_realloc_bam_data                                 ▒
   1.07%  qtask_ordered  qtask_ordered       [.] sam_read1                                            ▒
   0.82%  qtask_ordered  libc-2.27.so        [.] memcpy@GLIBC_2.2.5                                   ▒
   0.74%  qtask_ordered  qtask_ordered       [.] main                                                 ▒
   0.63%  qtask_ordered  qtask_ordered       [.] bam_tag2cigar                                        ▒

It's somewhere in the kernel, but that's probably memory page clearing given the other high CPU functions. Looking at it, it seems to be spending most of its time allocating bam objects. This is one way my implementation majorly differs. I don't use bam_init1 and I don't free the bams between successive calls. I have a queue of bam arrays that I push onto and pull from, so I can reuse my bam-array from a previous job. Don't give the memory back to the system until program exit and it'll be MUCH faster!

@vasudeva8
Copy link
Contributor Author

Thanks for the sample James.

The aim of the sample was just to demonstrate the usage of queues and hts thread pool, specifically the 2 different ways of its usage, made as basic as possible for that purpose. It had no intention to show the user how best to use threading in general to get best performance.

Based on earlier reviews, I have incorporated the input and output file thread pool sharing as it gives some direction on hts thread pool usage. And taking it to next level with separate read/write threads may make the sample complex and shifts the objective of it.

But if we need one such sample, may be we should add as a separate one? Or just refer the user to a samtools functionality?

vasudeva8 and others added 5 commits July 2, 2024 16:07
update with correction

fixed issues
Use hts_tpool_next_result_wait() in threadfn_orderedwrite() so it
blocks properly when it has nothing to do.  Remove lock and
conditional that are no longer needed.  Use a sentinel job to
tell threadfn_orderedwrite() that there is no more data and it can
terminate.

Improve error handling.  Add a data::result field so that
threadfn_orderedwrite() can report back to the main thread if
it encountered a problem.  Ensure everything is cleaned up
correctly if either the main thread or threadfn_orderedwrite()
fail, and in the right order.  Set the return value to EXIT_FAILURE
if either sam_close() or threadfn_orderedwrite() report failure.

Ensure error messages are written to stderr.
@vasudeva8
Copy link
Contributor Author

Thanks James and Rob.
I have merged and updated the samples. Now they use alignments in chunks and reuses memory. Also the removed lookup / decoding of base from loop to outside the loop.
The unordered processing now just dumps the count of bases and gc ratio.

@jkbonfield
Copy link
Contributor

I've not checked the code yet, but I can verify the speed is now similar to my own implementations.

I think there are some minor niceties with this directory to fix too.

  • The default build is -O0 when it perhaps should be -O2. People can always override it with "CFLAGS=-g" on the make command line if they wish.
  • There are no dependencies, so if source is modified then it doesn't rebuild. It's just a matter of adding the source name in the build rules, e.g flags: flags.c

samples/DEMO.md Outdated
Comment on lines 989 to 996
if (!(data = faidx_fetch_seq64(idx, faidx_iseq(idx, tid), beg, end, &len))) {
...
printf("Data: %"PRIhts_pos" %s\n", len, data);
free((void*)data);
data = NULL;
//get quality data for fastq
if (fmt == FAI_FASTQ) {
if (!(data = faidx_fetch_qual64(idx, faidx_iseq(idx, tid), beg, end, &len))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate this is just code from the demo programs that's been cutdown with "..." to get to the essence of things, but it's losing the if-else structure which is highly confusing. We could add elses clauses, but I think really the correct solution is negating the if statements.

Ie:

if ((data = faidx_fetch_seq64(idx, faidx_iseq(idx, tid), beg, end, &len)) != NULL) {
    printf("Data: %"PRIhts_pos" %s\n", len, data);
    ...

We don't need the explicit error checking here as DEMO.md isn't designed as a guide for how to write bullet proof code and instead is trying to describe the most important features. However keeping an if statement with an explicit != NULL is a hint that it can fail and it's up to the programmer to handle that (and is demonstrated in the actual sample code).

It's probably easier for me to just go through this and make a new commit on to this PR than to document each case I find here.

The "..." to indicate clipped out code is useful as it removes a lot
of the unnecessary boiler plate, especially error checking, and allows
us to drill down to the essense of what we're trying to teach.
However without indentation it's not clear what bits are being
removed.

Also there are times when the code removed is an error handler plus
a subsequent else clause, which gives the false impression with the
indentation that the if-clause has been negated.  Eg:

    ...
    //select required field alone, this is useful for CRAM alone
    if (hts_set_opt(infile, CRAM_OPT_REQUIRED_FIELDS, SAM_FLAG) < 0) {
    ...
       //read header
       in_samhdr = sam_hdr_read(infile);
    ...

Clearly the `sam_hdr_read` is not within the if-clause of
`hts_set_opt` returning < 0, but it's how it reads because "..."
hid too much.  (In the above case just indenting the ... and
un-indenting the hdr read call reverses the mental image.)

My approach here is to replace all `...` error handlers with `... // error`
and to indent them accordingly.

Also where we only have a one line thing, removing open curly braces
if we've cut the close curly brace gives a more sensible view.

None of this is really changing the code we're showing, but presenting
it in a more obvious manner to those that don't understand what is and
isn't an error case.
@jkbonfield
Copy link
Contributor

Squashed, rebased and merged as f8016c0.

I didn't force push back over this as you may wish to keep the separate commit histories as there was a lot of back and forth, so I'll leave you to decide whether to delete the branch.

@jkbonfield jkbonfield closed this Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants