-
Notifications
You must be signed in to change notification settings - Fork 16
/
CHANGES
1787 lines (1675 loc) · 88.1 KB
/
CHANGES
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
######## Bio::ToolBox revision history #############
v2.01
- Update chromosome sorting to properly handle chromosomal arms, for
example with Drosophila
- Change '.groups.txt' group file name to '.col_groups.txt' when writing
column metadata file for scripts get_binned_data.pl and get_relative_data.pl
- Change back to '_summary.txt' file name when writing a summary file
- Change "--blacklist" option to "--exclude" in bam2wig.pl
- Improve error handling scenarios in data2wig.pl, including invalid indexes
- Fix bugs in manipulate_datasets.pl, including missing lines in the view function
and restricting the addname function to only update a proper "Name" column
- Update any remaining POD text references about 0-base indexing to 1-base
v2.0
- Version number change, no code changes
v1.70
- MAJOR UPDATE: Change all internal and user-oriented column indexing
to 1-base instead of 0-base indexing, i.e. column numbers are now
listed beginning with 1 instead of 0. WARNING!!! THIS WILL BREAK ALL
PRE-EXISTING SCRIPTS AND CODE THAT USES HARD-CODED COLUMN INDEXES!!!
- MAJOR UPDATE: Use a single unified Bio::ToolBox::Parser module with
subclasses for bed, gff, gtf, and ucsc table formats. NOTE: This changed
name capitalization of Bio::ToolBox::Parser subclasses from parser
- Improve parsing of gtf files, especially with duplicate tags
- Replaced old table sorting algorithm to use numeric, mixed digit-string,
and/or string sorting
- Improved accuracy of detecting standard columns such as name, ID, start, etc
- Add support for column median and trimmed-mean methods when generating a
summary file
- Fix bug with filtering features by transcript_support_level and gencode
- Remove Bio::Seq::IO requirement for writing fasta files in data2fasta.pl
- Changed behavior to always use 1-base coordinate when generating coordinate
strings, which is standard behavior e.g. with HTSlib (samtools and tabix) queries
- Improve support for coordinate lookup in merge_datasets.pl, including
handling either 1-base or 0-base coordinate strings.
- Always report both transcript and gene name IDs and names in text output
from get_gene_regions.pl
- Add bedpe format support
- Fix bugs with parsing file headers and assigning standard column metadata
- Fix bug with naming empirically derived introns
- Add option to skip chromosomes in get_gene_regions.pl
- Add option to adjust relative coordinates based on narrowPeak peak
in get_features.pl
- Fix edge-case bugs with low-level bam parsing
- Speed up certain stats functions and improve detection of numbers
in manipulate_datasets.pl
- Remove defunct supplementary tables in ucsc_table2gff3.pl
- Improve data format verification, and only run it when reading and writing
- Improve error reporting in scripts
- Use a proper prompting module for user-input
- Fix massive numbers of perlcritic and perltidy issues
- Hundreds of other bug fixes
v1.691
- Fix critical error in script get_relative_data.pl
- Fix prerequisite version numbers leading to build failures
- Change a private function to a public function
v1.69
- Revise genomic sorting by introducing a sane, logical chromosome
ordering that smartly handles numerical, Roman, contigs, and
alternate names. Sorting is done by both start and end coordinates.
Sorting speed modestly improved.
- Improve handling of coordinates of Data Feature objects, including
caching and setting.
- Add support for narrowPeak summit coordinate as reference point
in multiple scripts.
- Improve handling of databases, including bigWigSet feature types.
Make simplification of dataset names a little less aggressive.
- Include options for excluding chromosomes and/or intervals when
generating a new list of genomic bins
- Improve tasting of file formats, keeping the file format of parsed files
in the Data object.
- Allow non-stranded values when parsing UCSC files, including bed files.
- Optimize scoring subroutines
- Remove legacy subroutines from utility module
- Include new test file for utility functions
- Numerous other small changes and fixes
v1.68
- Script bam2wig.pl script can now record both ends of paired-end
fragments, rather than faking it as single-end. Paired-end start
now respectes orientation. Added new option to only record either
first or second read in a pair. Added new option to ignore
zero intervals when writing bedGraph format. Changed multi-hit
scoring to preferentially use NH instead of IH.
- Scripts get_binned_data.pl and get_relative_data.pl now
can write out column names and associated datasets in separate
groups file for use in plotting. Also specify score decimal format.
- Script get_features.pl has new option to only keep features with
explicit tag value.
- High level ToolBox convenience function parse_file() now includes
basic default subfeatures exon, cds, and utr.
- Efficiency improvements in loading large text files by going
back to chomp. Should still fail appropriately with wrong
line endings.
- Feature objects now allow certain attribute methods to be both
get and set, including seq_id, start, end, strand, name, and
type, so long as the table does not contain parsed or database
SeqFeature objects.
- Add Data object function to return any single row Feature without
having to use an iterator.
- Add high level function for iterating over Bam alignments.
- Add support for intron subfeatures in Feature objects and data
collection scripts.
- Allow bigWigToBedGraph to be explicitly used
- Better handling of verified dataset names
- Bug fixes and improvements in identifying database file
formats and loading adapters.
- Bug fix in writing bgzip files.
v1.67
- Add new option of smart coverage to script bam2wig that
smartly handles pair-end alignments with gaps (introns)
- Add capability to collect from multiple datasets at once
for scripts get_binned_data and get_relative_data. Summary
files can now handle multiple datasets.
- Allow specific number of up and down windows in
script get_relative_data.
- Add option to provide list of specific feature IDs to
script get_features.
- Write shift correlation region data from bam2wig.
- Improve GTF export.
- Add utility function to simplify dataset names, used in
data collection scripts. Strips path and everything after
first period from dataset file names.
- Improve sort function in manipulate_datasets by taking a
range of columns and sort by mean. Also addname function will
overwrite a feature name if present.
- Adjust logic for setting a file extension when none is
provided.
- Lots of additional minor fixes and changes
v1.66
- Optimize data2wig fast mode, about 3 times faster
- Summary files now use a cleaned-up column name. Fix
bugs with summary file generation.
- Bam2wig now properly reports alignment counts for each
strand when provided with multiple input bam files
(previously reported the same number).
- Fix bug where the Big adapter would crash when search
coordinate was out of bound, unlike UCSC, HTS, and Sam.
- Improve GTF export with correct formatting and no longer
export transcript lines.
- Improve GTF parsing where both transcripts and genes are
inferred but coordinates where not updated correctly.
v1.65
- Add function to read directly from bigWig files, and add
support for bigWig files to script manipulate_wig
- Added options for filtering transcript Gencode or biotype
in script get_gene_regions.
- Added option to discard low count features from script
get_datasets.
- Add option to explicitly set number of columns of output
bed file in script data2bed
- Update script get_feature_info to work with annotation files
- Optimize data2wig to handle fast option in more scenarios
- Coordinate string generation in manipulate_datasets takes
start values as is
- Bug fixes in Bio::ToolBox, get_relative_data,
manipulate_datasets, more
-
v1.64
- Added support for Encode gappedPeak files. Also support for
gleaning file formats from bed track lines. This should make
future file formats easier to support in the future.
- Fix critical bug with skipping duplicate features from GTF
files, particularly from Ensembl where exons share the same exon ID.
- Fix double-counting of stranded alignments in bam2wig script.
Also correctly set minimum paired-end size.
- Fix bug to correctly count FPKM and TPM over length-adjusted
features in script get_datasets.
- Fix bug with filtering transcripts in script get_features.
- Reset and clarify behavior regarding stop codons when parsing
and exporting transcript features for various annotation formats.
- Add single-letter option support to script get_gene_regions.
v1.63
- Added minimal Cram file support through the HTS adapter.
Currently only supports the reference fasta listed in the Cram
file header.
- Added fast paired-end option and paired-end start point options
to script bam2wig. Temporary files now written to a temporary
subdirectory, which can be specified. Extreme depth can now be
handled properly by using 32 bit integers instead of 16. Splice
segments can now be fractionally counted.
- Brought back and updated old script correlate_position_data to
identify positional shifts in nucleosome or ChIP signal peaks.
- Added new SeqFeature methods to duplicate objects and delete
subfeatures.
- Added option to format result numbers in script get_datasets.
- Fix numerous small bugs in scripts data2gff, data2fasta,
get_intersecting_features, get_relative_data, and more
v1.62
- Added Bed parser with support for bed3-12, bedgraph, narrowPeak,
and broadPeak files. Data collection files will now parse bed
files and write a table with ID and name only, instead of
appending data columns to the original file structure. Parsing
can be turned off if you prefer the old way.
- Added support for writing bed12 transcript models to GeneTools
library and get_features script.
- Bam file alignment counting now automatically excludes all
secondary, duplicate, and supplementary marked alignments.
- Add new method to manipulate_datasets to name features, useful
for naming bed3 files.
- Added TPM option to get_datasets script.
- Fix bugs with parsing gff and gtf files at same time
- Fix bugs with detecting null and/or empty values, especially when
converting data formats
- other miscellaneous bug fixes
v1.61
- Added genomic sort and bgzip file compression support when
writing files for tabix compatibility with several scripts,
including those that write gene tables.
- Tables generated from parsed gene annotation files (GFF, etc)
no longer write a Type column.
- Simplified dataset column names in script get_datasets.
- Fix transcript filtering bugs in script get_features.
- Add helper methods for setting bam and big adapters to db_helper.
- Optimized run time by loading db_helper only on demand.
- Fix numerous POD bugs.
v1.60
- Major update to using Bio::DB::Big module for bigWig and
bigBed file support. This should be much easier to install and
support than the old UCSC library adapter modules from GMOD.
The old UCSC adapter is still supported, however. Also
included a wrapper for working with BigWigSet databases, which
are too useful to deprecate.
- Use File::Which to always locate helper applications.
- Add support for pigz when writing gzip compressed files.
- Add support for fetching genomic sequence from subfeatures
- Add single letter command line options to all scripts. This
was vaguely inherently supported before if the option was unique,
but now single letters (case sensitive) for common options are
explicit, and bundling is available.
- Add simple menu descriptions and option grouping to the Synopsis
section of every script POD documentation (about time!).
- Add new script manipulate_wig.pl.
- Add chromosome-specific normalization to bam2wig.
v1.55
- Fix bugs in bam2wig script when using negative shift values;
thanks to Piotr for reporting. Also fix bug regarding forking
in coverage mode; thanks to Naoki for reporting.
v1.54
- Update config module to stop writing unnecessary config files.
Config file will only be written when updating database or
application paths. Removed outdated validation, exclude tags, feature
classes, and default window values used by old db_helper methods.
- Complete rewrite of get_features script to handle annotation
files such as GFF3/GTF/UCSC formats in addition to SeqFeature::Lite
databases. Includes additional feature filters based on tags.
- Add additional transcript filter methods to GeneTools library,
including GENCODE basic tags and transcript_biotype.
- Update Data parse_table API, now allows for chromosome skip regex,
control simplify option, and explicitly search for mRNAs.
- Allow SeqFeature transcript collapsing and length determination
to work with features from a database.
- Tolerate weird transcript types when collecting subfeatures in
various GeneTools functions.
- Removed unnecessary primary_tag gene checks when collecting scores.
- Record extra ensemblSource data as transcript biotype when
parsing UCSC files
- Add chromosome skip regex to db_helper and big_helper methods
- Add no header options to data convertor scripts
- Long overdue update of POD and Readme.
v1.53
- Significantly streamlined GTF and GFF3 parsing to improve
loading times. By default, no subfeatures are parsed and must
be explicitly turned on as needed.
- Improved parsing gene tables (GTF, UCSC, etc) as an input
file to scripts. Now supports defining both the feature and
subfeature types to process. One more reason not to use an
annotation database.
- Fixed critical bug with collecting data across subfeatures,
e.g. get_binned_data. Subfeatures were not being properly
parsed and coordinates weren't converted to relative positions
correctly. Thanks to Zhizhou for reporting.
- New methods in Data objects for collapsing gene transcripts
and calculating transcript lengths.
- Fix bug with paired-end center span recording in bam2wig.
Thanks to Yixuan for reporting.
- Summary files now report bin midpoints based on 1000 bp length.
- Script pull_features allow multiple groups in a list file,
and write only summary files if desired.
- Bug fix in collecting sequence. Thanks to Patrick.
- Add support for collecting cds Start and Stop in script
get_gene_regions
- Numerous small bug fixes
v1.52
- Added binning option to wig files in script bam2wig. Default
is to write wig files in 10 bp bins with significant decreases
in runtime and memory usage while not appreciably diminishing
resolution.
- Add support to calculate shift values without doing wig
conversion in script bam2wig
- Add support for mRNA transcript subfeatures, including CDS,
5 prime UTR, and 3 prime UTRs, in data collection scripts
get_datasets and get_binned_data.
- Add new UTR methods to GeneTools library
- Changed behavior of reporting common and alternate exons and
introns in GeneTools. Genes with single transcripts now report
all exons and introns as common for simplicity.
- Add option to search at the 5 prime, middle, or 3 prime end
of features in script get_intersecting_features
- Fix bug in specifying which database feature to collect
regions from in script get_gene_regions
- Fix bug where tables with coordinates could not be used in
database lookups in script get_feature_info
v1.51
- Changed how bam alignments are recorded for indexed position
data hashes. Alignments are now recorded at their 5' postion
instead of midpoint, which wrecked havoc with large gaps and pairs.
- Reporting indexed bam alignment names (ncount method) now returns
the actual names rather than count. The db_helper calculate_score
method can properly count these. This avoids double-counting across
exons, etc.
- Fix major bug in script bam2wig that prevented paired-end
alignments from working. Thanks to Mengyao for pointing this out.
- Add additional checks when loading malformed files that have a
missing column header or extraneous hidden columns (extra tabs)
- Add format checks for numeric columns in some file formats
- Miscellaneous code improvements here and there
v1.50
- Major upgrade of the data collection libraries to simplify data
collection and improve efficiency. The value type is no longer
specified, being rolled into the specified collection method. Low
level optimizations have been added to improve speed. Increases
from 30% to over 300% have been measured, depending on the
collection method and adapter.
- Rewrite of data collection scripts to work with the improved libraries
- Added support for the modern Bio::DB::HTS module for Bam files,
while keeping support for the older Bio::DB::Sam module.
- Added more agnostic support for multiple different fasta indexing
adapters
- Script bam2wig is completely rewritten to handle multiple bam
files for merging, independent bam scaling, improved alignment
filtering, customizable output, improved cross-strand correlation
for peak shifting, improved speed and memory management, and lots
more features.
- Updated script data2fasta
- Numerous other features and changes too small to mention
- Relaxed requirements for external modules, namely BioPerl, so
that scripts and functions that don't absolutely require them can
still be used. All database functions will require it though.
v1.45
- Fix endless loop bug with opening files with metadata but no data,
e.g. empty VCF files
- Revert support for opening bedGraphToBigWig file handles
v1.44
- Added new function to GeneTools for exporting to GTF format.
- Added new function to filter transcript subfeatures in a gene
SeqFeature object by available Ensembl Transcript Support Level tags.
- Fixed critical bug with collapsing multiple transcripts in
GeneTools function that resulted in too many overlapping exons.
- Fixed bug in exporting non-coding gene models to UCSC refFlat format.
- Other minor bug fixes.
v1.43
- Fix bug with unique option in script get_gene_regions where
too many regions were being discarded. Thanks to Mengyao.
- Fix bug with generating bigWig files in script bam2wig, and
restore option to prefer bedGraphToBigWig if so desired
- Add option to ignore extraneous attribute tags when parsing
GFF and GTF files to reduce memory (simplify). Enable this
option by default when parsing annotation files when loading a
table in Bio::ToolBox::Data.
v1.42
- Changed bigWig convertor method to use primarily the wigToBigWig
utility for simplicity
- Introduced new method to open a wigToBigWig utility filehandle to
"print" wig files directly to a bigWig
- Updated bam2wig and data2wig scripts to write directly to the
bigWig utility and skip writing temporary intermediate wig file
- Added functionality to bam2wig to record stranded shifted counts
- Fixed a critical bug in script get_gene_regions where transcripts
weren't being filtered
- Improved file format taste testing to avoid GFF false positives
- Improved UCSC gene table parser behavior
v1.41
- Added no header option when loading text files missing a
column header row. Updated script manipulate_datasets to take
advantage of the feature.
- Added option to combine multiple score columns into a single score
when converting a file to a wig file in script data2wig
- Added option to split gff or vcf data files by an attribute tag
in script split_data_file
- Improve handling of writing vcf files
- Fix critical errors with calculating cdsStart and cdsEnd in the
GeneTools library
- Fix bugs in gff parser to continue when encountering errors in
parsing and interpret transcript biotype gtf attributes
- Fix bug in properly handling start coordinates in script data2wig
v1.40
- Major update introduces new SeqFeature object Bio::ToolBox::SeqFeature
that is a little faster and more compact than equivalent BioPerl objects.
This is the default object used in gene table parsers.
- New Module Bio::ToolBox::GeneTools for working with SeqFeature objects
representing traditional nested feature gene, transcript, exon models.
The script get_gene_regions now uses this module, as do other scripts.
- Expunged many scripts that are no longer considered part of the primary
mission of the BioToolBox distribution. These are now available in a
separate repository located at https://github.com/tjparnell/HCI-Scripts.
- Bio::ToolBox::Data objects can now parse all gene tables into memory
and store the features in the object. This allows gene tables to be
used without requiring a database to be setup.
- Added a file tasting method to determine whether a file looks like a
specific file format, e.g. gff, UCSC gene table, etc.
- Added numerous little methods and method aliases here and there to
improve functionality
- Added attribute rewrite functions for both GFF and VCF files
- Improved file format testing
- Numerous little optimizations in loading files
v1.36 (git 44b9dea)
- added new option to script get_relative_data to allow user to specify
what feature types to avoid
- fix bugs in scripts manipulate_datasets when exporting log2 treeview
files and defining x axes in graph_profile
- fix annoying bug where manipulate_datasets will not re-show column list
- improve data file summarization
- some library method optimizations
v1.35 (git e489d52)
- Add new options for setting dimensions and linear regression lines in
script graph_data.
- Restored unique option in script data2gff.
- New convenience methods for Feature objects.
- Fixed bug with smoothing interpolation in get_relative_data
- Numerous other bug fixes regarding bed files, column names,
file support, warnings.
v1.34 (git 5d4803c)
- Changed the behavior of automatically converting interbase coordinates
to base coordinates upon loading a file, and converting back as necessary
when writing. This had the side effect of effectively changing coordinates
when writing out nonstandard text files. Conversion is now done on the fly
when using the start method of row Features. Start interbase coordinates
are now recognized by appending a 0 to the column name. Output files should
now look like the input files.
- Strand values are not automatically converted upon loading; They are
converted as necessary on the fly using the row Feature strand method.
- Null values are not automatically converted to internal '.' null values.
They are converted as necessary using the row Feature value method to
maintain backward compatibility.
- Scripts data2bed and data2wig go back to using a Stream input to avoid
high memory usage.
- Script data2wig now has a fast option to skip lots of checks on values
and intervals. This speeds up conversion considerably at the risk of
making improper wig files if the source file has issues.
- Script join_data_file is considerably faster by simply concatenating
data lines without processing or checking.
- Script bam2wig has new recording option, mid extend, to record the
middle portion of alignments or proper paired-end alignments. Credit to
Ohad for recommending.
- Add explicit interbase support to scripts data2gff and data2fasta.
- Fix critical bug were extensions were not scored properly for coordinate
features in script get_binned_data. Thanks to Mengyao.
- Fix bam2wig alignment alignment illustrations in POD. Thanks to Ohad.
- Bug fixes regarding bed file integrity checking that were introduced in
the previous release.
v.1.33 (git ba1a70e)
- Removed legacy_helper module. All scripts now properly updated to
use Bio::ToolBox::Data and related objects. This was the last step of
a long process to modernize all of the scripts to use the new libraries.
- All data collection modules are now chromosome naming-scheme agnostic,
meaning that "chr1" and "1" for chromosome can be used equally, regardless
of what the annotation or big data file uses.
- Minimal VCF file support is added, including the ability to parse INFO
and SAMPLE attributes, and verify some file format integrity.
- Significantly improve GTF file parsing.
- Improve file format verification, including printing error messages.
This should alleviate cryptic reasons for automatic file extension changes.
- Tons of bug fixes. See GitHub for a full change log.
v.1.32 (git 67749a7)
- Fix bug with adding a new column to Data object, particularly
when selected from a database.
- Fix bugs related to adding, deleting, or modifying columns for
a specific file format, such as BED or GFF
- Introduce additional Data structure verification tests, including
proper strand information, to verify correct file formatting, such
as BED and GFF
- Fix bugs when writing data files that incorrectly maintained
file extensions for a given format even when the structure was no
longer valid.
- Add support for .bigwig and .bigbed file extensions.
- Fix bug with opening fai fasta index and forked databases in script
CpG_calculator.
v.1.31 (git 9a4e122)
- Major addition of parsers for GFF and UCSC gene table formats.
This replaces the old gff3_parser and now supports GFF, GTF, and GFF3.
Also moved UCSC gene table parsing out of ucsc_table2gff3 and into
own parser module, available for all. This supports refFlat, genePred,
and knownGene tables. Tests for these parsers are included.
- Updated script get_gene_regions to use parsers.
- Greatly optimized bedGraph writing from script bam2wig to reduce
memory usage. Also ensure that bedGraph is written over entire chromosome.
- Fix bugs when sorting and performing math with null, NA, and inf
values, especially with script manipulate_datasets.
- Fix bug where coverage shifts by 1 bp after each write to fixedStep wig
in script bam2wig. Thanks to Magda for reporting.
v.1.30 (git 9ab9ff4)
- Major upgrade of the Bio::ToolBox::Data library internals.
Old data_helper and file_helper modules are gone, and a
legacy_helper module added for those programs that still haven't
been upgraded yet. Numerous improvements and bug fixes to Data and
Stream objects, structure verification, standard file format metadata,
file writing, and more. Several new methods have been added too.
- Added support for ncount, or name count, of bam files. By
counting unique alignment names, we can avoid double-counting
of reads in adjacent search areas. Also works for counting
paired-end reads. Supported by get_datasets script.
- Updated pull_features script to use new Data objects.
v.1.26 (git 21c800b)
- Removed Extras folder and outdated library functions. These
are available as a separate GitHub project, biotoolbox-extra.
- Improved GFF3 parser to handle orphans more gracefully, and
simplify parsing by adding a next_top_feature function. It is
moved out of the db_helper hierarchy, where it never really belonged.
- Changed license to exclusively Artistic License 2.0.
- Fixed bug when using input files with coordinate information in
script get_datasets. Thanks to Mengyao for reporting.
- Fixed bug when opening a new Data::Stream not based on a file or
data list.
v.1.25 (svn 955)
- Added a new option to manually specify the extension length
and allow new ways to record read coverage in the script bam2wig.pl.
A text graphic is included in the documentation to illustrate
different methods.
- Broke out database and fasta functionality from
Bio::ToolBox::db_helper into a separate sub module, which should
limit the number of modules loaded at compile time.
- Allow main Data feature_type to be specified by command line
option, useful when your input file has names of database features
but not a type column, for scripts get_feature_info.pl,
get_datasets.pl, get_binned_data.pl, and get_relative_data.pl.
- Added BED and GFF string export to Bio::ToolBox::Data::Feature
objects.
- Changed library version reporting for default new Data files.
- Fix bugs with setting and removing AUTO metadata properly
when opening and writing Data files.
- Fix bugs regarding deleting metadata, which had a side effect
of adding unwanted metadata to files written by manipulate_datasets.
- Added more name possibilities when looking for possible name
columns.
- Fix bug where a database may sometimes not be opened properly
after forking into children in data collection scripts.
- Fix bug that prevented statistics from being recovered from
child processes in script graph_data.pl.
v.1.24001 (svn 940)
- Updated tests to catch possible sources of error, including
recent UCSC BigFile libraries that power Bio::DB::BigWig
adaptors, DB_File required for GFF3 loading into memory database,
and path verification in Data metadata.
v.1.24 (svn 936)
- Added new module Bio::ToolBox::Data::Stream for working with
data files line by line instead of loading them into memory.
Moved lots of shared methods into Bio::ToolBox::Data::common.
- Added explicit file support for UCSC-style refSeq and genePred
file formats, as well as Encode narrowPeak and broadPeak files.
- Added new value type, pcount, in data collection scripts and
library score methods. Features, such as Bam alignments, must
be entirely contained within the search region, and not just
overlapping as with the count value.
- Added improved method for reloading forked children files
back into Data objects without having to call external
join_data_file script.
- Improved forking in data collection scripts, including a
delay in the parent after forking to prevent race conditions
on fast servers with high fork numbers.
- Removed all vanity names to data_helper and file_helper
subroutines. All scripts updated to reflect changes.
- Improved identification of overlapping features when avoiding
neighboring features when collecting relative data.
- Optimized Bam score data collection methods.
- Disabling bins when writing coverage in bam2wig.
- Fix bugs with writing CDT files in manipulate_datasets.
- Improved ToolBox::Data::Feature methods to handle internal nulls.
- Improved retrieval of sequence list, particularly for
SeqFeature::Store databases.
- Updated and improved library testing for Data and Stream objects
and database interaction.
- Fixed bug where negative coordinates would not be accepted
when collecting relative coordinates.
- Fixed bug where Bam and BigBed databases may not be opened
properly in some instances, such as precounting features for RPM
scores.
- Fix bug where in some cases all database features could be
returned with the method get_feature().
- Fix bug were type options is now properly implemented in script
get_feature_info.
- Fix bug limiting to chromosome length in script
get_intesecting_features.
v.1.23 (svn 915)
- Improved script get_gene_regions to recognize non_coding exons;
prompt for region, feature, and RNA type; specify for more than
one feature type at a time; and avoid mixing RNA sub types from
the same gene. Thanks to Mengyao for troubleshooting.
- Fixed bugs pertaining to collecting relative windows that may
extend beyond the beginning of the chromosome. Thanks to Nate
for reporting.
- Fixed bugs sorting by genomic coordinate, especially when
only Position is provided and not Start.
- Made Bio:ToolBox::Features return smart coordinates only, no
funny values.
v.1.22 (svn 906)
- Added new export options of alternate, common, or all exons
to script get_gene_regions.
- Changed behavior of Bio::ToolBox::Data::Feature such that
database features must now be explicitly retrieved rather
than automatically retrieved, which could lead to runaway
execution if it could not be found.
- Improved how name columns are recognized and used when
retrieving database features.
- Improved writing of strand information in proper format
for Bed and GFF files.
- Fixed numerous bugs that prevented proper execution in
several scripts, including manipulate_datasets, get_feature_info,
graphing scripts. Thanks to Mengyao and Yixuan for reporting.
- Standardize data file loading message among several scripts.
v.1.21 (svn 896)
- Fixed critical bug that prevented upstream windows from
collecting data in script get_relative_data.
- Fixed critical bug that prevented some bigBed files from
being opened.
- Fixed critical bugs that prevented scripts data2fasta and
get_intersecting_features from working properly.
- Fixed bugs where strand may be inappropriately assigned or
sometimes ignored when collecting a regional positioned scores.
- Fix minor bugs in output of scripts ucsc_table2gff3 and
get_ensembl_data
- Include checks in data collection scripts to exit gracefully if
datasets can't be verified.
- Interactive list of values to keep or toss is now sorted
alphanumerically in script manipulate_datasets.
v.1.20 (svn 884)
- Refactored db_helper so that all database adaptors are loaded
dynamically only as needed during runtime, rather than loading
everything all at once regardless of need. This results in
faster load times and reduced memory footprint.
- Added new methods to Bio::ToolBox::Data objects, including
sorting, genomic sorting, and feature_type.
- Split out metadata-related methods and Feature objects as
separate modules in Bio::ToolBox::Data. Feature objects will
now automatically retrieve represented database features as
necessary to collect attributes.
- Rewrote many, many scripts to use Bio::ToolBox::Data objects.
Simplify, unify, and improve all Data functions.
- Moved many specialized, outdated, or esoteric scripts to an
optional extras folder that will no longer be distributed via
CPAN but will be available through SVN.
- Added new functions to script manipulate_datasets.pl, including
processing rows with specific values, split and concatenate columns,
view table contents, and add additional manipulations prior to
writing CDT files. Also, several old functions were removed.
- Added support for converting refFlat and simple genePred
file formats to GFF3 in script ucsc_table2gff3.pl.
- Add better warnings for reading files with DOS or MAC line endings.
- Removed file extension manipulation in join_data_file script.
- Replaced fatal errors with warnings in merge_datasets script.
- Fix critical error where midpoints were not calculated correctly
for features in script get_relative_data.pl, preventing data
collection around a feature midpoint.
- Fix bug to properly collect extended bins at 3'end and avoid
undefined start errors in average_gene.pl; plus write a summary
file when executing with forks.
- Fix bugs with collecting features from a database.
- Fix bug with renaming M to UCSC-style chrMT in
get_ensembl_annotation.
- Numerous other small fixes scattered about.
v.1.19 (svn 843)
- Implemented subfeature sharing and multiple parentage when
exporting UCSC tables as GFF3. For example, exons can now be
shared between multiple transcripts of the same gene. This
leads to considerable reduction in file size at the expense
of increased complexity. Naming of subfeatures is now optional.
- Renamed script print_feature_types.pl to simply db_types.pl.
Known databases in the configuration file can now be
interactively chosen from a list.
- Added support for multiple parentage in the gff3 parser
library and script gff3_to_ucsc_table.pl.
- Added a verbose option and improved path detection in script
db_setup.pl.
- Script filter_bam.pl now works on unsorted and non-indexed
bam files, making it more useful than before.
- Bam files opened using db_helper::bam may now be sorted as
necessary before indexing.
- Increase default buffer value in script bam2wig.pl.
- Fixed bug where firstExon features were misnamed as lastExon
in script get_gene_regions.pl.
v.1.18 (svn 826)
- Fixed critical bug when calculating RPM and RPKM values in
data collection scripts. This is a long-standing bug that
produced erroneous values. The bug does not affect bam2wig.pl
rpm reporting.
- Improved methods for collecting from subfeatures such as
exons of genes or transcripts in script get_datasets.pl.
- Added option to specify which UCSC table(s) to use when
setting up a new database in script db_setup.pl.
- Added new options to extend and concatenate sequences in
script data2fasta.pl.
- Added ability to use the samtools fasta index when available
in scripts data2fasta.pl and CpG_calculator.pl. This index is
about 10-20% faster than the BioPerl fasta index.
- Fixed bug to avoid illegal characters in filenames when
splitting data files, and added an option to use a custom
file prefix in script split_data_file.pl.
- Fixed bug where ensembl gene names may not be properly
recorded in the output GFF3 file in script ucsc_table2gff3.pl.
v.1.17 (svn 808)
- Added six new method functions to Bio::ToolBox::Data for
working with columns and metadata.
- Updated script correlate_position_data.pl with parallel
execution plus an ANOVA statistical analysis between data.
- Fixed bug where the --bwapp option was not being used in
script bam2wig.pl. Thanks to Michael D. for reporting.
- Removed extraneous BioPerl warnings when opening a fasta file
or directory fails, and replaced with some suggestions.
- Fixed bug with RPM option that lead to warnings in db_helper.
- Simplified warning for duplicate lookup values in script
merge_datasets.pl.
- Reorganized the POD summary and provided examples of usage
for main data collection scripts, plus provide default values
in POD summaries for a number of scripts. Thanks to Christian
for the recommendation.
v.1.16 (svn 794)
- Fixed critical bug that prevented the forward strand from
being written when generating stranded coverage in script
bam2wig.pl. Thanks to Michael D. for reporting.
- Fixed critical bug that prevented the script get_bam_seq_stats.pl
from compiling properly.
- Fixed bug that prevented filtering more than one length at
a time in script filter_bam.pl. Thanks to Yixuan for reporting.
- Fixed again the bug where passing a negative or zero start
to data collection methods issues a warning and resets the value
to 1 in db_helper.
v.1.15 (svn 786)
- Added Bio::ToolBox::Data method to delete column metadata
and improved adding new metadata.
- Added back cached database objects for data collection,
which brings back speed lost in the previous version.
- Original strand format is now maintained when rewriting data
files. For example, + and - from Bed and GFF files as opposed
to 1 and -1.
- Passing a negative or zero start value to data collection
methods in db_helper now issues a friendly warning and resets
the value to 1.
- Opening a BigWigSet directory of bigWig files can now infer
strand based on filename and set the metadata appropriately.
For example, files whose basename ends in f, forward, or plus
will be interpreted as strand 1.
- Script gff3_to_ucsc_table.pl was significantly updated to
address critical flaws and change the output format to refFlat.
- Script manipulate_datasets.pl no longer writes metadata for
simple file formats when using certain functions that do not
change data content.
- Script bam2wig.pl now includes a --flip strand option.
- Scripts graph_data.pl and graph_profile.pl have fixed errors
and made improvements regarding fonts and sizes.
- Various other small bug fixes and checks for optional Perl
module installs.
- Updated shebang lines to use universal /usr/bin/perl
- Updated script POD documentation to make common options more
uniform.
v.1.14.1 (svn 763)
- Changed the method of caching database objects introduced
in version 1.14, which wreaked havoc with forked child
processes. All database connections are cached by default
and returned if subsequently re-opened, unless explicitly
told to not use the cached connection. Multiple scripts
were updated to reflect the new connection caching.
- Bio::ToolBox::Data now automatically re-clones existing
database connections if you splice the data table.
- Bam file index files are now explicitly generated prior
to opening the bam file database connection. Additionally,
existing .bai files are copied as .bam.bai in preference
to creating a new .bam.bai file. Thanks to Yixuan for
reporting.
- Fixed POD errors in script bar2wig.pl and updated method
for finding the java executable file. Thanks to Guillaume
for reporting.
- Removed debugging warn statements in script
get_relative_data.pl.
- Added POD documentation to Bio::ToolBox::db_helper::useq.
v.1.14 (svn 737)
- Massive reorganization of the entire package into a proper
Perl module distribution that is installed using standard
Module::Build methods. This will install the libraries into
site-specific Perl library directories as Bio::ToolBox::*.
Scripts will install into a standard bin directory. All
scripts have been updated to reflect these changes.
- Added new module Bio::ToolBox::Data, which provides an easy
object-oriented interface to working with data files and the
rest of the Bio::ToolBox functions.
- Added new script db_setup.pl to ease generating an annotation
database with UCSC data
- Added Build tests for all major library functions, including
score collections from all binary database adaptors.
- Added capability to properly collect value types, including
score, count, and length, from useq and wiggle database adaptors
- Loosened restriction for counting Bam alignments where the
midpoint had to be within the query region; now any overlapping
alignment that intersects the region will be counted.
- Reworked the interpolation algorithm to interpolate as many
datapoints as possible in script get_relative_data.pl.
- Removed cryptic error messages when opening databases, and
added database handle caching to avoid repeated openings
- Newly generated feature lists no longer append all aliases to
the feature name
- Added additional attributes to the list of available ones to
retrieve from the database in script get_feature_info.pl. Also
added a --type command line option to set a feature type to
named features.
- Improved data table checking to include a count of columns
for every row.
- Added max_count option to script bam2wig.pl to control for
high Bam coverage
- Fixed bug where the summary file was not created for
script get_relative_data.pl
v.1.13 (svn 691)
- Updated to include native support for USeq archive files
with data collection scripts. USeq files may be used in
the same manner as BigWig, BigBed, or Bam files for data
collection. USeq files may be generated using tools from
the USeq package (useq.sourceforge.net). The
Bio::DB::USeq adaptor is available via CPAN.
- Added new script filter_bam.pl, which can filter alignments
based on various criteria and write a new Bam file. Filters
are one or more boolean tests, including attributes, scores,
lengths, sequence, etc.
- Added new script get_bam_seq_stats.pl, which collects
information about the read sequences themselves and summarizes
the sequence composition and nucleotide frequencies, suitable
for generating sequence logos.
- Updated script manipulate_datasets.pl to allow any integer
to be used when formatting decimal values.
- Restored ability to write a new data file without collecting
data from script get_datasets.pl.
- Changed the log conversion step to avoid having to increase
read count by 1 to avoid log of 0 errors in script bam2wig.pl.
- Use the command line --log argument in preference over
metadata in script manipulate_datasets.pl.
- Method sum now writes 0 instead of null in script
bin_genomic_data.pl.
- Fixed issue where joining data files may not maintain gzip
status. This had issues with combining forked children files.
- Fixed bug where a provided, indexed data source file
(e.g. BigWig) could not be used as a database in script
get_datasets.pl
v.1.12.6 (svn 680)
- Updated the script novo_wrapper.pl to use Parallel::ForkManager
instead of GNU Parallel. This should make it more stable,
particularly under nohup.
- Consolidated the standard out results when functions were
applied to multiple columns in script manipulate_datasets.pl.
This will make the script much less chatty.
- Fixed bug with naming temporary forked children file names.
- Fixed bugs with the generation of summary files.
- Fixed bug with the automatic identification of the X axis in
script graph_profile.pl.
- Fixed bug where features not found in a database could crash
the script get_feature_info.pl.
v.1.12.5 (svn 667)
- Improved the shift value determination to make it more robust
against outliers in script bam2wig.pl. Additionally, the model
data that is written is now centered over the shift peak to
make evaluations more interpretable.
- Fixed a bug where 0 or negative coordinates may be written
to varStep wig files in script bam2wig.pl.
v.1.12.4 (svn 662)
- Improved the efficiency of scanning for high coverage regions
and calculating 3 prime shift values in script bam2wig.pl; Each
reference sequence is now scanned in parallel. Also added a new
option to write the shift profile model and correlation data.
The efficiency of writing bedGraph files was improved, giving
up to 2X increase in performance. The default maximum duplicate
value is now unlimited. Warnings about coverage beyond the ends
of chromosomes are now silenced unless verbose is turned on.
- The script graph_data.pl can now execute in parallel to improve
efficiency when a list of datasets are provided in advance. A
list may now be provided in conjunction with the --all option.
- Improved recognition of the X-axis column in script
graph_profile.pl.
- Fixed critical error when writing extended position bedGraph
files from script bam2wig.pl where reverse reads were not
extended appropriately in the 3 prime direction.
v.1.12.3 (svn 651)
- Added user options to control the size of the memory buffer
when writing bedGraph files and the disk write frequency in
script bam2wig.pl.
- Added option to control the output order of the features from
script pull_features.pl. The order may match either the input
list or input data file. Also improved automatic column identification
and avoid empty output files.
- Script data2wig.pl will now write bedGraph files.
- Fixed bug leading to excessive memory usage when writing a
fixedStep wig file from script bam2wig.pl. Thanks to Jeff for
reporting.
- Fixed bug where writing strand values for gff or bed files may
not be written correctly.
- Fixed bug leading to errors loading input files with comment or
empty lines in the middle of data lines.
- Fixed bug to avoid log of 0 errors in script bam2wig.pl.
v.1.12.2 (svn 642)
- Scripts find_enriched_regions.pl and CpG_calculator.pl are now
multi-threaded. The find_enriched_regions.pl also has additional
optimizations to reduce memory usage.
- The script merge_datasets.pl now has the option to use a coordinate
string as a unique identifier when looking up features. This is
particularly helpful with BED, GFF, and other files with genomic
coordinates that do not have unique name identifiers.
- A coordinate string in the format chromo:start-stop may now be
generated from coordinate values in data files using a new function
in the script manipulate_datasets.pl.
- Fixed a bug regarding changing file extensions in script
join_data_file.pl, which gave odd output file names with scripts that
executed in parallel.
v.1.12.1 (svn 635)
- Fixed bugs were gzip status and file extensions may be inappropriately
inherited. This may cause problems when joining children files from
parallel process forks.
- Fixed bug where the interactive menu would exit upon an empty value
in script manipulate_datasets.pl. A "q" must now be provided to exit.
- Minor optimization when calculating shift values in script bam2wig.pl.
v.1.12 (svn 619)
- Major improvements to performance of some data collection scripts by
adding multi-threaded options. These include get_datasets.pl,
get_relative_data.pl, average_gene.pl, and bam2wig.pl. The number of
CPU forks may be specified with the --cpu option (default 2). This option
requires the installation of Parallel::ForkManager, available through
CPAN. Run the check_dependencies.pl script to install it.
- All gzip compression read and writes are now forked through an
external gzip utility for a considerable boost in performance (2-5X).
The gzip executable must be in your path for this to work (it usually
is on most Unix-like environments).
- Added --long option when collecting data from long features in script
average_gene.pl.
- Improved efficiency when collecting data from very large windows in
both get_relative_data.pl and average_gene.pl.
- Summing the total number of read alignments in Bam files is also
multi-threaded. Summing the total number of intervals in a BigBed file
is also improved.
- Fixed a critical error where not all windows had data collected when
using the script get_relative_data.pl
v.1.11 (svn 603)
- Major revision of how features are now retrieved from the database
using primary_IDs rather than relying on unique names in the database.
Generating lists of features will now return Primary_ID, Name, and Type.
The Primary_ID is unique to a database and is usually non-portable.
Current feature lists with only Name and Type will still work, and are