-
Notifications
You must be signed in to change notification settings - Fork 11
/
CHANGELOG
422 lines (366 loc) · 18.8 KB
/
CHANGELOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
2024-10-25 Release 5.2.2
- fix logup error level
2024-02-05 Release 5.2.1
- make logup log level configurable via environment variable
2024-01-10 Release 5.2
- update URLs for test files on GitHub
- fixed bug in aggregators that excluded tag patterns ending in "!"
2023-09-25 Release 5.1
- reduce log level for failed number conversion from INFO to DEBUG
2023-09-19 Release 5.0.3
- remove pytest from requirements.txt, and move it to tests_require in setup.py
2023-07-13 Release 5.0.2
- allow .hxl extension for CSV data
- change hxlspec script args to allow variable input source
2023-06-01 Release 5.0.1
- loosen versions for dependencies
- remove dependency on requests_cache
2023-05-18 Release 5.0
- milestone release
- fix URL munging for direct-download HXL Proxy URLs
- add hxlinfo command-line script
- added headers and hxl_headers fields to output of hxl.input.info()
2023-04-06 Release 4.29
- remove hxl.input.ExcelInput.info() and make a top-level hxl.input.info() function that works with every data type (also alias to hxl.info())
- update Makefile for new git branch structure (just dev and prod)
2023-03-20 Release 4.28:
- update requirements to allow latest versions of dependencies
- don't fall back to CSV if we have a MIME type or file extension that's not in the allow list (which is fairly liberal)
- change behaviour of ReplaceDataFilter so that with a map, it stops after the first successful replacement for each value
- change to ReplaceDataFilter means that replacement maps can now have default values at the end
- speed up ReplaceDataFilter by precompiling matching column indices for each pattern/replacement
- simplify Excel handling to remove dependency on mmap and tmpfile
- add patch from wynnw to fix crash reading some local files
- fixed issue where libhxl can't read directly from sys.stdin
2022-12-12 Release 4.27.3
- decreasing log level to reduce the amount of logs generated by the HXL Proxy (in prod)
2022-11-25 Release 4.27.2
- use structlog module
- add hxl.util.logup function to include function name in log message
- switch to logup in input module
- add extra help text for hxlcount script
2022-09-30 Release 4.27
- raise a HXLHTMLException (subclass of HXLIOException) when HTML markup found
- clean dates that are integers, assuming seconds or days since epoch (with cutouts for years and month days)
- bypass RequestResponseIOWrapper, because it's causing grief; need to make sure content still gets uncompressed
- fix input options for secondary datasets in scripts (e.g. an appended or joined dataset)
- more debug logging about HXL hashtag detection
- add a new logger to hxl.REMOTE_ACCESS specifically for external URL access
- various command-line script updates and fixes
2022-08-05 Release 4.26
- use loglevel CRITICAL only for infrastructure failures
- don't put tracebacks in logging (avoid logger.exception())
2022-07-15 Release 4.25.2
- raise exception for HTML input (prevents tagger exploit in HXL Proxy)
- make hxl.input.munge_url() public
- remove default tags for hxlcount script so that it can just count lines of data
2022-06-28 Release 4.25.1
- support optionally filling merged areas (XLS and XLSX only)
- support optionally scanning CKAN datasets for the first HXLated resource (in hxl.data(), not make_input())
- support Google Sheets interactive view URLs for XLSX files (but not tab identifiers)
- make Excel workbook handling more efficient
- add --encoding option to all command-line scripts (tested with CSV)
- add --expand-merged option to all comand-line scripts
- add --scan-ckan-resources option to all command-line scripts
- add hxl.input.InputOption object to hold all input options
- add info method to input objects to get general info about an XLSX file (including HXLated and merged areas in each sheet)
- refactor input and model classes
- upgrade to xlrd3 version 1.1.0
2022-02-28 Release 4.25
- remove discontinued "encoding" keyword for json.load (Python 3.9)
- rename the hxl.io module to hxl.input (to avoid problems in Python on Windows)
2021-04-23 Release 4.24
- fix bug that prevented tagger from adding tags if there was a blank row before the text headers
2021-02-08 Release 4.23
- handle encoding errors in CSV more gracefully (replace with "?")
- fix bug in JSON recipes (correct "key" to "tags" in the sort filter)
- fix bug in the sort filter when there is no key and untagged columns on the right side of the dataset
2021-01-15 Release 4.22
- consolidate recent bugfix releases
- use same input code for XLS (old Excel) and XLSX (new Excel) via xlrd3 library
- more-informative exception for an out-of-range Excel sheet index
2021-01-13 Release 4.21.3
- Bug-fix release: switch to xlrd3 library, since xlrd has dropped support for XLSX files
- be more careful about detecting non-XLSX zip files
- be more efficient with memory usage for Excel files
- for web-security reasons, unless allow_local is True, block fetching datasets from localhost, *.localdomain, or any dotted quad
2020-12-08 Release 4.21.2
- Bug-fix release: corrected rare Unicode bug in lookahead buffer for CSV files
2020-08-18 Release 4.21.1
- Bug-fix release: corrected rare bug in handling HXL Proxy saved recipes as a source
2020-07-22 Release 4.21
- BACKWARDS-INCOMPATIBLE: rename "whitelist" to "includes" in
hxl.model.Dataset.with_columns() and JSON format for loading a
ColumnFilter
- BACKWARDS-INCOMPATIBLE: rename "blacklist" to "excludes" in
hxl.model.Dataset.without_columns() and JSON format for loading a
ColumnFilter
- added a new ExpandListsFilter and hxlexpand command-line script
- added --skip-untagged option to hxlcut script
- improved docstrings for hxl.io, hxl.datatypes, hxl.converters, and hxl.geo modules
- added hxlspec command-line script to process a HXL JSON spec
- added URL munging for HXL Proxy links (non-CSV/JSON)
- added URL munging for Kobo survey links (requires an Authorization: header)
- if there's an error parsing a formula, return the value "** ERROR **" and log the error
- added requests_cache to requirements (to disable caching for API calls)
2020-05-20 Release 4.20
- switch to jsonpath-ng library to support more JSONPath features
- add support for row formulas as right side of row queries (making it possible to compare one column to another)
- add today() function for row formulas (today's UTC date in ISO YYYY-mm-dd format)
- add support for parsing Unix-style epoch timestamps (nanoseconds, milliseconds, or seconds)
- add long description in setup.py for PyPi (from README.md)
- work around xlrd date-handling bug (for malformed Excel dates)
- more cleanup on docstrings
2020-04-23 Release 4.19
- allow selector for JSON data to be a JSON path as well as just a simple token
- add --selector option to all command-line scripts
- added toupper() and tolower() row functions
- fixed bugs with date heuristics and with JSONpath parsing of non-strings
- various code cleanup
- add a generic Makefile to simplify common testing and git tasks
2020-03-16 Release 4.18
- add ImplodeFilter and implode() method to convert a long dataset into a wide dataset
- add hxlexplode and hxlimplode command-line scripts
- change hxl.model.Column.parse to return None for an empty string, or False for a malformed one
- add logging for malformed tagspecs
- extend column-rename spec to allow filtering by header as well as tag pattern
2020-02-25 Release 4.17
- add ability to flatten lists as non-JSON (just separated by " | ")
2019-12-09 Release 4.16
- added optional encoding parameter to hxl.io.data() etc. to force a character encoding
- fix bug opening an Excel file when the server sends a
content-type: zip MIME type
- fix bug peeking at UTF-8 files with many extended characters
- fix minor stability issues (e.g. unclosed file resources)
- fix CSV delimiter detection to ensure "," is always tried first
2019-04-04 Release 4.15.1
- better error report when trying to open a zip file that's not an
XLSX or zipped CSV
2019-03-29 Release 4.15
- improve support for private datasets: fix custom HTTP headers
with CKAN URL munging; add HXL_HTTP_HEADER and --http-header
support for all scripts
- add hxl.io.HXLIOException as base class for all HXL I/O-related
exceptions, with an optional url property
- add hxl.io.HXLAuthorizationException, and throw it for private
CKAN datasets or any 403 Forbidden response
- better error reporting for external validation errors
2019-03-04 Release 4.14
- report external validation errors (e.g. missing taxonomy)
separately (and don't invalidate the dataset)
- improve date parsing, and fix bug in borderline case (month but
no year)
- stability improvements for formulas (including "NaN" result for DIV0)
- efficiency improvements in validation tests
- fix a bug when reading an empty dataset
- fix a DIV0 bug in validation
- fix lexer conflict with jsonpath_rw package
2019-02-06 Release 4.13.2
- rewrite, correct, and simplify formula parsing
- fix bug in add_column filter
- add round() function to formulas
2019-01-31 Release 4.13.1
- fix setup.py to include new hxl.formulas package in the dist files (oops!)
2019-01-31 Release 4.13
- add spreadsheet-style formulas calculated from a row's
contents: https://github.com/HXLStandard/hxl-proxy/wiki/Row-formulas
- update add-column filter to support formula substitutions
- improve error reporting for private HDX resources
- fix bug with is min/is max aggregator and mixed datatypes
- fix date normalisation to preserve original value (and log an
error) if parsing fails
- bugfix for validation error when no row and/or column present
2018-12-03 Release 4.12
- added concat() aggregator for CountFilter
- added data_hash and columns_hash properties for
hxl.model.Dataset, together with hxlhash command-line script
- fixed parsing of YYYY-MM-DD hh:mm:ss SQL dates
- support opening zipped CSV files (thanks to Orest Dubay)
2018-08-31 Release 4.11
- fixed date-cleaning bug when reading a filter definition from
JSON
- added an option to send custom HTTP headers with a request for a
remote HXL file (e.g. a custom user agent)
2018-07-31 Release 4.10
- refine delimiter detection for CSV
- added append_external_list() method (and filter support) for an
external list of files to append
- count filter now supports dates and strings for min() and max()
aggregators
- add dayfirst parameter to normalise_date() (defaults to True)
- have CleanFilter prescan the dataset for date cleaning, and
default to dayfirst unless unambiguous MMDD format is more common
2018-06-29 Release 4.9
- support additional separators besides comma for CSV-like files
(including tab, semicolon, colon, and vertical bar)
- allow absolute tag patterns ending in "!" (does not ignore extra
attributes)
- fix bug in ReplaceDataFilter that raised an error when the
replacement was empty/None
2018-06-14 Release 4.8.4
- when reading JSON, flatten any non-scalar values into JSON
strings
- add a filter for extracting values from JSON strings using JSONPath
2018-05-05 Release 4.8.3
- handle Google Drive "open" and "file" URLs
- normalise whitespace for the count filter (so that "Guinea" and
"Guinea " won't count separately)
- fix validation test for trailing whitespace
2018-05-31 Releases 4.8.1, 4.8.2
- hotfixes for installation problem with 4.8 (in a clean install)
2018-05-31 Release 4.8
- add __version__ attribute to module
- refactor the hxl.validation module for better testing and
maintainability
- add a new default schema with useful default tests
- allow multiple tag patterns (comma-separated) for #valid_tag in
a HXL schema
- add a spelling validation test
- add a numeric-outlier validation test
- refactor CacheFilter to preserve row numbers
- add ability to generate a JSON-style validation report easily
via hxl.validate()
- new requirement: python-io-wrapper
- RowFilter (with_rows, without_rows) no longer ignores empty
cells; that will occasionally give some different results
- when multiple columns match a row query, it will succeed with at
least one success
- fixed a bug parsing "is" row queries
- handle more Google Sheets URLs
- recognise datetime formats as dates
2018-05-11 Release 4.7.1
- hotfix for bug in date parsing
2018-04-30 Release 4.7
- remove obsolete Python2 compatibility code
- added source_row_number and source_column_number to support validation
- add wildcard support to tag patterns, so that we can use
patterns like "*" or "*+f-children"
- revamped date handling to support partial dates like "2018-01"
or "2018", and also special notation like "2018W05" or "2018Q1"
- add min and max methods to hxl.model.Dataset
- HXL validation reports a validation error when a #valid_value+url is not usable
- HXL validation now reports proper column
- HXL validation now accepts all parseable date formats
- HXL validation now has a #valid_unique constraint (single value or
compound key)
- HXL validation now has a #valid_correlation constraint (e.g. make sure
that #adm1 and #adm2 are always consistent for any given value
of #adm3)
- HXL validation can now try to infer datatypes without explicit rules
- HXL validation now calculates edit distance and suggests the
closest match when failing validation against a list
- HXL validation can now test for irregular whitespace using
#valid_value+whitespace
- add "is (not) min" and "is (not) max" support to hxl.model.RowQuery
- add is_cached flag to hxl.model.Dataset and subclasses
- updated all AbstractInput to be iterables rather than iterators
(for repeatability)
- removed hxl.common module and added hxl.datatypes, with
more-consistent data checking/conversion
- when importing JSON arrays and objects, flatten them to a usable
text representation
- update docstrings
- default to case insensitive for validation
- added static hxl.model.TagPattern.match_list method
- fixed hxl.filters.ReplaceDataFilter to allow multiple tag
patterns
- fixed bug when an empty row appears before the hashtag row
2018-03-29 Release 4.6
- end support for Python 2 (will die with a RuntimeError;
next release will remove Python2 compatibility code relics)
- start implementing logging support
- all command-line utilities now have a --log option to set the
logging level
- restore support for preserving original attribute order (except
for JSON object flavoured export)
- add hxl.Column.get_display_tag method with optional attribute
sorting
- add support for lat/lon normalisation to clean-data filter
- add purge option to clean_data to allow removing numbers, dates,
or lat/lon that can't be parsed during data cleaning
- fix bug opening a Google Sheet from a CKAN resource URL
- allow opening a dataset from a CKAN dataset URL (uses first
resource)
- make the 'patterns' parameter optional for the JSON count recipe
2018-02-05 Release 4.5.1
- bug-fix release: do not let a misspelled date cause a fatal
exception
2018-01-31 Release 4.5
- the merge-data filter now looks for keys in *all* candidate
columns (not just the first-matching ones)
- add skip_untagged parameter to without_columns and ColumnFilter,
for removing columns without HXL hashtags
- hxl.model.Row.get_all can take a default value
- the clean-data filter has a number_format option (e.g. "0.2f")
- the hxlclean command-line script has a --number-format option
- hxl.model.Column.display_tag always shows attributes sorted, per HXL
1.1 beta
- added hxl.model.Row.dictionary property to return row as a
Python dict
- hxl.model.Source.gen_json() has a new use_objects option to use
the JSON list-of-objects format from HXL 1.1 beta as output
- hxl.io.write_json() has the use_objects option to pass on to
hxl.model.Source.gen_json()
- try to recognise JSON data even if it doesn't have a JSON MIME
type or file extension
2017-11-22 Release 4.4
- throw proper exception for failed HTTP request from requests library
- support JSON arrays of objects as well as arrays of arrays
- recursively search for HXL data inside a JSON dataset
2017-06-13 Release 4.3
- fixed bug with disabling SSL checks via requests library
- improved Excel data handling — use integers instead of floats
when possible, and fix bug when trying to parse numbers as dates
2017-06-05 Release 4.1
- support JSON input (list of rows)
- add fill_data filter to fill empty cells from previous rows
- use MIME type and extension where available to help choose type
- grab character encoding from HTTP response if available
- add verify_ssl parameter to hxl.io.data, hxl.io.make_input,
etc. Defaults to True; if False, don't try to validate SSL certs
- add new "is (not)" operator for row queries
- add optional date_format parameter for clean data filter
- fix bug with error messages from scripts
- fix output bug in command-line scripts
- fix bug in merge filter
- fix bug in hxl.io.tagger
2016-12-02 Release 4.0
- Fully modularised JSON specs and recipes.
- Made JSON specs and recipes work recursively.
- Fix bug that caused select filter to fail after explode filter
- Added top-level hxl.tagger() function, similar to hxl.data()
- add optional default_header arg to hxl.model.Column.parse_spec
- major overhaul of the merge_data filter: now merges *all* columns
matching the pattern supplied, and doesn't create an empty
column if there are no matching columns in the merge dataset
- refactored append filter to allow multiple append files in
single filter
2016-10-17 Release 3.3
- Regex ~ and !~ operators in row queries now match anywhere in the cell
- make row query smart about date comparisons with #date hashtag
- block numeric/date conversion in row queries for ~ and !~
2016-08-30 Release 3.2
- add timeout option for opening URLs (avoids long wait in unit
tests)
- encoding fixes for Python2
- add HXLColumn.has_attribute() method
- add an optional parsed attribute to HXLRow.get() to try parsing
the value according to attributes (currently supports +list)
- add experimental support for the +list attribute
2016-07-28 Release 3.1
- change request handling to work better with requests_cache (no
more streaming directly from the raw object in the request response)
2016-07-23 Release 3.0
- use the Python requests library in hxl.io (which will allow
add-ons like requests_cache)
- the tagger now has an option for a default tag
- add unit tests for Tagger, along option to force a full header
match and a default tag for non-matching headers
2016-06-22 Release 2.8
- add a new Explode filter that changes series data to a more-normalised form (no command-line version yet)
- improvements to Add Column filter
- better number handling in Clean Data filter (can now handle exponential notation)
- add mask parameter to Merge Data filter
- start support for reading/writing filter chains encoded in JSON
- added add_attribute and remove_attribute methods for a dataset
- documentation and unit test improvements