Issue with character encoding during ingestion of data #369

shahmanash · 2019-11-14T10:54:25Z

Special characters (Swedish characters) seem to be handled properly during the "loading" phase, for example Swedish characters in the field "stateProvince" is written to Cassandra correctly.
However, during the "sampling" process, special characters do not seem to be handled and written properly to cassandra database.
This error then propagates to SOLR index breaking the search functionalities from other apps like regions / spatial-service etc.

For example, in the following record, the "stateProvince" field is displayed correctly but in the "Additional political boundaries information", the encoding issue appears:

https://beta.bioatlas.se/ala-hub/occurrences/2ab38688-d6f4-440d-acce-83ffdc2d3e89

The same record in raw format:

https://beta.bioatlas.se/biocache-service/occurrences/2ab38688-d6f4-440d-acce-83ffdc2d3e89

The spatial processing of the layers have been done correctly (in terms of character encoding) as they appear correctly in the postgis database, geoserver and the spatial-service as can be seen here:

https://beta.bioatlas.se/spatial-service/field/cl10039
https://beta.bioatlas.se/spatial-service/objects/cl10039

https://beta.bioatlas.se/spatial-service/field/cl10064
https://beta.bioatlas.se/spatial-service/objects/cl10064

However on looking into the cassandra, the encoding broken for the processed contextual layer field "cl_p" but not for stateProvince field.

cqlsh:occ> select "stateProvince",cl_p from occ where rowkey='2ab38688-d6f4-440d-acce-83ffdc2d3e89';

 stateProvince | cl_p
---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   Gästrikland | {"cl10038":"Sweden","cl10068":"Kuperad sydlig boreal (S, T, U, W, X, y)","cl10042":"LA1646","cl10039":"G�vleborgs","cl10053":"Baltic Sea","cl10064":"H�gmosse-region","cl10058":"Boreal biogeografisk region","cl10041":"45","cl10087":"Sweden","cl10052":"4329","cl10040":"Hofors"}

(1 rows)

It might be safe to say that the issue is introduced during sampling stage.
The biocache-cli version is 2.4.4

Any suggestion how to deal with this would be very much welcome.

The text was updated successfully, but these errors were encountered:

shahmanash · 2019-11-15T14:08:02Z

The json response from spatial-service/intersect is well-formed as can be seen here:

https://beta.bioatlas.se/spatial-service/intersect/cl10040/57.7/11.96667
[{"field":"cl10040","description":"null","layername":"Kommuner","pid":"187","value":"Göteborg"}]

https://beta.bioatlas.se/spatial-service/intersect/cl10039/57.7/11.96667
[{"field":"cl10039","description":"null","layername":"Län","pid":"13","value":"Västra Götalands"}]

However the sample.csv file created in the /intersect/batch folder of spatial-service contains the malformed characters. This file I guess is the source to the sampling-drID.txt file in tmp folder of biocache.

Any suggestions ?

…issues for SHP ingestion reported https://github.com/AtlasOfLivingAustralia/biocache-store/issues/353

djtfmartin · 2019-11-21T09:25:43Z

Moving to spatial-service as i think the problem lies there.

shahmanash · 2019-11-21T10:38:09Z

The problem had been that the .dbf file from zipped shapefiles are normally NOT encoded in UTF8.
During the batch/intersect call in sampling process , the dbf file is read and this is when the garbled characters get written to sampling file, which then get written to cassandra db and then SOLR.

The changes from the following commits in the spatial-service module and the layers-store modules respectively deal with the problem of encoding in the sampling file produced by "batch/intersect" during the sampling process.

AtlasOfLivingAustralia/spatial-service@3b2c171

AtlasOfLivingAustralia/layers-store@c0c2f72

The sampling files generated using the artifacts after the aforementioned commits do not have the garbled characters.

djtfmartin · 2019-11-21T10:44:35Z

thanks @shahmanash

djtfmartin · 2019-11-27T12:51:28Z

fixed and released in 0.3.1

vjrj · 2020-01-07T22:09:56Z

After this issue and some problems with utf-8 layers reported by Austria in slack I'been testing spatial-hub and service (both 0.3.1).

This is what I investigated so far. Correct me if I'm wrong:

Charset encoding in shapefiles is mainly the charset of DBF files, where field data is stored.
These DBF were in their origins restricted to ISO-8859-1, but nowadays this restriction is not very realistic (specially if you want to use some non-latin character languages).
The charset of DBF is defined in the CPG files.
ALA's spatial nowadays doesn't take into account this CPG files, so IMHO process all shapefiles (aka DBF) as ISO-8859-1. I think this because spatial don't use the charset param in the layer PUT call to geoserver.
So in the geoserver vector data stores, the charset field is not configured (so is interpreted as I-8859-1 by default).
So nowadays, in ALA, only ISO-8859-1 DBF files (and layers) are processed correctly.
If I understand correctly the last Dave commits, he converts these ISO-8859-1 fields data to UTF-8 in order to be correctly sampled and persisted in cassandra and solr with correct characters.

So for now, if we use ISO-8859-1 layers, after this fix, the data should be sampled correctly.

The only problem I see is that nowadays we cannot use layers with non latin characters (like Chinese, or Cyrillic, etc).

So maybe we should take into account the charset defined in the CPG files when unloading layers, and store correctly in geoserver. And later, convert (or not) to UTF-8 depending on this charset when we need to use during sampling.

@djtfmartin , if I'm correct with this analysis, shall I fill other enhancement issue in order to continue with this in the future?

djtfmartin · 2020-01-08T09:38:21Z

Thanks @vjrj - yes thats all correct.

We should add the support for CPG files. I didn't have time to tackle CPG support this time round, but it should be easier now for someone to try and add this now. For this issue i developed some docker-compose files to help a developer get up and running with a development environment (Postgis DB, geoserver).

The changes will be required in the layers-store (https://github.com/AtlasOfLivingAustralia/layers-store) library which has the code for reading the DBF files and in spatial-service to make use of the charset param that you've pointed out.

vjrj · 2020-01-08T21:14:47Z

Thanks for the detailed description of the next steps, @djtfmartin .

I added this to a list of pending internationalization issues thinking in Taiwanese and Russian future nodes:
https://github.com/AtlasOfLivingAustralia/documentation/wiki/Known-issues-in-LA-Internationalization
a i18n label should be better, but this simple list is enough for now.

nickdos · 2020-04-21T04:25:58Z

@djtfmartin this issue is still in biocache-store - should it be in spatial-service repo?

djtfmartin · 2020-04-21T09:25:23Z

@nickdos yes i think so. Its essentially an enhancement now to add support for CPG files.

nickdos · 2020-04-21T22:52:37Z

Issue moved to AtlasOfLivingAustralia/spatial-service #148 via ZenHub

djtfmartin referenced this issue in AtlasOfLivingAustralia/spatial-service Nov 19, 2019

Docker files for local development setup and attempt to fix encoding …

5ac6fe1

…issues for SHP ingestion reported https://github.com/AtlasOfLivingAustralia/biocache-store/issues/353

djtfmartin transferred this issue from AtlasOfLivingAustralia/biocache-store Nov 21, 2019

djtfmartin self-assigned this Nov 21, 2019

peggynewman transferred this issue from AtlasOfLivingAustralia/spatial-service Mar 31, 2020

nickdos mentioned this issue Apr 21, 2020

Issue with character encoding during ingestion of data AtlasOfLivingAustralia/spatial-service#148

Open

nickdos closed this as completed Apr 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with character encoding during ingestion of data #369

Issue with character encoding during ingestion of data #369

shahmanash commented Nov 14, 2019

shahmanash commented Nov 15, 2019

djtfmartin commented Nov 21, 2019

shahmanash commented Nov 21, 2019

djtfmartin commented Nov 21, 2019

djtfmartin commented Nov 27, 2019

vjrj commented Jan 7, 2020

djtfmartin commented Jan 8, 2020

vjrj commented Jan 8, 2020

nickdos commented Apr 21, 2020

djtfmartin commented Apr 21, 2020

nickdos commented Apr 21, 2020

Issue with character encoding during ingestion of data #369

Issue with character encoding during ingestion of data #369

Comments

shahmanash commented Nov 14, 2019

shahmanash commented Nov 15, 2019

djtfmartin commented Nov 21, 2019

shahmanash commented Nov 21, 2019

djtfmartin commented Nov 21, 2019

djtfmartin commented Nov 27, 2019

vjrj commented Jan 7, 2020

djtfmartin commented Jan 8, 2020

vjrj commented Jan 8, 2020

nickdos commented Apr 21, 2020

djtfmartin commented Apr 21, 2020

nickdos commented Apr 21, 2020