Issue with character encoding during ingestion of data #148

nickdos · 2020-04-21T22:52:34Z

@shahmanash commented on Thu Nov 14 2019

Special characters (Swedish characters) seem to be handled properly during the "loading" phase, for example Swedish characters in the field "stateProvince" is written to Cassandra correctly.
However, during the "sampling" process, special characters do not seem to be handled and written properly to cassandra database.
This error then propagates to SOLR index breaking the search functionalities from other apps like regions / spatial-service etc.

For example, in the following record, the "stateProvince" field is displayed correctly but in the "Additional political boundaries information", the encoding issue appears:

https://beta.bioatlas.se/ala-hub/occurrences/2ab38688-d6f4-440d-acce-83ffdc2d3e89

The same record in raw format:

https://beta.bioatlas.se/biocache-service/occurrences/2ab38688-d6f4-440d-acce-83ffdc2d3e89

The spatial processing of the layers have been done correctly (in terms of character encoding) as they appear correctly in the postgis database, geoserver and the spatial-service as can be seen here:

https://beta.bioatlas.se/spatial-service/field/cl10039
https://beta.bioatlas.se/spatial-service/objects/cl10039

https://beta.bioatlas.se/spatial-service/field/cl10064
https://beta.bioatlas.se/spatial-service/objects/cl10064

However on looking into the cassandra, the encoding broken for the processed contextual layer field "cl_p" but not for stateProvince field.

cqlsh:occ> select "stateProvince",cl_p from occ where rowkey='2ab38688-d6f4-440d-acce-83ffdc2d3e89';

 stateProvince | cl_p
---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   Gästrikland | {"cl10038":"Sweden","cl10068":"Kuperad sydlig boreal (S, T, U, W, X, y)","cl10042":"LA1646","cl10039":"G�vleborgs","cl10053":"Baltic Sea","cl10064":"H�gmosse-region","cl10058":"Boreal biogeografisk region","cl10041":"45","cl10087":"Sweden","cl10052":"4329","cl10040":"Hofors"}

(1 rows)

It might be safe to say that the issue is introduced during sampling stage.
The biocache-cli version is 2.4.4

Any suggestion how to deal with this would be very much welcome.

@shahmanash commented on Sat Nov 16 2019

The json response from spatial-service/intersect is well-formed as can be seen here:

https://beta.bioatlas.se/spatial-service/intersect/cl10040/57.7/11.96667
[{"field":"cl10040","description":"null","layername":"Kommuner","pid":"187","value":"Göteborg"}]

https://beta.bioatlas.se/spatial-service/intersect/cl10039/57.7/11.96667
[{"field":"cl10039","description":"null","layername":"Län","pid":"13","value":"Västra Götalands"}]

However the sample.csv file created in the /intersect/batch folder of spatial-service contains the malformed characters. This file I guess is the source to the sampling-drID.txt file in tmp folder of biocache.

Any suggestions ?

@djtfmartin commented on Thu Nov 21 2019

Moving to spatial-service as i think the problem lies there.

@shahmanash commented on Thu Nov 21 2019

The problem had been that the .dbf file from zipped shapefiles are normally NOT encoded in UTF8.
During the batch/intersect call in sampling process , the dbf file is read and this is when the garbled characters get written to sampling file, which then get written to cassandra db and then SOLR.

The changes from the following commits in the spatial-service module and the layers-store modules respectively deal with the problem of encoding in the sampling file produced by "batch/intersect" during the sampling process.

3b2c171

AtlasOfLivingAustralia/layers-store@c0c2f72

The sampling files generated using the artifacts after the aforementioned commits do not have the garbled characters.

@djtfmartin commented on Thu Nov 21 2019

thanks @shahmanash

@djtfmartin commented on Wed Nov 27 2019

fixed and released in 0.3.1

@vjrj commented on Wed Jan 08 2020

After this issue and some problems with utf-8 layers reported by Austria in slack I'been testing spatial-hub and service (both 0.3.1).

This is what I investigated so far. Correct me if I'm wrong:

Charset encoding in shapefiles is mainly the charset of DBF files, where field data is stored.
These DBF were in their origins restricted to ISO-8859-1, but nowadays this restriction is not very realistic (specially if you want to use some non-latin character languages).
The charset of DBF is defined in the CPG files.
ALA's spatial nowadays doesn't take into account this CPG files, so IMHO process all shapefiles (aka DBF) as ISO-8859-1. I think this because spatial don't use the charset param in the layer PUT call to geoserver.
So in the geoserver vector data stores, the charset field is not configured (so is interpreted as I-8859-1 by default).
So nowadays, in ALA, only ISO-8859-1 DBF files (and layers) are processed correctly.
If I understand correctly the last Dave commits, he converts these ISO-8859-1 fields data to UTF-8 in order to be correctly sampled and persisted in cassandra and solr with correct characters.

So for now, if we use ISO-8859-1 layers, after this fix, the data should be sampled correctly.

The only problem I see is that nowadays we cannot use layers with non latin characters (like Chinese, or Cyrillic, etc).

So maybe we should take into account the charset defined in the CPG files when unloading layers, and store correctly in geoserver. And later, convert (or not) to UTF-8 depending on this charset when we need to use during sampling.

@djtfmartin , if I'm correct with this analysis, shall I fill other enhancement issue in order to continue with this in the future?

@djtfmartin commented on Wed Jan 08 2020

Thanks @vjrj - yes thats all correct.

We should add the support for CPG files. I didn't have time to tackle CPG support this time round, but it should be easier now for someone to try and add this now. For this issue i developed some docker-compose files to help a developer get up and running with a development environment (Postgis DB, geoserver).

The changes will be required in the layers-store (https://github.com/AtlasOfLivingAustralia/layers-store) library which has the code for reading the DBF files and in spatial-service to make use of the charset param that you've pointed out.

@vjrj commented on Thu Jan 09 2020

Thanks for the detailed description of the next steps, @djtfmartin .

I added this to a list of pending internationalization issues thinking in Taiwanese and Russian future nodes:
https://github.com/AtlasOfLivingAustralia/documentation/wiki/Known-issues-in-LA-Internationalization
a i18n label should be better, but this simple list is enough for now.

@nickdos commented on Tue Apr 21 2020

@djtfmartin this issue is still in biocache-store - should it be in spatial-service repo?

@djtfmartin commented on Tue Apr 21 2020

@nickdos yes i think so. Its essentially an enhancement now to add support for CPG files.

The text was updated successfully, but these errors were encountered:

nickdos mentioned this issue Apr 21, 2020

Issue with character encoding during ingestion of data AtlasOfLivingAustralia/biocache-store#369

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with character encoding during ingestion of data #148

Issue with character encoding during ingestion of data #148

nickdos commented Apr 21, 2020

Issue with character encoding during ingestion of data #148

Issue with character encoding during ingestion of data #148

Comments

nickdos commented Apr 21, 2020