Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with character encoding during ingestion of data #148

Open
nickdos opened this issue Apr 21, 2020 · 0 comments
Open

Issue with character encoding during ingestion of data #148

nickdos opened this issue Apr 21, 2020 · 0 comments

Comments

@nickdos
Copy link
Contributor

nickdos commented Apr 21, 2020

@shahmanash commented on Thu Nov 14 2019

Special characters (Swedish characters) seem to be handled properly during the "loading" phase, for example Swedish characters in the field "stateProvince" is written to Cassandra correctly.
However, during the "sampling" process, special characters do not seem to be handled and written properly to cassandra database.
This error then propagates to SOLR index breaking the search functionalities from other apps like regions / spatial-service etc.

For example, in the following record, the "stateProvince" field is displayed correctly but in the "Additional political boundaries information", the encoding issue appears:

https://beta.bioatlas.se/ala-hub/occurrences/2ab38688-d6f4-440d-acce-83ffdc2d3e89

The same record in raw format:

https://beta.bioatlas.se/biocache-service/occurrences/2ab38688-d6f4-440d-acce-83ffdc2d3e89

The spatial processing of the layers have been done correctly (in terms of character encoding) as they appear correctly in the postgis database, geoserver and the spatial-service as can be seen here:

https://beta.bioatlas.se/spatial-service/field/cl10039
https://beta.bioatlas.se/spatial-service/objects/cl10039

https://beta.bioatlas.se/spatial-service/field/cl10064
https://beta.bioatlas.se/spatial-service/objects/cl10064

However on looking into the cassandra, the encoding broken for the processed contextual layer field "cl_p" but not for stateProvince field.

cqlsh:occ> select "stateProvince",cl_p from occ where rowkey='2ab38688-d6f4-440d-acce-83ffdc2d3e89';

 stateProvince | cl_p
---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   Gästrikland | {"cl10038":"Sweden","cl10068":"Kuperad sydlig boreal (S, T, U, W, X, y)","cl10042":"LA1646","cl10039":"G�vleborgs","cl10053":"Baltic Sea","cl10064":"H�gmosse-region","cl10058":"Boreal biogeografisk region","cl10041":"45","cl10087":"Sweden","cl10052":"4329","cl10040":"Hofors"}

(1 rows)

It might be safe to say that the issue is introduced during sampling stage.
The biocache-cli version is 2.4.4

Any suggestion how to deal with this would be very much welcome.


@shahmanash commented on Sat Nov 16 2019

The json response from spatial-service/intersect is well-formed as can be seen here:

https://beta.bioatlas.se/spatial-service/intersect/cl10040/57.7/11.96667
[{"field":"cl10040","description":"null","layername":"Kommuner","pid":"187","value":"Göteborg"}]

https://beta.bioatlas.se/spatial-service/intersect/cl10039/57.7/11.96667
[{"field":"cl10039","description":"null","layername":"Län","pid":"13","value":"Västra Götalands"}]

However the sample.csv file created in the /intersect/batch folder of spatial-service contains the malformed characters. This file I guess is the source to the sampling-drID.txt file in tmp folder of biocache.

Any suggestions ?


@djtfmartin commented on Thu Nov 21 2019

Moving to spatial-service as i think the problem lies there.


@shahmanash commented on Thu Nov 21 2019

The problem had been that the .dbf file from zipped shapefiles are normally NOT encoded in UTF8.
During the batch/intersect call in sampling process , the dbf file is read and this is when the garbled characters get written to sampling file, which then get written to cassandra db and then SOLR.

The changes from the following commits in the spatial-service module and the layers-store modules respectively deal with the problem of encoding in the sampling file produced by "batch/intersect" during the sampling process.

3b2c171

AtlasOfLivingAustralia/layers-store@c0c2f72

The sampling files generated using the artifacts after the aforementioned commits do not have the garbled characters.


@djtfmartin commented on Thu Nov 21 2019

thanks @shahmanash


@djtfmartin commented on Wed Nov 27 2019

fixed and released in 0.3.1


@vjrj commented on Wed Jan 08 2020

After this issue and some problems with utf-8 layers reported by Austria in slack I'been testing spatial-hub and service (both 0.3.1).

This is what I investigated so far. Correct me if I'm wrong:

So for now, if we use ISO-8859-1 layers, after this fix, the data should be sampled correctly.

The only problem I see is that nowadays we cannot use layers with non latin characters (like Chinese, or Cyrillic, etc).

So maybe we should take into account the charset defined in the CPG files when unloading layers, and store correctly in geoserver. And later, convert (or not) to UTF-8 depending on this charset when we need to use during sampling.

@djtfmartin , if I'm correct with this analysis, shall I fill other enhancement issue in order to continue with this in the future?


@djtfmartin commented on Wed Jan 08 2020

Thanks @vjrj - yes thats all correct.

We should add the support for CPG files. I didn't have time to tackle CPG support this time round, but it should be easier now for someone to try and add this now. For this issue i developed some docker-compose files to help a developer get up and running with a development environment (Postgis DB, geoserver).

The changes will be required in the layers-store (https://github.com/AtlasOfLivingAustralia/layers-store) library which has the code for reading the DBF files and in spatial-service to make use of the charset param that you've pointed out.


@vjrj commented on Thu Jan 09 2020

Thanks for the detailed description of the next steps, @djtfmartin .

I added this to a list of pending internationalization issues thinking in Taiwanese and Russian future nodes:
https://github.com/AtlasOfLivingAustralia/documentation/wiki/Known-issues-in-LA-Internationalization
a i18n label should be better, but this simple list is enough for now.


@nickdos commented on Tue Apr 21 2020

@djtfmartin this issue is still in biocache-store - should it be in spatial-service repo?


@djtfmartin commented on Tue Apr 21 2020

@nickdos yes i think so. Its essentially an enhancement now to add support for CPG files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant