-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue with character encoding during ingestion of data #369
Comments
The json response from spatial-service/intersect is well-formed as can be seen here: https://beta.bioatlas.se/spatial-service/intersect/cl10040/57.7/11.96667 https://beta.bioatlas.se/spatial-service/intersect/cl10039/57.7/11.96667 However the sample.csv file created in the /intersect/batch folder of spatial-service contains the malformed characters. This file I guess is the source to the sampling-drID.txt file in tmp folder of biocache. Any suggestions ? |
…issues for SHP ingestion reported https://github.com/AtlasOfLivingAustralia/biocache-store/issues/353
Moving to spatial-service as i think the problem lies there. |
The problem had been that the .dbf file from zipped shapefiles are normally NOT encoded in UTF8. The changes from the following commits in the spatial-service module and the layers-store modules respectively deal with the problem of encoding in the sampling file produced by "batch/intersect" during the sampling process. AtlasOfLivingAustralia/spatial-service@3b2c171 AtlasOfLivingAustralia/layers-store@c0c2f72 The sampling files generated using the artifacts after the aforementioned commits do not have the garbled characters. |
thanks @shahmanash |
fixed and released in 0.3.1 |
After this issue and some problems with utf-8 layers reported by Austria in slack I'been testing spatial-hub and service (both This is what I investigated so far. Correct me if I'm wrong:
So for now, if we use The only problem I see is that nowadays we cannot use layers with non latin characters (like Chinese, or Cyrillic, etc). So maybe we should take into account the charset defined in the @djtfmartin , if I'm correct with this analysis, shall I fill other enhancement issue in order to continue with this in the future? |
Thanks @vjrj - yes thats all correct. We should add the support for CPG files. I didn't have time to tackle CPG support this time round, but it should be easier now for someone to try and add this now. For this issue i developed some docker-compose files to help a developer get up and running with a development environment (Postgis DB, geoserver). The changes will be required in the layers-store (https://github.com/AtlasOfLivingAustralia/layers-store) library which has the code for reading the DBF files and in spatial-service to make use of the charset param that you've pointed out. |
Thanks for the detailed description of the next steps, @djtfmartin . I added this to a list of pending internationalization issues thinking in Taiwanese and Russian future nodes: |
@djtfmartin this issue is still in biocache-store - should it be in spatial-service repo? |
@nickdos yes i think so. Its essentially an enhancement now to add support for CPG files. |
Issue moved to AtlasOfLivingAustralia/spatial-service #148 via ZenHub |
Special characters (Swedish characters) seem to be handled properly during the "loading" phase, for example Swedish characters in the field "stateProvince" is written to Cassandra correctly.
However, during the "sampling" process, special characters do not seem to be handled and written properly to cassandra database.
This error then propagates to SOLR index breaking the search functionalities from other apps like regions / spatial-service etc.
For example, in the following record, the "stateProvince" field is displayed correctly but in the "Additional political boundaries information", the encoding issue appears:
https://beta.bioatlas.se/ala-hub/occurrences/2ab38688-d6f4-440d-acce-83ffdc2d3e89
The same record in raw format:
https://beta.bioatlas.se/biocache-service/occurrences/2ab38688-d6f4-440d-acce-83ffdc2d3e89
The spatial processing of the layers have been done correctly (in terms of character encoding) as they appear correctly in the postgis database, geoserver and the spatial-service as can be seen here:
https://beta.bioatlas.se/spatial-service/field/cl10039
https://beta.bioatlas.se/spatial-service/objects/cl10039
https://beta.bioatlas.se/spatial-service/field/cl10064
https://beta.bioatlas.se/spatial-service/objects/cl10064
However on looking into the cassandra, the encoding broken for the processed contextual layer field "cl_p" but not for stateProvince field.
It might be safe to say that the issue is introduced during sampling stage.
The biocache-cli version is 2.4.4
Any suggestion how to deal with this would be very much welcome.
The text was updated successfully, but these errors were encountered: