You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Special characters (Swedish characters) seem to be handled properly during the "loading" phase, for example Swedish characters in the field "stateProvince" is written to Cassandra correctly.
However, during the "sampling" process, special characters do not seem to be handled and written properly to cassandra database.
This error then propagates to SOLR index breaking the search functionalities from other apps like regions / spatial-service etc.
For example, in the following record, the "stateProvince" field is displayed correctly but in the "Additional political boundaries information", the encoding issue appears:
The spatial processing of the layers have been done correctly (in terms of character encoding) as they appear correctly in the postgis database, geoserver and the spatial-service as can be seen here:
However the sample.csv file created in the /intersect/batch folder of spatial-service contains the malformed characters. This file I guess is the source to the sampling-drID.txt file in tmp folder of biocache.
The problem had been that the .dbf file from zipped shapefiles are normally NOT encoded in UTF8.
During the batch/intersect call in sampling process , the dbf file is read and this is when the garbled characters get written to sampling file, which then get written to cassandra db and then SOLR.
The changes from the following commits in the spatial-service module and the layers-store modules respectively deal with the problem of encoding in the sampling file produced by "batch/intersect" during the sampling process.
These DBF were in their origins restricted to ISO-8859-1, but nowadays this restriction is not very realistic (specially if you want to use some non-latin character languages).
So nowadays, in ALA, only ISO-8859-1DBF files (and layers) are processed correctly.
If I understand correctly the last Dave commits, he converts these ISO-8859-1 fields data to UTF-8 in order to be correctly sampled and persisted in cassandra and solr with correct characters.
So for now, if we use ISO-8859-1 layers, after this fix, the data should be sampled correctly.
The only problem I see is that nowadays we cannot use layers with non latin characters (like Chinese, or Cyrillic, etc).
So maybe we should take into account the charset defined in the CPG files when unloading layers, and store correctly in geoserver. And later, convert (or not) to UTF-8 depending on this charset when we need to use during sampling.
@djtfmartin , if I'm correct with this analysis, shall I fill other enhancement issue in order to continue with this in the future?
We should add the support for CPG files. I didn't have time to tackle CPG support this time round, but it should be easier now for someone to try and add this now. For this issue i developed some docker-compose files to help a developer get up and running with a development environment (Postgis DB, geoserver).
The changes will be required in the layers-store (https://github.com/AtlasOfLivingAustralia/layers-store) library which has the code for reading the DBF files and in spatial-service to make use of the charset param that you've pointed out.
@shahmanash commented on Thu Nov 14 2019
Special characters (Swedish characters) seem to be handled properly during the "loading" phase, for example Swedish characters in the field "stateProvince" is written to Cassandra correctly.
However, during the "sampling" process, special characters do not seem to be handled and written properly to cassandra database.
This error then propagates to SOLR index breaking the search functionalities from other apps like regions / spatial-service etc.
For example, in the following record, the "stateProvince" field is displayed correctly but in the "Additional political boundaries information", the encoding issue appears:
https://beta.bioatlas.se/ala-hub/occurrences/2ab38688-d6f4-440d-acce-83ffdc2d3e89
The same record in raw format:
https://beta.bioatlas.se/biocache-service/occurrences/2ab38688-d6f4-440d-acce-83ffdc2d3e89
The spatial processing of the layers have been done correctly (in terms of character encoding) as they appear correctly in the postgis database, geoserver and the spatial-service as can be seen here:
https://beta.bioatlas.se/spatial-service/field/cl10039
https://beta.bioatlas.se/spatial-service/objects/cl10039
https://beta.bioatlas.se/spatial-service/field/cl10064
https://beta.bioatlas.se/spatial-service/objects/cl10064
However on looking into the cassandra, the encoding broken for the processed contextual layer field "cl_p" but not for stateProvince field.
It might be safe to say that the issue is introduced during sampling stage.
The biocache-cli version is 2.4.4
Any suggestion how to deal with this would be very much welcome.
@shahmanash commented on Sat Nov 16 2019
The json response from spatial-service/intersect is well-formed as can be seen here:
https://beta.bioatlas.se/spatial-service/intersect/cl10040/57.7/11.96667
[{"field":"cl10040","description":"null","layername":"Kommuner","pid":"187","value":"Göteborg"}]
https://beta.bioatlas.se/spatial-service/intersect/cl10039/57.7/11.96667
[{"field":"cl10039","description":"null","layername":"Län","pid":"13","value":"Västra Götalands"}]
However the sample.csv file created in the /intersect/batch folder of spatial-service contains the malformed characters. This file I guess is the source to the sampling-drID.txt file in tmp folder of biocache.
Any suggestions ?
@djtfmartin commented on Thu Nov 21 2019
Moving to spatial-service as i think the problem lies there.
@shahmanash commented on Thu Nov 21 2019
The problem had been that the .dbf file from zipped shapefiles are normally NOT encoded in UTF8.
During the batch/intersect call in sampling process , the dbf file is read and this is when the garbled characters get written to sampling file, which then get written to cassandra db and then SOLR.
The changes from the following commits in the spatial-service module and the layers-store modules respectively deal with the problem of encoding in the sampling file produced by "batch/intersect" during the sampling process.
3b2c171
AtlasOfLivingAustralia/layers-store@c0c2f72
The sampling files generated using the artifacts after the aforementioned commits do not have the garbled characters.
@djtfmartin commented on Thu Nov 21 2019
thanks @shahmanash
@djtfmartin commented on Wed Nov 27 2019
fixed and released in 0.3.1
@vjrj commented on Wed Jan 08 2020
After this issue and some problems with utf-8 layers reported by Austria in slack I'been testing spatial-hub and service (both
0.3.1
).This is what I investigated so far. Correct me if I'm wrong:
DBF
files, where field data is stored.DBF
were in their origins restricted to ISO-8859-1, but nowadays this restriction is not very realistic (specially if you want to use some non-latin character languages).DBF
is defined in theCPG
files.CPG
files, so IMHO process all shapefiles (akaDBF
) asISO-8859-1
. I think this because spatial don't use the charset param in the layer PUT call to geoserver.ISO-8859-1
DBF
files (and layers) are processed correctly.ISO-8859-1
fields data toUTF-8
in order to be correctly sampled and persisted incassandra
andsolr
with correct characters.So for now, if we use
ISO-8859-1
layers, after this fix, the data should be sampled correctly.The only problem I see is that nowadays we cannot use layers with non latin characters (like Chinese, or Cyrillic, etc).
So maybe we should take into account the charset defined in the
CPG
files when unloading layers, and store correctly in geoserver. And later, convert (or not) toUTF-8
depending on this charset when we need to use during sampling.@djtfmartin , if I'm correct with this analysis, shall I fill other enhancement issue in order to continue with this in the future?
@djtfmartin commented on Wed Jan 08 2020
Thanks @vjrj - yes thats all correct.
We should add the support for CPG files. I didn't have time to tackle CPG support this time round, but it should be easier now for someone to try and add this now. For this issue i developed some docker-compose files to help a developer get up and running with a development environment (Postgis DB, geoserver).
The changes will be required in the layers-store (https://github.com/AtlasOfLivingAustralia/layers-store) library which has the code for reading the DBF files and in spatial-service to make use of the charset param that you've pointed out.
@vjrj commented on Thu Jan 09 2020
Thanks for the detailed description of the next steps, @djtfmartin .
I added this to a list of pending internationalization issues thinking in Taiwanese and Russian future nodes:
https://github.com/AtlasOfLivingAustralia/documentation/wiki/Known-issues-in-LA-Internationalization
a i18n label should be better, but this simple list is enough for now.
@nickdos commented on Tue Apr 21 2020
@djtfmartin this issue is still in biocache-store - should it be in spatial-service repo?
@djtfmartin commented on Tue Apr 21 2020
@nickdos yes i think so. Its essentially an enhancement now to add support for CPG files.
The text was updated successfully, but these errors were encountered: