Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasource *sometimes* adds null attribute to a feature when it lacks an attribute another feature has #668

Open
davidtheclark opened this issue Jul 11, 2016 · 23 comments
Assignees

Comments

@davidtheclark
Copy link

I'm noticing this oddity when I create a Datasource and iterate through the features.

Let's say feature FA in my GeoJSON has one attribute: foo: 1. And FB has one attribute: bar: 2. When I create a Datasource and check these features within that structure, FA might have two attributes now: foo: 1, bar: null; and same for FB: bar: 2, foo: null.

So it might seem that Mapnik is giving each feature all the attributes available but with null as the value if the feature did not actually have that feature in the original data.

But this is not happening consistently. In some sample data I've been using, I've found that Mapnik adds the null attributes only sometimes. Whether or not this happens does not seem connected with data type, or the number of times the attribute shows up.

I'm not sure what to make of this, or whether I should file this upstream in Mapnik itself.

@davidtheclark
Copy link
Author

More info: I'm iterating through the datasource's features according to the example in the docs:

var datasource = new mapnik.Datasource(datasourceOptions);
var features = datasource.featureset();
var feature = features.next();
while (feature) {..}

Also, here's the sample GeoJSON that is producing these odd results:

{
  "type": "FeatureCollection",
  "features": [
    {
      "type": "Feature",
      "properties": {
        "name": "foo",
        "power": 6,
        "mixed": 7,
        "numbers": [1, 2, 3],
        "astonishing": true
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          84.72656249999999,
          55.7765730186677
        ]
      }
    },
    {
      "type": "Feature",
      "properties": {
        "name": "bar",
        "mixed": "7",
        "power": 99
      },
      "geometry": {
        "type": "LineString",
        "coordinates": [
          [
            84.462890625,
            55.32914440840507
          ],
          [
            86.0009765625,
            55.751849391735284
          ],
          [
            86.044921875,
            56.04749958329888
          ]
        ]
      }
    },
    {
      "type": "Feature",
      "properties": {
        "name": "baz"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          83.49609375,
          56.48676175249086
        ]
      }
    },
    {
      "type": "Feature",
      "properties": {
        "name": "haha"
      },
      "geometry": {
        "type": "Point",
        "coordinates": [
          85.341796875,
          56.8249328650072
        ]
      }
    },
    {
      "type": "Feature",
      "properties": {
        "name": "hoho",
        "numbers": [1, 2, 3],
        "data": 6
      },
      "geometry": {
        "type": "LineString",
        "coordinates": [
          [
            85.8251953125,
            55.50374985927514
          ],
          [
            86.4404296875,
            55.677584411089505
          ],
          [
            86.7919921875,
            55.65279803318956
          ]
        ]
      }
    },
    {
      "type": "Feature",
      "properties": {
        "name": "hehe",
        "power": 86,
        "data": {
          "foo": "bar"
        },
        "onlyNull": null,
        "description": "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum."
      },
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [
              80.947265625,
              57.397624055000456
            ],
            [
              80.5078125,
              56.145549500679074
            ],
            [
              81.650390625,
              56.145549500679074
            ],
            [
              81.650390625,
              56.559482483762245
            ],
            [
              80.947265625,
              57.397624055000456
            ]
          ]
        ]
      }
    }
  ]
}

Attributes that are having null values added: astonishing, mixed, numbers, power

Attributes that not not having null values added: name, data, description, onlyNull

@springmeyer
Copy link
Member

@davidtheclark Mapnik (core) reads the first 5 features, by default, to assemble the schema of possible attributes. The default of 5 is conservative for performance reasons on very large feature collections. Can you try passing num_features_to_query=100 to see if this makes the behavior consistent? Refs mapnik/mapnik#3238 | mapnik/mapnik@f44b5cc

@davidtheclark
Copy link
Author

Seems like the same thing is happening when I add num_features_to_query = 100 to my options.

@artemp
Copy link
Member

artemp commented Jul 11, 2016

@davidtheclark - I'll take a look tomorrow, thanks for reporting.

@davidtheclark
Copy link
Author

Great! To be clear: what I'd like would be for no null-valued attributes to be inserted into the features. That way I can distinguish between an attribute that was actually set to null in the feature and an attribute that was not a part of (undefined for) the feature.

On Jul 11, 2016, at 11:59 AM, Artem Pavlenko notifications@github.com wrote:

@davidtheclark - I'll take a look tomorrow, thanks for reporting.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.

@artemp
Copy link
Member

artemp commented Jul 12, 2016

Using mapnik-python

for f in ds.all_features():
...     print(f["astonishing"],f["mixed"],f["numbers"],f["power"],f["name"])
... 
True 7 [1,2,3] 6 foo
None 7 None 99 bar
None None None None baz
None None None None haha
None None [1,2,3] None hoho
None None None 86 hehe

^ looks reasonable to me - features that don't have particular attribute return None in Python

>>> for f in ds.all_features():
...     f.to_geojson()
... 
'{"type":"Feature","id":1,"geometry":{"type":"Point","coordinates":[84.7265625,55.7765730186677]},"properties":{"astonishing":true,"mixed":7,"name":"foo","numbers":"[1,2,3]","power":6}}'
'{"type":"Feature","id":2,"geometry":{"type":"LineString","coordinates":[[84.462890625,55.3291444084051],[86.0009765625,55.7518493917353],[86.044921875,56.0474995832989]]},"properties":{"mixed":"7","name":"bar","power":99}}'
'{"type":"Feature","id":3,"geometry":{"type":"Point","coordinates":[83.49609375,56.4867617524909]},"properties":{"name":"baz"}}'
'{"type":"Feature","id":4,"geometry":{"type":"Point","coordinates":[85.341796875,56.8249328650072]},"properties":{"name":"haha"}}'
'{"type":"Feature","id":5,"geometry":{"type":"LineString","coordinates":[[85.8251953125,55.5037498592751],[86.4404296875,55.6775844110895],[86.7919921875,55.6527980331896]]},"properties":{"data":6,"name":"hoho","numbers":"[1,2,3]"}}'
'{"type":"Feature","id":6,"geometry":{"type":"Polygon","coordinates":[[[80.947265625,57.3976240550005],[80.5078125,56.1455495006791],[81.650390625,56.1455495006791],[81.650390625,56.5594824837622],[80.947265625,57.3976240550005]]]},"properties":{"data":"{foo:\\"bar\\"}","description":"Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.","name":"hehe","power":86}}'

Note, "astonishing" appears only once as in original ^

I think some logic is faulty on JS (mapnik.node) side.
/cc @davidtheclark @springmeyer

@artemp
Copy link
Member

artemp commented Jul 14, 2016

After some digging it turned out that the best place to fix this is in https://github.com/mapnik/mapnik feature_kv_iterator

mapnik/mapnik@3397b8f#diff-7e10cfbdff7fca38fb1824d1ce771d88R42

^ I modified feture_kv_iterator to skip over non-existent values and also fix feature generator logic to allow null values in output which are valid JSON values
mapnik/mapnik@8ce58ea

/cc @davidtheclark @springmeyer

@springmeyer
Copy link
Member

@artemp to prepare for @davidtheclark testing this I tried node-mapnik master with latest mapnik master but I'm seeing:

 624 passing (7s)
  1 pending
  3 failing

  1) mapnik.Feature  should output the same geojson that it read:
     SyntaxError: Unexpected token }
      at Object.parse (native)
      at Context.<anonymous> (test/feature.test.js:123:28)

  2) mapnik.Feature  should output the same geojson that it read (point):
     SyntaxError: Unexpected token }
      at Object.parse (native)
      at Context.<anonymous> (test/feature.test.js:148:28)

  3) mapnik.Feature  should output the same geojson that it read (line):
     SyntaxError: Unexpected token }
      at Object.parse (native)
      at Context.<anonymous> (test/feature.test.js:173:28)

Can you replicate these failures?

@artemp
Copy link
Member

artemp commented Jul 14, 2016

@springmeyer - yes, confirming I'm seeing those as well ^. Back to debugging :)

@artemp
Copy link
Member

artemp commented Jul 14, 2016

Doh ;)
Summary:
our trick to use mapnik::value_null as a placeholder for Features that don’t actually have particular attribute is getting problematic.
Specially if we want to maintain 1:1 matching of input and output (GeoJSON)
A proper solution would be to abandon ​_schema_​ based features (aka Context) in favour of “scheme-less” Features like GeoJSON - back to how it used to be.

Or add “special” value type, lets call it “undefined” where value doesn’t exists.
Latter feels like a hack and will muddy internal logic further so not sure..
Sounds like we need to sync before choosing a path

/cc @springmeyer @davidtheclark

@artemp
Copy link
Member

artemp commented Jul 14, 2016

To distill even further: If we want to support "null" values (i think we do because they are valid in GeoJSON) then we can't rely on mapnik::value_null as a special default value for 'schema' (context) based feature implementation.
/cc @springmeyer @davidtheclark

@artemp
Copy link
Member

artemp commented Jul 14, 2016

> d={"name":null}
{ name: null }
> d["name"]
null
> d["undefined"]
undefined
> 

In JavaScript null is a distinct type that can be assigned ^
/cc @springmeyer @davidtheclark

artemp added a commit that referenced this issue Jul 15, 2016
@artemp
Copy link
Member

artemp commented Jul 15, 2016

To properly support null values in JSON input and output we need to change mapnik::feature_impl - I'm planning to work on this next week.

Meanwhile, I added properties method which can optionally filter out null values as in feature json generator. This is OK because from data point of view following JSONs are equal

{ "first": 123, "second":null }
{ "first": 123}
var mapnik = require('mapnik');
mapnik.register_datasource('/opt/mapnik/lib/mapnik/input/geojson.input');
var ds = new mapnik.Datasource({type:'geojson',file:'test.json'});
var featureset = ds.featureset();
var feat = featureset.next();
while (feat)
{
    var props = feat.properties();
    console.log(props);
    console.log("------------NO NULL----------");
    props = feat.properties(false);
    console.log(props);
    console.log("-----------------------------");
    feat = featureset.next();
}

Sample output

{ astonishing: true,
  data: null,
  description: null,
  mixed: 7,
  name: 'foo',
  null: null,
  numbers: '[1,2,3]',
  onlyNull: null,
  power: 6 }
------------NO NULL----------
{ astonishing: true,
  mixed: 7,
  name: 'foo',
  numbers: '[1,2,3]',
  power: 6 }
-----------------------------
{ astonishing: null,
  data: null,
  description: null,
  mixed: '7',
  name: 'bar',
  null: null,
  numbers: null,
  onlyNull: null,
  power: 99 }
------------NO NULL----------
{ mixed: '7', name: 'bar', power: 99 }
-----------------------------
{ astonishing: null,
  data: null,
  description: null,
  mixed: null,
  name: 'baz',
  null: null,
  numbers: null,
  onlyNull: null,
  power: null }
------------NO NULL----------
{ name: 'baz' }
-----------------------------
{ astonishing: null,
  data: null,
  description: null,
  mixed: null,
  name: 'haha',
  null: null,
  numbers: null,
  onlyNull: null,
  power: null }
------------NO NULL----------
{ name: 'haha' }
-----------------------------
{ astonishing: null,
  data: 6,
  description: null,
  mixed: null,
  name: 'hoho',
  null: null,
  numbers: '[1,2,3]',
  onlyNull: null,
  power: null }
------------NO NULL----------
{ data: 6, name: 'hoho', numbers: '[1,2,3]' }
-----------------------------
{ astonishing: null,
  data: '{foo:"bar"}',
  description: 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.',
  mixed: null,
  name: 'hehe',
  null: null,
  numbers: null,
  onlyNull: null,
  power: 86 }
------------NO NULL----------
{ data: '{foo:"bar"}',
  description: 'Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.',
  name: 'hehe',
  power: 86 }
-----------------------------

/cc @springmeyer @davidtheclark

@davidtheclark
Copy link
Author

Thanks for plugging away at this @artemp.

@springmeyer
Copy link
Member

@davidtheclark: quick update: @artemp and I worked today to get the latest Mapnik core version (needed for his fix for you) working against latest node-mapnik. Still a few things needed to test, but getting close. Once there we'll be ready to further test the patch here and get you a dev build. ETA later this week. Feel free to ping for an update anytime you are thinking about this.

@davidtheclark
Copy link
Author

davidtheclark commented Aug 11, 2016

@springmeyer Any update on this?

@springmeyer
Copy link
Member

springmeyer commented Aug 11, 2016

@davidtheclark - dug into this issue just now. There is some other bug lurking so it is not as simple as updating the Mapnik core version like I thought (which is now done). I'm seeing unpredictable results when the geojson is indexed (index created with mapnik-index command) that makes me uncomfortable about the changes. So, this needs a closer look and I'm strapped on time at the moment. Can you live with your workaround for another week or two? If not hit me up on chat to scheme.

@springmeyer
Copy link
Member

@artemp as part of testing this (https://gist.github.com/springmeyer/5f9ad5831402587c258116f1e627895c) I also noticed that:

"data": { "foo": "bar" }

Is round tripping as "data":"{foo:\"bar\"}" in mapnik v3.0.12 but "data":"{\"foo\":\"bar\"}" in mapnik v.3.0.11. Although in mapnik v3.0.12 it also roundtrips as "data":"{\"foo\":\"bar\"}" when not indexed. It seems we have some different behavior when indexing that needs to be unit tested in core C++ with this file. Can you take a look at pulling the test file into mapnik core and testing how features are round tripped with and without indexing?

@artemp
Copy link
Member

artemp commented Aug 12, 2016

@springmeyer - great catch before mapnik v3.0.12 release!

The issue was missing quoting for nested objects (mapnik/mapnik#3491) and it's not related to indexing so please verify your setup ^. We already have comprehensive tests covering all JSON properties types but the expected data was bogus (corrected in mapnik/mapnik@5c11fe4)
The issue is fixed in mapnik/mapnik@8dca305

@davidtheclark
Copy link
Author

Can you live with your workaround for another week or two?

Absolutely.

@davidtheclark
Copy link
Author

@artemp @springmeyer Has there been any progress on this issue? Asking because I'm starting work on another project where this has come up.

@springmeyer
Copy link
Member

@davidtheclark In my mind what still stands here is the problem of null reporting. Mapnik creates (as it parses features) a "context". This "context" is a map of every property key found and an index to where it is. This allows for more efficient storage of strings to reduce memory in featuresets with a lot of features. The impact of this is that we loose the ability to decern null properties from missing properties. @artemp and I have previously felt this tradeoff was good, since perf/low memory > round tripping null features. However your usecase indicates round tripping nulls matters. The basic problem is we can't by design. Second you also saw inconsistency in when nulls were reported. This inconsistency has to do with two factors: 1) the order of features and 2) whether the data is indexed. In both cases the inconsistency is by design (based on the order things are encountered) and I think we've rooted out all the unintended bugs.

So overall I think the question is whether round tripping is important enough to change the way Mapnik works. The perf/memory optimizations that originally motivated the use of a "context" in Mapnik only apply when catching features in memory. Most modern use cases of Mapnik don't actually do this anymore, so we could consider changing this behavior. Let's schedule a time to talk more about this in 2017? In the meantime ponder if you think you can workaround this fully or if you'd like it fixed.

@lightmare
Copy link

So overall I think the question is whether round tripping is important enough to change the way Mapnik works.

I don't see a strong reason why the distinction between a null property and a missing property should matter anywhere mapnik-related. It's a language-specific thing; sure mapnik is tightly connected with JavaScript and Python, which do distinguish those; but if you wanted to bind to Lua or Perl, you'd have trouble. The GeoJSON spec requires "properties" to be either null or an object; the interpretation of its contents is entirely up to the application.

That being said...

This "context" is a map of every property key found and an index to where it is. This allows for more efficient storage of strings to reduce memory in featuresets with a lot of features.

... this becomes counter-productive when each feature has a handful of properties from a much larger set of possible properties. Think of fully expanding a sparse matrix. I don't know whether such use case is common, but it might be worth considering.

Regarding feature memory usage, replacing vector<value> with unordered_map<index_t, value> won't make it much higher because value can hold a UnicodeString with its SSO buffer. index_t can be unsigned, IMO there won't be a need for the range of size_t this century; or const std::string* (and then context can be a set instead of map).

This would make round-tripping possible, but I'd rather not store null values in the feature at all. get has to return null for missing properties, anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants