Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[datastore] new excludeFromIndexes syntax should allow for a catch-all on object properties #12

Closed
stephenplusplus opened this issue Dec 16, 2017 · 19 comments
Assignees
Labels
api: datastore Issues related to the googleapis/nodejs-datastore API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@stephenplusplus
Copy link
Contributor

Copied from original issue: googleapis/google-cloud-node#2510

@­lostpebble
August 6, 2017 1:56 PM

Playing around with this today (after seeing this issue #1916 has been resolved) and I'm noticing that you have to define each and every embedded object property that you would like unindexed, instead of being able to just define a single "catch all" excludeFromIndexes option.

This works (actually it doesn't really, see edit below...):

{
  "key": { ... },
  "excludeFromIndexes": [
    "testEmbeddedObject.otherStringThing",
    "testEmbeddedObject.stringThing",
    "testLongString"
  ],
  "data": {
    "testLongString": "really long string here",
    "testEmbeddedObject": {
        "stringThing": "really long string here",
        "otherStringThing": "really long string here"
     }
  }
}

But this does not:

{
  "key": { ... },
  "excludeFromIndexes": [
    "testEmbeddedObject",
    "testLongString"
  ],
  "data": {
    "testLongString": "really long string here",
    "testEmbeddedObject": {
        "stringThing": "really long string here",
        "otherStringThing": "really long string here"
     }
  }
}

In the second example I'm still getting an error for stringThing and otherStringThing being longer than 1500 bytes. Is there no way to define that it catches all the properties in an embedded object?

Maybe something like this, if you'd like the catch all to have a more intentional syntax:

 "excludeFromIndexes": [
    "testEmbeddedObject.*",
    "testLongString"
  ],

Though, I would hope that for this syntax of propertyName.*, which seems to say catch all properties of the object at propertyName, would also catch things that are not embedded objects but also simply a long string at that propertyName. I would just like a way to define that data stored at a certain property of an entity should not be indexed at all - be it a string, embedded object or whatever.

EDIT:

Upon thinking about it more, why does the syntax in my second example not work? I think putting a * wildcard shouldn't be necessary actually. You've deliberately indicated you do not want that property indexed and that should mean the entire property, be it an embedded object and all it's properties or just a long string. If you'd still like to index some properties of an embedded object, then you'd define those which you want unindexed and leave out the ones you want indexed.

This is confusing because upon looking at my entities in the datastore console it appears that my first example is actually wrong. I should have included the base testEmbeddedObject in my exclusion array too, so it would look like this now:

"excludeFromIndexes": [
    "testEmbeddedObject",
    "testEmbeddedObject.otherStringThing",
    "testEmbeddedObject.stringThing",
    "testLongString"
  ],

Otherwise, the datastore console still sees that "base" property as indexed even though there is no data on it.

Environment details

  • OS: Windows 10
  • Node.js version: 8.2.1
  • npm version: 5.3.0
  • google-cloud-node version: 1.1.0
@stephenplusplus stephenplusplus added api: datastore priority: p2 Moderately-important priority. Fix may not be included in next release. labels Dec 16, 2017
@stephenplusplus
Copy link
Contributor Author

August 7, 2017 12:56 PM

Thanks for the detailed issue. Anything is technically possible, but I believe our discussion is turning more towards "what would the average user want?" and I'm not sure enough to make the call on this. It would be nice to get an opinion from the Datastore team behind the excludeFromIndexes API structure, and if they see the catchall behavior as problematic in any way-- @lukesneeringer can you help with this?

@stephenplusplus
Copy link
Contributor Author

@­lostpebble
August 7, 2017 1:34 PM

Personally, I've come from using the Objectify datastore library for Java (https://github.com/objectify/objectify).

I found it's interface to be very intuitive, so much so that I've created my own Datastore library in JavaScript to try emulate it's behaviour as much as possible (Pebblebed).

My understanding of the Datastore has always been that embedded objects are not going to be indexed. That if you want something indexed it should be put in an indexed property on the base entity - I believe Objectify follows this same idea, and does allow embedded objects with properties which are indexed but apparently creates synthetic properties in the top-level Entity object to do so. (https://github.com/objectify/objectify/wiki/Entities#embedded-object-native-representation)

So from my user perspective, I see it as the norm to rather exclude an entire embedded object from being indexed instead of it being the norm to have to exclude each and every property itself.

But this is obviously just my experience with using the Datastore. I understand there might be others who expect to interact with it differently, so interested to see what others think on this issue.

All I'm hoping for is at least some way to exclude a property wholly from being indexed.

@stephenplusplus
Copy link
Contributor Author

@­landrito
August 7, 2017 8:15 PM

I have requested for some input from the datastore team internally.

@stephenplusplus
Copy link
Contributor Author

@­pcostell
August 7, 2017 8:52 PM

Somewhat of a long answer, I'll break it down into (1) why did we decide to do this and (2) why it's different from objectify. Hope this clarifies a bit why the API is defined like it is.

(1)
The rational for requiring excludeFromIndexes for each relevant property is that we want the entire embedded entity to still be valid (i.e. convertible to a top-level, complete entity). The excludeFromIndexes bit indicates that "this property is not indexable", so it should really apply at the single property value level.

As an example of how this is useful is looking at something like an Address. This type of information may belong in either an embedded entity, or in a stand alone entity. If you were designing this in an ORM system, you might want to write something like:

AddressModel = createModel("Address", {
  "number": IntegerProperty(),
  "street": StringProperty(),
  "zip_code": IntegerProperty(),
  "landmark_description": StringProperty({ "excludeFromIndexes": true})
  });

BusinessModel = createModel("Business", {
  // ...
  "address": ModelProperty(AddressModel),
});

a = AddressModel({
  "number": 123,
  "street": "Main St",
  "zip_code": 12345,
  "landmark_description": "On the corner, across from Google building."
});

This will allow you to issue queries on part of the index (e.g. SELECT * FROM Business WHERE address.zip_code = 12345), but still have parts of your address that are un-indexable (landmark_description).

Note that it is possible to exclude the entire address from indexing. However, excludeFromIndexes must still be set on un-indexable properties. The reasoning here is that if decide you don't want address indexed, then later decide you want to query for it, removing the excludeFromIndexes bit will be the only action you need to take. If we did not validate the inner properties, this could potentially lead to breakage on already stored entities (whose properties would become indexed when modifying them if using any sort of ORM).

(2)
This is different from objectify because of additional support that was added to the Cloud Datastore API after this feature was already added to Objectify. In particular, with this new support, embedded entities are indexable (see Address example above). In the previous version of Cloud Datastore, indexed embedded entities were not supported. Clients (such as objectify) would serialize the entire embedded entity as a string and store it. In order to support indexing of specific properties in those entities, Objectify would then write duplicate properties, named with dots (e.g. a property named "address.zip_code"), that could then be used to query for a business based on the address. On receiving matching results, the duplicate indexed properties are discarded, and the embedded entity gets deserialized from the opaque string.

@stephenplusplus
Copy link
Contributor Author

@­lostpebble
August 9, 2017 3:02 PM

@pcostell Thanks for the long and detailed reply.

(just noticed you wrote "excludeFromIndexes": false while I'm sure you meant "excludeFromIndexes": true for landmark_description)

I can totally understand the new direction you guys have gone with embedded entities and being able to index properties of those objects within the parent entity. It makes a lot of sense and it's opened my mind to some new possibilities with the Datastore.

I must say though that the way it seems to have been implemented, having to include excludeFromIndexes on every single property that you would like to not index, seems to be very tedious from my perspective.

I use embedded entities to store POJO in Java, or in JavaScript's case just regular JSON objects, of large and complex structures. These are embedded objects because I know that the properties within them should never be indexed, but I would like access to them once I've queried for the entity later on. To have to now mark each and every property within these objects individually that I would not like indexed just doesn't seem intuitive in the least. Embedded objects have always been just that - objects which are embedded inside the entity... To suddenly start making (as default behaviour) properties within those objects act like top-level entity properties doesn't seem natural.

I don't fully understand your reasoning with why excludeFromIndexes needs to be set on each and every property and how that affects changes later on should one decide to actually index those properties at a later stage (my understanding was always that if I decided to index previously unindexed properties that only entities saved from that point onward would have those properties indexed - so in your example should I decide to now index address.landmark_description - for entities saved previously that property is as good as not indexed and non-query-able).


EDIT: I think I do understand now what you were saying about excluding the entire address entity from indexing. Essentially, the entire entity and all its properties are excluded from indexing if you set the base address property as excludeFromIndexes? But validation (1500 byte length for each field) is still enacted on all the properties incase in the future that changes and those properties that are already saved now becoming indexed during a re-save / modification.

I guess that makes some of my points moot. But I would still like a way to just store an entire POJO / JSON object without having to think about these things.


Of course, I could in this case just do a JSON.stringify() and store that entire object as an unindexed string property and JSON.parse() when I want it again - but that doesn't really feel intuitive. In fact I'm quite surprised to find that this is the way it actually works, and I can't help but think a lot of other users will be just as surprised when they run into that 1500 bytes property length error.

I feel like having this behaviour of having to exclude instead of targeting directly what to include will create a real issue of people creating indexes where they really shouldn't be - which is something I try and watch out for as much as possible to keep my Datastore entities as lean and performant as possible. I can't imagine how many people have probably indexed entire embedded objects unknowingly already because of this behaviour.

@stephenplusplus
Copy link
Contributor Author

@­briangranatir
September 6, 2017 3:17 AM

Just adding an additional opinion. Datastore is backed by BigTable. This means that it doesn't have traditional indexes. Instead, every index in Datastore manifests as a new row in BigTable with a specialised rowkey.

This means that datastore tables with a large number of columns will translate into a massive number of simultaneous writes to BigTable (makes a row for each column index). Add some ancestor structure and throughput of writes drops significantly.

Since our streaming architecture can burst to thousands of writes per second, we can't afford to index all column. I know, we could just use BigTable, but that requires maintaining instances (which are far more expensive than just using datastore). Also BigTable doesn't guarantee writes like datastore.

Anyway, before writing any data to datastore, we add excludeFromIndexes to all values.

  const data = [];
  Object.keys(obj).forEach(( key ) => {
    const nestedObject = obj[key];
    if(nestedObject instanceof Array) {
      data.push({name: key, value: JSON.stringify(nestedObject), excludeFromIndexes: true});
    } else {
      data.push({
        name: key,
        value: nestedObject,
        excludeFromIndexes: !(self.indexes.indexOf(key.toUpperCase().trim()) > -1)
      });
    }
  });
  return data;

Not a problem, but it would be nice to change the default to have no indexes. I believe a major reason that new users complain about performance issues when using datastore with Node is because they are indexing a large number of columns with some sort of ancestor structure.

Just our teams experiences. Happy to provide more data if needed.

@stephenplusplus
Copy link
Contributor Author

@­lostpebble
October 15, 2017 10:55 AM

Just to agree with basically what @briangranatir is saying, and follow up if there has been any progress or thoughts about this?

I think there should be an option to have all properties unindexed as the default save behaviour - I'd much rather be explicit about what I'd like to query on than have performance issues and potentially a bloated Datastore with properties which should never have been indexed. This is exacerbated by the current behaviour of having to try explicitly say what you would not like indexed, especially in deep child objects on your entities.

Perhaps if there was a property on save introduced such as an includeInIndexes array which, if present, changes the default behaviour to act the other way around. If both excludeFromIndexes and includeInIndexes are present, an error could be thrown. This should prevent any confusion about usage.

@stephenplusplus
Copy link
Contributor Author

@­sebelga
October 25, 2017 12:30 PM

Hi
Interesting discussion, I arrived here as a bug was reported in gstore-node (sebelga/gstore-node#80) mentioning that "Exclude from indexes cannot be set on a list value". It seems that since version 1.1.0 and the new array format to declare excludeFromIndexes, it is not possible to exclude a whole Array at once (see my comment here)

I like the idea of @lostpebble. By default embedded entities should have all their indexes set to false.
We could then revert the behaviour with:

excludeAllFromIndexes: false,

and an includeInIndexes array would allow declare the specific properties needed for the queries.

@lostpebble I didn't quiet understand when you said "But validation (1500 byte length for each field) is still enacted on all the properties incase in the future". That means that the embedded entities is not indexed but the limitation on its properties still applies?

@stephenplusplus
Copy link
Contributor Author

@­lostpebble
October 25, 2017 1:26 PM

@sebelga from what I can gather from what @pcostell wrote about this, the validation is still enacted on the inner properties of an entity you have marked as excludeFromIndexes - because if in the future you change your mind and happen to want to index and now query on one of those inner properties it would cause issues if that property was actually too large to be indexed.

All in all, it's a little confusing and seems like they're planning beyond what is practical in most user's eyes. Though, if the time ever comes I might appreciate that I can easily now index those inner properties.

The only solution I see for storing regular large Javascript objects unindexed completely and easily, is to implement some kind of serialisation to a single string value which you make excludeFromIndexes. Most probably just a JSON.stringify() and JSON.parse() before saving and after loading respectively.

@stephenplusplus
Copy link
Contributor Author

@­sebelga
October 25, 2017 4:00 PM

@lostpebble thanks for the explanation.
I think that with "gstore-node" I will add the option I mentioned above (excludeAllFromIndexes) and if it is set to true (default) I will read all the keys from the object and add them to the excludeFromIndexes array (new since version 1.1.0). I think it will be better when editing the entities in the cloud console.

I am still confused if this new way of setting the excludeFromIndexes replaces the old one (where it was set on each entity property) or if both are working together.

EDIT:
I made some more tests and both ways of excluding indexes work on version 1.1.0 but with inconsistency.

// This will correctly not index the property "title"
const entity = {
    data: { title: 'hello' },
    excludeFromIndexes: ['title'],
    key: {}
}

// This will do the same
const entity = {
    data: [{
        name: 'title',
        value: 'hello',
        excludeFromIndexes: true,
    }],
    key: {}
}

// This will correctly not index the property "address"
const entity = {
    data: { address: { street: 'abc', city: 'abc' } },
    excludeFromIndexes: ['address'],
    key: {}
}

// This will do the same
const entity = {
    data: [{
        name: 'addresss',
        value: { street: 'abc', city: 'abc' },
        excludeFromIndexes: true,
    }],
    key: {}
}

// This will correctly *not* index any of the values of the Array
const entity = {
    data: [{
        name: 'myArray',
        value: ['1', '2', '3'],
        excludeFromIndexes: true,
    }],
    key: {}
}

// Here we would expect the same behaviour but we get
// the error: "Exclude from indexes cannot be set on a list value"
const entity = {
    data: { myArray: ['1', '2', '3'] },
    excludeFromIndexes: ['myArray'],
    key: {}
}

@stephenplusplus
Copy link
Contributor Author

@­sebelga
November 8, 2017 4:51 PM

@pcostell I am coming back to ask if the inconsistency I mentioned above "is" an inconsistency or if it like that by design. Repecting the list values that can't be declared like embedded entities in the "array format". Thanks 👍

@jacekkopecky
Copy link

We too just got hit by the 1500 character limit on properties we never intend to have indexed, and we too added code that inspects the objects and populates excludeFromIndexes. I would have preferred to be spared this experience.

With the excludeFromIndexes approach, I would imagine a wildcard exclusion mechanism, which excludes everything in a certain path, with the warning that if you enable indexing in the future, properties over 1500 characters will lead to errors.

The alternative includeInIndexes approach would better suit our usage, btw; as discussed above by @lostpebble and @briangranatir. While it's not particularly difficult to just JSON.stringify most everything, it does not feel natural at all – it feels like it harms the maintainability of our code.

Btw, the documentation could be improved with a section on the details of the current exclusion scheme. Just a sentence "Exclude properties from indexing using a simple JSON path notation." and a pointer to examples isn't sufficient to give the developer confidence that their code will work.

Are you accepting PRs?

@dankraus
Copy link

Add me as another bit by this too. Have a property that's just JSON that I don't want any of it's sub properties indexed at all. The data can really change entity to entity.

In the example below. payload can contain any sort of JSON. content isn't always going to be there. Schema-less basically. Would love to be able to completely not index payload at all.

const entity = {
    key: datastore.key('Event'),
    excludeFromIndexes: ['payload'],
    data: {
        name: "someString",
        payload: {
            content: "a very long string",
            foo: "bar"
        }
    }
};

await datastore.insert(entity);

It took me quite a long while between reading all these issue comments and looking at source to determine that it doesn't do what (I think at least) is a natural assumption. I guess in the meantime, I'll just use JSON.stringify as @jacekkopecky suggested and JSON.parse when queried.

@trollkotze
Copy link
Contributor

I think this is an important issue. I assume most will be confused and surprised by this unintuitive standard behaviour.
I guess an anticipated problem with changing it is backwards-compatibility. But I think a breaking change should be done here. I have not read any compelling reasoning for the current (IMO strange) necessity to exclude any subproperty from indexing separately for a complex object.

@trollkotze
Copy link
Contributor

The behaviour at the moment is really nonsensical incomplete bullshit.

We can do this:

entity = {
  key: key,
  data: [{
    name: 'someArray',
    value: [1,2,3,'whatever',{something: 'bla'}],
    excludeFromIndexes: true
  }]
};

This will mark the array property 'someArray', as well as all its members at the first level (including entity values, but not their children) as excluded from indexes.

We can do this:

entity = {
  key: key,
  data: {
    someArray: [1,2,3,'whatever',{something: 'bla'}],
  },
  excludeFromIndexes: ['someArray[]']
};

This will exclude only all primitive array members of 'someArray' from indexes. But not 'someArray' property itself, nor any entity members of 'someArray'.

We can do this:

entity = {
  key: key,
  data: {
    someArray: [1,2,3,'whatever',{something: 'bla'}],
  },
  excludeFromIndexes: ['someArray[]', 'someArray[].something']
};

This will exclude all primitive array members of 'someArray' and the 'something' property of members' children from indexes. But it will not exclude entity members of someArray, nor someArray itself at their root.

So it is completely impossible with the current Datastore library to exclude everything from indexes, with either syntax.
The first one leaves out the possibility to exclude anything nested at a deeper level.
The second one, while allowing to exclude primitive values at any level, leaves out the possibility to exclude any array or entity value at its root.

Completely stupid.

@JustinBeckwith JustinBeckwith added the 🚨 This issue needs some love. label Jun 14, 2018
@sduskis sduskis removed 🚨 This issue needs some love. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Feb 5, 2019
@JustinBeckwith JustinBeckwith added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed type: enhancement labels Jun 5, 2019
@stephenplusplus
Copy link
Contributor Author

@trollkotze and everyone else -- hello! Would you mind seeing how this PR looks? #451

@jacekkopecky
Copy link

Has any further consideration been given to something like includeInIndexes? It would allow us to specify positively what to index, meaning everything else is excluded? (This was raised above…)

@AVaksman
Copy link
Contributor

closed with #451

@google-cloud-label-sync google-cloud-label-sync bot added the api: datastore Issues related to the googleapis/nodejs-datastore API. label Jan 31, 2020
@kirillgroshkov
Copy link

Is there a plan to document the exclusion syntax (e.g .*, [], etc)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: datastore Issues related to the googleapis/nodejs-datastore API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
Development

No branches or pull requests

8 participants