Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[datastore] new excludeFromIndexes syntax should allow for a catch-all on object properties #2510

Closed
lostpebble opened this issue Aug 6, 2017 · 12 comments
Assignees
Labels
api: datastore Issues related to the Datastore API.

Comments

@lostpebble
Copy link

lostpebble commented Aug 6, 2017

Playing around with this today (after seeing this issue #1916 has been resolved) and I'm noticing that you have to define each and every embedded object property that you would like unindexed, instead of being able to just define a single "catch all" excludeFromIndexes option.

This works (actually it doesn't really, see edit below...):

{
  "key": { ... },
  "excludeFromIndexes": [
    "testEmbeddedObject.otherStringThing",
    "testEmbeddedObject.stringThing",
    "testLongString"
  ],
  "data": {
    "testLongString": "really long string here",
    "testEmbeddedObject": {
        "stringThing": "really long string here",
        "otherStringThing": "really long string here"
     }
  }
}

But this does not:

{
  "key": { ... },
  "excludeFromIndexes": [
    "testEmbeddedObject",
    "testLongString"
  ],
  "data": {
    "testLongString": "really long string here",
    "testEmbeddedObject": {
        "stringThing": "really long string here",
        "otherStringThing": "really long string here"
     }
  }
}

In the second example I'm still getting an error for stringThing and otherStringThing being longer than 1500 bytes. Is there no way to define that it catches all the properties in an embedded object?

Maybe something like this, if you'd like the catch all to have a more intentional syntax:

 "excludeFromIndexes": [
    "testEmbeddedObject.*",
    "testLongString"
  ],

Though, I would hope that for this syntax of propertyName.*, which seems to say catch all properties of the object at propertyName, would also catch things that are not embedded objects but also simply a long string at that propertyName. I would just like a way to define that data stored at a certain property of an entity should not be indexed at all - be it a string, embedded object or whatever.

EDIT:

Upon thinking about it more, why does the syntax in my second example not work? I think putting a * wildcard shouldn't be necessary actually. You've deliberately indicated you do not want that property indexed and that should mean the entire property, be it an embedded object and all it's properties or just a long string. If you'd still like to index some properties of an embedded object, then you'd define those which you want unindexed and leave out the ones you want indexed.

This is confusing because upon looking at my entities in the datastore console it appears that my first example is actually wrong. I should have included the base testEmbeddedObject in my exclusion array too, so it would look like this now:

"excludeFromIndexes": [
    "testEmbeddedObject",
    "testEmbeddedObject.otherStringThing",
    "testEmbeddedObject.stringThing",
    "testLongString"
  ],

Otherwise, the datastore console still sees that "base" property as indexed even though there is no data on it.

Environment details

  • OS: Windows 10
  • Node.js version: 8.2.1
  • npm version: 5.3.0
  • google-cloud-node version: 1.1.0
@lostpebble lostpebble changed the title [datastore] new excludeFromIndexes syntax should allow for a catch-all on child properties [datastore] new excludeFromIndexes syntax should allow for a catch-all on object properties Aug 6, 2017
@stephenplusplus stephenplusplus added the api: datastore Issues related to the Datastore API. label Aug 7, 2017
@stephenplusplus
Copy link
Contributor

Thanks for the detailed issue. Anything is technically possible, but I believe our discussion is turning more towards "what would the average user want?" and I'm not sure enough to make the call on this. It would be nice to get an opinion from the Datastore team behind the excludeFromIndexes API structure, and if they see the catchall behavior as problematic in any way-- @lukesneeringer can you help with this?

@lostpebble
Copy link
Author

Personally, I've come from using the Objectify datastore library for Java (https://github.com/objectify/objectify).

I found it's interface to be very intuitive, so much so that I've created my own Datastore library in JavaScript to try emulate it's behaviour as much as possible (Pebblebed).

My understanding of the Datastore has always been that embedded objects are not going to be indexed. That if you want something indexed it should be put in an indexed property on the base entity - I believe Objectify follows this same idea, and does allow embedded objects with properties which are indexed but apparently creates synthetic properties in the top-level Entity object to do so. (https://github.com/objectify/objectify/wiki/Entities#embedded-object-native-representation)

So from my user perspective, I see it as the norm to rather exclude an entire embedded object from being indexed instead of it being the norm to have to exclude each and every property itself.

But this is obviously just my experience with using the Datastore. I understand there might be others who expect to interact with it differently, so interested to see what others think on this issue.

All I'm hoping for is at least some way to exclude a property wholly from being indexed.

@landrito landrito added priority: p2 Moderately-important priority. Fix may not be included in next release. status: acknowledged labels Aug 7, 2017
@landrito
Copy link
Contributor

landrito commented Aug 7, 2017

I have requested for some input from the datastore team internally.

@pcostell
Copy link
Contributor

pcostell commented Aug 7, 2017

Somewhat of a long answer, I'll break it down into (1) why did we decide to do this and (2) why it's different from objectify. Hope this clarifies a bit why the API is defined like it is.

(1)
The rational for requiring excludeFromIndexes for each relevant property is that we want the entire embedded entity to still be valid (i.e. convertible to a top-level, complete entity). The excludeFromIndexes bit indicates that "this property is not indexable", so it should really apply at the single property value level.

As an example of how this is useful is looking at something like an Address. This type of information may belong in either an embedded entity, or in a stand alone entity. If you were designing this in an ORM system, you might want to write something like:

AddressModel = createModel("Address", {
  "number": IntegerProperty(),
  "street": StringProperty(),
  "zip_code": IntegerProperty(),
  "landmark_description": StringProperty({ "excludeFromIndexes": true})
  });

BusinessModel = createModel("Business", {
  // ...
  "address": ModelProperty(AddressModel),
});

a = AddressModel({
  "number": 123,
  "street": "Main St",
  "zip_code": 12345,
  "landmark_description": "On the corner, across from Google building."
});

This will allow you to issue queries on part of the index (e.g. SELECT * FROM Business WHERE address.zip_code = 12345), but still have parts of your address that are un-indexable (landmark_description).

Note that it is possible to exclude the entire address from indexing. However, excludeFromIndexes must still be set on un-indexable properties. The reasoning here is that if decide you don't want address indexed, then later decide you want to query for it, removing the excludeFromIndexes bit will be the only action you need to take. If we did not validate the inner properties, this could potentially lead to breakage on already stored entities (whose properties would become indexed when modifying them if using any sort of ORM).

(2)
This is different from objectify because of additional support that was added to the Cloud Datastore API after this feature was already added to Objectify. In particular, with this new support, embedded entities are indexable (see Address example above). In the previous version of Cloud Datastore, indexed embedded entities were not supported. Clients (such as objectify) would serialize the entire embedded entity as a string and store it. In order to support indexing of specific properties in those entities, Objectify would then write duplicate properties, named with dots (e.g. a property named "address.zip_code"), that could then be used to query for a business based on the address. On receiving matching results, the duplicate indexed properties are discarded, and the embedded entity gets deserialized from the opaque string.

@lostpebble
Copy link
Author

lostpebble commented Aug 9, 2017

@pcostell Thanks for the long and detailed reply.

(just noticed you wrote "excludeFromIndexes": false while I'm sure you meant "excludeFromIndexes": true for landmark_description)

I can totally understand the new direction you guys have gone with embedded entities and being able to index properties of those objects within the parent entity. It makes a lot of sense and it's opened my mind to some new possibilities with the Datastore.

I must say though that the way it seems to have been implemented, having to include excludeFromIndexes on every single property that you would like to not index, seems to be very tedious from my perspective.

I use embedded entities to store POJO in Java, or in JavaScript's case just regular JSON objects, of large and complex structures. These are embedded objects because I know that the properties within them should never be indexed, but I would like access to them once I've queried for the entity later on. To have to now mark each and every property within these objects individually that I would not like indexed just doesn't seem intuitive in the least. Embedded objects have always been just that - objects which are embedded inside the entity... To suddenly start making (as default behaviour) properties within those objects act like top-level entity properties doesn't seem natural.

I don't fully understand your reasoning with why excludeFromIndexes needs to be set on each and every property and how that affects changes later on should one decide to actually index those properties at a later stage (my understanding was always that if I decided to index previously unindexed properties that only entities saved from that point onward would have those properties indexed - so in your example should I decide to now index address.landmark_description - for entities saved previously that property is as good as not indexed and non-query-able).


EDIT: I think I do understand now what you were saying about excluding the entire address entity from indexing. Essentially, the entire entity and all its properties are excluded from indexing if you set the base address property as excludeFromIndexes? But validation (1500 byte length for each field) is still enacted on all the properties incase in the future that changes and those properties that are already saved now becoming indexed during a re-save / modification.

I guess that makes some of my points moot. But I would still like a way to just store an entire POJO / JSON object without having to think about these things.


Of course, I could in this case just do a JSON.stringify() and store that entire object as an unindexed string property and JSON.parse() when I want it again - but that doesn't really feel intuitive. In fact I'm quite surprised to find that this is the way it actually works, and I can't help but think a lot of other users will be just as surprised when they run into that 1500 bytes property length error.

I feel like having this behaviour of having to exclude instead of targeting directly what to include will create a real issue of people creating indexes where they really shouldn't be - which is something I try and watch out for as much as possible to keep my Datastore entities as lean and performant as possible. I can't imagine how many people have probably indexed entire embedded objects unknowingly already because of this behaviour.

@briangranatir
Copy link

Just adding an additional opinion. Datastore is backed by BigTable. This means that it doesn't have traditional indexes. Instead, every index in Datastore manifests as a new row in BigTable with a specialised rowkey.

This means that datastore tables with a large number of columns will translate into a massive number of simultaneous writes to BigTable (makes a row for each column index). Add some ancestor structure and throughput of writes drops significantly.

Since our streaming architecture can burst to thousands of writes per second, we can't afford to index all column. I know, we could just use BigTable, but that requires maintaining instances (which are far more expensive than just using datastore). Also BigTable doesn't guarantee writes like datastore.

Anyway, before writing any data to datastore, we add excludeFromIndexes to all values.

  const data = [];
  Object.keys(obj).forEach(( key ) => {
    const nestedObject = obj[key];
    if(nestedObject instanceof Array) {
      data.push({name: key, value: JSON.stringify(nestedObject), excludeFromIndexes: true});
    } else {
      data.push({
        name: key,
        value: nestedObject,
        excludeFromIndexes: !(self.indexes.indexOf(key.toUpperCase().trim()) > -1)
      });
    }
  });
  return data;

Not a problem, but it would be nice to change the default to have no indexes. I believe a major reason that new users complain about performance issues when using datastore with Node is because they are indexing a large number of columns with some sort of ancestor structure.

Just our teams experiences. Happy to provide more data if needed.

@lostpebble
Copy link
Author

Just to agree with basically what @briangranatir is saying, and follow up if there has been any progress or thoughts about this?

I think there should be an option to have all properties unindexed as the default save behaviour - I'd much rather be explicit about what I'd like to query on than have performance issues and potentially a bloated Datastore with properties which should never have been indexed. This is exacerbated by the current behaviour of having to try explicitly say what you would not like indexed, especially in deep child objects on your entities.

Perhaps if there was a property on save introduced such as an includeInIndexes array which, if present, changes the default behaviour to act the other way around. If both excludeFromIndexes and includeInIndexes are present, an error could be thrown. This should prevent any confusion about usage.

@sebelga
Copy link
Contributor

sebelga commented Oct 25, 2017

Hi
Interesting discussion, I arrived here as a bug was reported in gstore-node (sebelga/gstore-node#80) mentioning that "Exclude from indexes cannot be set on a list value". It seems that since version 1.1.0 and the new array format to declare excludeFromIndexes, it is not possible to exclude a whole Array at once (see my comment here)

I like the idea of @lostpebble. By default embedded entities should have all their indexes set to false.
We could then revert the behaviour with:

excludeAllFromIndexes: false,

and an includeInIndexes array would allow declare the specific properties needed for the queries.

@lostpebble I didn't quiet understand when you said "But validation (1500 byte length for each field) is still enacted on all the properties incase in the future". That means that the embedded entities is not indexed but the limitation on its properties still applies?

@lostpebble
Copy link
Author

@sebelga from what I can gather from what @pcostell wrote about this, the validation is still enacted on the inner properties of an entity you have marked as excludeFromIndexes - because if in the future you change your mind and happen to want to index and now query on one of those inner properties it would cause issues if that property was actually too large to be indexed.

All in all, it's a little confusing and seems like they're planning beyond what is practical in most user's eyes. Though, if the time ever comes I might appreciate that I can easily now index those inner properties.

The only solution I see for storing regular large Javascript objects unindexed completely and easily, is to implement some kind of serialisation to a single string value which you make excludeFromIndexes. Most probably just a JSON.stringify() and JSON.parse() before saving and after loading respectively.

@sebelga
Copy link
Contributor

sebelga commented Oct 25, 2017

@lostpebble thanks for the explanation.
I think that with "gstore-node" I will add the option I mentioned above (excludeAllFromIndexes) and if it is set to true (default) I will read all the keys from the object and add them to the excludeFromIndexes array (new since version 1.1.0). I think it will be better when editing the entities in the cloud console.

I am still confused if this new way of setting the excludeFromIndexes replaces the old one (where it was set on each entity property) or if both are working together.

EDIT:
I made some more tests and both ways of excluding indexes work on version 1.1.0 but with inconsistency.

// This will correctly not index the property "title"
const entity = {
    data: { title: 'hello' },
    excludeFromIndexes: ['title'],
    key: {}
}

// This will do the same
const entity = {
    data: [{
        name: 'title',
        value: 'hello',
        excludeFromIndexes: true,
    }],
    key: {}
}

// This will correctly not index the property "address"
const entity = {
    data: { address: { street: 'abc', city: 'abc' } },
    excludeFromIndexes: ['address'],
    key: {}
}

// This will do the same
const entity = {
    data: [{
        name: 'addresss',
        value: { street: 'abc', city: 'abc' },
        excludeFromIndexes: true,
    }],
    key: {}
}

// This will correctly *not* index any of the values of the Array
const entity = {
    data: [{
        name: 'myArray',
        value: ['1', '2', '3'],
        excludeFromIndexes: true,
    }],
    key: {}
}

// Here we would expect the same behaviour but we get
// the error: "Exclude from indexes cannot be set on a list value"
const entity = {
    data: { myArray: ['1', '2', '3'] },
    excludeFromIndexes: ['myArray'],
    key: {}
}

@sebelga
Copy link
Contributor

sebelga commented Nov 8, 2017

@pcostell I am coming back to ask if the inconsistency I mentioned above "is" an inconsistency or if it like that by design. Repecting the list values that can't be declared like embedded entities in the "array format". Thanks 👍

@stephenplusplus
Copy link
Contributor

This issue was moved to googleapis/nodejs-datastore#12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: datastore Issues related to the Datastore API.
Projects
None yet
Development

No branches or pull requests

6 participants