[wip] Populate.next roadmap - feedback welcome #1292

aheckmann · 2013-01-15T18:22:39Z

DO NOT MERGE

For release 3.6 our primary objective is a rewrite of the populate functionality. Currently it suffers from several limitations:

1) Poor support for nested path population

Given a schema like the following:

new Schema({
    name: String
  , em: [{ arr: [{ person: { type: Schema.ObjectId, ref: 'Person'}}] }]
});

An attempt to populate the following path will not work as expected:

Model.find().populate('em.arr.person').exec(console.log)
// Every em.arr.person is still an ObjectId

2) Cannot further populate a document after population

Model.findByid(id, function (err, doc) {
 doc.populate('after.the.fact', callback)
})

3) Recursive sub-document population not supported

References more than one level deep are not resolved.

a.populate('b') ->
  b.populate('c')

#601

4) Original _ids not found in populate query do not exist in returned document

Given the following document:

{ _id: 3, friends: [7,8,34] }

and we query/populate the documents friends

Model.find().populate('friends').exex(cb)

All friends ids that are not found during population are dropped on the floor. For example, if friends 7 and 34 are found, 8 is dropped from the returned document:

Model.find().populate('friends').exex(function (err, doc) {
  console.log(doc.friends) // [{ _id: 7, name: 'luigi' }, { _id: 34, name: 'mario' }]
})

5) Inefficient queries

We currently execute a separate query per document per path.

6) Clumsy population syntax when specifying multiple params

Model.find().populate(path, [fields], [model], [conditions], [options])

http://mongoosejs.com/docs/api.html#query_Query-populate

7) Assigning values to populated paths casts the value back to an _id

#570

8) Lean queries cannot be populated

#1260

NOTES

So far, items #1, #2, #5, #6, #8 have been fixed in the populate-wip branch.

#1, #2, #3, #6, #8 solutions

The core populate functionality is being rewritten and exposed through Model.populate(docs, paths, cb). This core functionality will now also be made accessible through document#populate(path, options, cb). Query#populate will still exist as in previous versions.

This refactor and exposure opens the door to populating plain js objects as well as mongoose documents, meaning population of documents returned from lean queries will now be supported.

The syntax of Query#populate() is being cleaned up (yet backward compatible). Its interface will support passing an options object containing all knobs instead of having to pass nulls as placeholders:

// < 3.6
Model.find().populate(path, null, null, { sort: { name: -1 }})

// >= 3.6
var o = {};
o.path = path;
o.options = { sort: { name: -1 }}};
Model.find().populate(o)

This syntax also supports populating multiple paths with the same options:

// >= 3.6
var o = { path: 'path1 path2', options: { sort: { name: -1 }}};
Model.find().populate(o)

As for auto-populating multi-level deep refs, this will be supported through the public exposure of Model.populate(docs, options, cb) which can be utilized to continue populating returned documents ad infinitum.

#4 solution may require breaking changes

Original _ids not found in populate query do not exist in returned document

arrays

For populated arrays we could associate the original ids as an ad-hoc property so we don't lose anything. However, it makes managing array manipulation a headache, ensuring the ad-hoc list of original _ids stays in sync with the exposed array of populated documents in cases where the user then saves the original document is pretty silly.

single properties

When single fields are populated and no value was found (due to specified query matchers etc) we currently replace the value with null (the value returned by the query). This means we lose the original _id. No bookkeeping is really necessary in this case b/c the null value will not get saved unless the user modifies it somehow. But, we still lose access to the original _id.

solution

When no matching document is found for a given _id, we leave that _id in the document. This means that no _ids would be lost after population and the risk of manipulating a populated array and saving the document back to the db would be mitigated. The result is that populated arrays may contain a mix of _ids and documents and it will be up to your code to manage either type. No bookkeeping required. Same goes for populating an individual path, the result could be a path with either the original _id or the found document. At this time, this is the direction we are taking but by no means is this final yet. The goal is to remain 3.x backward compatible and hide this new behavior behind and option. Feedback is welcome.

Fixed a typo in the migration docu

skotchio · 2013-01-15T20:35:30Z

I do not agree with the 4 point. I think at now it works properly. Why should I have an ids whose documents don't exists in database?

aheckmann · 2013-01-15T20:50:57Z

@vovan22 users are allowed to pass query conditions to filter out docs as well.

query.populate(path, fields, match, options)

braunsquared · 2013-01-15T20:55:18Z

@vovan22, I would agree in the sense that the model layer should never make changes to the underlaying data without your knowledge. If it drops ids from an array of references without your knowledge then you have potential for even more corruptions in your data set.

As for #4 in general, I think consistency of data structures is critical here. The array either needs to be an array of model objects or an array of ids. There should never be a situation where the structures change depending on whether someone has previously populated a reference or not. ie. When trying to get the id of a referenced document in an array I should never have to use instanceof operators.

I think population in general needs to allow for access to the direct field with the option of retrieving the referenced document. If a field in mongo is an array of object ids then I should be able to access that field as an array of object ids or optionally retrieve the referenced documents via a populate call, but the field itself should remain an array of object ids.

For example if I have a model called Post which references an array of Subscribers as subscribers then:

post.subscribers should return an array of object ids.
post.populate('subscribers') should return an array of documents.

if populate is smart enough, it will cache the response so multiple calls to populate would return the same list. If modifications to the subscribers array are made, the cache can be dropped or altered. I know keeping these in sync would be difficult.

If I am confused in these please help clarify.

aheckmann · 2013-01-15T21:34:55Z

@braunsquared I agree that we need to expose both the original _ids as well as populated docs separately somehow. the problem i see with caching within populate is cases where we really do want to re-populate based on changes that may have occurred in the db, more of a refresh scenario. now we'd have to expose the cache management as well so users could disable it etc which is leaky IMO, but easily done within a plugin if desired.

braunsquared · 2013-01-15T22:24:42Z

@aheckmann I agree with what you are saying and understand the complexity of adding a cache management system which can add lots of bloat as well. My main concern is keeping it consistent and preventing the need to pass references around to populated fields outside of the primary object reference.

What about the possibility of wrapping the populate functionality in a DocumentArray class which understands how to populate the array of references, the query that was used etc etc. We could then manually reload if need be rather than bombard the db with unnecessary duplicate queries?

As for non-array fields. IMO the field should always return an object reference regardless of whether it is populated or not. If it hasn't been populated, the document will only have an _id field and we can call populate/load on the object directly.

Also, please use the $in call very carefully. From our experience it is faster to perform multiple queries than use the $in operator even when proper indexes are in use.

aheckmann · 2013-01-16T18:47:38Z

@braunsquared I've not observed any performance impacts when using $in.

non-array fields: "fake" docs don't seem to help this scenario. if anything it makes it less clear that a document was not found. i don't want to think any more than i have to.

wrapping in a document-array: i don't see any benefit to this. Model.populate already provides "reloading" docs. Cache management doesn't belong in mongoose.

Another option based on your earlier suggestion:

Model.findById(id).populate('friends').exec(function (err, doc) {
  var originalIds = doc.friends;
  var populatedDocs = doc.populated('friends');
})

This means original paths are never overwritten. No "syncing" between original path and populated data would occur when manipulating the original array or populated array, this exercise is left to the user. These would be strictly separate views.

aheckmann · 2013-01-16T19:41:56Z

Going with populated(path) also makes implementing backward compatibility mode simpler, as it'd just use this cache to overwrite the original path before returning the documents.

braunsquared · 2013-01-16T20:01:47Z

@aheckmann Sounds great. I'm just thinking out loud.

As for $in, it depends on the situation. When you throw in $in with sort options mongo doesn't leverage indexes as well as it should and it's very easy to fall out of the index world into the scan complete collection world. On small collections this generally doesn't cause an issue but when you are referring to millions of records of decently sized docs, the performance hit can be detrimental. This was last tested by us with v1.8 of mongod so it may have improved since then.

aheckmann · 2013-01-16T20:18:22Z

@braunsquared thanks for thinking out loud, its very helpful.

re: $in, we're ok b/c the only thing changing is the potential size of the $in array clause. After speaking with Aaron Staple about this (he sits next to me and works on indexing for MongoDB core), the only way this would introduce a new caveat is if a $sort is also used and the size of the $in array result grows so massive that it doesn't fit in RAM within mongo itself, in which case, the user is "doing it wrong".

vedmalex · 2013-01-22T06:16:41Z

Hi i've use population feature in my project and have to rewrite it for personall use.
for population and for Query i use sompething like Query. I've implemented relation definitions and store it somewhere.
so the query is very simple:
{
model:'Person',
conditions:{name:'Alex'},
childRel:'address etc'
}
this will return Person with address and etc relation keys loaded.

or like that
{
model:'Person',
conditions:{name:'Alex'},
childRel:[{
opposite:'address',
condition:{city:'New York'},
exists:true
}]
}

this will load all person that has addres with city New York

the engine can work with relation cardinality 1:1 or 1:* and even :.

the engine just traverse the query tree and builds params for populate relation, and on findExec it loads all the items within document.

childs also can have conditions onlyIds, exists notExists attribute that shape the result tree to fit specific conditions.

fixed "psuedo" typo in Readme

Fixed incorrect annotation in SchemaNumber#min

fixes - cpu spikes - reconnection issues with primaries

https://github.com/promises-aplus/promises-spec adds promise#then, promise#reject

useful when cleaning up old properties that are no longer in the schema. http://bytes.goodeggs.com/post/36553128854/how-to-remove-a-property-from-a-mongoosejs-schema

closes #1278

added doc#populate added Model#populate standardize query,document,Model populate arguments Model .find() .populate({ path: "_creator" , select: "name age" , match: { age: { $gte: 21 }} , options: { sort: { age: -1 }} }) .exec() TODO needs to support multi docs + non Mongoose Documents & friends code is in non-working state

currently retaining _ids for documents not found in populate queries

closes #1260

closes #1171 closes #1184

aheckmann · 2013-01-31T04:40:12Z

closing for now.

aheckmann · 2013-02-03T23:11:45Z

merged into 3.6x

basicallydan · 2013-02-08T10:36:20Z

Was there any decision on the solution to point 3, AKA #601?

aheckmann · 2013-02-08T15:05:05Z

Core population functionality it now exposed for users to do this manually.

basicallydan · 2013-02-08T16:14:04Z

Hey Aaron - thanks, very helpful. What about point 1?

    CompanyMembership.statics.getCompaniesForUser = function (userId, callback) {
        this.find({ user: userId })
        .exec(function (err, memberships) {
            var opts = [
                { path: 'company' },
                { path: 'company.createdBy' }
            ];
            this.populate(memberships, opts, function (err, memberships) {
                // memberships[0].company.createdBy is still just an ObjectId
            });
        }.bind(this));
    };

This isn't working quite how I expect - maybe I've got the wrong syntax? I'm expecting memberships[0].company.createdBy to be an object using whatever model I used as the ref - and I can't work it out from the source.

aheckmann · 2013-02-08T17:35:08Z

a membership looks like: { company: { createdBy: ObjectId(..) }} ?

in that case, populating company is needless. hard to say without seeing schemas. if you determine it to be a bug please open a ticket with code to reproduce.

YourDeveloperFriend · 2014-07-03T17:52:33Z

I know I'm coming in late to this discussion, but I'm wondering if any care has been taken for this comment made by @braunsquared. I'm running into the situation where I have to check to see if the field is populated:

if(_.isObject(job.user_id)) {
  user_id = job.user_id._id;
} else {
  user_id = job.user_id;
}

This is really hacky and unintuitive. I feel uncomfortable with referencing job.user_id as if it were a user object and referencing job.user as if it were a user's id. It's confusing to see job.user_id all over the place, and sometime's it's an _id, and sometimes it's an object. Also, if I decide I want to populate the job model, then I have to find all references to job.user_id and update them.

I found this to be the most relevant place for this comment. If there's somewhere else you'd like me to discuss this issue, I'd be happy to move this comment there.

mattcasey · 2014-07-03T20:34:24Z

@YourDeveloperFriend I agree, I've been working on a new project and already it's awkward explaining to new devs that "sometimes the 'person' field is an object, but in the database it's an ObjectId". however sometimes I do actually want the person field to be a personId, so in the code it makes it confusing. One suggestion is to allow an additional option: a fieldname to populate into. That way it's backwards compatible, if you don't set a special fieldName. It would be more like a virtual field, and we could always expect the initial reference field to exist:

new Schema({
  userId: {
    type: ObjectId,
    ref: 'User',
    populateField: 'user',
  }
})

YourDeveloperFriend · 2014-07-07T23:22:25Z

@mattcasey I totally agree. That was the solution I was envisioning.

Alexander Prinzhorn and others added 2 commits January 14, 2013 21:46

typo in migration docu

31e03da

Merge pull request #1291 from Prinzhorn/master

30c0f2f

Fixed a typo in the migration docu

aheckmann mentioned this pull request Jan 15, 2013

Calling populate on embedded doc of embedded doc #601

Closed

fixed "psuedo" typo in Readme

22468e7

aheckmann and others added 16 commits January 22, 2013 16:44

Merge pull request #1298 from rened/patch-1

13a8762

fixed "psuedo" typo in Readme

Merge branch 'master' of github.com:LearnBoost/mongoose

0b2784f

tests; fix buffer casting test

cb631fb

remove node < 0.6x shims

1f9a460

remove more node 0.4x thangs

bac152b

Fixed incorrect annotation for SchemaNumber#min

2654919

tests; fix name

7285b8c

Merge pull request #1316 from bilalq/master

939a025

Fixed incorrect annotation in SchemaNumber#min

updated; driver to 1.2.11

a4c4035

fixes - cpu spikes - reconnection issues with primaries

Merge branch 'master' of github.com:LearnBoost/mongoose

c3e3fbb

release 3.5.5

7b03b02

now working on 3.5.6

6876837

added; promises/A+ comformancy

f7905dd

https://github.com/promises-aplus/promises-spec adds promise#then, promise#reject

use mpromise

65838ee

updated; muri 0.2.0

f03bbed

Cast string 'true'/'false' to boolean value

052827f

aheckmann added 20 commits January 29, 2013 14:07

adjust test for muri update

2541d12

allow non-strict sets and updates

d2a8556

useful when cleaning up old properties that are no longer in the schema. http://bytes.goodeggs.com/post/36553128854/how-to-remove-a-property-from-a-mongoosejs-schema

updated; muri to 0.3.0

34124f4

added; removing subdoc with _id value support

1f84a8a

closes #1278

updated; mpath to 0.1.0

31d35d4

populate; tests; re-enabled

4bda85d

populate; tests; fix query option test

6cd6939

populate; add utils.array.flatten

f0d59b4

fixed; rebase regression

a7e9994

update; mpath to 0.1.1

7168492

populate; working code (not-backwards compat)

3df29d6

currently retaining _ids for documents not found in populate queries

populate; no need for extra nextTicks

4d50a81

docs

12cf61f

populate; TODO

3f9c443

populate; now compatible with lean

bbe35f1

closes #1260

docs;

8740ef8

old TODO

89aac17

remove old comment

96901ca

refactor internal document properties

ede53a9

closes #1171 closes #1184

aheckmann closed this Jan 31, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] Populate.next roadmap - feedback welcome #1292

[wip] Populate.next roadmap - feedback welcome #1292

aheckmann commented Jan 15, 2013

skotchio commented Jan 15, 2013

aheckmann commented Jan 15, 2013

braunsquared commented Jan 15, 2013

aheckmann commented Jan 15, 2013

braunsquared commented Jan 15, 2013

aheckmann commented Jan 16, 2013

aheckmann commented Jan 16, 2013

braunsquared commented Jan 16, 2013

aheckmann commented Jan 16, 2013

vedmalex commented Jan 22, 2013

aheckmann commented Jan 31, 2013

aheckmann commented Feb 3, 2013

basicallydan commented Feb 8, 2013

aheckmann commented Feb 8, 2013

basicallydan commented Feb 8, 2013

aheckmann commented Feb 8, 2013

YourDeveloperFriend commented Jul 3, 2014

mattcasey commented Jul 3, 2014

YourDeveloperFriend commented Jul 7, 2014

[wip] Populate.next roadmap - feedback welcome #1292

[wip] Populate.next roadmap - feedback welcome #1292

Conversation

aheckmann commented Jan 15, 2013

DO NOT MERGE

1) Poor support for nested path population

2) Cannot further populate a document after population

3) Recursive sub-document population not supported

4) Original _ids not found in populate query do not exist in returned document

5) Inefficient queries

6) Clumsy population syntax when specifying multiple params

7) Assigning values to populated paths casts the value back to an _id

8) Lean queries cannot be populated

NOTES

#1, #2, #3, #6, #8 solutions

#4 solution may require breaking changes

arrays

single properties

solution

skotchio commented Jan 15, 2013

aheckmann commented Jan 15, 2013

braunsquared commented Jan 15, 2013

aheckmann commented Jan 15, 2013

braunsquared commented Jan 15, 2013

aheckmann commented Jan 16, 2013

aheckmann commented Jan 16, 2013

braunsquared commented Jan 16, 2013

aheckmann commented Jan 16, 2013

vedmalex commented Jan 22, 2013

aheckmann commented Jan 31, 2013

aheckmann commented Feb 3, 2013

basicallydan commented Feb 8, 2013

aheckmann commented Feb 8, 2013

basicallydan commented Feb 8, 2013

aheckmann commented Feb 8, 2013

YourDeveloperFriend commented Jul 3, 2014

mattcasey commented Jul 3, 2014

YourDeveloperFriend commented Jul 7, 2014