Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] Populate.next roadmap - feedback welcome #1292

Closed
wants to merge 42 commits into from
Closed

Conversation

aheckmann
Copy link
Collaborator

DO NOT MERGE

For release 3.6 our primary objective is a rewrite of the populate functionality. Currently it suffers from several limitations:

1) Poor support for nested path population

Given a schema like the following:

new Schema({
    name: String
  , em: [{ arr: [{ person: { type: Schema.ObjectId, ref: 'Person'}}] }]
});

An attempt to populate the following path will not work as expected:

Model.find().populate('em.arr.person').exec(console.log)
// Every em.arr.person is still an ObjectId

2) Cannot further populate a document after population

Model.findByid(id, function (err, doc) {
 doc.populate('after.the.fact', callback)
})

3) Recursive sub-document population not supported

References more than one level deep are not resolved.

a.populate('b') ->
  b.populate('c')

#601

4) Original _ids not found in populate query do not exist in returned document

Given the following document:

{ _id: 3, friends: [7,8,34] }

and we query/populate the documents friends

Model.find().populate('friends').exex(cb)

All friends ids that are not found during population are dropped on the floor. For example, if friends 7 and 34 are found, 8 is dropped from the returned document:

Model.find().populate('friends').exex(function (err, doc) {
  console.log(doc.friends) // [{ _id: 7, name: 'luigi' }, { _id: 34, name: 'mario' }]
})

5) Inefficient queries

We currently execute a separate query per document per path.

6) Clumsy population syntax when specifying multiple params

Model.find().populate(path, [fields], [model], [conditions], [options])

http://mongoosejs.com/docs/api.html#query_Query-populate

7) Assigning values to populated paths casts the value back to an _id

#570

8) Lean queries cannot be populated

#1260

NOTES

So far, items #1, #2, #5, #6, #8 have been fixed in the populate-wip branch.

#1, #2, #3, #6, #8 solutions

The core populate functionality is being rewritten and exposed through Model.populate(docs, paths, cb). This core functionality will now also be made accessible through document#populate(path, options, cb). Query#populate will still exist as in previous versions.

This refactor and exposure opens the door to populating plain js objects as well as mongoose documents, meaning population of documents returned from lean queries will now be supported.

The syntax of Query#populate() is being cleaned up (yet backward compatible). Its interface will support passing an options object containing all knobs instead of having to pass nulls as placeholders:

// < 3.6
Model.find().populate(path, null, null, { sort: { name: -1 }})

// >= 3.6
var o = {};
o.path = path;
o.options = { sort: { name: -1 }}};
Model.find().populate(o)

This syntax also supports populating multiple paths with the same options:

// >= 3.6
var o = { path: 'path1 path2', options: { sort: { name: -1 }}};
Model.find().populate(o)

As for auto-populating multi-level deep refs, this will be supported through the public exposure of Model.populate(docs, options, cb) which can be utilized to continue populating returned documents ad infinitum.

#4 solution may require breaking changes

Original _ids not found in populate query do not exist in returned document

arrays

For populated arrays we could associate the original ids as an ad-hoc property so we don't lose anything. However, it makes managing array manipulation a headache, ensuring the ad-hoc list of original _ids stays in sync with the exposed array of populated documents in cases where the user then saves the original document is pretty silly.

single properties

When single fields are populated and no value was found (due to specified query matchers etc) we currently replace the value with null (the value returned by the query). This means we lose the original _id. No bookkeeping is really necessary in this case b/c the null value will not get saved unless the user modifies it somehow. But, we still lose access to the original _id.

solution

When no matching document is found for a given _id, we leave that _id in the document. This means that no _ids would be lost after population and the risk of manipulating a populated array and saving the document back to the db would be mitigated. The result is that populated arrays may contain a mix of _ids and documents and it will be up to your code to manage either type. No bookkeeping required. Same goes for populating an individual path, the result could be a path with either the original _id or the found document. At this time, this is the direction we are taking but by no means is this final yet. The goal is to remain 3.x backward compatible and hide this new behavior behind and option. Feedback is welcome.

Alexander Prinzhorn and others added 2 commits January 14, 2013 21:46
@skotchio
Copy link

I do not agree with the 4 point. I think at now it works properly. Why should I have an ids whose documents don't exists in database?

@aheckmann
Copy link
Collaborator Author

@vovan22 users are allowed to pass query conditions to filter out docs as well.

query.populate(path, fields, match, options)

@braunsquared
Copy link

@vovan22, I would agree in the sense that the model layer should never make changes to the underlaying data without your knowledge. If it drops ids from an array of references without your knowledge then you have potential for even more corruptions in your data set.

As for #4 in general, I think consistency of data structures is critical here. The array either needs to be an array of model objects or an array of ids. There should never be a situation where the structures change depending on whether someone has previously populated a reference or not. ie. When trying to get the id of a referenced document in an array I should never have to use instanceof operators.

I think population in general needs to allow for access to the direct field with the option of retrieving the referenced document. If a field in mongo is an array of object ids then I should be able to access that field as an array of object ids or optionally retrieve the referenced documents via a populate call, but the field itself should remain an array of object ids.

For example if I have a model called Post which references an array of Subscribers as subscribers then:

post.subscribers should return an array of object ids.
post.populate('subscribers') should return an array of documents.

if populate is smart enough, it will cache the response so multiple calls to populate would return the same list. If modifications to the subscribers array are made, the cache can be dropped or altered. I know keeping these in sync would be difficult.

If I am confused in these please help clarify.

@aheckmann
Copy link
Collaborator Author

@braunsquared I agree that we need to expose both the original _ids as well as populated docs separately somehow. the problem i see with caching within populate is cases where we really do want to re-populate based on changes that may have occurred in the db, more of a refresh scenario. now we'd have to expose the cache management as well so users could disable it etc which is leaky IMO, but easily done within a plugin if desired.

@braunsquared
Copy link

@aheckmann I agree with what you are saying and understand the complexity of adding a cache management system which can add lots of bloat as well. My main concern is keeping it consistent and preventing the need to pass references around to populated fields outside of the primary object reference.

What about the possibility of wrapping the populate functionality in a DocumentArray class which understands how to populate the array of references, the query that was used etc etc. We could then manually reload if need be rather than bombard the db with unnecessary duplicate queries?

As for non-array fields. IMO the field should always return an object reference regardless of whether it is populated or not. If it hasn't been populated, the document will only have an _id field and we can call populate/load on the object directly.

Also, please use the $in call very carefully. From our experience it is faster to perform multiple queries than use the $in operator even when proper indexes are in use.

@aheckmann
Copy link
Collaborator Author

@braunsquared I've not observed any performance impacts when using $in.

non-array fields: "fake" docs don't seem to help this scenario. if anything it makes it less clear that a document was not found. i don't want to think any more than i have to.

wrapping in a document-array: i don't see any benefit to this. Model.populate already provides "reloading" docs. Cache management doesn't belong in mongoose.

Another option based on your earlier suggestion:

Model.findById(id).populate('friends').exec(function (err, doc) {
  var originalIds = doc.friends;
  var populatedDocs = doc.populated('friends');
})

This means original paths are never overwritten. No "syncing" between original path and populated data would occur when manipulating the original array or populated array, this exercise is left to the user. These would be strictly separate views.

@aheckmann
Copy link
Collaborator Author

Going with populated(path) also makes implementing backward compatibility mode simpler, as it'd just use this cache to overwrite the original path before returning the documents.

@braunsquared
Copy link

@aheckmann Sounds great. I'm just thinking out loud.

As for $in, it depends on the situation. When you throw in $in with sort options mongo doesn't leverage indexes as well as it should and it's very easy to fall out of the index world into the scan complete collection world. On small collections this generally doesn't cause an issue but when you are referring to millions of records of decently sized docs, the performance hit can be detrimental. This was last tested by us with v1.8 of mongod so it may have improved since then.

@aheckmann
Copy link
Collaborator Author

@braunsquared thanks for thinking out loud, its very helpful.

re: $in, we're ok b/c the only thing changing is the potential size of the $in array clause. After speaking with Aaron Staple about this (he sits next to me and works on indexing for MongoDB core), the only way this would introduce a new caveat is if a $sort is also used and the size of the $in array result grows so massive that it doesn't fit in RAM within mongo itself, in which case, the user is "doing it wrong".

@vedmalex
Copy link
Contributor

Hi i've use population feature in my project and have to rewrite it for personall use.
for population and for Query i use sompething like Query. I've implemented relation definitions and store it somewhere.
so the query is very simple:
{
model:'Person',
conditions:{name:'Alex'},
childRel:'address etc'
}
this will return Person with address and etc relation keys loaded.

or like that
{
model:'Person',
conditions:{name:'Alex'},
childRel:[{
opposite:'address',
condition:{city:'New York'},
exists:true
}]
}

this will load all person that has addres with city New York

the engine can work with relation cardinality 1:1 or 1:* and even :.

the engine just traverse the query tree and builds params for populate relation, and on findExec it loads all the items within document.

childs also can have conditions onlyIds, exists notExists attribute that shape the result tree to fit specific conditions.

added doc#populate
added Model#populate
standardize query,document,Model populate arguments

Model
.find()
.populate({
    path: "_creator"
  , select: "name age"
  , match: { age: { $gte: 21 }}
  , options: { sort: { age: -1 }}
})
.exec()

TODO
  needs to support multi docs + non Mongoose Documents & friends
  code is in non-working state
currently retaining _ids for documents not found in populate queries
@aheckmann
Copy link
Collaborator Author

closing for now.

@aheckmann aheckmann closed this Jan 31, 2013
@aheckmann
Copy link
Collaborator Author

merged into 3.6x

@basicallydan
Copy link

Was there any decision on the solution to point 3, AKA #601?

@aheckmann
Copy link
Collaborator Author

Core population functionality it now exposed for users to do this manually.

@basicallydan
Copy link

Hey Aaron - thanks, very helpful. What about point 1?

    CompanyMembership.statics.getCompaniesForUser = function (userId, callback) {
        this.find({ user: userId })
        .exec(function (err, memberships) {
            var opts = [
                { path: 'company' },
                { path: 'company.createdBy' }
            ];
            this.populate(memberships, opts, function (err, memberships) {
                // memberships[0].company.createdBy is still just an ObjectId
            });
        }.bind(this));
    };

This isn't working quite how I expect - maybe I've got the wrong syntax? I'm expecting memberships[0].company.createdBy to be an object using whatever model I used as the ref - and I can't work it out from the source.

@aheckmann
Copy link
Collaborator Author

a membership looks like: { company: { createdBy: ObjectId(..) }} ?

in that case, populating company is needless. hard to say without seeing schemas. if you determine it to be a bug please open a ticket with code to reproduce.

@YourDeveloperFriend
Copy link

I know I'm coming in late to this discussion, but I'm wondering if any care has been taken for this comment made by @braunsquared. I'm running into the situation where I have to check to see if the field is populated:

if(_.isObject(job.user_id)) {
  user_id = job.user_id._id;
} else {
  user_id = job.user_id;
}

This is really hacky and unintuitive. I feel uncomfortable with referencing job.user_id as if it were a user object and referencing job.user as if it were a user's id. It's confusing to see job.user_id all over the place, and sometime's it's an _id, and sometimes it's an object. Also, if I decide I want to populate the job model, then I have to find all references to job.user_id and update them.

I found this to be the most relevant place for this comment. If there's somewhere else you'd like me to discuss this issue, I'd be happy to move this comment there.

@mattcasey
Copy link

@YourDeveloperFriend I agree, I've been working on a new project and already it's awkward explaining to new devs that "sometimes the 'person' field is an object, but in the database it's an ObjectId". however sometimes I do actually want the person field to be a personId, so in the code it makes it confusing. One suggestion is to allow an additional option: a fieldname to populate into. That way it's backwards compatible, if you don't set a special fieldName. It would be more like a virtual field, and we could always expect the initial reference field to exist:

new Schema({
  userId: {
    type: ObjectId,
    ref: 'User',
    populateField: 'user',
  }
})

@YourDeveloperFriend
Copy link

@mattcasey I totally agree. That was the solution I was envisioning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants