Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On-the-Fly Conversions #45

Closed
aldeed opened this issue Jan 13, 2014 · 19 comments
Closed

On-the-Fly Conversions #45

aldeed opened this issue Jan 13, 2014 · 19 comments

Comments

@aldeed
Copy link
Collaborator

aldeed commented Jan 13, 2014

I have a rough idea of a way to handle on-the-fly conversions.

  1. Add convert option to schema, false by default.
  2. Override the find method. If convert: true and autoValue is set, call an asynchronous update for all of the documents that find is returning, right before they are returned.
  3. For each update, get the complete current doc, pass that to autoValue for each field needing conversion, then do $set for that field with the returned auto value.

The idea is that most schema changes don't require large scale data conversions. Instead, each document can be converted the first time it is requested. Since we'd do the conversion updates asynchronously, it wouldn't affect find performance all that much. Unconverted documents would be used temporarily until the conversion completes, causing deps update.

One key would be to be able to track when the conversion has run for a doc. This likely requires creating an indexed tracking collection to be used internally by C2. This would do two things: prevent running the conversion more than once per doc, and allow the user to see when a conversion has been run on all docs, meaning that any code handling the old schema can be removed. To do this accurately, we might also need a version option, or maybe convert can be set to a version identifier instead of true.

Anyone with thoughts, feel free to comment.

@mquandalle
Copy link
Contributor

How could the convert: true boolean option handle cases where I convert my format more than once without waiting for all documents to be converted. Moreover I find a bit confusing to use the autoValue function for that use case.

What about a version option of type Number and then a convert/upgrade function that will get as a parameter the old value and the old version number and will return the new value:

createdAt: {
  type: Date,
  autoValue: function () { ... },
  version: 2,
  upgrade: function (oldValue, version) {
    // In v1 `createdAt` was of type String
    if (version === 1)
      return new Date(oldValue);
  }
}

@aldeed
Copy link
Collaborator Author

aldeed commented Jan 13, 2014

Yes, maybe it will have to be a separate function rather than using autoValue.

I don't think a simple old value -> new value function handles enough cases, though, so I'd rather use the same this context as autoValue has, so that you can examine the values of other keys, too. Although... since we don't have modifier operators to deal with, maybe just passing the entire current doc as another arg is good enough.

@mquandalle
Copy link
Contributor

But for performances reasons, to avoid making more than one .find() it's better to store the version in the document itself, in a field __version. This field would be added only if the Collection2 field definition contains a version number.

Then the find will automatically add the __version to the requested fields, compare it with the model field and if necessary run the migration to upgrade the doc. If some fields selectors were specified in the query, the find will delete the __version field before returning. I think this modified find method only need to run on the server, and should not block queries. Basically the migration will be run on publication of old document.

@aldeed
Copy link
Collaborator Author

aldeed commented Jan 13, 2014

__version will need to be an object with keys corresponding to all the field names, for per-field conversion tracking. Maybe that's what you said, but just making sure. Otherwise, yes, sounds perfect.

@mquandalle
Copy link
Contributor

Yes, I forgave this :-) So maybe __versions

@testbird
Copy link

As this essentially seems to introduce partial versioning (and an extra collection), I'll add the follwoing info from mcrider/azimuth#69

@aldeed
Copy link
Collaborator Author

aldeed commented Jan 16, 2014

As I see it, versioned data is different from versioned schemas. This issue is about supporting loose schema versioning, as a way to do simple on-the-fly data conversions when I decide to change the data model for collections that already contain production data.

Your comments, though interesting, seem to be more about versioned collections, wherein documents are kept over time rather than being replaced. Is that correct, or was I reading too quickly?

@craig-l
Copy link

craig-l commented Jan 17, 2014

If I was to do a find where one of the fields in the criteria had version set, I would expect that before that find is done all documents in that collection having an old version of that field should go through the 'migration' process.. otherwise those old docs will be missed by find. What do you think?

Example.. using @mquandalle's example:

var date = new Date();
date.setDate(date.getDate() - 1);

Collection.find({createdAt: date})

I need to find docs even if createdAt was stored as a String originally.

@craig-l
Copy link

craig-l commented Jan 17, 2014

If I was to add a new status field, for example, to a collection which already has production docs, All that should be necessary for the schema is:

status: {
  type: String,
  version: 1,
  allowedValues: ['open','closed'],
  upgrade: function (oldValue, version, doc) {
       if (doc.closed == 'yes')
           return 'closed';
       else
           return 'open';
  }
}

If there is no real reason to have autoValue then I think it shouldn't be necessary.. assuming you take the upgrade function route. I like this pattern for adding new fields and setting "catch up" values that aren't really defaults. In this example status isn't a calculated field.. rather it is a field that's updated explicitly.

@aldeed
Copy link
Collaborator Author

aldeed commented Jan 17, 2014

@craig-l, I agree with all that. In your date find example, I think that should be cause for kicking off conversions for all the documents synchronously before returning from find. The find method would have to immediately attempt to convert every doc that matches all other query params. Could get costly.

Only a rough thought at the moment, but maybe we need both upgrade and downgrade? Maybe something involving returning a selector or partial selector to identify documents needing updates?

@aldeed
Copy link
Collaborator Author

aldeed commented Jan 17, 2014

This will be simplest when querying on only fields that do not have the upgrade option, such as a simple _id query. Maybe we could get it working for this case first, and then see about the complex cases.

@testbird
Copy link

schema vs. data versions

Yes, you are correct in what I linked the _version field is used for documents that are kept over time. And I was thinking their (old) schemas would also have to be considered, if one wants to on the fly conversion. Trying to cram my thoughts into a nutshell:

If a collection is set to be versioned: true, that "vermongo" scheme copies the current document version into a separate versions history collection, before updating it. (Maybe that pattern could be further enhanced by writing a new version to both collections right away, or storing older versions in the same collection.)

Different language versions of a post or article content, from what I read about the mongo text index, need to be stored in separate mongo documents with their language field set. They would also require a pattern to store common, non localized and meta data of content.

Now, schema version updates (on the fly or bulk) add another important vector to this. If i understand it correctly, from a C2 perspective they introduce the presence of different schemas in the same collection (nothing untypical in the mongo world).

Incidentially, mybe you already noticed that in issue #54 I posted some collection design ideas that I came accross, and one is to see schemas as linked to documents rather than collections.

So all points above introduce different schemas into C2 collections, that should be trackable across document versions and between+within collections.

OK, where to start?
(I really don' know.) But just to throw out something to compare ideas:
Track document schemas with name and version _schema: { myschema : 123 }? (Possibly, break that down to the field level as noted above.)
Track document versions with _version: number?
Be aware that the schemas for old versions will also need updates.
...?

@testbird
Copy link

BTW: The meteor book update is said to cover a new migrations package now: https://github.com/percolatestudio/meteor-migrations
Do you also see the on-the-fly tracking mechanism could facilitate migrations that could be executed as background (worker) processes?

@aldeed
Copy link
Collaborator Author

aldeed commented Jan 18, 2014

I was really thinking of very simple migrations only. I've looked briefly at meteor-migrations, and I think it's probably a really good solution, but it doesn't seem to be designed for on-the-fly migrations. I try to migrate data on the fly for pretty much every schema change I make.

I see your point about schema versioning having an impact on the ability to version documents. We'll keep it in mind. I have no specific ideas at the moment.

@testbird
Copy link

I try to migrate data on the fly for pretty much every schema change I make.

How do you think about data that is only very seldomly accessed after while, like archives or old versions?

It doesn't feel so right to me to just leave them untouched, requiring continious concideration and dependence on the migration code (e.g. avoid direct db accesses).
I think your on-the-fly appoach is superior to the plain migration run functions, as it tracks migrations per document. I think it could allow for secure and reliable complete migrations very nicely as well, simply by triggering additional/artificial access in a way that ensures a migration completes in the desired pace and way (e.g. at nights).

@mquandalle
Copy link
Contributor

BTW "on-the-fly migration" probably deserve it's own package. There are no real reason to implement this in Collection2. Moreover usually c2 schema are defined in files shared by the client and the server, and we probably want to keep migrations on the server side only.

Since collection2 now overwrite the native Meteor.Collection object, it will be possible for a third party package to add this feature using the same technique.

@aldeed
Copy link
Collaborator Author

aldeed commented Feb 27, 2014

Agree

On Wed, Feb 26, 2014 at 7:10 PM, Maxime Quandalle
notifications@github.com wrote:

BTW "on-the-fly migration" probably deserve it's own package. There are no real reason to implement this in Collection2 (

Reply to this email directly or view it on GitHub:
#45 (comment)

@aldeed aldeed closed this as completed Aug 9, 2014
@comerc
Copy link

comerc commented Dec 14, 2015

Do you know about it?

transform Function
An optional transformation function. Documents will be passed through this function before being returned from fetch or findOne, and before being passed to callbacks of observe, map, forEach, allow, and deny. Transforms are not applied for the callbacks of observeChanges or to cursors returned from publish functions.

@Floriferous
Copy link

Floriferous commented Aug 29, 2019

I find myself needing this, as we're writing a lot of unnecessary migrations to add defaultValues, autoValues etc on production data very often.

The important part is to not update some autoValues (like updatedAt for example).

I guess a simple function like this should do the job, just call it from a method somewhere (requires dburles:mongo-collection-instances):

import { Mongo } from 'meteor/mongo';

import omit from 'lodash/omit';

const skippedCollections = ['_cacheMigrations', 'grapher_counts'];
const skippedFields = ['_id', 'createdAt', 'updatedAt'];

const makeCleanDocument = (collection, schema) => ({ _id, ...doc }) => {
  const cleanDoc = schema.clean(doc, {
    mutate: true,
    filter: true,
    autoConvert: true,
    removeEmptyStrings: false,
    trimStrings: true,
    getAutoValues: true,
  });

  const withoutSkippedFields = omit(cleanDoc, skippedFields);

  //   Sometimes empty documents can slip through, and update will fail because $set is empty
  if (!withoutSkippedFields || Object.keys(withoutSkippedFields).length === 0) {
    console.log('empty document', _id);
    return Promise.resolve();
  }

  return collection.instance
    .rawCollection()
    .update({ _id }, { $set: withoutSkippedFields });
};

const cleanCollection = (collection) => {
  if (
    !collection.name
    || skippedCollections.includes(collection.name)
    || !collection.instance._c2
  ) {
    return;
  }

  const schema = collection.instance._c2._simpleSchema;

  const allDocuments = collection.instance.find({}).fetch();

  return Promise.all(allDocuments.map(makeCleanDocument(collection, schema)));
};

export const cleanAllData = async () => {
  const collections = Mongo.Collection.getAll();

  await Promise.all(collections.map(cleanCollection));
};

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants