Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.0.0] DML migrations (seeders) #687

Open
3 tasks
wzrdtales opened this issue Apr 19, 2020 · 7 comments
Open
3 tasks

[v1.0.0] DML migrations (seeders) #687

wzrdtales opened this issue Apr 19, 2020 · 7 comments
Assignees
Labels
Milestone

Comments

@wzrdtales
Copy link
Member

wzrdtales commented Apr 19, 2020

Description

DML migrations, or commonly referred to as Seeders, although not entirely precise with what DML migrations in db-migrate are, are reversible operations to manipulate the data inside the schema previously defined by default DDL migrations.

Implementation Details

There will be two modes with different APIs, one being handled by db-migrate, the other being handled by the user. The user handled one is a atomic wrapped API exposing DML functionality, essentially direct access to the db with runSql or other drivers provided functions. The db-migrate handled one follows the v2 principle of perfect tracability and atomicity. It works hand in hand with the DDL preparing certain standards on the tables, such as flag fields and will provide step by step a thorough API capable of transforming not only big, but large tables.

Big Table Operations

Big tables do have the problem, that changes made to them will directly affect the running systems. Besides the necessity to follow principles like zero downtime mitigations, it is also necessary to operate in small steps on the tables to allow the live system to still operate on the tables in question. This becomes interesting especially for databases that do not provide online schema changes. Since this can be mitigated with cheaper DML (like adding notNull false columns instead of defaultValue fields and migrate that action over, or copy actions instead of type changes). These operations will be embedded into the DDL

Streaming continuable operations

This needs tracking of the current execution status, which will be two folded. First the state manager will carry the current state parameters for currently running seeding operations, secondly the flag field will indicate successfully updated rows to not depend on transactions. All the operations will resolve on completion, but will also provide a streaming interface, since especially DML will operate on massive amounts of data which needs the capability to work with partial data instead of the full available data.

Base operations

The db-migrate handled feature set will include the following base operations

  • Insert
  • Delete
  • Update
  • [ ]
  • [ ]
  • [ ]
  • [ ]
  • [ ]
  • [ ]
  • [ ]
  • [ ]
  • [ ]
  • [ ]
@wzrdtales wzrdtales self-assigned this Apr 19, 2020
@wzrdtales wzrdtales added this to the v1.0.0 milestone Apr 19, 2020
@wzrdtales wzrdtales changed the title [v1.0.0] Seeder [v1.0.0] DML migrations (seeders) Apr 19, 2020
@wzrdtales
Copy link
Member Author

wzrdtales commented Apr 19, 2020

First drafts of interfaces

insert

exports.migrate = async (db, opt) => {
  const { util } = opt.dbm;

  await db.insert(
    'test',
    // can recieve object or array of same structured objs
    util.insertObj([
      { column1: '1', column2: '2' },
      { column1: '1', column2: '3' },
      { column1: '2', column2: '4' }
    ])
  );
  /**
   * For efficiency insertObj shouldn't be used on big operations
   * The previous command essentially expands to:
   */
  await db.insert('test', {
    columns: ['column1', 'column2'],
    data: ['1', '2', '1', '3', '2', '4']
  });

  /**
   * The streaming interface can be accessed by passing a config object
   * with the predefinition of the incoming data and the stream to consume
   * from.
   */
  const streamConfig = {
    columns: ['column1', 'column2'],
    stream: myDataStream
  };
  await db.insert.stream('test', streamConfig);

  /**
   * Streaming can be used with helper functions such as structured readers.
   */
  await db.insert.stream(
    'test',
    util.readStructured('./data.csv', { delimiter: ';' })
  );

  /**
   * And of course also read from other tables data.
   */
  await db.insert.from('test', 'readme', {
    map: {
      source1: 'column1',
      source2: 'column2'
    },
    search: 'ventilation NOT NULL'
  });


  /**
   * Instead of search as a short hand, the get property
   * will give access to get directly.
   */
  await db.insert.from('test', 'readme', {
    map: {
      source1: 'column1',
      source2: 'column2'
    },
    get: { /* ... */ } 
  })
};

exports._meta = {
  version: 2,
  DML: true
};

dot notation might be as well be integrated into the object as a mode property, but not sure, probably will provide both possibilities.

get

exports.migrate = async (db, opt) => {
  await db.get('test', {
    columns: ['column1']
  });

  /**
   * columns can be aliased
   */
  await db.get('test', {
    columns: {
      column1: 'maxima'
    }
  });

  /**
   * We avoid tapping into complicated languages to define
   * the where clause. Instead you just use SQL or mongo schema
   * depending on the db you use. If you want to provide genericity we will
   * suggest using helper libraries like sql2mongo or vice versa. This allows
   * flexibility and avoids also the need to learn to many new things and
   * reinventing the weel just for a translation layer in this case.
   */
  await db.get('test', {
    columns: {
      column1: 'maxima',
      column2: 'something'
    },
    search: 'maxima >= 12'
  });

  /**
   * However, the most simple operation we will support upfront. Everything
   * else can use tooling. We will provide a plugin hook so this can be in
   * doubt tightly integrated into the migrations.
   */
  await db.get('test', {
    columns: {
      column1: 'maxima',
      column2: 'something'
    },
    search: {
      maxima: 15
    }
  });

  /**
   * when doing more then searching, you want to access it in query mode.
   * You will have to write the whole definition by hand. This guarantees
   * that we still understand what is happening (type of operation), but
   * gives ultimate flexibility.
   *
   * The table param is optional, when using for example SQL it will not be
   * needed, databases like mongodb will need it however, so it might be
   * recommended as a best practice to not omit it, depending on how ultimate
   * the decision of the db is yet.
   *
   * Adding LIMITS yourself is possible, but leave this to db-migrate in doubt.
   */
  await db.get('test', {
    query: 'DISTINCT value1 FROM test' // select will be forcefully prepended so we always know this will only get data
  });
};

exports._meta = {
  version: 2,
  DML: true
};

@wzrdtales
Copy link
Member Author

wzrdtales commented Apr 19, 2020

Calling some of latest contributors and previous feedback givers :)

/cc @BorntraegerMarc @Marsup @rfrm

As always feedback and ideas are welcome

@BorntraegerMarc
Copy link
Contributor

Thanks for requesting some feedback.

First of all I'd like to say that I'm not really a database migration expert and that I'm coming from the mongodb world. General I'd say a concept like this is very much needed. So here are a couple of points from my side:

  1. Maybe it's not the right place to bring up bring this up but I'd like to point out that when doing migrations I usually use the underlying driver and not the libraries functions. Because I find it easier to do exactly what I want to. so maybe to save sometimes we could consider to only expose the driver to users.
  2. Maybe I don't quite understand the Streaming continuable operations: do they update the database in a staged fashion? So would the database ever be partially migrated? If that's the case then I don't think they are so useful to us. Especially because applications I worked on cannot handle partially migrated data.
  3. In a kubernetes deployment: how would we orchestrate two simultaneously run seeders so that the migration is only run once?

@jtassin
Copy link

jtassin commented Apr 20, 2020

@wzrdtales it's a nice thought.
What could an update look like with this syntax ?

@wzrdtales
Copy link
Member Author

wzrdtales commented Apr 20, 2020

Maybe it's not the right place to bring up bring this up but I'd like to point out that when doing migrations I usually use the underlying driver and not the libraries functions. Because I find it easier to do exactly what I want to. so maybe to save sometimes we could consider to only expose the driver to users.

This will be the user controlled version mentioned. However, you will loose every and all benefits of db-migrates new capabilities in v2 if you do that. I will however see, how to mitigate that and leave the flexibility.

Maybe I don't quite understand the Streaming continuable operations: do they update the database in a staged fashion? So would the database ever be partially migrated? If that's the case then I don't think they are so useful to us. Especially because applications I worked on cannot handle partially migrated data.

Yes and no.

The migrations would run through completely, but they let your old application work, by leaving room for other queries to be executed and not lock the table all the time.

The reason that your applications can't handle those is that you probably never cared about zero downtime at all yet and did not follow the principles to accomplish that. That might be ok in the field you're working in, but most fields today would invest the tiny amount of extra time it needs to accomplish zero downtime and save often millions by that (we talk about very big tables where those migration processes otherwise mean an hour downtime or longer). I am a consultant for a lot of big companies, so I know the direct needs of the industry very well, that is mainly the reason why I am so focused on reaching the state of a stable framework providing zero downtime capabilities.

In a kubernetes deployment: how would we orchestrate two simultaneously run seeders so that the migration is only run once?

This is the task of the concurrency manager also a v2 migration feature, but will be applied in general to v1.0.0 of db-migrate in general. Please see and read #538 for that. In the future there will be an event emitter so you can even partially activate new functions in your application depending on which migrations have run already instead of waiting for all of them to complete.

@wzrdtales
Copy link
Member Author

@jtassin Pretty much like insert.from + simple interfaces that look like insert but accept a search param/query.

@ajw725
Copy link

ajw725 commented Jan 27, 2021

@wzrdtales i didn't mean my comment on #292 to sound snarky. i understand you (and anyone else who maintains OSS) have lots of other things to do. i wasn't asking for functionality that doesn't exist yet; i was just pointing out that the current state of documentation is a bit confusing, because it mentions something that doesn't really seem to be implemented. if the feature exists, whatever is there should be fully documented, and if not, that's fine, but IMO it should be removed from the docs until if/when it's ready.

i would love to help out if i can find some time. will keep an eye on this in case i can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants