Database restrictions #228

rafamanzo · 2016-07-29T17:35:40Z

We need to discuss wether database restrictions are relevant for a Rails application. The database restrictions are nice to ensure data consistency at insertion time. On the other hand raw database validations do not get along with Rails so well as ActiveRecord ones when handling errors.

@mezuro/core input here is welcome.

Initial discussion has been made through #215

My personal opinion is to leave validations with ActiveRecord because:

Handles better errors to the user
Easier to Rails developers maintain as the restrictions are explicit within the Model's code
Do not duplicate validations
Backwards compatible

danielkza · 2016-07-29T17:47:57Z

This would all be fine if validations could actually do their job. They cannot, and the existence of hundreds of thousands of garbage records in our database proves that. This post sums up my thoughts quite well: DRY is great, but it doesn't matter when you replace the repetition with one source that doesn't do it's job.

Handles better errors to the user

The validations will still exist and cover 99% of the errors. The foreign keys will catch the 1% of errors that the validations fail to. We'll see them clearly as exceptions instead of allowing the garbage in and missing the bug completely.

Easier to Rails developers maintain as the restrictions are explicit within the Model's code

Again, it doesn't matter how easy it is if it doesn't work. Also, I would really prefer that the developer adding the validation understands how it actually fits into the physical model and how it will actually work. That developer will have 0 issue adding a foreign_key line to a migration.

Do not duplicate validations

I already made this point, but DRY doesn't matter if it doesn't work. For example, why do we duplicate unit and acceptance tests? Because they do different jobs, and one cannot give us the same guarantees the other does. The same applies here: the only part of the stack that is in position to effectively enforce constraints, specially those that are relations between different models/tables, is the database itself.

Backwards compatible

What do you mean?

rafamanzo · 2016-07-29T17:54:22Z

What do you mean about they not doing their job? Looks like they are performing well after they were introduced.

I do like the idea of introducing something that raises exceptions when the standard rails validations fail to catch some case. But still, what is the insertion time impact of the duplicate checks?

You seem to insist Rails' validators do not work at KalibroProcessor, can you write a test to assert that? This can become an PR to Rails' core if we can show that.

Can you give an example of validation made through database impossible through ActiveRecord?

Backwards compatible means no already existent data removal requirement for enforcing new validations.

danielkza · 2016-07-29T18:14:44Z

What do you mean about they not doing their job? Looks like they are performing well after they were introduced.

How so? They let hundreds of thousands of garbage records in. Even recent ones such as those added with HotspotMetricResult.

I do like the idea of introducing something that raises exceptions when the standard rails validations fail to catch some case. But still, what is the insertion time impact of the duplicate checks?

The Rails validation's SELECT probably takes an order of magnitude longer than the DB's own check, due to having to parse the query, create a query plan, execute, collect results and send back, vs. just doing the check and raising an error or not.

Also, considering how poorly optimized everything else about processing is, duplicate integrity checks are the least of our worries.

You seem to insist Rails' validators do not work at KalibroProcessor, can you write a test to assert that? This can become an PR to Rails' core if we can show that.

I never said the validations don't work by themselves, but in the context of kalibro_processor. This is not my opinion, it is a fact evidenced by the existence of many invalid records in the mezuro.org database. You can verify it yourself by the select queries in the CleanInconsistencies migration in the mezuro.org DB.

Constraints in the database cannot be forgotten or ignored, as we did in dozens of places in the acceptance tests. They do not let subtle invisible errors creep in as if they didn't exist - which, again, we observed with just Rails validations.

Can you give an example of validation made through database impossible through ActiveRecord?

Every one that involves associations. The AR version is just a poor substitute that cannot handle race conditions, external changes and buggy code that doesn't handle validations correctly (which we have already seen). There is a huge difference between ensuring you do no attempt to insert invalid data given some conditions, and ensuring invalid data does not exist. The latter is much stronger than the former.

Backwards compatible means no already existent data removal requirement for enforcing new requirements.

Then we're going to need to remove the validations too, because in the first time any operation is attempted with the invalid records they will fail. One by one at different times instead of all at once.

I won't go into this discussion again, but keeping garbage data does not make it good data. It is still garbage, it will still break when it is used in any meaningful way, and it is still a violation of the model - foreign keys existing or not.

Rails has many good ideas, but ignoring the tools that databases provide to enforce data integrity is not one of them. If we have no need for consistency, then we might as well be using a much faster NoSQL database and using an architecture that doesn't have or doesn't care about those inconsistencies.

rafamanzo · 2016-07-29T18:46:39Z

So are you blaming ActiveRecord for a validation we have not programed? Seems strange... The invalid records I see are previous to most of the consistency validations added. If you insist on that I'm afraid I cannot agree with that as those invalid data is our flaw not ActiveRecord's. And worse, DB restrictions are subject to the same human flaw of not foreseeing its necessity as ActiveRecord were for what you're blaming it for.

Can you give examples of invalid data inserted after v1.1.5 deployment? If they exist, can you write an test case to reproduce that? Is impossible to fix this test case using ActiveRecord?

You have dodged my question about insertion time impact. The closest to an answer I can get is that the extra time is negligible compared to the whole, right? Can you provide references or evidences of that?

You've made lots of affirmations regarding ActiveRecord: "The AR version is just a poor substitute that cannot handle race conditions, external changes and buggy code that doesn't handle validations correctly (which we have already seen)". So I repeat my request from the last comment: please if you can write tests that reproduce these issues, it can become a PR and a larger discussion at Rails core.

If you want to leave Rails you can open a new discussion on a new issue. But to be as clear as possible I do need to stress this: the objective here is to discuss wether DB constraints are worth the extra code complexity, data loss and insertion time. ActiveRecord removal is not in question here as we rely on it for a lot of other features than just data consistency.

rafamanzo · 2016-07-29T19:00:49Z

Obviously HotspotMetricResult has invalid records. It has no validations :)

danielkza · 2016-07-29T19:12:20Z

edit: I was excessively harsh in some points here. It wasn't my intention to offend and I apologize if I failed at that.

So are you blaming ActiveRecord for a validation we have not programed? Seems strange...

Except for the part where I explicitly said that there is nothing wrong the validations themselves. The paradigm of using something other than the database to validate constraints is the problem. It could be any ORM.

The invalid records I see are previous to most of the consistency validations added. I

Did you actually look at them? I did. There are invalid records of HotspotMetricResults too, which we added recently.

And worse, DB restrictions are subject to the same human flaw of not foreseeing its necessity as ActiveRecord were for what you're blaming it for.

Let's be clear: I'm not blaming AR, I'm blaming poor engineering and believing that there is no need to think about or understand the database.

Can you give examples of invalid data inserted after v1.1.5 deployment? If they exist, can you write an test case to reproduce that? Is impossible to fix this test case using ActiveRecord?

None of our tests actually test that relational validations work, because we mock everything.

You have dodged my question about insertion time impact.

As you have ignored my questions about data integrity. And yes, the difference will be neglibible compared to the current performance situation of the whole code. No questions ever existed about how terrible the code always was, and now I need to benchmarks basic things such as adding indexes? I suggest you learn a bit about them instead of just assuming everything Rails proposes is correct. http://usetheindexluke.com. Checking foreign key constraints should be nothing more than index lookups, which are the fastest operations in a database. As I said, with none of the query work surrounding it.

You've made lots of affirmations regarding ActiveRecord: "The AR version is just a poor substitute that cannot handle race conditions, external changes and buggy code that doesn't handle validations correctly (which we have already seen)". So I repeat my request from the last comment: please if you can write tests that reproduce these issues, it can become a PR and a larger discussion at Rails core.

This is a fundamental property of databases and applies to any ORM. AR developers already know it - even if you don't. You cannot enforce constraints against multiple concurrent clients in a database without having full control and visibility into everything that is happening. An AR validation can never have that. It cannot guarantee that no changes happen between it's SELECT and the following INSERT. I suggest you read about the different isolation levels that SQL RDBMS provide to understand why any given client can never ensure integrity alone.

If you want to leave Rails you can open a new discussion on a new issue.

The fact that you think wanting to change one particular aspect of Rails means throwing it all away shows me you might be a bit too immersed in dogma instead of practical and sound engineering principles.

If you want to leave Rails you can open a new discussion on a new issue. But to be as clear as possible I do need to stress this: the objective here is to discuss wether DB constraints are worth the extra code complexity, data loss and insertion time. ActiveRecord removal is not in question here as we rely on it for a lot of other features than just data consistency.

Again, who said anything about that at all? You're creating a straw-man to pretend I'm attacking AR, when I'm not. Please read what I write carefully and honestly instead of jumping to knee-jerk conclusions derived from Rails dogma.

danielkza · 2016-07-29T19:19:23Z

Also, as evidence that Rails developers themselves recognized the need for foreign keys is their addition in Rails 4.2 and even use by default in new migrations created with the generator tools.

rafamanzo · 2016-07-29T19:55:58Z

Firstly, please keep the discussion technical. If we are going to start with personal attacks we're not getting anywhere. Sorry I'll not answer to that and I'm disappointed now. If you want, we can talk through personal issues in person as they should. Not through messages. I'll send an email to everyone scheduling a talk where we will address this.

The practical conclusions I can get from the discussion so far:

Both are subject to human errors that we must correct Results consistency #230
DB constraints
- Pros
  - The right tool for assuring data consistency
  - foreign_key constraints are now default for the Rails' migration generator
  - Retro validates existent data
- Cons
  - Retro validates existent data, thus requiring removals
  - Errors are not handled to issue friendly user messages
  - Requires the developer to inspect the schema in order to find Model related restrictions
  - ~~Requires slower DatabaseCleaner truncation strategy to work within acceptance tests~~
AR validations
- Pros
  - Takes care of issuing friendly error messages
  - Keeps all the validations at the same level
  - Do not require data removal at introduction time
- Cons
  - Do not taking advantage of DB's already existent structures for such task
  - Update of invalid data will fail
Controversial
- Can AR validations fail and allow insertion of invalid data?
- Is the DB constraint a burden for insertion?

Please feel free to add anything to the list above.

danielkza · 2016-07-29T20:19:39Z

I made no personal attacks whatsoever. Saying code was poorly engineered is not a personal attack. Questioning technical priorities in the decision making process is not a personal attack. Saying one is using dogmatic reasoning to justify positions is not a personal attack. Saying that asking for benchmarks of smaller changes doesn't make sense when there are much larger problems is not a personal attack.

Requires slower DatabaseCleaner truncation strategy to work within acceptance tests

This is wrong. The truncation strategy is required with or without foreign keys, because data from a uncommitted transaction is not visible to other connections, which will exist for any processing job that creates threads. The lack of foreign keys was just the reason we didn't see the mistake.

Requires the developer to inspect the schema in order to find Model related restrictions

This won't be a problem if we simply make DB constraints that reflect the validations. They become the safety net and last guarantee. Being able to create bad data is much more unexpected and wastes much more time than looking at the schema - which one already has to do just to see what columns exist.

Do not require data removal at introduction time

This is not an advantage of the validations, it's just deciding to ignore the problem. The data will not become valid on its own. Trying to manipulate it will cause exceptions or validation failures anyway, or even be impossible in cases of unreferenced or unreachable rows.

do-you-dare · 2016-07-29T22:13:32Z

@danielkza, I believe it was not your intention, but things like " I suggest you learn a bit about them instead of just assuming everything Rails proposes is correct" sound quite offensive, and your tone was once again too harsh. A personal meeting would be better for such comments.

I am fond of raw sql queries, but I'd rather stick with Active Record for the validations, for maintainability and legibility reasons. When eventual updates of invalid data happens, we could ask the user permission for proper removal.

About AR validations failing, we could write tests to try to duplicate the data @danielkza found, and open an issue for AR developers.

danielkza · 2016-07-29T22:24:20Z

@danielkza, I believe it was not your intention, but things like " I suggest you learn a bit about them instead of just assuming everything Rails proposes is correct" sound quite offensive, and your tone was once again too harsh. A personal meeting would be better for such comments.

Yes, I apologize for that.

I am fond of raw sql queries, but I'd rather stick with Active Record for the validations, for maintainability and legibility reasons.

I don't agree that they're more maintainable at all. Figuring out what to do with invalid data instead of preventing is it is not free.

When eventual updates of invalid data happens, we could ask the user permission for proper removal.

How is that better than never inserting invalid data in the first place?

About AR validations failing, we could write tests to try to duplicate the data @danielkza found, and open an issue for AR developers.

I already mentioned I don't believe there is any issue with the AR code itself - the limitations that it works with are pretty well known. But that doesn't mean they don't exist. As I said, it is not possible for a single client to enforce consistency in a multi-client database. Doesn't matter how good or bad AR is.

do-you-dare · 2016-07-29T22:47:45Z

When eventual updates of invalid data happens, we could ask the user permission for proper removal.

How is that better than never inserting invalid data in the first place?

This would be just to handle the invalid data that already made its way to the db; preventing their insertion is good, and I think we can get AR validators to do it.

I already mentioned I don't believe there is any issue with the AR code itself - the limitations that it works with are pretty well known. But that doesn't mean they don't exist.

If that's the case and validators aren't enough, even consistency checks using AR would be reasonable, in my opinion. A big part of Rails' power resides in its conventions, and performance, though desirable, is not critic in our case here.

danielkza · 2016-07-29T23:14:15Z

If that's the case and validators aren't enough, even consistency checks using AR would be reasonable, in my opinion.

Any type of check that comes from AR is not enough. It cannot possibly be, as I already mentioned, if multiple writers are ever involved. There is no distinction between validations on columns on the same table or multiple tables in this case. AR isn't enough for either, as it's own documentation explains:

This (uniqueness) helper validates that the attribute's value is unique right before the object gets saved.
It does not create a uniqueness constraint in the database, so it may happen that two different database connections create two records with the same value for a column that you intend to be unique.
To avoid that, you must create a unique index on that column in your database.

The impression that AR is sufficient and replaces database constraints in every situation isn't expressed in it's documentation, and certainly is not implied by any of it's developers. AR is just an abstraction to make interacting with the database easier. It doesn't free you from the work of actually designing the database. Validations are the default choice, but not the only choice.

A big part of Rails' power resides in its conventions

The whole point of this discussion is figuring out when a convention is not good enough. A convention has value when it proves to be a good choice in most cases, and when it fits your specific situation. The choice to follow has to come from understanding the reasoning, not just doing the same as everyone does. I'm not the first person to heavily disagree with the Rails creator opinion about it - even other Rails developers clearly do too, otherwise foreign key support would not have been added.

and performance

That's pretty much backwards. Rails is slower than most frameworks: it values development ease above everything, and includes lots of dynamic components, automatic magic and other similar choices. Rails validations are strictly slower than database constraints: they do all the same work, plus the job of producing and executing an actual query, and without any of the chances for optimization a database has (such as verifying multiple constraints or records simultaneously).

The choice of using or not using indexes or foreign keys really has nothing to do with Rails and everything to do with database design practices. Using Rails does not invalidate them at all.

do-you-dare · 2016-07-30T00:37:27Z

Any type of check that comes from AR is not enough. It cannot possibly be, as I already mentioned, if multiple writers are ever involved.

Well, if it doesn't work, it can't count as an option. But I don't like the retroactive validation with obligatory removal, for the reasons @diegoamc presented.

I'd like to hear the opinions of the remaining @mezuro/core members. Anyway, I'll be happy with whatever you choose.

danielkza · 2016-07-30T00:52:02Z

I think my opinion is quite clear and the code is there, so it's up for you guys to decide.

diegoamc · 2016-07-30T15:25:22Z

Hello guys,

First of all, I'd like to point out that the conversation @danielkza and I were having at #215 does not concern database validations. That was about the best way to apply changes when they remove data that belongs to an user.

I did not know about the ORM's limitations @danielkza pointed out about the uniqueness constraint, for example. They do seem alarming.

Regarding the garbage data that got in after v1.1.5 (HotspotMetricResults) it seems we did not implement the AR validation for them. That mistake is entirely ours. All other entities seem to be working fine, though. Have you detected any other kind of inconsistency on the database after v1.1.5, @danielkza?

Assuming you didn't, we have the current situation: AR validations seem to be working and the garbage we have is due to AR validations we did not implement.
Assuming you did, can we guarantee that only the foreign key constraints prevent those other inconsistencies from happening? We may need further investigation.

My final thoughts are (I'm assuming other inconsistencies weren't found because I don't remember anyone mentioning them):

AR validations seem to be working for our project. Those limitations (multi-client inserts) you mentioned do not apply to our current situation. We only have one connection to the processor's database at a time. Here I mean connections that create processings, module results and metric results. That's because we don't process repositories in parallel. Correct me if I'm wrong.
In the future, a desirable feature is to process repositories in parallel. When that time comes, it's clear to me the database will move to an invalid state eventually. Also, on those cases, for me it would not matter if the integrity checks have relevant impact on performance or not. They would have to be present anyway.
As a result, I think database constraints are not necessary right now. They will be vital in the future, though. As we cannot tell when we will start to process repositories in parallel, I suggest we do the following:
- Measure the performance impact of those database constraints.
- If we see a significant performance degradation (here we have to define what a significant performance degradation is) I vote we wait until we start to process repositories in parallel to add the constraints. That's because we won't eventually fall on an invalid database state and users will have results quickly.
- If the performance degradation is irrelevant, I vote for including the database constraints now so we ensure we do not forget about this later.
Finally, it seems to me we are doing something wrong in our develpment process. We made the same mistake (a very serious one) twice. In a personal meeting I would like to talk about possible causes and how to avoid making the same mistake for the thrid time.

danielkza · 2016-07-30T18:33:04Z

First of all, I'd like to point out that the conversation @danielkza and I were having at #215 does not concern database validations. That was about the best way to apply changes when they remove data that belongs to an user.

Yeah, but that discussion becomes pointless if we choose not to add the constraints.

I did not know about the ORM's limitations @danielkza pointed out about the uniqueness constraint, for example. They do seem alarming.
Regarding the garbage data that got in after v1.1.5 (HotspotMetricResults) it seems we did not implement the AR validation for them. That mistake is entirely ours. All other entities seem to be working fine, though. Have you detected any other kind of inconsistency on the database after v1.1.5, @danielkza?

That's the thing with inconsistencies: it's hard to figure out where and when they came from. You only find out they're there when it's already too late. It's not like we ever stopped to check if they existed or paid any special attention to it: as I said, I had to fix about 10 instances of referential integrity failures in the acceptance tests that nobody ever noticed.

Assuming you didn't, we have the current situation: AR validations seem to be working and the garbage we have is due to AR validations we did not implement.

I maintain my point: AR validations can seem to work, and even work 99% of the time, but they cannot guarantee working 100% of the time. That's just their nature. And a 99% consistent database is an inconsistent database. There are no intermediate states.

Assuming you did, can we guarantee that only the foreign key constraints prevent those other inconsistencies from happening?

Yes, AR validations can fundamentally not 100% guarantee their own enforcement. Nothing we do can change that.

AR validations seem to be working for our project. Those limitations (multi-client inserts) you mentioned do not apply to our current situation.

Do they not? Nowhere have we specified or enforced that only one worker is run at the same time. We just chose to setup mezuro.org like that.

We only have one connection to the processor's database at a time.

That's already false if we process any PHPMD metrics. That processing creates a new thread, and therefore a new connection. edit: Actually, it's always false. The service itself has one connection and the worker has another.

As a result, I think database constraints are not necessary right now. They will be vital in the future, though.

The problem is that delaying it makes it progressively harder to do it. If we had caught the problem when there were a dozen wrong records it wouldn't really matter to much. But now that there are thousands (!) we have to find creative ways to deal with it.

As a result, I think database constraints are not necessary right now. They will be vital in the future, though. As we cannot tell when we will start to process repositories in parallel, I suggest we do the following:

IMO they were necessary since the software was first created and never stopped being necessary up to now. The fact that we didn't care about the inconsistencies doesn't mean the setup was ever correct.

Measure the performance impact of those database constraints.

If we see a significant performance degradation (here we have to define what a significant performance degradation is) I vote we wait until we start to process repositories in parallel to add the constraints. That's because we won't eventually fall on an invalid database state and users will have results quickly.

If the performance degradation is irrelevant, I vote for including the database constraints now so we ensure we do not forget about this later.

What will be the comparison base? The code before or after the aggregation and/or interpreting rewrites? Also, we'll probably need a read-focused benchmark as @rafamanzo mentioned previously.

diegoamc · 2016-07-31T13:28:29Z

I may have said something incorrect. On production we run more than one worker at the same time, running potentially 12 processings simultaneously, right? Even though one processing runs sequentially, the fact that many workers run processings at the same time can raise the issue @danielkza is talking about.

If that is the case indeed (again, correct me if I'm wrong), I vote for the inclusion of the database restrictions without the need of performance tests.

rafamanzo added the question label Jul 29, 2016

rafamanzo mentioned this issue Jul 29, 2016

Overhaul database schema #215

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database restrictions #228

Database restrictions #228

rafamanzo commented Jul 29, 2016 •

edited

Loading

danielkza commented Jul 29, 2016

rafamanzo commented Jul 29, 2016 •

edited

Loading

danielkza commented Jul 29, 2016 •

edited

Loading

rafamanzo commented Jul 29, 2016 •

edited

Loading

rafamanzo commented Jul 29, 2016

danielkza commented Jul 29, 2016 •

edited

Loading

danielkza commented Jul 29, 2016

rafamanzo commented Jul 29, 2016 •

edited

Loading

danielkza commented Jul 29, 2016 •

edited

Loading

do-you-dare commented Jul 29, 2016

danielkza commented Jul 29, 2016

do-you-dare commented Jul 29, 2016

danielkza commented Jul 29, 2016 •

edited

Loading

do-you-dare commented Jul 30, 2016 •

edited

Loading

danielkza commented Jul 30, 2016

diegoamc commented Jul 30, 2016

danielkza commented Jul 30, 2016 •

edited

Loading

diegoamc commented Jul 31, 2016

Database restrictions #228

Database restrictions #228

Comments

rafamanzo commented Jul 29, 2016 • edited Loading

danielkza commented Jul 29, 2016

rafamanzo commented Jul 29, 2016 • edited Loading

danielkza commented Jul 29, 2016 • edited Loading

rafamanzo commented Jul 29, 2016 • edited Loading

rafamanzo commented Jul 29, 2016

danielkza commented Jul 29, 2016 • edited Loading

danielkza commented Jul 29, 2016

rafamanzo commented Jul 29, 2016 • edited Loading

danielkza commented Jul 29, 2016 • edited Loading

do-you-dare commented Jul 29, 2016

danielkza commented Jul 29, 2016

do-you-dare commented Jul 29, 2016

danielkza commented Jul 29, 2016 • edited Loading

do-you-dare commented Jul 30, 2016 • edited Loading

danielkza commented Jul 30, 2016

diegoamc commented Jul 30, 2016

danielkza commented Jul 30, 2016 • edited Loading

diegoamc commented Jul 31, 2016

rafamanzo commented Jul 29, 2016 •

edited

Loading

rafamanzo commented Jul 29, 2016 •

edited

Loading

danielkza commented Jul 29, 2016 •

edited

Loading

rafamanzo commented Jul 29, 2016 •

edited

Loading

danielkza commented Jul 29, 2016 •

edited

Loading

rafamanzo commented Jul 29, 2016 •

edited

Loading

danielkza commented Jul 29, 2016 •

edited

Loading

danielkza commented Jul 29, 2016 •

edited

Loading

do-you-dare commented Jul 30, 2016 •

edited

Loading

danielkza commented Jul 30, 2016 •

edited

Loading