-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Database restrictions #228
Comments
This would all be fine if validations could actually do their job. They cannot, and the existence of hundreds of thousands of garbage records in our database proves that. This post sums up my thoughts quite well: DRY is great, but it doesn't matter when you replace the repetition with one source that doesn't do it's job.
The validations will still exist and cover 99% of the errors. The foreign keys will catch the 1% of errors that the validations fail to. We'll see them clearly as exceptions instead of allowing the garbage in and missing the bug completely.
Again, it doesn't matter how easy it is if it doesn't work. Also, I would really prefer that the developer adding the validation understands how it actually fits into the physical model and how it will actually work. That developer will have 0 issue adding a foreign_key line to a migration.
I already made this point, but DRY doesn't matter if it doesn't work. For example, why do we duplicate unit and acceptance tests? Because they do different jobs, and one cannot give us the same guarantees the other does. The same applies here: the only part of the stack that is in position to effectively enforce constraints, specially those that are relations between different models/tables, is the database itself.
What do you mean? |
What do you mean about they not doing their job? Looks like they are performing well after they were introduced. I do like the idea of introducing something that raises exceptions when the standard rails validations fail to catch some case. But still, what is the insertion time impact of the duplicate checks? You seem to insist Rails' validators do not work at KalibroProcessor, can you write a test to assert that? This can become an PR to Rails' core if we can show that. Can you give an example of validation made through database impossible through ActiveRecord? Backwards compatible means no already existent data removal requirement for enforcing new validations. |
How so? They let hundreds of thousands of garbage records in. Even recent ones such as those added with HotspotMetricResult.
The Rails validation's SELECT probably takes an order of magnitude longer than the DB's own check, due to having to parse the query, create a query plan, execute, collect results and send back, vs. just doing the check and raising an error or not. Also, considering how poorly optimized everything else about processing is, duplicate integrity checks are the least of our worries.
I never said the validations don't work by themselves, but in the context of kalibro_processor. This is not my opinion, it is a fact evidenced by the existence of many invalid records in the mezuro.org database. You can verify it yourself by the select queries in the CleanInconsistencies migration in the mezuro.org DB. Constraints in the database cannot be forgotten or ignored, as we did in dozens of places in the acceptance tests. They do not let subtle invisible errors creep in as if they didn't exist - which, again, we observed with just Rails validations.
Every one that involves associations. The AR version is just a poor substitute that cannot handle race conditions, external changes and buggy code that doesn't handle validations correctly (which we have already seen). There is a huge difference between ensuring you do no attempt to insert invalid data given some conditions, and ensuring invalid data does not exist. The latter is much stronger than the former.
Then we're going to need to remove the validations too, because in the first time any operation is attempted with the invalid records they will fail. One by one at different times instead of all at once. I won't go into this discussion again, but keeping garbage data does not make it good data. It is still garbage, it will still break when it is used in any meaningful way, and it is still a violation of the model - foreign keys existing or not. Rails has many good ideas, but ignoring the tools that databases provide to enforce data integrity is not one of them. If we have no need for consistency, then we might as well be using a much faster NoSQL database and using an architecture that doesn't have or doesn't care about those inconsistencies. |
So are you blaming ActiveRecord for a validation we have not programed? Seems strange... The invalid records I see are previous to most of the consistency validations added. If you insist on that I'm afraid I cannot agree with that as those invalid data is our flaw not ActiveRecord's. And worse, DB restrictions are subject to the same human flaw of not foreseeing its necessity as ActiveRecord were for what you're blaming it for. Can you give examples of invalid data inserted after v1.1.5 deployment? If they exist, can you write an test case to reproduce that? Is impossible to fix this test case using ActiveRecord? You have dodged my question about insertion time impact. The closest to an answer I can get is that the extra time is negligible compared to the whole, right? Can you provide references or evidences of that? You've made lots of affirmations regarding ActiveRecord: "The AR version is just a poor substitute that cannot handle race conditions, external changes and buggy code that doesn't handle validations correctly (which we have already seen)". So I repeat my request from the last comment: please if you can write tests that reproduce these issues, it can become a PR and a larger discussion at Rails core. If you want to leave Rails you can open a new discussion on a new issue. But to be as clear as possible I do need to stress this: the objective here is to discuss wether DB constraints are worth the extra code complexity, data loss and insertion time. ActiveRecord removal is not in question here as we rely on it for a lot of other features than just data consistency. |
Obviously HotspotMetricResult has invalid records. It has no validations :) |
edit: I was excessively harsh in some points here. It wasn't my intention to offend and I apologize if I failed at that.
Except for the part where I explicitly said that there is nothing wrong the validations themselves. The paradigm of using something other than the database to validate constraints is the problem. It could be any ORM.
Did you actually look at them? I did. There are invalid records of HotspotMetricResults too, which we added recently.
Let's be clear: I'm not blaming AR, I'm blaming poor engineering and believing that there is no need to think about or understand the database.
None of our tests actually test that relational validations work, because we mock everything.
As you have ignored my questions about data integrity. And yes, the difference will be neglibible compared to the current performance situation of the whole code. No questions ever existed about how terrible the code always was, and now I need to benchmarks basic things such as adding indexes? I suggest you learn a bit about them instead of just assuming everything Rails proposes is correct. http://usetheindexluke.com. Checking foreign key constraints should be nothing more than index lookups, which are the fastest operations in a database. As I said, with none of the query work surrounding it.
This is a fundamental property of databases and applies to any ORM. AR developers already know it - even if you don't. You cannot enforce constraints against multiple concurrent clients in a database without having full control and visibility into everything that is happening. An AR validation can never have that. It cannot guarantee that no changes happen between it's SELECT and the following INSERT. I suggest you read about the different isolation levels that SQL RDBMS provide to understand why any given client can never ensure integrity alone.
The fact that you think wanting to change one particular aspect of Rails means throwing it all away shows me you might be a bit too immersed in dogma instead of practical and sound engineering principles.
Again, who said anything about that at all? You're creating a straw-man to pretend I'm attacking AR, when I'm not. Please read what I write carefully and honestly instead of jumping to knee-jerk conclusions derived from Rails dogma. |
Also, as evidence that Rails developers themselves recognized the need for foreign keys is their addition in Rails 4.2 and even use by default in new migrations created with the generator tools. |
Firstly, please keep the discussion technical. If we are going to start with personal attacks we're not getting anywhere. Sorry I'll not answer to that and I'm disappointed now. If you want, we can talk through personal issues in person as they should. Not through messages. I'll send an email to everyone scheduling a talk where we will address this. The practical conclusions I can get from the discussion so far:
Please feel free to add anything to the list above. |
I made no personal attacks whatsoever. Saying code was poorly engineered is not a personal attack. Questioning technical priorities in the decision making process is not a personal attack. Saying one is using dogmatic reasoning to justify positions is not a personal attack. Saying that asking for benchmarks of smaller changes doesn't make sense when there are much larger problems is not a personal attack.
This is wrong. The truncation strategy is required with or without foreign keys, because data from a uncommitted transaction is not visible to other connections, which will exist for any processing job that creates threads. The lack of foreign keys was just the reason we didn't see the mistake.
This won't be a problem if we simply make DB constraints that reflect the validations. They become the safety net and last guarantee. Being able to create bad data is much more unexpected and wastes much more time than looking at the schema - which one already has to do just to see what columns exist.
This is not an advantage of the validations, it's just deciding to ignore the problem. The data will not become valid on its own. Trying to manipulate it will cause exceptions or validation failures anyway, or even be impossible in cases of unreferenced or unreachable rows. |
@danielkza, I believe it was not your intention, but things like " I suggest you learn a bit about them instead of just assuming everything Rails proposes is correct" sound quite offensive, and your tone was once again too harsh. A personal meeting would be better for such comments. I am fond of raw sql queries, but I'd rather stick with Active Record for the validations, for maintainability and legibility reasons. When eventual updates of invalid data happens, we could ask the user permission for proper removal. About AR validations failing, we could write tests to try to duplicate the data @danielkza found, and open an issue for AR developers. |
Yes, I apologize for that.
I don't agree that they're more maintainable at all. Figuring out what to do with invalid data instead of preventing is it is not free.
How is that better than never inserting invalid data in the first place?
I already mentioned I don't believe there is any issue with the AR code itself - the limitations that it works with are pretty well known. But that doesn't mean they don't exist. As I said, it is not possible for a single client to enforce consistency in a multi-client database. Doesn't matter how good or bad AR is. |
This would be just to handle the invalid data that already made its way to the db; preventing their insertion is good, and I think we can get AR validators to do it.
If that's the case and validators aren't enough, even consistency checks using AR would be reasonable, in my opinion. A big part of Rails' power resides in its conventions, and performance, though desirable, is not critic in our case here. |
Any type of check that comes from AR is not enough. It cannot possibly be, as I already mentioned, if multiple writers are ever involved. There is no distinction between validations on columns on the same table or multiple tables in this case. AR isn't enough for either, as it's own documentation explains:
The impression that AR is sufficient and replaces database constraints in every situation isn't expressed in it's documentation, and certainly is not implied by any of it's developers. AR is just an abstraction to make interacting with the database easier. It doesn't free you from the work of actually designing the database. Validations are the default choice, but not the only choice.
The whole point of this discussion is figuring out when a convention is not good enough. A convention has value when it proves to be a good choice in most cases, and when it fits your specific situation. The choice to follow has to come from understanding the reasoning, not just doing the same as everyone does. I'm not the first person to heavily disagree with the Rails creator opinion about it - even other Rails developers clearly do too, otherwise foreign key support would not have been added.
That's pretty much backwards. Rails is slower than most frameworks: it values development ease above everything, and includes lots of dynamic components, automatic magic and other similar choices. Rails validations are strictly slower than database constraints: they do all the same work, plus the job of producing and executing an actual query, and without any of the chances for optimization a database has (such as verifying multiple constraints or records simultaneously). The choice of using or not using indexes or foreign keys really has nothing to do with Rails and everything to do with database design practices. Using Rails does not invalidate them at all. |
Well, if it doesn't work, it can't count as an option. But I don't like the retroactive validation with obligatory removal, for the reasons @diegoamc presented. I'd like to hear the opinions of the remaining @mezuro/core members. Anyway, I'll be happy with whatever you choose. |
I think my opinion is quite clear and the code is there, so it's up for you guys to decide. |
Hello guys, First of all, I'd like to point out that the conversation @danielkza and I were having at #215 does not concern database validations. That was about the best way to apply changes when they remove data that belongs to an user. I did not know about the ORM's limitations @danielkza pointed out about the uniqueness constraint, for example. They do seem alarming. Regarding the garbage data that got in after v1.1.5 (HotspotMetricResults) it seems we did not implement the AR validation for them. That mistake is entirely ours. All other entities seem to be working fine, though. Have you detected any other kind of inconsistency on the database after v1.1.5, @danielkza? Assuming you didn't, we have the current situation: AR validations seem to be working and the garbage we have is due to AR validations we did not implement. My final thoughts are (I'm assuming other inconsistencies weren't found because I don't remember anyone mentioning them):
|
Yeah, but that discussion becomes pointless if we choose not to add the constraints.
That's the thing with inconsistencies: it's hard to figure out where and when they came from. You only find out they're there when it's already too late. It's not like we ever stopped to check if they existed or paid any special attention to it: as I said, I had to fix about 10 instances of referential integrity failures in the acceptance tests that nobody ever noticed.
I maintain my point: AR validations can seem to work, and even work 99% of the time, but they cannot guarantee working 100% of the time. That's just their nature. And a 99% consistent database is an inconsistent database. There are no intermediate states.
Yes, AR validations can fundamentally not 100% guarantee their own enforcement. Nothing we do can change that.
Do they not? Nowhere have we specified or enforced that only one worker is run at the same time. We just chose to setup mezuro.org like that.
That's already false if we process any PHPMD metrics. That processing creates a new thread, and therefore a new connection. edit: Actually, it's always false. The service itself has one connection and the worker has another.
The problem is that delaying it makes it progressively harder to do it. If we had caught the problem when there were a dozen wrong records it wouldn't really matter to much. But now that there are thousands (!) we have to find creative ways to deal with it.
IMO they were necessary since the software was first created and never stopped being necessary up to now. The fact that we didn't care about the inconsistencies doesn't mean the setup was ever correct.
What will be the comparison base? The code before or after the aggregation and/or interpreting rewrites? Also, we'll probably need a read-focused benchmark as @rafamanzo mentioned previously. |
I may have said something incorrect. On production we run more than one worker at the same time, running potentially 12 processings simultaneously, right? Even though one processing runs sequentially, the fact that many workers run processings at the same time can raise the issue @danielkza is talking about. If that is the case indeed (again, correct me if I'm wrong), I vote for the inclusion of the database restrictions without the need of performance tests. |
We need to discuss wether database restrictions are relevant for a Rails application. The database restrictions are nice to ensure data consistency at insertion time. On the other hand raw database validations do not get along with Rails so well as ActiveRecord ones when handling errors.
@mezuro/core input here is welcome.
Initial discussion has been made through #215
My personal opinion is to leave validations with ActiveRecord because:
The text was updated successfully, but these errors were encountered: