47

I've started working at a new organization and one of the patterns I've been seeing in the database is duplicating fields to make writing queries easier for the business analysts. We're using Django and its ORM.

In one case, we keep a MedicalRecordNumber object with a unique string identifying a patient in a certain context. We have Registration objects which track patients and have associated MedicalRecordNumbers, but rather than using a foreign key relationship, they duplicate the string so they can avoid writing a join (not for performance reasons). This pattern is common throughout the database.

For me the importance of a data model being clean is just so I can think about it well. Needless complexity is a waste of my limited cognitive processing time. It's a systematic problem. Not being comfortable writing joins is a rectifiable skills issue. I don't necessarily want to advocate going back and changing the schema, but I'd love to be able to convincingly articulate the problems with this type of duplication.

canisrufus
  • 1,091
  • 2
    What does it mean to "not be comfortable writing joins"? How do they explain that? – scriptin Apr 11 '15 at 16:48
  • 9
    Do these folks work for you? Are you their supervisor? Most of your justifications can be found here: http://en.wikipedia.org/wiki/Database_normalization. Yes, they need to get better at using joins. – Robert Harvey Apr 11 '15 at 17:00
  • 1
    Have you looked up the literature on why normalization is desirable? – Nathan Tuggy Apr 11 '15 at 17:02
  • @scriptin "Well, we put in those MRN fields [because it's duplicated multiple times] so that it's easier to query. When we query the Registration table we don't want to have to join against the MedicalRecordNumber table." The discomfort is an inference I'm making. – canisrufus Apr 11 '15 at 17:45
  • @RobertHarvey no, I'm on a 6 month contract, and they manage the projects I work on without being in charge of me. Responsibility for the database is ambiguous. I don't necessarily care about being rehired and I feel some professional responsibility to push for better practices. – canisrufus Apr 11 '15 at 17:51
  • @NathanTuggy I pulled out my database modeling and design textbook but thought, you know, there must be a more succinct source of information on this subject, and decided to solicit advice from those more knowledgable. – canisrufus Apr 11 '15 at 17:51
  • 17
    Wouldn't adding views that do the join internally make writing queries just as easy? You could suggest them as an alternative. – CodesInChaos Apr 12 '15 at 10:14
  • In a medical setting privacy issues might also be of consideration. There might be juridical reasons for not applying joins between tables. – Kasper van den Berg Apr 12 '15 at 12:40
  • There have been two votes to close based on the question being too broad. Accepting your premise: how can it be improved? People have converged on some pretty cogent answers (justify using database normalization theory, offer alternative means of easier querying [views and materialized views]). Communicating well is hard, and you do questioners a favor by explaining how we can do better in the future! – canisrufus Apr 12 '15 at 13:26
  • @KaspervandenBerg very on point, but not true in this case. – canisrufus Apr 12 '15 at 13:27
  • You give too little information about the context. If they use duplicating columns as an easy resource it's wrong of course. But keep in mind that un some contexts denormalizing (if you know what you're doing) is a way of optimizing read speed. You can use a view, of course, but (in principle) it won't beat the performance. – nsn Apr 13 '15 at 10:01
  • 1
    Did you communicate this (politely) with your peers and seniors? What are their justifications, what considerations are they making? There are many possible reasons why this might be a good idea (even though you say "performance is not the reason", what evidence you have to support that?). Before accusing them of being too lazy and/or rigid, have you considered (and asked) the reasons they have for having the design the way it is? Maybe there's far more reads than writes (analytics heavy DB)? Change tracking? Historical data? Ask everyone - someone might know the real reason. – Luaan Apr 14 '15 at 07:39
  • Does the duplicated data every change once it is written to the database? – Ian Apr 14 '15 at 08:22
  • @Ian at this point, no. But it's biting us - we ended up with multiple MRNs for some patients and don't have a good way to clear it out, and it's complicated by the fact that they're sprinkled around throughout the database. – canisrufus Apr 14 '15 at 20:47
  • Be thankful that you are not trying to match together data that have come for paper based patent records and 20 odd different systems that are by the xray department etc.... A patient can be seeing a few consultants for unrelated reasons within a few weeks.... – Ian Apr 15 '15 at 20:31
  • If you're the new kid on the block, it may be premature for you to start rocking the boat. Presumably, they have their reasons for designing the database the way they have; and they may or may not appreciate being told that their reasons are bad reasons. Especially by someone who is new to the organisation, and who maybe doesn't even know what the reasons are. If I were you, I would just keep quiet and tolerate it, until you gain a bit more credibility in the organisation. – Dawood ibn Kareem Apr 17 '15 at 07:44
  • @DavidWallace There is wisdom to that advice. Who the hell is this guy. Conversations I've had, though: the designer of this database adamantly asserting null foreign keys are impossible. The designer of this database adamantly asserting that including two clearly distinct entities on one table was more normalized than my design moving them to separate tables. The designer of this database arguing that the best way to track history of a table is to create an identical history table, duplicating 15 columns. – canisrufus Apr 17 '15 at 15:42
  • @DavidWallace Admittedly, this is all orthogonal to whether or not rocking the boat will be well received, or whether I should say anything. And it sucks to come into a situation and be the guy who says "this could be done better", but sometimes, you actually do know better. – canisrufus Apr 17 '15 at 15:49
  • Yes, but is it actually your job to train the database designer? If it is, then go ahead and do it. But if not, then the best thing you can do is to do your own job well, and not worry too much about other people's jobs. In other words, tolerate the consequences of the database design being what it is, until you've built up a few brownie points for doing your own job well. Also remember that changing the data model retrospectively may be an expensive process, even though it will improve things in the long run, and may not be well received by management. – Dawood ibn Kareem Apr 17 '15 at 18:52
  • @DavidWallace It's great advice. You will note in my question: "I don't necessarily want to advocate going back and changing the schema..." which is shorthand for yes, I do want to advocate for that, but I ain't no Don Quixote. I worked for the Florida legislature for a year and a half. I learned a lot there, but most importantly I learned the need to STFU when people more powerful than you don't want to hear it. All of that, though, is orthogonal to my original question: how do I convincingly argue against duplicating database columns? I'll handle the organizational politics. – canisrufus Apr 17 '15 at 19:22

7 Answers7

129

Your operational database should be highly normalized, to reduce anomalies.

Your analytic database (warehouse) should be highly denormalized, to ease analysis.

If you don't have a separate analytic database, you should make some highly denormalized [materialized] views.

If you tell your senior business analysts / managers to do lots of joins for a simple analysis, well, you might get fired.

Agile Data Warehouse Design is a good book

See my quick n' dirty data warehouse tips here

  • 9
    This is the right way to go. – Etheryte Apr 11 '15 at 22:23
  • 6
    +1 This is exactly what Views are intended for: allowing a denormalized view on a normalized database. – Nzall Apr 12 '15 at 15:58
  • 4
    Absolutely correct, but I think "reduce anomalies" should be emphasized more, since that's the primary answer to the question. The most common (only?) anomaly you'll see with data duplication/denormalization is that the columns will somehow get populated with contradictory data at the same time, leaving you with no way of knowing what the real data is supposed to be and no way of determining what went wrong. The latter can be mitigated with massive tracking of changes, but this won't be cheap or quick to go through and find the problem. More cost effective to avoid the problem entirely. – jpmc26 Apr 12 '15 at 22:23
  • 2
    Another angle to consider is that, even assuming the developers are capable of keeping the data correct (doubtful), it becomes a huge drain on their resources to ensure that every duplicate field gets updated when required to maintain consistency. – Nate C-K Apr 13 '15 at 04:03
  • Imagine the number of transactions that would be required for something like that. Even something like UPDATE Facility SET OpenForBusiness = 1 WHERE ID = 830 would not work, since you would have to do such a trivial thing on multiple tables in multiple update statements, requiring a transaction. – Panzercrisis Apr 13 '15 at 14:44
  • @Panzercrisis Doing all your queries inside a transaction should be your default. You should have a really, *really* good reason to deviate from that default. It's just safer that way, and the cost of doing so isn't very high. – jpmc26 Apr 13 '15 at 15:32
  • @jpmc26 But if it's a one-row, one-line update statement, isn't the transaction implicit? – Panzercrisis Apr 13 '15 at 15:44
  • 1
    @Panzercrisis The only way a transaction is "implicit" is if you have an automatic commit running at the end of your query. This should not usually be the case for a production database. In an application, transactions should be initiated automatically and a commit should be issued separately from the query. This is a small upfront investment in the application, but it simplifies code changes that involve adding database calls and reduces how much a developer has to think about (improves dev speed, reduces dev errors). That sort of design also fits in well with things like connection pooling. – jpmc26 Apr 13 '15 at 17:28
  • I would tell the analysis to write their own views. If you can't write a stack of joins you /have no business/ querying the production database which has much tighter constraints than just getting joins right. – Joshua Apr 13 '15 at 17:58
  • @Joshua it depends on who the analysts are. If it's a marketing manager using Excel, then you shouldn't :) – Neil McGuigan Apr 13 '15 at 18:06
  • @jpmc26 Assuming ID is the primary key in Facility, are you saying the line above could end up only performing part of the update, but not the rest, leaving 830 in a corrupted state? Or are you just talking about the context of when it is done alongside other update statements or when multiple rows or columns are affected? – Panzercrisis Apr 16 '15 at 13:40
  • 1
    @Panzercrisis I'm saying that the cost of using a transaction for all statements, even single row ones, is miniscule, while avoiding a transaction for an operation that is currently a single query incurs risk of forgetting to add one in the event the action becomes more complicated. Additionally, in application code, you may have post-query operations that can fail and necessitate a rollback even though the query succeeded. It's simply much safer from a development perspective to use transactions by default, while the performance impact is practically nothing for the vast majority of cases. – jpmc26 Apr 17 '15 at 00:18
57

I understand, why somebody wants to avoid writing a join for each select.

But you can create once a view with the join and use it instead of your unnormalized table.

So you combine the advantage of normalization with the convenience of an easy select.

knut
  • 1,398
  • 12
    Views are your friends. Use them liberally. And for performance, you could even use Materialized views if your RDBMS supports them. – VH-NZZ Apr 12 '15 at 09:04
13

The answers which have already been upvoted pretty much cover the "how to avoid duplication" (using views) but not the why. They basically show that duplication of columns is the wrong solution to the problem of making it easier to write queries. But the question "why not duplicate any random column just for the heck of it?" still stands.

The answer is "Because of Murphy's Law". Murphy's law states that:

If something can go wrong, it will.

In this case, the contents of each row field of a duplicated column are supposed to be identical to the contents of each corresponding row field of the original column. What can go wrong, is that the contents of some row fields may differ from the originals, wreaking havoc. You might think that you have taken all conceivable precautions to ensure that they will not differ, but Murphy's law states that since they can differ, they will differ. And havoc will ensue.

As an example of how this may happen, simply consider the fact that the duplicated columns don't get filled by magic; someone must actually write code which stores values in them whenever rows are created in the original table, and someone must write code which keeps updating them whenever the originals get modified. Setting aside the fact that this is adding undue burden to code which enters data into the database, (and which is, by definition, far more crucial than any code which simply queries the database,) someone, somewhere, under certain circumstances, might forget to carry out this duplication. Then, the values will differ. Or they may remember to carry out the duplication, but not within a transaction, so it may, under certain rare fault conditions, be omitted. But I did not really need to waste my time writing these examples, and you did not really need to waste your time reading them: the beauty of Murphy's Law is that it saves us from having to come up with examples of how something may go wrong on a case by case basis: if it can go wrong, it will.

Mike Nakis
  • 32,193
  • 7
  • 77
  • 112
12

Thinking of it in terms of tradeoffs rather than good/bad will be more productive. They are trading off advantages of normalization (esp. consistency) for advantages in query usability.

At one extreme, the database would become useless if the data got severely inconsistent. At the other extreme, the database would be useless if it's too difficult for the people who need to query it every day to get results they can count on.

What can you do to reduce the risks and the costs?

  • Build a consistency checker tool and run it regularly.
  • Route write access through software that updates the replicated data consistently.
  • Add views or build query tools that do the joins automatically so the business people can think in terms of the information rather than the DB internals.
Jerry101
  • 5,407
  • 1
  • 16
  • 19
6

I think the strongest argument for data normalization for business analysts is that it promotes data integrity. If your key data is stored in only one place (one column, in one table), it's much less likely that the data will get corrupted by incorrect updates. I think they would probably care about the importance of data integrity, so this this might be a good way to convince them to update their ways of interacting with the database.

A slightly more difficult method of querying is likely going to be preferable to potential data corruption.

Oleksi
  • 11,894
  • 2
  • 54
  • 54
  • 6
    His people will argue that they're good enough to make sure that all of the data is being updated properly (a premise I dispute, if they're uncomfortable with joins). Perhaps a better argument is that you lose most of the benefits of ACID that RDBMS's provide, if you eschew normalization. – Robert Harvey Apr 11 '15 at 17:23
  • 4
    Probably, but it's all a question of risk. Are they willing to accept the risk of corrupting the database because it makes querying easier? – Oleksi Apr 11 '15 at 17:25
  • 1
    Playing the devil's advocate here, an obvious counterargument would be that, if someone's going to screw up an update and corrupt data anyway, that's a problem with or without normalization -- and, at least, having some redundancy in the database makes it more likely that someone will notice the corruption, and may even be able to fix it later. (Of course, ad hoc denormalization is hardly the most reliable error detection scheme, but the principle of error checking via redundancy is sound: that's how double-entry bookkeeping works.) – Ilmari Karonen Apr 12 '15 at 14:32
  • Or, to put it in other terms, there's more to data integrity than just relational integrity. With a fully normalized database, you can still maintain perfect relational integrity even if someone messes up an update, but that doesn't make the incorrectly updated data any less garbage. – Ilmari Karonen Apr 12 '15 at 14:35
0

To add to what the other guys have suggested above. This is a data governance issue. You need to work with relevant stakeholders: data architects and data stewards to develop data principles, policies and naming conventions.

Be patient and work methodically. Change won't happen over night.

0

Quit.

Honestly, you can spend months arguing about normalization, consistency, and fighting crazy bugs caused by sheer laziness, and then quit.

Or you can just save time, and frustration and quit now.

Good programmers are very lazy people. They understand customer and management needs. But most importantly they understand that solving problems well, using well designed, and well implemented solutions saves them personally HUGE amounts of work, effort, and most importantly agony and stress.

So you would be much better working at a place that understands and values good engineering.

Good Luck.


Afterthought : Maybe what they need are BI / OLAP tools... http://en.wikipedia.org/wiki/Online_analytical_processing

AK_
  • 744