Splitting 1-1 relationships across lots of different tables?

Question

I've come across a table that has about 200 columns. About 150 of these can be grouped into 5-10 tables that make real world "sense", and seeing as most of these entries are never used I figured it would save a lot of null pointers and reduce the size of the database drastically if I did this.

e.g. lets say the current main table has these entries:

Id | Person |  DOB  | Address   | FaveColour | LeastFaveColour | MoreColourOpinions
------------------------------------------------------------------------------
1    Jim      1992    Here        null         null              null
2    Bob      1991    There       Brown        Orange            I like purple
3    Bill     1990    Everywhere  null         null              null

So here you might have guessed that I would split the columns relating to colour into a separate table.

Id | Person |  DOB  | Address 
-----------------------------
1    Jim      1992    Here      
2    Bob      1991    There
3    Bill     1990    Everywhere


PersonId | FaveColour | LeastFaveColour | MoreColourOpinions
------------------------------------------------------------
2          Brown        Orange            I like purple

Now, I know that it's totally fine to have 1-1 tables, but my question relates to speed - what's going to be the difference between querying the original gargantuan table vs. querying all the separate tables left joined together?

Let's say the table has half a million rows and I want to query on one thing from EVERY group, e.g.

Select * from Person p
left join ColourOpinions co on p.Id = co.PersonId
-- add another ten+ left joins
where co.FaveColour = 'Brown'
-- and another filter, one for each of the ten+ joins

I assume that querying the original table will be faster, because there's no joins to be made, for all those joins I'm basically recreating the original table before querying it... but how much slower will it be? Is it a good idea to split this table up?

I'm thinking yes because the speed of querying smaller tables and joins separately, as well as the database size difference, will totally offset any occasion where we need to recreate the entire original table and query that?!

It's not always possible to answer such database performance questions in the abstract, because ideal or maximum performance is often a balancing act involving innumerable possible factors and settings, and often guesswork. Most professionals would instead ask whether a 200-column table is mentally manageable by the programmer, and if not then break it up a little bit, unless the 200-column table is a known solution to a previous performance problem. — Steve, Feb 02 '20 at 18:59
This seems like an XY problem to me. The performance difference between these two will be minuscule, and possibly nothing because the DB optimiser is smarter than either one of us. However, if performance is the issue, you'll likely get a much bigger increase from not selecting every column you don't actually need. — Chris Murray, Mar 05 '20 at 12:09

score 2 · Answer 1 · answered Feb 03 '20 at 21:41

2

There's an app for that! https://stackoverflow.com/questions/6587007/what-is-meant-by-sparse-data-datastore-database

The short version is, the cost of having a lot of null values varies by database product. Schemaless databases tend to be better at building or allowing custom indices that can speed up specific column combinations. But ACID databases have to solve this problem as well, let's say you have a VARCHAR(MAX) column, you bet the engine will keep an extra "table" for this column behind your back to organize the data.

Your idea is called a star schema, and it's correct for a lot of use cases.

Do you suspect a performance problem or is there a performance problem?

answered Feb 03 '20 at 21:41

Martin K

2,897

Well on further inspection it seems the speed it suffering because it's always pulling data for those 200+ columns. A lot of them aren't even populated with NULL, they have a tonne of default string values (e.g. "None", "Unknown" etc.) so I'm wondering if I can actually speed things up by removing default varchars and replace them with actual NULLS so it's pulling less data. – jamheadart Feb 04 '20 at 09:22
Also thanks for putting a name to it, star schema, I'd heard that before and I'll have a read up on it. – jamheadart Feb 04 '20 at 09:22

Splitting 1-1 relationships across lots of different tables?

1 Answers1