241

Instead of a database I just serialize my data to JSON, saving and loading it to disk when necessary. All the data management is made on the program itself, which is faster AND easier than using SQL queries. For that reason I have never understood why databases are necessary at all.

Why should one use a database instead of just saving the data to disk?

MaiaVictor
  • 5,830
  • 70
    If managing the relationships of your data in your application is actually faster than doing it in a database (which I find extremely hard to believe) then you need to read up on SQL and database normalization. What you are experiencing is most probably the side-effect of a horribly designed database. – yannis Mar 14 '13 at 03:09
  • 5
    Well, try an example: imagine you're making a page on your site that shows a full list of members. How can you implement this? a) first query the database to get an array with the list of members then use that data to answer the request... b) just have that data stored on your program already and simply send it! See where I'm confused? Involving a database is just an additional step. How can an additional step be faster than not having it at all? – MaiaVictor Mar 14 '13 at 03:18
  • 78
    You don't need a database in the scenario you are describing because your data set is trivial. Databases are meant for more complex data sets, if all you do is read and show a list, your approach works. – yannis Mar 14 '13 at 03:24
  • 3
    @YannisRizos Perhaps: it's a list of members of a site, their passwords, login info, permissions, a list of exams and the grades of each member on those exams, a list of movies and their locations on the site and a list of messages sent between members, etc. Is this trivial? It seems. Am I fine storing that as files? Also, I'm curious: that's pretty much the kind of data a standard website needs. I've worked in some sites and it's never much different. What would be "non-trivial" data? What would need a database? Oh and thank you! – MaiaVictor Mar 14 '13 at 03:30
  • 5
    That sounds like enough complexity to justify a database. The great thing about a database is that it will work when the data is small, and it will still work when the data gets big (if it's designed properly). – Robert Harvey Mar 14 '13 at 03:32
  • 6
    Your schema is complex enough, and you should really start considering a database. Querying a database for a set of messages send between two members on specific date ranges shouldn't take more than a few milliseconds (assuming your database is properly designed), even if you have thousands of members and millions of messages. But if we are talking about, for example, 100 members and a couple of thousands messages between them, then you could probably make it work without a database. Anything larger than that and your app will start suffering, along with your users. – yannis Mar 14 '13 at 03:36
  • 22
    What race conditions could you encounter, and are you ready for that? Will you want to to scale past a single webserver? What is your backup plan if your server fails? Your answer to all of these questions is likely to be better if you have a database than if you don't. Also if you ever went over the hump of learning how to use databases, my guess is that you'd find your "easier than using SQL queries" should be amended to "easier than using SQL queries if you don't understand SQL." – btilly Mar 14 '13 at 05:21
  • 1
    @btilly I'm really getting many of what you're saying but there are some things I still don't understand like the race condition thing. Will it happen if I store my data as a JavaScript object, and access it only from the node.js application running it? That is, you mean the request handling events on node.js are spawning threads all the time? So I just can't access global objects from those events safely? Is that it? – MaiaVictor Mar 14 '13 at 09:30
  • 1
    I like using sqlite for such things where data is small and I want to keep dependencies as small as possible. – Xolve Mar 14 '13 at 09:54
  • 44
    Database stores data to disk anyway. It's just the end result of a natural evolution of systems for storing structured data to file. Chances are if you set out to use files to store your structured data you are going to find yourself reinventing features that have already been developed in databases. So why not just use a database from the start? – Benedict Mar 14 '13 at 10:00
  • 4
    your question should essentially be when should you store data in disk instead of databases – minusSeven Mar 14 '13 at 11:45
  • 16
    Depending on how your project evolves, you may find yourself having to deal with things like concurrent access and rollbacks. They sound trivial, but aren't. By the time you get done solving them, you will find you have basically written a database. Do you really want to be in the database business, or another business? – jwernerny Mar 14 '13 at 12:35
  • 1
    Some points in favor of file systems: https://news.ycombinator.com/item?id=5229883 – arxanas Mar 14 '13 at 12:43
  • 3
    @Dokkat Node.js uses cooperative multitasking, so as long as you're only using one CPU, you have no race conditions. As soon as you start a second process up, you have the possibility that both processes try to manipulate the same data at the same time, and hence race conditions. The most common being that both read it, both update data, both write it, and one of the two gets lost. But there are worse possibilities in which the file becomes corrupted and unreadable. – btilly Mar 14 '13 at 15:08
  • You couldn't find a faster way to write data to a disk than JSON? – JeffO Mar 14 '13 at 16:04
  • 3
    @YannisRizos comment 1. Could you write that in an answer so that I can downvote it? Databases in general incur massive overheads, if you can do something faster with a database than with in-process data in a program dedicated to that task, then it is your dedicated program that sucks. – aaaaaaaaaaaa Mar 14 '13 at 16:04
  • 1
    Can you do spatial queries on a file or in-memory data structure? Because it is trivial to do it on spatial DBs. – sakisk Mar 14 '13 at 17:38
  • Just because a 'cycle' serves your purpose doesn't imply that a car is useless :) – PhD Mar 15 '13 at 00:45
  • @btilly but why would I use another process? That node.js instance is the responsible for controlling the data, serving it and building up the state of the site/game. – MaiaVictor Mar 15 '13 at 01:08
  • @Dokkat If you ever have a popular website, then you'll find that one process does not have sufficient CPU. Likewise if redundancy becomes a business requirement, you will want to have multiple webservers able to go at a moment's notice. – btilly Mar 15 '13 at 05:40
  • @btilly oh that's interesting. So people actually open several node.js applications in different computers to serve a huge site? That makes a lot of sense now, but I have no idea on how this would be implemented. I mean, when someone makes a HTTP request it goes directly to the IP/port of the specific application... – MaiaVictor Mar 15 '13 at 05:43
  • 1
    @Dokkat Look up load balancing. Making many machines seem like one IP/port for web requests is a long-solved problem. – btilly Mar 15 '13 at 07:18
  • The OP is pretty much describing a homemade Mongo DBMS. It's a perfectly legitimate use case and was the basis for the NoSQL movement that was popularized when Google published its Map Reduce paper in the early 2000s. – Sridhar Sarnobat Mar 10 '16 at 00:18
  • This discussion is also useful I guess http://stackoverflow.com/questions/3748/storing-images-in-db-yea-or-nay – DmitryBoyko Apr 29 '16 at 15:41
  • @btilly you can have race condition even in cooperative multitasking system running in one CPU. Parallelism is not a necessary condition for race conditions. – Lie Ryan Mar 09 '20 at 09:14
  • Can't believe such an subjective/opinion-based question is still open after 7 years and no one voted it to be closed... – Tseng Mar 09 '20 at 10:55
  • @LieRyan I agree when dealing with external resources such as databases. When dealing with just node.js code race conditions are possible but really easy to avoid. – btilly Mar 09 '20 at 16:35
  • Which language are you using? If your program already keeps the data in objects AND the data is small enough to fit in computer's memory, then you don't need an extra layer of complexity (DB) at all. – IceCold Sep 18 '22 at 05:45

13 Answers13

306
  1. You can query data in a database (ask it questions).
  2. You can look up data from a database relatively rapidly.
  3. You can relate data from two different tables together using JOINs.
  4. You can create meaningful reports from data in a database.
  5. Your data has a built-in structure to it.
  6. Information of a given type is always stored only once.
  7. Databases are ACID.
  8. Databases are fault-tolerant.
  9. Databases can handle very large data sets.
  10. Databases are concurrent; multiple users can use them at the same time without corrupting the data.
  11. Databases scale well.

In short, you benefit from a wide range of well-known, proven technologies developed over many years by a wide variety of very smart people.

If you're worried that a database is overkill, check out SQLite.

Robert Harvey
  • 199,517
  • 4
    Thanks for the input, but I still don't get why it's necessary.
    1. What's the point?
    2. How is this faster than "looking up" data that is already on the program? It's a step less.
    3. You can do the equivalent, better, using maps and reduces.
    4. I see that as a point but you can do it without databases aswell.
    5. You can perfectly structure your data without databases. It's trivial to store, say, JSON objects as tables.
        1. What?
    6. So can not using it at all? 10. Programs can be concurrent. Also I don't think it's very safe to have 2 programs editing the same data at once?
    – MaiaVictor Mar 14 '13 at 03:03
  • 27
  • Normalization, 7. See the link, 8. Read up on fault-tolerance. Oh, and before you get sucked up into the NoSQL craze, learn about SQL databases; get to know them on their own terms. You will understand. If you're just talking about simple configuration data, JSON may be all you need. But there are many other types of data out there besides program settings.
  • – Robert Harvey Mar 14 '13 at 03:04
  • 30
    As far as it not being safe to have two programs editing the data at once, well, that's partly why databases exist. If you ever have this need (and some or all of the other needs I mentioned), you're going to be very glad that you don't have to re-invent all this. – Robert Harvey Mar 14 '13 at 03:07
  • 3
    I kind of "understand" you. There ARE clearly many benefits in using a database. I can see that. Yet, I think I just fail to see where not using it at all will be a problem for me. I can visualize myself saving all my data on disk and loading it, without ever having a problem. I think this is the problem: I see why it's "good", but why it's "necessary"? Mainly giving how MUCH, MUCH faster it is to manipulate data from memory rather than querying a database all the time. It seems like a cost that is not paid at all, even after the benefits. – MaiaVictor Mar 14 '13 at 03:10
  • 14
    I think you're thinking in terms of small amounts of memory and data. I routinely work with data sets that are many gigabytes in size, and it's just not practical to load up all that data and tie it up on a single machine every time someone wants to work on it. I would imagine the practical working limit for a JSON data set without any indexing is probably 50 megabytes or so. – Robert Harvey Mar 14 '13 at 03:11
  • 24
    @Dokkat It's not necessary, nothing is. If your approach works for you, by all means go for it. I should mention however that most half decent rdbms support memory based storages, you can load everything you need in memory when your app wakes up (as you already do), and query them as you would a typical database (keeping all the benefits Robert mentioned). – yannis Mar 14 '13 at 03:13
  • 30
    To put it another way, sometimes you need a tent, but sometimes you need a house, and building a house is a whole different ball game than pitching a tent. – Robert Harvey Mar 14 '13 at 03:13
  • 1
    I'm getting it - indeed, I'm not even considering gbs of data at all, so I see. But lets try an example? Imagine you're making a page on your site that shows a full list of members. How can you implement this? a) first query the database to get an array with the list of members then use that data to answer the request... b) just have that data stored on your program already and simply send it! This is the basic use case for a DB for me, so involving it is just an additional step. How can an additional step be faster than not having it at all? That's what is confusing me. Thoughts on this? – MaiaVictor Mar 14 '13 at 03:21
  • 1
    @Dokkat: OK. So how do you deal with the concurrency issue? People are going to be adding new accounts while you retrieve your user list. How will you deal with that? – Robert Harvey Mar 14 '13 at 03:26
  • 2
    @Dokkat Well, you are comparing in-memory data (read 'volatile') vs. database (read 'stable')... What happens if the application crashes: there is suddenly no "list of members". Having unnecessary data in memory slows down your application... until you run out of it. Then: puff! – rae1 Mar 14 '13 at 03:26
  • 1
    @rae1n if it crashes you save the data to disk and reload it? Does having data in memory actually slows down your application if you don't run out of it? – MaiaVictor Mar 14 '13 at 03:34
  • @RobertHarvey I don't get it... how is this a problem...? If you save your data as a JavaScript object in a node.js environment, write and read it on requests, will it have concurrency issues? What will happen? (Also, would this be solved if I used immutable data? Clojure?) – MaiaVictor Mar 14 '13 at 03:36
  • 11
    @Dokkat I doubt you'll have to time to save during a crash... by definition code execution just stops! – rae1 Mar 14 '13 at 03:37
  • @Dokkat And you get a limited amount of memory by the OS, who can take it away at a moment's notice (hopefully not)... and not even mention GBs of data! – rae1 Mar 14 '13 at 03:39
  • 56
    @Dokkat when people are referring to crashes, they mean stuff like... your CPU blew up halfway through writing your "database" file. What happens now? Most likely your file is corrupt / unreadable (at least, it may no longer conform to your own format), and you need to restore form a backup (while most "real" DBs would only lose the last transaction). Of course, you can write code to make it handle this. Then you can write code for all the other stuff. And then you realise you've spent 6 months writing a DB, which you could have used from the start, for very little effort. – Daniel B Mar 14 '13 at 06:44
  • OK, I get the crash problem, but how about the race condition thing? What do you guys mean? I can't access global objects from within node.js's request handler events safely? – MaiaVictor Mar 14 '13 at 09:31
  • 10
    @Dokkat - Is your program going to be used by more than one person at a time? (Hint - if it's web-based, the answer is very likely "yes".) As soon as more than one person has access to the data at the same time, you run the risk of race conditions. Database management systems already have mechanisms in place to help manage these situations. – Shauna Mar 14 '13 at 17:37
  • 2
    @Shauna sorry but this is just not true. Node.js is single threaded (as I've just learned) and can perfectly serve data to thousands of people without having concurrency problems. – MaiaVictor Mar 15 '13 at 01:09
  • 2
    @Dokkat: Not quite. See http://stackoverflow.com/q/7018093 – Robert Harvey Mar 15 '13 at 02:04
  • 9
    @Dokkat: On any non-hobby project, you'd want to use more than one machine, usually for availability and sometimes for scalability (lots of users using it, etc.). Using a file system will choke right there (can't see your user JSON from the other machine!). You can use a shared disk, but now every single query goes to disk. And you have be weary of race conditions (multi machines == multi threads!). etc. etc. Believe me, RDBMS didn't became de-facto standard of DB for no reason. SQL is highly useful as well. MR paradigm isn't "better" than SQL. They are just great for different things. – Enno Shioji Mar 15 '13 at 11:02
  • 3
    @Dokkat - re "You can do the equivalent, better, using maps and reduces." Not really. A map/reduce is at best equivalent to a database table scan. The database engine is very good about making use of information and indexes to do better than a table scan. Which, of course, you can do yourself, but then you've re-invented what a DBMS gives you. – parsifal Mar 20 '13 at 13:18
  • @JMoore for purposes of this answer, just substitute RDBMS everywhere you see the word "database." – Robert Harvey May 20 '13 at 00:54
  • Don't forget about views! – Jack Henahan Apr 11 '14 at 00:53
  • @Daniel Writing data to a file in a way that is crash safe was a solved problem 20 years ago. – gnasher729 Mar 09 '20 at 08:48
  • If your program already keeps the data in objects AND the data is small enough to fit in computer's memory, then you don't need an extra layer of complexity (DB) at all. – IceCold Sep 18 '22 at 05:45