Is it reliable to process millions of records with INSERT INTO SELECT directly in the database?

Question

I have several million booking rows in a table and would now like to save the totals of the booking amounts in another table depending on the customer, account no., product, etc.

Should this necessarily be done at application level, or can the database (MySQL InnoDB) also do this reliably with INSERT INTO SELECT? I could imagine that the database might have a performance advantage because you don't have to read the data from the database chunk by chunk. In addition, I would have to mark (UPDATE) each data record that I have already processed in order to continue at the correct point if the program terminates.

How would you solve this?

You should be able to do this entirely in the database, and simple aggregations over just millions of rows should complete almost instantly. Just ensure that you have an appropriate level of skill to cover whatever development you propose, since databases can be extremely important and disruption to their workings can have serious effects for a business. — Steve, Dec 25 '23 at 12:01

Doc Brown · Accepted Answer · 2023-12-26T22:20:50.050

Should this necessarily be done at application level, or can the database (MySQL InnoDB) also do this reliably with INSERT INTO SELECT?

It is reliable, until it is not.

Honestly, this all depends on how complex your calculation logic is and how well it can be implemented in SQL. Your vague description

save the totals of the booking amounts in another table depending on the customer, account no., product,

sounds like something which can be implemented perfectly with a few simple group-by SQL statements. "Several million" booking rows are no order of magnitude which should be a huge problem for most contempary database servers when the subtotals can be calculated by a linear scan. However, you may not have told us the full story here, maybe your real requirements are more complex - only you know.

If I were in your shoes, I would simply try this out with SQL, which is probably the most straightforward and simple solution. In case you need to join in other, related tables, you have to care for proper indexing, of course. When this does not work well, runs too long etc, you can still switch to something more complicated.

I would have to mark (UPDATE) each data record that I have already processed in order to continue at the correct point if the program terminates.

Before going that route, I would first try if you can implement this without any markers - this is simpler and may be fully sufficient. Start by creating the aggregated tables in a single transaction. If the server terminates unexpectedly during execution, the transaction should be rollbacked, and nothing bad happens - you can simply retry the operation at a later point in time. If you need more than one transaction for filling an aggregate table, implement this in an idempotent way, where you can simply repeat the whole process and get always the same result. In this case, it might be sufficient to clear an aggregate table before refilling it. Or, you try to define the transactions which fill your aggregates so you can determine from the content of the aggregate table which data was already processed successfully and which not. Adding additional status columns or tables to the source data for tracking former processing steps is an optimization which has the potential to proof itself premature and unneccessary.

When you really run into one of the potential issues scetched in Ewan's answer, then it is still time enough to switch to something more complicated and try out if a solution implemented at "the application level" will really behave better. Don't forget getting the data out of the database and the results back into it introduces some extra overhead - in terms of running time, resource usage and programming effort - which needs to be balanced by some real benefits.

See also: If there are two ways of approaching a task, how should one choose between them?

Ewan · Answer 2 · 2023-12-25T11:36:34.710

0

Although it seems like you could do this purely in SQL it's generally not advisable for the following reasons.

The business logic in one of your calculations becomes too complex to handle in SQL. For example, "make a geolocation api call to convert the postcode into lat long" or "send an email". Even if you don't have this at the moment, it could be a factor later on.
Performance. Although the process itself will run fast when done purely in SQL, it won't be "performant". The whole operation will be done in a single transaction, potentially taking up all the resources of the database server for the duration and breaking other operations. A failure half way through may leave you with an unknown result with some records changed and some not.

If done in code, you can run each insert individually, the entire operation may take longer, but your database will remain up and servicing other calls for the duration.
Error handling. Although SQL does allow some error handling, what seems like simple insert this select, to begin with, can become far more complex with loops and child sprocs when you have to handle individual rows erring and being stored in some errored row table for reprocessing
Testing. SQL is generally harder to write unit tests for. If you put the logic in code, each individual function can be tested

edited Dec 25 '23 at 11:36

answered Dec 25 '23 at 11:11

Ewan

75,506

5

These are all very questionable assertions indeed. Certainly (2) is completely unfounded, since there is no significant obstacle in SQL to breaking up a batch of work into individual transactions, and the transactional guarantees the database offers will prevent inconsistency under all circumstances (or if intractable performance problems emerge, it will demonstrate the contradictions of your design). – Steve Dec 25 '23 at 11:48
In my defence, A. the question at face value implies a single SQL statement, and B. If you go further, with a loop in the SQL, itself considered bad practice, and sub transactions you can see that its still not trivial to solve the issues of this long running process and knowing which rows have been processed. If we compare a 'full' pure SQL solution which handles all these problems vs a code solution, the apparent benefits of doing it the "easy" one line of SQL way disappear and its not about simple vs complex anymore, its complex vs complex a much fairer comparison – Ewan Dec 25 '23 at 11:58
1

Loops are considered a bad practice in SQL when a single-statement alternative exists, but if your whole argument rests on saying that a single statement isn't appropriate for this case (for example because of the duration of the locks taken, or the resources consumed for transactional consistency of the entire batch), then obviously looping in SQL then becomes a legitimate solution. I certainly agree it's a complicated judgment. – Steve Dec 25 '23 at 12:13
hmm i have to disagree with you here, if you have complex "business logic" or ETL procedures in SQL, that's bad practice. A loop is a sure sign you are doing something wrong. – Ewan Dec 25 '23 at 12:16
1

The complexity of any calculations may be a reason to palm it off to a different language. But honestly, looping in its own right is not a bad practice if it is (a) still a correct solution, and (b) done specifically to fall back from a correct single-statement solution which is too demanding for the context. The reason loops are suspect in SQL is because new practitioners either (a) use them as a first resort (inefficiently, and often incorrectly), or (b) use them incorrectly in a way that doesn't preserve an appropriate transactional consistency. – Steve Dec 25 '23 at 13:24
its considered bad practice, obviously if it works it works, but its a code smell. Using your database resources to run logic is expensive. You can do it more cheaply, scalably and reliably by writing code which runs on some other box. I don't think this is a controversial opinion in 2023. You would only try to do this in a DB if you have no other option – Ewan Dec 25 '23 at 15:26
1

I'd say it's very controversial. – Steve Dec 25 '23 at 16:20
maybe if you are a dba – Ewan Dec 25 '23 at 16:25
You mean, if you're someone who is paid specifically to understand databases and their performance? Honestly it's a very complicated area; but it's your answer. – Steve Dec 25 '23 at 17:00
i mean if you dont have the option to add code somewhere, you have to do it in sql or ssis or a job or something. Spinning up a machine and writing and deploying code is a big deal. If you have a bunch of code already, running on multiple machines, then writing a new worker and running it on some cheap load balanced machine is nothing and the obvious choice. You don't have a tribal allegiance to the database being the best solution – Ewan Dec 25 '23 at 18:36
1

I agree tribalism is not a justification. And if you already have a client application, then that makes putting the logic there more reasonable than it would be otherwise. But it's very difficult to make sweeping statements about performance, reliability, or any other factor. Circumstances are so various that the context always counts. I'm reluctant myself to be more specific in the rebuttal, because of the complexity of setting up the context and covering my own backside! – Steve Dec 25 '23 at 20:11
Why don't you write an answer with a sample sproc showing how you would deal with updating each row as "done" or "errored", allowing for batching up into say 1000 row groups per transaction to avoid memory consumption and dealing with a connection timeout? would you run it on a separate reporting box? I want to know if you just think its so trivial you wont run into these issues, or if having complex logic in the db is preferable in general? – Ewan Dec 26 '23 at 11:21
1

The OPs case sounds trivial enough to me that in normal circumstances it could be handled as a one-liner. He only asked abstractly which approach is better - he wasn't saying he already had a problem with a simple aggregating query. If a query was a problem, I'd need far more information about why, about the context, and about the latitude for a solution. There are all different kinds of "complicated logic" - the database is the best place for any complications involving consistency, but not necessarily the best place for all complications. – Steve Dec 26 '23 at 12:44
ok well my answer is for the case where its non trivial – Ewan Dec 26 '23 at 13:54
1

But you're not being specific about which circumstances your answer would apply to. In many circumstances your answer wouldn't be right. I'm only cautioning against sweeping statements. Whether to execute server-side or client-side, and whether the solution to a found problem lies in changing the site of execution (rather than hardware reinforcement, say), is an expert professional judgment requiring a large amount of contextual information. There is no simple rule for high performance or problem-free operation, like "always the client". Unsure whether you're misunderstanding or disagreeing. – Steve Dec 26 '23 at 14:58
disagreeing. Its a simple general rule to avoid doing processing on the database. Following it will result in high performance, problem free software. – Ewan Dec 27 '23 at 10:10
simple general rules make for good helpful answers to simple low detail questions – Ewan Dec 27 '23 at 10:13
1

@Ewan: with simple, general rules, it is a little bit tricky. My answer contains also a simple, general rule for creating high performance, problem free software: "If one has to make a decision between two approaches, try out the simpler one first and use a more complicated only when the simpler one proofs itself not to be sufficient". – Doc Brown Dec 27 '23 at 14:12
Yeah, I mean you could optimise it slightly to "ask if there is a problem with the simple approach first and then try it if not" – Ewan Dec 27 '23 at 14:57
@Ewan, for an ETL-style process where the source and destination is the same database engine, then doing all execution server-side will always be the simpler starting approach (from an analysis and coding perspective), and quite probably the most correct and performant. This is because there is less heaving of data in and out, and because the transaction manager can oversee the entire operation. It doesn't have to be a sproc if that doesn't suit your approach to source code control or testing - the SQL code can be submitted from the client-side, but with execution all server-side. (1/3) – Steve Dec 28 '23 at 12:00
The overwhelming risk of your client-first approach is that manually analysing the transactional consistency needs, evaluating whether there is a solution for those needs whilst moving data back and forth to the client side, and then implementing a correct algorithm, is an expert-level task. If you're exclusively a client-side developer having your database-altering work overseen by a database developer, you'd be fine to rely on these simple rules, knowing that an expert will monitor and intervene as appropriate. (2/3) – Steve Dec 28 '23 at 12:01
But I don't think it's reasonable to infer that the OP is working under that supervision, or that his level of expertise with databases is where he could be expected implicitly to perform the analysis that is necessary to determine whether this task can be moved to the client-side without subtle defects in its consistency when the database is under load performing concurrent writes. I suspect your awareness may be hazy, because most of us express howls of anguish at the idea of losing transactional protection provided by the engine, and there's no reason here to move out of that realm. (3/3) – Steve Dec 28 '23 at 12:12
@Steve I think you need to write an answer to that effect. I contend with a client based approach there is no need for a deep analysis, you use the DB methods you already have for setting and retrieving and do the calculation in unit tested code. You are using cheap, unshared resources to perform the work in a scalable and rate limitable way. You avoid the well known pitfalls of overloading your central shared, probably transactional, resource with large batch calculations. – Ewan Dec 28 '23 at 12:14
A naive approach to this problem, even with a simple "select sum(order.cost), customerid into .." will land you in hot water at some point. Show us how you would implement this with batches on the database. Show us how you prevent large transactions and table locking etc My feeling is that once you account for anything above the most trivial the complexity of the two solutions makes the comparison obvious – Ewan Dec 28 '23 at 12:20
@Ewan, the reason I won't be drawn on an formal answer which addresses and counters all your canards, is because I anticipate the extraordinary complexity of making such an answer. As for implementing batching, I'd have no problem at all giving you a code example of batching (although I'd have to move off my mobile phone and onto the desktop to draft it), but I'm unclear exactly what impediment such batching is supposed to be working around (and which approaches couldn't be used), and I'm unclear what consistency constraints apply (and which approaches couldn't be used) (1/3) – Steve Dec 28 '23 at 13:27
None of us so far have any reason to believe that a simple INSERT INTO ... SELECT FROM ... GROUP BY one-liner wouldn't fit the bill, and if it didn't fit the bill for some performance reason, then why the same performance problem wouldn't apply to a basic SELECT FROM which returns all necessary raw data to the client. You ask me how I would "avoid large transactions and table looking". By default, I wouldn't be trying to avoid a transaction, because it is an important (and often necessary) safety guarantee which the database engine provides simply and by default. (2/3) – Steve Dec 28 '23 at 13:27
IF we've established that the transaction is a problem, and IF we don't need both the read and write in the same transaction, and IF we can read the table fast enough to return raw data to a client but not to make a transactional summary and insert of the same data, then the standard solution would be simple: not batching the aggregation, but reading the raw data into a temp table and then treating that temp table as the source (and from there onwards, just adapting the same simple one-liner we began with to draw from the temp table instead of the master table). (3/3) – Steve Dec 28 '23 at 13:34
Continuing (4/3)!... Those three IFs are however very contingent, and if the transaction is the problem, it does not necessarily follow that the operation can simply be made non-transactional (which requires expert analysis). There may instead need to be a bespoke algorithm implemented to guarantee the necessary consistency whilst lifting the data out of the database (which requires expert development), or if the initial reading of the table itself is too slow to even get a copy out, then there may be no alternatives but hardware reinforcement or a systemic redesign. – Steve Dec 28 '23 at 13:41
OK. 1. we know for sure that a simple one liner will fail at some critical number of affected rows. At some point it either eats all your db memory with an open transaction, locks a table for longer than the other operations timeout or just takes too long for your connection timeout. This gets more likely as the calculation complexity, number of affected rows and general other requests per second on your db increase. – Ewan Dec 28 '23 at 18:17
I only bring the transaction up because its a problem with the database, not the process as a whole. If you need a transaction as part of the operation, you would obvs have to deal with that in some way. I don't see it as a problem there are many approaches you could use depending on the details of the calculation – Ewan Dec 28 '23 at 18:19
Your assumptions here that you just wouldn't have to deal with these problems is exactly why my answer is correct. If you ignore them, your application WILL hit a saleability limit and crash disastrously at some point. This is the "performance" I'm talking about. If you follow my general advice your app will scale and you will not be getting calls at xmas to ask you why the accounts run has killed the webshop during peak sale hours etc etc – Ewan Dec 28 '23 at 18:24
I'm giving you the benefit of the doubt, maybe you know some clever DB tricks to avoid these problems, but if its just that you haven't come across them yet..... – Ewan Dec 28 '23 at 18:25

Is it reliable to process millions of records with INSERT INTO SELECT directly in the database?

2 Answers2