4

What is the best method to parse events in daily batches from external source (csv file) that produces entries in a separate table if any of a multiple of conditions are met. Conditions can vary through time and need to maintain certain level of flexibility. They are simple conditions though, either a field is equal, starts or contains certain strings.

What I envision is 2 tables, first table to import the external data, and a second to keep a partial record of the event with some type of rating depending on what condition was met. My question revolves on how to accomplish filling one with the other. My first thought was to add a trigger on the conditions in the first table that inserts values on the second. Another thought was to run a scheduled query on the first table, sometime after the daily upload, that inserts into the second table. The first method using triggers sounds like it would be onerous to maintain and change according to the requirements. The second seems like more overhead as it would be running to scheduled procedures, one to upload and the second to fill the second table.

Any thoughts are welcome. Nothing has been built at this point except the format in which the .CSVs will be received. Thank you for reading my post and to all who contribute.

nbayly
  • 123
  • rule #1 dont put business logic in the database. Write a program – Ewan Sep 30 '16 at 07:27
  • Ok, what do you mean by program? An executable? A LAMP/WAMP stack and code in PHP? Don't have any access or knowledge to do that but what would that look like in any case? – nbayly Sep 30 '16 at 14:20
  • how are you getting the csvs? I'd have some sort of queue and a worker process to work through it. maybe a webpage front end to show results. I'd go with .net myself but choose the language you like best – Ewan Sep 30 '16 at 14:32
  • CSVs are created from a webtool (that we don't manage) and downloaded manually into folder on a daily basis. Can you explain in further detail what you mean about queue (scheduled procedure in MySQL?) and worker process? Also can you provide a resource explaining why it's not a good idea to place business logic in db? I don't have any experience in .net and this project doesn't have that big of a scope that could justify investing extensively in learning it, any further suggestions. Regards – nbayly Sep 30 '16 at 15:23
  • ok so, the best practice answer is to solve this problem code and queuing, second best an executable with UI. but it seems you dont have the resource to do it that way. Go for the solution which you find practical and dont worry about best practice – Ewan Sep 30 '16 at 15:57

1 Answers1

2

Database

The first question is about what you really want in your database. Your current approach would be to have a table of conditions, to keep the source data that you import, and to construct the event rating table with some parsed information:

enter image description here

Do you really want to keep the source event in the database ?

  • Do you intend to re-parse the old events in case conditions change ?
  • Do you hold there many more information than in your rating table that you'd need (for example to print full information for highly rated events) ?
  • It is not clear if multiple conditions would result in multiple matches (see the picture), or if there's onely one rating per event (then you'd have a one to one relation with the event source).

Database approach

If you need all these tables in the datable, you could consider processing all this in the database:

  • you could use triggers as you've described, on the SourceEvent table: but these are non standard and vendor specific, so difficult to port. Furthermore, string parsing is not very efficiently done in triggers, so that this should be envisaged only if your CSV is already parsed by the upload tool.
  • you could also consider an sql script doing some selections and INSERT INTO SELECT. If there's only a single rating for any event, then the challenge would be to aggregate several ratings for an event (for example by building an aggregate view on the rating table). Your script should then use a date or a batch number to restrict the processing to the new items. Or if you can afford: remove the processed SourceEvents if you no longer need them at the end of the filling.

In both cases, you have to cope with the transactional logic: all the changes will be committed only at the end, so that with big input files you might experience slow down, latency increase and similar performance issues.

Application approach

Another approach would be to develop an external program to automate the full processing: get the event source file, read and parse the CSV file, apply on the fly the application of the rules to calculate the rating. This will certainly be more efficient, as it avoids reading several time the same data.

Christophe
  • 77,139
  • 1
    First, thanks for reading my post and providing an in depth answer. Here are answers to your questions: I was planning on minimizing the rows after a certain time as their relevance dwindles with time, say 6 months. And though I don't plan on parsing old logs with changing conditions I did think of pulling the rows in detail for highly rated events as you suggested. My apologies for not clarifying but yes multiple conditions can result in duplicate events. Thank you for bringing up the issue with transactions (which I'm only recently understanding). Will keep in mind. – nbayly Sep 30 '16 at 20:43
  • If conditions are not retroactive, it's best to apply them at insertion only. So my advice would be for application approach. If for any reason you don't like it, opt for the script approach (it'll be easy because of multiple ratings without aggregation). Due to potential change in conditions, I fear that the trigger approach might cause issues (additional ratings, change in old ratings) for example if you'd change some events, or if for technical reasons you'd have to reload the source table or a part of it in case of problems during the import). – Christophe Sep 30 '16 at 20:53