Efficient way/query to find related entities

Question

Let's say we have following SQL table structure

Entities (5-15k)
Keywords (15-20k)
EntityKeywords
ExcludedKeywords (keywords which should be excluded from common matching)

We need to find related entities, which are entities which have most common keywords ordered descending.

Now obviously, querying this on each load would be too slow because each query requires ordering by count. One of ideas is to aggregate keywords to a single column for each entity, and use full text search over it. Don't know is this a good approach?

Is this too much for SQL server and does it require another technological stack, or there are better ways to deal with this problem?

I would assume that searching through all aggregated keywords would be very slow by comparison, though I suppose the only way to know for sure would be to try tests. Maybe there is even support for this type of search. — Neil, Jan 24 '18 at 11:22
Usually the table structure has foreign keys which will tell you the relationship between tables (and there is software that can build an er diagram from the foreign keys). Without foreign keys, it is more difficult as someone may have missentered a key, or the reltionship somestimes exists ... . In the latter case you have to review the source code to confirm the relationship. — Robert Baron, Jan 24 '18 at 11:54
Not more than that? Then it's almost certainly the best solution to just load the entire data set into a dataframe and perform all calculations in RAM. Not everything that can be done in SQL should be. — Kilian Foth, Jan 24 '18 at 13:41
It is hard to design a performant system without understanding more requirements. In this case, it is the frequency of update of the data set. If the data set is static or hardly ever updated, then that will steer you toward a solution of precomputing exactly the best data structure to answer the questions of interest, no matter how expensive that is or how many intermediate computations are required. OTOH, if the data set is constantly updated, then there are other solutions that can help, like A RETE engine, which does pattern matching on dynamically changing data. — Erik Eidt, Jan 24 '18 at 15:07
You're omitting several critical pieces of information, for example how does ExcludedKeywords enter the picture, and how frequently are those tables updated, and how (for example, can keywords be deleted? Are frequencies stored in EntityKeywords, or are they calculated by counting duplicate rows there?) — LSerni, Jan 24 '18 at 23:02

Efficient way/query to find related entities

0 Answers0