Datastructure for insertion and scored extraction

Question

I have a simple program (assume a shop) that is actually just returning the $m$ most active customers from a total list of $n$ users (so $m \leq n$). I only use this program on one machine with one process, so assume no distribution of data to be necessary.

I require a data structure storing $\mathrm{userId}$ and $\mathrm{score}$, where $\mathrm{score}$ is an unsigned integer. In the beginning it's $\mathrm{score}:=0$ for all $n$ users. I can extract $m$ users beginning with the highest $\mathrm{score}$ (filling up with any other user, so the first extract just returns any $m$ users).

I have an increment(x) function that is called often. It increases $\mathrm{score}$ of the user with $\mathrm{userId} = x$ by $1$. Afterwards I want to call extract again, where the result may differ by $1$ user (the last incremented score may lead to that $\mathrm{userId}$ replacing another $\mathrm{userId}$ within the $n$ members).

At the moment my prototype uses a hashmap for the scoring and a linear min-search, so complexity for increment is $O(1)$ and for extract is $O(n\cdot m)$. I feel like there should exist a data structure where extract is $O(m)$ (min-heap maybe?) while increment is either $O(\log n )$ or even $O(1)$. Assuming $m = \log n$, this would lead to a speedup of $n$ which would be immense even for small datasets.

What data structure should I use? Is there such a data structure for the operations extract and increment(x)? And if not, is there a theoretical proof that there is none?

score 3 · Accepted Answer · answered Jul 06 '17 at 08:21

You can use left threaded AVL trees with score as key and hash (expected $\mathcal O(1)$ insert/delete or simple AVL, logarithmic but with value much smaller than $n$) as values to keep id's, and all the time keep the data sorted with $\mathcal O(1)$ inorder traversal and $\mathcal O(\log n)$ insert (which in case of your increment operation is one delete, one insert). Also you might want to keep the last result (which is sorted) to check whether some new id appeared and use binary search if it did.

Overall this gives $\mathcal O(m)$ extract (which cannot be better) and two $\mathcal O(\log n)$ operations per increment.

score 3 · Answer 2 · answered Jul 08 '17 at 01:08

Since you're interested in the ordering of scores, use a data structure that's designed to store ordered values: a search tree. Balanced search trees have $O(\log(n))$ lookup and insertion for $n$ elements. In practical terms, there's little difference between $O(\log(n))$ and $O(1)$: for typical data sizes, the constant in front tends to matter more.

The search tree is indexed by scores. If you also need to look up users by their ID, keep multiple data structures that are updated together, e.g. a hash map from user IDs to user records as well as the search tree from scores to user records. Depending on whether you use mutable data structures and what else you have in the records, the search tree for scores may contain (pointers to) user records, or user IDs. To insert a user, add an entry to both the hash map and the score tree. To update a score, update the user record, remove the existing entry from the tree and add a new entry.

If you're changing the score by 1, its final position in the tree will often be close to the original, so this can be optimized a bit from a basic remove-then-add. This is only an optimization in some cases: in the worst case you'll need to rebalance the tree all the way to the root.

To extracting the $m$ users with the highest score, walk the tree in decreasing order, and stop when you've seen $m$ users. This takes $O(\log(n)+m)$ time ($\log(n)$ is the depth of the tree to reach the largest element, then walking the next elements takes $O(m)$ time).

There are several ways to deal with ties. Given your scenario, I'd expect it to be desirable to select users at random when there are ties. So in each tree node, store a set of user IDs, using a data structure that lets you extract a random $m$-element subset easily. This can be a balanced search tree where you can extract $m$ elements among $k$ with $m$ queries of cost $\log(k)$. Another possibility is a hash map implemented as an array of buckets where you read from random buckets, though this will have a bias related to how users fit in buckets.

If you want ties to be resolved in a deterministic way, (e.g. always select the older users), then use a search tree where each node contains a single user, indexed by (score, user rank) pairs (e.g. user rank = account creation date).

Albert Hendriks · Answer 3 · 2018-04-27T21:12:37.133

3

Here's the datastructure I have in mind:

The list stays being sorted by score.

extracting the highest $m$ users goes through the end of the list, $O(m)$.
incrementing a score of a user goes by looking up the list entry through the hashmap, and then moving the user to the next list entry if the next list entry is score+1, or inserting a new entry with score+1 otherwise. If the Set the user came from is then empty, that list entry is removed. The link for the user in the hashmap is updated to point to the new list entry. This all is expected time $O(1)$.

edited Apr 27 '18 at 21:12

answered Apr 27 '18 at 15:59

Albert Hendriks

2,521
15
35

2

This answer is simpler and in theory better than the other answers given. However, I think it is good to note that one important disadvantage of the hash-map is that there is a real chance that a single operation will take $O(n)$. This can be problematic for the webshop use-case: we don't want that some user suddenly has to wait very long. The search tree is more 'consistent', as it guarantees $O(\log n)$, usually a negligible waiting time. Therefore, I advise to test whether the 'worst case behaviour' of this method is acceptable. If it isn't, it's a good idea to look at the other answers. – Discrete lizard Apr 28 '18 at 10:07
@Discretelizard if the userIds are consecutive with relatively few deleted users, as is often the case with webshops, the HashMap can be replaced with an array of pointers. As you probably already noticed, the LinkedHashSets can be replaced with DoubleLinkedLists that another similar array then points into. This would yield O(1) worst case. Of course, this doesn't work for other types of userIds. – Albert Hendriks Apr 28 '18 at 14:55

Datastructure for insertion and scored extraction

3 Answers3