I have a simple program (assume a shop) that is actually just returning the $m$ most active customers from a total list of $n$ users (so $m \leq n$). I only use this program on one machine with one process, so assume no distribution of data to be necessary.
I require a data structure storing $\mathrm{userId}$ and $\mathrm{score}$, where $\mathrm{score}$ is an unsigned integer. In the beginning it's $\mathrm{score}:=0$ for all $n$ users. I can extract
$m$ users beginning with the highest $\mathrm{score}$ (filling up with any other user, so the first extract
just returns any $m$ users).
I have an increment(x)
function that is called often. It increases $\mathrm{score}$ of the user with $\mathrm{userId} = x$ by $1$. Afterwards I want to call extract
again, where the result may differ by $1$ user (the last incremented score may lead to that $\mathrm{userId}$ replacing another $\mathrm{userId}$ within the $n$ members).
At the moment my prototype uses a hashmap for the scoring and a linear min-search, so complexity for increment
is $O(1)$ and for extract
is $O(n\cdot m)$. I feel like there should exist a data structure where extract
is $O(m)$ (min-heap maybe?) while increment
is either $O(\log n )$ or even $O(1)$. Assuming $m = \log n$, this would lead to a speedup of $n$ which would be immense even for small datasets.
What data structure should I use? Is there such a data structure for the operations extract
and increment(x)
? And if not, is there a theoretical proof that there is none?