3

Recently in an interview I was asked to implement k-means clustering using the Map Reduce architecture. I know how to implement a simple k-means clustering algorithm but couldn't wrap my head around to do it using Map Reduce(I know what Map Reduce is). Can someone provide me an explanation/algorithm of how to do that? More specifically I am looking what Map and Reduce phases will look like for such an implementation? I was also asked how many Mappers and Reducers will I need?

D.W.
  • 159,275
  • 20
  • 227
  • 470
user2966197
  • 131
  • 1
  • Since MapReduces is less a concept and more a library, this seems to be a programming question, which is offtopic here. Please clarify which conceptual issues you're facing. – Raphael Aug 16 '16 at 09:38
  • @Raphael - map / reduce is a concept - it's implemented in a number of frameworks and fitting a given algorithm into that mindset is a completely algorithmic problem. – Nathaniel Bubis Aug 16 '16 at 12:31
  • @nbubis Some people disagree. – Raphael Aug 16 '16 at 15:43

1 Answers1

5

You can run a loop over $j\in\{1..k\}$:

  1. Create a map that maps each point $x_i$ to itself if $x_i$ is nearest to the mean $m_j$, and to the zero vector otherwise: $$x_i \rightarrow \begin{cases}(x_i, 1) & d(x_i, m_j) \le d(x_i, m_l),\ l\neq j \\ (0,0) & \text{otherwise} \end{cases}$$

  2. In the reduce stage, you would find the sums and counts: $$(x_{i1}, c_1), (x_{i2}, c_2)\rightarrow(x_{i1}+x_{i2}, c_1+c_2)$$ You then divide the resulting sum by the total count.

Thus, for each iteration in the k-means algorithm you will need $k$ maps and $k$ reduces.

Nathaniel Bubis
  • 398
  • 1
  • 8