Implement K-means clustering with Map-Reduce

Question

Recently in an interview I was asked to implement k-means clustering using the Map Reduce architecture. I know how to implement a simple k-means clustering algorithm but couldn't wrap my head around to do it using Map Reduce(I know what Map Reduce is). Can someone provide me an explanation/algorithm of how to do that? More specifically I am looking what Map and Reduce phases will look like for such an implementation? I was also asked how many Mappers and Reducers will I need?

Since MapReduces is less a concept and more a library, this seems to be a programming question, which is offtopic here. Please clarify which conceptual issues you're facing. — Raphael, Aug 16 '16 at 09:38
@Raphael - map / reduce is a concept - it's implemented in a number of frameworks and fitting a given algorithm into that mindset is a completely algorithmic problem. — Nathaniel Bubis, Aug 16 '16 at 12:31

Nathaniel Bubis · Answer 1 · 2016-08-16T07:03:56.190

You can run a loop over $j\in\{1..k\}$:

Create a map that maps each point $x_i$ to itself if $x_i$ is nearest to the mean $m_j$, and to the zero vector otherwise: $$x_i \rightarrow \begin{cases}(x_i, 1) & d(x_i, m_j) \le d(x_i, m_l),\ l\neq j \\ (0,0) & \text{otherwise} \end{cases}$$
In the reduce stage, you would find the sums and counts: $$(x_{i1}, c_1), (x_{i2}, c_2)\rightarrow(x_{i1}+x_{i2}, c_1+c_2)$$ You then divide the resulting sum by the total count.

Thus, for each iteration in the k-means algorithm you will need $k$ maps and $k$ reduces.

Implement K-means clustering with Map-Reduce

1 Answers1