I have the following code:
n - number of 1-d points (real numbers)
numbers[n] - array of numbers n
list, se_list - list variables to store cluster starting points
_____________
sort(numbers)
find_cluster(start, end, ci): \\ ci - cluster index
best_se = +infinity \\ best squared eror
list = {start}
for i from start to end−ci+1:
if ci > 1:
(se, se_list) = find_cluster(i+1, end, ci-1)
else:
se = get_se(i+1, end)
se_list += {i+1}
new_se = get_se(start, i)
if new_se + se < best_se:
best_se = new_se + se
list = list + se_list
return (best_se, list)
\\ this computes variance of points from start to end
get_se(start, end):
sum = 0
for i from start to end:
sum = sum + numbers[i]
mean = sum/(end-start+1)
se = 0
for i from start_to_end:
se = se + (numbers[i]−mean)*(numbers[i]−mean)
return se
As far a my analysis goes, this algorithm take time O(n^(2k)), where k = ci, but I am not sure if I am correct. I first proved it to myself as
T(n) = n*(n + T(n-1)) = n^2 + n*(n-1)*((n-1) + T(n-2)) + ...
so it tends to n^k, but there are smaller arguments which are for sure smaller than n^k, so overall time is O(n^(2k)). But I am not very experienced so what is the running time of it?
Details: the original question for each I made this algorithm is "there are n points in 1 dimension (real numbers) which you need to cluster as k-means in polynomial time". K (i called it ci in this algorithm) is the number of clusters.
Details 2: By smaller arguments i meant that when we start opening parenthesis in T(n) = n^2 + n*(n-1)*((n-1) + T(n-2)) -> there will appear 1 argument n^k and a lot of smaller arguments. Though n^k is the largest power in this equation, if I take O(n^k) < O(n^2k), then this will also account for smaller arguments (just because), but i dont think that "just because" is a valid argument especially if a constant should do. But still the question holds O(n^2k).
NB. I need to check the links in comments, but if I understood this well enough there wouldn't be a need to ask this question right? So I do not think that referencing me to other materials unless they site exactly the same algorithm is right. So I am asking someone with good understanding of algorithms either verify my claim (better with small proof) or just say i'm plain wrong and possibly provide an explanation why.