Questions tagged [clustering]

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval etc.

Cluster analysis is the task of grouping objects into subsets (called clusters) so that observations in the same cluster are similar in some sense, while observations in different clusters are dissimilar.

In and , clustering is a method of unsupervised learning used to discover hidden structure in unlabeled data, and is commonly used in exploratory data analysis. Popular algorithms include , expectation maximization (EM), spectral clustering, correlation clustering and hierarchical clustering.

Related topics: , pattern-recognition, knowledge discovery, taxonomy. Not to be confused with cluster computing.

1396 questions
15
votes
2 answers

Clustering unique visitors by useragent, ip, session_id

Given website access data in the form session_id, ip, user_agent, and optionally timestamp, following the conditions below, how would you best cluster the sessions into unique visitors? session_id: is an id given to every new visitor. It does not…
AdrianBR
  • 367
  • 2
  • 10
7
votes
3 answers

How to evaluate clustering success in a completely unsupervised system?

The algorithm in question is Kohonen's SOM. But the question could also apply to PCA and some others. When the umatrix (or the codebook?) is examined, is there a way to tell how successful clustering was? And would it be a good idea to apply GA's to…
AkKoh
  • 71
  • 1
  • 2
7
votes
4 answers

Why does OPTICS use the core-distance as a minimum for the reachability distance?

The OPTICS clustering algorithm defines $$\text{core-dist}_{\varepsilon,MinPts}(p)=\begin{cases}\text{UNDEFINED} & \text{if } |N_\varepsilon(p)| < MinPts\\ MinPts\text{-th smallest distance to } N_\varepsilon(p) &…
Martin Thoma
  • 18,880
  • 35
  • 95
  • 169
6
votes
1 answer

Clusering based on categorical variables?

I am working on a project and currently experimenting cluster analysis. The dataset is mainly categorical variables and discrete numbers. Please pardon my poor programming skills as I am not very familiar with MathJax, but I will try to summarize…
Jing
  • 61
  • 3
6
votes
2 answers

Is this cluster analysis / prediction?

I have a series of seemingly random data dripping in one value at a time through time. Although it appears to be random, the data forms clusters when certain attributes are analysed which the charts show. I'm trying to avoid the fallacy of seeing…
user3791372
  • 398
  • 2
  • 14
5
votes
1 answer

Interpret clustering results after variable transformation

since some time I have a question to which I have not found the proper answer yet. My doubt concerns the interpretation of the results of a clustering algorithm which was run on features to which a log-transformation was applied. Specifically, let's…
Seymour
  • 163
  • 7
5
votes
2 answers

When is centering and scaling needed before doing hierarchical clustering?

I am working on a clustering project where we have collected protein data from over 100 patients samples. This data is normalized and log transformed. The goal is to cluster samples based upon their similarities, I am using hierarchal clustering and…
Mdhale
  • 185
  • 1
  • 1
  • 4
4
votes
2 answers

Is it possible to run clustering methods by only knowing the distance between pair of points?

By knowing each data point's coordinate, it is easy to apply them with clustering methods as k-means etc. By if the case is we only know the distances between each pair of data points without knowing the definite location coordinate of every data…
piratesailor
  • 171
  • 3
4
votes
1 answer

Finding clusters in multidimensional data

I have a set of data from 3,000 records. There are 5 attributes per individual (labelled A - E). I can use Kendall's W (coefficient of concordance) to determine the concordance between any two records. What I require is a way to discern any…
Carl
  • 143
  • 1
  • 1
  • 5
4
votes
2 answers

What is Spectral clustering?

What is spectral clustering? I have little background in statistics. I have tried to search for notes online but they assume quite a lot of knowledge. Would be good if you are able to find some notes online which teach the basics and the math…
listener
  • 131
  • 2
  • 5
4
votes
1 answer

Can some one explain how PCA is relevant in extracting parameters of Gaussian Mixture Models

I am having some difficulty in seeing connection between PCA on second order moment matrix in estimating parameters of Gaussian Mixture Models. Can anyone connect the above??
tejaswi
  • 43
  • 3
4
votes
1 answer

Using SVD for clustering

The dataset that I am experimenting with is in the form of a table with columns userid and itemid. If there is a row for a given user and a given item, that means the user accessed the item (like in an online store). I am trying to cluster similar…
rbk
  • 41
  • 1
  • 2
3
votes
3 answers

clustering data set based on the similarity of tree structure

I have a data set (>5000). each individual record of data is structured as a multilevel n-ary tree (>200 nodes). The tree node identifiers are unique within the tree. but the same identifiers are used to represent the same type of node across the…
user3691191
  • 131
  • 1
3
votes
3 answers

Expectation number of points in initial clustering for LSH

I have a very skewed, 10-dimensional data set. I need approximate nearest neighbours for my use case and I was looking into Locality senstive hashing. However after scaling and randomly generating hyperplanes through the origin and coding the data…
Jan van der Vegt
  • 9,368
  • 35
  • 52
3
votes
3 answers

Clustering based on features of varied importance

Suppose I have a dataset that includes the following features {HairColor, EyeColor, EducationLevel, Income}. I would like to perform clustering to separate the dataset into smaller datasets that you would expect to behave similarly. The difficulty…
1
2 3 4 5 6 7 8