Clustering based on features of varied importance

Question

Suppose I have a dataset that includes the following features {HairColor, EyeColor, EducationLevel, Income}. I would like to perform clustering to separate the dataset into smaller datasets that you would expect to behave similarly. The difficulty that arises is that it is clear that EducationLevel and Income are much more important than HairColor and EyeColor but I do not know how to measure that importance for the sake of clustering.

In the example below, I would want it to be clear that Row 1 is more similar to Row 3, than to Row 2.

ID	EyeColor	HairColor	EducationLevel	Income
1	1	1	1	1
2	1	1	2	2
3	2	2	1	1

score 1 · Answer 1 · answered Apr 19 '21 at 13:55

If education level and income are more important than other features, you can multiply those features by a factor (greater than 1). That will allow the clustering algo to focus more on those features than the rest.

For larger datasets, with features where the differences are not so obvious, you will need to rely on your judgement based on the end objective of the clustering. If you have a target variable in mind, you want to choose or boost those features that are highly correlated with the target

Nikos M. · Answer 2 · 2021-04-19T16:08:22.433

One approach is to do dimensionality sampling, that is drop some features and see the resulting dataset that arises.

If there is some objective importance metric (eg Correlation, PCA) that quantifies the 2 features as more important, you can try that directly.

Else one can try iteratively to drop features and test the resulting dataset.

This approach does not mean that information is lost, it may even fit better if some features are simply noise.

Another related approach is to merge features in a way that maintains some hierarchy of importance.

For example create a new combined feature from 2 features $x_1$, $x_2$ as $x_{12} = x_1^n + x_2$ , which maintains that feature $x_1$ is more important than $x_2$ in the combined feature

Golden Lion · Answer 3 · 2023-05-23T14:44:32.307

You can use isolateforest to find anomalies in the data. A value of -1 indicates an anomaly has occurred. isolateforest is looking for noise in the data. you then use a threshhold to remove that data as a subset to cleanup the classifications. you use elbow and pca to find the number of cluster groups and svm.

data="""ID  EyeColor    HairColor   EducationLevel  Income
1   1   1   1   1
2   1   1   2   2
3   2   2   1   1"""
df = pd.read_csv(io.StringIO(data), sep='\t')
print(df.head() )
clf=IsolationForest()
X=df[["EyeColor","HairColor","EducationLevel"]]
y=df["Income"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,  random_state=42)
#n_estimators, max_samples, max_features
#-1 represents the outliers (according to the fitted model)
clf = IsolationForest(max_samples=2,n_estimators=10, random_state=10)
clf.fit(X_train)
y_pred_test = clf.predict(X_test)
cm=confusion_matrix(y_test, y_pred_test)
#sns.heatmap(cm)
def plot_detected_anomalies(X, true_labels, predicted_anomalies):
    #y_pred_inliers = X[predicted_anomalies == -1, :]
    # PLOTTING RESULTS
    plt.figure(figsize=(12, 6))
    plt.subplot(121)
    plt.scatter(X[:, 0], X[:, 1], c=true_labels)
    plt.title('Clean data and added noise - TRUE')
    plt.xlim([-11, 11])
    plt.ylim([-11, 11])
    plt.subplot(122)
    plt.scatter(X[:, 0], X[:, 1], c=predicted_anomalies)
    plt.title('Noise detected via Isolation Forest')
    plt.xlim([-11, 11])
    plt.ylim([-11, 11])
    plt.show()
plot_detected_anomalies(np.array(X_test[["EyeColor","EducationLevel"]]), y_test, y_pred_test)

Clustering based on features of varied importance

3 Answers3