Questions tagged [classification]

An instance of supervised learning that identifies the category or categories which a new instance of dataset belongs.

In machine learning and statistics, classification refers to the problem of predicting category memberships based on a set of pre-labeled examples. It is thus a type of supervised learning.

Some of the most important classification algorithms are support vector machines , logistic regression, naive Bayes, random forest and artificial neural networks .

When we wish to associate inputs with continuous values in a supervised framework, the problem is instead known as . The unsupervised counterpart to classification is known as (or cluster analysis), and involves grouping data into categories based on some measure of inherent similarity.

3281 questions
84
votes
6 answers

Cosine similarity versus dot product as distance metrics

It looks like the cosine similarity of two features is just their dot product scaled by the product of their magnitudes. When does cosine similarity make a better distance metric than the dot product? I.e. do the dot product and cosine similarity…
ahoffer
  • 943
  • 1
  • 7
  • 7
11
votes
2 answers

Finding optimal threshold in multi-class classification task

In a binary classification problem, it is easy to find the optimal threshold (F1) by setting different thresholds, evaluating them and picking the one with the highest F1. Similarly is there a proper way to find optimal thresholds for all the…
saiRegrefree
  • 146
  • 1
  • 4
11
votes
3 answers

How can I classify text considering word order, instead of just using a bag-of-words approach?

I've made a Naive Bayes classifier that uses the bag-of-words technique to classify spam posts on a message board. It works, but I think I could get much better results if my models considered the word orderings and phrases. (ex: 'girls' and 'live'…
Yerk
  • 211
  • 1
  • 5
10
votes
3 answers

AUC-ROC of a random classifier

Why the area under the ROC Curve for a random classifier is equal to 0.5 and has diagonal shape? For me a random classifier would have 25% of TP,TN,FP,FN and therefore it would only be a single point on the ROC Curve.
Victor
  • 281
  • 1
  • 3
  • 5
7
votes
1 answer

F1 score vs accuracy, which metric is more important?

I have two multiclass classification models for making predictions (number of classes is three to be precise). One is Keras neural network, other is Gradient Boosted Classifier from Scikit Learn library. I have noticed that after training on same…
Ach113
  • 225
  • 1
  • 2
  • 7
7
votes
4 answers

Measuring the uncertainty of predictions

Given a multiclass classification model, with n features, how can I measure the uncertainty of the model for that particular classification? Let's say that for some class the model accuracy is amazing, but for another it's not. I would like to find…
Latent
  • 313
  • 3
  • 16
7
votes
2 answers

The meaning of multi-class classification rules

The meaning of multi-class classification rules Example: I have two classification rules (Refund is a predictor and Cheat is a binary response): (Refund, No) → (Cheat, No) Support = 0.4, Confidence = 0.57 (Refund, No) → (Cheat, Yes) Support = 0.3,…
Xuan Dung
  • 153
  • 1
  • 6
6
votes
2 answers

significance test and sample size estimation for classifiers

What is the test to tell if e.g. an F1 score of 0.69 for classifier A and 0.72 for classifier B is truly different and not just by chance? (for mean-values one would use a "t-test" and obtain a "p-value"). I have access to the underlying data and…
lordy
  • 294
  • 2
  • 12
5
votes
1 answer

Large Scale Personalization - Per User vs Global Models

I'm currently working on a project that would benefit from personalized predictions. Given an input document, a set of output documents, and a history of user behavior, I'd like to predict which of the output documents are clicked. In short, I'm…
Madison May
  • 2,029
  • 2
  • 17
  • 18
5
votes
2 answers

F1 maximization with convolutional neural net. for an imbalanced dataset

I'm dealing with an imbalanced dataset for binary classification (about 70% to 30%). I was wondering what is the best way to optimize the F1 score for such a task when using a convolutional neural net. As of now, I'm sampling the dataset in order to…
Rimbaud_
  • 51
  • 1
  • 2
4
votes
1 answer

How are selected the features for a decision tree in CART?

Suppose I want to use CART as classification tree (I want a categorical response). I have the training set, and I split it using observation labels. Now, to build the decision tree (classification tree) how are selected the features to decide which…
gc5
  • 879
  • 2
  • 9
  • 17
4
votes
3 answers

Which non-training classification methods are available?

I am trying to find which classification methods, that do not use a training phase, are available. The scenario is gene expression based classification, in which you have a matrix of gene expression of m genes (features) and n samples…
gc5
  • 879
  • 2
  • 9
  • 17
4
votes
1 answer

K nearest neighbour

Is the k-nearest neighbour algorithm a discriminative or a generative classifier? My first thought on this was that it was generative, because it actually uses Bayes' theorem to compute the posterior. Searching further, it seems like it is a…
101
  • 43
  • 3
4
votes
1 answer

Deal with overlapping classes in classification modeling

I am currently working with a dataset comprising information about crop insurance for soybeans. My ultimate goal with this dataset is to create a classification model capable of predicting whether insurance for soybeans will be activated based on…
EduMinsky
  • 41
  • 1
4
votes
4 answers

How to classify using incomplete features

Assume we have some features pressure, volume, temperature, intensity, mass, size, ... The problem is that, I do not have allways a complete set of these info. I can not put zero for the unknown featurs because it has a meaning. For example if I do…
1
2 3 4 5 6 7