Questions tagged [feature-engineering]

the process of using domain knowledge of the data to create features that improve machine learning algorithms

655 questions
27
votes
3 answers

Encoding categorical variables using likelihood estimation

I am trying to understand how I can encode categorical variables using likelihood estimation, but have had little success so far. Any suggestions would be greatly appreciated.
small dwarf
  • 271
  • 1
  • 3
  • 4
4
votes
2 answers

2D matrix for labelbinarizer

There is one behavior of labelbinarizer import numpy as np from sklearn import preprocessing lb = preprocessing.LabelBinarizer() lb.fit(np.array([[0, 1, 1], [1, 0, 0]])) lb.classes_ The output is array([0, 1, 2]). Why there is a 2 there?
william007
  • 775
  • 1
  • 10
  • 20
3
votes
1 answer

Effect of Skewness and data range in machine learning

I have a feature for machine learning as follow that skew to the left, and only have number in certain number range (here 0-2000). Will skewness and range of number affect the learning? If yes what should I do?
user29151
3
votes
2 answers

numerical or categorical data

I have a feature for machine learning (using methods like SVM, naive bayes, neural network and random forest) called member duration as follows: Should I make it as numerical or categorical data?
william007
  • 775
  • 1
  • 10
  • 20
2
votes
1 answer

When is it appropriate to split a dataset on a categorical value and generate $n$ models instead?

When doing regression or classification when faced with a categorical attribute with $n$ possible values there are two options: Feed this attribute directly into your model. Partition your data into $n$ pieces based on the categorical attribute and…
orlp
  • 121
  • 2
1
vote
1 answer

Problem of finding best combination of features when desired feature is feature some_feature_A/some_feature_B

Problem is stated: we have giant csv file with one target column and rest are inputs, we don't know these features impact target but we would like to use algorithm that besides using linear and non-linear transformations will also take into account…
quester
  • 295
  • 1
  • 3
  • 8
1
vote
1 answer

How can I deal with circular features like hours?

Assume I want to predict if I'm fit in the morning. One feature is the last time I was online. Now this feature is tricky: If I take the hour, then a classifier might have a difficult time with it because 23 is numerically closer to 20 than to 0,…
Martin Thoma
  • 18,880
  • 35
  • 95
  • 169
1
vote
0 answers

Should original features be retained in the model after using them to engineer new features?

BACKGROUND: I have dataset that includes Race (e.g., White, Black) and Ethnicity (e.g., Hispanic, Non-Hispanic) as observed variables. The dataset also includes Race_Ethnicity (e.g., Hispanic White, Non-Hispanic Black) as an engineered variable,…
1
vote
1 answer

An efficient way of calculating/estimating frequency spectrum for an event

This is rather a practical question. I'm looking for an efficient way of calculating the frequency of an event for a large number of samples. Here's a more concrete example. Let's say that I have a system with millions of users. Each user has so…
Mehran
  • 277
  • 1
  • 2
  • 12
0
votes
1 answer

Cyclic dependency between feature and predictor class

I have a feature which has specific categorical values ex(Technology, Hardware, Software, Marketing, Evnts etc). Based on this and some other features, I am trying to classify the dataset into 2 categories IsSoftwareSystem or NotSoftwareSystem. In…
0
votes
1 answer

How to use feature group?

Let's say I have a data set like the following: file group_a_co_1 group_a_co_2 group_b_co_1 group_b_co_2 file_1 0.8 0.2 0.3 0.7 file_2 0.1 0.9 0.2 0.8 file_3 0.5 0.5 0.7 0.3 ... I wonder, whether there are ways/tricks to tell the…
dgg32
  • 113
  • 4
0
votes
1 answer

To One-Hot-Encode or not to One-Hot-Encode?

I have been struggling to find proof for that but I couldnt Every time I prepare dataset I face the same issue when a column is a classification such as CountryCode or TaskType in this dataset TaskType CountryCode Target 1 61 …
asmgx
  • 549
  • 2
  • 18
0
votes
2 answers

What is a good approach for a lifespan?

Let's say I wan't to predict the lifespan of an ad in a listing. I know a bunch of thing from the ad like: the title the price the location etc The target value is the duration of the ad in the listing before it's being removed (item has been…
0
votes
0 answers

Features derived using retrocausality

I have been experimenting with features derived using retrocausality (not to be confused with data leakage) in training models. Are there any examples of prior work in the literature where this form of feature engineering has yielded success?
0
votes
1 answer

Are there any search algorithms for feature optimization similar to RFE, but which consider all possible combinations?

Does anyone know any good search algorithms for feature optimization that search through every possible combination to find the optimal combination of features for maximum predictive power? (Permutations are not important). So far I have been using…
PlatinumMaths
  • 81
  • 2
  • 11
1
2