the process of using domain knowledge of the data to create features that improve machine learning algorithms
Questions tagged [feature-engineering]
655 questions
27
votes
3 answers
Encoding categorical variables using likelihood estimation
I am trying to understand how I can encode categorical variables using likelihood estimation, but have had little success so far.
Any suggestions would be greatly appreciated.

small dwarf
- 271
- 1
- 3
- 4
4
votes
2 answers
2D matrix for labelbinarizer
There is one behavior of labelbinarizer
import numpy as np
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit(np.array([[0, 1, 1], [1, 0, 0]]))
lb.classes_
The output is array([0, 1, 2]). Why there is a 2 there?

william007
- 775
- 1
- 10
- 20
3
votes
1 answer
Effect of Skewness and data range in machine learning
I have a feature for machine learning as follow that skew to the left, and only have number in certain number range (here 0-2000). Will skewness and range of number affect the learning? If yes what should I do?
user29151
3
votes
2 answers
numerical or categorical data
I have a feature for machine learning (using methods like SVM, naive bayes, neural network and random forest) called member duration as follows:
Should I make it as numerical or categorical data?

william007
- 775
- 1
- 10
- 20
2
votes
1 answer
When is it appropriate to split a dataset on a categorical value and generate $n$ models instead?
When doing regression or classification when faced with a categorical attribute with $n$ possible values there are two options:
Feed this attribute directly into your model.
Partition your data into $n$ pieces based on the categorical attribute and…

orlp
- 121
- 2
1
vote
1 answer
Problem of finding best combination of features when desired feature is feature some_feature_A/some_feature_B
Problem is stated: we have giant csv file with one target column and rest are inputs, we don't know these features impact target but we would like to use algorithm that besides using linear and non-linear transformations will also take into account…

quester
- 295
- 1
- 3
- 8
1
vote
1 answer
How can I deal with circular features like hours?
Assume I want to predict if I'm fit in the morning. One feature is the last time I was online. Now this feature is tricky: If I take the hour, then a classifier might have a difficult time with it because 23 is numerically closer to 20 than to 0,…

Martin Thoma
- 18,880
- 35
- 95
- 169
1
vote
0 answers
Should original features be retained in the model after using them to engineer new features?
BACKGROUND: I have dataset that includes Race (e.g., White, Black) and Ethnicity (e.g., Hispanic, Non-Hispanic) as observed variables. The dataset also includes Race_Ethnicity (e.g., Hispanic White, Non-Hispanic Black) as an engineered variable,…

Snehal Patel
- 23
- 3
1
vote
1 answer
An efficient way of calculating/estimating frequency spectrum for an event
This is rather a practical question. I'm looking for an efficient way of calculating the frequency of an event for a large number of samples. Here's a more concrete example.
Let's say that I have a system with millions of users. Each user has so…

Mehran
- 277
- 1
- 2
- 12
0
votes
1 answer
Cyclic dependency between feature and predictor class
I have a feature which has specific categorical values ex(Technology, Hardware, Software, Marketing, Evnts etc). Based on this and some other features, I am trying to classify the dataset into 2 categories IsSoftwareSystem or NotSoftwareSystem. In…
0
votes
1 answer
How to use feature group?
Let's say I have a data set like the following:
file group_a_co_1 group_a_co_2 group_b_co_1 group_b_co_2
file_1 0.8 0.2 0.3 0.7
file_2 0.1 0.9 0.2 0.8
file_3 0.5 0.5 0.7 0.3
...
I wonder, whether there are ways/tricks to tell the…

dgg32
- 113
- 4
0
votes
1 answer
To One-Hot-Encode or not to One-Hot-Encode?
I have been struggling to find proof for that but I couldnt
Every time I prepare dataset I face the same issue
when a column is a classification such as CountryCode or TaskType in this dataset
TaskType CountryCode Target
1 61 …

asmgx
- 549
- 2
- 18
0
votes
2 answers
What is a good approach for a lifespan?
Let's say I wan't to predict the lifespan of an ad in a listing.
I know a bunch of thing from the ad like:
the title
the price
the location
etc
The target value is the duration of the ad in the listing before it's being removed (item has been…

Benjamin Toueg
- 109
- 2
0
votes
0 answers
Features derived using retrocausality
I have been experimenting with features derived using retrocausality (not to be confused with data leakage) in training models. Are there any examples of prior work in the literature where this form of feature engineering has yielded success?
0
votes
1 answer
Are there any search algorithms for feature optimization similar to RFE, but which consider all possible combinations?
Does anyone know any good search algorithms for feature optimization that search through every possible combination to find the optimal combination of features for maximum predictive power? (Permutations are not important).
So far I have been using…

PlatinumMaths
- 81
- 2
- 11