3

How do I remove outliers from my data? Should I use RobustScaler? I am aware I can use DecisionTree but I want to use XGBoost...
Please can you help me, This is a bit urgent, I am not sure how to do it, I have researched and seen previous question but it doesnt work well and was not helpful.
Thank you

Cheers

omkaartg
  • 155
  • 8

1 Answers1

3
  • First of all, you don't need to remove outlier because "Decision family algorithm" like XGBoost can handle it.

  • Secondly, you can use Tukey method (Tukey JW., 1977):

    def detect_outliers(df,n,features):
        outlier_indices = []
        # iterate over features(columns)
        for col in features:
            # 1st quartile (25%)
            Q1 = np.percentile(df[col], 25)
            # 3rd quartile (75%)
            Q3 = np.percentile(df[col],75)
            # Interquartile range (IQR)
            IQR = Q3 - Q1
            # outlier step
            outlier_step = 1.5 * IQR
            # Determine a list of indices of outliers for feature col
            outlier_list_col = df[(df[col] < Q1 - outlier_step) | (df[col] > Q3 + outlier_step )].index
            # append the found outlier indices for col to the list of outlier indices 
            outlier_indices.extend(outlier_list_col)
            # select observations containing more than 2 outliers
            outlier_indices = Counter(outlier_indices)        
            multiple_outliers = list( k for k, v in outlier_indices.items() if v > n )
            return multiple_outliers 
    Outliers_to_drop = detect_outliers(data,2,["col1","col2"])
    data.loc[Outliers_to_drop] # Show the outliers rows
    # Drop outliers
    data= data.drop(Outliers_to_drop, axis = 0).reset_index(drop=True)
    

https://www.kaggle.com/yassineghouzam/titanic-top-4-with-ensemble-modeling

  • And Thirdly, I suggest you try discrete (binning) continuous variable instead of remove outlier for xgboost.
parvij
  • 791
  • 5
  • 17
  • i do not understand why above (multiple_outliers = list( k for k, v in outlier_indices.items() if v > n )) is necessary, can you point me to where the tukey method says this? @parvij – Maths12 Dec 08 '20 at 12:29