Questions tagged [scikit-learn]

scikit-learn is a popular machine learning package for Python that has simple and efficient tools for predictive data analysis. Topics include classification, regression, clustering, dimensionality reduction, model selection, and preprocessing.

What is scikit-learn?

scikit-learn is a popular machine learning package for Python that has simple and efficient tools for predictive data analysis. Topics include classification, regression, clustering, dimensionality reduction, model selection, and preprocessing. It is built upon NumPy, SciPy, and matplotlib and is open-sourced under the BSD License. It is part of the scientific computation ecosystem and useful for both individual and commercial use.


New to scikit-learn?

There are various resources including books, tutorials/workshops, etc. for those looking to learn how to use scikit-learn.

A popular introductory tutorial is:

SciPy 2018 Conference Tutorial:

A popular introductory book is:

Introduction to Machine Learning with Python, by Andreas C. Müller and Sarah Guido.


Tag usage

When posting questions about scikit-learn, please take the following into consideration:

  • When tagging questions with the tag, users should not use the tag sklearn, despite semantic similarity, as the latter is marked as a synonym and will automatically be retagged.

  • Explicit programming related questions are more suitable for Stack Overflow and should not be posted on Stack Exchange Data Science.

  • Questions should include sufficient details and clarity to be able to provide support for the problem at hand. This includes linking to underlying data used, providing code used for the model's construction, highlighting relevant outputs, etc.


External Resources

scikit-learn: Documentation page

scikit-learn: GitHub page


Important links

2308 questions
6
votes
4 answers

Poisson regression options in python

I want to predict count data. In my understanding both standard classification and regression are not well suited for this. A poisson or binomial regression algorithm seems to do the trick. I am used to doing most of my ML tasks in sklearn. But on…
El Burro
  • 800
  • 1
  • 4
  • 12
4
votes
1 answer

Scikit learn: which regressors natively support multi-target regression?

The docs on sklearn.multioutput.MultiOutputRegressor state that it implements a strategy for extending regressors that do not natively support multi-target regression. I'm interested to know: which ones do natively support multi-target regression ?…
3
votes
1 answer

Expected 2D array, got scalar array instead: array=11

import numpy as np import matplotlib.pyplot as plt import pandas as pd from sklearn.metrics import r2_score # veri yukleme veriler = pd.read_csv(r'C:\Users\k\Desktop\maaslar_yeni.csv') # x burada bagımsız degısken y ise bagımlı degiskendir. x =…
user86600
3
votes
1 answer

Sci-kit Pipeline and GridsearchCV returns indexError: too many indices for array

I'm trying to get to grips with sci-kit learn for some simple machine learning projects but I'm coming unstuck with Pipelines and wonder what I've done wrong... I'm trying to work through a tutorial on Kaggle Here's my code: import pandas as…
elksie5000
  • 233
  • 3
  • 8
3
votes
1 answer

Scikitlearn - TfidfVectorizer - how to use a custom analyzer AND still use token_pattern

The docs state that token_pattern is only used if analyzer == 'word': token_pattern : string Regular expression denoting what constitutes a “token”, only used if analyzer == 'word'. The default regexp selects tokens of 2 or more …
aweeeezy
  • 501
  • 2
  • 5
  • 9
3
votes
1 answer

sklearn CountVectorizer token_pattern -- skip token if pattern match

I apologize if this question is misplaced -- I'm not sure if this is more of a re question or a CountVectorizer question. I'm trying to exclude any would be token that has one or more numbers in it. >>> from sklearn.feature_extraction.text import…
aweeeezy
  • 501
  • 2
  • 5
  • 9
3
votes
2 answers

Large sparse dataset in Catboost

I have a large sparse data matrix (bag of words, over large number of entries). I can easily treat it as a sparse matrix in sklearn models such as RandomForest. But, if I want to use Catboost, I need to turn it into a dense matrix. I was wondering…
3
votes
1 answer

How does scikit-learn decision function method work?

The scikit-learn docs say it is the signed distance of that sample to the hyperplane. I've taken the sum of the weights and their corresponding coefficient and added the intercept to that sum but this does not return the value given by the…
berrypy
  • 213
  • 3
  • 7
2
votes
1 answer

How many features do you generally use for your ML Model?

I am working on a certain kaggle competition and users there say that they are using >5000 features and training a XGBoost or Random Forest on it. The mentioned post is here:…
Rahul Agarwal
  • 201
  • 1
  • 3
2
votes
1 answer

Criteria used to create and select leaf nodes in sklearn

I just want to know the details of what (and how) is the criteria used by sklearn.tree.DecisionTreeClassifier to create leaf nodes. I know that the parameters criterion{“gini”, “entropy”}, default=”gini” and splitter{“best”, “random”},…
Ivan
  • 21
  • 1
2
votes
2 answers

Is there a documentation where it is explained why scikit-learn does not provide p-values?

Is there a documentation, paper etc. where it is explained why scikit-learn does not provide p-values/confidence levels (1, 2, 3, 4)? Note: I'm not asking about opinions, but about documentation. For example the R package lme4 does not provide…
Qaswed
  • 121
  • 3
2
votes
2 answers

Why don't all feature selection methods in sklearn allow specifying desired variance explained?

Why don't all feature selection methods in sklearn allow specifying desired variance explained? sklearn.decomposition.PCA does allow inputting a percentage of variance that one wants to be explained in place of n_components. However other methods…
mavavilj
  • 416
  • 1
  • 3
  • 12
2
votes
1 answer

Module 'sklearn' has no attribute 'datasets'?

Isn't scikit-learn version 1.0.2 supposed to have an attribute datasets? If so, why am I getting an error? Python 3.9.2 (default, Feb 28 2021, 17:03:44) [GCC 10.2.1 20210110] on linux Type "help", "copyright", "credits" or "license" for more…
Tfovid
  • 195
  • 3
  • 6
2
votes
0 answers

Multidimensional scaling (MDS) fails on a simple example

I want to apply multi-dimensional scaling (MDS) on specific objects; using the Euclidean distance does not make sense for such objects; using another distance metric, I can compute their dissimilarity matrix $D$. Then I compute the embeddings of the…
user11634
  • 21
  • 1
1
vote
1 answer

Can I add new features in an existing dataset using function transformers in scikit-learn

I have written a code that can add 3 new columns into a NumPy array, using function transformer(1 st column is element-wise +, 2nd is element-wise *, 3rd is element-wise /. Just need to know if in this way I can add new features to an existing…
1
2 3 4