Loading collections of datasets - Python code examples

Question

Sometimes you might want to check your ideas on multiple datasets. There are several places with datasets collections.

Question: Please share some Python scripts how to download multiple datasets from these (or other) datasets collection ?

Ideally one should be able to : 1) get datasets list 2) select some desired by conditions 3) download those selected. But if you have something different please share anyway.

For "openml" database - I have a script - see my own answer. But I do have for other collections : Kaggle, uci ...

Here some examples of datasets collections:

https://www.openml.org/

https://archive.ics.uci.edu/ml/index.php

https://ieee-dataport.org/datasets

Каggle contains lots of datasets, there are also specific collections: graph collections see list here https://mathoverflow.net/a/359449/10446 , many biological data is here : https://www.ncbi.nlm.nih.gov/gds

Some other datasets lists: https://www.forbes.com/sites/bernardmarr/2016/02/12/big-data-35-brilliant-and-free-data-sources-for-2016/#512c8a29b54d https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research https://datascience.stackexchange.com/questions/155/publicly-available-datasets?rq=1 — Alexander Chervov, Oct 22 '20 at 09:13
Note that there is a whole stack exchange dedicated to data : https://opendata.stackexchange.com/ it might be of interest for a specific problem. — Lucas Morin, Oct 29 '20 at 13:50

score 5 · Accepted Answer · answered Oct 22 '20 at 02:48

How to fetch Kaggle data from python code?

Install kaggle package C:\Users\TalgatHafiz> pip install kaggle
login to your Kaggle account click on the icon in the upper right corner -> My Account Scroll down to API section Click "Create New API Token" "kaggle.json" file is created and saved locally
Create ".kaggle" dir C:\Users\TalgatHafiz>mkdir .kaggle and move "kaggle.json" into that directory
see all active competitions by running the following command C:\Users\TalgatHafiz>kaggle competitions list
Select one of the competitions that you signed up for, eg https://www.kaggle.com/c/contradictory-my-dear-watson/data# Scroll down. Right before "Data Explorer" section there should be API line: "kaggle competitions download -c contradictory-my-dear-watson" copy it
run these commands from the notebook import kaggle !kaggle competitions download -c contradictory-my-dear-watson
zipped data file is downloaded into your the same directory where your notebook is: C:\Users\TalgatHafiz\conda\contradictory-my-dear-watson.zip so now you can unzip and start using the data

If you still have questions please read https://medium.com/@jeff.daniel77/accessing-the-kaggle-com-api-with-jupyter-notebook-on-windows-d6f330bc6953

score 3 · Answer 2 · answered Oct 19 '20 at 19:35

Here is some script for "openml" collection of datasets. Hopefully one can provide something similar for other databases.

#see docs: https://docs.openml.org/Python-guide/
!pip install openml
import openml
import numpy as np
import pandas as pd
import time
Get information on all collection of openml datasets:
datalist = openml.datasets.list_datasets(output_format="dataframe")
select datasets by some conditions (just pandas) - we will get just 4 such datasets
datasets_selected = datalist[ (datalist.NumberOfInstances < 2550) & (datalist.NumberOfInstances > 300)& (datalist.NumberOfFeatures > 10000) &  (datalist.NumberOfFeatures < 40000) & 

                     ( datalist.NumberOfFeatures != 10937)    ].sort_values(["NumberOfInstances"], ascending=False)#.head(n=20)
print(datasets_selected.shape)
load all selected datasets and print short info:
for i in range(len(datasets_selected)):
  nm = datasets_selected['name'].iloc[i]
  print(nm, i )
  did =  int( datasets_selected['did'].iloc[i] ) # did - dataset_id 
  t0 = time.time()
  data = openml.datasets.get_dataset(did)
  X, y, categorical_indicator, attribute_names = data.get_data(
      dataset_format="array", target=data.default_target_attribute )
  print(X.shape, y.shape, time.time()-t0,'secs passed' )

Here is even more simple example for sklearn built-in datasets:

import numpy as np 
from sklearn import  datasets 
import time
list_id =  ['load_boston', 'load_iris', 'load_diabetes', 'load_digits', 'load_linnerud', 'load_wine' , 'load_breast_cancer'] + \
 ['fetch_california_housing', 'fetch_covtype',  'fetch_lfw_people', 'fetch_20newsgroups_vectorized','fetch_olivetti_faces' ]
# 'fetch_rcv1', - too long 
# 'fetch_lfw_pairs' - TypeError fetch_lfw_pairs() got an unexpected keyword argument 'return_X_y
# 'fetch_kddcup99' - sometimes problem happens
for id in list_id:
  print(id)
  t0 = time.time()
  func_load  = getattr(datasets, id )
  X,y = func_load(return_X_y = True)
  print(id, X.shape, time.time()-t0, 'secs passed')

score 3 · Answer 3 · answered Oct 20 '20 at 09:35

OpenML has a gallery of different use case examples, including browsing and downloading datasets through python, and running benchmarks: https://openml.github.io/openml-python/master/examples/index.html

When you want to benchmark new algorithms, this is the gist:

import openml
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
suite = openml.study.get_suite('OpenML-CC18') # get benchmark suite
tasks = np.random.choice(suite.tasks, size=10, replace=False) # sample 10 tasks randomly
clf = make_pipeline(SimpleImputer(),RandomForestClassifier()) # simple pipeline
for task_id in tasks:
    task = openml.tasks.get_task(task_id)
    print("Running on task",task.get_dataset().name)
    run = openml.runs.run_model_on_task(clf, task)
    print(run.get_metric_fn(accuracy_score))

Output (these are 10-fold CV tasks):

Running on task credit-approval
[0.928 0.884 0.841 0.768 0.913 0.884 0.884 0.841 0.899 0.884]
Running on task pc1
[0.955 0.919 0.946 0.955 0.937 0.973 0.919 0.928 0.919 0.918]

You can also choose to directly share the result on OpenML with run.publish()

Disclaimer: I am one of the core developers of OpenML

Thank you for sharing ! Actually OpenML example I also described in my own answer, but your code provides another look. Do I understand correctly that OpenML has their own servers and one can run task on their side, not locally ? — Alexander Chervov, Oct 20 '20 at 21:41
@AlexanderChervov Currently, OpenML doesn't allow running experiments server-side. It does make experiments portable so that anyone can run them on any hardware, and compare/share the results with others. — Joaquin Vanschoren, Oct 20 '20 at 23:41

Loading collections of datasets - Python code examples

3 Answers3

Get information on all collection of openml datasets:

select datasets by some conditions (just pandas) - we will get just 4 such datasets

load all selected datasets and print short info: