5

Sometimes you might want to check your ideas on multiple datasets. There are several places with datasets collections.

Question: Please share some Python scripts how to download multiple datasets from these (or other) datasets collection ?

Ideally one should be able to : 1) get datasets list 2) select some desired by conditions 3) download those selected. But if you have something different please share anyway.

For "openml" database - I have a script - see my own answer. But I do have for other collections : Kaggle, uci ...


Here some examples of datasets collections:

https://www.openml.org/

https://archive.ics.uci.edu/ml/index.php

https://ieee-dataport.org/datasets

Каggle contains lots of datasets, there are also specific collections: graph collections see list here https://mathoverflow.net/a/359449/10446 , many biological data is here : https://www.ncbi.nlm.nih.gov/gds

  • Some other datasets lists: https://www.forbes.com/sites/bernardmarr/2016/02/12/big-data-35-brilliant-and-free-data-sources-for-2016/#512c8a29b54d https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research https://datascience.stackexchange.com/questions/155/publicly-available-datasets?rq=1 – Alexander Chervov Oct 22 '20 at 09:13
  • Note that there is a whole stack exchange dedicated to data : https://opendata.stackexchange.com/ it might be of interest for a specific problem. – Lucas Morin Oct 29 '20 at 13:50

3 Answers3

5

How to fetch Kaggle data from python code?

  1. Install kaggle package C:\Users\TalgatHafiz> pip install kaggle

  2. login to your Kaggle account click on the icon in the upper right corner -> My Account Scroll down to API section Click "Create New API Token" "kaggle.json" file is created and saved locally

  3. Create ".kaggle" dir C:\Users\TalgatHafiz>mkdir .kaggle and move "kaggle.json" into that directory

  4. see all active competitions by running the following command C:\Users\TalgatHafiz>kaggle competitions list

  5. Select one of the competitions that you signed up for, eg https://www.kaggle.com/c/contradictory-my-dear-watson/data# Scroll down. Right before "Data Explorer" section there should be API line: "kaggle competitions download -c contradictory-my-dear-watson" copy it

  6. run these commands from the notebook import kaggle !kaggle competitions download -c contradictory-my-dear-watson

  7. zipped data file is downloaded into your the same directory where your notebook is: C:\Users\TalgatHafiz\conda\contradictory-my-dear-watson.zip so now you can unzip and start using the data

If you still have questions please read https://medium.com/@jeff.daniel77/accessing-the-kaggle-com-api-with-jupyter-notebook-on-windows-d6f330bc6953

3

Here is some script for "openml" collection of datasets. Hopefully one can provide something similar for other databases.

#see docs: https://docs.openml.org/Python-guide/

!pip install openml import openml

import numpy as np import pandas as pd import time

Get information on all collection of openml datasets:

datalist = openml.datasets.list_datasets(output_format="dataframe")

select datasets by some conditions (just pandas) - we will get just 4 such datasets

datasets_selected = datalist[ (datalist.NumberOfInstances < 2550) & (datalist.NumberOfInstances > 300)& (datalist.NumberOfFeatures > 10000) & (datalist.NumberOfFeatures < 40000) &
( datalist.NumberOfFeatures != 10937) ].sort_values(["NumberOfInstances"], ascending=False)#.head(n=20) print(datasets_selected.shape)

load all selected datasets and print short info:

for i in range(len(datasets_selected)): nm = datasets_selected['name'].iloc[i] print(nm, i ) did = int( datasets_selected['did'].iloc[i] ) # did - dataset_id t0 = time.time() data = openml.datasets.get_dataset(did) X, y, categorical_indicator, attribute_names = data.get_data( dataset_format="array", target=data.default_target_attribute ) print(X.shape, y.shape, time.time()-t0,'secs passed' )


Here is even more simple example for sklearn built-in datasets:

import numpy as np 
from sklearn import  datasets 
import time
list_id =  ['load_boston', 'load_iris', 'load_diabetes', 'load_digits', 'load_linnerud', 'load_wine' , 'load_breast_cancer'] + \
 ['fetch_california_housing', 'fetch_covtype',  'fetch_lfw_people', 'fetch_20newsgroups_vectorized','fetch_olivetti_faces' ]
# 'fetch_rcv1', - too long 
# 'fetch_lfw_pairs' - TypeError fetch_lfw_pairs() got an unexpected keyword argument 'return_X_y
# 'fetch_kddcup99' - sometimes problem happens
for id in list_id:
  print(id)
  t0 = time.time()
  func_load  = getattr(datasets, id )
  X,y = func_load(return_X_y = True)
  print(id, X.shape, time.time()-t0, 'secs passed')
3

OpenML has a gallery of different use case examples, including browsing and downloading datasets through python, and running benchmarks: https://openml.github.io/openml-python/master/examples/index.html

When you want to benchmark new algorithms, this is the gist:

import openml
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

suite = openml.study.get_suite('OpenML-CC18') # get benchmark suite tasks = np.random.choice(suite.tasks, size=10, replace=False) # sample 10 tasks randomly clf = make_pipeline(SimpleImputer(),RandomForestClassifier()) # simple pipeline for task_id in tasks: task = openml.tasks.get_task(task_id) print("Running on task",task.get_dataset().name) run = openml.runs.run_model_on_task(clf, task) print(run.get_metric_fn(accuracy_score))

Output (these are 10-fold CV tasks):

Running on task credit-approval
[0.928 0.884 0.841 0.768 0.913 0.884 0.884 0.841 0.899 0.884]
Running on task pc1
[0.955 0.919 0.946 0.955 0.937 0.973 0.919 0.928 0.919 0.918]

You can also choose to directly share the result on OpenML with run.publish()

Disclaimer: I am one of the core developers of OpenML

  • Thank you for sharing ! Actually OpenML example I also described in my own answer, but your code provides another look. Do I understand correctly that OpenML has their own servers and one can run task on their side, not locally ? – Alexander Chervov Oct 20 '20 at 21:41
  • 1
    @AlexanderChervov Currently, OpenML doesn't allow running experiments server-side. It does make experiments portable so that anyone can run them on any hardware, and compare/share the results with others. – Joaquin Vanschoren Oct 20 '20 at 23:41