-1

I have a dataset with a big number of features (it is coming from a document-oriented database). 80% of the features are 80% empty and filled in only in specific conditions. Let me give an example with an animals dataset :

  • Number_of_paws is filled for every animal where living_place = ground,
  • Coat_color is filled for every animal where aspect = coat,
  • Depth_in_water is filled only for fishs, ...

How can I determine, for a new unknown feature to what subset of data it is related? So imagine a feature %something_unknown that is empty 98% of the time and I want to discover that this feature is only filled in when Animal_color = Red and Animal_type = fish.

I would say that it is related to subset analysis. How should one proceed to solve this problem ?

Brian Spiering
  • 21,136
  • 2
  • 26
  • 109
Rusoiba
  • 839
  • 5
  • 14
  • 1
    one can do the inverse procedure to determine complementary features from features. For example: feature number_of_paws has co-feature the living_place=ground and so on. Then once you have a map of features to co-features you can do the inverse process – Nikos M. Feb 20 '21 at 17:33

1 Answers1

0

A contingency table might be useful. It displays the multivariate frequency distribution of categorical variables.

Brian Spiering
  • 21,136
  • 2
  • 26
  • 109