Find subsets of data that are related in a large, sparse dataset

Question

I have a dataset with a big number of features (it is coming from a document-oriented database). 80% of the features are 80% empty and filled in only in specific conditions. Let me give an example with an animals dataset :

Number_of_paws is filled for every animal where living_place = ground,
Coat_color is filled for every animal where aspect = coat,
Depth_in_water is filled only for fishs, ...

How can I determine, for a new unknown feature to what subset of data it is related? So imagine a feature %something_unknown that is empty 98% of the time and I want to discover that this feature is only filled in when Animal_color = Red and Animal_type = fish.

I would say that it is related to subset analysis. How should one proceed to solve this problem ?

one can do the inverse procedure to determine complementary features from features. For example: feature number_of_paws has co-feature the living_place=ground and so on. Then once you have a map of features to co-features you can do the inverse process — Nikos M., Feb 20 '21 at 17:33

score 0 · Answer 1 · answered Feb 20 '21 at 18:09

0

A contingency table might be useful. It displays the multivariate frequency distribution of categorical variables.

answered Feb 20 '21 at 18:09

Brian Spiering

21,136
2
26
109

Find subsets of data that are related in a large, sparse dataset

1 Answers1