2

I have a data set with 5 variables,

a b c d e
1 0 0 1 0
0 1 0 1 1
0 1 1 0 0
0 0 0 1 0
1 1 1 0 0
0 1 1 0 1
1 0 1 0 0
1 0 0 1 1
0 1 0 1 1
0 0 1 1 0

I am only interested in the percentages of occurrence,

occurrence,

| a | b | c | d | e |
.4 | .5 | .5 | .6 | .4

BUT, I would like to visualize in such a way that I can see the overlap, or not, among all the different groups.

Any idea?

myradio
  • 131
  • 4

3 Answers3

3

enter image description hereIf you have richer data (ie more than 10 rows), you will want an upset plot. Upset plots are a way to view information in an intuitive way like a Venn diagram, but is more useful for 4+ categories.

Some references which may give you some ideas and implementation in R:

Timothy Chan
  • 166
  • 4
1

Since the combinations are known, we can use some knowledge of binary numbers and use this to find come up with a frequency plot

Basically - convert the binary string to integer and get a frequency plot based on the integer values

import numpy as np
import pandas as pd
from itertools import product
import matplotlib.pyplot as plt

test data, 1 of every 32 combinations

combs = np.array(map(list, product([0, 1], repeat=5)))

store in dataframe

df = pd.DataFrame(data={'a': combs[:, 0], 'b': combs[:, 1], 'c': combs[:, 2], 'd': combs[:, 3], 'e': combs[:, 4]})

concatenate the binary sequences to strings

df['concatenate'] = df[list('abcde')].astype(str).apply(''.join, axis=1)

to convert binary strings to integers

def int2(x): return int(x, 2)

every combination has a unique value

df['unique_values'] = df['concatenate'].apply(int2)

prepare labels for the frequency plot

variables = list('abcde') labels = [] for combination in df.concatenate: tmp = ''.join([variables[i] for i, x in enumerate(combination) if x != '0']) labels.append(tmp)

fig, ax = plt.subplots() counts, bins, patches = ax.hist(df.unique_values, bins=32, rwidth=0.8)

turn of the

plt.tick_params( axis='x', # changes apply to the x-axis which='both', # both major and minor ticks are affected top=False, # ticks along the top edge are off labelbottom=False)

calculate the bin centers

bin_centers = 0.5 * np.diff(bins) + bins[:-1] ax.set_xticks(bin_centers) for label, x in zip(labels, bin_centers): # replace integer mapping with the labels ax.annotate(str(label), xy=(x, 0), xycoords=('data', 'axes fraction'), xytext=(0, -5), textcoords='offset points', va='top', ha='center', rotation='30')

plt.show()

enter image description here

sai
  • 219
  • 1
  • 5
1

With Wolfram Language you may use AbsoluteCorrelation.

With

t = {
     {1, 0, 0, 1, 0}, {0, 1, 0, 1, 1}, 
     {0, 1, 1, 0, 0}, {0, 0, 0, 1, 0}, 
     {1, 1, 1, 0, 0}, {0, 1, 1, 0, 1}, 
     {1, 0, 1, 0, 0}, {1, 0, 0, 1, 1}, 
     {0, 1, 0, 1, 1}, {0, 0, 1, 1, 0}
    }

Then

MatrixForm[ac = AbsoluteCorrelation[t]] 

Mathematica graphics

Where the diagonals are the marginal column frequencies and the off-diagonals the joint frequencies. That is for ac[[1,1]] variable a occurs with frequency 0.4 and for ac[[1,2]] (row 1, column 2) variable a occurs jointly with variable b with frequency 0.1

This can be visualised with MatrixPlot or ArrayPlot.

MatrixPlot[
 ac 
 , FrameTicks -> {Transpose@{Range@5, CharacterRange["a", "e"]}}
 , PlotLegends -> Automatic]

Mathematica graphics

Hope this helps.

Edmund
  • 705
  • 5
  • 15
  • But isn't this a pairwise correlation? So, the matrix has 25 elements which are pair combinations of occurrence but has no information beyond pairwise. – myradio Oct 07 '20 at 06:18
  • @myradio Correct. I took your "overlap" to mean joint frequency. You only need the upper or lower triangle of the matrix since it is symmetric. – Edmund Oct 07 '20 at 09:56