Visualize frequency of 5 Boolean variables together

Question

I have a data set with 5 variables,

a b c d e
1 0 0 1 0
0 1 0 1 1
0 1 1 0 0
0 0 0 1 0
1 1 1 0 0
0 1 1 0 1
1 0 1 0 0
1 0 0 1 1
0 1 0 1 1
0 0 1 1 0

I am only interested in the percentages of occurrence,

occurrence,

| a | b | c | d | e |
.4 | .5 | .5 | .6 | .4

BUT, I would like to visualize in such a way that I can see the overlap, or not, among all the different groups.

Any idea?

so something like a frequency graph for all possible combinations? like a, b, c, d, e, ab, ac.....? — sai, Oct 01 '20 at 16:39
Well, yes. But that's the problem, how to visualize 32 combinations. — myradio, Oct 01 '20 at 16:47

score 3 · Accepted Answer · answered Oct 06 '20 at 05:19

If you have richer data (ie more than 10 rows), you will want an upset plot. Upset plots are a way to view information in an intuitive way like a Venn diagram, but is more useful for 4+ categories.

Some references which may give you some ideas and implementation in R:

https://cran.r-project.org/web/packages/UpSetR/vignettes/basic.usage.html (attached image from r-project.org).
https://www.littlemissdata.com/blog/set-analysis

score 1 · Answer 2 · answered Oct 01 '20 at 18:10

Since the combinations are known, we can use some knowledge of binary numbers and use this to find come up with a frequency plot

Basically - convert the binary string to integer and get a frequency plot based on the integer values

import numpy as np
import pandas as pd
from itertools import product
import matplotlib.pyplot as plt
test data, 1 of every 32 combinations
combs = np.array(map(list, product([0, 1], repeat=5)))
store in dataframe
df = pd.DataFrame(data={'a': combs[:, 0], 'b': combs[:, 1], 'c': combs[:, 2], 'd': combs[:, 3], 'e': combs[:, 4]})
concatenate the binary sequences to strings
df['concatenate'] = df[list('abcde')].astype(str).apply(''.join, axis=1)
to convert binary strings to integers
def int2(x):
    return int(x, 2)
every combination has a unique value
df['unique_values'] = df['concatenate'].apply(int2)
prepare labels for the frequency plot
variables = list('abcde')
labels = []
for combination in df.concatenate:
    tmp = ''.join([variables[i] for i, x in enumerate(combination) if x != '0'])
    labels.append(tmp)
fig, ax = plt.subplots()
counts, bins, patches = ax.hist(df.unique_values, bins=32, rwidth=0.8)
turn of the
plt.tick_params(
    axis='x',          # changes apply to the x-axis
    which='both',      # both major and minor ticks are affected
    top=False,         # ticks along the top edge are off
    labelbottom=False)
calculate the bin centers
bin_centers = 0.5 * np.diff(bins) + bins[:-1]
ax.set_xticks(bin_centers)
for label, x in zip(labels, bin_centers):
    # replace integer mapping with the labels
    ax.annotate(str(label), xy=(x, 0), xycoords=('data', 'axes fraction'),
        xytext=(0, -5), textcoords='offset points', va='top', ha='center', rotation='30')
plt.show()

+1 Indeed this is s possibility but I was looking for something simpler to visualize. — myradio, Oct 03 '20 at 16:14

Edmund · Answer 3 · 2020-10-07T02:43:35.397

1

With Wolfram Language you may use AbsoluteCorrelation.

With

t = {
     {1, 0, 0, 1, 0}, {0, 1, 0, 1, 1}, 
     {0, 1, 1, 0, 0}, {0, 0, 0, 1, 0}, 
     {1, 1, 1, 0, 0}, {0, 1, 1, 0, 1}, 
     {1, 0, 1, 0, 0}, {1, 0, 0, 1, 1}, 
     {0, 1, 0, 1, 1}, {0, 0, 1, 1, 0}
    }

Then

MatrixForm[ac = AbsoluteCorrelation[t]]

Where the diagonals are the marginal column frequencies and the off-diagonals the joint frequencies. That is for ac[[1,1]] variable a occurs with frequency 0.4 and for ac[[1,2]] (row 1, column 2) variable a occurs jointly with variable b with frequency 0.1

This can be visualised with MatrixPlot or ArrayPlot.

MatrixPlot[
 ac 
 , FrameTicks -> {Transpose@{Range@5, CharacterRange["a", "e"]}}
 , PlotLegends -> Automatic]

Hope this helps.

edited Oct 07 '20 at 02:43

answered Oct 07 '20 at 02:31

Edmund

705
5
15

But isn't this a pairwise correlation? So, the matrix has 25 elements which are pair combinations of occurrence but has no information beyond pairwise. – myradio Oct 07 '20 at 06:18
@myradio Correct. I took your "overlap" to mean joint frequency. You only need the upper or lower triangle of the matrix since it is symmetric. – Edmund Oct 07 '20 at 09:56

Visualize frequency of 5 Boolean variables together

3 Answers3

test data, 1 of every 32 combinations

store in dataframe

concatenate the binary sequences to strings

to convert binary strings to integers

every combination has a unique value

prepare labels for the frequency plot

turn of the

calculate the bin centers

Linked