Calculating correlation between two time variables

Question

I have a dataset that looks like this :

    UserID  App_open_Time(Hour_ofDay)   Email_Open_Time(hour_ofDay)
       1        1                         1
       2        1                         1
       3        1                         1
       4        1                         1
       5        1                         1
       6        2                         1
       7        2                         1
       8        3                         1
       9        3                         1
      10        3                         1

I would like to know if the App_open_time is correlated with Email open time . How can i do this analysis in python. I was planning to do a pearson-correlation in Python using Numpy ,Is this the best approach?

score 2 · Answer 1 · edited Apr 09 '18 at 13:41

I don't think you can use Pearson's correlation because it is used for continuous variables. Your variables are ordinal, so a test like Spearman's would be more appropriate. However I don't think that tests for ordinal variables would be appropriate either, because your variables are also cyclical, in the sense that Hour_ofDay=23 and Hour_ofDay=1 are really just 2 hours apart, but for the Spearman's test they would be considered 22 hours apart.

I think in this case it would be more appropriate to look at the distribution of the distance (measured in hours) between the two variables. The appropriate distance metric in this case is defined as follows (distance originally defined in the accepted answer to this other question)

import numpy as np
distance = np.sign(a1-a2)*(12 - abs(abs(a1 - a2) - 12))

Where a1 and a2 are your App and Email opening time variables. Note that the variables need to be in the range 0 to 23 for this distance to work.

Compute this distance for each row, add it as a column to your dataframe and plot it with an histogram. This histogram will tell you quite a lot about the "correlation" between the two variables.

For example

No correlation: the histogram will be uniform between -12 and 12
Instantaneous correlation, i.e. user opens email and app at the same time: the histogram will have a peak at 0
Anticipated delay, i.e. user opens email 1 hour before app: the histogram will have a peak at 1
Delayed correlation, i.e. user opens email 1 hour after app: the histogram will have a peak at -1

This visualization will allow you to draw rich conclusions about the relation between App- and Email opening times.

Note: if your variables also include minutes and seconds in a date format you have to convert the variables to numerical. E.g. 01:30 (hour 1 and 30 minutes) becomes 1.5 . Also pay attention to the date format in case you have time expressed in the 12-hour clock (e.g. 6 PM, 1 AM)

score 0 · Answer 2 · answered Feb 17 '18 at 16:52

You can use the following code snippet:

from matplotlib import cm
cmap = cm.get_cmap('gnuplot')
scatter = pd.scatter_matrix(YOUR_TRAINING_DATA, c = YOUR_LABELS_OF_TRAINING, marker = 'o', s = 40, hist_kwds = {'bins':15}, figsize = (12, 12), cmap = cmap)

It plots the scatter plot of each feature separately and together. It is like the correlation matrix. You can take a look at here.

score 0 · Answer 3 · answered May 09 '18 at 16:29

use pandas to efficient handle tables in python. Pandas has a tool to calculate correlation between two Series, or between to columns of a Dataframe. Assuming you have your data in a csv file, you can read it and calculate the correlation this way:

import pandas as pd
data = pd.read_csv("my_file.csv")
correlation = data["col1"].corr(data["col2"], method="pearson")

You can also choose the method used to calculate the correlation between this:

-pearson

-kendall

-spearman

Calculating correlation between two time variables

3 Answers3