0

I have a scenario where I have to identify employees, who when take sick or any other paid leave his/her colleague (any other employee) gets overtime.

My data set is as follows:

employee_code----period---Leave_hours--- Overtime hours

How to identify the relation between overtime hours and leave_hours between employees?

Which machine learning model can be used to get employee wise relation...?

Stephen Rauch
  • 1,783
  • 11
  • 22
  • 34
Zaka
  • 9
  • 1
  • 1
    You could start by aggregating overtime and leave hours by period and doing a linear regression. This would only find relations for all employes combined. If you had more detail information (e.g. which employes sit together), you could define a graph and try to do linear regression one employee vs her neighbours in the graph. – Valentas Dec 10 '18 at 12:52
  • 1
    I'm down-voting this question. Please don't post homework assignments :-) – I_Play_With_Data Jan 09 '19 at 13:17

3 Answers3

1

I don't see why you need a model at all. This seems like a simple counting problem. Here is SQL that finds pairs where one employee takes leave and another gets overtime:

select leave.employee_code, overtime.employee_code, count(*) as occurrences
from time_off leave
join time_off overtime
on leave.employee_id = overtime.employee_id 
   and leave.period = overtime.period
where leave.leave_hours > 0
  and overtime.overtime_hours > 0
group by leave.employee_code, overtime.employee_code
order by occurrences desc

In dplyr:

time_off %>%
  filter(leave_hours > 0) %>%
  select(employee_code) %>%
  inner_join(time_off %>% 
               filter(overtime_hours) %>%
               select(employee_code), 
             by=c("employee_code", "period"),
             suffix=c("_leave","_overtime")) %>%
  group_by(employee_code_leave, employee_code_overtime) %>%
  summarize(n=length(employee_code_leave)) %>%
  arrange(desc(n))

Note that I haven't run this code, so it might have typos or bugs, but it should be close to correct.

To explore the relationship between leave hours and overtime hours, you don't even need the employee pairs, you can just aggregate by period and make a scatter plot.

John Rauser
  • 221
  • 1
  • 4
  • Note that if you want triples or 4-tuples, ... you're just solving the frequent item-set problem, with a filter applied at the end. See: https://en.wikipedia.org/wiki/Association_rule_learning for this more general problem. – John Rauser Jan 09 '19 at 18:10
0

You should start with linear regression model. Do a bi-variate analysis to check the relationship between overtime hours and leave_hours between employees.

Better to draw a heat map for more info. Check dispersion for variance.

Sunil
  • 234
  • 2
  • 8
  • I think you missed that the question was about overtime of one employee vs overtime of another employee. – Valentas Dec 10 '18 at 12:46
0

Can you simply look at all correlations at once?

from scipy.cluster.vq import kmeans,vq
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

# get only numeric features from DF
df = df.sample(frac=0.1, replace=True, random_state=1)
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
newdf = df.select_dtypes(include=numerics)

for col in newdf.columns: 
    print(col) 

# Compute the correlation matrix
# no statistically significant correlations between any numeric features...
corrmat = newdf.corr()
top_corr_features = corrmat.index
plt.figure(figsize=(10,10))
#plot heat map
g=sns.heatmap(newdf[top_corr_features].corr(),annot=True,cmap="RdYlGn")
ASH
  • 615
  • 3
  • 9