Given website access data in the form session_id, ip, user_agent
, and optionally timestamp, following the conditions below, how would you best cluster the sessions into unique visitors?
session_id
: is an id given to every new visitor. It does not expire, however if the user doesn't accept cookies/clears cookies/changes browser/changes device, he will not be recognised anymore
IP
can be shared between different users (Imagine a free wi-fi cafe, or your ISP reassigning IPs), and they will often have at least 2, home and work.
User_agent
is the browser+OS version, allowing to distinguish between devices. For example a user is likely to use both phone and laptop, but is unlikely to use windows+apple laptops. It is unlikely that the same session id has multiple useragents.
Data might look as the fiddle here: http://sqlfiddle.com/#!2/c4de40/1
Of course, we are talking about assumptions, but it's about getting as close to reality as possible. For example, if we encounter the same ip and useragent in a limited time frame with a different session_id, it would be a fair assumption that it's the same user, with some edge case exceptions.
Edit: Language in which the problem is solved is irellevant, it's mostly about logic and not implementation. Pseudocode is fine.
Edit: due to the slow nature of the fiddle, you can alternatively read/run the mysql:
select session_id, floor(rand()*256*256*256*256) as ip_num , floor(rand()*1000) as user_agent_id
from
(select 1+a.nr+10*b.nr as session_id, ceil(rand()*3) as nr
from
(select 1 as nr union all select 2 union all select 3 union all select 4 union all select 5
union all select 6 union all select 7 union all select 8 union all select 9 union all select 0)a
join
(select 1 as nr union all select 2 union all select 3 union all select 4 union all select 5
union all select 6 union all select 7 union all select 8 union all select 9 union all select 0)b
order by 1
)d
inner join
(select 1 as nr union all select 2 union all select 3 union all select 4 union all select 5
union all select 6 union all select 7 union all select 8 union all select 9 )e
on d.nr>=e.nr