Differentiating HTTP bot traffic from regular users

Question

I run a quite small website and I'd like to have some analytics about the traffic it gets from regular users. And I'd like to continue allowing bots to scrape it. I don't really care enough to invest in trying to detect bad-acting bots masquerading as real users (nor can I think of a reason my site would be the target of such bots).

My quick-and-dirty approach has been to just look at all of the user agents I've received, eyeball which ones seem like bots, and put them in a manually-maintained list. This seems error-prone, and I have to update the list if the various bots ever change their strings, or if a new one discovers my site.

Is there a better approach? In terms of amount of upkeep and error-prone-ness.

I guess this also gets into a broader question: if there doesn't exist a standard way for a good-acting bot to identify itself as such, why not? Have there been proposals to make this a part of some spec?

You could use a user agent list, like this. ... Though, I'm under the impression that many analytics packages and services maintain their own lists and already filter bots out for you. E.g., Google analytics ... — svidgen, Sep 20 '19 at 00:19
Malicious bots are rarely targeting "important" sites, they mostly want to find vulnerable sites to exploit. This means that your small website will most likely receive proportionally less " legitimate" crawler traffic than big, well-referenced sites. — Hans-Martin Mosner, Sep 20 '19 at 05:22
It's probably easier to make a whitelist of commonly used web browsers than to make a blacklist of bots. If you look only at the name and not the whole user agent string, top 10 will probably make up 95% of your bot traffic. — Michał Kosmulski, Sep 20 '19 at 12:32

Differentiating HTTP bot traffic from regular users

0 Answers0