Have you ever found yourself in the situation where you had no or
anonymized logs and still wanted to figure out where your traffic was
coming from?
Or you have multiple upstreams and are looking to see if you can save
fees by getting into peering agreements with some other party?
Or your site is getting heavy load but you can't pinpoint it on a
single IP and you suspect some amoral corporation is training their
degenerate AI on your content with a bot army?
(You might be getting onto something there.)
If that rings a bell, read on.
TL;DR:
... or just skip the cruft and install asncounter:
pip install asncounter
Also available in Debian 14 or later, or possibly in Debian 13
backports (soon to be released) if people are interested:
apt install asncounter
Then count whoever is hitting your network with:
awk '{print $2}' /var/log/apache2/*access*.log | asncounter
or:
tail -F /var/log/apache2/*access*.log | awk '{print $2}' | asncounter
or:
tcpdump -q -n | asncounter --input-format=tcpdump --repl
or:
tcpdump -q -i eth0 -n -Q in "tcp and tcp[tcpflags] & tcp-syn != 0 and (port 80 or port 443)" | asncounter --input-format=tcpdump --repl
Read on for why this matters, and why I wrote yet another weird tool
(almost) from scratch.
Background and manual work
This is a tool I've been dreaming of for a long, long time. Back in
2006, at Koumbit a colleague had setup TAS ("Traffic
Accounting System", "Система учета трафика" in Russian, apparently), a
collection of Perl script that would do per-IP accounting. It was
pretty cool: it would count bytes per IP addresses and, from that, you
could do analysis. But the project died, and it was kind of bespoke.
Fast forward twenty years, and I find myself fighting off bots at the
Tor Project (the irony...), with our GitLab suffering pretty bad
slowdowns (see issue tpo/tpa/team#41677 for the latest public
issue, the juicier one is confidential, unfortunately).
(We did have some issues caused by overloads in CI, as we host, after
all, a fork of Firefox, which is a massive repository, but the
applications team did sustained, awesome work to fix issues on that
side, again and again (see tpo/applications/tor-browser#43121 for
the latest, and tpo/applications/tor-browser#43121 for some
pretty impressive correlation work, I work with really skilled
people). But those issues, I believe were fixed.)
So I had the feeling it was our turn to get hammered by the AI
bots. But how do we tell? I could tell something was hammering at
the costly /commit/
and (especially costly) /blame/
endpoint. So
at first, I pulled out the trusted awk
, sort | uniq -c | sort -n |
tail
pipeline I am sure others have worked out before:
awk '{print $1}' /var/log/nginx/*.log | sort | uniq -c | sort -n | tail -10
For people new to this, that pulls the first field out of web server
log files, sort the list, counts the number of unique entries, and
sorts that so that the most common entries (or IPs) show up first,
then show the top 10.
That, other words, answers the question of "which IP address visits
this web server the most?" Based on this, I found a couple of IP
addresses that looked like Alibaba. I had already addressed an abuse
complaint to them (tpo/tpa/team#42152) but never got a response,
so I just blocked their entire network blocks, rather violently:
for cidr in 47.240.0.0/14 47.246.0.0/16 47.244.0.0/15 47.235.0.0/16 47.236.0.0/14; do
iptables-legacy -I INPUT -s $cidr -j REJECT
done
That made Ali Baba and his forty thieves (specifically their
AL-3 network go away, but our load was still high, and I was
still seeing various IPs crawling the costly endpoints. And this time,
it was hard to tell who they were: you'll notice all the Alibaba IPs
are inside the same 47.0.0.0/8 prefix. Although it's not a /8
itself, it's all inside the same prefix, so it's visually easy to
pick it apart, especially for a brain like mine who's stared too long
at logs flowing by too fast for their own mental health.
What I had then was different, and I was tired of doing the stupid
thing I had been doing for decades at this point. I had recently
stumbled upon pyasn recently (in January, according to my notes)
and somehow found it again, and thought "I bet I could write a quick
script that loops over IPs and counts IPs per ASN".
(Obviously, there are lots of other tools out there for that kind of
monitoring. Argos, for example, presumably does this, but it's a kind
of a huge stack. You can also get into netflows, but there's serious
privacy implications with those. There are also lots of per-IP
counters like promacct
, but that doesn't scale.
Or maybe someone already had solved this problem and I just wasted a
week of my life, who knows. Someone will let me know, I hope, either
way.)
ASNs and networks
A quick aside, for people not familiar with how the internet
works. People that know about ASNs, BGP announcements and so on can
skip.
The internet is the network of networks. It's made of multiple
networks that talk to each other. The way this works is there is a
Border Gateway Protocol (BGP), a relatively simple TCP-based protocol,
that the edge routers of those networks used to announce each other
what network they manage. Each of those network is called an
Autonomous System (AS) and has an AS number (ASN) to uniquely identify
it. Just like IP addresses, ASNs are allocated by IANA and local
registries, they're pretty cheap and useful if you like running your
own routers, get one.
When you have an ASN, you'll use it to, say, announce to your BGP
neighbors "I have 198.51.100.0/24
" over here and the others might
say "okay, and I have 216.90.108.31/19
over here, and I know of this
other ASN over there that has 192.0.2.1/24
too! And gradually, those
announcements flood the entire network, and you end up with each BGP
having a routing table of the global internet, with a map of which
network block, or "prefix" is announced by which ASN.
It's how the internet works, and it's a useful thing to know, because
it's what, ultimately, makes an organisation responsible for an IP
address. There are "looking glass" tools like the one provided by
routeviews.org which allow you to effectively run "trace routes"
(but not the same as traceroute
, which actively sends probes from
your location), type an IP address in that form to fiddle with it. You
will end up with an "AS path", the way to get from the looking glass
to the announced network. But I digress, and that's kind of out of
scope.
Point is, internet is made of networks, networks are autonomous
systems (AS) and they have numbers (ASNs), and they announced IP
prefixes (or "network blocks") that ultimately tells you who is
responsible for traffic on the internet.
Introducing asncounter
So my goal was to get from "lots of IP addresses" to "list of ASNs",
possibly also the list of prefixes (because why not). Turns out pyasn
makes that really easy. I managed to build a prototype in probably
less than an hour, just look at the first version, it's 44 lines
(sloccount
) of Python, and it works, provided you have already
downloaded the required datafiles from routeviews.org. (Obviously, the
latest version is longer at close to 1000 lines, but it downloads the
data files automatically, and has many more features).
The way the first prototype (and later versions too, mostly) worked is
that you feed it a list of IP addresses on standard input, it looks up
the ASN and prefix associated with the IP, and increments a counter
for those, then print the result.
That showed me something like this:
root@gitlab-02:~/anarcat-scripts# tcpdump -q -i eth0 -n -Q in "(udp or tcp)" | ./asncounter.py --tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
INFO: collecting IPs from stdin, using datfile ipasn_20250523.1600.dat.gz
INFO: loading datfile /root/.cache/pyasn/ipasn_20250523.1600.dat.gz...
INFO: loading /root/.cache/pyasn/asnames.json
ASN count AS
136907 7811 HWCLOUDS-AS-AP HUAWEI CLOUDS, HK
[----] 359 [REDACTED]
[----] 313 [REDACTED]
8075 254 MICROSOFT-CORP-MSN-AS-BLOCK, US
[---] 164 [REDACTED]
[----] 136 [REDACTED]
24940 114 HETZNER-AS, DE
[----] 98 [REDACTED]
14618 82 AMAZON-AES, US
[----] 79 [REDACTED]
prefix count
166.108.192.0/20 1294
188.239.32.0/20 1056
166.108.224.0/20 970
111.119.192.0/20 951
124.243.128.0/18 667
94.74.80.0/20 651
111.119.224.0/20 622
111.119.240.0/20 566
111.119.208.0/20 538
[REDACTED] 313
Even without ratios and a total count (which will come later), it was
quite clear that Huawei was doing something big on the server. At that
point, it was responsible for a quarter to half of the traffic on our
GitLab server or about 5-10 queries per second.
But just looking at the logs, or per IP hit counts, it was really hard
to tell. That traffic is really well distributed. If you look more
closely at the output above, you'll notice I redacted a couple of
entries except major providers, for privacy reasons. But you'll also
notice almost nothing is redacted in the prefix list, why? Because
all of those networks are Huawei! Their announcements are kind of
bonkers: they have hundreds of such prefixes.
Now, clever people in the know will say "of course they do, it's an
hyperscaler; just ASN14618 (AMAZON-AES) there is way more
announcements, they have 1416 prefixes!" Yes, of course, but they are
not generating half of my traffic (at least, not yet). But even then:
this also applies to Amazon! This way of counting traffic is way
more useful for large scale operations like this, because you group by
organisation instead of by server or individual endpoint.
And, ultimately, this is why asncounter
matters: it allows you to
group your traffic by organisation, the place you can actually
negotiate with.
Now, of course, that assumes those are entities you can talk with. I
have written to both Alibaba and Huawei, and have yet to receive a
response. I assume I never will. In their defence, I wrote in English,
perhaps I should have made the effort of translating my message in
Chinese, but then again English is the Lingua Franca of the
Internet, and I doubt that's actually the issue.
The Huawei and Facebook blocks
Another aside, because this is my blog and I am not looking for a
Pullitzer here.
So I blocked Huawei from our GitLab server (and before you tear your
shirt open: only our GitLab server, everything else is still
accessible to them, including our email server to respond to my
complaint). I did so 24h after emailing them, and after examining
their user agent (UA) headers. Boy that was fun. In a sample of 268
requests I analyzed, they churned out 246 different UAs.
At first glance, they looked legit, like:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36
Safari on a Mac, so far so good. But when you start digging, you
notice some strange things, like here's Safari running on Linux:
Mozilla/5.0 (X11; U; Linux i686; en-US) AppleWebKit/534.3 (KHTML, like Gecko) Chrome/6.0.457.0 Safari/534.3
Was Safari ported to Linux? I guess that's.. possible?
But here is Safari running on a 15 year old Ubuntu release (10.10):
Mozilla/5.0 (X11; Linux i686) AppleWebKit/534.24 (KHTML, like Gecko) Ubuntu/10.10 Chromium/12.0.702.0 Chrome/12.0.702.0 Safari/534.24
Speaking of old, here's Safari again, but this time running on Windows
NT 5.1, AKA Windows XP, released 2001, EOL since 2019:
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-CA) AppleWebKit/534.13 (KHTML like Gecko) Chrome/9.0.597.98 Safari/534.13
Really?
Here's Firefox 3.6, released 14 years ago, there were quite a lot of
those:
Mozilla/5.0 (Windows; U; Windows NT 6.1; lt; rv:1.9.2) Gecko/20100115 Firefox/3.6
I remember running those old Firefox releases, those were the days.
But to me, those look like entirely fake UAs, deliberately rotated to
make it look like legitimate traffic.
In comparison, Facebook seemed a bit more legit, in the sense that
they don't fake it. most hits are from:
meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)
which, according their documentation:
crawls the web for use cases such as training AI models or improving products by indexing content directly
From what I could tell, it was even respecting our rather liberal
robots.txt
rules, in that it wasn't crawling the sprawling /blame/
or /commit/
endpoints, explicitly forbidden by robots.txt
.
So I've blocked the Facebook bot in robots.txt
and, amazingly, it
just went away. Good job Facebook, as much as I think you've given the
empire to neo-nazis, cause depression and genocide, you know how to
run a crawler, thanks.
Huawei was blocked at the web server level, with a friendly 429 status
code telling people to contact us (over email) if they need help. And
they don't care: they're still hammering the server, from what I can
tell, but then again, I didn't block the entire ASN just yet, just the
blocks I found crawling the server over a couple hours.
A full asncounter run
So what does a day in asncounter look like? Well, you start with a
problem, say you're getting too much traffic and want to see where
it's from. First you need to sample it. Typically, you'd do that with
tcpdump
or tailing a log file:
tail -F /var/log/apache2/*access*.log | awk '{print $2}' | asncounter
If you have lots of traffic or care about your users' privacy, you're
not going to log IP addresses, so tcpdump
is likely a good option
instead:
tcpdump -q -n | asncounter --input-format=tcpdump --repl
If you really get a lot of traffic, you might want to get a subset
of that to avoid overwhelming asncounter
, it's not fast enough to do
multiple gigabit/second, I bet, so here's only incoming SYN IPv4
packets:
tcpdump -q -n -Q in "tcp and tcp[tcpflags] & tcp-syn != 0 and (port 80 or port 443)" | asncounter --input-format=tcpdump --repl
In any case, at this point you're staring at a process, just sitting
there. If you passed the --repl
or --manhole
arguments, you're
lucky: you have a Python shell inside the program. Otherwise, send
SIGHUP
to the thing to have it dump the nice tables out:
pkill -HUP asncounter
Here's an example run:
> awk '{print $2}' /var/log/apache2/*access*.log | asncounter
INFO: using datfile ipasn_20250527.1600.dat.gz
INFO: collecting addresses from <stdin>
INFO: loading datfile /home/anarcat/.cache/pyasn/ipasn_20250527.1600.dat.gz...
INFO: finished reading data
INFO: loading /home/anarcat/.cache/pyasn/asnames.json
count percent ASN AS
12779 69.33 66496 SAMPLE, CA
3361 18.23 None None
366 1.99 66497 EXAMPLE, FR
337 1.83 16276 OVH, FR
321 1.74 8075 MICROSOFT-CORP-MSN-AS-BLOCK, US
309 1.68 14061 DIGITALOCEAN-ASN, US
128 0.69 16509 AMAZON-02, US
77 0.42 48090 DMZHOST, GB
56 0.3 136907 HWCLOUDS-AS-AP HUAWEI CLOUDS, HK
53 0.29 17621 CNCGROUP-SH China Unicom Shanghai network, CN
total: 18433
count percent prefix ASN AS
12779 69.33 192.0.2.0/24 66496 SAMPLE, CA
3361 18.23 None
298 1.62 178.128.208.0/20 14061 DIGITALOCEAN-ASN, US
289 1.57 51.222.0.0/16 16276 OVH, FR
272 1.48 2001:DB8::/48 66497 EXAMPLE, FR
235 1.27 172.160.0.0/11 8075 MICROSOFT-CORP-MSN-AS-BLOCK, US
94 0.51 2001:DB8:1::/48 66497 EXAMPLE, FR
72 0.39 47.128.0.0/14 16509 AMAZON-02, US
69 0.37 93.123.109.0/24 48090 DMZHOST, GB
53 0.29 27.115.124.0/24 17621 CNCGROUP-SH China Unicom Shanghai network, CN
Those numbers are actually from my home network, not GitLab. Over
there, the battle still rages on, but at least the vampire bots are
banging their heads against the solid Nginx wall instead of eating the
fragile heart of GitLab. We had a significant improvement in latency
thanks to the Facebook and Huawei blocks... Here are the "workhorse
request duration stats" for various time ranges, 20h after the block:
range |
mean |
max |
stdev |
20h |
449ms |
958ms |
39ms |
7d |
1.78s |
5m |
14.9s |
30d |
2.08s |
3.86m |
8.86s |
6m |
901ms |
27.3s |
2.43s |
We went from two seconds mean to 500ms! And look at that standard deviation!
39ms! It was ten seconds before! I doubt we'll keep it that way very
long but for now, it feels like I won a battle, and I didn't even have
to setup anubis
or go-away
, although I suspect that will
unfortunately come.
Note that asncounter also supports exporting Prometheus metrics, but
you should be careful with this, as it can lead to cardinal explosion,
especially if you track by prefix (which can be disabled with
--no-prefixes`.
Folks interested in more details should read the fine manual for
more examples, usage, and discussion. It shows, among other things,
how to effectively block lots of networks from Nginx, aggregate
multiple prefixes, block entire ASNs, and more!
So there you have it: I now have the tool I wish I had 20 years
ago. Hopefully it will stay useful for another 20 years, although I'm
not sure we'll have still have internet in 20
years.
I welcome constructive feedback, "oh no you rewrote X", Grafana
dashboards, bug reports, pull requests, and "hell yeah"
comments. Hacker News, let it rip, I know you can give me another
juicy quote for my blog.
This work was done as part of my paid work for the Tor Project,
currently in a fundraising drive, give us money if you like what you
read.
Blocking comment spammers on an Ikiwiki blog
Despite comments on my ikiwiki blog being fully moderated, spammers have been increasingly posting link spam comments on my blog. While I used to use the blogspam plugin, the underlying service was likely retired circa 2017 and its public repositories are all archived.
It turns out that there is a relatively simple way to drastically reduce the amount of spam submitted to the moderation queue: ban the datacentre IP addresses that spammers are using.
Looking up AS numbers
It all starts by looking at the IP address of a submitted comment:
From there, we can look it up using
whois
:The important bit here is this line:
which referts to Autonomous System 207408, owned by a hosting company in Germany called Servinga.
Alternatively, you can use this WHOIS server with much better output:
Looking up IP blocks
Autonomous Systems are essentially organizations to which IPv4 and IPv6 blocks have been allocated.
These allocations can be looked up easily on the command line either using a third-party service:
or a local database downloaded from IPtoASN.
This is what I ended up with in the case of Servinga:
Preventing comment submission
While I do want to eliminate this source of spam, I don't want to block these datacentre IP addresses outright since legitimate users could be using these servers as VPN endpoints or crawlers.
I therefore added the following to my Apache config to restrict the CGI endpoint (used only for write operations such as commenting):
and then put the following in
/etc/apache2/spammers.include
:Finally, I can restart the website and commit my changes:
Future improvements
I will likely automate this process in the future, but at the moment my blog can go for a week without a single spam message (down from dozens every day). It's possible that I've already cut off the worst offenders.
I have published the list I am currently using.
04 June, 2025 08:28PM