Find spammers with the help of a database and Google Charts

Sniffing out Spammers

Article from Issue 100/2009

Author(s): Michael Schilli

To identify the geographic regions from which link spam originated, a database locates IP addresses and the Google Charts service puts them onto a world map.

Sometimes I imagine how satisfying it would be to track down a spammer or telemarketer's office, like the man in the Snickers commercial [1] who arrives at the office and gets his revenge. Unfortunately, legal and logistical reasons often prevent this. Additionally, it is often the case that the perpetrators are botnets rather than the spammers themselves. Still, it would be interesting to create a graph that pinpoints the geographic regions in which most spam activities originate.

The Internet is the ideal platform for anonymous trickery, but the perpetrator's deeds actually leave a trail – each incoming request on a website includes the sender's IP address (see Figure 1).

Of course, the address could be spoofed, but this is not so simple and just too much trouble for most link spammers.

The DNS system, which resolves hostnames into IP addresses, can often do the same thing in reverse gear. A DNS reverse lookup expects an IP address, and if the spammer's service provider has set up everything as it should be, the script in Listing 1, revlookup, will return a hostname from which you usually can identify the provider. Figure 2 shows that the address I caught spamming, IP 69.162.110.146, belongs to an ISP called lstn.net; a friendly email to the ISP's webmaster, stating the IP and the time (which is important because these IPs are often assigned dynamically) might just be the ticket to stop the spammer's illegal activity for good.

Listing 1

revlookup

The inet_aton() function in Perl's Socket module accepts an IP address in string notation ("x.x.x.x") and returns a data structure for a subsequent call to Perl's gethostbyaddr() function. When called with the AF_INET parameter, as shown in line 17 in revlookup, the function performs the DNS reverse lookup in the IPv4 address space and, if successful, returns a string with the hostname or returns undef if an error occurs. Depending on how busy the DNS server you call on is at the time and how many of its peers it needs to consult to answer your request, this process can take a couple of seconds.

As another option, the whois command-line utility doesn't just work with domains, it also accepts IP addresses as arguments. Figure 3 shows that the provider, Limestone Networks, has registered everything correctly and even provides an email address that spammed webmasters can contact with their complaints. The lookup can be automated in Perl with the CPAN Net::Whois::Raw module, for example; however, it puts significant load on the servers hosted by Network Solutions, who will block access if you perform 100 lookups in quick succession. In other words, searching a complete access log with this module is impossible, even if you cache queries you have already made.

Many spammers use IP addresses without a reverse lookup entry on the DNS system. But even then you can still locate the culprit; IP addresses are assigned to service providers in blocks, and you can download databases with the information necessary to discover the approximate geographic position of any given IP address. MaxMind offers a database file [2] that is free for non-commercial use. The licensing conditions are available in the same directory as the database itself. The CPAN IP::Country::MaxMind module provides an API to match, thus avoiding the need to mess around with data blobs. The IP mappings stored in the database change very slowly; updating once every couple of months should be fine.

After installing the module, you will need another CPAN module, Geo::IP::PurePerl. The MaxMind module's open() constructor loads the local database that you specify, and the inet_atocc() function returns a country code for any IP address (for example, DE for Germany).

The Google Charts API [3] gives you a useful option for plotting these codes on a world map. If you pass in pairs of values to the Google server, it will respond with a PNG-formatted image file. The data format for the pairs of values is slightly unusual in that you need to squash largish volumes of data into the very restricted space offered by a URL and its query parameters.

Simplicity Itself

The API's aptly named Simple Encoding data format will only allow values between 0 and 61, encoded as A-Z (0-25), a-z (26-51), and 0-9 (52-61).

If you assign a value of 23 to Germany, 3 to the USA, and 60 to Japan, you can encode the country codes in the chld URL parameter as "DEUSJP" (DE, US, and JP, concatenated without blanks), and the values as "s:XD8" (s = simple encoding, X = 23, D = 3, and 8 = 60) in chd.

The script in Listing 2, spam2geo, implements the steps I have identified thus far; it analyzes the access.log file from an Apache server under heavy fire from link spammers. The CPAN ApacheLog::Parser module provides a parse_line_to_hash function, which understands the access.log format and returns the individual fields of each log entry as a hash. The client entry includes the spammer's IP address in each case, and a call to the inet_atocc method in line 32 returns the two-letter country code, assuming the database knows it.

Listing 2

spam2geo

If successful, line 36 increments the hash entry for the country, and the program moves on to the next line in the logfile. Because you are not interested in all the URLs – just the ones generated by spammers – line 28 filters out all entries whose path (file hash key) does not match the regular expression posting. The regex should only match URLs used by spammers to post on the forums you are monitoring, so you must modify it to match your local conditions.

Normalization and conversion of the data to the Google format starts in line 42. Because the numeric values for each country in the %by_country hash are not necessarily in the range 0--61 but can assume arbitrary values, spam2geo must determine the limits of the range by use of min and max from List::Util. After doing so, it subtracts $min and divides by $max to squash the numeric values to be represented into the range between 0 and 1 and multiplies the latter value by the number of encoding characters minus 1. Thus, $norm contains a floating point number, which can be converted to an integer and used as an index in the @SYMBOLS array, thus mapping the whole range of values to an element in the array.

Lines 68 and 70 then concatenate the calculated symbols to give strings without separating blanks, for passing in with the chld (country codes) and chd (values) URL parameters. From the programmer's point of view, the order in which the keys and values functions return results is arbitrary, but consistent within the Perl script, and irrelevant to the Google service.

Communications with the Google server are handled by LWP::UserAgent via the http protocol. The URL parameters are set by the query_form() method, which also performs any URL encoding required. The cht parameter specifies the charts type used by the Google Charts service and is set to "t" (topological) for a world map. You can optionally restrict the view to individual continents; however, you need to set the chtm parameter to "world" for a world map.

The chs parameter sets the dimensions of the resulting image to 440x220 pixels. Google Charts uses the colors white, yellow, and red specified as hex RGB values in chco to shade the countries, thus reflecting minimum, medium, and maximum values. So, the settings in Listing 2 leave countries with normalized spam counts around 0 white, values of around 20 yellow, and values of 60 or more red. The "bg,s,EAF7FE" string for the chf parameter stands for background, solid, and the hex value for light blue to color the world's oceans.

All told, the URL will look something like this: http://chart.apis.google.com/chart?cht=t&chs=440x220&chtm=world&chd=
s%3ABFAABAHGQAAA8BAAAAAAAaBAA&chco=ffffff%2Cf4ed28%2Cf11414&chld=GBNLHKEELVKRRUSAPAMDCA
SECNDEPKITPLINMEBRCZUSUAESFR&chf=bg%2Cs%2CEAF7FE

Google takes just a couple of seconds to render and deliver this as the graph shown in Figure 4. If you comment out lines 27-29 in spam2geo, the graph will give you a geographic distribution of all incoming URLs instead (Figure 5).

Although most spam requests originate in China and the US, most of the website's bona fide customers come from Germany. The eog file.png command displays the file produced by Google and retrieved via a web request in the Eye of Gnome utility.

Installation

After downloading the MaxMind GeoIP.dat.gz database [2], unpack the GeoIP.dat file and place the spam2geo script into your current working directory. The CPAN IP::Country::MaxMind, Geo::IP::PurePerl, List::Util, and ApacheLog::Parser modules and all their dependencies are best installed from a CPAN shell. To use the Google API, you do not need to register. You just need to modify line 28 in spam2geo to match your local conditions by changing the /posting/ pattern to match URLs used only by spammers to clutter your discussion groups with parasitic entries.

For more detailed analysis including, for example, the number of forum requests compared with other activities or the preferred browser type used by the spammers (at least what they say they're using), check out the enormous choice provided by the Google Charts API [3], which gives you an easy approach to render any statistical information elegantly in polished chart form.

Infos

Snickers Cruncher – Telemarketer: http://www.youtube.com/watch?v=R6QATC2C0h8
Free MaxMind GeoIP database download: http://www.maxmind.com/download/geoip/database/
"Maps" charts by the Google Charts web service: http://code.google.com/apis/chart/types.html#maps
Listings for this article: http://www.linux-magazine.com/resources/article_code

Buy this article as PDF

Express-Checkout as PDF

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subs

Digisubs

TABLET & SMARTPHONE APPS

US / Canada

UK / Australia

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News

Linux Servers Targeted by Akira Ransomware

Enterprise Linux , Linux , ransomware , Security

A group of bad actors who have already extorted $42 million have their sights set on the Linux platform.
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU

Games , Hardware , laptop , Linux

This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
XZ Gets the All-Clear

Arch Linux , Fedora , Linux , open source , Security , Ubuntu

The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
Canonical Collaborates with Qualcomm on New Venture

Artificial Inte... , Linux , open source , Security , Ubuntu

This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
Kodi 21.0 Open-Source Entertainment Hub Released

audio , Multimedia , Music , open source , streaming video , Video

After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
Linux Usage Increases in Two Key Areas

Games , Linux , open source , Steam

If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
Vulnerability Discovered in xz Libraries

Fedora , Linux , malware , Security

An urgent alert for Fedora 40 has been posted and users should pay attention.
Canonical Bumps LTS Support to 12 years

Linux , open source , Operating Systems , Ubuntu

If you're worried that your Ubuntu LTS release won't be supported long enough to last, Canonical has a surprise for you in the form of 12 years of security coverage.
Fedora 40 Beta Released Soon

Fedora , Gnome , open source , Plasma , Wayland

With the official release of Fedora 40 coming in April, it's almost time to download the beta and see what's new.
New Pentesting Distribution to Compete with Kali Linux

Linux , open source , Tools , Ubuntu

SnoopGod is now available for your testing needs

Find spammers with the help of a database and Google Charts

Sniffing out Spammers

Simplicity Itself

Installation

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

News

Linux Servers Targeted by Akira Ransomware

TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU

XZ Gets the All-Clear

Canonical Collaborates with Qualcomm on New Venture

Kodi 21.0 Open-Source Entertainment Hub Released

Linux Usage Increases in Two Key Areas

Vulnerability Discovered in xz Libraries

Canonical Bumps LTS Support to 12 years

Fedora 40 Beta Released Soon

New Pentesting Distribution to Compete with Kali Linux

Find spammers with the help of a database and Google Charts

Sniffing out Spammers

Simplicity Itself

Installation

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters Find Linux and Open Source Jobs Subscribe to our ADMIN Newsletters

Support Our Work

News

Tag Cloud

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters