Detecting spam users automatically with a neural network
Circuit Training
The training is conducted iteratively. During a pass (an epoch), the developer passes the complete data set through the neural network one time and works out the loss. You do not use all the data at once in this process but, rather, divide it into smaller portions (batches) for performance reasons.
In the course of a later epoch, the training script sets new parameters via the optimization process, incurring a smaller loss. Ideally, the loss will converge toward a specific minimum value. The network can then be regarded as trained, and the training script stores the calculated weightings and threshold values that produced the minimum loss.
Figure 3 shows a simplified view of how the gradient descent method determines the minimum for an individual weighting w
. The x-axis is the weighting, and the y-axis is the loss function's value with this weighting. The training script determines the loss graph's differentiation – and thus its slope – with each iteration of the gradient method at the point of the current weight, then moves a step in the learning rate (eta) direction.
Interesting Properties
The fields of the data sets, which are used as inputs for the network, are known as properties. The neural network works with real numbers, which means names and IP addresses cannot be added directly as properties in the form of strings.
Our experience shows that spammers often use very cryptic usernames. We were able to derive the following properties to help identify spammers: the length, the number of hyphens, the number of numerals, the differentiation of the characters, the number of vowels, the number of non-letters, and the occurrence of certain keywords (e.g., credits, 100mg, taler).
A geolocation database breaks down the country, matching a particular IP address and the ISP. The on-hand data reveals how often an ISP operates as a spammer, how frequently a combination of a particular country of origin and chosen language appears for the website builder, and which countries transmit an especially large amount of spam.
The next step is to sort out properties that do not correlate strongly with the class and thus contribute little to the outcome. The reason for sorting out the data that doesn't correlate strongly is that a smaller network can be trained more quickly and needs fewer resources. I can use a correlation matrix to discern how well-suited properties are for spam detection.
Listing 1 shows a Python script for setting up the correlation matrix. The script reads a CSV file with the data, computes the correlation matrix using the np.corrcoef()
function, and finally generates a PNG file with the density plot of the matrix. The script ignores the first column (in the username sample data) during this process. If the CSV file contains other values that are not real numbers, you will have to modify the read_file()
function accordingly. The class, which distinguishes spammers from legitimate website builders, is intended to be in the last column.
Listing 1
correlation.py
The density plot (Figure 4) shows an overview of which properties are particularly suitable. Each row and column cover a property. The lighter the field, the higher the correlation between the row property and the column property. The last row and column reveals the correlation with the class. For this reason, the lighter colored the field in the last row and column, the better suited the properties to the classification.
The correlation matrix also reveals whether two characteristics are excessively similar and whether it would be sufficient to classify them as one where possible. The properties 6 (the number count in the username) and 10 (the number of non-letters) would be an example of this. The white field indicates a strong relationship between both these variables. It is therefore sufficient to take property 6 into account, because 10 provides no additional information.
Hyperparameters
The lion's share of work with neural networks is in determining the structure or configuration of the network with the aid of hyperparameters. Developers usually perfect this process manually by individually training each network configuration and comparing the results until ending up at a good configuration. Hyperparameters include the number and size of the layers, the activation functions for the layers, the number of epochs, the size of the data batches, the optimization process, and the learning rate.
TFLearn offers a variety of activation functions in the tflearn.optimizations package. Figure 5 depicts the most important of these functions; the easiest is the identity or the linear activation function, which returns the input value unaltered. The sigmoid function is non-linear and so is more interesting as an activation function than the linear equivalent. The function is restricted and only produces positive values between 0 and 1.
Tanh can be compared with sigmoid, except that it returns values between -1 and 1. A further activation function is known as a rectified linear unit, or Relu. You can think of this as a linear function with a threshold value. Relu converges very quickly and is the recommended function at this time. We also use it for our network.
Another important activation function goes by the name softmax. The softmax function creates a relationship between the value of the neuron and the values of other neurons in the layer. Its special characteristic is that all output values in this layer add up to 1. Users often use this function for the output layer in networks whose purpose is to classify. The network's output can then be interpreted as probabilities for the individual classes.
You also pick an optimization process along with the layers and their activation functions. Adam, an algorithm with which you can generally achieve good results, is often used by developers to train neural networks as an alternative to the classic gradient method. Adam also needs a learning rate. The preset value of 0.001 is a suitable learning rate to start with, although it can be reduced to achieve an even better outcome where possible. All the optimization techniques supplied with TFLearn are included in the tflearn.optimizers package.
You can train your network and measure its accuracy by using the hyperparameters you have found, before proceeding to vary the parameters. You continue to repeat this process until you can no longer significantly increase the accuracy.
Figure 6 shows the loss during training in graph form. The graph will indicate whether the learning rate is too high or low. The loss graph is intended as far as possible to resemble a falling exponential curve (the blue graph). If the learning rate is too high, the loss initially drops quickly, although it is possible that it may converge prematurely (red). This indicates that you have not yet found the optimum. The loss only drops very slowly when the learning rate is too low, and it is more likely that you actually find the global optimum (yellow). You can increase the batch size if the loss graph is too noisy.
« Previous 1 2 3 4 Next »
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
New Steam Client Ups the Ante for Linux
The latest release from Steam has some pretty cool tricks up its sleeve.
-
Gnome OS Transitioning Toward a General-Purpose Distro
If you're looking for the perfectly vanilla take on the Gnome desktop, Gnome OS might be for you.
-
Fedora 41 Released with New Features
If you're a Fedora fan or just looking for a Linux distribution to help you migrate from Windows, Fedora 41 might be just the ticket.
-
AlmaLinux OS Kitten 10 Gives Power Users a Sneak Preview
If you're looking to kick the tires of AlmaLinux's upstream version, the developers have a purrfect solution.
-
Gnome 47.1 Released with a Few Fixes
The latest release of the Gnome desktop is all about fixing a few nagging issues and not about bringing new features into the mix.
-
System76 Unveils an Ampere-Powered Thelio Desktop
If you're looking for a new desktop system for developing autonomous driving and software-defined vehicle solutions. System76 has you covered.
-
VirtualBox 7.1.4 Includes Initial Support for Linux kernel 6.12
The latest version of VirtualBox has arrived and it not only adds initial support for kernel 6.12 but another feature that will make using the virtual machine tool much easier.
-
New Slimbook EVO with Raw AMD Ryzen Power
If you're looking for serious power in a 14" ultrabook that is powered by Linux, Slimbook has just the thing for you.
-
The Gnome Foundation Struggling to Stay Afloat
The foundation behind the Gnome desktop environment is having to go through some serious belt-tightening due to continued financial problems.
-
Thousands of Linux Servers Infected with Stealth Malware Since 2021
Perfctl is capable of remaining undetected, which makes it dangerous and hard to mitigate.