Detecting spam users automatically with a neural network

Spam Stopper

© Lead Image © Kirsty Pargeter,

© Lead Image © Kirsty Pargeter,

Article from Issue 195/2017

Build a neural network that uncovers spam websites.

Website builders – online hosting services that provide tools for non-technical users to build their own websites  – are frequently exploited by spammers looking for a convenient launching pad. Checking thousands, or sometimes millions, of web pages manually to look for evidence of a spammer is both tedious and inefficient.

In this article, I show how to build a suitable spam-searching neural network with help from Google's TensorFlow machine learning library [2] [3] and TFLearn [4], a library with a high-level API for TensorFlow. Even if you don't spend your days searching for spammers, the techniques described in this article will give you some insights on how to harness the power of neural networks for other complex problems.

Training Day

The neural network needs both positive and negative samples in order to learn. This solution starts with a manually compiled list of sample users divided into spammers and legitimate users, taking care to distribute both types in equal numbers. Alongside this classification (spammer or not spammer), the data set contained the user's name or the website that belongs to the user, the IP address with which the site is registered, and the language version associated with the site.

As a result of the solution described here, the neural network now automatically recognizes new spammers as they register. The next step is to combine this automatic check with a manual check. A Python script automatically blocks sites that the network classifies with a very high probability of being spam, and an employee manually checks the sites that are deemed high probability.

Sound Network

Neural networks are mathematical models that can approximate any function. A neural network is guided by networked neurons similar to those in the human brain, such as in the visual cortex. What makes these networks special is that you do not have to model their behavior explicitly; instead, you train the network using sample data.

Neural networks help out when it is difficult to model functions manually, and they are often used in image and speech recognition. You need to provide the neural network with training data that has already been classified, and it will then attempt to classify new data in a similar way.

A single artificial neuron comprises several weighted inputs and an activation function, which is usually non-linear and helps to determine the output value of the neuron. There is also a threshold value or bias, which complements the weighted inputs, thus influencing the activation function. The mathematical formula behind this concept is as follows:

The formula uses the w vector to weight the input vector x and calculate the sum of both. It then adds the bias b, using the activation function phi. When developers skillfully combine several neurons, they can compute more complex functions (see the box titled "Solving Problems with Neural Networks").

Solving Problems with Neural Networks

A single neuron can already solve linearly separable problems. The binary OR function is an example of such a linearly separable problem. If you enter the possible inputs in a coordinate system, the two output values can be separated with a straight line (the top right neuron in Figure 1).

Few problems, however, are so easy to solve. A single neuron is typically not enough to make a classification. Full networks composed of neurons are used in practice, because more complex challenges can largely be split into separable sub-problems, which individual neurons can then solve.

Figure 1 shows the binary exclusive OR, which proves not to be linearly separable. A single line is not sufficient to separate the two ones from the zeros. As the propositional logic is aware, the XOR function consists of a combination of two conjunctions:

Both the two conjunctions and the disjunction can in turn be separated linearly. It is therefore possible to model the binary exclusive OR with three neurons, with one of them receiving the outputs of the two others. This combination of neurons forms a small, two-layered neural network.

As the small example demonstrates, deep learning experts can calculate complex functions with ease by combining multiple neurons. The strength of a neural network increases with the number of layers used. The layers allow experts to compute more functions.

Figure 1: A binary XOR is not linearly separable per se, although it can be split into linearly separable sub-problems.

Networked Learning

Several layers of interconnected neurons form a neural network (Figure 2). These layers consist of at least an input layer, which receives the input values, and an output layer, on which the data arrives after passing through several hidden layers. All the neurons on a particular layer generally use the same activation function.

Figure 2: A neural network with two hidden layers.

Neural networks learn through an optimization process that determines the parameters of the network, the weightings of the connections, and the bias of all the neurons, then refines these values step by step. The process that determines parameters is one of the optimization problems. This process involves the use of many traditional numerical analysis methods (e.g., the gradient method  [5]).

The script first initializes the network using random parameters. Next, the script applies the training data set to the neural network and determines the difference between the network's results and the correct results from the training data. The gap between these results is the loss, which the script attempts to minimize in the course of the optimization process.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • FAQ

    Welcome our new artificial intelligence overlords by tinkering with their gray matter.

  • Neural Networks

    3, 4, 8, 11… ? A neural network can complete this series without knowledge of the underlying algorithm – by a kind of virtual gut feeling. We’ll show you how neural networks solve problems by simulating the behavior of a human brain.

  • Programming Snapshot – Mileage AI

    On the basis of training data in the form of daily car mileage, Mike Schilli's AI program tries to identify patterns in driving behavior and make forecasts.

  • TensorFlow AI on the Pi

    You don't need a powerful computer system to use AI. We show what it takes to benefit from AI on the Raspberry Pi and what tasks the small computer can handle.

  • Neural networks learn from mistakes and remember successes

    The well-known Monty Hall game show problem can be a rewarding maiden voyage for prospective statisticians. But is it possible to teach a neural network to choose between goats and cars with a few practice sessions?

comments powered by Disqus