Developing a mailbot script
Address Catcher

© Lead Image © Konstantin Inozemtcev, 123RF.com
A Python script that captures email addresses will help you understand how bots analyze and extract data from the web.
Bots crawl around constantly on the Internet, capturing information from public websites for later processing. Although the science of bot design has become quite advanced, the basic steps for capturing data from an HTML page are quite simple. This article describes an example script that extracts email addresses. The script even provides the option to extend the search to the URLs found on the target page. Rolling your own bot will help you build a deeper understanding of privacy defense and cybersecurity.
Setting Up the Environment
I recommend setting up an integrated development environment, like Visual Studio (VS) Code for Python programming, and having a basic understanding of the language. You can download VS Code from the VS Code website [1]. On Ubuntu, an easy way to install the application is by downloading the .deb
package, right-clicking the file, and selecting the Install
option. Alternatively, you can search for "vscode" in the App Center and click the Install
button. If you prefer using the terminal, the VS Code website [2] provides detailed instructions for any Linux distribution. I also suggest adding Python development extensions, including Pylance and the Python Debugger.
The Script
The full text of the mailbot.py
script is available on the Linux Magazine website [3]. Listing 1 shows the beginning of the script where I import the modules I will need to manage communications via the HTTP protocol, search for string patterns using regular expressions, implement asynchronous functions, manage script input arguments, and show a progress bar to track process advancement. The alive_progress
module is not part of the standard library, so I have to install it with the following command:
pip install alive-progress
Listing 1
Importing Modules
01 import urllib 02 import urllib.request 03 import re 04 import asyncio 05 import argparse 06 from alive_progress import alive_bar 07 import sys
Listing 2 first defines the asynchronous function ExtractURLs(myUrl)
, which takes the web address to be scanned as a parameter and returns a list of URLs. The code then analyzes each entry to search for email addresses in case the user has requested recursive scanning. The elements to be added to the list are selected and extracted using a regular expression, defined in the variable regex
and used as a parameter in the findall
function. The findall
function returns all occurrences that match the format I have set. I store these occurrences in the list t
, which is finally returned as the result of the ExtractURLs(myUrl)
function.
Listing 2
Defining Asynchronous Functions
01 async def ExtractURLs(myUrl): 02 try: 03 t=[] 04 regex="(?P<url>https?://[^\\s'\"]+)" 05 t=re.findall(regex, myUrl) 06 except: 07 pass 08 finally: 09 return t 10 11 async def ExtractMails(myUrl): 12 try: 13 q=[] 14 regex="[\\w\\.-]+@[\\w\\.-]+\\.\\w+" 15 u=re.findall(regex, myUrl) 16 except: 17 pass 18 finally: 19 return u
Similarly, I then define the asynchronous function ExtractMails(myUrl)
, which returns a list of email addresses found in the string myUrl
, based on another regular expression that I have defined. Both functions described above are enclosed in a try, except, finally
construct. In case of an error, the script does not perform any operations, in order to avoid premature termination or producing a final output in a nonstandard and thus unusable format. Regardless of the outcome, both functions return a list, containing web addresses and email addresses, respectively. The regular expressions used for extracting URLs and email addresses have been adapted from discussions on the Stack Overflow forum [4, 5]. I could refine ExtractURLs(myUrl)
and ExtractMails(myUrl)
to ensure they return a list strictly populated with valid values; however, I prefer to prioritize sensitivity over specificity in the extraction process. In other words, I choose to return a broader list of addresses that includes all available ones, rather than a shorter list of likely valid addresses that may exclude some equally valid entries. The rationale behind this choice is based on the fact that sending emails is a quick and low-cost operation. This approach allows me to maximize the number of users reached at the cost of a small percentage of undelivered emails.
In Listing 3 I process the script parameters, two of which are mandatory: the web address to analyze, url
, and the name of the output file, output
. The third parameter, -r
, is optional and extends the search for email addresses to the URLs contained within the web page defined by the first parameter. I implement only one level of recursion, because the number of addresses involved in the search (and consequently the required resources) would grow exponentially with each iteration. Next, I open a connection to the web page specified in the url
argument and assign its source code to the variable webContent
.
Listing 3
Processing Script Parameters
01 try: 02 parser=argparse.ArgumentParser() 03 parser.add_argument("url", help="Url to analyze", type=str) 04 parser.add_argument("output", help="Output file", type=str) 05 parser.add_argument("-r", "--recursive", help="Recursive search", action="store_true") 06 args=parser.parse_args() 07 08 response = urllib.request.urlopen(args.url) 09 webContent = response.read().decode('UTF-8')
Listing 4 scans the email addresses in the string I just obtained as a result of ExtractURLs(myUrl)
function. At this point, I check whether the user has requested a recursive search. If not, the operation is completed, and I print the list variable z
(where I have stored the found email addresses) to the output file. Otherwise, I perform a search for all URLs contained in the webContent
string, and, for each of them, I call the ExtractMails(myUrl)
function again. I perform the searches using asynchronous functions to ensure that the email extraction process occurs in an orderly fashion and only after the URL search has been completed. Specifically, the list v
contains all the URLs extracted from webContent
as its elements. I loop through the list v
, calling the ExtractMails(myUrl)
function for each element. The variable k
contains the email addresses extracted from the current element of the v
list. If the length of k
is greater than zero (i.e., if at least one address is found), I add it to the list z
, which holds all the email addresses extracted so far. At the same time, I provide the user with a visual progress indicator, using a real-time updating progress bar, which takes a value between zero and the length of the v
list, corresponding to the number of URLs to analyze. The current value of the bar is equal to i
, a counter that tracks the iterations through the URL list. The bar is updated at each iteration by calling the bar()
function.
Listing 4
Scanning Email Addresses
01 z=asyncio.run(ExtractMails(webContent)) 02 03 if(args.recursive==True): 04 v=asyncio.run(ExtractURLs(webContent)) 05 with alive_bar(len(v)) as bar: 06 for i in range(len(v)): 07 k=asyncio.run(ExtractMails(v[i])) 08 if(len(k)!=0): 09 z.extend(k) 10 bar() 11 12 if(len(z)>0): 13 z[-1] = z[-1] + ";" 14 with open(args.output, "w") as f: 15 print(*z, sep="; ", file=f) 16 print("Found " + str(len(z)) + " mail addresses.") 17 else: 18 print("No mail addresses were found.") 19 except Exception as e: 20 print(e)
I check if any email addresses have been extracted and, if not, notify the user with the message "No mail addresses were found" (Figure 1). If the check is positive, I append a semicolon to the last element of the z
list. Next, I open the file specified in the args.output
argument and write all the elements of the z
list to it, each separated by a semicolon and a space. Finally, I inform the user about the number of email addresses found. The output file will thus contain a sequence of addresses already properly formatted for use as a recipient list for an email.

Use Case Example
The script should be executed with the following syntax:
python mailbot.py website_name output_file [-r]
Instead of website_name
, you should insert the full address of the page, including the "http://" or "https://" prefix. output_file
refers to the text file where you want to save the list of found email addresses. Finally, the optional parameter -r
enables a recursive search. Therefore, to recursively extract email addresses from www.mysite.org and save the corresponding list to output.txt
, you just need to type the following:
python mailbot.py https://www.mysite.org output.txt -r
A progress bar indicates the real-time progress of the search process. Figure 2 shows the appearance of the terminal while the operation is in progress. If the -r
parameter is not specified, the search is performed only on the page provided as the first parameter of the script. The output file will contain a list of extracted email addresses, separated by a semicolon and a space. This way, the resulting string can be used directly as a recipient list without further processing.

Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News
-
System76 Releases COSMIC Alpha 7
With scores of bug fixes and a really cool workspaces feature, COSMIC is looking to soon migrate from alpha to beta.
-
OpenMandriva Lx 6.0 Available for Installation
The latest release of OpenMandriva has arrived with a new kernel, an updated Plasma desktop, and a server edition.
-
TrueNAS 25.04 Arrives with Thousands of Changes
One of the most popular Linux-based NAS solutions has rolled out the latest edition, based on Ubuntu 25.04.
-
Fedora 42 Available with Two New Spins
The latest release from the Fedora Project includes the usual updates, a new kernel, an official KDE Plasma spin, and a new System76 spin.
-
So Long, ArcoLinux
The ArcoLinux distribution is the latest Linux distribution to shut down.
-
What Open Source Pros Look for in a Job Role
Learn what professionals in technical and non-technical roles say is most important when seeking a new position.
-
Asahi Linux Runs into Issues with M4 Support
Due to Apple Silicon changes, the Asahi Linux project is at odds with adding support for the M4 chips.
-
Plasma 6.3.4 Now Available
Although not a major release, Plasma 6.3.4 does fix some bugs and offer a subtle change for the Plasma sidebar.
-
Linux Kernel 6.15 First Release Candidate Now Available
Linux Torvalds has announced that the release candidate for the final release of the Linux 6.15 series is now available.
-
Akamai Will Host kernel.org
The organization dedicated to cloud-based solutions has agreed to host kernel.org to deliver long-term stability for the development team.