Scraping highly dynamic websites
Programming Snapshot – chromedp
Screen scrapers often fail when confronted with complex web pages. To keep his scraper on task, Mike Schilli remotely controls the Chrome browser using the DevTools protocol to extract data, even from highly dynamic web pages.
Gone are the days when hobbyists could simply download websites quickly with a curl
command in order to machine-process their content. The problem is that state-of-the-art websites are teeming with reactive design and dynamic content that only appears when a bona fide, JavaScript-enabled web browser points to it.
For example, if you wanted to write a screen scraper for Gmail, you wouldn't even get through the login process with your script. In fact, even a scraping framework like Colly [1] would fail here, because it does not support JavaScript and does not know the browser's DOM (Document Object Model), upon which the web flow relies. One elegant workaround is for the scraper program to navigate a real browser to the desired web page and to inquire later about the content currently displayed.
For years, developers have been using the Java Selenium suite for fully automated unit tests for Web user interfaces (UIs). The tool speaks the Selenium protocol, which is supported by all standard browsers, to get things moving. Google's Chrome browser additionally implements the DevTools protocol [2], which does similar things, and the chromedp project on GitHub [3] defines a Go library based on it. Go enthusiasts can now write their unit tests and scraper programs natively in their favorite language. I'll take a look at some screen-scraping techniques in this article, but keep in mind that many websites have licenses that prohibit screen scraping. See the site's permission page and consult the applicable laws for your jurisdiction.
Directing Chrome
The Go program in Listing 1 [4] launches the Chrome browser, points it at the Linux Magazine web page, and then takes a screenshot of the retrieved content. The whole thing runs at the command line if you type
Listing 1
screenshot.go
01 package main 02 03 import ( 04 "context" 05 emu "github.com/chromedp/cdproto/emulation" 06 "github.com/chromedp/cdproto/page" 07 cdp "github.com/chromedp/chromedp" 08 "io/ioutil" 09 ) 10 11 func main() { 12 ctx, cancel := 13 cdp.NewContext(context.Background()) 14 defer cancel() 15 16 var buf []byte 17 tasks := cdp.Tasks{ 18 cdp.Navigate( 19 "http://linux-magazine.com"), 20 cdp.ActionFunc( 21 func(ctx context.Context) error { 22 _, _, contentSize, err := 23 page.GetLayoutMetrics().Do(ctx) 24 if err != nil { 25 panic(err) 26 } 27 28 w, h := contentSize.Width, 29 contentSize.Height 30 31 viewPortFix(ctx, int64(w), int64(h)) 32 33 buf, err = page.CaptureScreenshot(). 34 WithQuality(90). 35 WithClip(&page.Viewport{ 36 X: contentSize.X, 37 Y: contentSize.Y, 38 Width: w, 39 Height: h, 40 Scale: 1, 41 }).Do(ctx) 42 if err != nil { 43 panic(err) 44 } 45 return nil 46 })} 47 48 err := cdp.Run(ctx, tasks) 49 if err != nil { 50 panic(err) 51 } 52 53 err = ioutil.WriteFile("screenshot.png", 54 buf, 0644) 55 if err != nil { 56 panic(err) 57 } 58 } 59 60 func viewPortFix( 61 ctx context.Context, w, h int64) { 62 err := emu.SetDeviceMetricsOverride( 63 w, h, 1, false). 64 WithScreenOrientation( 65 &emu.ScreenOrientation{ 66 Type: 67 emu.OrientationTypePortraitPrimary, 68 Angle: 0, 69 }). 70 Do(ctx) 71 72 if err != nil { 73 panic(err) 74 } 75 }
go build screenshot.go
followed by ./screenshot
. The user will not see a browser pop up, because chromedp
normally runs in headless (i.e., invisible) mode, unless otherwise configured. The following command gets the required library code from GitHub and also compiles and installs it:
$ go get -u github.com/chromedp/chromedp
It takes the compiled program in Listing 1 a few seconds to retrieve the page, depending on your Internet connection and the current server speed; then it saves an image file in PNG format named screenshot.png
to the hard disk as a result. Since the Linux Magazine homepage fills several browser pages in terms of length, giving users a reason to scroll down and explore, the screenshot in Figure 1 is almost 3000 pixels tall.
Listing 1 creates a new chromedp context in line 13 and gives the constructor a standard Go background context, which is an auxiliary construct for controlling Go routines and subroutines. A context constructor in Go returns a cancel()
function. This function can be called by the main program later to signal to another (maybe deeply) nested part of the program that it is time to clean up, because doors are being closed.
The Tasks
structure starting on line 17 defines a set of actions that you want the connected Chrome browser to perform, using the DevTools protocol. The Navigate
task starting on line 18 directs the browser to the Linux Magazine website. The second task starting in line 20 is created by the ActionFunc()
function, a tool to structure new customized tasks in chromedp. In this case, the task creates a screenshot of the web page displayed in the remote browser using the function CaptureScreenshot()
in line 33.
Wide Open Spaces
Now the question is how far to open the virtual browser, because this setting determines what you see in the screenshot. Is only a fraction of the web page visible or all of it, including the parts that can only be reached by scrolling? If it's the latter, the screenshot needs to capture everything that the user would see if they had an infinitely tall screen with the browser fully extended.
To capture it all, the GetLayoutMetrics()
function calculates the layout dimensions of the displayed page, and the viewPortFix()
function (called in line 31 and defined in line 60) uses SetDeviceMetricsOverride()
to adjust the dimensions of the invisible browser. The buf
image buffer returned by the Screenshot
function in line 33 is written to disk in PNG format by WriteFile()
. The sequence of the actions, starting with navigating to the page, followed by taking the screenshot, is processed by the Run()
function starting in line 48.
The technique of creating screenshots of automatically fetched web pages opens up a number of unheard-of possibilities when testing newly developed web UIs. For example, image recognition can later determine whether the site's various graphic elements are in the right place with different browser sizes, without human test personnel having to click their way through the flow with every release. It could also be used to implement a neat system for archiving websites; in the next century, historians would surely be amused by the advertisements placed on the Linux Magazine homepage in 2020.
Complicating Easy Things
For test purposes, it would be quite useful at times to start the remote browser visibly in the foreground instead of hidden in the background. Developers of scraping applications can thus determine if the browser is stepping through or if it gets stuck at some point. Paradoxically, however, setting up foreground mode has become quite complicated since the introduction of default background mode in chromedp some time ago, since using NewContext()
to create a new browser context configures the browser to run in background mode deep down in the library's engine compartment, which is inaccessible from outside.
This is why Listing 2 creates a new browser controller in the form of NewExecAllocator()
and passes it the NoFirstRun
option to make the browser run in the foreground. Back comes a context, but, alas, not a context compatible with the context object that chromedp uses and gives to Run()
in line 24 of the executing function. Therefore, line 12 creates a compatible context via NewContext()
and passes it the previously created Exec
context as a parent context. The new chromedp context also has a cancel()
function, and the defer
statements in lines 13 and 14 are both triggered at the end of the program to neatly collapse the remote-controlled browser.
Listing 2
foreground.go
01 package main 02 03 import ( 04 "context" 05 cdp "github.com/chromedp/chromedp" 06 "time" 07 ) 08 09 func main() { 10 pctx, pcancel := cdp.NewExecAllocator( 11 context.Background(), cdp.NoFirstRun) 12 ctx, cancel := cdp.NewContext(pctx) 13 defer cancel() 14 defer pcancel() 15 16 tasks := cdp.Tasks{ 17 cdp.Navigate( 18 "https://linux-magazin.de"), 19 cdp.Navigate( 20 "http://linux-magazine.com"), 21 cdp.Sleep(5 * time.Second), 22 } 23 24 err := cdp.Run(ctx, tasks) 25 if err != nil { 26 panic(err) 27 } 28 }
Listing 2 only accesses the homepages of the German and English versions of Linux Magazine for this test; it then Sleep()
s for five seconds and terminates.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
New Slimbook EVO with Raw AMD Ryzen Power
If you're looking for serious power in a 14" ultrabook that is powered by Linux, Slimbook has just the thing for you.
-
The Gnome Foundation Struggling to Stay Afloat
The foundation behind the Gnome desktop environment is having to go through some serious belt-tightening due to continued financial problems.
-
Thousands of Linux Servers Infected with Stealth Malware Since 2021
Perfctl is capable of remaining undetected, which makes it dangerous and hard to mitigate.
-
Halcyon Creates Anti-Ransomware Protection for Linux
As more Linux systems are targeted by ransomware, Halcyon is stepping up its protection.
-
Valve and Arch Linux Announce Collaboration
Valve and Arch have come together for two projects that will have a serious impact on the Linux distribution.
-
Hacker Successfully Runs Linux on a CPU from the Early ‘70s
From the office of "Look what I can do," Dmitry Grinberg was able to get Linux running on a processor that was created in 1971.
-
OSI and LPI Form Strategic Alliance
With a goal of strengthening Linux and open source communities, this new alliance aims to nurture the growth of more highly skilled professionals.
-
Fedora 41 Beta Available with Some Interesting Additions
If you're a Fedora fan, you'll be excited to hear the beta version of the latest release is now available for testing and includes plenty of updates.
-
AlmaLinux Unveils New Hardware Certification Process
The AlmaLinux Hardware Certification Program run by the Certification Special Interest Group (SIG) aims to ensure seamless compatibility between AlmaLinux and a wide range of hardware configurations.
-
Wind River Introduces eLxr Pro Linux Solution
eLxr Pro offers an end-to-end Linux solution backed by expert commercial support.