Common Crawl
FAQ
Download the entire web to kick-start a data science empire.
Q Is this some new swimming stroke that's all the rage?
A Is that really the best guess you can come up with? The Common Crawl project [1] scrapes the web, sucking up as much information as possible, and makes this data available for anyone who wants to use it. Data is released approximately every month and goes back to 2007.
Q They scrape the web for pages that are accessible to the public and make this data available to the public? What exactly is this meant to achieve?
[...]
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Two New Distros Adopt Enlightenment
MX Moksha and AV Linux 25 join ranks with Bodhi Linux and embrace the Enlightenment desktop.
-
Solus Linux 4.8 Removes Python 2
Solus Linux 4.8 has been released with the latest Linux kernel, updated desktops, and a key removal.
-
Zorin OS 18 Hits over a Million Downloads
If you doubt Linux isn't gaining popularity, you only have to look at Zorin OS's download numbers.
-
TUXEDO Computers Scraps Snapdragon X1E-Based Laptop
Due to issues with a Snapdragon CPU, TUXEDO Computers has cancelled its plans to release a laptop based on this elite hardware.
-
Debian Unleashes Debian Libre Live
Debian Libre Live keeps your machine free of proprietary software.
-
Valve Announces Pending Release of Steam Machine
Shout it to the heavens: Steam Machine, powered by Linux, is set to arrive in 2026.
-
Happy Birthday, ADMIN Magazine!
ADMIN is celebrating its 15th anniversary with issue #90.
-
Another Linux Malware Discovered
Russian hackers use Hyper-V to hide malware within Linux virtual machines.
-
TUXEDO Computers Announces a New InfinityBook
TUXEDO Computers is at it again with a new InfinityBook that will meet your professional and gaming needs.
-
SUSE Dives into the Agentic AI Pool
SUSE becomes the first open source company to adopt agentic AI with SUSE Enterprise Linux 16.

