Common Crawl


Article from Issue 200/2017

Download the entire web to kick-start a data science empire.

Q Is this some new swimming stroke that's all the rage?

A Is that really the best guess you can come up with? The Common Crawl project [1] scrapes the web, sucking up as much information as possible, and makes this data available for anyone who wants to use it. Data is released approximately every month and goes back to 2007.

Q They scrape the web for pages that are accessible to the public and make this data available to the public? What exactly is this meant to achieve?


Use Express-Checkout link below to read the full article (PDF).

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Hadoop 2 and Apache Spark

    Hadoop version 2 has transitioned from an application to a Big Data platform. Reports of its demise are premature at best.

  • FAQ – Apache Spark

    Spread your processing load across hundreds of machines as easily as running it locally.

  • Hadoop

    Experience the power of supercomputing and the big data revolution with Apache Hadoop.

  • ThinkUp

    Community managers, professional marketers, and active social media users want to know the effect their messages have on followers. ThinkUp can help.

  • Welcome

    As everyone knows, we journalists are always looking for the next big thing. High-tech journalists are especially attuned to this quest, because what is high tech but the history of the next big thing unfolding?

comments powered by Disqus

Direct Download

Read full article as PDF:

Price $2.95