Building a Web Crawler with Scrapy

Scrapy_logo

Recently, crawling data from a website or websites is not so complicated as it used to be. Actually, you can build a simple crawler within an hour or left with a scripting language such as PHP or even Javascript (on server side). In this post, I will introduce an approach to the topic by using a Python framework – Scrapy – to build a web crawler. Okay, let’s dive in.

0. Why Scrapy?

You might ask why I choose a Python framework instead of using some other “faster” solutions. Well, when I first found out Scrapy, I was in need of a fast-deployed framework instead of a high-performance one. I even didn’t know Python language before, and Scrapy doesn’t require much about that. However, when I fired my first crawler, its performance amazed me: it’s insanely fast!

1. Installing Scrapy

Before installing Scrapy, be sure you have already installed Python (latest version). What is Python and how to install it are beyond of this post. However, if you are Vietnamese, ViThon is a great place to start, otherwise, please pass by  http://www.python.org/ to catch a glimpse of what it is.

After installing Python, there are several options for you:

  • Install Scrapy from source (if it does ring a bell, you are awesome!).
  • Install Scrapy with the Ubuntu package (if and only if you are using Ubuntu, Kubuntu and so on).
  • Install Scrapy via Python’s pip.

The third one is so far the most general method, which can be applied for MacOS X,  Linux and Windows. Okay, firstly, please follow the instructions on this page to install pip http://www.pip-installer.org/en/latest/installing.html.

Once you have installed pip, run the command below:

If there are errors relating to permissions, please add sudo to the beginning of the command. Till this moment, everything is ready for the meal.

2. Creating a new crawling project

Open the terminal (or whatever you may call it in your OS), change to the folder you want your project to be in. Then execute the following command:

Our example today is to crawl all the post’s titles on http://code.tutsplus.com/ (great, huh?), then name it “tutsplus”:

After creating the project, we have a folder which stores almost everything we want:

Folder organization - Scrapy

There are several files and one folder named “spiders”. I don’t want to make a mess here so I just want you to concentrate on two things: the items.py file and spiders folder.

3. Define your crawl object

When working with Scrapy, you must specify what you want to get after crawling, which is called an item (another terminology which might be familiar with you is “model”). To do this, open the items.py file please, and have a look at it:

TutsplusItem class which is derived from Item class stores the data you want to get. Each field in this class must be an object of Field class as above. In our example, we want to get only the title of a post, so there is only one field:

I remove the pass statement because it it meaningless here.

4. The Spider

Okay, we just have a model, all we have to do now is create a “spider”, which controls the “crawl” action. Open the spiders folder and create a file named tutspider.py (you can name it whatever you want). Firstly, we must import some needed class:

Spider is a basic crawling class. TutsplusItem is the model we’ve just created. Request class enables us to recursively crawl a page.

The next thing is the heart of our crawler: the spider class. It’s a derived class of BaseSpider which has three fields and one method (all are required):

  •  name: name of your spider, which is use to launch the spider (will mention later). You can name it whatever you want. According to my taste, I will call it “nettuts”.
  • allowed_domains: a list of domains on which the crawler crawling. Every domains which are not in this list are not available for crawling. This is optional.
  • start_urls: a list of URLs, which will be the roots of later crawls.
  • parse(self, response): main method which is invoked by BaseSpider and contains main logic of our crawler.

Example for filling the first 3 fields:

That is half of a spider, next important thing is implementing the TutsplusSpider. First of all, you must figure out what data you want to get from the website. In our case, it’s the post’s title, which can be easily “Inspect Element” as follow:

Inspect element - Scrapy

Obviously, the title is the text of a h1 tag, which is nested inside an anchor tag. Alright, at this moment, we must have a way to extract the above structure and the answer is using XPath Selector. For those who haven’t heard about XPath before: it is a common syntax to traverse through XML document in general and HTML in our case. How to construct an XPath is beyond of this post, please pass by w3schools to get more information.  So, with the above requirement, we have the XPath:

Notice the space after the class’s name. Because there is a space right after posts__post-title in class attribute of the anchor tag so you have to add one in XPath. Another way to do this is to use contains:

And here is how we use XPath in our spider (which is also the parse method):

The full code of our spider:

5. Launch the spider

Okay, we can now launch it, execute following command (make sure you are in your project folder, in our case, it is tutsplus):

Ah yeah, you will see the result is printed in your terminal like this:

Command results - Scrapy

“Man! This is too messy! I only want tidy things!”. Please calm down, if you want to output it to well-formatted file, there are options for you. Assume we want a CSV file with the above data, so execute the following command:

Then it will wrap your result into a very beautiful CSV file. The directive -o will be followed by the output file, and -t is the output file type. Besides CSV, you can also use XML and JSON for output result. It’s too damn convenient!

“But wait … Is there something wrong? The output contains only a few posts!”. Yes, I know, please read on.

6. Recursive crawling

In previous spider, it only investigate the root page without adding new link to crawl. In order to recursively crawl the page, we must extract the href attribute of any anchor tag in the HTML file and yield it for the Spider to process (I know! I know, geeks! It’s more than that, but I want to keep it simple, okay?). And here is what I modified from the previous spider source code:

Don’t rush! Firstly, I added re class to check whether the link I get is valid or not. This will prevent any relative link. Secondly, I extracted all links of the page and start checking: if it is valid and haven’t been put into crawedLinks, I yield a new Request to that link to crawl. Finally, I extracted all titles of the current page and yield them as TutsplusItem objects.

It is easy and fast, huh? However, if you seriously want to crawl tutsplus, you have to change the code because the website prevents you from accessing too many pages in a short amount of time. Consider it an exercise, have a nice day, geeks!

References

  1. http://scrapy.org/
  2. http://www.w3schools.com/xsl/xpath_intro.asp
  3. http://www.python.org/

Updated for Scrapy 1.1.2 on August 25, 2016 by Hong Nguyen

SSS Full-stack Engineer

Love Silicon Straits and want to know more about our company culture, working environment or job vacancies?
Find out more at careers.siliconstraits.vn.

Silicon Straits
Be Challenged. Be Inspired. Be Different.




  • tttoan

    Nice article, glad to catch someone on the same path with me! I have used this Scrapy kit for a while, together with some learning tools. Great combination to extract link and retrieve data from pages automatically :)

    • Hoài Thương

      Thank you for your attention. We’re looking forward to hear about the other tools from you. Cheers!

  • Jerry Wu

    I am new to scrapy and reading your tutorial. Every thing is nice for me except regex part. I googled for reges but still can not understand your re.compile part. Could you explain a bit?

    Thanks.

    • minhdanh

      Hi Jerry,

      First of all, “re” is a Python module for Regular Expression (regex) operations. The “re.compile” function compiles a regular expression into an object so that it can be used later (for matching a string). I guess the confusing thing here is the value (the regex) which this function receives:
      “^(?:ftp|http|https)://(?:[w.-+]+:{0,1}[w.-+]*@)?(?:[a-z0-9-.]+)(?::[0-9]+)?(?:/|/(?:[w#!:.?+=&%@!-/()]+)|?(?:[w#!:.?+=&%@!-/()]+))?$”
      I admit that this regex is rather complex to explain. Its purpose is to check if a string is a propper URL (so that the crawler will crawl that URL). It checks if a string begins with “ftp”, “http” or “https”, and followed by some kind of characters that may appear in a url. For new user like you, I recommend using an regex explainer tool: http://rick.measham.id.au/paste/explain.pl, or http://regex101.com/ (just paste the regex and see the explanation).

      • Jerry Wu

        Thanks for your tips. I rewrote the regex part much more straightforward. Since the url I want is like: http://code.tutsplus.com/posts?page=2, I want the pattern to be matched with that url. Mine is: ^(?:ftp|http|https)://(?:code).(?:tutsplus).(?:com)/(?:posts)?(?:page)=d+$ and seems good when trying on regex101.com.

        I crawl the links directly from the bottom navigation part. And my regex seems work good for that. However, I send them back with Request function, but it seems something wrong with it, for the spider doesn’t go to that page to crawl. Could you take an eye for me?

        Here is my spider code: http://pastebin.com/RHrvwfxK

        Thanks.

        • Hi,

          I’ve had a look at your code, the reason why your spider did not work is because it included the following line:
          allowed_domains = [“http://code.tutsplus.com/”]

          This makes the spider crawls urls that match “http://code.tutsplus.com/” in their HOST NAME only, so in the crawler’s output, you should get this message: DEBUG: Filtered offsite request to ‘code.tutsplus.com':

          While the correct directive should be like this:
          allowed_domains = [“code.tutsplus.com”]

          • Jerry Wu

            Thanks for your patience. My code goes well after revised. But I still have two questions.

            1. Does “http://code.tutsplus.com/posts?page=2″ have different HOST NAME as “http://code.tutsplus.com/”? I googled and find same problems I met and their advise, but seems no answer for why they are different Host NAME.

            2. In line 31 of my code, I yield a Request function. Will this interrupt the code below( line 33-37)? What I know about yield is like a return but is also works as a iterator. What I am thinking for this question is that after calling yield Request function, the link will be sent to downloader and code line 33-37 will be run right after that. self.parse function will not be called until response ( for the link) has been downloaded after finishing parse (line33-37) on the current page. Am I right?

          • 1. The answer is “no”, they’re not different. “http://code.tutsplus.com/posts…” is a URL. And the host name of the URL is “code.tutsplus.com”. So it’s the same for both of your URLs. More information: http://en.wikipedia.org/wiki/Hostname

            2. This is a good question. As far as I know, line 31 will not interrupt the code below it, it simply creates a new Request object and then keeps moving to line 33, 34.. According to Scrapy’s documentation, the newly created Request will then be processed by Scrapy, which why I think you’re right.
            http://doc.scrapy.org/en/latest/topics/spiders.html#spiders

  • Really awesome article, I’m new scrapy and reading your tutorial. After I installed the scrapy, then run your code, but below error printed:

    File “D:Python27libsite-packagesscrapy-0.24.4-py2.7.eggscrapyutilsmisc.
    py”, line 42, in load_object
    raise ImportError(“Error loading object ‘%s': %s” % (path, e))
    ImportError: Error loading object ‘scrapy.core.downloader.handlers.s3.S3Download
    Handler': DLL load failed: %1 is not a valid Win32 application.

  • Hi Hoài, wonderful tutorial! I added this post to my list of Scrapy web crawler tutorials. Thank you for the great resource.

  • Pingback: Python Web Crawler & Spider Tutorials | Potent Pages | IT Lyderis()

  • Pingback: scrapy wiki资料汇总-IT大道()

  • Andrius Žilėnas

    Thanks for good tutorial but images dissapeared

    • Hi Andrius,

      Thanks for noticing us. We have fixed the images and updated the content for the latest Scrapy.

  • Ubald Kuijpers

    Thanks, i just started with scrapy and this is the first spider that worked well

  • Gustavo Rangel

    Amazing guide, i’m new at all this. When i run the first code, without the recursive part, it appears the error “no module named items” related to the “tutsplus.items” in the second line. I tried to use another name for “tutsplus”, but i don’t think that’s the problem.
    I tried a lot of things, but i can’t run from this error. If anyone can help.

  • Pingback: scrapy常见问题 | Dotte博客()

  • Vaibs

    i have been using phpcrawl but the performance is very poor but the integration with with web technologies is good. I will give it a try ,if success/happy , will integrate it with the front end. Thanks my site with online web crawler iseebug.com

  • HeyItsMe!

    please help me to select to element in some result so exemple Title and Description

  • J2

    Thanks for this found it really useful! I have c. 50+ items to scrape, in items.py I have added them all like this – item_name = scrapy.Field() and then within the parse function like this – item[“item_name”] = item_name. Will I need to do a for loop for each one of these, or is there a way around that?

    • J2

      Can I just do this? – item[“item_name”] = response.xpath(‘//h1/text()’).extract()

Posted by

on August 25, 2016

in , ,

Comments

Follow us for more later

or subscribe with