Scrpy的第一个爬虫

学习目标：

学会安装Scrapy
了解使用Scrapy抓取特定网页内容，并将结果储存为json、csv文件的方法
学会使用Scrapy shell
学会使用CSS选择器及Xpath
学习抓取时访问“下一页”的方法

Windows下Scrpy的安装

如果手动安装的话，比较麻烦，需要先安装lxml和PYwin32，然后再安装Python.

PYwin32下载地址：https://sourceforge.net/projects/pywin32/files/pywin32/Build%20221/
lxml下载地址:https://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml

如果你要在python3.X下安装，可以看这里。
视频地址见：这里。

如果想一键安装的话，可以先安装Anaconda，然后以管理员身份运行CMD,输入下列命令：

conda install scrapy

安装完毕之后，执行“开始--程序--Anaconda”功能，在Anaconda程序的子项下面，运行“Anaconda Prompt”。
Snap252

一、新建项目

使用Scrpy采集前，你需要先新建一个项目。进入你想存放Scrpy的目录，然后运行：

scrapy startproject tutorial

运行这个命令后，将会建立一个tutorial的文件夹。

其实它的结构是这样的：

tutorial/
scrapy.cfg # deploy configuration file 项目的配置文件

tutorial/ # project's Python module, you'll import your code from here 该项目的python模块
__init__.py

items.py # project items definition file 定义项目items的文件

pipelines.py # project pipelines file 定义pipelines的文件

settings.py # project settings file 设置项目的文件

spiders/ # a directory where you'll later put your spiders 放置spider代码的目录.
__init__.py

二、建第一个蜘蛛

蜘蛛必须继承scrapy.Spider这个类，它还可以定义如何访问网页上的链接，以及如何剖析网页的内容。

用Pycharm建立一个quotes_spider.py的文件，并将它保存到你之前新建项目的 tutorial/spiders 文件夹下面（这个注意不要放错位置，否则执行不了，上面的示意图就没有选到spider文件夹），quotes_spider.py的代码如下：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

name: 蜘蛛的名称. 在一个项目中它是唯一的.

start_requests(): 必须是一个可迭代的requests.

parse(): 处理每次返回内容的方法. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.

parse()通常剖析返回的内容，将抓取的内容存储为字典，同时寻找新的URLs，继续访问并建立新的请求。

小知识：

运行下面的命令，系统会自动建立一个爬虫文件。

scrapy genspider itcast "http://www.itcast.cn"

三、运行蜘蛛

进入项目中，可以使用scrapy命令来管理和控制您的项目。

可以通过“scrapy <command> -h”命令查看该命令的用法，也可以直接用scrapy -h查看所有的命令。

进入项目的根目录，并执行：scrapy crawl quotes，意思即为运行名称为“quotes”的蜘蛛。

在根目录，我们发现多了：quotes-1.html 和 quotes-2.htm两个文件。

可以打开看一下效果。

你可能会说，上面这个东西并不是我真正想要的，那么请继续。

四、一种更快捷的处理start_url方法

你可以定义start_url的类别属性，

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

即使我们没有明确告诉Scrapy如何做，parse()也会自动处理每个URLs的返回数据，因为parse() 是Scrapy默认的callback方法，如果没有指定callback的方法，就自动使用parse()。

五、提取数据

最佳方法是使用Scrapy shell里面的selectors。

如果你是windows系统，执行：

scrapy shell "http://quotes.toscrape.com/page/1/"

注意是双引号，如果是其他系统，则用单引号

使用这个shell，你可以使用CSS选择返回内容中的某一特定值。比如，输入：
response.css('title')，就会返回：

运行response.css('title')返回的结果是一个和列表类似的对象，叫SelectorList，它可以让你进一步提取其中的数据。
比如，我们要提取title的内容。可以使用response.css('title::text').extract()命令：

在这里需要注意两点：一是我们加了“::text”，意思即是我们只想提取<title>标签中的文字内容，如果我们不使用
“::text”，那我们将会得到包括<title>标签在内的内容，

response.css('title').extract()
['<title>Quotes to Scrape</title>']

二是执行.extract()的结果是一个列表，如果你想要列表的第一个元素，可以使用“
response.css('title::text').extract_first()”
命令或者使用：
“>>> response.css('title::text')[0].extract()”

使用.extract_first()命令可以避免IndexError和返回空值（当没有找到任何元素的时候）

这里有个决窍：编写采集代码的时候，有时遇到抓取错误的时候，我们需要忽略它，因它这样即使其中部分出错，我们还能抓取到其他部分的数据。

除了extract() 和 extract_first()的方法，我们还可以用正则表达式的方法取数：

>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']

六、XPath简介

除了CSS，Scrapy支持XPath 表达方法取数。

与CSS相比，XPath更加强大，它可以选择包含“下一页”文本的链接，这在我们采集的时候非常有用，所以，即使你会了CSS提取数据的方法，我还是建议你学习XPath，因为它会让你的数据抓取轻松许多。

退出shell模式的命令是:exit()

另我，我们还可以通过下面的命令统计采集到的数量：

len(response.xpath(".//div[@class='f-list-item']/d1/dd[1]/a/text()").extract()

下面给出给出XPath表达式的例子及对应的含义:

想要学习XPath的使用，请参考：

using XPath with Scrapy Selectors here

this tutorial to learn XPath through examples

this tutorial to learn “how to think in XPath

中文的xpath教程

七、提取内容和作者

每一个内容大概是这样的：

<div class="quote">
    <span class="text">“The world as we have created it is a process of our
    thinking. It cannot be changed without changing our thinking.”</span>
    <span>
        by <small class="author">Albert Einstein</small>
        <a href="/author/Albert-Einstein">(about)</a>
    </span>
    <div class="tags">
        Tags:
        <a class="tag" href="/tag/change/page/1/">change</a>
        <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>
        <a class="tag" href="/tag/thinking/page/1/">thinking</a>
        <a class="tag" href="/tag/world/page/1/">world</a>
    </div>
</div>

Windows系统通过CMD窗口“scrapy shell "http://quotes.toscrape.com"”打开shell。

输入“response.css("div.quote")”可以得到一个选择器的列表，通过 “quote = response.css("div.quote")[0]”可以获得第一个元素。

现在提取title, author 和 tags。

Tags是字串的列表，我们可以使用.extract()的方法全部获取。

现在我们抓取所有内容，并将它存到python的字典中。

>>> for quote in response.css("div.quote"):
...     text = quote.css("span.text::text").extract_first()
...     author = quote.css("small.author::text").extract_first()
...     tags = quote.css("div.tags a.tag::text").extract()
...     print(dict(text=text, author=author, tags=tags))
{'tags': ['change', 'deep-thoughts', 'thinking', 'world'], 'author': 'Albert Einstein', 'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'}
{'tags': ['abilities', 'choices'], 'author': 'J.K. Rowling', 'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”'}
    ... a few more of these, omitted for brevity
>>>

八、提取蜘蛛中的数据

我们使用yield 来提取数据。

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

输出结果是：

2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from &lt;200 http://quotes.toscrape.com/page/1/&gt;
{'tags': ['life', 'love'], 'author': 'André Gide', 'text': '“It is better to be hated for what you are than to be loved for what you are not.”'}
2016-09-19 18:57:19 [scrapy.core.scraper] DEBUG: Scraped from &lt;200 http://quotes.toscrape.com/page/1/&gt;
{'tags': ['edison', 'failure', 'inspirational', 'paraphrased'], 'author': 'Thomas A. Edison', 'text': "“I have not failed. I've just found 10,000 ways that won't work.”"}

九、储存采集到的数据
最简单的方法是使用Feed exports，使用

scrapy crawl quotes -o quotes.json

将会生成一个quotes.json文件。

格式是这样的：

由于历史原因，如果你执行这个命令两次，就会覆盖原来的json文件，从而破坏了json 文件。
为了避免以上问题，你可以使用JSON Lines，命令是

scrapy crawl quotes -o quotes.jl

同理，如果执行：

scrapy crawl quotes -o quotes.csv

则生成一个csv文件。

备注：
如果打开csv文件，中文显示乱码，可以用notepad转码，再保存成为csv就不会乱码了。

在小项目中，这些已经足够了，如果你想做更复杂的项目，可以使用 Item Pipeline

十、访问链接
如果我们不是抓取两个网页，而是一个网站的所有网页，下面我们看看如何访问网页上的链接。
首先我们需要提取网页中的链接，查看网页源代码，我们可以看到，一个链接到“下一页”的链接是这样的：

<ul class="pager">
    <li class="next">
        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>
    </li>
</ul>

我们可以在shell中提取数据

>>> response.css('li.next a').extract_first()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'

如果我们想要href的值，可以用：

>>> response.css('li.next a::attr(href)').extract_first()
'/page/2/'

再在看spider代码：

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

提取数据之后，parse() 查找到“下一页”的链接，使用urljoin()生成绝对路径，再生成一个新的请求抓取下一页。

原文及更多的例子请参考：

https://doc.scrapy.org/en/latest/intro/tutorial.html#storing-the-scraped-data

中文版教程

十一、scrapy实现登陆

蜗牛之路

最新

Scrpy的第一个爬虫

暧昧帖

发表评论点击这里取消回复。

最新

推荐

暧昧帖

发表评论 点击这里取消回复。

发表评论点击这里取消回复。