Home >  > 《Python网络爬虫实战》笔记(Xpath)

《Python网络爬虫实战》笔记(Xpath)

0

一、Python命名规则

二、xpath用法:


这里的下标是从1开始的,不是0

抓取图片:

小技巧:
如果遇到]怎么办?

links = dom_tree.xpath("//a[@class='download']")#在xml中定位节点,返回的是一个列表
for index in range(len(links)):
    # links[index]返回的是一个字典
    if (index % 2) == 0:
        print(links[index].tag)  
        print(links[index].attrib)
        print(links[index].text)

比如url为磁力链接

那么

print(links[index].tag)#获取a标签名a
print(links[index].attrib)#获取a标签的属性href和class
print(links[index].text)#获取a标签的文字部分

的结果分别为:

a
{'href': 'magnet:?xt=urn:btih:7502edea0dfe9c2774f95118db3208a108fe10ca', 'class': 'download'}
磁力链接

参考:https://www.cnblogs.com/z-x-y/p/8260213.html

截取超级链接

//a[@class="text--link"]/@href 

//span[@class='l fl']/a/@href   #截取超级链接

截取多个标签
比如,文章的正文部分又有h2,又有h3,又有p标签,可以通过下面的方法截取,不过我试了一下,发现它先将所有的h2,找出来,再将所有的p标签找出来排列,那么和原文章的顺序就乱了。。
所以最后还是要用正则。

xpath('//*[@id="xxx"]/h2|*[@id="xxx"]/h3|*[@id="xxx"]/p')

特殊的字段

ip.xpath('string(td[5])')[0].extract().strip()  #获取第5个单元格的所有文本。

ip.xpath('td[8]/div[@class="bar"]/@title').re('\d{0,2}\.\d{0,}')[0]  #匹配<div class="bar" title="0.0885秒"> 中的数字

如果图片url在不同的存储位置,xpath的时候用“|”符号。

常见问题:
(1)如果在chrome的xpath插件中可以用q_urls = response.xpath('//div[@class="line content"]')看到取值,但是在代码中取不到值,需要用下面的代码:

def get_detail(url):
    html = requests.get(url,headers = headers)
    response = etree.HTML(html.content)
    q_urls = response.xpath('//div[@class="line content"]')
    result = q_urls[0].xpath('string(.)').strip()
    return result

(2)查看元素

content = selector.xpath('//div[@class="metarial"]')[0] 

参考:
https://www.cnblogs.com/just-do/p/9778941.html

(3)乱码问题
如果用xpath取到的中文乱码,可以用下面的方案:

content=etree.tostring(content,encoding="utf-8").decode('utf-8')

参考:
https://www.cnblogs.com/Rhythm-/p/11374832.html

第五章 scrapy爬虫框架
1. __init__.py文件,它是个空文件,作用是将它的上级目录变成了一个模块,,可以供python导入使用。
2. items.py决定抓取哪些项目,wuhanmoviespider.py决定怎么爬的,settings.py决定由谁去处理爬取的内容,pipilines.py决定爬取后的内容怎么样处理。
3.

&lt;h3&gt;武汉&lt;font color="#0066cc"&gt;今天&lt;/font&gt;天气&lt;/h3&gt;

选择方式为:.h3//text()而不是.h3/text()

4.使用json输出
settings.py中的Item_pipeline项,它是一个字典,字典是可以添加元素的,因此完全可以自行构造一个Python文件,然后加进去。
(1)创建pipelines2json.py文件

import time
import json
import codecs

class WeatherPipeline(object):
    def process_item(self, item, spider):
        today=time.strftime(‘%Y%m%d‘,time.localtime())
        fileName=today+‘.json‘
        with codecs.open(fileName,‘a‘,encoding=‘utf8‘) as fp:
            line=json.dumps(dict(item),ensure_ascii=False)+‘\n‘
            fp.write(line)
        return item

(2)修改settings.py,将pipeline2json加入到Item_pipelines中去

   Item_pipelines = {
   'weather.pipelines.weatherpipeline':1,
   'weather.将pipelines2json.weatherpipeline':2,
   }

5.登陆数据库,数据表的命令

其实之前在这里已经学过了。不过那里是直接修改pipeline,这里新建了一个pipelines2mysql.py用来入库。

这里主要是为了标记一下数据库操作命令。

# 创建数据库:scrapyDB ,以utf8位编码格式,每条语句以’;‘结尾
CREATE DATABASE scrapyDB CHARACTER SET 'utf8' COLLATE 'utf8_general-Ci';

# 选中刚才创建的表:
use scrapyDB;

# 创建我们需要的字段:字段要和我们代码里一一对应,方便我们一会写sql语句
CREATE TABLE weather(
id INT AUTO_INCREMENT,
date char(24),
week char(24),
img char(128),
temperature char(24),
weather char(24),
wind char(24),
PRIMARY KEY(id) 
)ENGINE=InnoDB DEFAULT CHARSET='utf8';

查看一下weather表格的样子

show columns from weather   或者:desc weather

6.添加一个User_agent

在scrapy中的确是有默认的headers,但这个headrs与浏览器的headers是有区别的。有的网站会检查headers,所以需要给scrapy一个浏览器的headers。

当然还可以用下面的方法:

from getProxy import userAgents
 
BOT_NAME='getProxy'
 
SPIDER_MODULES=['getProxy.spiders']
 
NEWSPIDER_MODULE='getProxy.spiders'
 
USER_AGENT=userAgents.pcUserAgent.get('Firefox 4.0.1 – Windows')
 
ITEM_PIPELINES={'getProxy.pipelines.GetProxyPipeline':300}

只需要在settings.py里添加一个User_agent项就可以了。

这里修改了USER_AGENT,导入userAgents模块,下面给出userAgents模块代码:

pcUserAgent = {
"safari 5.1 – MAC":"User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"safari 5.1 – Windows":"User-Agent:Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"IE 9.0":"User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0);",
"IE 8.0":"User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
"IE 7.0":"User-Agent:Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"IE 6.0":"User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
"Firefox 4.0.1 – MAC":"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Firefox 4.0.1 – Windows":"User-Agent:Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera 11.11 – MAC":"User-Agent:Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Opera 11.11 – Windows":"User-Agent:Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
"Chrome 17.0 – MAC":"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Maxthon":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
"Tencent TT":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
"The World 2.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"The World 3.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
"sogou 1.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"360":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Avant":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
"Green Browser":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)"
}
 
mobileUserAgent = {
"iOS 4.33 – iPhone":"User-Agent:Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"iOS 4.33 – iPod Touch":"User-Agent:Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"iOS 4.33 – iPad":"User-Agent:Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Android N1":"User-Agent: Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Android QQ":"User-Agent: MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Android Opera ":"User-Agent: Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
"Android Pad Moto Xoom":"User-Agent: Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"BlackBerry":"User-Agent: Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
"WebOS HP Touchpad":"User-Agent: Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
"Nokia N97":"User-Agent: Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
"Windows Phone Mango":"User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
"UC":"User-Agent: UCWEB7.0.2.37/28/999",
"UC standard":"User-Agent: NOKIA5700/ UCWEB7.0.2.37/28/999",
"UCOpenwave":"User-Agent: Openwave/ UCWEB7.0.2.37/28/999",
"UC Opera":"User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999"
}

7.封锁user-agent破解
方法其实很简单:
(1).新建一个middlewares目录,创建__init__.py,以及资源文件resource.py和中间件文件customUserAgent.py。
(2).编辑customUserAgent.py,将资源文件resource.py中的user-agent随机选择一个出来,作为Scrapy的user-agent。
(3)修改settings.py文件,将RandomUserAgent加入DOWNLOADER_MIDDLEWARES

8.封锁IP破解
方法和上面差不多:
(1)在resource.py中添加PROXIES。
(2)创建customProxy.py,让Scrapy爬取网站时随机使用IP池中的代理。
(3)修改settings.py文件。

可以参考:https://www.cnblogs.com/hqutcy/p/7341212.html

第六章 自写爬虫模板

项目文件图示:

代码:
getTrendsMV.py

from bs4 import BeautifulSoup
import urllib.request
import time
from mylog import MyLog as mylog
import resource
import random


class Item(object):
    top_num = None  # 排名
    score = None  # 打分
    mvname = None  # mv名字
    singer = None  # 演唱者
    releasetime = None  # 发布时间


class GetMvList(object):
    """ the all data from www.yinyuetai.com
    所有的数据都来自www.yinyuetai.com
    """
    def __init__(self):
        self.urlbase = 'http://vchart.yinyuetai.com/vchart/trends?'
        self.areasDic = {
                         'ALL': '总榜',
                         'ML': '内地篇',
                         'HT': '港台篇',
                         'US': '欧美篇',
                         'KR': '韩国篇',
                         'JP': '日本篇',
                         }
        self.log = mylog()
        self.geturls()

    def geturls(self):
        # 获取url池
        areas = [i for i in self.areasDic.keys()]
        pages = [str(i) for i in range(1, 4)]
        for area in areas:
            urls = []
            for page in pages:
                urlEnd = 'area=' + area + '&amp;page=' + page
                url = self.urlbase + urlEnd
                urls.append(url)
                self.log.info('添加URL:{}到URLS'.format(url))
            self.spider(area, urls)

    def getResponseContent(self, url):
        """从页面返回数据"""
        fakeHeaders = {"User-Agent": self.getRandomHeaders()}
        request = urllib.request.Request(url, headers=fakeHeaders)
        proxy = urllib.request.ProxyHandler({'http': 'http://' + self.getRandomProxy()})
        opener = urllib.request.build_opener(proxy)
        urllib.request.install_opener(opener)
        try:
            response = urllib.request.urlopen(request)
            html = response.read().decode('utf8')
            time.sleep(1)
        except Exception as e:
            self.log.error('Python 返回 URL:{} 数据失败'.format(url))
            return ''
        else:
            self.log.info('Python 返回 URL:{} 数据成功'.format(url))
            return html

    def getRandomProxy(self):
        # 随机选取Proxy代理地址
        return random.choice(resource.PROXIES)

    def getRandomHeaders(self):
        # 随机选取User-Agent头
        return random.choice(resource.UserAgents)

    def spider(self, area, urls):
        items = []
        for url in urls:
            responseContent = self.getResponseContent(url)
            if not responseContent:
                continue
            soup = BeautifulSoup(responseContent, 'lxml')
            tags = soup.find_all('li', attrs={'name': 'dmvLi'})
            for tag in tags:
                item = Item()
                item.top_num = tag.find('div', attrs={'class': 'top_num'}).get_text()
                if tag.find('h3', attrs={'class': 'desc_score'}):
                    item.score = tag.find('h3', attrs={'class': 'desc_score'}).get_text()
                else:
                    item.score = tag.find('h3', attrs={'class': 'asc_score'}).get_text()
                item.mvname = tag.find('a', attrs={'class': 'mvname'}).get_text()
                item.singer = tag.find('a', attrs={'class': 'special'}).get_text()
                item.releasetime = tag.find('p', attrs={'class': 'c9'}).get_text()
                items.append(item)
                self.log.info('添加mvName为{}的数据成功'.format(item.mvname))
        self.pipelines(items, area)

    def pipelines(self, items, area):
        filename = '音悦台V榜-榜单.txt'
        nowtime = time.strftime('%Y-%m-%d %H:%M:S', time.localtime())
        with open(filename, 'a', encoding='utf8') as f:
            f.write('{} --------- {}\r\n'.format(self.areasDic.get(area), nowtime))
            for item in items:
                f.write("{} {} \t {} \t {} \t {}\r\n".format(item.top_num,
                                                             item.score,
                                                             item.releasetime,
                                                             item.mvname,
                                                             item.singer
                                                             ))
                self.log.info('添加mvname为{}的MV到{}...'.format(item.mvname, filename))
            f.write('\r\n'*4)


if __name__ == '__main__':
    GetMvList()

mylog.py

#!/usr/bin/env python
# coding: utf-8
import logging
import getpass
import sys


# 定义MyLog类
class MyLog(object):
    def __init__(self):
        self.user = getpass.getuser()  # 获取用户
        self.logger = logging.getLogger(self.user)
        self.logger.setLevel(logging.DEBUG)

        # 日志文件名
        self.logfile = sys.argv[0][0:-3] + '.log'  # 动态获取调用文件的名字
        self.formatter = logging.Formatter('%(asctime)-12s %(levelname)-8s %(message)-12s\r\n')

        # 日志显示到屏幕上并输出到日志文件内
        self.logHand = logging.FileHandler(self.logfile, encoding='utf-8')
        self.logHand.setFormatter(self.formatter)
        self.logHand.setLevel(logging.DEBUG)

        self.logHandSt = logging.StreamHandler()
        self.logHandSt.setFormatter(self.formatter)
        self.logHandSt.setLevel(logging.DEBUG)

        self.logger.addHandler(self.logHand)
        self.logger.addHandler(self.logHandSt)

    # 日志的5个级别对应以下的5个函数
    def debug(self, msg):
        self.logger.debug(msg)

    def info(self, msg):
        self.logger.info(msg)

    def warn(self, msg):
        self.logger.warn(msg)

    def error(self, msg):
        self.logger.error(msg)

    def critical(self, msg):
        self.logger.critical(msg)


if __name__ == '__main__':
    mylog = MyLog()
    mylog.debug(u"I'm debug 中文测试")
    mylog.info(u"I'm info 中文测试")
    mylog.warn(u"I'm warn 中文测试")
    mylog.error(u"I'm error 中文测试")
    mylog.critical(u"I'm critical 中文测试")

resource.py

#!/usr/bin/env python
# coding: utf-8
UserAgents = [
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
    "Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
    "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
    "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
    "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
]

# 代理ip地址,如果不能使用,去网上找几个免费的使用
# 这里使用的都是http
PROXIES = [
    "120.83.102.255:808",
    "111.177.106.196:9999",
]

暧昧帖

本文暂无标签

发表评论

*

*