一、Python命名规则

二、xpath用法:


这里的下标是从1开始的,不是0


抓取图片:

小技巧:
如果遇到]怎么办?
links = dom_tree.xpath("//a[@class='download']")#在xml中定位节点,返回的是一个列表
for index in range(len(links)):
# links[index]返回的是一个字典
if (index % 2) == 0:
print(links[index].tag)
print(links[index].attrib)
print(links[index].text)
比如url为磁力链接
那么
print(links[index].tag)#获取a标签名a
print(links[index].attrib)#获取a标签的属性href和class
print(links[index].text)#获取a标签的文字部分
的结果分别为:
a
{'href': 'magnet:?xt=urn:btih:7502edea0dfe9c2774f95118db3208a108fe10ca', 'class': 'download'}
磁力链接
参考:https://www.cnblogs.com/z-x-y/p/8260213.html
截取超级链接
//a[@class="text--link"]/@href //span[@class='l fl']/a/@href #截取超级链接
截取多个标签
比如,文章的正文部分又有h2,又有h3,又有p标签,可以通过下面的方法截取,不过我试了一下,发现它先将所有的h2,找出来,再将所有的p标签找出来排列,那么和原文章的顺序就乱了。。
所以最后还是要用正则。
xpath('//*[@id="xxx"]/h2|*[@id="xxx"]/h3|*[@id="xxx"]/p')
特殊的字段
ip.xpath('string(td[5])')[0].extract().strip() #获取第5个单元格的所有文本。
ip.xpath('td[8]/div[@class="bar"]/@title').re('\d{0,2}\.\d{0,}')[0] #匹配<div class="bar" title="0.0885秒"> 中的数字
如果图片url在不同的存储位置,xpath的时候用“|”符号。

常见问题:
(1)如果在chrome的xpath插件中可以用q_urls = response.xpath('//div[@class="line content"]')看到取值,但是在代码中取不到值,需要用下面的代码:
def get_detail(url):
html = requests.get(url,headers = headers)
response = etree.HTML(html.content)
q_urls = response.xpath('//div[@class="line content"]')
result = q_urls[0].xpath('string(.)').strip()
return result
(2)查看元素
content = selector.xpath('//div[@class="metarial"]')[0]
参考:
https://www.cnblogs.com/just-do/p/9778941.html
(3)乱码问题
如果用xpath取到的中文乱码,可以用下面的方案:
content=etree.tostring(content,encoding="utf-8").decode('utf-8')
参考:
https://www.cnblogs.com/Rhythm-/p/11374832.html
第五章 scrapy爬虫框架
1. __init__.py文件,它是个空文件,作用是将它的上级目录变成了一个模块,,可以供python导入使用。
2. items.py决定抓取哪些项目,wuhanmoviespider.py决定怎么爬的,settings.py决定由谁去处理爬取的内容,pipilines.py决定爬取后的内容怎么样处理。
3.
<h3>武汉<font color="#0066cc">今天</font>天气</h3>
选择方式为:.h3//text()而不是.h3/text()
4.使用json输出
settings.py中的Item_pipeline项,它是一个字典,字典是可以添加元素的,因此完全可以自行构造一个Python文件,然后加进去。
(1)创建pipelines2json.py文件
import time
import json
import codecs
class WeatherPipeline(object):
def process_item(self, item, spider):
today=time.strftime(‘%Y%m%d‘,time.localtime())
fileName=today+‘.json‘
with codecs.open(fileName,‘a‘,encoding=‘utf8‘) as fp:
line=json.dumps(dict(item),ensure_ascii=False)+‘\n‘
fp.write(line)
return item
(2)修改settings.py,将pipeline2json加入到Item_pipelines中去
Item_pipelines = {
'weather.pipelines.weatherpipeline':1,
'weather.将pipelines2json.weatherpipeline':2,
}
5.登陆数据库,数据表的命令
其实之前在这里已经学过了。不过那里是直接修改pipeline,这里新建了一个pipelines2mysql.py用来入库。
这里主要是为了标记一下数据库操作命令。
# 创建数据库:scrapyDB ,以utf8位编码格式,每条语句以’;‘结尾 CREATE DATABASE scrapyDB CHARACTER SET 'utf8' COLLATE 'utf8_general-Ci'; # 选中刚才创建的表: use scrapyDB; # 创建我们需要的字段:字段要和我们代码里一一对应,方便我们一会写sql语句 CREATE TABLE weather( id INT AUTO_INCREMENT, date char(24), week char(24), img char(128), temperature char(24), weather char(24), wind char(24), PRIMARY KEY(id) )ENGINE=InnoDB DEFAULT CHARSET='utf8';
查看一下weather表格的样子
show columns from weather 或者:desc weather
6.添加一个User_agent
在scrapy中的确是有默认的headers,但这个headrs与浏览器的headers是有区别的。有的网站会检查headers,所以需要给scrapy一个浏览器的headers。

当然还可以用下面的方法:
from getProxy import userAgents
BOT_NAME='getProxy'
SPIDER_MODULES=['getProxy.spiders']
NEWSPIDER_MODULE='getProxy.spiders'
USER_AGENT=userAgents.pcUserAgent.get('Firefox 4.0.1 – Windows')
ITEM_PIPELINES={'getProxy.pipelines.GetProxyPipeline':300}
只需要在settings.py里添加一个User_agent项就可以了。
这里修改了USER_AGENT,导入userAgents模块,下面给出userAgents模块代码:
pcUserAgent = {
"safari 5.1 – MAC":"User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"safari 5.1 – Windows":"User-Agent:Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"IE 9.0":"User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0);",
"IE 8.0":"User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
"IE 7.0":"User-Agent:Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"IE 6.0":"User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
"Firefox 4.0.1 – MAC":"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Firefox 4.0.1 – Windows":"User-Agent:Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera 11.11 – MAC":"User-Agent:Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Opera 11.11 – Windows":"User-Agent:Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
"Chrome 17.0 – MAC":"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Maxthon":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
"Tencent TT":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
"The World 2.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"The World 3.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
"sogou 1.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"360":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Avant":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
"Green Browser":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)"
}
mobileUserAgent = {
"iOS 4.33 – iPhone":"User-Agent:Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"iOS 4.33 – iPod Touch":"User-Agent:Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"iOS 4.33 – iPad":"User-Agent:Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Android N1":"User-Agent: Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Android QQ":"User-Agent: MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Android Opera ":"User-Agent: Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
"Android Pad Moto Xoom":"User-Agent: Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"BlackBerry":"User-Agent: Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
"WebOS HP Touchpad":"User-Agent: Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
"Nokia N97":"User-Agent: Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
"Windows Phone Mango":"User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
"UC":"User-Agent: UCWEB7.0.2.37/28/999",
"UC standard":"User-Agent: NOKIA5700/ UCWEB7.0.2.37/28/999",
"UCOpenwave":"User-Agent: Openwave/ UCWEB7.0.2.37/28/999",
"UC Opera":"User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999"
}
7.封锁user-agent破解
方法其实很简单:
(1).新建一个middlewares目录,创建__init__.py,以及资源文件resource.py和中间件文件customUserAgent.py。
(2).编辑customUserAgent.py,将资源文件resource.py中的user-agent随机选择一个出来,作为Scrapy的user-agent。
(3)修改settings.py文件,将RandomUserAgent加入DOWNLOADER_MIDDLEWARES

8.封锁IP破解
方法和上面差不多:
(1)在resource.py中添加PROXIES。
(2)创建customProxy.py,让Scrapy爬取网站时随机使用IP池中的代理。
(3)修改settings.py文件。

可以参考:https://www.cnblogs.com/hqutcy/p/7341212.html
第六章 自写爬虫模板
项目文件图示:

代码:
getTrendsMV.py
from bs4 import BeautifulSoup
import urllib.request
import time
from mylog import MyLog as mylog
import resource
import random
class Item(object):
top_num = None # 排名
score = None # 打分
mvname = None # mv名字
singer = None # 演唱者
releasetime = None # 发布时间
class GetMvList(object):
""" the all data from www.yinyuetai.com
所有的数据都来自www.yinyuetai.com
"""
def __init__(self):
self.urlbase = 'http://vchart.yinyuetai.com/vchart/trends?'
self.areasDic = {
'ALL': '总榜',
'ML': '内地篇',
'HT': '港台篇',
'US': '欧美篇',
'KR': '韩国篇',
'JP': '日本篇',
}
self.log = mylog()
self.geturls()
def geturls(self):
# 获取url池
areas = [i for i in self.areasDic.keys()]
pages = [str(i) for i in range(1, 4)]
for area in areas:
urls = []
for page in pages:
urlEnd = 'area=' + area + '&page=' + page
url = self.urlbase + urlEnd
urls.append(url)
self.log.info('添加URL:{}到URLS'.format(url))
self.spider(area, urls)
def getResponseContent(self, url):
"""从页面返回数据"""
fakeHeaders = {"User-Agent": self.getRandomHeaders()}
request = urllib.request.Request(url, headers=fakeHeaders)
proxy = urllib.request.ProxyHandler({'http': 'http://' + self.getRandomProxy()})
opener = urllib.request.build_opener(proxy)
urllib.request.install_opener(opener)
try:
response = urllib.request.urlopen(request)
html = response.read().decode('utf8')
time.sleep(1)
except Exception as e:
self.log.error('Python 返回 URL:{} 数据失败'.format(url))
return ''
else:
self.log.info('Python 返回 URL:{} 数据成功'.format(url))
return html
def getRandomProxy(self):
# 随机选取Proxy代理地址
return random.choice(resource.PROXIES)
def getRandomHeaders(self):
# 随机选取User-Agent头
return random.choice(resource.UserAgents)
def spider(self, area, urls):
items = []
for url in urls:
responseContent = self.getResponseContent(url)
if not responseContent:
continue
soup = BeautifulSoup(responseContent, 'lxml')
tags = soup.find_all('li', attrs={'name': 'dmvLi'})
for tag in tags:
item = Item()
item.top_num = tag.find('div', attrs={'class': 'top_num'}).get_text()
if tag.find('h3', attrs={'class': 'desc_score'}):
item.score = tag.find('h3', attrs={'class': 'desc_score'}).get_text()
else:
item.score = tag.find('h3', attrs={'class': 'asc_score'}).get_text()
item.mvname = tag.find('a', attrs={'class': 'mvname'}).get_text()
item.singer = tag.find('a', attrs={'class': 'special'}).get_text()
item.releasetime = tag.find('p', attrs={'class': 'c9'}).get_text()
items.append(item)
self.log.info('添加mvName为{}的数据成功'.format(item.mvname))
self.pipelines(items, area)
def pipelines(self, items, area):
filename = '音悦台V榜-榜单.txt'
nowtime = time.strftime('%Y-%m-%d %H:%M:S', time.localtime())
with open(filename, 'a', encoding='utf8') as f:
f.write('{} --------- {}\r\n'.format(self.areasDic.get(area), nowtime))
for item in items:
f.write("{} {} \t {} \t {} \t {}\r\n".format(item.top_num,
item.score,
item.releasetime,
item.mvname,
item.singer
))
self.log.info('添加mvname为{}的MV到{}...'.format(item.mvname, filename))
f.write('\r\n'*4)
if __name__ == '__main__':
GetMvList()
mylog.py
#!/usr/bin/env python
# coding: utf-8
import logging
import getpass
import sys
# 定义MyLog类
class MyLog(object):
def __init__(self):
self.user = getpass.getuser() # 获取用户
self.logger = logging.getLogger(self.user)
self.logger.setLevel(logging.DEBUG)
# 日志文件名
self.logfile = sys.argv[0][0:-3] + '.log' # 动态获取调用文件的名字
self.formatter = logging.Formatter('%(asctime)-12s %(levelname)-8s %(message)-12s\r\n')
# 日志显示到屏幕上并输出到日志文件内
self.logHand = logging.FileHandler(self.logfile, encoding='utf-8')
self.logHand.setFormatter(self.formatter)
self.logHand.setLevel(logging.DEBUG)
self.logHandSt = logging.StreamHandler()
self.logHandSt.setFormatter(self.formatter)
self.logHandSt.setLevel(logging.DEBUG)
self.logger.addHandler(self.logHand)
self.logger.addHandler(self.logHandSt)
# 日志的5个级别对应以下的5个函数
def debug(self, msg):
self.logger.debug(msg)
def info(self, msg):
self.logger.info(msg)
def warn(self, msg):
self.logger.warn(msg)
def error(self, msg):
self.logger.error(msg)
def critical(self, msg):
self.logger.critical(msg)
if __name__ == '__main__':
mylog = MyLog()
mylog.debug(u"I'm debug 中文测试")
mylog.info(u"I'm info 中文测试")
mylog.warn(u"I'm warn 中文测试")
mylog.error(u"I'm error 中文测试")
mylog.critical(u"I'm critical 中文测试")
resource.py
#!/usr/bin/env python
# coding: utf-8
UserAgents = [
"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"Opera/9.80 (Windows NT 6.1; U; zh-cn) Presto/2.9.168 Version/11.50",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
"Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
]
# 代理ip地址,如果不能使用,去网上找几个免费的使用
# 这里使用的都是http
PROXIES = [
"120.83.102.255:808",
"111.177.106.196:9999",
]