项目需求:
有成千上万个域名,想查询哪些域名已经被百度收录,收量的页面数量有多少?
直接找网上的代码改了一下,基本满足需求了,运行界面如下:

将被百度收录的域名保存下来:

最开始在网上找的是通过"http://www.baidu.com/s?wd="的方式来查询百度收录的,不过这种方式查询的百度收录准确率只有90%多,有些用site命令查找并没有收录的网站也会被误判为收录,而且无法取到百度的收录量。后来通过抓取site页面结果的方式,可以准确地取到收录量,而且数据也精确了。
运行代码时,第一次我将sleep设为3秒,跑了一个晚上没问题,第二天到10点多钟卡住不动了。第二次晚上开跑,跑了2个多小时,又卡在那儿不动了。估计是查询太过频繁IP被百度禁止了。
由于是自己用的,我两种方式的代码都保存了下来,所以下面的代码变成了一个大杂烩。其实真要用的话,拿check_index_number这个函数稍微改改就可以了,可以精简很多代码。
import requests
import time
from random import randint
from lxml import etree
HEADERS = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
"X-Forwarded-For": '%s:%s:%s:%s' % (randint(1, 255),
randint(1, 255), randint(1, 255), randint(1, 255)),
"Content-Type": "application/x-www-form-urlencoded",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Connection": "keep-alive"}
def check_index_number(url):
"""
查询网址被百度收录的数量
:param url: 要查询的网址
:return: 返回被百度收录的数量
"""
url_a = 'https://www.baidu.com/s?wd=site%3A'
url_b = '&pn=1&oq=site%3A52pojie.cn&ie=utf-8&usm=1&rsv_idx=1&rsv_pq=dd6157d100015d1f&rsv_t=9a3eHncH3YeAeoblNqMm1f3%2FAQsJeSgF03XLXg6VDz6VqSprqUL8lGGO3us'
joinUrl = url_a + url + url_b
# print joinUrl #拼接URL
html_Doc = requests.get(joinUrl, headers=HEADERS)
response = etree.HTML(html_Doc.content)
try:
index_number = response.xpath('//*[@id="1"]/div/div[1]/div/p[3]/span/b/text()')[0]
except:
index_number = 0
pass
return index_number
def getUrl(filepath):
with open(filepath, "r") as f:
f = f.readlines()
return f
def getHtml(url):
try:
r = requests.get(url)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return ""
def isindex(link):
url = link.replace("http://", "").replace("/", "%2F")
url = "http://www.baidu.com/s?wd=" + url
html = getHtml(url)
with open("result.txt", 'a') as f:
if "很抱歉,没有找到与" in html or "没有找到该URL" in html:
print(link, "未收录")
else:
print(link, "收录")
indexed_number = check_index_number(link)
f.write(link+'\t'+str(indexed_number)+'\n')
def main():
filepath = "20181105-new.txt" # 待查询的URL链接文本,一行一个
urls = getUrl(filepath)
for url in urls:
url = url.strip()
try:
isindex(url)
except:
pass
time.sleep(2)
if __name__ == '__main__':
main()