百度搜索资源平台 - 站点天级收录(Python 版)

前言

注册开通百度熊掌 ID 后(该产品后来改名为移动专区),百度搜索资源平台提供了站点天级收录的 API,可以让站长享受天级收录站点的机会(存在天级收录配额限制),这样可以很大程度地加快站点的收录。本文会给出现成的 Python 版本站点天级收录代码,系统环境依赖 Linux,软件环境依赖 Python3Curl

baidu-search-day-included

Python3 代码

以下代码会读取特定域名下的 sitemap 站点地图文件,然后通过 Curl 命令将站点地图文件中合法 (结尾为 .html)的 URL 批量提交给百度搜索资源平台,请自行替换代码中的 domainapp_idtokensite_map_urlday_submit_max_lines 变量值。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
# -*- coding: utf-8 -*-

import re
import os
import logging
import subprocess
from io import StringIO
from urllib import request

# 站点域名
domain = 'www.example.com'

# 搜索资源平台申请的唯一识别ID
app_id = 'xxxxxxxxxxxxxxxxx'

# 搜索资源平台申请的提交用的准入密钥
token = 'xxxxxxxxxxxxxxxxxxxxxx'

# 站点地图的URL
site_map_url = 'https://www.example.com/sitemap.xml'

# 最大的提交数量(天级收录配额)
day_submit_max_lines = 10

# 提交链接的接口
day_submit_url = 'http://data.zz.baidu.com/urls?appid={app_id}&token={token}&type=realtime'.format(app_id=app_id, token=token)

# 提交的URL链接文件
day_submit_urls_file = "/tmp/baidu_xiongzhang_day_submit_url.txt"

# 记录历史提交位置的索引文件(索引从一开始)
day_record_file = "/tmp/baidu_xiongzhang_day_record.txt"

# 日志文件
log_file = "/tmp/baidu_xiongzhang_day.log"

def regexpMatchUrl(content):
pattern = re.findall(r'(http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?', content, re.IGNORECASE)
if pattern:
return True
else:
return False

def regexpMatchWebSite(content):
pattern = re.findall(r''.join(domain), content, re.IGNORECASE)
if pattern:
return True
else:
return False

def getUrl(content):
pattern = re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+.html', content, re.IGNORECASE)
if pattern:
return pattern[0]
else:
return ''

def writeRecordFile(record_file_path, content):
record_file = open(record_file_path, 'w')
record_file.writelines(content)
record_file.close()

def readRecordFile(record_file_path):
content = "0"
if(os.path.exists(record_file_path)):
record_file = open(record_file_path, 'r')
content = record_file.readline()
record_file.close()
if(len(content) == 0):
content = "0"
return content

def countWebsiteMapUrl():
total = 0
content = request.urlopen(site_map_url).read().decode('utf8')
website_map_file = StringIO(content)
for line in website_map_file:
if(regexpMatchUrl(line) and regexpMatchWebSite(line)):
total = total + 1
website_map_file.close()
return total

def createUrlFile(url_file_path, max_lines):
old_index = readRecordFile(day_record_file)
content = request.urlopen(site_map_url).read().decode('utf8')
website_map_file = StringIO(content)
url_file = open(url_file_path, 'w')

# write url file
index = 0
number = 0
for line in website_map_file:
if(regexpMatchUrl(line) and regexpMatchWebSite(line)):
if(index < int(old_index)):
index = index + 1
continue
url = getUrl(line)
if(url != ''):
index = index + 1
number = number + 1
url_file.writelines(url + "\n")
if(number >= max_lines):
break

# update record file
if(index == countWebsiteMapUrl()):
writeRecordFile(day_record_file, str(0))
else:
writeRecordFile(day_record_file, str(index))

# close file
url_file.close()
website_map_file.close()

def submitUrlFile(url, url_file_path, log_file):
shell_cmd_line = "curl -H 'Content-Type:text/plain' --data-binary @" + url_file_path + " " + '\"' + url + '\"'
(status, output) = subprocess.getstatusoutput(shell_cmd_line)
logging.info(output + "\n")
# print(shell_cmd_line)

if __name__ == "__main__":
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', filename=log_file)
createUrlFile(day_submit_urls_file, day_submit_max_lines)
submitUrlFile(day_submit_url, day_submit_urls_file, log_file)

Crontab 定时任务

Linux 系统环境下,配合 Python 脚本 + Crontab 定时任务,即可定时主动提交链接到百度搜索资源平台。

1
2
# 每天凌晨三点主动提交一次链接
0 3 * * * /usr/bin/python3 /usr/local/baidu-push/baidu_xiongzhang_day.py

脚本输出的日志信息

1
2
3
4
5
6
7
8
$ cat /tmp/baidu_xiongzhang_day.log

2020-01-19 21:46:08,072 - www - INFO - % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed

0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 507 100 67 100 440 555 3648 --:--:-- --:--:-- --:--:-- 3666
{"remain":5,"success":10,"success_realtime":10,"remain_realtime":5}

Docker 一键部署收录服务

  • Dockerfile 的内容如下,构建生成 Docker 镜像后,使用命令直接启动 Docker 镜像即可。
  • 使用命令直接启动 Docker 镜像时,需要通过 -v 参数将宿主机的 Python 脚本文件挂载到 Docker 容器内的 /usr/local/python_scripts/baidu_xiongzhang_day.py 位置。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from augurproject/python2-and-3

MAINTAINER clay<656418510@qq.com>

RUN mkdir -p /tmp/baidu

RUN touch /var/log/cron.log

RUN mkdir -p /usr/local/python_scripts

ENV workpath /usr/local/python_scripts

WORKDIR $workpath

RUN echo "Asia/Shanghai" > /etc/timezone
RUN ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime

RUN cp /etc/apt/sources.list /etc/apt/backup.sources.list
RUN echo "deb http://mirrors.163.com/debian/ stretch main non-free contrib" > /etc/apt/sources.list
RUN echo "deb http://mirrors.163.com/debian/ stretch-updates main non-free contrib" >> /etc/apt/sources.list
RUN echo "deb http://mirrors.163.com/debian/ stretch-backports main non-free contrib" >> /etc/apt/sources.list
RUN echo "deb-src http://mirrors.163.com/debian/ stretch main non-free contrib" >> /etc/apt/sources.list
RUN echo "deb-src http://mirrors.163.com/debian/ stretch-updates main non-free contrib" >> /etc/apt/sources.list
RUN echo "deb-src http://mirrors.163.com/debian/ stretch-backports main non-free contrib" >> /etc/apt/sources.list
RUN echo "deb http://mirrors.163.com/debian-security/ stretch/updates main non-free contrib" >> /etc/apt/sources.list
RUN echo "deb-src http://mirrors.163.com/debian-security/ stretch/updates main non-free contrib" >> /etc/apt/sources.list

RUN apt-get -y update && apt-get -y upgrade
RUN apt-get -y install python-rsa python-requests cron rsyslog vim htop net-tools telnet apt-utils tree wget curl git make gcc
RUN apt-get -y autoclean && apt-get -y autoremove

RUN sed -i "s/#cron./cron./g" /etc/rsyslog.conf

RUN echo "0 3 * * * root /usr/bin/python3 /usr/local/python_scripts/baidu_xiongzhang_day.py" >> /etc/crontab

CMD service rsyslog start && service cron start && tail -f -n 20 /var/log/cron.log

若通过 Docker-Compose 来管理 Docker 镜像,那么 YML 配置文件的内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
version: '3.5'

services:
baidu-push:
image: clay/baidu-push:1.0
container_name: hexo-baidu-push
restart: always
environment:
TZ: 'Asia/Shanghai'
volumes:
- /usr/local/baidu-push/logs:/tmp/baidu
- /usr/local/baidu-push/baidu_xiongzhang_day.py:/usr/local/python_scripts/baidu_xiongzhang_day.py

数据卷挂载:

  • /usr/local/baidu-push/logs:宿主机里的日志目录
  • /usr/local/baidu-push/baidu_xiongzhang_day.py:宿主机里 Python 脚本文件的路径