百度站长链接提交 - 主动推送(Python 版)

前言

在加速百度搜索引擎收录站点方面,百度站长目前提供自动提交链接和手动提交链接两种方式,其中自动提交又分为主动推送、自动推送和 sitemap 三种形式。按百度的说法,主动推送的效果最好,百度站长平台后台提供了 CurlPHPRuby 的推送示例代码,但唯独没有提供 Python 示例代码。本文会给出现成的 Python 版本主动推送代码,系统环境依赖 Linux,软件环境依赖 Python3Curl

Python3 代码

以下代码会读取特定域名下的 sitemap 站点地图文件,然后通过 Curl 命令将站点地图文件中合法 (结尾为 .html)的 URL 批量提交给百度站长平台,请自行替换代码中的 domaintokensite_map_url 变量值。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# -*- coding: utf-8 -*-

import re
import logging
import subprocess
from io import StringIO
from urllib import request

# 站点域名
domain = 'www.example.com'

# 在百度站长申请的推送用的准入密钥
token = 'xxxxxxxxxxxxxxxxx'

# 站点地图的URL
site_map_url = 'https://www.example.com/sitemap.xml'

# 最大的链接推送数量
push_max_lines = 1000

# 推送的URL链接文件
push_urls_file = "/tmp/baidu_zhanzhang_push_url.txt"

# 数据推送的接口
push_url = 'http://data.zz.baidu.com/urls?site={domain}&token={token}'.format(domain=domain, token=token)

# 日志文件
log_file = "/tmp/baidu/baidu_zhanzhang_push.log"

def regexpMatchUrl(content):
pattern = re.findall(r'(http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&:/~\+#]*[\w\-\@?^=%&/~\+#])?', content, re.IGNORECASE)
if pattern:
return True
else:
return False

def regexpMatchWebSite(content):
pattern = re.findall(r''.join(domain), content, re.IGNORECASE)
if pattern:
return True
else:
return False

def getUrl(content):
pattern = re.findall(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+.html', content, re.IGNORECASE)
if pattern:
return pattern[0]
else:
return ''

def createUrlFile(url_file_path, max_lines):
content = request.urlopen(site_map_url).read().decode('utf8')
website_map_file = StringIO(content)
url_file = open(url_file_path, 'w')
index = 0
for line in website_map_file:
if(regexpMatchUrl(line) and regexpMatchWebSite(line)):
url = getUrl(line)
if(url != ''):
index = index + 1
url_file.writelines(url + "\n")
if(index >= max_lines):
break
url_file.close()
website_map_file.close()

def pushUrlFile(url, url_file_path, log_file):
shell_cmd_line = "curl -H 'Content-Type:text/plain' --data-binary @" + url_file_path + " " + '\"' + url + '\"'
(status, output) = subprocess.getstatusoutput(shell_cmd_line)
logging.info(output + "\n")
# print(shell_cmd_line)

if __name__ == "__main__":
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(name)s - %(levelname)s - %(message)s', filename=log_file)
createUrlFile(push_urls_file, push_max_lines)
pushUrlFile(push_url, push_urls_file, log_file)

Crontab 定时任务

Linux 系统环境下,配合 Python 脚本 + Crontab 定时任务,即可定时主动提交链接到百度站长平台。

1
2
# 每隔两小时主动提交一次链接
0 */2 * * * /usr/bin/python3 /usr/local/baidu-push/baidu_zhanzhang_push.py

脚本输出的日志信息

1
2
3
4
5
6
7
8
9
$ cat /tmp/baidu/baidu_zhanzhang_push.log

2019-02-18 23:15:20,985 - www - INFO - % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed

0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
100 6092 100 30 100 6062 82 16671 --:--:-- --:--:-- --:--:-- 16653
{"remain":98069,"success":138}

Docker 一键部署推送服务

  • Dockerfile 的内容如下,构建生成 Docker 镜像后,使用命令直接启动 Docker 镜像即可。
  • 使用命令直接启动 Docker 镜像时,需要通过 -v 参数将宿主机的 Python 脚本文件挂载到 Docker 容器内的 /usr/local/python_scripts/baidu_zhanzhang_push.py 位置。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from augurproject/python2-and-3

MAINTAINER clay<656418510@qq.com>

RUN mkdir -p /tmp/baidu

RUN touch /var/log/cron.log

RUN mkdir -p /usr/local/python_scripts

ENV workpath /usr/local/python_scripts

WORKDIR $workpath

RUN echo "Asia/Shanghai" > /etc/timezone
RUN ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime

RUN cp /etc/apt/sources.list /etc/apt/backup.sources.list
RUN echo "deb http://mirrors.163.com/debian/ stretch main non-free contrib" > /etc/apt/sources.list
RUN echo "deb http://mirrors.163.com/debian/ stretch-updates main non-free contrib" >> /etc/apt/sources.list
RUN echo "deb http://mirrors.163.com/debian/ stretch-backports main non-free contrib" >> /etc/apt/sources.list
RUN echo "deb-src http://mirrors.163.com/debian/ stretch main non-free contrib" >> /etc/apt/sources.list
RUN echo "deb-src http://mirrors.163.com/debian/ stretch-updates main non-free contrib" >> /etc/apt/sources.list
RUN echo "deb-src http://mirrors.163.com/debian/ stretch-backports main non-free contrib" >> /etc/apt/sources.list
RUN echo "deb http://mirrors.163.com/debian-security/ stretch/updates main non-free contrib" >> /etc/apt/sources.list
RUN echo "deb-src http://mirrors.163.com/debian-security/ stretch/updates main non-free contrib" >> /etc/apt/sources.list

RUN apt-get -y update && apt-get -y upgrade
RUN apt-get -y install python-rsa python-requests cron rsyslog vim htop net-tools telnet apt-utils tree wget curl git make gcc
RUN apt-get -y autoclean && apt-get -y autoremove

RUN sed -i "s/#cron./cron./g" /etc/rsyslog.conf

RUN echo "0 */2 * * * root /usr/bin/python3 /usr/local/python_scripts/baidu_zhanzhang_push.py" >> /etc/crontab

CMD service rsyslog start && service cron start && tail -f -n 20 /var/log/cron.log

若通过 Docker-Compose 来管理 Docker 镜像,那么 YML 配置文件的内容如下:

1
2
3
4
5
6
7
8
9
10
11
12
version: '3.5'

services:
baidu-push:
image: clay/baidu-push:1.0
container_name: hexo-baidu-push
restart: always
environment:
TZ: 'Asia/Shanghai'
volumes:
- /usr/local/baidu-push/logs:/tmp/baidu
- /usr/local/baidu-push/baidu_zhanzhang_push.py:/usr/local/python_scripts/baidu_zhanzhang_push.py

数据卷挂载:

  • /usr/local/baidu-push/logs:宿主机里的日志目录
  • /usr/local/baidu-push/baidu_zhanzhang_push.py:宿主机里 Python 脚本文件的路径