使用Python写一个简单的WordPress网站采集程序

发布于 2022.03.26

关于网站采集估计大家都直到著名的火车头采集软件，但是火车头采集程序是收费的，很多功能不能用，并且还需要自己手动…

关于网站采集估计大家都直到著名的火车头采集软件，但是火车头采集程序是收费的，很多功能不能用，并且还需要自己手动点击采集，比较麻烦。像我这样的懒人当然是不喜欢每天自己点击，Python作为每台Linux服务器自带的程序环境，不利用起来简直太浪费了，一点也不符合我朝倡导的勤俭节约精神，下面以我随便搭建的新闻采集站为例，简单分享下采集程序。

首先为了避免重复采集，我们需要使用到MySQL数据库，这点不麻烦，因为网站环境必须安装这个。Python使用的3.X版本，默认是2.7版本，如果代码报错，请升级你的Python。教程参考：Centos7.X升级默认Python到3.X并安装pip3扩展管理

下面的程序需要安装requests、BeautifulSoup、pymysql扩展，不会安装的请看我前面的文章或者自行百度如何安装python扩展。

数据库结构

使用Python写一个简单的WordPress网站采集程序

随手写的代码，就没写数据库的创建了，自己手动创建吧，phpmyadmin可视化操作，应该不难，只需注意id这个字段需要勾选A_I选择框，即自动增长。

源码

数据库操作sql.py

#!/usr/bin/python3
# -*- coding: UTF-8 -*-
import pymysql

class msql:

    def __init__(self,host,database,user,pwd):

        self.host=host

        self.database=database

        self.user=user

        self.pwd=pwd

    def conn(self):

        self.conn=pymysql.connect(host=self.host, user=self.user,password=self.pwd,database=self.database,charset="utf8")

        #return self.conn

    def insertmany(self,sql,data):

        cursor = self.conn.cursor()

        try:

               # 批量执行多条插入SQL语句

            cursor.executemany(sql, data)

            # 提交事务

            self.conn.commit()

        except Exception as e:

           # 有异常，回滚事务

            print(e)

            self.conn.rollback()

            cursor.close()

    def insert(self,sql):

        cursor = self.conn.cursor()

        cursor.execute(sql)

        self.conn.commit()

        cursor.close()

    def ishave(self,sql):

        cursor = self.conn.cursor()

        # 执行SQL语句

        cursor.execute(sql)

        # 获取单条查询数据

        ret = cursor.fetchall()

        cursor.close()

        return cursor.rownumber

    def mclose(self):

        self.conn.close()

主程序getdata.py（随便你写，注意后面计划任务的名字一样即可）

#!/usr/bin/python3
# -*- coding: UTF-8 -*-
import requests
from bs4 import BeautifulSoup
import sql
import time
import html

def getpage(conn):

    pageurl='https://www.xinwentoutiao.net/xinxianshi/'

    gkr = requests.get(pageurl)

    gkr.encoding = 'UTF-8'

    gksoup = BeautifulSoup(gkr.text, "html")

    article=gksoup.find('ul',attrs={'class':'gv-list'})

    li=article.find_all('li')

    for i in range(0, len(li)):

        singleurl=li[i].find('div',attrs={'class':'gv-title'}).find("a").get("href")

        num=msql.ishave("SELECT * from cj_5afxw where url='"+singleurl+"'")

        if num==0:

            getsingle(singleurl)

            sqlstr = "INSERT INTO cj_5afxw(url,insert_time) VALUES ('"+singleurl+"','"+time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())+"');"

            msql.insert(sqlstr)

            print(singleurl)

            

        else:

            print(singleurl+"已存在")

    msql.mclose()

def getsingle(url):

    gkr = requests.get('https://www.xinwentoutiao.net'+url)

    gkr.encoding = 'UTF-8'

    gksoup = BeautifulSoup(gkr.text, "html")

    title=gksoup.find('h2').text

    content=gksoup.find('div',attrs={'id':'art-body-gl'})

    content.find('button').decompose()

    #content.find('div',attrs={"id": "toc"}).decompose()

    url = 'http://你的采集接口地址?action=save'

    data = {'post_title': title,'post_content':content.prettify(),'post_category':505}

    print(data)

    r = requests.post(url, data=data)

    print(r.text)

msql=sql.msql('127.0.0.1','数据库名','数据库用户名','数据库密码')

msql.conn()

getpage(msql)

#getsingle("https://www.xinwentoutiao.net/xinxianshi/2164624.html")

采集接口文件在前面的火车头WordPress发布规则写法教程一文中已经分享过了，没有的请自行前往下载。上面的代码需要注意的是：请修改数据表名，我使用的是cj_5afxw，你可以根据上面手动创建的数据表名进行修改。数据库连接信息就不用说了，自己改自己的。

然后将这两个文件放在同一个文件夹中，再去宝塔添加计划任务，选择shell，添加内容 python3 /XXX/XXX/getdata.py设置为每日固定时候执行即可。

使用Python写一个简单的WordPress网站采集程序

通过日志可以查看每次采集的信息。

类别：WordPress教程、

本文收集自互联网，转载请注明来源。
如有侵权，请联系 wper_net@163.com 删除。

数据库结构

源码

数据库操作sql.py

主程序getdata.py（随便你写，注意后面计划任务的名字一样即可）

猜你喜欢GUESS YOU LIKE

WordPress添加贴吧表情

PhpStorm配置Xdebug调试WordPress

PHP使用Imagick扩展处理图片

如何在WordPress中添加用户在线功能？

WordPress添加百度主动推送

WordPress文章底部添加相关文章

评论 (0)COMMENT

网站概况

流量分析

来源分析

访问分析

转化分析

访客分析