【爬虫】数据抓取与存储

发布于 2023-01-26  508 次阅读


一、前言

网络爬虫,即Web Spider,是一个很形象的名字。把互联网比喻成一个蜘蛛网,那么Spider就是在网上爬来爬去的蜘蛛。网络蜘蛛是通过网页的链接地址来寻找网页的。从网站某一个页面(通常是首页)开始,读取网页的内容,找到在网页中的其它链接地址,然后通过这些链接地址寻找下一个网页,这样一直循环下去,直到把这个网站所有的网页都抓取完为止。如果把整个互联网当成一个网站,那么网络蜘蛛就可以用这个原理把互联网上所有的网页都抓取下来。这样看来,网络爬虫就是一个爬行程序,一个抓取网页的程序,简单的流程图如下。

二、爬虫用到的pip模块以及对应的功能。

Pip模块

功能

pip install reqeusts

用于发送http请求

pip install bs4

全称是Beatiful Soup,提供一些python式的函数用来处理导航、搜索、修改分析树等功能。通过解析文档为tiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。

pip install pandas

基于NumPy的一种工具,提供了大量能使我们快速便捷地处理数据的函数和方法

pip install selenium

设置浏览器的参数,浏览器多窗口切换,设置等待时间,文件的上存与下载,Cookies处理以及frame框架操作

pip install sqlalchemy

是Python编程语言下的一款ORM框架,该框架建立在数据库API之上,使用关系对象映射进行数据库操作。

pip install pymongo

Python中用来操作MongoDB的一个库,MongoDB是一个基于分布式文件存储的数据库,旨在为WEB应用提供可扩展的高性能数据存储解决方案。

pip install dateparser

日期解析Python库,支持除用英语编写的日期之外的其它语言

pip install scrapy

Python的一个快速、高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。

pip install deepspeed

能够让亿万参数量的模型,能够在自己个人的工作服务器上进行训练推理。

pip install gerapy_auto_extractor

是Gerapy的自动提取器模块,可以使用这个包来区分列表页和详细信息页,我们可以使用它来提取 url从列表页中提取datetime,content,而不使用任何XPath或选择器。在对于中文新闻网站来说,它比其他场景更有效。

pip install gerapy

可以更方便地控制爬虫运行,更直观地查看爬虫状态,更实时地查看爬取结果,更简单地实现项目部署,更统一地实现主机管理

pip install scrapyd

是一个用来部署和运行Scrapy项目的应用,由Scrapy的开发者开发。其可以通过一个简单的Json API来部署(上传)或者控制你的项目。

三、实际操作

一:单网页爬取数据

流程:找到相应的网页(人民网),右键打开检查。查找内容所在HTML的位置,找到div块里的class和a类。打开jupyter notebook,导入requests库,爬取到原始内容,字符可能乱码,我们找到原来的字符编码是GB2312,在mysql里先创建一个database,在建立一个表,修改后更改格式,并建立自增id(表头),就可成功导入my sql,MongoDB类似。

调取所需的库

import requests
from bs4 import BeautifulSoup
import pandas as pd
from urllib import parse
from sqlalchemy import create_engine
import pymongo 
import json

找到目标网站并对其进行解码

url = "http://health.people.com.cn/GB/408568/index.html"
html = requests.get(url)
html.encoding = "GB2312"

用BeautifulSoup对人民网数据进行爬取

soup = BeautifulSoup(html.text,'lxml')
list
data = []
for i in soup.find_all("div",class_="newsItems"):
    title = i.a.text
    date = i.div.text
    url = parse.urljoin(url,i.a["href"])
    print(title,date,url)
    data.append((title,date,url))

把爬取的数据转换成DF格式并存储起来

df = pd.DataFrame(data,columns=["title","date","url"])

写入sql

sql = 'insert into qiushi(title,date,url) values(%s,%s,%s) charset=utf8'
engine = create_engine('mysql+pymysql://root:123456@localhost/test1?charset=utf8')
df.to_sql( 'newlist1', con=engine, if_exists='append') 

写入MOngoDB

client = pymongo.MongoClient('127.0.0.1',27017) #连接mongodb
database = client["NewsData"] #建立数据库
table = database["News"] 
data_ = json.loads(df.T.to_json())

导入sql数据库展示:

导入MongoDB展示:

遇到的问题:


1 识别不出来pymysql
解决办法:cmd命令下pip install pymysql
2 错误代码1364
解决办法:在mysql里设置自增表头
3 导入MongoDB时出现timeout
解决办法:在bin目录下执行mongod.exe查看存储路径,然后在相应位置建立相应文件
然后在bin目录下cmd执行mongod.exe -dbpath 对应路径,重启电脑,再次导入
发现还是timeout
发现是主机端口27017写错了,改正后导入成功

二:多网页的爬取

1.items,middlewares,pipelines,settings如何配置以及对应代码

Items代码:

把初始代码第九行下面的全部删掉,然后定义想要存储的字段都有什么

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
 
import scrapy
 
class NewsdataItem(scrapy.Item):
    title=scrapy.Field() #文章标题
    url=scrapy.Field() #文章链接
    date=scrapy.Field() #发布日期
    content=scrapy.Field() #文章正文
    site=scrapy.Field() #站点
    item=scrapy.Field() #栏目
    student_id=scrapy.Field() #学号

middlewares代码:


我们在初始的基础上添加105行以后的代码
使用中心键调取配置里的信息,随机抽取一个放到USER_AGENT_LIST里

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
 
from scrapy import signals
 
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
 
 
class NewsdataSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.
 
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
 
    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.
 
        # Should return None or raise an exception.
        return None
 
    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.
 
        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i
 
    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.
 
        # Should return either None or an iterable of Request or item objects.
        pass
 
    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.
 
        # Must return only requests (not items).
        for r in start_requests:
            yield r
 
    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)
 
 
class NewsdataDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.
 
    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s
 
    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.
 
        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None
 
    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.
 
        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response
 
    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.
 
        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass
 
    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)
 
 
# 添加Header
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
from scrapy.utils.project import get_project_settings
import  random
settings = get_project_settings()
class RotateUserAgentMiddleware(UserAgentMiddleware):
    def process_request(self, request, spider):
        referer = request.url
        if referer:
            request.headers["referer"] = referer
        USER_AGENT_LIST = settings.get('USER_AGENT_LIST')
        user_agent = random.choice(USER_AGENT_LIST)
        if user_agent:
            request.headers.setdefault('user-Agent', user_agent)
            print(f"user-Agent:{user_agent}")

pipelines代码:


这一步是数据存储,这里我们选着存入MongoDB而不是Mysql,因为Mysql只支持一种数据类型的存储,MongoDB就没有这个限制,在初始代码的基础上,加入第十、十一行,表示导入MongoDB的包,并加载其配置,包括ip地址,用户名,密码,端口号。再把class中的内容全部替换,如果MongoDB没有用户名和密码,第24、25行注释掉,并把原27行的连接方式改成28行。

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
 
 
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
 
# 添加必备包和加载设置
import pymongo
from scrapy.utils.project import get_project_settings
 
settings = get_project_settings()
 
 
class NewsdataPipeline:
    # class中全部替换
    def __init__(self):
        host = settings["MONGODB_HOST"]
        port = settings["MONGODB_PORT"]
        dbname = settings["MONGODB_DATABASE"]
        sheetname = settings["MONGODB_TABLE"]
        #username = settings["MONGODB_USER"]
        #password = settings["MONGODB_PASSWORD"]
        # 创建MONGODB数据库链接
        #client = pymongo.MongoClient(host=host, port=port, username=username, password=password)
        client = pymongo.MongoClient(host=host, port=port)
        # 指定数据库
        mydb = client[dbname]
        # 存放数据的数据库表名
        self.post = mydb[sheetname]
 
    def process_item(self, item, spider):
        data = dict(item)
        # 数据写入
        self.post.insert_one(data)
        return item

settings代码:


把第20行的机器人协议由ture改成false,如果遵守很多网站都爬不了。再把53-55行的注释解开,把54行后面的名字改成中心键代码里的那个名字,其中543这个数字决定了爬取的顺序,数字小的先执行。再把66-69行的注释解开,这个不解开就执行不了pipelines方法类。最后在最底下加入客户端信息(92-111行)和MongoDB的设置(114-118行)

# Scrapy settings for NewsData project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html
 
BOT_NAME = 'NewsData'
 
SPIDER_MODULES = ['NewsData.spiders']
NEWSPIDER_MODULE = 'NewsData.spiders'
 
 
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'NewsData (+http://www.yourdomain.com)'
 
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
 
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
 
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
 
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
 
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
 
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}
 
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'NewsData.middlewares.NewsdataSpiderMiddleware': 543,
#}
 
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'NewsData.middlewares.RotateUserAgentMiddleware': 543,
  #'NewsData.middlewares.NewsdataDownloaderMiddleware': 543,
}
 
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}
 
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'NewsData.pipelines.NewsdataPipeline': 300,
}
 
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
 
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
 
 
USER_AGENT_LIST = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
 
 
# 添加MONGODB数仓设置
MONGODB_HOST = "localhost"  # 数仓IP
MONGODB_PORT = 27017  # 数仓端口号
MONGODB_DATABASE = "NewsData"  # 数仓数据库
MONGODB_TABLE = "News_Process_A"  # 数仓数据表单

2.展示抓取网站的频道列表

网址网站板块
https://www.easyzw.com/html/xieren/作文题材写人
https://www.easyzw.com/html/xiejing/作文题材写景
https://www.easyzw.com/html/xiangxiang/作文题材想象
https://www.easyzw.com/html/dongwuzuowen/作文题材动物
https://www.easyzw.com/html/zhouji/作文题材周记
https://www.easyzw.com/html/yingyuzuowen/作文题材英语
https://www.easyzw.com/html/huanbaozuowen/作文题材环保
https://www.easyzw.com/html/xinqing/作文题材心情
https://www.easyzw.com/html/zhuangwuzuowen/作文题材状物
https://www.easyzw.com/html/riji/作文题材日记

3.描述爬虫启动start_requests、列表解析parse、内容解析parse_detail、以及数据存储的文字描述对应代码。

爬虫启动start_requests部分:

    def start_requests(self):
        for url in self.start_urls: #不止爬一个网站,所以要循环,self是调用自身的意思
            item = NewsdataItem()
            item["site"] = url[1]
            item["item"] = url[2]
            item["student_id"] = "20201905"
            # ['http://www.news.cn/politicspro/', '新华网', '时政']
 
            yield scrapy.Request(url=url[0], meta={"item": item}, callback=self.parse) #yield是返回函数,url是目标网址,callback是函数处理方式

列表解析parse部分:

def parse(self, response):
    item = response.meta["item"]

    site_ = item["site"]
    item_ = item["item"]
    title_list = response.xpath('//li/a/text()').extract() #xpath的解析方式,要从网站的开发者模式查看,extract表示解析成列表的方式
    url_list = response.xpath('//li/a/@href').extract() #双斜杠表示模糊匹配

    for each in range(len(title_list)): #取随便哪一个的长度
        item = NewsdataItem()  #定义字典
        item["title"] = title_list[each] #然后开始向里面填充,注意这里面的名字要和items里的一样
        item["url"] = "https://www.easyzw.com" + str(url_list[each]) #这一步是相对路径改成绝对路径,有些网站不用改
        item["site"] = site_
        item["item"] = item_
        item["student_id"] = "20201905"
        yield scrapy.Request(url=item["url"], meta={"item": item}, callback=self.parse_detail) #这是访问这个循环体内的url,相当于点击操作,处理方法为下面的parse_detail
        #meta表示带着已经填充好的url数据

内容解析parse_detail:

    def parse_detail(self, response):
        # data = extract_detail(response.text)
        item = response.meta["item"]  #重新定义一个字典
        item["date"] = ""
        strs = response.xpath('//div[@class="content"]').extract_first() #和上面一样从开发者模式定位,first是因为列表无法存入字典
        item["content"] = BeautifulSoup(strs, 'lxml').text
        return item #把数据写入MongoDB里

4.抓取的数据截图

三:Gerapy的部署、搭建

初始化:

新建一个文件夹,在里面按住shift点右键,输入gerapy init,出现这个表示成功

然后文件夹里会出现这个文件

我们cd进去这个文件,输入cd .\gerapy
然后输入数据迁移命令:gerapy migrate,执行完后会生成很多表单

接下来要初始化一个管理账号,输入命令gerapy initadmin

这里显示密码和账号都是admin
然后启动服务gerapy runserver 0.0.0.0:8000

然后去浏览器输入127.0.0.1:8000,进入这个界面

这里的账号和密码是之前的admin。

主机管理设置:

点击创建

输入下列信息后点创建

这个时候是不成功的,因为没有启动6800端口服务,我们找到scrapyd.exe所在路径,打开shell脚本,输入.\scrapyd.exe,结果如下,并一直保持该窗口不关闭

这时主机管理显示正常

项目管理:

我们把之前写的爬虫脚本(NewsData)放到新建文件夹 (5)\gerapy\projects里面,这时候项目管理会出现

点击部署,输入下方的描述后,点击打包。打包成功后点击部署,会出现主机1部署成功

任务管理:

点击创建,开始建立定时任务

名称自己填,项目名字是projects文件里的名字,爬虫名字是spiders里的py名称,主机选localhost,时间选择亚洲-香港,调度方式选择周期,开始时间和周期可以自己选

创建成功后有这个界面

点击状态,可以看执行的情况

点击调度可以看脚本是不是在执行

然后等待周期爬取,数据也能导入进来。

遇到的问题:


1 ValueError: Missing scheme in request url
解决办法:这是相对路径和绝对路径的原因,要多进行一步拼接

item["url"] = parse.urljoin(response.url,data[each]["url"])

2 接上条,NameError: name 'parse' is not defined
解决办法:from urllib import parse

3 ZeroDivisionError: division by zero
解决办法:这是没有抓到数据,用自动爬取算法爬取某些网站不可取,需要回网页f12定位

4 在pip install gerapy_auto_extractor时报错
解决办法:去镜像网站下载
pip install gerapy_auto_extractor -i https://pypi.tuna.tsinghua.edu.cn/simple some-package

5 gerapy init初始化显示无法识别gerapy
管理员权限打开cmd卸载重装

届ける言葉を今は育ててる
最后更新于 2023-01-26