当前位置：Gxlcms > 数据库问题 > 第一篇博客（python爬取小故事网并写入mysql）

第一篇博客（python爬取小故事网并写入mysql）

时间：2021-07-01 10:21:17 帮助过：26人阅读

import requests as rq 2 import re 3 import codecs 4 import queue 5 import pymysql 6 from urllib.request import urlopen 7 from bs4 import BeautifulSoup 8 9 tasks = queue.Queue() # 链接队列 10 tasks_pass = set() # 已队列过的链接 11 results = {} # 结果变量 12 count = 0 # 爬取页面总数 13 tasks.put(‘/index.html‘) # 把主页加入到链接队列 14 tasks_pass.add(‘/index.html‘) # 把主页加入到已队列链接 15 db = pymysql.connect("localhost","testuser","test123","TESTDB",charset=‘gbk‘) 16 dbc = db.cursor() 17 18 while count<=100: 19 url = tasks.get() #取出一个链接 20 url = ‘http://wap.xigushi.com‘+url 21 web = rq.get(url).content.decode(‘gbk‘) #这里的编码要看实际情形而定 22 urls = re.findall(‘href="(/.*?)"‘, web) #查找所有站内链接 23 for u in urls: 24 if u not in tasks_pass: #把还没有队列过的链接加入队列 25 tasks.put(u) 26 tasks_pass.add(u) 27 html=urlopen(url) 28 bsObj=BeautifulSoup(html.read(),"lxml") 29 if not (re.search(‘images‘,url)): 30 # print(re.search(‘images‘, url)) 31 text=bsObj.title.get_text() 32 print(url + ‘ ‘ + text) 33 sql = "insert into data1(url,title) values(%s,%s);" 34 data=(url,text) 35 dbc.execute(sql,data) 36 db.commit() 37 else: 38 if (re.search(‘images‘,url).span()): 39 print(‘---------------------------skipping--------------------------------------------‘) 40 count += 1 41 db.close() 42 43 with codecs.open(‘results.txt‘, ‘w‘, encoding=‘utf-8‘) as f: 44 f.write(‘\n‘.join(results.items()))

备注：

　　原PROJECT采取多线程并写入MongoDB，我因为还不熟悉多线程，采取了单线程，出于对mysql的偏好和熟悉选取了mysql。

BUG：

编码问题：
小故事网的编码是GB2312，可以从网页源代码的META标签中读出，因为GBK向下与GB2312兼容，所以可以用GBK解码。值得注意的是，mysql默认编码为latin-1，遇到编码问题时，在google时给出的答案都是UTF-8，但在将mysql字符集改为UTF-8后，pycharm读取正常，但mysql显示还是乱码，需要将其改为GBK才可以，虽然在猜想UTF-8应该兼容GBK才对...在对mysql修改字符集的时候，又反复操作了几次，才发现原来有数据库、表、字段的区别...最终都改了...
修改mysql字符集这个网站讲得很好：http://fatkun.com/2011/05/mysql-alter-charset.html
正则表达式匹配问题：
代码主要是为了爬取链接和对应的标题，但在碰到图片的时候就会报错，这时打算写个正则过滤掉，没想到写出来了没反应...最后只能加了个判断过滤...coding仍需加强...

第一篇博客（python爬取小故事网并写入mysql）

标签：set 需要 odi 优先 fat image ima findall close

第一篇博客（python爬取小故事网并写入mysql）

人气教程排行