当前位置:Gxlcms > 数据库问题 > scrapy 爬取知乎问题、答案 ,并异步写入数据库(mysql)

scrapy 爬取知乎问题、答案 ,并异步写入数据库(mysql)

时间:2021-07-01 10:21:17 帮助过:26人阅读

python版本  python2.7

爬取知乎流程:

 一 、分析 在访问知乎首页的时候(https://www.zhihu.com),在没有登录的情况下,会进行重定向到(https://www.zhihu.com/signup?next=%2F)这个页面,

  爬取知乎,首先要完成登录操作,登陆的时候观察往那个页面发送了post或者get请求。可以利用抓包工具来获取登录时密码表单等数据的提交地址。

 1、利用抓包工具,查看用户名密码数据的提交地址页就是post请求,将表单数据提交的网址,经过查看。是这个网址 ‘https://www.zhihu.com/api/v3/oauth/sign_in‘。

 2、通过抓取上述登录地址,在其请求的contenr字段中,发现post请求服务器不止包含用户名,密码,还有timetamp,lang,client_id,sihnature等表单数据,需要知道每一个表单数据的特点,而特点是我们数据变化 在每次登录的时候的变化来查找数据的规律。

 3、经过多次登录观察,这些表单数据中只有timetamp,和signature是变化的,其他的值是不变的。

   4、通过js发现 signature字段的值是有多个字段组合加密而成,其实timetamp时间戳是核心,每次根据时间的变化,生成不同的signature值。

5、考虑到signature的值加密较为复杂,直接将浏览器登陆成功后的时间戳timetamp和signature 复制到请求数据中,然后进行登录。
6、表单数据田中完毕,发送post请求时,出现了缺少验证码票据的错误(capsion_ticket) 经过分析验证码票据是为了获取验证码而提供的一种验证方式,
而抓包装工具中关于验证码的请求有两次, 一次获取的是:

{‘show_captcha‘:true}
而同时第二次获取的是:{‘img_base_64‘:Rfadausifpoauerfae}。
7、经过分析{‘show_captcha‘:true} 是获取验证码的关键信息,再抓包信息中发现第一次请求相应的set-cookie中,包含了capsion_ticket验证码票据信息。
8、在此模拟登陆又出现了错误‘ERR_xxx_AUTH_TOKEN‘错误信息,而她出现在我们很根据验证码票据获取验证码图片时,
我们从抓包中查看关于Authorization:oauth ce30dasjfsdjhfkiswdnf.所以将其在headers当中进行配置。
验证码问题:
验证码问题
-对于知乎的验证码,有两种情况,一种是英文的图片验证码,一种是点击倒立文字的验证码,当登录需要验证码的时候,回向这两个网站发送数据
倒立文字验证码:https://www.zhihu.com/api/v3/oauth/captcha?lang=cn
英文图片验证码:https://www.zhihu.com/api/v3/oauth/captcha?lang=en
-英文验证码得到数据是四个英文字母。可采用云打码在线识别。
   -倒立文字验证码是得到的是每个汉字有一定的范围,当登陆的时候点击验证码的时候,
从https://www.zhihu.com/api/v3/oauth/captcha?lang=cn该网站获取到的一个像素点(x,y),比如倒立文字在第三个和第五个,就会有一个可选范围,只要输入合适的像素点 就可以登录。
  -只对倒立文字进行验证
  -只是简单地爬取第一页的问题及回答
二、创建scrapy项目
  scrapy startproject ZhiHuSpider
  scrapy genspider zhihu zhihu.com
三、代码
  在zhihu.py中代码如下:
  
  1 # -*- coding: utf-8 -*-
  2 import base64
  3 import json
  4 import urlparse
  5 import re
  6 from datetime import datetime
  7 import scrapy
  8 from scrapy.loader import ItemLoader
  9 from ..items import ZhiHuQuestionItem, ZhiHuAnswerItem
 10 
 11 
 12 class ZhihuSpider(scrapy.Spider):
 13     name = zhihu
 14     allowed_domains = [www.zhihu.com]
 15     start_urls = [https://www.zhihu.com]
 16     start_answer_url = "https://www.zhihu.com/api/v4/questions/{}/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cupvoted_followees%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=20&offset={}&sort_by=default"
 17 
 18     headers = {
 19         User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0,
 20         Referer: https://www.zhihu.com,
 21         HOST: www.zhihu.com,
 22         Authorization: oauth c3cef7c66a1843f8b3a9e6a1e3160e20
 23     }
 24     points_list = [[20, 27], [42, 25], [65, 20], [90, 25], [115, 32], [140, 25], [160, 25]]
 25 
 26     def start_requests(self):
 27         """
 28         重写父类的start_requests()函数,在这里设置爬虫的起始url为登录页面的url。
 29         :return:
 30         """
 31         yield scrapy.Request(
 32             url=https://www.zhihu.com/api/v3/oauth/captcha?lang=cn,
 33             callback=self.captcha,
 34             headers=self.headers,
 35         )
 36 
 37     def captcha(self, response):
 38         show_captcha = json.loads(response.body)[show_captcha]
 39         if show_captcha:
 40             print u有验证码
 41             yield scrapy.Request(
 42                 url=https://www.zhihu.com/api/v3/oauth/captcha?lang=cn,
 43                 method=PUT,
 44                 headers=self.headers,
 45                 callback=self.shi_bie
 46             )
 47         else:
 48             print u没有验证码
 49             # 直接进行登录的操作
 50             post_url = https://www.zhihu.com/api/v3/oauth/sign_in
 51             post_data = {
 52                 client_id: c3cef7c66a1843f8b3a9e6a1e3160e20,
 53                 grant_type: password,
 54                 timestamp: 1515391742289,
 55                 source: com.zhihu.web,
 56                 signature: 6d1d179e50a06d1c17d6e8b5c89f77db34f406ac,
 57                 username: ,#账号
 58                 password: ,#密码
 59                 captcha: ‘‘,
 60                 lang: cn,
 61                 ref_source: homepage,
 62                 utm_source: ‘‘
 63             }
 64 
 65             yield scrapy.FormRequest(
 66                 url=post_url,
 67                 headers=self.headers,
 68                 formdata=post_data,
 69                 callback=self.index_page
 70             )
 71 
 72     def shi_bie(self, response):
 73         try:
 74             img= json.loads(response.body)[img_base64]
 75         except Exception, e:
 76             print 获取img_base64的值失败,原因:%s%e
 77         else:
 78             print 成功获取加密后的图片地址
 79             # 将加密后的图片进行解密,同时保存到本地
 80             img = img.encode(utf-8)
 81             img_data = base64.b64decode(img)
 82             with open(zhihu_captcha.GIF, wb) as f:
 83                 f.write(img_data)
 84 
 85             captcha = raw_input(请输入倒立汉字的位置:)
 86             if len(captcha) == 2:
 87                 # 说明有两个倒立的汉字
 88                 pass
 89                 first_char = int(captcha[0]) - 1 # 第一个汉字对应列表中的索引
 90                 second_char = int(captcha[1]) - 1 # 第二个汉字对应列表中的索引
 91                 captcha = {"img_size":[200,44],"input_points":[%s,%s]} % (self.points_list[first_char], self.points_list[second_char])
 92             else:
 93                 # 说明只有一个倒立的汉字
 94                 pass
 95                 first_char = int(captcha[0]) - 1
 96                 captcha = {"img_size":[200,44],"input_points":[%s]} % (
 97             self.points_list[first_char])
 98 
 99             data = {
100                 input_text: captcha
101             }
102             yield scrapy.FormRequest(
103                 url=https://www.zhihu.com/api/v3/oauth/captcha?lang=cn,
104                 headers=self.headers,
105                 formdata=data,
106                 callback=self.get_result
107             )
108 
109     def get_result(self, response):
110         try:
111             yan_zheng_result = json.loads(response.body)[success]
112         except Exception, e:
113             print 关于验证码的POST请求响应失败,原因:{}.format(e)
114         else:
115             if yan_zheng_result:
116                 print u验证成功
117                 post_url = https://www.zhihu.com/api/v3/oauth/sign_in
118                 post_data = {
119                     client_id: c3cef7c66a1843f8b3a9e6a1e3160e20,
120                     grant_type: password,
121                     timestamp: 1515391742289,
122                     source: com.zhihu.web,
123                     signature: 6d1d179e50a06d1c17d6e8b5c89f77db34f406ac,
124                     username: ,#账号
125                     password: ,#密码
126                     captcha: ‘‘,
127                     lang: cn,
128                     ref_source: homepage,
129                     utm_source: ‘‘
130                 }
            #以上数据需要在抓包中获取
131 132 yield scrapy.FormRequest( 133 url=post_url, 134 headers=self.headers, 135 formdata=post_data, 136 callback=self.index_page 137 ) 138 else: 139 print u是错误的验证码! 140 141 def index_page(self, response): 142 for url in self.start_urls: 143 yield scrapy.Request( 144 url=url, 145 headers=self.headers 146 ) 147 148 def parse(self, response): 149 """ 150 提取首页中的所有问题的url,并对这些url进行进一步的追踪,爬取详情页的数据。 151 :param response: 152 :return: 153 """ 154 # /question/19618276/answer/267334062 155 all_urls = response.xpath(//a[@data-za-detail-view-element_name="Title"]/@href).extract() 156 all_urls = [urlparse.urljoin(response.url, url) for url in all_urls] 157 for url in all_urls: 158 # https://www.zhihu.com/question/19618276/answer/267334062 159 # 同时提取:详情的url;文章的ID; 160 result = re.search((.*zhihu.com/question/(\d+)), url) 161 if result: 162 detail_url = result.group(1) 163 question_id = result.group(2) 164 # 将详情url交由下载器去下载网页源码 165 yield scrapy.Request( 166 url=detail_url, 167 headers=self.headers, 168 callback=self.parse_detail_question, 169 meta={ 170 question_id: question_id, 171 } 172 ) 173 174 # 在向详情url发送请求的同时,根据问题的ID,同时向问题的url发送请求。由于问题和答案是两个独立的url。而答案其实是一个JSON的API接口,直接请求即可,不需要和问题url产生联系。 175 yield scrapy.Request( 176 # 参数:问题ID,偏移量。默认偏移量为0,从第一个答案开始请求 177 url=self.start_answer_url.format(question_id, 0), 178 headers=self.headers, 179 callback=self.parse_detail_answer, 180 meta={ 181 question_id: question_id 182 } 183 ) 184 185 break 186 187 def parse_detail_question(self, response): 188 """ 189 用于处理详情页面关于question问题的数据,比如:问题名称,简介,浏览数,关注者数等 190 :param response: 191 :return: 192 """ 193 item_loader = ItemLoader(item=ZhiHuQuestionItem(), response=response) 194 item_loader.add_value(question_id, response.meta[question_id]) 195 item_loader.add_xpath(question_title, //div[@class="QuestionHeader"]//h1/text()) 196 item_loader.add_xpath(question_topic, //div[@class="QuestionHeader-topics"]//div[@class="Popover"]/div/text()) 197 # 获取的问题中,可能会不存在简介 198 item_loader.add_xpath(question_content, //span[@class="RichText"]/text()) 199 item_loader.add_xpath(question_watch_num, //button[contains(@class, "NumberBoard-item")]//strong/text()) 200 item_loader.add_xpath(question_click_num, //div[@class="NumberBoard-item"]//strong/text()) 201 item_loader.add_xpath(question_answer_num, //h4[@class="List-headerText"]/span/text()) 202 item_loader.add_xpath(question_comment_num, //div[@class="QuestionHeader-Comment"]/button/text()) 203 item_loader.add_value(question_url, response.url) 204 item_loader.add_value(question_crawl_time, datetime.now()) 205 206 question_item = item_loader.load_item() 207 yield question_item 208 209 def parse_detail_answer(self, response): 210 """ 211 用于解析某一个问题ID对应的所有答案。 212 :param response: 213 :return: 214 """ 215 answer_dict = json.loads(response.body) 216 is_end = answer_dict[paging][is_end] 217 next_url = answer_dict[paging][next] 218 219 for answer in answer_dict[data]: 220 answer_item = ZhiHuAnswerItem() 221 answer_item[answer_id] = answer[id] 222 answer_item[answer_question_id] = answer[question][id] 223 answer_item[answer_author_id] = answer[author][id] 224 answer_item[answer_url] = answer[url] 225 answer_item[answer_comment_num] = answer[comment_count] 226 answer_item[answer_praise_num] = answer[voteup_count] 227 answer_item[answer_create_time] = answer[created_time] 228 answer_item[answer_content] = answer[content] 229 answer_item[answer_crawl_time] = datetime.now() 230 answer_item[answer_update_time] = answer[updated_time] 231 232 yield answer_item 233 234 # 判断is_end如果值为False,说明还有下一页 235 if not is_end: 236 yield scrapy.Request( 237 url=next_url, 238 headers=self.headers, 239 callback=self.parse_detail_answer 240 )

  item.py中代码:

    

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define here the models for your scraped items
 4 #
 5 # See documentation in:
 6 # https://doc.scrapy.org/en/latest/topics/items.html
 7 
 8 from datetime import datetime
 9 import scrapy
10 from utils.common import extract_num
11 
12 
13 class ZhihuspiderItem(scrapy.Item):
14     # define the fields for your item here like:
15     # name = scrapy.Field()
16     pass
17 
18 
19 class ZhiHuQuestionItem(scrapy.Item):
20     question_id=scrapy.Field()              # 问题ID
21     question_title = scrapy.Field()         # 问题标题
22     question_topic = scrapy.Field()         # 问题分类
23     question_content = scrapy.Field()       # 问题内容
24     question_watch_num = scrapy.Field()     # 关注者数量
25     question_click_num = scrapy.Field()     # 浏览者数量
26     question_answer_num = scrapy.Field()    # 回答总数
27     question_comment_num = scrapy.Field()   # 评论数量
28     question_crawl_time = scrapy.Field()    # 爬取时间
29     question_url = scrapy.Field()           # 问题详情url
30 
31     def get_insert_sql(self):
32         insert_sql = "insert into zhihu_question(question_id, question_title, question_topic, question_content, question_watch_num, question_click_num, question_answer_num, question_comment_num, question_crawl_time, question_url) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ON DUPLICATE KEY UPDATE question_id=VALUES(question_id),question_title=VALUES(question_title),question_topic=VALUES(question_topic),question_content=VALUES(question_content),question_watch_num=VALUES(question_watch_num),question_click_num=VALUES(question_click_num),question_answer_num=VALUES(question_answer_num),question_comment_num=VALUES(question_comment_num),question_crawl_time=VALUES(question_crawl_time),question_url=VALUES(question_url)"
33 
34         # 整理字段对应的数据
35         question_id = str(self[question_id][0])
36         question_title = ‘‘.join(self[question_title])
37         question_topic = ",".join(self[question_topic])
38 
39         try:
40             question_content = ‘‘.join(self[question_content])
41         except Exception,e:
42             question_content = question_content内容为空
43 
44         question_watch_num = ‘‘.join(self[question_watch_num]).replace(,, ‘‘)
45         question_watch_num = extract_num(question_watch_num)
46 
47         question_click_num = ‘‘.join(self[question_click_num]).replace(,, ‘‘)
48         question_click_num = extract_num(question_click_num)
49         # ‘86 回答‘
50         question_answer_num = ‘‘.join(self[question_answer_num])
51         question_answer_num = extract_num(question_answer_num)
52         # ‘100 条评论‘
53         question_comment_num = ‘‘.join(self[question_comment_num])
54         question_comment_num = extract_num(question_comment_num)
55 
56         question_crawl_time = self[question_crawl_time][0]
57         question_url = self[question_url][0]
58 
59         args_tuple = (question_id, question_title, question_topic, question_content, question_watch_num, question_click_num, question_answer_num, question_comment_num, question_crawl_time, question_url)
60 
61         return insert_sql, args_tuple
62 
63 
64 class ZhiHuAnswerItem(scrapy.Item):
65     answer_id = scrapy.Field()                  # 答案的ID (zhihu_answer表的主键)
66     answer_question_id = scrapy.Field()         # 问题的ID (zhihu_question表的主键)
67     answer_author_id = scrapy.Field()           # 回答用户的ID
68     answer_url = scrapy.Field()                 # 回答的url
69     answer_comment_num = scrapy.Field()         # 该回答的总评论数
70     answer_praise_num = scrapy.Field()          # 该回答的总点赞数
71     answer_create_time = scrapy.Field()         # 该回答的创建时间
72     answer_content = scrapy.Field()             # 回答的内容
73     answer_update_time = scrapy.Field()         # 回答的更新时间
74 
75     answer_crawl_time = scrapy.Field()          # 爬虫的爬取时间
76 
77     def get_insert_sql(self):
78         insert_sql = "insert into zhihu_answer(answer_id, answer_question_id, answer_author_id, answer_url, answer_comment_num, answer_praise_num, answer_create_time, answer_content, answer_update_time, answer_crawl_time) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ON DUPLICATE KEY UPDATE answer_id=VALUES(answer_id),answer_question_id=VALUES(answer_question_id),answer_author_id=VALUES(answer_author_id),answer_url=VALUES(answer_url),answer_comment_num=VALUES(answer_comment_num),answer_praise_num=VALUES(answer_praise_num),answer_create_time=VALUES(answer_create_time),answer_content=VALUES(answer_content),answer_update_time=VALUES(answer_update_time),answer_crawl_time=VALUES(answer_crawl_time)"
79 
80         # 处理answer_item中的数据
81         # fromtimestamp(timestamp):将一个时间戳数据转化为一个date日期类型的数据
82         answer_id = self[answer_id]
83         answer_question_id = self[answer_question_id]
84         answer_author_id = self[answer_author_id]
85         answer_url = self[answer_url]
86         answer_comment_num = self[answer_comment_num]
87         answer_praise_num = self[answer_praise_num]
88         answer_content = self[answer_content]
89         answer_create_time = datetime.fromtimestamp(self[answer_create_time])
90         answer_update_time = datetime.fromtimestamp(self[answer_update_time])
91         answer_crawl_time = self[answer_crawl_time]
92 
93         args_tuple = (answer_id, answer_question_id, answer_author_id, answer_url, answer_comment_num, answer_praise_num, answer_create_time, answer_content, a                    

人气教程排行