时间:2021-07-01 10:21:17 帮助过:26人阅读
python版本 python2.7
爬取知乎流程:
一 、分析 在访问知乎首页的时候(https://www.zhihu.com),在没有登录的情况下,会进行重定向到(https://www.zhihu.com/signup?next=%2F)这个页面,
爬取知乎,首先要完成登录操作,登陆的时候观察往那个页面发送了post或者get请求。可以利用抓包工具来获取登录时密码表单等数据的提交地址。
1、利用抓包工具,查看用户名密码数据的提交地址页就是post请求,将表单数据提交的网址,经过查看。是这个网址 ‘https://www.zhihu.com/api/v3/oauth/sign_in‘。
2、通过抓取上述登录地址,在其请求的contenr字段中,发现post请求服务器不止包含用户名,密码,还有timetamp,lang,client_id,sihnature等表单数据,需要知道每一个表单数据的特点,而特点是我们数据变化 在每次登录的时候的变化来查找数据的规律。
3、经过多次登录观察,这些表单数据中只有timetamp,和signature是变化的,其他的值是不变的。
4、通过js发现 signature字段的值是有多个字段组合加密而成,其实timetamp时间戳是核心,每次根据时间的变化,生成不同的signature值。
5、考虑到signature的值加密较为复杂,直接将浏览器登陆成功后的时间戳timetamp和signature 复制到请求数据中,然后进行登录。
6、表单数据田中完毕,发送post请求时,出现了缺少验证码票据的错误(capsion_ticket) 经过分析验证码票据是为了获取验证码而提供的一种验证方式,
而抓包装工具中关于验证码的请求有两次, 一次获取的是:
{‘show_captcha‘:true}
而同时第二次获取的是:{‘img_base_64‘:Rfadausifpoauerfae}。
7、经过分析{‘show_captcha‘:true} 是获取验证码的关键信息,再抓包信息中发现第一次请求相应的set-cookie中,包含了capsion_ticket验证码票据信息。
8、在此模拟登陆又出现了错误‘ERR_xxx_AUTH_TOKEN‘错误信息,而她出现在我们很根据验证码票据获取验证码图片时,
我们从抓包中查看关于Authorization:oauth ce30dasjfsdjhfkiswdnf.所以将其在headers当中进行配置。
验证码问题:
验证码问题
-对于知乎的验证码,有两种情况,一种是英文的图片验证码,一种是点击倒立文字的验证码,当登录需要验证码的时候,回向这两个网站发送数据
倒立文字验证码:https://www.zhihu.com/api/v3/oauth/captcha?lang=cn
英文图片验证码:https://www.zhihu.com/api/v3/oauth/captcha?lang=en
-英文验证码得到数据是四个英文字母。可采用云打码在线识别。
-倒立文字验证码是得到的是每个汉字有一定的范围,当登陆的时候点击验证码的时候,
从https://www.zhihu.com/api/v3/oauth/captcha?lang=cn该网站获取到的一个像素点(x,y),比如倒立文字在第三个和第五个,就会有一个可选范围,只要输入合适的像素点 就可以登录。
-只对倒立文字进行验证
-只是简单地爬取第一页的问题及回答
二、创建scrapy项目
scrapy startproject ZhiHuSpider
scrapy genspider zhihu zhihu.com
三、代码
在zhihu.py中代码如下:
1 # -*- coding: utf-8 -*-
2 import base64
3 import json
4 import urlparse
5 import re
6 from datetime import datetime
7 import scrapy
8 from scrapy.loader import ItemLoader
9 from ..items import ZhiHuQuestionItem, ZhiHuAnswerItem
10
11
12 class ZhihuSpider(scrapy.Spider):
13 name = ‘zhihu‘
14 allowed_domains = [‘www.zhihu.com‘]
15 start_urls = [‘https://www.zhihu.com‘]
16 start_answer_url = "https://www.zhihu.com/api/v4/questions/{}/answers?include=data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cupvoted_followees%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%3F%28type%3Dbest_answerer%29%5D.topics&limit=20&offset={}&sort_by=default"
17
18 headers = {
19 ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0‘,
20 ‘Referer‘: ‘https://www.zhihu.com‘,
21 ‘HOST‘: ‘www.zhihu.com‘,
22 ‘Authorization‘: ‘oauth c3cef7c66a1843f8b3a9e6a1e3160e20‘
23 }
24 points_list = [[20, 27], [42, 25], [65, 20], [90, 25], [115, 32], [140, 25], [160, 25]]
25
26 def start_requests(self):
27 """
28 重写父类的start_requests()函数,在这里设置爬虫的起始url为登录页面的url。
29 :return:
30 """
31 yield scrapy.Request(
32 url=‘https://www.zhihu.com/api/v3/oauth/captcha?lang=cn‘,
33 callback=self.captcha,
34 headers=self.headers,
35 )
36
37 def captcha(self, response):
38 show_captcha = json.loads(response.body)[‘show_captcha‘]
39 if show_captcha:
40 print u‘有验证码‘
41 yield scrapy.Request(
42 url=‘https://www.zhihu.com/api/v3/oauth/captcha?lang=cn‘,
43 method=‘PUT‘,
44 headers=self.headers,
45 callback=self.shi_bie
46 )
47 else:
48 print u‘没有验证码‘
49 # 直接进行登录的操作
50 post_url = ‘https://www.zhihu.com/api/v3/oauth/sign_in‘
51 post_data = {
52 ‘client_id‘: ‘c3cef7c66a1843f8b3a9e6a1e3160e20‘,
53 ‘grant_type‘: ‘password‘,
54 ‘timestamp‘: ‘1515391742289‘,
55 ‘source‘: ‘com.zhihu.web‘,
56 ‘signature‘: ‘6d1d179e50a06d1c17d6e8b5c89f77db34f406ac‘,
57 ‘username‘: ‘‘,#账号
58 ‘password‘: ‘‘,#密码
59 ‘captcha‘: ‘‘,
60 ‘lang‘: ‘cn‘,
61 ‘ref_source‘: ‘homepage‘,
62 ‘utm_source‘: ‘‘
63 }
64
65 yield scrapy.FormRequest(
66 url=post_url,
67 headers=self.headers,
68 formdata=post_data,
69 callback=self.index_page
70 )
71
72 def shi_bie(self, response):
73 try:
74 img= json.loads(response.body)[‘img_base64‘]
75 except Exception, e:
76 print ‘获取img_base64的值失败,原因:%s‘%e
77 else:
78 print ‘成功获取加密后的图片地址‘
79 # 将加密后的图片进行解密,同时保存到本地
80 img = img.encode(‘utf-8‘)
81 img_data = base64.b64decode(img)
82 with open(‘zhihu_captcha.GIF‘, ‘wb‘) as f:
83 f.write(img_data)
84
85 captcha = raw_input(‘请输入倒立汉字的位置:‘)
86 if len(captcha) == 2:
87 # 说明有两个倒立的汉字
88 pass
89 first_char = int(captcha[0]) - 1 # 第一个汉字对应列表中的索引
90 second_char = int(captcha[1]) - 1 # 第二个汉字对应列表中的索引
91 captcha = ‘{"img_size":[200,44],"input_points":[%s,%s]}‘ % (self.points_list[first_char], self.points_list[second_char])
92 else:
93 # 说明只有一个倒立的汉字
94 pass
95 first_char = int(captcha[0]) - 1
96 captcha = ‘{"img_size":[200,44],"input_points":[%s]}‘ % (
97 self.points_list[first_char])
98
99 data = {
100 ‘input_text‘: captcha
101 }
102 yield scrapy.FormRequest(
103 url=‘https://www.zhihu.com/api/v3/oauth/captcha?lang=cn‘,
104 headers=self.headers,
105 formdata=data,
106 callback=self.get_result
107 )
108
109 def get_result(self, response):
110 try:
111 yan_zheng_result = json.loads(response.body)[‘success‘]
112 except Exception, e:
113 print ‘关于验证码的POST请求响应失败,原因:{}‘.format(e)
114 else:
115 if yan_zheng_result:
116 print u‘验证成功‘
117 post_url = ‘https://www.zhihu.com/api/v3/oauth/sign_in‘
118 post_data = {
119 ‘client_id‘: ‘c3cef7c66a1843f8b3a9e6a1e3160e20‘,
120 ‘grant_type‘: ‘password‘,
121 ‘timestamp‘: ‘1515391742289‘,
122 ‘source‘: ‘com.zhihu.web‘,
123 ‘signature‘: ‘6d1d179e50a06d1c17d6e8b5c89f77db34f406ac‘,
124 ‘username‘: ‘‘,#账号
125 ‘password‘: ‘‘,#密码
126 ‘captcha‘: ‘‘,
127 ‘lang‘: ‘cn‘,
128 ‘ref_source‘: ‘homepage‘,
129 ‘utm_source‘: ‘‘
130 }
#以上数据需要在抓包中获取
131
132 yield scrapy.FormRequest(
133 url=post_url,
134 headers=self.headers,
135 formdata=post_data,
136 callback=self.index_page
137 )
138 else:
139 print u‘是错误的验证码!‘
140
141 def index_page(self, response):
142 for url in self.start_urls:
143 yield scrapy.Request(
144 url=url,
145 headers=self.headers
146 )
147
148 def parse(self, response):
149 """
150 提取首页中的所有问题的url,并对这些url进行进一步的追踪,爬取详情页的数据。
151 :param response:
152 :return:
153 """
154 # /question/19618276/answer/267334062
155 all_urls = response.xpath(‘//a[@data-za-detail-view-element_name="Title"]/@href‘).extract()
156 all_urls = [urlparse.urljoin(response.url, url) for url in all_urls]
157 for url in all_urls:
158 # https://www.zhihu.com/question/19618276/answer/267334062
159 # 同时提取:详情的url;文章的ID;
160 result = re.search(‘(.*zhihu.com/question/(\d+))‘, url)
161 if result:
162 detail_url = result.group(1)
163 question_id = result.group(2)
164 # 将详情url交由下载器去下载网页源码
165 yield scrapy.Request(
166 url=detail_url,
167 headers=self.headers,
168 callback=self.parse_detail_question,
169 meta={
170 ‘question_id‘: question_id,
171 }
172 )
173
174 # 在向详情url发送请求的同时,根据问题的ID,同时向问题的url发送请求。由于问题和答案是两个独立的url。而答案其实是一个JSON的API接口,直接请求即可,不需要和问题url产生联系。
175 yield scrapy.Request(
176 # 参数:问题ID,偏移量。默认偏移量为0,从第一个答案开始请求
177 url=self.start_answer_url.format(question_id, 0),
178 headers=self.headers,
179 callback=self.parse_detail_answer,
180 meta={
181 ‘question_id‘: question_id
182 }
183 )
184
185 break
186
187 def parse_detail_question(self, response):
188 """
189 用于处理详情页面关于question问题的数据,比如:问题名称,简介,浏览数,关注者数等
190 :param response:
191 :return:
192 """
193 item_loader = ItemLoader(item=ZhiHuQuestionItem(), response=response)
194 item_loader.add_value(‘question_id‘, response.meta[‘question_id‘])
195 item_loader.add_xpath(‘question_title‘, ‘//div[@class="QuestionHeader"]//h1/text()‘)
196 item_loader.add_xpath(‘question_topic‘, ‘//div[@class="QuestionHeader-topics"]//div[@class="Popover"]/div/text()‘)
197 # 获取的问题中,可能会不存在简介
198 item_loader.add_xpath(‘question_content‘, ‘//span[@class="RichText"]/text()‘)
199 item_loader.add_xpath(‘question_watch_num‘, ‘//button[contains(@class, "NumberBoard-item")]//strong/text()‘)
200 item_loader.add_xpath(‘question_click_num‘, ‘//div[@class="NumberBoard-item"]//strong/text()‘)
201 item_loader.add_xpath(‘question_answer_num‘, ‘//h4[@class="List-headerText"]/span/text()‘)
202 item_loader.add_xpath(‘question_comment_num‘, ‘//div[@class="QuestionHeader-Comment"]/button/text()‘)
203 item_loader.add_value(‘question_url‘, response.url)
204 item_loader.add_value(‘question_crawl_time‘, datetime.now())
205
206 question_item = item_loader.load_item()
207 yield question_item
208
209 def parse_detail_answer(self, response):
210 """
211 用于解析某一个问题ID对应的所有答案。
212 :param response:
213 :return:
214 """
215 answer_dict = json.loads(response.body)
216 is_end = answer_dict[‘paging‘][‘is_end‘]
217 next_url = answer_dict[‘paging‘][‘next‘]
218
219 for answer in answer_dict[‘data‘]:
220 answer_item = ZhiHuAnswerItem()
221 answer_item[‘answer_id‘] = answer[‘id‘]
222 answer_item[‘answer_question_id‘] = answer[‘question‘][‘id‘]
223 answer_item[‘answer_author_id‘] = answer[‘author‘][‘id‘]
224 answer_item[‘answer_url‘] = answer[‘url‘]
225 answer_item[‘answer_comment_num‘] = answer[‘comment_count‘]
226 answer_item[‘answer_praise_num‘] = answer[‘voteup_count‘]
227 answer_item[‘answer_create_time‘] = answer[‘created_time‘]
228 answer_item[‘answer_content‘] = answer[‘content‘]
229 answer_item[‘answer_crawl_time‘] = datetime.now()
230 answer_item[‘answer_update_time‘] = answer[‘updated_time‘]
231
232 yield answer_item
233
234 # 判断is_end如果值为False,说明还有下一页
235 if not is_end:
236 yield scrapy.Request(
237 url=next_url,
238 headers=self.headers,
239 callback=self.parse_detail_answer
240 )
item.py中代码:
1 # -*- coding: utf-8 -*-
2
3 # Define here the models for your scraped items
4 #
5 # See documentation in:
6 # https://doc.scrapy.org/en/latest/topics/items.html
7
8 from datetime import datetime
9 import scrapy
10 from utils.common import extract_num
11
12
13 class ZhihuspiderItem(scrapy.Item):
14 # define the fields for your item here like:
15 # name = scrapy.Field()
16 pass
17
18
19 class ZhiHuQuestionItem(scrapy.Item):
20 question_id=scrapy.Field() # 问题ID
21 question_title = scrapy.Field() # 问题标题
22 question_topic = scrapy.Field() # 问题分类
23 question_content = scrapy.Field() # 问题内容
24 question_watch_num = scrapy.Field() # 关注者数量
25 question_click_num = scrapy.Field() # 浏览者数量
26 question_answer_num = scrapy.Field() # 回答总数
27 question_comment_num = scrapy.Field() # 评论数量
28 question_crawl_time = scrapy.Field() # 爬取时间
29 question_url = scrapy.Field() # 问题详情url
30
31 def get_insert_sql(self):
32 insert_sql = "insert into zhihu_question(question_id, question_title, question_topic, question_content, question_watch_num, question_click_num, question_answer_num, question_comment_num, question_crawl_time, question_url) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ON DUPLICATE KEY UPDATE question_id=VALUES(question_id),question_title=VALUES(question_title),question_topic=VALUES(question_topic),question_content=VALUES(question_content),question_watch_num=VALUES(question_watch_num),question_click_num=VALUES(question_click_num),question_answer_num=VALUES(question_answer_num),question_comment_num=VALUES(question_comment_num),question_crawl_time=VALUES(question_crawl_time),question_url=VALUES(question_url)"
33
34 # 整理字段对应的数据
35 question_id = str(self[‘question_id‘][0])
36 question_title = ‘‘.join(self[‘question_title‘])
37 question_topic = ",".join(self[‘question_topic‘])
38
39 try:
40 question_content = ‘‘.join(self[‘question_content‘])
41 except Exception,e:
42 question_content = ‘question_content内容为空‘
43
44 question_watch_num = ‘‘.join(self[‘question_watch_num‘]).replace(‘,‘, ‘‘)
45 question_watch_num = extract_num(question_watch_num)
46
47 question_click_num = ‘‘.join(self[‘question_click_num‘]).replace(‘,‘, ‘‘)
48 question_click_num = extract_num(question_click_num)
49 # ‘86 回答‘
50 question_answer_num = ‘‘.join(self[‘question_answer_num‘])
51 question_answer_num = extract_num(question_answer_num)
52 # ‘100 条评论‘
53 question_comment_num = ‘‘.join(self[‘question_comment_num‘])
54 question_comment_num = extract_num(question_comment_num)
55
56 question_crawl_time = self[‘question_crawl_time‘][0]
57 question_url = self[‘question_url‘][0]
58
59 args_tuple = (question_id, question_title, question_topic, question_content, question_watch_num, question_click_num, question_answer_num, question_comment_num, question_crawl_time, question_url)
60
61 return insert_sql, args_tuple
62
63
64 class ZhiHuAnswerItem(scrapy.Item):
65 answer_id = scrapy.Field() # 答案的ID (zhihu_answer表的主键)
66 answer_question_id = scrapy.Field() # 问题的ID (zhihu_question表的主键)
67 answer_author_id = scrapy.Field() # 回答用户的ID
68 answer_url = scrapy.Field() # 回答的url
69 answer_comment_num = scrapy.Field() # 该回答的总评论数
70 answer_praise_num = scrapy.Field() # 该回答的总点赞数
71 answer_create_time = scrapy.Field() # 该回答的创建时间
72 answer_content = scrapy.Field() # 回答的内容
73 answer_update_time = scrapy.Field() # 回答的更新时间
74
75 answer_crawl_time = scrapy.Field() # 爬虫的爬取时间
76
77 def get_insert_sql(self):
78 insert_sql = "insert into zhihu_answer(answer_id, answer_question_id, answer_author_id, answer_url, answer_comment_num, answer_praise_num, answer_create_time, answer_content, answer_update_time, answer_crawl_time) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ON DUPLICATE KEY UPDATE answer_id=VALUES(answer_id),answer_question_id=VALUES(answer_question_id),answer_author_id=VALUES(answer_author_id),answer_url=VALUES(answer_url),answer_comment_num=VALUES(answer_comment_num),answer_praise_num=VALUES(answer_praise_num),answer_create_time=VALUES(answer_create_time),answer_content=VALUES(answer_content),answer_update_time=VALUES(answer_update_time),answer_crawl_time=VALUES(answer_crawl_time)"
79
80 # 处理answer_item中的数据
81 # fromtimestamp(timestamp):将一个时间戳数据转化为一个date日期类型的数据
82 answer_id = self[‘answer_id‘]
83 answer_question_id = self[‘answer_question_id‘]
84 answer_author_id = self[‘answer_author_id‘]
85 answer_url = self[‘answer_url‘]
86 answer_comment_num = self[‘answer_comment_num‘]
87 answer_praise_num = self[‘answer_praise_num‘]
88 answer_content = self[‘answer_content‘]
89 answer_create_time = datetime.fromtimestamp(self[‘answer_create_time‘])
90 answer_update_time = datetime.fromtimestamp(self[‘answer_update_time‘])
91 answer_crawl_time = self[‘answer_crawl_time‘]
92
93 args_tuple = (answer_id, answer_question_id, answer_author_id, answer_url, answer_comment_num, answer_praise_num, answer_create_time, answer_content, a