时间:2021-07-01 10:21:17 帮助过:67人阅读
好了,下面是程序代码:
- import urllib2
- from bs4 import BeautifulSoup
- import re
- import sys
- reload(sys)
- sys.setdefaultencoding('utf-8')
- baseurl = "http://www.gxlcms.com/dont-worry.html"
- #说明下,起始地址是第一篇文章的地址,通过该文章的页面就
- #可以使用BeautifulSoup模块来获取上一篇文章的地址
- file = open(r"E:\123.txt","a")
- def pageloop(url):
- page = urllib2.urlopen(url)
- soup = BeautifulSoup(page)
- img = soup.findAll(['img'])
- if img == []:
- print "当前页面没有图片"
- return
- else:
- for myimg in img:
- link = myimg.get('src')
- print link
- pattern = re.compile(r'QQ\S*[0-9]*png')
- badimg = pattern.findall(str(link))
- if badimg:
- print url
- file.write(link + "\n")
- file.write(url+"\n")
- def getthenextpage(url):
- pageloop(url)
- page = urllib2.urlopen(url)
- soup = BeautifulSoup(page)
- for spanclass in soup.findAll(attrs={"class" : "article-nav-prev"}):
- #print spanclass
- if spanclass.find('article-nav-prev') != -1:
- pattern = re.compile(r'http://www.gxlcms.com/\S*html')
- pageurl = pattern.findall(str(spanclass))
- for i in pageurl:
- #print i
- getthenextpage(i)
- getthenextpage(baseurl)
- print "the end!"
- file.close()
最后,对和我以前刚开始做网站的同学说下,图片命名的话最好是用数字形式或者是英文、拼音的形式,要不然到最后想修改的话就麻烦了,所以最好就是从刚开始就养成好的习惯,用正确的命名规范来问文章、图片来命名,这样就会好很多。