当前位置:Gxlcms > asp.net > .NET从优酷专辑中采集所有视频及信息(VB.NET代码)

.NET从优酷专辑中采集所有视频及信息(VB.NET代码)

时间:2021-07-01 10:21:17 帮助过:14人阅读

这个方法就是提取优酷的专辑ID,然后一个个ID进行循环采集网页代码,从中提取title标签和VID,没什么技术含量。=..=
采集中应用.NET中的HttpWebRequest和HttpWebResponse类,代码分析用了正则表达式。
这个代码效率不是很好,一个网页的解析时间在0.5~2秒之间,不适合大量采集。也许将它转换成JavaScript速度会快一点吧。
暂时就研究这么多,代码直接发出来给大家共享一下。

代码VB.NET,新建一个窗体frmMain,添加一个TextBox,一个ListBox,两个Button,复制下面的代码:

代码如下:
  1. <br>Imports System.Net <br>Imports System.IO <br>Imports System.Text <br>Imports System.Text.RegularExpressions <br>Public Class frmMain <br>Structure VList <br>Dim id As Integer <br>Dim title As String <br>Dim vid1 As String <br>Dim vid2 As String <br>Overloads Function ToString() As String <br>Return String.Format("{0}:<{1}> [{2}]", id, title, vid1) <br>End Function <br>End Structure <br>Dim myList As New List(Of VList) <br>Private Sub Button1_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button1.Click <br>' 防止重复创建变量 <br>Dim wr1 As HttpWebRequest <br>Dim wr2 As HttpWebResponse <br>Dim ret As String <br>Dim reg As Match <br>Dim g As Group <br>Dim preVid As String = "" '上一个VID <br>Dim nowid As Integer = 0 '当前的视频集数 <br>Dim listUrl As String = TextBox1.Text '获取专辑URL,如 http://www.youku.com/playlist_show/id_2350764.html <br>Dim tarUrl As String = "http://v.youku.com/v_playlist/f{0}" '{0}ListID <br>reg = Regex.Match(listUrl, "playlist_show/id_(\d+).*\.html") <br>If Not reg.Success Then <br>MsgBox("专辑列表提取失败!") <br>Exit Sub <br>End If <br>g = reg.Groups(1) <br>tarUrl = String.Format(tarUrl, g.Value) & "o{1}p{0}.html" '{0}集数 {1}排序 <br>wr1 = HttpWebRequest.Create(TextBox1.Text) <br>wr2 = wr1.GetResponse <br>ret = New StreamReader(wr2.GetResponseStream, Encoding.GetEncoding(wr2.CharacterSet)).ReadToEnd <br>reg = Regex.Match(ret, "<title>(.+) - 专辑 - 优酷视频</title>") <br>If Not reg.Success Then <br>MsgBox("专辑名称提取失败!") <br>Else <br>g = reg.Groups(1) <br>MsgBox("专辑名:《" & g.Value & "》") <br>End If <br>Do <br>' 从Web流中获取页面文本 <br>wr1 = HttpWebRequest.Create(String.Format(tarUrl, nowid, "0")) '按倒序方式查找视频 <br>wr2 = wr1.GetResponse <br>ret = New StreamReader(wr2.GetResponseStream, Encoding.GetEncoding(wr2.CharacterSet)).ReadToEnd <br>'TextBox2.Text = ret <br>' 创建一个临时视频列表变量 <br>Dim nlist As New VList <br>nlist.id = nowid '获取ID <br>' 获取videoId <br>reg = Regex.Match(ret, "var\s+videoId\s*=\s*""(\d+)""\s*;") <br>If Not reg.Success Then Exit Do <br>g = reg.Groups(1) <br>' 如果VID等于上一个VID最退出 <br>If g.Value = preVid Then Exit Do <br>nlist.vid1 = g.Value <br>' 获取videoId2 <br>reg = Regex.Match(ret, "var\s+videoId2\s*=\s*""((\w|=)+)""\s*;") '"var\s+videoId2\s*=\s*""(\w+)""\s*;") <br>If Not reg.Success Then Exit Do <br>g = reg.Groups(1) <br>nlist.vid2 = g.Value <br>' 获取标题 <br>reg = Regex.Match(ret, "<title>(.+) - (.+) - 视频 - 优酷视频 - 在线观看 - </title>") <br>If Not reg.Success Then <br>nlist.title = "{名称查找错误}" <br>Else <br>g = reg.Groups(2) <br>nlist.title = g.Value <br>End If <br>' 收尾工作 <br>myList.Add(nlist) '添加到总列表中 <br>preVid = nlist.vid1 '记录最后一个VID <br>wr2.Close() <br>Me.Text = nowid & " : 处理完成!" <br>nowid += 1 <br>Loop <br>wr2.Close() <br>MsgBox(nowid & " 个视频全部采集处理完成!") <br>Button2_Click(sender, e) <br>End Sub <br>Private Sub Button2_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button2.Click <br>ListBox1.Items.Clear() <br>For Each ls As VList In myList <br>ListBox1.Items.Add(String.Format("{0}:<{1}> [{2}]", ls.id, ls.title, ls.vid1)) <br>Next <br>myList.Clear() <br>End Sub <br><br>End Class <br> <br><br>夜闻香原创<br>博客: http://clso.cnblogs.com <br>主页: http://cleclso.cn <br>QQ:315514678 E-mail:clso#qq.com <br>欢迎技术交流!

人气教程排行