当前位置:Gxlcms > PHP教程 > 利用curl抓取网页数据,phantomjs..请神人解

利用curl抓取网页数据,phantomjs..请神人解

时间:2021-07-01 10:21:17 帮助过:9人阅读

小弟昨天有发文请教,有很多的神人给我小弟很大的帮忙,目前只剩下一小块的数据未抓到。

有大大说用 phantomjs来抓取html
目前的js如
var page = require('webpage').create();
var url = 'http://www.cbssports.com/mlb/gametracker/live/MLB_20140528_CLE@CHW';

  1. <code>page.open(url, function (status) {
  2. var js = page.evaluate(function () {
  3. return document;
  4. });
  5. console.log(js.all[0].outerHTML);
  6. phantom.exit();
  7. });
  8. </code>

误错,显示不出正确的hmtl
另 phontomjs是一个执行档,我要怎么每秒让他自动执行,用 php ? 因为在php 里我目前只能用
exec("start d:\phantomjs script.js ")
让它自动产生本文档,然后针对本文档作解析,但一直没有办法执行,求神人解

2014 05 23 更新

之前在网站上有提出,有抓到几个数据。
先看一下我的程序如下:
$url ="http://www.cbssports.com/mlb/gametracker/live/MLB_20140529_SF@STL";
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172 Safari/537.22");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION ,1);
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 0);
$data = curl_exec($ch);
preg_match_all('/(.?)<\/span>/is',$data,$teamCity);
preg_match_all('/(.
?)<\/span>/is',$data,$teamName);…….以下为正规化

未抓出的资料部份如下:(红字为抓不出来的)(以下只是部份)

  1. <code><p>里面会有一个 </p><p>或是</p><p><</p><p>div class=”batter-pitcher fleft”></p><p><</p><p>table>里面的 </p>的资料都抓不出来
  2. <p>重点在于,部份的数据,不管你用什么浏览器的「另存新檔」 save as 或是 「检视原始码」,都看不到上列的这数据。 而 div class=”batter-pitcher fleft” 这部份的资料目前已知是 JS 的 batter_ingame_stats function是跑「进行比赛中的」</p><p>而另一个function function() { CBSi.app.BaseRunners = function(args 则是跑 「谁在垒上」右下角那个 「球场的图标」的数据 ,目前只剩这几个部份抓不出。</p><p>很多神人大大说,「就抓js呀」,但是,就问不到要如何抓。</p><p>跪求各位大大给个方向。</p><p>这一个话题的 讨论在:http://segmentfault.com/q/1010000000522277</p><p>目前直播赛事:http://www.cbssports.com/mlb/gametracker/live/MLB_20140529_SF@STL</p><p></p><h2>回复内容:</h2><p>小弟昨天有发文请教,有很多的神人给我小弟很大的帮忙,目前只剩下一小块的数据未抓到。</p><p>有大大说用 phantomjs来抓取html<br>
  3. 目前的js如<br>
  4. var page = require('webpage').create();<br>
  5. var url = 'http://www.cbssports.com/mlb/gametracker/live/MLB_20140528_CLE@CHW';</p><pre class="brush:php;toolbar:false layui-box layui-code-view layui-code-notepad"><ol class="layui-code-ol"><li><code>page.open(url, function (status) {</li><li>var js = page.evaluate(function () {</li><li>return document;</li><li>});</li><li>console.log(js.all[0].outerHTML);</li><li>phantom.exit();</li><li>});</li><li></code></li></ol></pre><p>误错,显示不出正确的hmtl<br>
  6. 另 phontomjs是一个执行档,我要怎么每秒让他自动执行,用 php ? 因为在php 里我目前只能用<br>
  7. exec("start d:\phantomjs script.js ")<br>
  8. 让它自动产生本文档,然后针对本文档作解析,但一直没有办法执行,求神人解</p><p>2014 05 23 更新</p><p>之前在网站上有提出,有抓到几个数据。<br>
  9. 先看一下我的程序如下:<br>
  10. $url ="http://www.cbssports.com/mlb/gametracker/live/MLB_20140529_SF@STL";<br>
  11. $ch = curl_init();<br>
  12. curl_setopt($ch, CURLOPT_HEADER, 0);<br>
  13. curl_setopt($ch, CURLOPT_URL, $url);<br>
  14. curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);<br>
  15. curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.22 (KHTML, like Gecko) Chrome/25.0.1364.172 Safari/537.22");<br>
  16. curl_setopt($ch, CURLOPT_FOLLOWLOCATION ,1);<br>
  17. curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 0);<br>
  18. $data = curl_exec($ch);<br>
  19. preg_match_all('/<span class="teamLocation">(.<em>?)<\/span>/is',$data,$teamCity);<br>
  20. preg_match_all('/<span class="teamNickname">(.</span></em>?)<\/span>/is',$data,$teamName);…….以下为正规化</span></p><p>未抓出的资料部份如下:(红字为抓不出来的)(以下只是部份)</p><pre class="brush:php;toolbar:false layui-box layui-code-view layui-code-notepad"><ol class="layui-code-ol"><li><code></code></li></ol></pre><table class="data condensed stacked" width="100%">
  21. <tbody><tr class="”row1">
  22. </tr>
  23. <tr>
  24. </tr>
  25. </tbody></table><p>里面会有一个 </p><p>或是</p><p><</p><p>div class=”batter-pitcher fleft”></p><p><</p><p>table>里面的 </p>的资料都抓不出来
  26. <p>重点在于,部份的数据,不管你用什么浏览器的「另存新檔」 save as 或是 「检视原始码」,都看不到上列的这数据。 而 div class=”batter-pitcher fleft” 这部份的资料目前已知是 JS 的 batter_ingame_stats function是跑「进行比赛中的」</p><p>而另一个function function() { CBSi.app.BaseRunners = function(args 则是跑 「谁在垒上」右下角那个 「球场的图标」的数据 ,目前只剩这几个部份抓不出。</p><p>很多神人大大说,「就抓js呀」,但是,就问不到要如何抓。</p><p>跪求各位大大给个方向。</p><p>这一个话题的 讨论在:http://segmentfault.com/q/1010000000522277</p><p>目前直播赛事:http://www.cbssports.com/mlb/gametracker/live/MLB_20140529_SF@STL</p><p class="answer fmt" data-id="1020000000523721">
  27. </p><p>这么写</p><pre class="brush:php;toolbar:false layui-box layui-code-view layui-code-notepad"><ol class="layui-code-ol"><li><code>var page = require('webpage').create();</li><li>page.open('http://segmentfault.com/', function(status) {</li><li> var ua = page.evaluate(function() {</li><li> return document.body.outerHTML;</li><li> });</li><li> console.log(ua);</li><li> phantom.exit();</li><li>});</li><li></code></li></ol></pre><div class="">
  28. <ul class="m-news-opt fix">
  29. <li class="opt-item">
  30. <a href="/PHPjiqiao-104365.html" target="_blank"><p>< 上一篇</p><p class="ellipsis">大家帮忙推荐几本好书吧</p></a>
  31. </li>
  32. <li class="opt-item ta-r">
  33. <a href="/PHPjiqiao-104367.html" target="_blank"><p>下一篇 ></p><p class="ellipsis">关于表单上传的建议</p></a>
  34. </li>
  35. </ul>
  36. </div><div class="g-title fix">
  37. <h2 class="title-txt">人气教程排行</h2>
  38. </div><div class="m-rank u-dashed mb40">
  39. <ul>
  40. <li class="rank-item">
  41. <a href="/PHPjiqiao-379253.html" title="php如何获取跳转前的url" class="item-name ellipsis" target="_blank">
  42. <span class="g-art-count fr">174次</span>
  43. <span class="g-sort-num top">1</span>
  44. php如何获取跳转前的url </a>
  45. </li>
  46. <li class="rank-item">
  47. <a href="/PHPjiqiao-379019.html" title="php格林威治时间转换成当前时间的方法" class="item-name ellipsis" target="_blank">
  48. <span class="g-art-count fr">174次</span>
  49. <span class="g-sort-num second">2</span>
  50. php格林威治时间转换成当前时间的方法 </a>
  51. </li>
  52. <li class="rank-item">
  53. <a href="/PHPjiqiao-366629.html" title="为什么php不能做大型系统?" class="item-name ellipsis" target="_blank">
  54. <span class="g-art-count fr">174次</span>
  55. <span class="g-sort-num third">3</span>
  56. 为什么php不能做大型系统? </a>
  57. </li>
  58. <li class="rank-item">
  59. <a href="/PHPjiqiao-207623.html" title="range函数怎么用" class="item-name ellipsis" target="_blank">
  60. <span class="g-art-count fr">174次</span>
  61. <span class="g-sort-num ">4</span>
  62. range函数怎么用 </a>
  63. </li>
  64. <li class="rank-item">
  65. <a href="/PHPjiqiao-162433.html" title="php中计算页面加载时间几种方法总结_PHP教程" class="item-name ellipsis" target="_blank">
  66. <span class="g-art-count fr">174次</span>
  67. <span class="g-sort-num ">5</span>
  68. php中计算页面加载时间几种方法总结_PHP教程 </a>
  69. </li>
  70. <li class="rank-item">
  71. <a href="/PHPjiqiao-140221.html" title="求帮助,关于paypal支付返回值修改订单状态" class="item-name ellipsis" target="_blank">
  72. <span class="g-art-count fr">174次</span>
  73. <span class="g-sort-num ">6</span>
  74. 求帮助,关于paypal支付返回值修改订单状态 </a>
  75. </li>
  76. <li class="rank-item">
  77. <a href="/PHPjiqiao-103588.html" title="typecho怎么配置文章内容页?" class="item-name ellipsis" target="_blank">
  78. <span class="g-art-count fr">174次</span>
  79. <span class="g-sort-num ">7</span>
  80. typecho怎么配置文章内容页? </a>
  81. </li>
  82. <li class="rank-item">
  83. <a href="/PHPjiqiao-99213.html" title="PhpStorm左侧structure不显示文件的方法列表是这么回事?" class="item-name ellipsis" target="_blank">
  84. <span class="g-art-count fr">174次</span>
  85. <span class="g-sort-num ">8</span>
  86. PhpStorm左侧structure不显示文件的方法列表是这么回事? </a>
  87. </li>
  88. <li class="rank-item">
  89. <a href="/PHPjiqiao-92208.html" title="查看PHP的环境变量_PHP" class="item-name ellipsis" target="_blank">
  90. <span class="g-art-count fr">174次</span>
  91. <span class="g-sort-num ">9</span>
  92. 查看PHP的环境变量_PHP </a>
  93. </li>
  94. <li class="rank-item">
  95. <a href="/PHPjiqiao-170.html" title="PHP Primary script unknown 解决方法总结" class="item-name ellipsis" target="_blank">
  96. <span class="g-art-count fr">174次</span>
  97. <span class="g-sort-num ">10</span>
  98. PHP Primary script unknown 解决方法总结 </a>
  99. </li>
  100. <li class="rank-item">
  101. <a href="/PHPjiqiao-148.html" title="php的命名空间与自动加载实现方法" class="item-name ellipsis" target="_blank">
  102. <span class="g-art-count fr">174次</span>
  103. <span class="g-sort-num ">11</span>
  104. php的命名空间与自动加载实现方法 </a>
  105. </li>
  106. <li class="rank-item">
  107. <a href="/PHPjiqiao-133.html" title="解决laravel 出现ajax请求419(unknown status)的问题" class="item-name ellipsis" target="_blank">
  108. <span class="g-art-count fr">174次</span>
  109. <span class="g-sort-num ">12</span>
  110. 解决laravel 出现ajax请求419(unknown status)的问题 </a>
  111. </li>
  112. <li class="rank-item">
  113. <a href="/PHPjiqiao-462817.html" title="php 如何删除mysql记录" class="item-name ellipsis" target="_blank">
  114. <span class="g-art-count fr">173次</span>
  115. <span class="g-sort-num ">13</span>
  116. php 如何删除mysql记录 </a>
  117. </li>
  118. <li class="rank-item">
  119. <a href="/PHPjiqiao-388448.html" title="PHP如何替换数组中的指定元素" class="item-name ellipsis" target="_blank">
  120. <span class="g-art-count fr">173次</span>
  121. <span class="g-sort-num ">14</span>
  122. PHP如何替换数组中的指定元素 </a>
  123. </li>
  124. <li class="rank-item">
  125. <a href="/PHPjiqiao-124270.html" title="怎么去除字符串中非汉字、非字母、非数字的字符" class="item-name ellipsis" target="_blank">
  126. <span class="g-art-count fr">173次</span>
  127. <span class="g-sort-num ">15</span>
  128. 怎么去除字符串中非汉字、非字母、非数字的字符 </a>
  129. </li>
  130. <li class="rank-item">
  131. <a href="/PHPjiqiao-112291.html" title="mysql如何一次执行多条SQL语句" class="item-name ellipsis" target="_blank">
  132. <span class="g-art-count fr">173次</span>
  133. <span class="g-sort-num ">16</span>
  134. mysql如何一次执行多条SQL语句 </a>
  135. </li>
  136. <li class="rank-item">
  137. <a href="/PHPjiqiao-110669.html" title="修改header里面的Connection为close解决方法" class="item-name ellipsis" target="_blank">
  138. <span class="g-art-count fr">173次</span>
  139. <span class="g-sort-num ">17</span>
  140. 修改header里面的Connection为close解决方法 </a>
  141. </li>
  142. <li class="rank-item">
  143. <a href="/PHPjiqiao-153.html" title="PHP基于session.upload_progress 实现文件上传进度显示功能详解" class="item-name ellipsis" target="_blank">
  144. <span class="g-art-count fr">173次</span>
  145. <span class="g-sort-num ">18</span>
  146. PHP基于session.upload_progress 实现文件上传进度显示功能详解 </a>
  147. </li>
  148. <li class="rank-item">
  149. <a href="/PHPjiqiao-125.html" title="php5.6.x到php7.0.x特性小结" class="item-name ellipsis" target="_blank">
  150. <span class="g-art-count fr">173次</span>
  151. <span class="g-sort-num ">19</span>
  152. php5.6.x到php7.0.x特性小结 </a>
  153. </li>
  154. <li class="rank-item">
  155. <a href="/PHPjiqiao-378118.html" title="php为什么会出现504错误" class="item-name ellipsis" target="_blank">
  156. <span class="g-art-count fr">172次</span>
  157. <span class="g-sort-num ">20</span>
  158. php为什么会出现504错误 </a>
  159. </li>
  160. </ul>
  161. </div><div class="footer">
  162. 本站所有资源全部来源于网络,若本站发布的内容侵害到您的隐私或者利益,请联系我们删除!</div><div style="display:none">
  163. <div class="login-box" id="login-dialog">
  164. <div class="login-top"><a class="current" rel="nofollow" id="login1" onclick="setTab('login',1,2);">登录</a></div>
  165. <div class="login-form" id="nav-signin">
  166. <!-- <div class="login-ico"><a rel="nofollow" class="qq" id="qqlogin" target="_blank" href="/user-center-qqlogin.html"> QQ </a></div> -->
  167. <div class="login-box-form" id="con_login_1">
  168. <form id="loginform" action="/user-center-login.html" method="post" onsubmit="return false;"></form>
  169. <p class="int-text">
  170. <input class="email" id="username" name="username" type="text" value="用户名或Email" onfocus="if(this.value=='用户名或Email'){this.value='';}" onblur="if(this.value==''){this.value='用户名或Email';};"></p>
  171. <p class="int-text">
  172. <input class="password1" type="password" id="password" name="password" value="******" onblur="if(this.value=='') this.value='******';" onfocus="if(this.value=='******') this.value='';">
  173. </p>
  174. <p class="int-info">
  175. <label class="ui-label"> </label>
  176. <label for="agreement" class="ui-label-checkbox">
  177. <input type="checkbox" value="" name="cookietime" id="cookietime" checked="checked">
  178. <input type="hidden" name="notforward" id="notforward" value="1">
  179. <input type="hidden" name="dosubmit" id="dosubmit" value="1">记住我的登录 </label>
  180. <a rel="nofollow" class="aright" href="/user-center-forgetpwd.html" target="_blank"> 忘记密码? </a></p>
  181. <p class="int-btn"><a rel="nofollow" id="loginbt" class="loginbtn"><span>登录</span></a></p>
  182. </div>
  183. <form id="regform" action="/user-center-reg.html" method="post"></form>
  184. <div class="login-reg" style="display: none;" id="con_login_2">
  185. <input type="hidden" name="t" id="t">
  186. <p class="int-text">
  187. <input id="email" name="email" type="text" value="Email" onfocus="if(this.value=='Email'){this.value='';}" onblur="if(this.value==''){this.value='Email';};"></p>
  188. <p class="int-text">
  189. <input id="uname" name="username" type="text" value="用户名或昵称" onfocus="if(this.value=='用户名或昵称'){this.value='';}" onblur="if(this.value==''){this.value='用户名或昵称';};"></p>
  190. <p class="int-text">
  191. <input type="password" id="pwd" name="password" value="******" onblur="if(this.value=='') this.value='******';" onfocus="if(this.value=='******') this.value='';"> </p>
  192. <p class="int-text1"><span class="inputbox">
  193. <input id="validate" name="validate" type="text" value="验证码" onfocus="if(this.value=='验证码'){this.value='';}" onblur="if(this.value==''){this.value='验证码';};">
  194. </span><span class="yzm-img"><img src="/user-checkcode-index" alt="看不清楚换一张" id="indexlogin"></span></p>
  195. <p class="int-info">
  196. <label>
  197. <input value="" name="agreement" id="agreement" checked="checked" type="checkbox">
  198. 我已阅读<a rel="nofollow" href="/user-center-agreement.html">用户协议</a>及<a rel="nofollow" href="/user-center-agreement.html">版权声明</a></label>
  199. </p>
  200. <p class="int-btn"><input type="hidden" name="dosubmit">
  201. <a rel="nofollow" class="loginbtn" id="register"><span>注册</span></a></p>
  202. </div>
  203. </div>
  204. </div>
  205. </div><div data-type="4" data-plugin="aroundbox" data-plugin-aroundbox-x="left" data-plugin-aroundbox-y="bottom" data-plugin-aroundbox-iconsize="60x60" data-plugin-aroundbox-fixed="1" data-plugin-aroundbox-offsetx="10"></div><table class="data condensed stacked" width="100%">
  206. <tbody><tr class="”row1">
  207. </tr>
  208. <tr>
  209. </tr>
  210. <!-- / 教程内容页 -->
  211. <!-- 页尾 -->
  212. <!-- / 页尾 -->
  213. <script src="https://hm.baidu.com/hm.js?6dc1c3c5281cf70f49bc0bc860ec24f2"></script><script>
  214. var _hmt = _hmt || [];
  215. (function() {
  216. var hm = document.createElement("script");
  217. hm.src = "https://hm.baidu.com/hm.js?6dc1c3c5281cf70f49bc0bc860ec24f2";
  218. var s = document.getElementsByTagName("script")[0];
  219. s.parentNode.insertBefore(hm, s);
  220. })();
  221. </script>
  222. <script type="text/javascript" src="/layui/layui.js"></script>
  223. <script>
  224. layui.use('code', function() {
  225. layui.code({
  226. elem: 'pre', //默认值为.layui-code
  227. about: false,
  228. skin: 'notepad',
  229. title: 'php怎么实现数据库验证跳转代码块',
  230. encode: true //是否转义html标签。默认不开启
  231. });
  232. });
  233. </script>
  234. </tbody></table></code>