forum.enarion.net
August 20, 2008, 06:12:31 PM *
Welcome, Guest. Please login or register.
Did you miss your activation email?

Login with username, password and session length
News: New forum software installed, the old postings and user profiles have been migrated. Please log in with your settings and keep on forum'ing!
 
   Home   Help Search Chat Login Register  
Pages: [1]   Go Down
  Print  
Author Topic: Crawler regex switch  (Read 676 times)
Xorlev
Newbie
*

Karma: +0/-0
Offline Offline

Posts: 1


Email
« on: July 03, 2008, 12:45:00 AM »

Simplified regex, and follows link tags, which'll grab RSS links and other links. I edited it because my site uses buttons to navigate between pages (JS) rather than <a> links, but I include a two links in the head which point to the next and previous page of a document.

Even if you remove the link part, you can get rid of all the ridiculous [Aa] stuff, the /i switch will work just fine.

Code:
--- Crawler.class.php   2006-01-29 21:19:00.000000000 -0600
+++ ../../../inc/classes/Crawler.class.php      2008-07-02 17:39:09.000000000 -0500
@@ -207,17 +207,16 @@
                }

                // contribution by vvkov
-//             preg_match_all("/<[Aa][ \r\n\t]{1}[^>]*[Hh][Rr][Ee][Ff][^=]*=[ '\"\n\r\t]*([^ \"'>]+)[^>]*>/",$res ,$urls);
-               preg_match_all("/<[Aa][^>]*[Hh][Rr][Ee][Ff]=['\"]([^\"'>]+)[^>]*>/",$res ,$urls); // update by TK, 2005-07-27
-       $urls_count = count( $urls[1] );
-
+               preg_match_all("/<(a|link)[^>]*href=['\"]([^\"'>]+)[^>]*>/i",$res ,$urls); // update by TK, 2005-07-27
+               $urls_count = count( $urls[2] );
+
                if (preg_match("/<[Bb][Aa][Ss][Ee][^>]*[Hh][Rr][Ee][Ff]=['\"]([^\"'>]+)[^>]*>/", $res, $matches)) {
                        $this->base = $matches[1];
                }

        $ts_begin = $this->microtime_float();
        while ((($ts_middle = ($this->microtime_float()-$ts_begin)) < PSNG_CRAWLER_MAX_GETFILE_TIME) && $urls_count > 0 ) {
-               $thisurl =  trim(str_replace('&amp;', '&', $urls[1][--$urls_count]));
+               $thisurl =  trim(str_replace('&amp;', '&', $urls[2][--$urls_count]));
                        if ($thisurl == '' || (strcasecmp(substr($thisurl, 0, strlen('javascript:')), 'javascript:') == 0))     continue;
                        // filter out links to fragment ids (same resource) - added mk/2005-11-13
                        if ('#' == $thisurl{0}) continue;
Report to moderator   Logged
forum.enarion.net
« on: July 03, 2008, 12:45:00 AM »

 Logged
Pages: [1]   Go Up
  Print  
 
Jump to:  

Powered by MySQL Powered by PHP Powered by SMF 1.1 RC3 | SMF © 2001-2006, Lewis Media Valid XHTML 1.0! Valid CSS!


Google visited last this page August 14, 2008, 06:08:27 AM