Rebol: Web crawler

Tuesday, 8 April 2014
Web crawler

Here is a web crawler, sites contains all pages read:
 REBOL [

        Title: "REBOL Web Crawler"

        Date: 16-Sep-1999

        File: %webcrawler.r

        Author: "Bohdan Lechnowsky"

        Purpose: {

                To crawl the web starting from any site.    Does not record

                duplicate visits.    Saves all links found in 'newlinks.

        }

        Email: bo@rebol.com

        Comments: {

                Based on my previous script, %rebol-web-miner.r

        }

        library: [

                level: 'advanced 

                platform: none 

                type: 'tool 

                domain: [web other-net] 

                tested-under: none 

                support: none 

                license: none 

                see-also: none

        ]

]

find-links: func [

        "Finds 'href' links and outputs them as a block"

        url [url!] "The site currently being checked"

        html [string!] "The HTML text to parse"

][

        links: make block! 0

        site: tail form url

        while [(copy/part site: back site 1) <> "/"][]

        site: to-url head clear next site

        while [html: find html "href"] [

                link: (trim (copy/part (next (find html "=")) (html: find html ">")))

                if not found? find link "mailto:" [

                        link: trim/with link {"}

                        if (copy/part form link 7) <> "http://" [

                                link: head clean-path join site link

                        ]

                        append links to-url link

                ]

        ]

        return links

]

urls: [http://www.rebol.com/]

newlinks: make block! 0

sites: make block! 0

while [true] [

        foreach url urls [

                either find sites url [

                        print [url "already visited"]

                ][

                        print ["        READING" url]

                        append newlinks url

                        append sites url

                        either not error? try [read-url: read url] [

                                foreach link find-links url read-url [

                                        if none? find newlinks link [

                                                append newlinks link

                                        ]

                                ]

                        ][

                                print ["        Error reading" url]

                        ]

                ]

        ]

        urls: newlinks

]
1 comment:

Juragan Casino1 July 2019 at 01:11
Baca juga :
Situs Artikel Terbaik di Indonesia
ReplyDelete
Replies
Add comment