REBOL [
Title: "REBOL Web Crawler"
Date: 16-Sep-1999
File: %webcrawler.r
Author: "Bohdan Lechnowsky"
Purpose: {
To crawl the web starting from any site. Does not record
duplicate visits. Saves all links found in 'newlinks.
}
Email: bo@rebol.com
Comments: {
Based on my previous script, %rebol-web-miner.r
}
library: [
level: 'advanced
platform: none
type: 'tool
domain: [web other-net]
tested-under: none
support: none
license: none
see-also: none
]
]
find-links: func [
"Finds 'href' links and outputs them as a block"
url [url!] "The site currently being checked"
html [string!] "The HTML text to parse"
][
links: make block! 0
site: tail form url
while [(copy/part site: back site 1) <> "/"][]
site: to-url head clear next site
while [html: find html "href"] [
link: (trim (copy/part (next (find html "=")) (html: find html ">")))
if not found? find link "mailto:" [
link: trim/with link {"}
if (copy/part form link 7) <> "http://" [
link: head clean-path join site link
]
append links to-url link
]
]
return links
]
urls: [http://www.rebol.com/]
newlinks: make block! 0
sites: make block! 0
while [true] [
foreach url urls [
either find sites url [
print [url "already visited"]
][
print [" READING" url]
append newlinks url
append sites url
either not error? try [read-url: read url] [
foreach link find-links url read-url [
if none? find newlinks link [
append newlinks link
]
]
][
print [" Error reading" url]
]
]
]
urls: newlinks
]
This is a blog about Rebol, it's a fantastic free programming language, it permits easily to create complete software with few lines of code. It's cross-platform, so if you write it on Windows, it will work on Linux and Mac, and vice-versa. You can produce also wonderful GUI with just 3 lines of code!
Tuesday, 8 April 2014
Web crawler
Here is a web crawler, sites contains all pages read:
Subscribe to:
Post Comments (Atom)
Baca juga :
ReplyDeleteSitus Artikel Terbaik di Indonesia