Results 1 to 9 of 9
Threaded View
-
15th Jun 2010, 12:53 PM #1OPRespected Developer
[F#][SNIPPET] Basic web crawler example (multi-core)
Code:open System open System.Net open Hyperz.SharpLeech.Engine open Hyperz.SharpLeech.Engine.Html open Hyperz.SharpLeech.Engine.Net let getData url = url |> Http.Prepare |> Http.Request |> fun result -> if result.HasError then result.Data else Http.HandleRedirects(result, false).Data let getUrls html sourceUrl = let baseUrl = new Uri(sourceUrl) |> fun u -> u.Scheme + "://" + u.Host new HtmlDocument() |> fun doc -> doc.LoadHtml(html); doc |> fun doc -> doc.DocumentNode.SelectNodes("//a") |> Seq.map (fun node -> node.GetAttributeValue("href", "")) |> Seq.map (fun url -> HttpUtility.HtmlDecode(url).Trim()) |> Seq.map (fun url -> if url.StartsWith("http://") then url elif url.StartsWith("https://") then url elif url.StartsWith("/") then baseUrl + url elif url.StartsWith("#") then "" else baseUrl + "/" + url) |> Seq.filter (fun url -> url.Length > 0) let rec crawl url crawled = Async.Start(async { let data = getData url let urls = getUrls data url |> Seq.filter (fun u -> not(List.exists (fun itm -> itm = u) (crawled))) do printfn "Crawling: %s\nFound: %i URL's" url (Seq.length urls) for u in urls do crawl u (crawled @ [u]) }) (* ================================================ *) (* START CRAWLING *) (* ================================================ *) let url = "http://thepiratebay.org/" let rec memCleaner() = (* Clean memory every 10 seconds *) System.Threading.Thread.Sleep(10000) GC.Collect() memCleaner() ServicePointManager.DefaultConnectionLimit <- 10 Http.MaxRedirects <- 2 Http.Timeout <- 10000 Http.KeepAlive <- true Http.UseCompression <- true Console.BufferWidth <- 256 Console.BufferHeight <- 768 Console.Title <- "F# Web Crawler" (* Start the crawler and mem cleaner *) Async.Start(async{memCleaner()}) crawl url [url] stdin.Read() |> ignore
Video showing a slightly modified version:
Ignore the heavy hiphop tune and other nonsense. I was a bit drunk when I was recording it, lol.
Hyperz Reviewed by Hyperz on . [F#][SNIPPET] Basic web crawler example (multi-core) open System open System.Net open Hyperz.SharpLeech.Engine open Hyperz.SharpLeech.Engine.Html open Hyperz.SharpLeech.Engine.Net let getData url = url |> Http.Prepare |> Http.Request Rating: 5
Sponsored Links
Thread Information
Users Browsing this Thread
There are currently 1 users browsing this thread. (0 members and 1 guests)
Similar Threads
-
Snippet of the Day
By SplitIce in forum Web Development AreaReplies: 41Last Post: 26th Aug 2012, 06:09 PM -
Plz Help To Add A Php Snippet Into My DLE Index !
By JoomlaZ in forum Web Development AreaReplies: 0Last Post: 7th Jul 2011, 01:18 PM -
[C#] Tiny Web Server (snippet)
By Hyperz in forum Web Development AreaReplies: 6Last Post: 24th Jun 2010, 01:19 PM -
[F#] Strong random password generator (Multi-core)
By Hyperz in forum Web Development AreaReplies: 2Last Post: 19th Jun 2010, 11:45 AM -
A Snippet from my latest project
By litewarez in forum Tutorials and GuidesReplies: 19Last Post: 21st Jun 2009, 05:17 PM
themaCreator - create posts from...
Version 3.47 released. Open older version (or...