Results 1 to 9 of 9
-
15th Jun 2010, 12:53 PM #1OPRespected Developer
[F#][SNIPPET] Basic web crawler example (multi-core)
Code:open System open System.Net open Hyperz.SharpLeech.Engine open Hyperz.SharpLeech.Engine.Html open Hyperz.SharpLeech.Engine.Net let getData url = url |> Http.Prepare |> Http.Request |> fun result -> if result.HasError then result.Data else Http.HandleRedirects(result, false).Data let getUrls html sourceUrl = let baseUrl = new Uri(sourceUrl) |> fun u -> u.Scheme + "://" + u.Host new HtmlDocument() |> fun doc -> doc.LoadHtml(html); doc |> fun doc -> doc.DocumentNode.SelectNodes("//a") |> Seq.map (fun node -> node.GetAttributeValue("href", "")) |> Seq.map (fun url -> HttpUtility.HtmlDecode(url).Trim()) |> Seq.map (fun url -> if url.StartsWith("http://") then url elif url.StartsWith("https://") then url elif url.StartsWith("/") then baseUrl + url elif url.StartsWith("#") then "" else baseUrl + "/" + url) |> Seq.filter (fun url -> url.Length > 0) let rec crawl url crawled = Async.Start(async { let data = getData url let urls = getUrls data url |> Seq.filter (fun u -> not(List.exists (fun itm -> itm = u) (crawled))) do printfn "Crawling: %s\nFound: %i URL's" url (Seq.length urls) for u in urls do crawl u (crawled @ [u]) }) (* ================================================ *) (* START CRAWLING *) (* ================================================ *) let url = "http://thepiratebay.org/" let rec memCleaner() = (* Clean memory every 10 seconds *) System.Threading.Thread.Sleep(10000) GC.Collect() memCleaner() ServicePointManager.DefaultConnectionLimit <- 10 Http.MaxRedirects <- 2 Http.Timeout <- 10000 Http.KeepAlive <- true Http.UseCompression <- true Console.BufferWidth <- 256 Console.BufferHeight <- 768 Console.Title <- "F# Web Crawler" (* Start the crawler and mem cleaner *) Async.Start(async{memCleaner()}) crawl url [url] stdin.Read() |> ignore
Video showing a slightly modified version:
Ignore the heavy hiphop tune and other nonsense. I was a bit drunk when I was recording it, lol.
Hyperz Reviewed by Hyperz on . [F#][SNIPPET] Basic web crawler example (multi-core) open System open System.Net open Hyperz.SharpLeech.Engine open Hyperz.SharpLeech.Engine.Html open Hyperz.SharpLeech.Engine.Net let getData url = url |> Http.Prepare |> Http.Request Rating: 5
-
15th Jun 2010, 01:01 PM #2
-
15th Jun 2010, 01:03 PM #3OPRespected Developer
-
16th Jun 2010, 11:23 PM #4MemberWebsite's:
litewarez.net litewarez.com triniwarez.com:/ makes me wonder how fast the googlebot system is..
Great tut Hyperz, any chance a C# version for comparison of speeds ??Join Litewarez.net today and become apart of the community.
Unique | Clean | Advanced (All with you in mind)
Downloads | Webmasters
Notifications,Forum,Chat,Community all at Litewarez Webmasters
-
16th Jun 2010, 11:28 PM #5ლ(ಠ益ಠლ)Website's:
extremecoderz.comlool! we were talking on teamspeak about this. Personally, and i dont care what you say Hyp, my method pwns urs! Let the battle begin!
i'll post some speed-tests using my little engine.
EDIT: chomps ur CPU a little - why so? What part needs so much calculation?
-
17th Jun 2010, 12:11 AM #6OPRespected Developer
A C# version would have at least 10 times more code to do the same thing.
This doesn't show my method. But sure, bring it on. It's time to kick ass and chew bubblegum
!
It has a queue of 500.000 urls. Each time it crawls a page it extracts 50 urls on average. Each of those 50 urls has to be checked against the entire queue to filter out dupes. And this gets done more than 5 times in a second at moments.
500.000 * 50 * 5 = 125.000.000 string matches per second. Yes, that will indeed keep a quad core busy. I'm surprised it only eats 50%. Just goes to show how good it scales. Keep in mind that filtering dupes is not the only thing that happens a few times a second here.
-
17th Jun 2010, 09:24 AM #7MemberWebsite's:
litewarez.net litewarez.com triniwarez.comI dont think it matters about the amount of code, im looking for a comparison is speeds against F# / C#
But them stats are pretty nice Hyperz !
Actually your using libraries aswell here
Code:open Hyperz.SharpLeech.Engine open Hyperz.SharpLeech.Engine.Html open Hyperz.SharpLeech.Engine.Net
Join Litewarez.net today and become apart of the community.
Unique | Clean | Advanced (All with you in mind)
Downloads | Webmasters
Notifications,Forum,Chat,Community all at Litewarez Webmasters
-
17th Jun 2010, 10:47 AM #8OPRespected Developer
It does matter because I'm not gonna bother with a C# port. But the answer to your question is that F# will be faster. Not because it's a better language per s?, they both are .NET anyway. But rather because it's a functional language. Some stuff will perform better when written in C#, some stuff will run better with F#. This is one of the things that will run better with a functional language because of the asynchronous stuff that's involved.
F# is gonna be better choice over C# for:
- stuff that needs to run in parallel
- complex maths
- AI / neural networks
- situations where a lot of data needs to be handled
- writing a compiler
- ray-tracing
- DSL's
- few other things
For other stuff you really want to use C#. Did you know they wrote the F# 2.0 compiler in F# itself in under 10.000 lines of code?
And yes it uses the SL library. As I stated in the 1st post this was made to test the Http part of it.
-
17th Jun 2010, 10:50 AM #9ლ(ಠ益ಠლ)Website's:
extremecoderz.comIm not even gonna entertain async. im going to throw it all in an array and threadPool it like i showed you on TeamSpeak.
Sponsored Links
Thread Information
Users Browsing this Thread
There are currently 1 users browsing this thread. (0 members and 1 guests)
Similar Threads
-
Snippet of the Day
By SplitIce in forum Web Development AreaReplies: 41Last Post: 26th Aug 2012, 06:09 PM -
Plz Help To Add A Php Snippet Into My DLE Index !
By JoomlaZ in forum Web Development AreaReplies: 0Last Post: 7th Jul 2011, 01:18 PM -
[C#] Tiny Web Server (snippet)
By Hyperz in forum Web Development AreaReplies: 6Last Post: 24th Jun 2010, 01:19 PM -
[F#] Strong random password generator (Multi-core)
By Hyperz in forum Web Development AreaReplies: 2Last Post: 19th Jun 2010, 11:45 AM -
A Snippet from my latest project
By litewarez in forum Tutorials and GuidesReplies: 19Last Post: 21st Jun 2009, 05:17 PM
themaCreator - create posts from...
Version 3.47 released. Open older version (or...