Activity Stream
48,167 MEMBERS
6871 ONLINE
besthostingforums On YouTube Subscribe to our Newsletter besthostingforums On Twitter besthostingforums On Facebook besthostingforums On facebook groups

Results 1 to 9 of 9
  1.     
    #1
    Respected Developer

    Default [F#][SNIPPET] Basic web crawler example (multi-core)

    Code: 
    open System
    open System.Net
    open Hyperz.SharpLeech.Engine
    open Hyperz.SharpLeech.Engine.Html
    open Hyperz.SharpLeech.Engine.Net
    
    let getData url =
        url
        |> Http.Prepare
        |> Http.Request
        |> fun result ->
            if result.HasError then result.Data
            else Http.HandleRedirects(result, false).Data
    
    let getUrls html sourceUrl =
        let baseUrl =
            new Uri(sourceUrl)
            |> fun u -> u.Scheme + "://" + u.Host
        new HtmlDocument()
        |> fun doc -> doc.LoadHtml(html); doc
        |> fun doc -> doc.DocumentNode.SelectNodes("//a")
        |> Seq.map (fun node -> node.GetAttributeValue("href", ""))
        |> Seq.map (fun url -> HttpUtility.HtmlDecode(url).Trim())
        |> Seq.map (fun url ->
            if url.StartsWith("http://") then url
            elif url.StartsWith("https://") then url
            elif url.StartsWith("/") then baseUrl + url
            elif url.StartsWith("#") then ""
            else baseUrl + "/" + url)
        |> Seq.filter (fun url -> url.Length > 0)
    
    let rec crawl url crawled = Async.Start(async {
        let data = getData url
        let urls =
            getUrls data url
            |> Seq.filter (fun u -> not(List.exists (fun itm -> itm = u) (crawled)))
        do printfn "Crawling: %s\nFound:    %i URL's" url (Seq.length urls)
        for u in urls do crawl u (crawled @ [u]) })
    
    
    (* ================================================ *)
    (* START CRAWLING                                   *)
    (* ================================================ *)
    
    let url = "http://thepiratebay.org/"
    let rec memCleaner() =
        (* Clean memory every 10 seconds *)
        System.Threading.Thread.Sleep(10000)
        GC.Collect()
        memCleaner()
    
    ServicePointManager.DefaultConnectionLimit <- 10
    Http.MaxRedirects <- 2
    Http.Timeout <- 10000
    Http.KeepAlive <- true
    Http.UseCompression <- true
    Console.BufferWidth <- 256
    Console.BufferHeight <- 768
    Console.Title <- "F# Web Crawler"
    
    (* Start the crawler and mem cleaner *)
    Async.Start(async{memCleaner()})
    crawl url [url]
    
    stdin.Read() |> ignore
    This is just a little something I made while I was working on SL to test the performance of the Http class. It'll crawl a webpage, extract the links, filter out the dupes (already crawled ones) and add the new links to a queue. It's based on recursion and is "state-less" (no mutable variables are used to store data/the state). That means no matter how many CPU cores you have, it'll scale linear across all of them with very little performance loss. It doesn't have any practical function but might be a handy reference for people who are looking into functional programming or F#.

    Video showing a slightly modified version:

    Ignore the heavy hiphop tune and other nonsense. I was a bit drunk when I was recording it, lol .
    Hyperz Reviewed by Hyperz on . [F#][SNIPPET] Basic web crawler example (multi-core) open System open System.Net open Hyperz.SharpLeech.Engine open Hyperz.SharpLeech.Engine.Html open Hyperz.SharpLeech.Engine.Net let getData url = url |> Http.Prepare |> Http.Request Rating: 5

  2.   Sponsored Links

  3.     
    #2
    Member
    What language is this??

    Just googled F#.Wow I never new that microsoft was developing F#. :0

  4.     
    #3
    Respected Developer

  5.     
    #4
    Member
    Website's:
    litewarez.net litewarez.com triniwarez.com
    :/ makes me wonder how fast the googlebot system is..

    Great tut Hyperz, any chance a C# version for comparison of speeds ??
    Join Litewarez.net today and become apart of the community.
    Unique | Clean | Advanced (All with you in mind)
    Downloads | Webmasters


    Notifications,Forum,Chat,Community all at Litewarez Webmasters


  6.     
    #5
    ლ(ಠ益ಠლ)
    Website's:
    extremecoderz.com
    lool! we were talking on teamspeak about this. Personally, and i dont care what you say Hyp, my method pwns urs! Let the battle begin! i'll post some speed-tests using my little engine.

    EDIT: chomps ur CPU a little - why so? What part needs so much calculation?

  7.     
    #6
    Respected Developer
    Quote Originally Posted by litewarez View Post
    :/ makes me wonder how fast the googlebot system is..

    Great tut Hyperz, any chance a C# version for comparison of speeds ??
    A C# version would have at least 10 times more code to do the same thing.

    Quote Originally Posted by jayfella View Post
    lool! we were talking on teamspeak about this. Personally, and i dont care what you say Hyp, my method pwns urs! Let the battle begin! i'll post some speed-tests using my little engine.

    EDIT: chomps ur CPU a little - why so? What part needs so much calculation?
    This doesn't show my method . But sure, bring it on. It's time to kick ass and chew bubblegum !

    It has a queue of 500.000 urls. Each time it crawls a page it extracts 50 urls on average. Each of those 50 urls has to be checked against the entire queue to filter out dupes. And this gets done more than 5 times in a second at moments.

    500.000 * 50 * 5 = 125.000.000 string matches per second. Yes, that will indeed keep a quad core busy. I'm surprised it only eats 50%. Just goes to show how good it scales . Keep in mind that filtering dupes is not the only thing that happens a few times a second here.

  8.     
    #7
    Member
    Website's:
    litewarez.net litewarez.com triniwarez.com
    I dont think it matters about the amount of code, im looking for a comparison is speeds against F# / C#

    But them stats are pretty nice Hyperz !

    Actually your using libraries aswell here
    Code: 
    open Hyperz.SharpLeech.Engine
    open Hyperz.SharpLeech.Engine.Html
    open Hyperz.SharpLeech.Engine.Net
    Join Litewarez.net today and become apart of the community.
    Unique | Clean | Advanced (All with you in mind)
    Downloads | Webmasters


    Notifications,Forum,Chat,Community all at Litewarez Webmasters


  9.     
    #8
    Respected Developer
    It does matter because I'm not gonna bother with a C# port. But the answer to your question is that F# will be faster. Not because it's a better language per s?, they both are .NET anyway. But rather because it's a functional language. Some stuff will perform better when written in C#, some stuff will run better with F#. This is one of the things that will run better with a functional language because of the asynchronous stuff that's involved.

    F# is gonna be better choice over C# for:
    - stuff that needs to run in parallel
    - complex maths
    - AI / neural networks
    - situations where a lot of data needs to be handled
    - writing a compiler
    - ray-tracing
    - DSL's
    - few other things

    For other stuff you really want to use C#. Did you know they wrote the F# 2.0 compiler in F# itself in under 10.000 lines of code ?

    And yes it uses the SL library. As I stated in the 1st post this was made to test the Http part of it.

  10.     
    #9
    ლ(ಠ益ಠლ)
    Website's:
    extremecoderz.com
    Im not even gonna entertain async. im going to throw it all in an array and threadPool it like i showed you on TeamSpeak.

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Similar Threads

  1. Snippet of the Day
    By SplitIce in forum Web Development Area
    Replies: 41
    Last Post: 26th Aug 2012, 06:09 PM
  2. Plz Help To Add A Php Snippet Into My DLE Index !
    By JoomlaZ in forum Web Development Area
    Replies: 0
    Last Post: 7th Jul 2011, 01:18 PM
  3. [C#] Tiny Web Server (snippet)
    By Hyperz in forum Web Development Area
    Replies: 6
    Last Post: 24th Jun 2010, 01:19 PM
  4. [F#] Strong random password generator (Multi-core)
    By Hyperz in forum Web Development Area
    Replies: 2
    Last Post: 19th Jun 2010, 11:45 AM
  5. A Snippet from my latest project
    By litewarez in forum Tutorials and Guides
    Replies: 19
    Last Post: 21st Jun 2009, 05:17 PM

Tags for this Thread

BE SOCIAL