Extracting data from a string

**Hyperz** · 5th Jul 2010, 04:06 PM

No, using regex is overkill. I suggest you do some research in to how regular expression engines work internally. As I said before, you can use regex only in very (very very) simple situations. As soon as you try to use it for extracting dynamic data from multiple locations in a document you're screwed with regex. It ain't rocket science.

**Dman** · 5th Jul 2010, 05:24 PM

I know how regex works. The way DOM parser works is simple but uses lot of memory and is slower. I find it trivial to waste resources for a simple task such as his. He does not need to extract dynamic data. Using a DOM parser is overkill and waste of resources in this case. It ain't rocket science either.

**Hyperz** · 5th Jul 2010, 06:09 PM

Originally Posted by Dman

I know how regex works.

I don't think you do. If you did you'd agree with me and everyone esle who calls himself a coder.

Have fun: http://swtch.com/~rsc/regexp/regexp2.html

Originally Posted by Dman

The way DOM parser works is simple but uses lot of memory and is slower.

Sigh. To parse XML/HTML one only needs a simple state machine. Because of that it can be parsed with relatively little code and is much more efficient when compared to the complex finite state machines that make up modern regular expression engines. It requires less CPU cycles than running it trough one or more regular expressions and because of the simplicity of the state machine it uses less memory. Once a DOM is parsed it'll use roughly the same amount of memory as the original string that holds the markup (+ a few KB here and there for the objects).

Every developer that needs to work with HTML or XML will parse it with state machine and not with regular expressions for said reasons.

I'll give you another article: http://www.codinghorror.com/blog/200...hulhu-way.html

Originally Posted by Dman

I find it trivial to waste resources for a simple task such as his. He does not need to extract dynamic data.

Erm, you save resources if you simply parse the HTML the way it should be parsed. You might want to re-read the first post. It's clear that he needs dynamic data.

Originally Posted by Dman

It ain't rocket science either.

Apparently for some it is. You know you're wrong here, why not just say I'm right? It's not gonna make you look stupid or anything. We all learn, including me

.

PS: do read those articles.

**Dman** · 6th Jul 2010, 09:47 AM

Well from the looks of it, it doesn't look like he wants to parse data but extract it.
The regular expression reading was very interesting - thanks for the link

I am also not clear as to what do you mean by dynamic data?

**pankaj** · 6th Jul 2010, 10:55 AM

I am not doing by Hyper'z method or Regex method. Both are new to me, what I require is very simple work, just extracting all links with a given start string and end string from a page which I have already stored as string.

I am using loop and IndexOf of do it.

**Hyperz** · 6th Jul 2010, 12:11 PM

^That too is how it should not be done. It's virtually the same as regular expressions but even less solid. What is so hard about just parsing the HTML file? Why use some confusing dirty method? Anyway, it's your problem. Just know that your code is flawed.

Originally Posted by Dman

Well from the looks of it, it doesn't look like he wants to parse data but extract it.
The regular expression reading was very interesting - thanks for the link

I am also not clear as to what do you mean by dynamic data?

Lol you just don't give up do you ^^. You need to parse data to extract it correctly. By dynamic data I mean data that changes (coming from a PHP file for example). If you look at the 1st post you see he extracts data from a forum which is about as dynamic as it gets.

**pankaj** · 6th Jul 2010, 12:22 PM

I already completed coding the part for extracting the string. It was not simple <div> and </div> tag. I had to extract data between two definite pattern of strings.

So, I didn't want to waste another week learning your method.
I'll give it time after I complete my project and learn that too though.

PHP Code:


 for(int i=0;i<5;i++)
            {                                        
                   string result = "";
            int iIndexOfBegin = strSource.IndexOf(strBegin);
            if (iIndexOfBegin != -1)
                            {
                String tempstring = strSource.Substring(iIndexOfBegin + strBegin.Length);
                int iEnd = tempstring.IndexOf(strEnd);
                if (iEnd != -1)
                {
                    result = tempstring.Substring(0, iEnd);
                
                    string next = result;

**Hyperz** · 6th Jul 2010, 12:41 PM

What is there to learn about:

Code:

var html = new HtmlDocument();

// load the html
html.LoadHtml(yourHtmlHere);

// use XPath to select all "A" elements from the html
var anchors = html.DocumentNode.SelectNodes("//a");

// filter out those that start with http
var filter = from a in anchors
             where a.GetAttributeValue("href", "").StartsWith("http")
             select a;

??

It's just loading a dll and calling a few methods. I don't see what needs to be learned here. What you're doing right there is the wrong way to do it and I wouldn't be surprised if I see you making another topic because suddenly something stopped working or your program crashes.

**jayfella** · 6th Jul 2010, 12:47 PM

Doing what pankaj wants is quickest when using IndexOf - iv been through them all, and thats what conclusion i came to, which is why i created my little "getstringInbetween" class.

**Hyperz** · 6th Jul 2010, 12:51 PM

But it is not reliable with dynamic data. You need access to the DOM. This is not a situation where execution time is important but rather one where the validity of the data is.

Extracting data from a string

Thread Tools

Display

Sponsored Links

Thread Information

Users Browsing this Thread

Similar Threads

extracting data from diffrent site

C++ string search help needed

How to recover deleted or lost data, file, photo on Mac with Data Recovery software

php string - heredoc syntax

[c#] Get String In between strings

Tags for this Thread

Posting Permissions

themaLeecher - leech and manage...

themaPoster - post to forums and...

themaCreator - create posts from...

themaRegister - register to forums...

themaCreator - create posts from...