Extracting data from a string

**rohansakhale** · 4th Jul 2010, 06:24 AM

try using some logic where the angular brackets close & the text starts with out opening angular bracket then u store each char in a character array using pointer.... i can do it in C...didnt start C# yet

**Dman** · 4th Jul 2010, 08:50 AM

I think regex should do the trick

**pankaj** · 4th Jul 2010, 11:06 AM

Yeah, thanks Dman. It helped out.

**Jueki** · 4th Jul 2010, 11:23 AM

It's relatively simple actually. You can use RegEx as mentioned already, here's an example for the "<a tag><c tag>xyz.abc</c></a>" string.

Code:

string StringToSearch = "<a tag><c tag>xyz.abc</c></a>";
string StringFound = Regex.Match(StringToSearch, "<a tag><c tag>(.*)<\/c><\/a>").Groups.Item(1).Value;
MessageBox.Show(StringFound);

You can adopt the code yourself for whatever stuff you need

If you need all found matches, use the Matches function instead of Match, then mess around with the Groups and Item array followed by a foreach loop.

**Hyperz** · 4th Jul 2010, 02:22 PM

Never use regex for parsing markup. Download SharpLeech and add a reference to the Engine dll in your project. Then add these using's in your code:

PHP Code:


using Hyperz.SharpLeech.Engine.Html;
using Hyperz.SharpLeech.Engine.Net;

Now you can use it like:

PHP Code:


var html = new HtmlDocument();

// load the html
html.LoadHtml("<div class=\"example\">foo</div>");

// use XPath to select the div
var node = html.DocumentNode.SelectSingleNode("//div[@class='example']");
var divContent = HttpUtility.HtmlDecode(node.InnerText);

XPath info: http://www.w3schools.com/xpath/default.asp

**pankaj** · 4th Jul 2010, 03:27 PM

Where to put this file - Hyperz.SharpLeech.Engine.dll

I guess your code is extracting the word "example" from between //div i.e. <div> tags.
But how to extract links those are starting from http and ends with .extension

**Hyperz** · 4th Jul 2010, 03:36 PM

Can't find what files? You only need Hyperz.SharpLeech.Engine.dll. And nope, the example extracts the word foo. Take a look at XPath via the link I posted.

Regarding the other question:

PHP Code:


var html = new HtmlDocument();

// load the html
html.LoadHtml(yourHtmlHere);

// use XPath to select all "A" elements from the html
var anchors = html.DocumentNode.SelectNodes("//a");

// filter out those that start with http
var filter = from a in anchors
             where a.GetAttributeValue("href", "").StartsWith("http")
             select a;

Just experiment with it.

**Dman** · 5th Jul 2010, 03:36 PM

Originally Posted by Hyperz

Never use regex for parsing markup. Download SharpLeech and add a reference to the Engine dll in your project. Then add these using's in your code:

PHP Code:


using Hyperz.SharpLeech.Engine.Html;
using Hyperz.SharpLeech.Engine.Net;

Now you can use it like:

PHP Code:


var html = new HtmlDocument();

// load the html
html.LoadHtml("<div class=\"example\">foo</div>");

// use XPath to select the div
var node = html.DocumentNode.SelectSingleNode("//div[@class='example']");
var divContent = HttpUtility.HtmlDecode(node.InnerText);

XPath info: http://www.w3schools.com/xpath/default.asp

I beg to differ - regex is much cleaner lol

**Hyperz** · 5th Jul 2010, 03:53 PM

Cleaner? That sounds like something a VB6 coder would say

. You being a coder should know that you can't use regex for parsing markup. For one it is much to slow for that. And secondly your expressions are static. It can't handle changes in the DOM structure without having to redo it all. Then there is the issue of inner html, etc etc.

The only case in which you can use regex is when you need only 1 simple string from a small html document of which you know the contents wont change. For anything else it'll change into an slow unmanageable mess. I'd be more happy to put this to the test

.

**Dman** · 5th Jul 2010, 04:00 PM

Using a DOM parser is faster than regex? I thought DOM parsers used regex

Anyway those parsers use up too much memory. For his case regex is simple - and using a parser is overkill

Extracting data from a string

Thread Tools

Display

Sponsored Links

Thread Information

Users Browsing this Thread

Similar Threads

extracting data from diffrent site

C++ string search help needed

How to recover deleted or lost data, file, photo on Mac with Data Recovery software

php string - heredoc syntax

[c#] Get String In between strings

Tags for this Thread

Posting Permissions

themaLeecher - leech and manage...

themaRegister - register to forums...

themaCreator - create posts from...

themaCreator - create posts from...

themaPoster - post to forums and...