Thursday, June 12, 2014

[ 4GuysFromRolla.com ] Parsing HTML Documents with the Html Agility Pack

The Html Agility Pack contains a number of classes, all in the HtmlAgilityPack namespace. Therefore, start by adding a using statement (or Imports statement if you are using VB) to the top of your code-behind class:

Using HtmlAgilityPack

To download a web page from a remote server, use the HtmlWeb class's Load method, passing in the URL to download.
[ example ] var webGet = new HtmlWeb();
[ example ] var document = webGet.Load(url);
The Load method returns an HtmlDocument object. In the above code snippet we've assigned this returned object to the local variable document. The HtmlDocument class represents a complete HTML document and contains a DocumentNode property, which returns an HtmlNode object that represents the root node of the document.

The HtmlNode class has several germane properties worth noting. There are properties for traversing the DOM, including:
        - ParentNode
        - ChildNodes
        - NextSibling
        - PreviousSibling

There are properties for determining information about the node itself, such as:
        - Name - gets or sets the node's name. For HTML elements this property returns (or assigns) the name of the tag - "body" for the <body> tag, "p" for a <p> tag, and so on.
        - Attributes - returns the collection of attribu for this element, if any.
        - InnerHtml - gets or sets the HTML content within the node.
        - InnerText - returns the text within the node.
        - NodeType - indicates the type of the node. Can be Document, Element, Comment, or Text.

There are also methods for retrieving particular nodes relative to this one. For instance, the Ancestors method returns a collection of all ancestor nodes. And the SelectNodes method returns a collection of nodes that match a specified XPath expression.

Selecting Meta Tags off an html page
[ example ] var metaTags = document.DocumentNode.SelectNodes("//meta");
If there are no <meta> tags in the document then, at this point, metaTags will be null. But if there are one or more <meta> tags then metaTags will be a collection of matching HtmlNode objects. We can enumerate these matching nodes an display their attributes.

You can click on this link for more information: Parsing HTML Documents with the Html Agility Pack - 4GuysFromRolla.com

Screen scraping is the process of programmatically accessing and processing information from an external website. For example, a price comparison website might screen scrape a variety of online retailers to build a database of products and what various retailers are selling them for. Typically, screen scraping is performed by mimicking the behavior of a browser - namely, by making an HTTP request from code and then parsing and analyzing the returned HTML.

The .NET Framework offers a variety of classes for accessing data from a remote website, namely the WebClient class and the HttpWebRequest class. These classes are useful for making an HTTP request to a remote website and pulling down the markup from a particular URL, but they offer no assistance in parsing the returned HTML. Instead, developers commonly rely on string parsing methods like String.IndexOf, String.Substring, and the like, or through the use of regular expressions.

Another option for parsing HTML documents is to use the Html Agility Pack, a free, open-source library designed to simplify reading from and writing to HTML documents. The Html Agility Pack constructs a Document Object Model (DOM) view of the HTML document being parsed. With a few lines of code, developers can walk through the DOM, moving from a node to its children, or vice versa. Also, the Html Agility Pack can return specific nodes in the DOM through the use of XPath expressions. (The Html Agility Pack also includes a class for downloading an HTML document from a remote website; this means you can both download and parse an external web page using the Html Agility Pack.)

This article shows how to get started using the Html Agility Pack and includes a number of real-world examples that illustrate this library's utility. A complete, working demo is available for download at the end of this article. Read on to learn more!

No comments:

Know us

Our Team

Tags

Video of the Day

Contact us

Name

Email *

Message *