Home Mail an Joachim Pimiskern Impressum

MarkupToken

MarkupToken is a bundle of Delphi classes for parsing files written in markup languages like HTML or XML. The main class TTagParser splits a file into tags and non-tags. You can use it for example to extract the text from HTML documents or collect links to spider the web.

Status: Freeware; modify and use it at your own risk. I'd appreciate it if you left a comment in the source files about the original author, Joachim Pimiskern.

Download

Installation

Simply use the units martok and htmlparts. It is not necessary to install something into the VCL.

Demo application

The demo contains a couple of typical applications. To adapt it, copy the functions related to your purpose into your program.

Description of the used data types

TCharStream Reads a textfile into the RAM and provides access to it via a function nextChar() which realizes a stream of integer numbers (0..255), -1 = EOF.
TMarkupTokenType Aggregation type of characters to bigger tokens. For example, "abc" is from this point of view not only a doublequote, followed by 'a','b', 'c', and a final doublequote, but rather simply a tt_string.
TMarkupTokenizer Gets a TCharStream object as input and generates a stream of TMarkupTokenType which is accessible via the public variables TokenType and TokenString. For example, "abc" would result in TokenType = tt_string and TokenString = abc
TTagTokenType That's the highest level of abstraction. A file in a markup language consists of tags, non-tags, and comments.
TTagToken Holds the information for each occurrence of TTagTokentype. It has the key entries TokenType of type TTagTokenType, TokenString, which contains the actual value, and Data, which is a hashtable. Data has the special entries tagname and tagtype. For example, <a href=http://www.google.de"> would lead to a Data hashtable with "tagname" -> "a", "tagtype" -> "begin", "href" -> "http://www.google.de". Ending tags have "tagtype" -> "end"
TTagParser Gets a stream of TMarkupTokenType as input and fills a property Tokens with elements of type TTagTokenType. Tokens is a list. The normal way to deal with it is to iterate through the list, access the items Tokens[i] and evaluate TokenType and Data.
THtmlParts A class that provides convenient access to parts of a HTML file like the links, head, body, title, meta-tags.
TFileWithVariables Reads a file of type TCharStream and translates every variable form of $identifier. For each variable, the event OnTranslateVariable is triggered so the the application can replace the variable by some other text. The expanded file is yielded by the function Expland().

Home Mail an Joachim Pimiskern Impressum