|
GrunkLite is the moniker for a much simpler and smaller configuration format for Grunk, a context-aware parser. The full syntax, which is called TOSCA (for The One Syntax for the Configuration Analyzer) is a very general, heavyweight language. Often needs are much more modest and while the full power of TOSCA is sometimes required, for smaller cases it is just too much. Of course, TOSCA is actually the only grammer the grammer understanding kernel really can process, so what is going on behind the scenes is that there is an XSL transformation from grunkLite into TOSCA. You don't need to be aware that this is happeneing unless you want to write your own transfomation to replace grunkLite. In a nutshell grunklite supports the following:
Our first example will be to simply split up a file into a sequence of lines. Ideally, grunkLite should follow the dictum that simple things are simple, so here is a complete example ( lines.gl )
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE grunk SYSTEM "http://emerge.ncsa.uiuc.edu/grunk/grunklite.dtd" > <grunk> <field define="newLine" startsWith="^" omitMatch="false" exactMatch="false"/> </grunk>(Note that the location of the dtd points to the Emerge website.) This will take a file and return a collection of
newLine elements. Running it on
lines.txt
Hi Ho! Hi Ho! It's off to work we go!yields
<root >
<newLine >
Hi Ho!
</newLine>
<newLine >
Hi Ho!
</newLine>
<newLine >
It's off to work we go!
</newLine>
</root>
A couple of salient points are to be noted. In the first place, we set the omitMatch
attribute to false. This is because we do not want to omit the start of each line.
We also tell grunk that the exactMatch attribute is also false. If it
were otherwise, this would mean that each token to look for would be exactly a beginning of line
marker, i.e., just a blank line.
One of the more common text formats--often from the export of a database--consists of a sequence of lines, each of which represents a record in the database. Each field in the database is separated by a tab character. Building on our previous example, we will split the file up by lines and then split each record up. This is where grunklite is very convenient, since one need do little more than list the target fields. Keeping with our minimalist approach, we have the following data (in tabdel.txt)
Smith Bob 01/02/01 Jones Joe 01/03/01 Schmidt Georg 03/02/01
We will chop up each line into last name, first name and date fields. Moreover, we will chop up the dates into day, month and year fields. (This is in tabdel.gl.)
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE grunk SYSTEM "http://emerge.ncsa.uiuc.edu/grunk/grunklite.dtd" >
<grunk>
<field define="newLine" startsWith="^" omitMatch="false" exactMatch="false" split="\t">
<field define="lastName"/>
<field define="firstName"/>
<field define="date" split="/">
<field define="day"/>
<field define="month"/>
<field define="year"/>
</field>
</field>
</grunk>
Note that since we do not specify the order that fields are to be
taken in, the default of strict is assumed. This is correct, since we know how regular
that data is. The split attribute just lists the tab character. The way it works is that grunk
will split the file into lines, then inside of each line it will look for tabs. Once those have been
found the results are distributed to the sub-fields. It is in this way that grunk uncovers the structure
in an input source.
The full output you can see in
tabdel.out.
We will take the one of the examples elsewhere and amplify it. Our original information is
http://www.abc.com/index.html 12/01/1999 http://www.abc.com/welcome.html 03/05/1999 http://www.abc.com/support/about-us/contact.html 11/11/2001and we want the urls split. This will be done in two parts to keep it clean: We will get the head (here just
http: but this might vary) and then the rest of it chopped up into fields. This could be
extremely useful for building a database of web pages, directories and such.
Our first pass breaks this up into url elements. We want to split this at the head and place everything
else inside the url. It is natural to break using a single slash, /. The configuration is then
(url_ex2.gl)
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE structure SYSTEM "http://emerge.ncsa.uiuc.edu/grunk/grunklite.dtd" >
<grunk>
<field define="entry" startsWith="^" childOrder="strict" exactMatch="false" split="\s+">
<field define="url" split="//">
<field define="head"/>
<field define="tail" split="/">
<field define="field"/>
</field>
</field>
<field define="date" split="/" childOrder="strict">
<field define="day"/>
<field define="month"/>
<field define="year"/>
</field>
</field>
</grunk>
The output is in
url_ex2.out. While it is a bit lengthy, this will allow very fine
control over the contents. A tremendous improvement over a plain text file.