|
Here is a table of contents for this tutorial.
A Few Short Examples This section contains the basic list of examples.
The first place to start is with a description of what problems grunk solves. Grunk stands for grammatical understanding kernel which points to the fact that it processes a grammar to parse text and manages the details of the parsing too (hence it is a kernel).
Information is called data and information about data is called metadata. Information about metadata is meta-metadata and so on, until your eyes glaze over. There can be many types of metadata for a document. A text document's formatting requires metadata for a word processor while its content requires something different if it is to be indexed by a database and maybe something else entirely if it is source code for a computer program. There have been many, many different things done to manage metadata and how it relates to the data and one of the most natural for text is to use markup which is just putting the metatdata into the data in such a fashion that you can (hopefully) separate the two. For about as long as people have been using computers, they have been marking-up their text, sometimes not as well as other times, but the idea has certainly been there.
One other facet that bears mention is one commonly used way to organize text and that is via a hierarchy. This is a series in which each element is graded. People often tend to think hierarchically, putting text into chapters, section, subsections, paragraphs and sentences. This also happens quite frequently in data, where a record of contains fields and those in turn can contain other fields (such as a date containing the day, month and year).
XML (extensible markup language) is a way to markup text in a very regular fashion. We like XML. Much of its power derives from this regularity, allowing one to transform it from one form to another. The early tremendous success of HTML and its myriad dialects showed how useful the idea was. As with many early markup schemes, the problem with HTML is that it allows far too much latitude in its syntax to be completely regular. In microcosm, many the problems of dealing with older markup are seen in HTML today. Attempts to regularize the syntax and make it more XML like founder on the vast amount of legacy HTML that doesn't follow this as well as the fact that many vendors have their own proprietary extensions which they are loath to give up.
The problem that motivated grunk's creation was this: Given some other, not particularly regular, more or less hierarchical markup scheme, can we extract the information and regularize it, say into XML? This would be a tremendous benefit to people who have legacy data. Grunk has some extremely powerful ways to recognize and manipulate text, so that likely you can take your text and grunk it. Grunk is not a cure-all and there are things it cannot do. Of course, you say, does the world really need another parser? There are scads of them available. This is true, but they do not treat the hierarchical nature of the text with anything nearing the flexibility or power of grunk. Moreover, being configuration driven, grunk can be re-used in a variety of contexts and situations that other parsers would be hard put to do. Grunk comes equiped with a complete language for specifying hierarchies of a data source and has built-in Perl 5 compliant regular expressions to help it recognize parts of text.
To breifly delve into the innards of grunk, we note that one common type of parser is an LL parser. This looks from the left-hand side of the input line (this is the first "L") and then only takes the left-most token (the second "L") as the valid one. Consider the data
record! title! I just love to record!
In this case it is clear that this ought to parse as
record | +--title(I just love to record!)
Most parsers would not be able to sort out that the second record! delimits nothing.
Since grunk is context aware, it would know enough to ignore the second instance. Grunk always works
from the left-hand side of the line and takes the left-most token, so since it is context-aware, it
should properly be termed a CALL parser, for Context-Aware LL parser.
A couple of brief examples would probably help to move us into the nitty-gritty of the examples. Text may be marked-up in. In order of severity of syntax, these are structured text, semi-structured text and free or unstructured text
One final comment before we move on. This tutorial is graded. This means that each example builds upon the preceeding example, so please do them in order. They are not designed to be hard and the whole tutorial should take no more than an hour and a half from start to finish.
TITLE On the Dynamics of An Asteroid
AUTHOR Prof. J. Moriarty
E. Q. Dutton
YEAR 1872
PUBLISHER Underworld Publishers, Inc.
Here each record is set off by a keyword that occupies columns 1 through 10 of the line. This type of field structuring was much beloved in the days when punchcards and mainframes ruled the IT field.
On the Dynamics of An Asteroid.::Prof. J. Moriarty; E. Q. Dutton [Underworld Publishers Inc. (1872)]
If we know that this starts with the title, is followed by the authors and the publication information, it makes sense that we should be able to extract the useful information for this and return it.
This is text that, while it might contain information we as humans can process, it does not have firm enough structures for a computer to latch onto, for example
Ummm, I think -- was it that fellow Moriarty that wrote it? The one who gave Holmes all of those problems? He wrote a book on asteroids, their dynamics, yes, that was it. It was on the dynamics of asteroids. Oh, and he usually worked with Dutton, the one over at Oxford. It was put out by Underworld, I think shortly after the sixth attempt on Queen Victoria's life.
This is amusing, but there is a truffle of truth buried here for text parsing: Humans are often able to pull out the information from a piece of text like this because of external references and associations. These are extrinsic to the text and we can make sense of it because we can put into context many different bits of information. All text parsing (at least at the dawn of the twenty-first century) relies on intrinsic markers, that is, strings and sequences of characters in the text that allow very narrow variation. There are possibly some artificial intelligence expert systems on the horizon that someday might be able to do free-text, at least partially, but these will be a long time in being generally usable. Certain expert systems to exist, but these consists of finely tweaked thesauri and apply only to extremely narrow sets of text. The real problem is that these systems must actually understand (no, I don't know what that word really means) the content to parse it. This is a very hard task.
The main steps to writing a parser are pretty standard, these consist of
When this is done, one needs to compile these parts and then run them on the source. There are many tools to help with parts of this, such as unix's yacc or lex. Grunk is a generalized tool for doing all of this in one fell swoop. The clever part of grunk is to realize that hierarchical grammars can be easily expressed in XML, therefore by using XML to write the configuration file, we are actually specifying a grammar for our source. Grunk then does all of the above steps and by default hands back an XML document.
Grunk may be invoked in one of two ways, either from the command line, or as a processing routine from a program. We will be concerned with invoking grunk from the command line here, rather than using it as a parsing library This allows you to run several examples in quick succession. This is also a really good way to test a configuration file for grunk. You will need to have the latest jar for grunk, called grunk.jar. There are several supporting classes it needs; see the download page for details. You will also need to have Java installed on your system.
Here is a typical command line for grunk.
java ncsa.emerge.grunk.Grunk -c configuration -i input
This assumes that your classpath contains grunk.jar and its supporting classes.
The command line arguments require switches. To see all available switches, invoke grunk with
the argument --help. You should get something like
Here is a quick summary of the options for this program
short form long form
------------------------------
*a *about
c config
i in
n noOutput
o out
s syntax
xsl xsl
------------------------------
To get help on a topic, enter >>--help (short or long form)<<
For example, >>--help s<<
Those that have a * next to them contain useful information, but do not affect
the operation of the program.
Here is what the switches do
switch argument effect
------------+------------+--------------------------------------------------
c, config | file name | This is the xml file conforming to grunk.dtd that
| | contains the configuration for grunk.
------------+------------+--------------------------------------------------
i, in | file name | The text file for grunk to parse
------------+------------+--------------------------------------------------
n, noOutput | - | Supresses all output. This is useful for
| | debugging, for example
------------+------------+--------------------------------------------------
o, out | file name | Name of a file that will receive the output. If
| | this is not specified, all output goes to the
| | console.
------------+------------+--------------------------------------------------
s, syntax | tosca, | The type of syntax in the configuration file to
| grunklite, | expect. "tosca" refers to the full syntax
| custom | while "grunklite" is a much simplified version,
| | incorporating the most widely-used features only.
| | Custom implies the xsl switch.
------------+------------+--------------------------------------------------
xsl | file name | The name of a file containing an XSL trans-
| | formation. When applied to the file named
| | with the -config switch this
| | will return a valid TOSCA configuration file.
| | Grunk will carry out the transformation on the
| | files specified.
------------+------------+--------------------------------------------------
In our example, configuration is the name of the configuration file
and input is the name of the file with text to parse.
You may use the short or long form of the switch, so the command line could also
have been entered as
java ncsa.emerge.grunk.Grunk -config configuration -in inputOrder is not important, so something like
java ncsa.emerge.grunk.Grunk -i input -config configurationis fine too. Invoked from the command line, grunk will print out the entire contents after running and it will look like
<root>
<fieldname>
...
</fieldname>
<root>
If you have a configuration file in grunklite syntax, say it is named
myGLConfig.gl you would use the -s switch:
java ncsa.emerge.grunk.Grunk -i input -c myGLConfig.gl -s grunklite
If you want to write your own XSL transformation into the full grunk format, you can even
have grunk invoke the transformation with the xsl switch. For example if you
have a transformation myXSLT.xsl to be used on your configuration file
myConfig.xml you would need to issue
java ncsa.emerge.grunk.Grunk -i input -config myConfig.xml -s custom -xsl myXSLT.xsl
If you omit the xsl switch while using the custom option,
grunk will issue a warning and exit.
In the case that there is a problem, grunk will always try to print out whatever it parsed. Plus some (hopefully) informative message. Realize that the most frequent source of an abrupt exit is that the configuration file is at variance with the data source. If grunk complains that there are no recognizable structures found, this is possibly because within the context given there were no matches.
To get a feel for what grunk can do, we will make a couple of really simple examples. These are to get you used to making configuration files and seeing where things go rather than letting you grunk the entire contents of the Library of Congress card catalog. All things in good time!
Here is out first bit of text to parse. It is in the example named small.txt.
Record: Small is beautiful.
This will show us all the basic ingredients for starting and running grunk. Here is the sample configuration file for our example. It is in the example called small1.grk . (Click here for the grunklite version.)
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE structure SYSTEM "http://emerge.ncsa.uiuc.edu/grunk/2002/April/dtd/grunk.dtd" >
<structure>
<children>
<use name="/elements/record"/>
</children>
<definitions name="elements">
<define name="record">
<field>
<delimiters>
<bookend>
<marker regex="^Record:$"/>
</bookend>
</delimiters>
</field>
</define>
</definitions>
</structure>
To invoke this example, you need to issue a command like
java ncsa.emerge.grunk.Grunk -c small.grk -i small.txt
It should output the following
<root>
<record>
Small is beautiful.
<\record>
<\root>
Now for a discussion of the example configuration. Grunk needs to have a list of fields and it needs to know which fields it will encounter first. These are put into the structure element in the configuration. At a high level it looks like
structure | +--children | +--definitions.
We will start by looking at the
definitions first. This is plural because it is a list of
definitions. It quickly becomes difficult to navigate a large
collection of fields, so what we do is name each definitions section
and allow these to nest. In this case we cleverly call our list
elements. All references to anything in the set are to start with
/elements. If you think this looks like the directory
structure on a computer, you are exactly right.
Inside of this is a single definition
for the field. The define requires the name of the thing being
defined, here record By
default this appears as the tag name (this can be set, but we're
trying to keep it simple). Inside of the definition goes the field.
The field needs to have a list of things that mark it. Here the
contents are preceeded by the single line with the text Record:
on it. The regular expression for this would be ^Record:$
(recall that the ^ denotes the start of a line and the $ the
end of a line for regular expressions). Here is one little feature
that you should understand: Grunk is designed to digest pretty messy
input, which can mean that there are several ways of marking a field,
some of which might not be quite compatible with each other. The best
way to get around this is to supply a list of identifiers, which we
call delimiters. The first
type of delimiter is a bookend.
This means that the field will have something that shows where it
starts and something that shows where it ends (one of these might be
trivial). In our case, it starts with Record:
and has no end. So, we define the delimiter list, a bookend and then
inside of that give a marker.
This has the regular expression in it that marks the field. This
accounts for the field definition.
The other elements
in the structure, the children consists of a use
statement. This gives the name of the thing to be used. That is all
there is to it. If you have a template, you need not do more than
sketch in the regular expression and add enough fields to cover all
elements in your source.
What if we want to run grunk with the following input (in small2.txt )?
Record: Small is beautiful. Record: Large ain't.
As it stands now, we have only told it to expect one record. This can be easily fixed. In the configuration file, small2.grk change the line for the use statement to read
<children>
<use name="/fields/record" maxReps="infinity">
</children>
(You can click here for the grunklite version.)
This tells grunk that it should expect more of these. The attribute
maxReps tells grunk the
maximum number to expect. It may be any number greater than or equal
to 0 (if it's zero grunk will refuse to process anything, of course
and you will get a configuration error if it finds the field) or the
special value of infinity.
This value means that grunk will always be ready to accept a new one.
What about having subfields? Let us say we have (see small3.txt for the input file.)
Record: Text: Small is beautiful. Author: F. Klein Date: 25.03.01
We would like to have a record with
three subfields in it. We must define the other fields and add them
as children to the Record:
field. Our definitions section looks like this now.
<define name="record">
<field>
<delimiters>
<bookend>
<marker exactMatch="true" regex="^Record:$"/>
</bookend>
</delimiters>
</field>
</define>
<define name="text">
<field>
<delimiters>
<bookend>
<marker exactMatch="true" regex="^Text:$"/>
</bookend>
</delimiters>
</field>
</define>
<define name="name">
<field>
<delimiters>
<bookend>
<marker exactMatch="true" regex="^Name:$"/>
</bookend>
</delimiters>
</field>
</define>
<define name="date">
<field>
<delimiters>
<bookend>
<marker exactMatch="true" regex="^Date:$"/>
</bookend>
</delimiters>
</field>
</define>
In order to keep this a little cleaner, we will make another
definitions element under our first one, just for all of our
fields. From a higher level we have
<definitions name="elements>
<definitions name="fields">
... all of our defintion elements...
</definitions>
</definitions>
Now references will be prefixed with
/elements/fields, e.g.,
/elements/fields/record.
We could just make a children
element for the record, but this is a great place to see how to use a
group. A group is just a
list of elements that are used. The definition for one follows the
pattern for a field: A define statement that contains it.
Any element that may be used can be included, including
othe groups. We will later see that at each level of usage we can
dictate how the elements interact, giving an enormous amount of
flexibility. In our case, we will have
<define name="contents">
<group>
<use name="/elements/fields/text"/>
<use name="/elements/fields/author"/>
<use name="/elements/fields/date"/>
</group>
</define>
In the /elements/fields/record field, we just use this as the
children element. The complete definition for this field reads (see
small3.grk
for the complete configuration file or here for the
grunklite version.)
<define name="record">
<field>
<delimiters>
<bookend>
<marker exactMatch="true" regex="^Record:$"/>
</bookend>
</delimiters>
<children>
<use name="/elements/groups/contents"/>
</children>
</field>
</define>
When we run it, we get
<root>
<record>
<text>
Small is beautiful.
</text>
<author>
F. Klein
</author>
<date>
25.03.01
</date>
</record>
</root>
Let's be giddy and run grunk against (see small4.txt)
Record: Author: F. Klein Text: Small is beautiful. Date: 25.03.01
Where the author and text fields are swapped. We get an ominous
sounding Format Error. What gives? It can be the case that data
sources are corrupt. As a safeguard (grunk has several, actually)
will assume that the order given in a group must be slavishly
followed. You must explicitly tell it not to follow order. This is
the meaning or the order
attribute. To let grunk look for any field at the current level
you would merely need to change the group's definition to read (see
small4.grk)
<group order="free">
<use name="/elements/fields/text"/>
<use name="/elements/fields/author"/>
<use name="/elements/fields/date"/>
</group>
(Click here for the grunklite version.)
This is the only change from small3.grk to small4.grk. Now it runs fine. This is an important distinction, albeit small, so it bears special mention.
Our previous examples have been pretty mundane. Now we will do one
that lets grunk flex its muscles. Notice that in the previous
example, the data looks a little weird (to American eyes). This is
because the format is German. That means that the format is dd.mm.yy
versus mm/dd/yy. We will have grunk fix this using a
transformation.
What is a transformation? It is a sequence of commands that will
carry out a replacement in the contents of a field, split the field
or call up another transformation. Moreover, these allow for looping
at several levels. This gives grunk an outlandishly powerful text
processing capability. We will do a simple replacement. To make a
transformation, you must put it into the field's definition. Here is
the transformation for the date field.
<transformation>
<transformationSequence>
<replace regex="(\d+).(\d+).(\d+)"
replaceTemplate="$2/$1/$3"/>
</transformationSequence>
</transformation>
See the file
small5.grk
for the configuration. (Click here for
the grunklite version.) For completeness, we have put the text of this in
small5.txt
even though it is the same as small4.txt.
This is placed in the definition for the date field, right after the
delimiters. The regular expression (regex)
says to look for digits separated by periods. The
replaceTemplate states that
the second and first are swapped and the last stays put, all of them
being separated by slashes now. The output is
<root>
<record>
<text>
Small is beautiful.
</text>
<author>
F. Klein
</author>
<date>
03/25/01
</date>
</record>
</root>
What about cases where we need to split up fields and split their results? This is an extremely useful feature. Once one has an XML document, it is straightforward to restructure it using XSL, however, recognizing structures in the data source, especially of a hierarchical nature can be difficult. Grunk is able to help. For example, if we have add a set of names in a field for editors to our last example, such as appending (see small7.txt for the full text)
Editors: Thom:01/02/01 Dique:02/01/01 Harri:02/03/01
Here each line is an entry consisting of a name and a date. We would like to split this up into fields consisting of the name and day. month and year information. This is acomplished in grunk by adding successive transformations to the fields. The resulting fields should look like
editors | +--editor | | | +--name | | | +--date | | | +--day | | | +--month | | | +--year | +--editor (etc)
The steps to do this are pretty simple. The full configuration is in small7.grk .
Editors: field by using end of line markers (\n).
The result of this should be sent to an editor field.
We need to tell grunk to keep the end of line markers so that we may use them as
field separators. This is the accomplished by setting
transformFieldAsSingleLine="true" and
transformFieldAsSingleLineKeepCR="true". The code for the transformation
is therefore
<transformation transformFieldAsSingleLine="true"
transformFieldAsSingleLineKeepCR="true">
<transformationSequence>
<split regex="\n"/>
</transformationSequence>
<transformationTarget>
<use maxReps="infinity" name="/elements/targets/editor"/>
</transformationTarget>
</transformation>
editor field,
the definition of which is
<define name="editor">
<field>
<transformation>
<transformationSequence>
<split regex=":"/>
</transformationSequence>
<transformationTarget>
<use name="/elements/targets/editor group"/>
</transformationTarget>
</transformation>
</field>
</define>
Since we have multiple target fields, we put them in a group, called editor group.
This is good practice at organizing things. This group contains the name
and date fields. Only the date field requires more splitting.
date field into day, month
and year, we use the separator /. The three target
fields we put into the date group. The date
is defined as
<define name="date">
<field>
<transformation>
<transformationSequence>
<split regex="/"/>
</transformationSequence>
<transformationTarget>
<use name="/elements/targets/date group"/>
</transformationTarget>
</transformation>
</field>
</define>
In this way the data source may be reshaped to virtually any level of detail. Here is a portion of the output, showing what an editor records looks like:
<editors>
<editor>
<name>
Thom
</name>
<date>
<day>
02
</day>
<month>
01
</month>
<year>
01
</year>
<date>
</editor>
... etc., more editor fields
</editors>
One situation that occurs is that we have to split up a field. Consider the following text.
//RECORD Authors= Smith, R.; Doe, J.; Schmitt, G. Title= Magnetic Monopoles for Fun and Profit. Abstract=Monopoles made simple. Things to do with your monopole on a rainy day. Care and feeding of monopoles. isbn =1-123-12345-6 //RECORD aUtHoR =Thudpucker, J. ISBN = 0-987-65432-1 TITLE =I was a Doonesbury character in the 1960's. ABSTRACt= Jimmy Thudpuckder explains what it was like to be trapped in a comic strip.
We really want to return this as XML and split up the authors field
into a list. Here are a few points to keep in mind.
First, the data is not clean, i.e., there is some variation between the
records and what they do. This is often the case with older databases
where input was not carefully controlled. Regular expressions will
handily give us matches. (Note that one record is labelled Authors
and another is labelled AUTHOR. Grunk will be used here to
normalize both to authors.) Secondly, we need case
insensitive matches. The simplest way to effect this is be using
classes for the regular expressions, so to check "title"
one would write [Tt][Ii][Tt][Ll][Ee].
One must be careful, since regular expression engines usually
implement case insensitive matching by enumerating all of the
possible cases (like "title", Title", "tItle",
...) . In this case, with 5 letters this is 25 = 32
possible trials. Since we really don't restrict the usage of regular
expressions, it is possible to make grunk jump through hoops.
Finally, we need to tell grunk in its configuration that there might be more than
one delimiter per line. This is done in the opening structure statement:
<structure oneMatchPerLine="false">. The default is
not to recirculate lines through the parser.
How to split up a field. We need to
define a splitter. This
contains the information needed to identify where the content should
be divided. In the case above, this is precisely where the semicolons
occur. The transformation is
<transformation>
<transformationSequence>
<split regex=";"/>
</transformationSequence>
<transformationTarget>
<use maxReps="infinity"
name="/Definitions/Fields/author"/>
</transformationTarget>
</transformation>
The complete definition of the field
author is
<define name="author"> <field/> </define>
Since there is no list of delimiters,
this field matches every token given to it, and each token becomes
its content. We could, of course, define delimiters and
transformations for these as well, but since there are no solid
structures here to hold onto, this would really do little more than
eat up computer time. The complete configuration file is in
small6.grk
and the text is in
small6.txt
.
You can look at the output here.
this. The grunklite version is
here. Note that we have
to invoke the recycleLines option.
The OMIM (Online Medelian Inheritance in Man) database contains a wealth of information on just about every conceivable aspect of inheritable diseases in man. This is a prime candidate for grunking. Note: While the information in OMIM is completely free and open, distribution is limited to approved clients to prevent bad data from creeping in. Therefore, while we will cheerfully show you how to grunk OMIM, we must be content with having artificial data. Please refer to omim.grk for the configuration file and smallomim.txt for a sample record.
The configuration file holds only one surprise. The format for the
records in OMIM allows for a "mini-MIM" record, which is delimited
by *FIELD* MN. This contains all of the fields that the
full record can have, except for the text field (*FIELD* TX).
Since this is nested, this means that the text field's starting delimiter
functions as a mini-MIM's end delimiter. This is fine, but we should circulate
the delimiter for a text field back. This is the reason there is a flag set
for reuseEndDelimiter="true".
The database for Cancer Net contains journal information for many articles and papers written on cancer and related subjects. We have included a small section on this. Please refer to cnet.grk for the configuration file and smallcnet.txt for a couple of sample records.