To understand basic syntax of LEX specification, built in functions & variable
Problem Statement : To
understand basic syntax of LEX specification, built in functions &
variable
Theory :
Lex helps write programs whose control flow is directed by
instances of regular expressions in the input stream. It is well
suited for editor-script type transformations and for segmenting
input in preparation for a parsing routine.
Lex source is a table
of regular expressions and corresponding program fragments. The table
is translated to a program which reads an input stream, copying it to
an output stream and partitioning the input into strings which match
the given expressions. As each such string is recognized the
corresponding program fragment is executed. The recognition of the
expressions is performed by a deterministic finite automaton
generated by Lex. The program fragments written by the user are
executed in the order in which the corresponding regular expressions
occur in the input stream.
The lexical analysis
programs written with Lex accept ambiguous specifications and choose
the longest match possible at each input point. If necessary,
substantial lookahead is performed on the input, but the input stream
will be backed up to the end of the current partition, so that the
user has general freedom to manipulate it.
Lex Source.
The general format of Lex source is:
{definitions}
%%
{rules}
%%
{user
subroutines}
where the definitions and the user subroutines are often omitted. The
second %% is optional, but the first is required to mark the
beginning of the rules. The absolute minimum Lex program is thus
%%
(no definitions, no rules) which translates into a program which
copies the input to the output unchanged.
In the
outline of Lex programs shown above, the rules represent the user's
control decisions; they are a table, in which the left column
contains regular expressions (see section 3) and the right column
contains actions, program fragments to be executed when the
expressions are recognized. Thus an individual rule might appear
integer
printf("found keyword INT");
to look for the string integer in the input stream and print the
message ``found keyword INT'' whenever it appears. In this example
the host procedural language is C and the C library function printf
is used to print the string. The end of the expression is indicated
by the first blank or tab character. If the action is merely a single
C expression, it can just be given on the right side of the line; if
it is compound, or takes more than a line, it should be enclosed in
braces. As a slightly more useful example, suppose it is desired to
change a number of words from British to American spelling. Lex rules
such as
colour
printf("color");
mechanise
printf("mechanize");
petrol
printf("gas");
would be a start. These rules are not quite enough, since the word
petroleum would become gaseum; a way of dealing with this will be
described later.
Lex Source Definitions.
Remember the format of the Lex source:
{definitions}
%%
{rules}
%%
{user
routines}
So far only the rules have been described. The user needs additional
options, though, to define variables for use in his program and for
use by Lex. These can go either in the definitions section or in the
rules section.
Remember
that Lex is turning the rules into a program. Any source not
intercepted by Lex is copied into the generated program. There are
three classes of such things.
1) Any
line which is not part of a Lex rule or action which begins with a
blank or tab is copied into the Lex generated program. Such source
input prior to the first %% delimiter will be external to any
function in the code; if it appears immediately after the first %%,
it appears in an appropriate place for declarations in the function
written by Lex which contains the actions. This material must look
like program fragments, and should precede the first Lex rule. As a
side effect of the above, lines which begin with a blank or tab, and
which contain a comment, are passed through to the generated program.
This can be used to include comments in either the Lex source or the
generated code. The comments should follow the host language
convention.
2)
Anything included between lines containing only %{ and %} is copied
out as above. The delimiters are discarded. This format permits
entering text like preprocessor statements that must begin in column
1, or copying lines that do not look like programs.
3)
Anything after the third %% delimiter, regardless of formats, etc.,
is copied out after the Lex output.
Definitions
intended for Lex are given before the first %% delimiter. Any line in
this section not contained between %{ and %}, and begining in column
1, is assumed to define Lex substitution strings. The format of such
lines is name translation and it causes the string given as a
translation to be associated with the name. The name and translation
must be separated by at least one blank or tab, and the name must
begin with a letter. The translation can then be called out by the
{name} syntax in a rule. Using {D} for the digits and {E} for an
exponent field, for example, might abbreviate rules to recognize
numbers:
D
[0-9]
E
[DEde][-+]?{D}+
%%
{D}+
printf("integer");
{D}+"."{D}*({E})?
|
{D}*"."{D}+({E})?
|
{D}+{E}
Note the first two rules for real numbers; both require a decimal
point and contain an optional exponent field, but the first requires
at least one digit before the decimal point and the second requires
at least one digit after the decimal point. To correctly handle the
problem posed by a Fortran expression such as 35.EQ.I, which does not
contain a real number, a context-sensitive rule such as
[0-9]+/"."EQ
printf("integer");
could be used in addition to the normal rule for integers.
The
definitions section may also contain other commands, including the
selection of a host language, a character set table, a list of start
conditions, or adjustments to the default size of arrays within Lex
itself for larger source programs. These possibilities are discussed
below under ``Summary of Source Format,'' section 12.
Lex Actions.
When
an expression written as above is matched, Lex executes the
corresponding action. This section describes some features of Lex
which aid in writing actions. Note that there is a default action,
which consists of copying the input to the output. This is performed
on all strings not otherwise matched. Thus the Lex user who wishes to
absorb the entire input, without producing any output, must provide
rules to match everything. When Lex is being used with Yacc, this is
the normal situation. One may consider that actions are what is done
instead of copying the input to the output; thus, in general, a rule
which merely copies can be omitted. Also, a character combination
which is omitted from the rules and which appears as input is likely
to be printed on the output, thus calling attention to the gap in the
rules.
One of the simplest
things that can be done is to ignore the input. Specifying a C null
statement, ; as an action causes this result. A frequent rule is
[
\t\n] ;
which
causes the three spacing characters (blank, tab, and newline) to be
ignored.
Another easy way to
avoid writing actions is the action character |, which indicates that
the action for this rule is the action for the next rule. The
previous example could also have been written
"
"
"\t"
"\n"
with
the same result, although in different style. The quotes around \n
and \t are not required.
In more complex
actions, the user will often want to know the actual text that
matched some expression like [a-z]+. Lex leaves this text in an
external character array named yytext. Thus, to print the name found,
a rule like
[a-z]+
printf("%s", yytext);
will
print the string in yytext. The C function printf accepts a format
argument and data to be printed; in this case, the format is ``print
string'' (% indicating data conversion, and s indicating string
type), and the data are the characters in yytext. So this just places
the matched string on the output. This action is so common that it
may be written as ECHO:
[a-z]+
ECHO;
is
the same as the above. Since the default action is just to print the
characters found, one might ask why give a rule, like this one, which
merely specifies the default action? Such rules are often required to
avoid matching some other rule which is not desired. For example, if
there is a rule which matches read it will normally match the
instances of read contained in bread or readjust; to avoid this, a
rule of the form [a-z]+ is needed. This is explained further below.
Sometimes it is more
convenient to know the end of what has been found; hence Lex also
provides a count yyleng of the number of characters matched. To count
both the number of words and the number of characters in words in the
input, the user might write [a-zA-Z]+ {words++; chars += yyleng;}
which accumulates in chars the number of characters in the words
recognized. The last character in the string matched can be accessed
by
yytext[yyleng-1]
Occasionally, a Lex
action may decide that a rule has not recognized the correct span of
characters. Two routines are provided to aid with this situation.
First, yymore() can be called to indicate that the next input
expression recognized is to be tacked on to the end of this input.
Normally, the next input string would overwrite the current entry in
yytext. Second, yyless (n) may be called to indicate that not all the
characters matched by the currently successful expression are wanted
right now. The argument n indicates the number of characters in
yytext to be retained. Further characters previously matched are
returned to the input. This provides the same sort of lookahead
offered by the / operator, but in a different form.
Example: Consider a
language which defines a string as a set of characters between
quotation (") marks, and provides that to include a " in a
string it must be preceded by a \. The regular expression which
matches that is somewhat confusing, so that it might be preferable to
write
\"[^"]*
{
if
(yytext[yyleng-1] == '\\')
yymore();
else
...
normal user processing
}
which
will, when faced with a string such as "abc\"def"
first match the five characters "abc\; then the call to yymore()
will cause the next part of the string, "def, to be tacked on
the end. Note that the final quote terminating the string should be
picked up in the code labeled ``normal processing''.
The function yyless()
might be used to reprocess text in various circumstances. Consider
the C problem of distinguishing the ambiguity of ``=-a''. Suppose it
is desired to treat this as ``=- a'' but print a message. A rule
might be
=-[a-zA-Z]
{
printf("Op
(=-) ambiguous\n");
yyless(yyleng-1);
...
action for =- ...
}
which
prints a message, returns the letter after the operator to the input
stream, and treats the operator as ``=-''. Alternatively it might be
desired to treat this as ``= -a''. To do this, just return the minus
sign as well as the letter to the input:
=-[a-zA-Z]
{
printf("Op
(=-) ambiguous\n");
yyless(yyleng-2);
...
action for = ...
}
will
perform the other interpretation. Note that the expressions for the
two cases might more easily be written
=-/[A-Za-z]
in
the first case and
=/-[A-Za-z]
in
the second; no backup would be required in the rule action. It is not
necessary to recognize the whole identifier to observe the ambiguity.
The possibility of ``=-3'', however, makes
=-/[^
\t\n]
a
still better rule.
In addition to these
routines, Lex also permits access to the I/O routines it uses. They
are:
1) input() which
returns the next input character;
2) output(c) which
writes the character c on the output; and
3) unput(c) pushes the
character c back onto the input stream to be read later by input().
By default these
routines are provided as macro definitions, but the user can override
them and supply private versions. These routines define the
relationship between external files and internal characters, and must
all be retained or modified consistently. They may be redefined, to
cause input or output to be transmitted to or from strange places,
including other programs or internal memory; but the character set
used must be consistent in all routines; a value of zero returned by
input must mean end of file; and the relationship between unput and
input must be retained or the Lex lookahead will not work. Lex does
not look ahead at all if it does not have to, but every rule ending
in + * ? or $ or containing / implies lookahead. Lookahead is also
necessary to match an expression that is a prefix of another
expression. See below for a discussion of the character set used by
Lex. The standard Lex library imposes a 100 character limit on
backup.
Another Lex library
routine that the user will sometimes want to redefine is yywrap()
which is called whenever Lex reaches an end-of-file. If yywrap
returns a 1, Lex continues with the normal wrapup on end of input.
Sometimes, however, it is convenient to arrange for more input to
arrive from a new source. In this case, the user should provide a
yywrap which arranges for new input and returns 0. This instructs Lex
to continue processing. The default yywrap always returns 1.
This routine is also a
convenient place to print tables, summaries, etc. at the end of a
program. Note that it is not possible to write a normal rule which
recognizes end-of-file; the only access to this condition is through
yywrap. In fact, unless a private version of input() is supplied a
file containing nulls cannot be handled, since a value of 0 returned
by input is taken to be end-of-file.
Example.
Consider copying an input file while adding 3 to every positive
number divisible by 7. Here is a suitable Lex source program
%%
int
k;
[0-9]+
{
k
= atoi(yytext);
if
(k%7 == 0)
printf("%d",
k+3);
else
printf("%d",k);
}
to do just that. The rule [0-9]+ recognizes strings of digits; atoi
converts the digits to binary and stores the result in k. The
operator % (remainder) is used to check whether k is divisible by 7;
if it is, it is incremented by 3 as it is written out. It may be
objected that this program will alter such input items as 49.63 or
X7. Furthermore, it increments the absolute value of all negative
numbers divisible by 7. To avoid this, just add a few more rules
after the active one, as here:
%%
int
k;
-?[0-9]+
{
k
= atoi(yytext);
printf("%d",
k%7
== 0 ? k+3 : k);
}
-?[0-9.]+
ECHO;
[A-Za-z][A-Za-z0-9]+
ECHO;
Numerical strings containing a ``.'' or preceded by a letter will be
picked up by one of the last two rules, and not changed. The if-else
has been replaced by a C conditional expression to save space; the
form a?b:c means ``if a then b else c''.
Conclusion :
_________________________________________________________
1.What is the use of LEX?
2.What is the general format of LEX?
3.What are classes of LEX?
4.Write a small program using LEX.
Comments
Post a Comment