TIP 75: Refer to Sub-RegExps Inside 'switch -regexp' Bodies

Login
Author:         Donal K. Fellows <[email protected]>
Author:         János Holányi <[email protected]>
Author:         Salvatore Sanfilippo <[email protected]>
State:          Final
Type:           Project
Vote:           Done
Created:        28-Nov-2001
Post-History:   
Discussions-To: http://purl.org/mini/cgi-bin/chat.cgi
Keywords:       switch,regexp,parentheses
Tcl-Version:    8.5
Tcl-Ticket:     848578

Abstract

Currently, it is necessary to match a regular expression against a string twice in order to get the sub-expressions out of the matched string. This TIP alters that so that those sub-exps can be substituted directly into the body of the script to be executed.

Rationale

Similarly to the

   regexp -- <RE> $string matchvar submatchvar ...

of Tcl and the

   interact -re <RE> {
      set matches "$interact_out(0,string) $interact_out(1,string) ..."
   }

of Tcl/Expect, it would be very helpful and would also make Tcl more consistent if the [switch] command of Tcl would support references to parenthesized REs inside the switch patterns from the bodies associated to each of the patterns. As it is, it is currently necessary to match the regular expression against the string twice to obtain this information.

Specification

The easiest way to get the information is to place it into a variable. All that remains is a way to specify which variable should receive the information. This is done by a new option to the [switch] command: -matchvar. The argument to this optiongives the name of a variable in which will be placed a Tcl list of the matches discovered by the RE engine, such that the part of the string that was matched is given by [lindex $var 0], the first parenthesis by [lindex $var 1], etc. The alternative to this is to use the name of an array, but this is more expensive.

The indices which the match occurred at can also be sometimes useful. Therefore, the new option -indexvar will also be provided which will name a variable into which a list of match indices (each a two item list of values in the same way that [regexp -indices] computes) will be placed. It will be legal for both -matchvar and -indexvar to be specified in the same [switch] command, but only if the matching mode is -regexp. (The other kinds of match modes always match against the whole string anyway.)

Both variables (if specified, of course) will contain the empty list if the default branch is taken.

Example

set string "some long complicated message"
switch -matchvar foo -indexvar bar -regexp -- $string {
   {\w*(e)\w*} {
      puts "matched [lindex $foo 0] with 'e' at [lindex $bar 1 0]"
   }
   default {
      puts "no words containing a letter 'e' at all"
   }
}

Alternatives

Actually, no new syntax is needed to achieve the mentioned ability. The solution could adopt the behavior of [regsub] (description taken from regsub(n)):

If subSpec contains a `&' or `\0', then it is replaced in the substitution with the portion of string that matched exp. If subSpec contains a `\n**, where _n is a digit between 1 and 9, then it is replaced in the substitution with the portion of string that matched the n-th parenthesized subexpression of exp. Additional backslashes may be used in subSpec to prevent special interpretation of `&' or `\0' or `\n' or backslash.

This has the disadvantage of being incompatible with existing code that makes use of the -regexp option to [switch] and which may well have characters matching the above sequences inside already.

Another alternative can be to specify either -submatches, or -subindexes and use three elements for every switch case. The first is the regexp, the second the list of vars like in the [regexp] command, and the last the script to execute.

set string [getSomeComplexProtocolLine]
switch -regexp -submatches -- $string {
    {EHLO (.*)} {match heloarg} {
       puts "Helo $heloarg"
    }
    {MAIL FROM: <(.*)@(.*)>} {match user host} {
       puts "Mail from $user at $host"
    }
    {QUIT} {} {
       exit
    }
    default {} {
       puts "What a strange SMTP command!"
    }
}  

Usually submatches have quite logical names, so it is possible that to refer they by name instead of to use [lindex] can be more comfortable. Another minor advantage of this is that variable names are very near the script, so it shouldn't be hard to follow what the script is doing.

On the other side this changes a well-known fact of switch getting as input two elements for every case; the main proposal of this TIP has the advantage of leaving that feature of the [switch] command as an invariant. This makes the overall implementation of the feature easier, and also makes it easier to tell people how to use. And it allows for trivial obtaining of both the matched string and the range of the input string that matched. Of course, in that case you could just have four values for each entry, but that is getting baroque.

Reference Implementation

http://sf.net/tracker/?func=detail&aid=848578&group\_id=10894&atid=310894

Copyright

This document has been placed in the public domain.