Author:         Donal K. Fellows <[email protected]>
State:          Final
Type:           Project
Vote:           Done
Created:        05-Nov-2001
Post-History:   
Tcl-Version:    8.4

Abstract

This TIP adds the capability to perform computations on values that are (at least) 64-bits wide even on 32-bit platforms. It also adds support for handling files that are larger than 2GB large on those platforms (where supported by the underlying platform and filing system).

Rationale

There have been a number of requests, and from a whole range of application areas, for Tcl to be enhanced to handle 64-bit values even on platforms where that is larger than the native word size, and the vast majority of C compilers support a large enough arithmetic type (often called long long though other names are common on the Windows platform.) Such areas include:

large-file support for people working with lots of data. Note that at the moment Tcl cannot even report the file type for a file that is larger than 2GB in size.
large value support for people working with network addresses (this is likely to come up more in the future with IPv6.)

However, a number of existing algorithms assume that integer arithmetic operations wrap at 32-bits (demonstrating the need for semantic backward-compatibility so termed because a recompile of the C portions of the relevant code will not fix the problem) and there are many existing extensions that assume a particular word-size too (requiring syntactic backward-compatibility because recompilation will probably cure the problem.) Hence any upgrade of Tcl's functionality must be done carefully so as to preserve as much backward compatibility as possible.

Proposed for Changes

To resolve this problem, I will introduce:

A new pair of types at the C level to represent signed and unsigned values with a width of at least 64-bits. These types will be called Tcl_WideInt and Tcl_WideUInt respectively. On 64-bit platforms (and 32-bit platforms where there is no compiler support for arithmetic 64-bit types) these will be typedef'ed to long to preserve as much inter-platform compatibility as possible.

The type names are based on the term Wide as opposed to either Long or LongLong because the first causes a problem with existing Tcl APIs (Tcl_GetLongFromObj for example) and the second because it is both longer and less mnemonic. Not all Tcl platforms are built with compilers that understand long long in the first place, and the major factor in its favour at the C level was almost certainly the fact that it did not introduce any new reserved words into the C syntax which would have had a major backward-compatibility impact - we are not bound by such things and can choose to suit ourselves.
A new field of type Tcl_WideInt in the internalRep union of the Tcl_Obj type. Note that this is 100% backward compatible since the union already contains a field that is a pair of pointers (each of which I assume to be at least 32-bits wide.)
A new object type of 64-bit wide values together with accessor functions to create, modify and retrieve from objects of that type called Tcl_NewWideIntObj, Tcl_SetWideIntObj and Tcl_GetWideIntFromObj (on platforms where Tcl_WideInt is not distinct from long, these will be all redirected to the previously existing integer type.)
The [expr] command shall be reworked so that:

* If a constant looks like a signed integer (i.e. it lies between INT_MIN and INT_MAX inclusive) it is treated as such. Otherwise if it looks like an integer of any size, an attempt will be made to treat it like a wide integer, and if that fails or it doesn't look like an integer at all, it will be treated as a double. Note that this will be a source of a potential backwards incompatibility with scripts that include values that are meant to be unsigned integers.

* With arithmetic operations, the output will be a double if at least one of the operands is a double, a wide integer if at least one of the operands is a wide integer, and a normal integer otherwise. (The main exception to this will be the left and right shift operations where the type of the second operand will not affect the type of the result.)

* The int() pseudo-function will always return a non-wide integer (converting by dropping the high bits) and the new pseudo-function wide() will always return a wide integer (converting by sign-extending.) On platforms without a distinct 64-bit type, these operations will behave identically.

* User-defined functions will be able to gain access to the wide integer through an extra wideValue field in the Tcl_Value structure and TCL_WIDE_INT (which will be the same as TCL_INT on platforms without a distinct 64-bit type) value in the Tcl_ValueType enumeration.
The [incr] command will be able to increment variables containing 64-bit values correctly, but will only accept 32-bit values as amounts to increment by.
Tcl_Seek and Tcl_Tell (together will all channel drivers) will be updated to use the new 64-bit type for offsets (which will reflect at the Tcl level in the [seek] and [tell] commands) though a compatibility interface for old extensions that do not supply a channel driver will be maintained (though the size of offset reportable through the interface will naturally be limited.)
Tcl_FSStat and Tcl_FSLstat will all be updated to use a stat structure reference that can contain 64-bit wide values. This will enable various [file] subcommands (and [glob] with some options) to work correctly with files over 2GB in size. Note that there is no neat way to do this in a backward compatible way as there is currently no guarantee on which fields will actually be present in the structure, but those functions have never been available outside an alpha...

Because the name of a suitable structure varies considerably between platforms, a new type, Tcl_StatBuf, will be declared to be the type of the structure which a pointer to should be passed to the stat-related functions. A new function, Tcl_AllocStatBuf, will be provided to allow extensions to allocate a buffer of the correct size whatever the platform.

Note that Tcl_Stat will written to contain backward-compatability code so that code that references it will work unchanged.
The format and scan commands will gain a significance to the l modifier to their integer-handling conversion specifiers (d, u, i, o and x) which will tell them to work with 64-bit values (if those are not the default for the platform anyway.)
The binary command will gain new w and W specifiers for its format and scan subcommands. These will operate on 64-bit wide values in a fashion analogous to the existing i and I specifiers (i.e. smallest byte to largest, and largest byte to smallest respectively.)
New compatibility functions will also be provided, because not all platforms have convenient equivalent functions to strtoll and strtoull.
Tcl_LinkVar will be extended to be given the ability to link with a wide C variable (via a TCL_LINK_WIDE_INT flag).
The tcl_platform array will gain a new member, wordSize, which will give the native size of machine words on the host platform (actually whatever sizeof(long) returns.)

Summary of Incompatibilities and Fixes

The behaviour of expressions containing constants that appear positive but which have a negative internal representation will change, as these will now usually be interpreted as wide integers. This is always fixable by replacing the constant with int(_constant)_.

Extensions creating new channel types will need to be altered as different types are now in use in those areas. The change to the declaration of Tcl_FSStat and Tcl_FSLstat (which are the new preferred API in any case) are less serious as no non-alpha releases have been made yet with those API functions.

Scripts that are lax about the use of the l modifier in format and scan will probably need to be rewritten. This should be very uncommon though as previously it had absolutely no effect.

Extensions that create new math functions that take more than one argument will need to be recompiled (the size of Tcl_Value changes), and functions that accept arguments of any type (TCL_EITHER) will need to be rewritten to handle wide integer values. (I do not expect this to affect many extensions at all.)

Why Tcl_WideInt?

I chose the name Tcl_WideInt for the type because it represents a wider-than-normal integer. Alternatives that were considered and rejected were:

Tcl_LongLong: This takes its name from the name of the underlying C type used in many UNIX compilers, but that in turn was chosen because it meant that no new keywords would be added to the language, and not out of any feeling that the type name itself is of any wider merit. Seeing as Tcl is a keyword-less language, there is no particular reason for going down this route (which would lead to things like a longlong() type conversion function added to the [expr] command, which is really very ugly indeed...) It is also not universally the name of the underlying type; the Windows world is different (as usual.)

Tcl_Int64: This name, by contrast, comes more from the Windows world. It's major problem is that it specifies eternally what the size of the type is, whereas at some point in the future (when 64-bit words are the norm) we may want to support something wider still (though I do not yet know what uses we would put 128-bit integers to.) I believe that the name of a type is part of its specification, but that the size of the type is less so. Tcl_Int64 is also ugly when it comes to derivations of the name for things like the type converter in [expr] (again) and the names of variables containing values of the type (internally, as formal parameters, and as fields of structures) and may well clash on systems where the C compiler gives real meaning to int64 by default. By contrast, Tcl_WideInt lends itself well to generating variable names (wideValue, widePtr, etc., and even just plain w in the implementation of the bytecode execution engine) which, as the person implementing the changes, was a major consideration.

Copyright