Value Analysis Data Representation

As explained in About the Value Analysis, states are computed at each reachable program point. A state is an over-approximation of all the possible values that the variables can hold at a given program point. You can imagine a state as a mapping between the variables and sets of values (keep in mind though that in reality it is a little more complex than that). The values’ representation is explained below.

Integer Values

A small set of possible integers is represented as an enumeration, for instance { -1, 0, 32, 48, 97, 98, 99 }.

Sets of more than 8 elements [1] are represented as intervals with an optional congruence information. For instance, [0 .. 200] represents the integers between 0 and 200, and [7 .. 107] 7 % 10 represents the integers 7, 17, 27, … 107.

[1]this limit of 8 is user-settable with the -val-ilevel option but changing it from the default of 8 is only useful in special circumstances. In general analyses, the representation of sets of more than 8 elements as intervals is not the only reason for the loss of precision, and raising this limit does not improve the precision.

Floating-point Values

The value of a floating-point expression in the program is represented as a single floating-point value {0.125} or as an interval of possible floating-point values [0.5 .. 2.5].

Address (pointer) Values

A pointer expression in the program is represented as offsets with respect to base addresses. The base addresses correspond to variables in the program, to string literals, to the results of dynamical allocation functions. Offsets with respects to each base are represented with the same representation as integer values (short enumerations or intervals with optional congruence information).

Undefined or Imprecise Values

Garbled Mix

When the analyzer is not certain that meaningful operations are made with addresses (e.g. xoring or adding the representation of addresses), it still keeps track, as a last resort, of the base addresses that were used in the computation. The user should not expect programs that do these operations on purpose to be analyzed very precisely. Programs that do not do these operations on purpose can still do them accidentally, or the analyzer may think that they do these operations as the result of other over-approximations. In this case, the user might as well stop the analysis as soon as such values occur, with the -val-stop-at-nth-garbled option, and fix the problem immediately.

Memory

Memory locations contain values as described above. In a memory location can also be found flags indicating that the memory should not be used because it is, in the C standards’ words, indeterminate. These flags are:

  • UNINITIALIZED: the memory should not be used because it is uninitialized.
  • ESCAPINGADDR: the memory should not be used because it is a dangling pointer.

It is not an error for these values to exist in memory, but it is an error (that the analyzer will flag explicitly) to use them in any computation.

Because the analyzer’s job is to consider multiple program behaviors at once (e.g. executions that can either go through a branch that initializes a memory location or not, pointers that either contain the address of an automatic variable or that of another variable with static lifetime, and arguments of free that can be either one dynamically allocated address or another), a memory location can contain both valid values and one or both of the above flags. In this case, if the memory location appears to be subsequently used, the analyzer will both flag the possible error and continue the analysis with the values that can legally be found in the memory location.

Example:

1
2
3
4
5
int main(int c, char *v[]) {
  int x, y;
  if (c % 2) x = 1;
  if (c % 3) y = x + 1;
}

When analyzed with tis-analyzer -val test.c, the results shows the following lines (among others):

test.c:4:[kernel] warning: accessing uninitialized left-value: assert \initialized(&x);
[value] Values at end of function run_atexit_functions:
[value] Values at end of function __tis_globfini:
[value] Values at end of function main:
  x ∈ {1} or UNINITIALIZED
  y ∈ {2} or UNINITIALIZED

After the first if, the value in memory for x is {1} OR UNINITIALIZED. Next, using x to compute y causes an alarm to be emitted about the fact that x may be uninitialized, and the value that y can have (if it has been computed) is inferred to be 2. The variable y being uninitialized is not an error in itself, and the fact that x is uninitialized isn’t either when y is not computed (for c == 6 for instance). But using x while uninitialized (for c == 2) is an error.

Hash signs in the value representations

Hash signs (#) in the printouts of memory contents by the analyzer indicate that the size or alignment of the value stored in memory does not match that of the container. When they correspond to an actual behavior of the program, there must be a union or some form of pointer conversion in the program to create the memory state that shows hash signs, and the program may suffer of type confusion.

The hash signs can also result from over-approximations of the analyzer, for instance if the program manipulates pointers inside arrays of structures having arrays as members. In this case the way to make them disappear is to eliminate the over-approximations that cause them. Watchpoints can be useful in order to determine where in the program the over-approximation happens.

Here is a program that produces all, or at least most, cases:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
#include <string.h>
#include <stdint.h>

double d[20] = {1.0, 1.0, 2.0, 3.0}; // remaining indices set to 0.0

uint64_t uc;

int main(void) {
  uint64_t longboi = 0x1234567887654321;
  double two = 4.0;
  uint32_t shorty = 0x76543210;
  memcpy((char*)d + 20, &two, 8);
  memcpy(&d[15], &longboi, 8);
  memcpy(&d[17], &shorty, 4);

  memcpy(&uc, (char*)d+4, 8);
}

When analyzed with tis-analyzer -val h.c, the analyzer indicates that the values of d and uc at the end of execution are:

  d[0..1] ∈ {1.}
   [2][bits 0 to 31]# ∈ {2.}%64, bits 0 to 31 
   {[2][bits 32 to 63]#; [3][bits 0 to 31]#} ∈ {4.}
   [3][bits 32 to 63]# ∈ {3.}%64, bits 32 to 63 
   [4..14] ∈ {0}
   [15] ∈ {1311768467139281697}
   [16] ∈ {0}
   [17][bits 0 to 31] ∈ {1985229328}
   {[17][bits 32 to 63]; [18..19]} ∈ {0}
  uc# ∈ {1.} repeated %64, bits 32 to 95 

Here are some line-by-line explanations:

[2][bits 0 to 31]# ∈ {2.}%64, bits 0 to 31

This line indicates that while the 32 bits from the beginning of d[2] contains the right bits in the right place of the value 2.0, the value 2.0 is a 64-bit double and only the first 4 bytes are available.

{[2][bits 32 to 63]#; [3][bits 0 to 31]#} ∈ {4.}

This line indicates that the byte 4 to 7 of d[2] and the bytes 0 to 3 of d[3] have been overwritten with a 64-bit value, and that value is the 64-bit double 4.0.

uc# ∈ {1.} repeated %64, bits 32 to 95

This line indicates that while the entirety of the variable uc was overwritten, it was overwritten with contents that do not align with the variable. Specifically, the bits contained in uc are identical to the bits 32 to 95 of an infinite array of doubles containing only the value 1.0.

JSON format to print states

When the -val-dump-destination json option is set, the tis_dump_each_file builtin dumps a JSON file representing the full state of the Value analysis. The grammar for this JSON file is:

state_file:
   {
     "file": string ,
     "line": integer,
     "args": string,
     "state": list of <binding>
   }
 binding:
   {
     "base": string,
     "values": list of <values>
   }
 values:
   {
     "offset": string (optional),
     "value": string,
     "more_value": string (optional)
   }

The file and line fields give the location of the call to the tis_dump_each_file builtin.

The args fields is a string that represent the current values of the other arguments of the tis_dump_each_file call (see -val-dump-destination usage for examples).

The base field is the name base in the memory model. It may be a variable name, a string literal or the name of an allocated memory chunk.

The offset field is optional. When not present, it is equivalent to a 0 offset. Otherwise, it is the representation of an offset from the base. It may be a sequence of array indexes, or field names.

The value is the textual representation of the values described in the current section (Value Analysis Data Representation). Typically, the single integer 10 is represented as "{10}".

The more_value, when present, gives more information about the misaligned values. See Hash signs in the value representations for more information.