Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I have some questions. First, do pointers that do not come from specifying a statically sized array also have provenance? That is if I say `char* c; size_t n = get_user_input(); c = malloc(n);` does c have provenance? If after the above I created an array like so `char str[n]` and then assigned such that it pointed to the same region of memory that malloc() returned, then would the provenance of both pointer be equal?

Also, in Rust why are pointer to integer casts allowed at all if they lose provenance?

Also, why exactly is one past end of bounds a valid pointer you can calculate?

Also, it seems like you cannot just create a random pointer to an unallocated address, not just that you can’t write to it. Is that correct?

Lastly, is provenance just a fancy way of saying metadata? Is there a reason something like “pointer metadata” or “object metadata” wasn’t used?



>That is if I say `char* c; size_t n = get_user_input(); c = malloc(n);` does c have provenance?

Yes.

>If after the above I created an array like so `char str[n]` and then assigned such that it pointed to the same region of memory that malloc() returned, then would the provenance of both pointer be equal?

Yes.

>Also, in Rust why are pointer to integer casts allowed at all if they lose provenance?

It would be a breaking change to remove the ability to cast. Instead Rust added API to extract the provenance when casting from ptr to int and then add provenance back when casting from int to ptr.

>Also, why exactly is one past end of bounds a valid pointer you can calculate?

It's useful to allow it to make `if (ptr < end)` with exclusive `end` possible. The alternative of only allowing `if (ptr <= end)` with inclusive `end` is not as trivial if it needs to handle empty ranges (the exclusive version can just use `end = start` to denote an empty range).

>Lastly, is provenance just a fancy way of saying metadata?

Pointer metadata in Rust at least refers to a different thing, namely the second half of fat pointers. For slice pointers the fat pointer contains data ptr and length, so the length is the metadata. For trait object pointers the fat pointer contains data ptr and vtable ptr, so the vtable ptr is the metadata.


> It's useful to allow it to make `if (ptr < end)` with exclusive `end` possible. The alternative of only allowing `if (ptr <= end)` with inclusive `end` is not as trivial if it needs to handle empty ranges (the exclusive version can just use `end = start` to denote an empty range).

As an aside: I find it fascinating that half-open ranges turn out to be - universally as far as I can tell - the right choice for ranges in software. As well as being able to store an empty range, they also allow you to calculate the length of the range by trivially subtracting end - start.


For the growable array and slice types (C++ std::vector and std::span) at a low level we should actually store ptr and length not start and end, it makes a tiny but measurable performance difference.

This is one of the tiny perf leaks from C++ ABI conservation. IIRC All three popular std::vector implementations got this wrong decades ago and are now ABI frozen so too bad. The design of std::span is correct though for whatever that's worth.


Storing the length instead of the address past the last element does not change anything in the answer to the question posed by the previous poster.

The length of an array is the index of an element beyond the limit of the array, which would be invalid in an indexed expression for accessing an array element.

Exactly like when using pointers the address beyond the last element of the array is useful, when using indices the length, i.e. the index beyond the last element of the array, is useful.

Moreover, regardless how the arrays are handled in a high-level programming language, on CPU architectures that provide SIB (scaled index + base) addressing modes instead of auto-incremented/auto-decremented addressing modes, e.g. on x86-64, there are frequently encountered cases when the optimal implementation of a loop in machine instructions requires a base register initialized to the address of the element beyond the array limit, together with an index register that contains negative indices, which is incremented at each iteration until it becomes non-negative, in order to access the array in increasing order.

(Positive indices that are decremented until becoming negative, together with a base register containing the address of the first array element, are used when the arrays must be accessed in decreasing order.)

In optimal implementations on x86-64 and similar ISAs it does not matter much which value between length and address past the array you choose to store in an array descriptor. When the loop is initialized in machine instructions you may need frequently the third value that is not stored, but that is pretty much irrelevant, because an extra addition or subtraction is usually done in zero time on modern CPUs (i.e. concurrently with the other instructions that would have been executed anyway, even when the extra addition/subtraction would not have been necessary).

Because arrays are more frequently accessed in increasing order, even when that is not necessary, for the purpose of just accessing the array it would be optimal to store in the array descriptor the length together with the address of the element beyond the array. However this would be worse in conjunction with legacy APIs that would expect the address of the first element, which would require an extra subtraction, and also when accessing just an array element, not the whole array, when an extra subtraction would also be needed.

In any case, on x86-64 storing both addresses instead of the first address with the length does not have any measurable performance difference in correct implementations, because the number of machine instructions needed for implementing a loop is the same in the most frequent cases. If there are performance differences, it is likely that the machine code generation is bad.


> I find it fascinating that half-open ranges turn out to be - universally as far as I can tell - the right choice for ranges in software.

There is a famous letter from Dijkstra on this topic. https://www.cs.utexas.edu/~EWD/transcriptions/EWD08xx/EWD831...


> > If after the above I created an array like so `char str[n]` and then assigned such that it pointed to the same region of memory that malloc() returned, then would the provenance of both pointer be equal?

> Yes.

Uh, no. This is flatly untrue. You cannot "assign an array such that it points to a region of memory". Arrays are not pointers, they do not point to anything.


I assumed they meant `char (*str)[5]`, ie a pointer to an array, because they wanted to reinterpret the malloc result as an array of the malloc'd length.


> If after the above I created an array like so `char str[n]` and then assigned such that it pointed to the same region of memory that malloc() returned, then would the provenance of both pointer be equal?

I don't think you can perform such assignment.

> Also, in Rust why are pointer to integer casts allowed at all if they lose provenance?

Sometimes you want the numeric address to perform some arithmetic/bitwise operation on it (e.g. for tagged pointers)

> Also, why exactly is one past end of bounds a valid pointer you can calculate?

Because it's increadibly useful for iterating. It is also used for representing an empty tail slice. It's something inherited from C/C++ and likely won't go away.

> Also, it seems like you cannot just create a random pointer to an unallocated address, not just that you can’t write to it. Is that correct?

You can create such pointer, it just won't be valid.

> Lastly, is provenance just a fancy way of saying metadata? Is there a reason something like “pointer metadata” or “object metadata” wasn’t used?

It's not metadata that exists at runtime. It only exist at compile time to allow certain common optimizations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: