Blog Image

News

Its time to design a new Instruction set architecture

Uncategorised Posted on Mon, February 08, 2021 17:49:20

I think its time to design a new hardware architecture that can eventually replace x86 as the dominant Instruction set architecture (ISA) for high performance computing. In this post i want to outline my reasoning for why this should happen and why it should happen now.

Historically, x86 has won out for three main reasons, Intel’s superior fabs, the scale of the x86 market, and Microsoft reluctance to support other instructions sets. When someone came up with something better (Like Alpha), the market size of x86 and the huge investments made in to it, would make sure that the advantage didn’t last long. Intel, even with a worse instruction set, could simply clock its CPUs so much faster making any instruction-per-cycle advantage irrelevant. This is no longer true.

A lot of attention have been given to Apples M1 architecture. Apple has an advantage using a newer ISA then x86. But the fact that a 30 year old architecture (Arm) has advantages over a 40 years old architecture, should neither surprise nor impress anyone. (It would however surprise me greatly if Apple makes the investments needed to make their architecture competitive on the high end, given how small they are in that market). Arm, while newer then x86, is essentially built with the same basic constraints: limited number of transistors. While Risc-V has gathered a lot of excitement, because of its open nature, its design mirrors old architectures in that it aims to be simple rather then fast.

So why is it time to design a new ISA right now? I think its time to redesign something when the constraints of the original design, are markedly different from the current constraints, and you can see that the new constrains will remain for the foreseeable future. Design decisions where made at the time, because of the limitations of the time. Today we are in a very different situation from when x86, Arm, and PowerPC was conceived:

-Single threaded performance has hit the ceiling. While computers are getting faster as a whole getting more cores and special hardware like GPUs, ML units, and video en/decoding, and so on, the vast majority of software is single threaded and runs on the CPU. many problems can’t be parallelized effectively. Even when software makes use of multiple cores or the GPU, a single thread acting as a job dispatcher can often be the bottleneck. This means that increasing single core performance would have an outsized impact on how fast the computer is in practice. A computer with half as many cores but with 50% more performance per core, would be a much more desirable in most cases even if it has a 25% lower theoretical performance.

-Most older designs where bound by transistor count, where as today we have so many transistors available, that spending more transistors on a core has diminishing returns. That’s why we instead go multi core. If we designed an ISA today we would do so with the assumption that we have a lot of transistors, and are likely to get more.

-Frequencies are not going up any longer mostly due to heat dissipation issues, so a design with better instruction-per-cycle would have a more permanent advantage.

-Memory access (especially latency) has become a limiting factor of real world performance. A design that has memory access designed from the ground up for a non-uniform-memory(NUMA) access models, with cashes, stacks in SRAM, more/different registers, memory synchronization and prefetching at its core, would enable many new innovations.

-A good ISA used to be one that was good for humans to write assembler to, but almost no one does that today. A good ISA today is one that a compiler can write better code for. What is clean and simple for a human to make use of is not the same as whats good for a computer to make use of.  

-A very large limiting factor is the CPUs ability to reason about out-of-order execution. Currently the ISA provides very little semantic information to aid in this. A new ISA and language constructs along the lines of “restrict” could aid both compiler and CPU designers reach higher performance.

-So much software and infrastructure we use today is opensource, therefor a new ISA would very quickly gain a working software stack. One could imagine a working GCC/LLVM compiler and a Linux port fairly quickly. Microsoft has also shown their willingness to support other ISA then x86, and their modern code base is designed for multiple ISAs.

-x86 has a lot of old stuff that currently is needed in order to be backwards compatible. (MMX!) Removing this would save transistors and “dark silicon”.

-Modern CPUs, have advanced branch prediction, pipelining, decoding and a lot of other hardware designed to turn the existing ISA in to something the CPU more effectively can use. The Itanium architecture, tried to move a lot of this logic in to the ISA. The problem with that is that the ISA works for only one specific hardware implementation. What we need is the opposite: an ISA that unleashes the creativity of chip designers, and gives them the tools they need to innovate further.

How would this happen?

I would prefer to see an organization set up and funded by the industry, mainly Intel, AMD and Microsoft. They would create a small group of independent engineers (preferably lead by a industry heavyweight like Jim Keller) who would go off and design the new ISA. Then each IHV could go off and make their own hardware implementation and compete in the market for the best product. The ISA would only be licensed for hardware implementation to the participating companies, for a few years so that investing companies could recoup their investment, and then made freely available.

Eskil Steenberg



How one word broke C

Uncategorised Posted on Mon, March 16, 2020 04:39:59

A lot have been written about the dangers of “Undefined behavior” in C. Its an often cited reason why C is a “Dangerous” language that invites hard to find bugs, and security issues. In my opinion Undefined behavior is not inherently bad. C is meant to be implementable on lots of different platforms and, to require all of them to behave in the exact same way would be impractical, it would limit hardware development and make C less future proof. Some of the concerns around undefined behavior in C are based on the fact that C is a small enough language that all corners of the language are explored and actually matter.
The problem with undefined behavior, is the definition of undefined behavior, or more precisely a single word in the definition of undefined behavior. Lets have a look at the definition of undefined behavior in the C89 Spec:

Undefined behavior — behavior, upon use of a nonportable or
   erroneous program construct, of erroneous data, or of
   indeterminately-valued objects, for which the Standard imposes no
   requirements.  Permissible undefined behavior ranges from ignoring the
   situation completely with unpredictable results, to behaving during
   translation or program execution in a documented manner characteristic
   of the environment (with or without the issuance of a diagnostic
   message), to terminating a translation or execution (with the issuance
   of a diagnostic message).

Sounds good. Now lets have a look at the definition of Undefined behavior in the C99 spec:

1   undefined behavior behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements
2   NOTE Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).

Notice any difference? Careful reading will reveal that the word “Permissible” has been exchanged to “Possible”. In my opinion this change has lead C to go in a very problematic direction. Lets unpack why its so problematic.

In C89, undefined behavior is interpreted as, “The C standard doesn’t have requirements for the behavior, so you must define what the behavior is in your implementation, and there are a few permissible options”. In C99 undefined behavior is interpreted as, “The C standard doesn’t have requirements for the behavior, so you can do what ever you want”. Everything after the word “Possible” becomes essentially meaningless. Its the difference between telling your kids they have to go to school, and telling them that going to school is an option.

C89 Gives the implementation plenty of reasonable options. from doing nothing, doing something platform specific that is also documented, or failing. For a long time this was enough, and as adoption of the C99 spec was slow (Among other reasons because C99 added variable array sizes that turned out to be a very bad idea and was made a optional feature in later versions) and this meant that the change didn’t matter for a long time since most compilers didn’t take advantage of the new definition. Over time however compiler engineers have employed more and more aggressive approaches to optimization and “the do what ever you want” was too good of an opportunity to pass up.

“What ever you want” is a very big possibility space. If, for instance, you have a large codebase like say the Linux Kernel and there is a single instance of undefined behavior somewhere in there the compiler is free to produce a binary that does what ever it wants. It doesn’t have to document what it does, it doesn’t have tell the user, it doesn’t need to do anything.

This change has led compiler writers to think that if the programmer even approaches anything undefined, they can do what ever, completely disregarding, if it makes logical sense, if it is predictable behavior or is in anyway useful to software development.
Lets have a look at this code:

struct{
char x
char y;
}a;

memset(&a, 0, sizeof(char) * 2);

The C specification says that there may be padding between members and that reading or writing to this memory is undefined behavior. So in theory this code could trigger undefined behavior if the platform has padding, and since padding is unknown, this constitutes undefined behavior.

So we get in to the weird situation where compilers say “I know X, but since the C standard doesn’t specify X, i can pretend that i don’t know X and behave as if it is unknowable.”. It gives the compiler license to optimize the code, without telling the user, in to this:
memset(&a, 0, sizeof(char));

If you are horrified by this, know that I’m being charitable towards the the compiler designer here. i might as well have written, that its perfectly reasonable for the compiler to produce a program that formats all your drives behind your back, because again, anything goes.

The problem here is that the compiler knows how much padding there is between the two members since it is not only conforming to the C standard, its also conforming to an ABI that very clearly needs to define padding between types. So the compiler states that it conforms to a ABI that clearly defines the padding between x and y while at the same time claiming that the user has no way of knowing what the padding is.

A compiler is by its very nature a translator that translate from one language in to another. It always have to conform to two standards, the one describing the input and the one describing the output. It makes sense for one of the two sides to say “translate me in to what ever works best for the other side”. In this way the general concept of undefined behavior is valuable.

Lets take a look at signed integer overflow:

Different architectures can handle signed overflows differently depending on how the sign bit is stored. If the behavior was defined by the C spec it would have made it incredibly difficult and slow to implement on hardware platforms that didn’t handle overflow the same way as the spec. By keeping it undefined, the C standard gives hardware vendors more flexibility to innovate. So far so good.

Saying that something is undefined to the C spec is not the same as saying that its unknowable. If I use my compiler to compile a program on my machine, the compiler knows that I’m compiling it to the x64 instruction set, so while the C standard doesn’t define what happens when a unsigned integer overflows the x64 specification certainly does. Consider this code:

void func(unsigned short a, unsigned short b)
{
unsigned int x;
x = a * b;
if(x > 0x80000000)
printf("%u is more then %u\n", x, 0x80000000);
else
printf("%u is less or equal then %u\n", x, 0x80000000);
}

if we run this code, (on a platform where short is 16 bits and int is 32bits):

func(65535, 65535);

We get:

4294836115 is less or equal then 2147483648

This looks crazy! Why does this happen? You would think since there are no signed variables in the code, overflow would be defined, and you wouldn’t have problems, but no. What happens is that C allows for the promotion of types, to other types that can fit the entire range of original type.

So:

x = a * b;

becomes:

x = (unsigned int)((int)a * (int)b);

Since the product of a and b is an signed int, the compiler deduces that the result cant be more then MAX_INT, and this carries over after the cast to a unsigned int because wraping is not a factor. Therefore x can never be more then 0x7FFFFFFF and therefore the if statement can be optimized away at compile time. You can imagine that the vast majority of programmers would have trouble debugging this code and understanding why the code behaves like it does.

C compilers have taken the concept of undefined behavior even further by doing the mental acrobatics of thinking that “If undefined behavior happens, I can do want I want, So therefor I can assume that it will never happen”. Consider this code:

if(p == NULL)
write_error_message_and_exit();
*p = 0;

If the compiler thinks that writing to NULL, is undefined, it can therefore assume that since you are writing to p, p can’t be NULL. If p can’t be NULL, the entire if statement can be removed and after optimization the code looks like this:

*p = 0;

This is very obviously dangerous behavior. In fact Linus Thorvald has said that this behavior is so broken that the Linux Kernel has broken with the C99 standard, and now require that the kernel is built with the -fdelete-null-pointer-checks option enabled. The fact that the compiler can detect that it is likely that the value p can be written to even if it is NULL is great. But it should result in a warning, not a seen as a license to make stupid assumptions.

The point of a compiler is not to try to show off that who ever implemented it knows more loop holes in the C standard, then the user, but to help the programmer write a program that does that the programmer wants. If you are a compiler and think that the if statement above is surpurpefelous, or that the code allows you to write to a null pointer, THEN TELL THE PROGRAMMER! That’s information the programmer wants to have!

Its like if a company builds a dangerous product that cuts peoples fingers off, but instead of fixing it, they put a warning on page 57 in the manual. Yes, you might be following the letter of the law, but your product still sucks for people fond of their fingers.

The thing is that while it is desirable to write code that is portable and have the same behavior, on any platform that C can be implemented on, it is also very use full to write C code that takes advantage of a specific platform. Portability is not the only goal a programmer can have. Making assumptions about your hardware is increasingly useful. Reality is that we do know a lot more about hardware architecture now then we did when C was invented. If you write code that assumes that int is 32 bits, that struct members are padded to the even size of the members, that int overflows to MIN_INT, you are going to be hard pressed to find a platform in wide use where this isn’t true. I’m even willing to bet that its going to look the same for decades to come. (Padding may change given that memory access is the main bottleneck, so packing things closer together to avoid cache misses may be a win over the cost of unpacking missaligned data) Can I see us using 128pointers in the future? Yes but even if Mores law keeps going that’s close to a century out. Worrying that your code wont do the right thing on a platform where a byte has nine bits, is insanity, even if the C standard permits such a platform to implement C.

Besides, the vast majority of C programs aren’t portable because of dependencies, not because of assumptions about underlying  architecture.

My feeling is that if this continues, we will eventually end up with a forked version of C, that caters more to engineers who want predictable results in practical applications, then to compiler engineers and academics who want to imagine theoretical architectures. In some ways this has already happened with the Linux kernel. Until this happens ill probably stick with c89.