Jitter optimization for System.Initialize()

In the newsgroups is a discussion about “Delphi compiler and CPU core usage” and in a subthread the idea of a jittered InitializeRecord, FinalizeRecord and CopyRecord was born.

So here are the first performance statistics for a first (Pure Pascal) implemenation of such a Jitter for Initialize/InitializeRecord/InitializeArray.

type
  TTest = record
    I1: Integer;
    SA: array[0..1] of string;
    I2: Integer;
    S: string;
    Intf: IInterface;
    A: array of Byte;
  end;
 
var
  m: array[0..1024] of TTest;

First call to Initialize
Original: 0.067819 ms
Jittered: 0.042238 ms (this includes the time for the Jitter itself)
// _InitializeArray(@m, TypeInfo(TTest), Length(m));

Second call to Initialize
Original: 0.057681 ms
Jittered: 0.010007 ms
// _InitializeArray(@m, TypeInfo(TTest), Length(m));

Execution in a tight loop: 10000x
Original: 419.089185 ms
Jittered: 39.533007 ms
// for I := 0 to 10000 – 1 do
// _InitializeArray(@m, TypeInfo(TTest), Length(m));

It is interesting that the Jitter can generate and execute the code in less time than the RTTI version can execute the initialization. This is because the RTTI version must process all array elements while the Jitter generates code for only one iteration and adds a loop to the code.

Generated code:

00200000 xor edx,edx
; inner loop for  "array[0..1] of string" (begin)
00200002 push eax
00200003 push ecx
00200004 mov ecx,$00000002
; array elements
00200009 mov [eax+$04],edx
; inner loop for  "array[0..1] of string" (end)
0020000C add eax,$04
0020000F dec ecx
00200010 jnz $00200009
00200012 pop ecx
00200013 pop eax
; record fields
00200014 mov [eax+$10],edx
00200017 mov [eax+$14],edx
0020001A mov [eax+$18],edx
; outer loop for array variables
0020001D add eax,$1c
00200020 dec ecx
00200021 jnz $00200002
00200023 ret
; alignment for the next method
00200024 nop
00200025 nop
00200026 nop
00200027 nop

This is only the start. I don’t think that the CopyRecord and FinializeRecord functions will show the same increase in performance, because the cleaning (LStrClr, …) are the real time eaters. But I can only be sure if I have tested it.

But there is also a downside. The Jitter uses a hash table to find an already jittered Initialize function for the TypeInfo. And if the type is a simple type, the original Initialize will outperform the Jitter because in the end both execute the same code but the original Initialize has less overhead. I’m sure this could be worked out by optimizing the hash table access and some other tricks.

3 thoughts on “Jitter optimization for System.Initialize()”

Raymond Wilson December 19, 2008

I think this is a good idea. The performance improvement seems quite significant.

Frankly I’m astonished that this is not already done in the compiler. It makes no sense to use RTTI as a basis for doing something (anything) that is statically known at compile time.

This sounds like an enhancement CodeGear should get onto for the next update.

As a side note to the original thread:

I’m surprised there was so much discussion on how to make Delphi use multiple cores to improve compilation speed. I have > 1.5mloc compiled into a single .EXE of ~10Mb in size. It takes less than 60 seconds for a full compile and link on a ~2GHz dual core laptop. Incremental compilations (which are probably > 90% of the compiles I do) take seconds.

Think about that for a moment. How much are you really gaining? I’d rather CodeGear spend time making enhancements like those discussed here, or providing extra RRTI as discussed in the referenced thread (and here’s another reality check: Don’t even bother discussing it if it’s a 20-30% increase in binary size – just do it and provide a compilation flag to say whether you want it or not).

With the current compilation performance of Delphi, I’d go for a compiled code performance improvements over compilation speed improvements any day.

Enough said (otherwise you might think I was starting to rant 😉 ).

Barry Kelly December 19, 2008

Raymond, doing things at compile time vs runtime – these kinds of things are space / speed tradeoffs. Also, small RTTI + simple function ought to lead to smaller code & less CPU cache usage, assuming that the RTTI is small enough to begin with, and the simple function is well-written.

Re compilation speeed: we (CodeGear) have many millions of lines of code, and it takes many minutes to compile. However, parallelizing the compiler is not an end in itself. It would be something to keep in mind in any larger changes to the codebase.

Eric December 19, 2008

Barry, in terms of performance, CPU instruction cache usage became much less significant since the days of speculative execution and branch predictions, ie. way back in the early Pentium days for mainstream processors. And its significance has been going down even more with speculative prefetchers.
If your code has intrinsic branch misprediction “patterns” built into it (such as when relying on RTTI, or more generally P-code implemented via tests rather than indirect jumps or inlining), you’re going to see low performance, regardless of how much time you will pour into trying to optimize that code.

The branching performance hit is even worse on new processors like the Atom, whose simplified branch predictor has a higher misprediction rate.

Comments are closed.