- Virtual functions are slow!
- Exceptions are slow!
- RTTI is slow!
Sunday, October 7, 2012
C++'s bad things for performance critical applications
One of the most often disputed things about C++ are its exceptions, virtuals, and runtime type information (RTTI). Some of the arguments centered around these are often subjected to bias, and I should disclaimer that this article may be subjected to my bias as well. But I'd like to take the time and reflect a few of the common statements about these thing.
Tuesday, July 31, 2012
Accelerate your SIMD knowledge to accelerate your code: Tricks to blaze through SSE code
Often when dealing with SSE code you find yourself referring to the documentation to understand what an instruction does, and it can impede on your time. This is a part of a mini series I'm writing "Accelerate your SIMD knowledge to accelerate your code". Lets begin shall we?
1) Eat those instruction mnemonics
Learn the instruction mnemonic rules, doing this will greatly improve your speed in hand writing SSE code.
SSE Has five categories of instructions:
- Data movement instructions
- Arithmetic instructions
- Comparison instructions
- Logical instructions
- Shuffle instruction
All data movement instructions start with MOV, and have either [U/A/H/L] after MOV. These four letters describe something.
U = unaligned
A = aligned
H = high (64 bits high)
L = low (64 bits low)
There is combination HL (high to low) and combination, LH (low to high) which moves SRC[H/L] -> DST[H/L]; finally PS is postfixed after the choice of instruction. Knowing this you can easily think of something to yourself: "I need to move the upper 64bit of this register to the lower 64bits of this other register"; which is easy as MOV (move) H(high) L(low) PS, or [MOVHLPS]. If you need an unalligned move for 128 bits, we know unalligned is U so MOVUPS is the instruction we need.
2) Order your memory, and fries with that.
SSE greatly benefits from predictable memory patterns, it prefers linear over arbitrary data. Iterating over a linked list of vectors to normalize is faster than incrementing a float pointer for every four elements of the vector and performing the normalize on the next four elements.
Aliasing memory strangely stalls the pipeline: i.e converting back from float to _m128 etc are wasteful operations. However aliasing memory in respect to SSE is fine. __mm_set_ps for example composes a _m128 from four floats, aliasing an aligned structure of four elements , or an array to _m128 would work the same.
Example:
/* allowed */
float x = 1,y = 2,z = 3, w = 0;
typedef struct { float x,y,z,w; } vec3 __attribute__((aligned(16)));
vec3 v3 = {x,y,z};
_m128 p = *(m128*)&v3;
/* same as this */
float x = 1,y = 2,z = 3, w = 0;
_m128 p = _mm_set_ps(x,y,z,w);
Aliasing memory like this allows for flexibility.
3) Nuke the runtime selection
Testing CPUID at runtime to determine if SSE is present is a common trick programs do for CPU compatibility. These are often fine for selecting the correct optimized routines, but many people get it wrong.
Consider the following example:
void (*add)(const vec3*, const vec3*) = NULL;
int main() {
add = (HasSSE()) ? &sse_add : &scalar_add;
}
This code is terribly wrong, because the function pointer calls are indirect calls, SSE doesn't like indirect calls, it instantly prevents any sort of pipelinability, and could very well be slower than the scalar version.
The right way to doing this would involve two separate translation units for each version of the function, and simply calling the correct version at runtime.
4) Punch a hole in your type system (C++ only)
When working with C++ and SSE intrinsics you often see yourself using lots of explicit casts, like reinterpret_cast, const_cast. It's no wonder this is an issue especially with things like _mm_malloc, which allocates an safely aligned chunk of memory for SSE operations. The problem is it returns void*, and C++ doesn't allow void* to be implicitly convertible to any pointer type. To work around this I've often found myself punching a hole in C++'s type system via a simple structure.
class voidptr_t {
void *ptr;
public: voidptr_t (void *pptr) : ptr (pptr) {}
void *ptr;
public: voidptr_t (void *pptr) : ptr (pptr) {}
template<typename T>
operator T *() { return (T*) ptr; }
};
};
As such when using _mm_malloc the void * will implicitly be converted to a voidptr_t using
voidptr_t::voidptr_t(void *), and then the voidptr_t will implicitly convert
itself to whatever pointer type you are trying to assign to.
5) Forget about prefetching
Unless you're writing code for CPU's before AMD anthlon XP or Intel Pentium 4, you're
wasting your time. Now virtually every CPU includes automatic cache prefetching. Which is costless and more accurate.
Sunday, June 24, 2012
EDID Extremely Disappointing Identification
Recently one of my online friends was having an issue with getting his display to work as expected. After system updates his display simply wouldn't go above standard VGA resolutions. Annoyed by this he came to me for help and after awhile of searching through system logs we finally found the cause of it.
There it sat inside one of those hidden logs that very few know exist, and even fewer know how to find.
[ 11.695] (WW) NVIDIA(GPU-0): The EDID read for display device CRT-0 is invalid: the
[ 11.695] (WW) NVIDIA(GPU-0): checksum for EDID version 1 is invalid.
[ 11.695] (--) NVIDIA(GPU-0):
[ 11.695] (--) NVIDIA(GPU-0): Raw EDID bytes:
[ 11.695] (--) NVIDIA(GPU-0):
[ 11.695] (--) NVIDIA(GPU-0): 00 ff ff ff ff ff ff 00 5a 63 13 35 01 01 01 01
[ 11.695] (--) NVIDIA(GPU-0): 06 0e 01 03 1d 24 1b be 2a bb b8 a3 52 46 98 24
[ 11.695] (--) NVIDIA(GPU-0): 0f 48 4c ff ff 80 81 99 81 59 71 4f 61 59 a9 4f
[ 11.695] (--) NVIDIA(GPU-0): c1 40 cd 40 d1 40 86 3d 00 c0 51 00 30 40 40 a0
[ 11.695] (--) NVIDIA(GPU-0): 13 00 60 08 11 00 00 1e 00 00 00 ff 00 33 35 44
[ 11.695] (--) NVIDIA(GPU-0): 30 34 30 36 30 30 32 36 31 0a 00 00 00 fd 00 32
[ 11.695] (--) NVIDIA(GPU-0): b4 1e 61 18 00 0a 20 20 20 20 20 20 00 00 00 fc
[ 11.695] (--) NVIDIA(GPU-0): 00 47 39 30 66 2b 0a 20 20 20 20 20 20 20 00 e4
[ 11.695] (--) NVIDIA(GPU-0):
[ 11.695] (--) NVIDIA(GPU-0): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 11.695] (--) NVIDIA(GPU-0): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 11.695] (--) NVIDIA(GPU-0): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 11.695] (--) NVIDIA(GPU-0): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 11.695] (--) NVIDIA(GPU-0): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 11.695] (--) NVIDIA(GPU-0): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 11.695] (--) NVIDIA(GPU-0): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 11.695] (--) NVIDIA(GPU-0): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Of course what does this all mean? After searching through the required documents that are free thanks to VESA (http://read.pudn.com/downloads110/ebook/456020/E-EDID%20Standard.pdf) We can conclude there is already an issue with this file. (If it isn't already apparent with the checksum fail message). Clearly the trailing 0 bytes stand out to anyone, so are they causing a problem?
Thankfully the spec states that the EDID data-chunk should be 128-bytes. So these trailing zeros are useless (is this what was causing the checksum fail?) Turns out yes. So lets rewrite it shall we?
The first problem, where do we get a copy of this data to modify? Turns out there are tons of extraction programs, and ways to dump this data, but all of them are very specific to the driver, and even more specific to the operating system. So I simply wrote each byte inside a hexeditor. Was I worried about making a physical mistake while retyping it? Of course I was which is why I had to be sure that the data I retyped was valid.
The EDID data spec states that the checksum of the file equals all bytes MOD 256, and the byte given from that operation is stored at the 128 byte mark. So I did just that, I added up every byte from 0 to 127, and MOD 256'ed to get 0xE4. Which is what was at the end of the file so I knew from this operation that the retype was successful.
So after all was said and done the binary blob I had looked something like this:
00000000 00 ff ff ff ff ff ff 00 5a 63 13 35 01 01 01 01 |........Zc.5....|
00000010 06 0e 01 03 1d 24 1b be 2a bb b8 a3 52 46 98 24 |.....$..*...RF.$|
00000020 0f 48 4c ff ff 80 81 99 81 59 71 4f 61 59 a9 4f |.HL......YqOaY.O|
00000030 c1 40 cd 40 d1 40 86 3d 00 c0 51 00 30 40 40 a0 |.@.@.@.=..Q.0@@.|
00000040 13 00 60 08 11 00 00 1e 00 00 00 ff 00 33 35 44 |..`..........35D|
00000050 30 34 30 36 30 30 32 36 31 0a 00 00 00 fd 00 32 |040600261......2|
00000060 b4 1e 61 18 00 0a 20 20 20 20 20 20 00 00 00 fc |..a... ....|
00000070 00 47 39 30 66 2b 0a 20 20 20 20 20 20 20 00 e4 |.G90f+. ..|
After finishing this I sent the file to my friend and we suffered from another issue that I quickly resolved before he even realized I had no clue what I was getting into. How exactly do we set this EDID file without overwriting the monitors EEPROM (since there is nothing wrong, this is merely a driver issue, which we concluded was true as everything worked fine in Windows). The solution is to override it in the driver through that really annoying xorg.conf file. Oh yes so where's the nice trusty document for that? There isn't any. At least nothing as official as a nicely wrote and formatted PDF. Instead a few man page searching, and some google search to offset the pain resulted in a nice Option override to specify your own EDID file.
So after asking him to move the file into /etc/X11 and giving it the name edid.bin I told him to add
Option "CustomEDID" "CRT-0:/etc/X11/edid.bin"
to his xorg.conf gile under the "Device" section (as per the documentation states). After starting X everything was promising at first. It was in a higher resolution by default. But not as high as he expected it should be. After he fiddled inside nvidia-settings finally the resolution options were available and it was trivial to change (something that he couldn't do before)
After all was said and done, a job well done was expected, but instead I got this:
<hirato> I am so gay for you right now
There it sat inside one of those hidden logs that very few know exist, and even fewer know how to find.
[ 11.695] (WW) NVIDIA(GPU-0): The EDID read for display device CRT-0 is invalid: the
[ 11.695] (WW) NVIDIA(GPU-0): checksum for EDID version 1 is invalid.
[ 11.695] (--) NVIDIA(GPU-0):
[ 11.695] (--) NVIDIA(GPU-0): Raw EDID bytes:
[ 11.695] (--) NVIDIA(GPU-0):
[ 11.695] (--) NVIDIA(GPU-0): 00 ff ff ff ff ff ff 00 5a 63 13 35 01 01 01 01
[ 11.695] (--) NVIDIA(GPU-0): 06 0e 01 03 1d 24 1b be 2a bb b8 a3 52 46 98 24
[ 11.695] (--) NVIDIA(GPU-0): 0f 48 4c ff ff 80 81 99 81 59 71 4f 61 59 a9 4f
[ 11.695] (--) NVIDIA(GPU-0): c1 40 cd 40 d1 40 86 3d 00 c0 51 00 30 40 40 a0
[ 11.695] (--) NVIDIA(GPU-0): 13 00 60 08 11 00 00 1e 00 00 00 ff 00 33 35 44
[ 11.695] (--) NVIDIA(GPU-0): 30 34 30 36 30 30 32 36 31 0a 00 00 00 fd 00 32
[ 11.695] (--) NVIDIA(GPU-0): b4 1e 61 18 00 0a 20 20 20 20 20 20 00 00 00 fc
[ 11.695] (--) NVIDIA(GPU-0): 00 47 39 30 66 2b 0a 20 20 20 20 20 20 20 00 e4
[ 11.695] (--) NVIDIA(GPU-0):
[ 11.695] (--) NVIDIA(GPU-0): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 11.695] (--) NVIDIA(GPU-0): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 11.695] (--) NVIDIA(GPU-0): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 11.695] (--) NVIDIA(GPU-0): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 11.695] (--) NVIDIA(GPU-0): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 11.695] (--) NVIDIA(GPU-0): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 11.695] (--) NVIDIA(GPU-0): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 11.695] (--) NVIDIA(GPU-0): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Of course what does this all mean? After searching through the required documents that are free thanks to VESA (http://read.pudn.com/downloads110/ebook/456020/E-EDID%20Standard.pdf) We can conclude there is already an issue with this file. (If it isn't already apparent with the checksum fail message). Clearly the trailing 0 bytes stand out to anyone, so are they causing a problem?
Thankfully the spec states that the EDID data-chunk should be 128-bytes. So these trailing zeros are useless (is this what was causing the checksum fail?) Turns out yes. So lets rewrite it shall we?
The first problem, where do we get a copy of this data to modify? Turns out there are tons of extraction programs, and ways to dump this data, but all of them are very specific to the driver, and even more specific to the operating system. So I simply wrote each byte inside a hexeditor. Was I worried about making a physical mistake while retyping it? Of course I was which is why I had to be sure that the data I retyped was valid.
The EDID data spec states that the checksum of the file equals all bytes MOD 256, and the byte given from that operation is stored at the 128 byte mark. So I did just that, I added up every byte from 0 to 127, and MOD 256'ed to get 0xE4. Which is what was at the end of the file so I knew from this operation that the retype was successful.
So after all was said and done the binary blob I had looked something like this:
00000000 00 ff ff ff ff ff ff 00 5a 63 13 35 01 01 01 01 |........Zc.5....|
00000010 06 0e 01 03 1d 24 1b be 2a bb b8 a3 52 46 98 24 |.....$..*...RF.$|
00000020 0f 48 4c ff ff 80 81 99 81 59 71 4f 61 59 a9 4f |.HL......YqOaY.O|
00000030 c1 40 cd 40 d1 40 86 3d 00 c0 51 00 30 40 40 a0 |.@.@.@.=..Q.0@@.|
00000040 13 00 60 08 11 00 00 1e 00 00 00 ff 00 33 35 44 |..`..........35D|
00000050 30 34 30 36 30 30 32 36 31 0a 00 00 00 fd 00 32 |040600261......2|
00000060 b4 1e 61 18 00 0a 20 20 20 20 20 20 00 00 00 fc |..a... ....|
00000070 00 47 39 30 66 2b 0a 20 20 20 20 20 20 20 00 e4 |.G90f+. ..|
After finishing this I sent the file to my friend and we suffered from another issue that I quickly resolved before he even realized I had no clue what I was getting into. How exactly do we set this EDID file without overwriting the monitors EEPROM (since there is nothing wrong, this is merely a driver issue, which we concluded was true as everything worked fine in Windows). The solution is to override it in the driver through that really annoying xorg.conf file. Oh yes so where's the nice trusty document for that? There isn't any. At least nothing as official as a nicely wrote and formatted PDF. Instead a few man page searching, and some google search to offset the pain resulted in a nice Option override to specify your own EDID file.
So after asking him to move the file into /etc/X11 and giving it the name edid.bin I told him to add
Option "CustomEDID" "CRT-0:/etc/X11/edid.bin"
to his xorg.conf gile under the "Device" section (as per the documentation states). After starting X everything was promising at first. It was in a higher resolution by default. But not as high as he expected it should be. After he fiddled inside nvidia-settings finally the resolution options were available and it was trivial to change (something that he couldn't do before)
After all was said and done, a job well done was expected, but instead I got this:
<hirato> I am so gay for you right now
Thursday, June 21, 2012
Writing a primitive dynamic loader for Linux
The process of dynamic loading libraries is a complex one. Infact too complex to elaborate and explain on a blog site. Thousands of man hours and research has been wasted on dynamic loading. Things like PIC (Position independant code) take time to perfect and implement and are often subjected to bugs and performance issues. Of course you're not here to listen to me ramble about it.
What I'm going to show you is the bread and butter of writing your own user-space dynamic loader by getting down to the bits and bytes of it. Before we continue I should suggest you have a working toolchain which includes a compiler (gcc, clang, pathscale will work) and binutils (objdump and objcopy).
Lets start with a simple library function:
int print_hello_world() {
return printf("Hello World!\n");
}
This is the function we're going to want to _dynamically_ load from our dynamic loader and execute. Lets begin by compiling this into an object with gcc.
# gcc hello.c -c hello.o
I assume you know assembly (because you're going to need it) if not here's a small crash course. Assembly language is hardware-level language. Every thing done in code is directly translated to a byte-code encoding of various instructions that move arguments in and off the stack, move values and memory around and perform basic arithmetic. It's very low-level and working with it is nearly impossible. Which is why there exists assemblers that take an input assembly file (human readable version of the bytes) and assembles it to a (non human readable but computer readable) format.
The first thing we want to do is look at the dissasembly of our hello.o object file. To do this we use objdump
# objdump -d hello.o
Note: the output may vary depending on your architecture.
Lets get a few things figured out first. As you can see we have some basic stack manipulation in this function in accordance to the SysV ABI. But the thing worth nothing is the `callq`. The printf function doesn't exist at this point, during link the linker will patch the call with a call to library printf. We don't want that, why? because printf itself is a library function which exists in a dynamically loaded library. We can't use it. So how do we perform a printf without a printf?
Welcome to the world of system calls. System calls are the process of invoking the kernel. Every major OS has system calls of some sort. But for the sake of linux there is a very specific way at performing a systemcall. Linux expects arguments for a systemcall to exist in registers in the order of EAX,EBX,ECX,EDX etc etc. Once the registers contain the values required simply calling int 128 (int 0x80) will perform the systemcall.
So what systemcall does printf? None. Instead there is write, which writes to a file descriptor. The write systemcall expects the fd to write to, the data, and the size of the data. So lets rewrite our test.c file to accomodate that.
int print_hello_world() {
const char hello[] = "Hello World!\n";
const int size = sizeof(hello);
int ret;
__asm__ __volatile__ (
"movl $4, %%eax\n\t"
"movl $1, %%ebx\n\t"
"movl %1, %%ecx\n\t"
"movl %2, %%edx\n\t"
"int $0x80" :
"=a"(ret) : "g"(hello), "g"(size) :
"%ebx", "%ecx", "%edx", "%esi", "%edi"
);
return ret;
}
Looks fancy doesn't it? Lets work out this assembly first
the movl for 4 is the syscall number for write
the movl for 1 is the file descriptor for stdout
This code will output "Hello World!\n" to stdout. Pretty awesome eh?
Of course now we need to take a look at the assembly of this.
Pay special attention to the encoding not the instructions, this is the bytecode the CPU understands and executes, it all encodes to a special meaning, this meaning is on the right.
Normally there would be a problem with this code. Currently it's PIC, but it might not be PIC (for example if you rely on data outside the function). Note the three successive movl's at the top, these are encoding for hello world and the size (but they could be addresses if you're not paying special attention). Simply loading code that uses addresses to read from memory and executing it would cause a segmentation fault (since we don't have access to that memory this). Instead we need a surfire way to make sure the code is position independant (even if it already is). Doing it requires a lot of work, you need relocation tables, and patching of all those addresses. Instead we can pass this information by registers from the dynamic loader (this isn't very usefull but this tutorial is only to explain and implement a primitive dynamic loader).
So lets make this code PIC by rewriting it, to something like this:
int print_hello_world(const char *hello, int size, int *ret) {
__asm__ __volatile__ (
"movl $4, %%eax\n\t"
"movl $1, %%ebx\n\t"
"movl %1, %%ecx\n\t"
"movl %2, %%edx\n\t"
"int $0x80" :
"=a"(*ret) : "g"(hello), "g"(size) :
"%ebx", "%ecx", "%edx", "%esi", "%edi"
);
return *ret;
}
as you can see no data on the heap. Everything fits in registers and is easily PIC. Of course you cannot make this assumption, lets check the dissasembly.
#objdump -d hello.o
Hey Look! Not bad, everything is happening between the registers! This is PIC!
Now we need to break this down further! Currently these object files we're working with are in ELF format and not really easy to load. What if we just rewrote those bytes on the left side of that dissasembly in a file manually we'd have a flat binary right? Wishfull thinking, that would take too long, this is what objcopy is for.
#objcopy --only-section=.text --output-target binary hello.o hello.bin
We specify we only want the .text section (other sections are useless) .text = code. We can thus validate the instructions to the objdump using hexdump
# hexdump -C hello.bin
00000000 55 48 89 e5 53 48 89 7d f0 89 75 ec 48 89 55 e0 |UH..SH.}..u.H.U.|
00000010 b8 04 00 00 00 bb 01 00 00 00 8b 4d f0 8b 55 ec |...........M..U.|
00000020 cd 80 41 89 c0 48 8b 45 e0 44 89 00 48 8b 45 e0 |..A..H.E.D..H.E.|
00000030 8b 00 5b 5d c3 |..[].|
00000035
What I'm going to show you is the bread and butter of writing your own user-space dynamic loader by getting down to the bits and bytes of it. Before we continue I should suggest you have a working toolchain which includes a compiler (gcc, clang, pathscale will work) and binutils (objdump and objcopy).
Lets start with a simple library function:
int print_hello_world() {
return printf("Hello World!\n");
}
This is the function we're going to want to _dynamically_ load from our dynamic loader and execute. Lets begin by compiling this into an object with gcc.
# gcc hello.c -c hello.o
I assume you know assembly (because you're going to need it) if not here's a small crash course. Assembly language is hardware-level language. Every thing done in code is directly translated to a byte-code encoding of various instructions that move arguments in and off the stack, move values and memory around and perform basic arithmetic. It's very low-level and working with it is nearly impossible. Which is why there exists assemblers that take an input assembly file (human readable version of the bytes) and assembles it to a (non human readable but computer readable) format.
The first thing we want to do is look at the dissasembly of our hello.o object file. To do this we use objdump
# objdump -d hello.o
0000000000000000 <print_hello_world>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: bf 00 00 00 00 mov $0x0,%edi
9: e8 00 00 00 00 callq e <print_hello_world+0xe>
e: 5d pop %rbp
f: c3 retq
Note: the output may vary depending on your architecture.
Lets get a few things figured out first. As you can see we have some basic stack manipulation in this function in accordance to the SysV ABI. But the thing worth nothing is the `callq`. The printf function doesn't exist at this point, during link the linker will patch the call with a call to library printf. We don't want that, why? because printf itself is a library function which exists in a dynamically loaded library. We can't use it. So how do we perform a printf without a printf?
Welcome to the world of system calls. System calls are the process of invoking the kernel. Every major OS has system calls of some sort. But for the sake of linux there is a very specific way at performing a systemcall. Linux expects arguments for a systemcall to exist in registers in the order of EAX,EBX,ECX,EDX etc etc. Once the registers contain the values required simply calling int 128 (int 0x80) will perform the systemcall.
So what systemcall does printf? None. Instead there is write, which writes to a file descriptor. The write systemcall expects the fd to write to, the data, and the size of the data. So lets rewrite our test.c file to accomodate that.
int print_hello_world() {
const char hello[] = "Hello World!\n";
const int size = sizeof(hello);
int ret;
__asm__ __volatile__ (
"movl $4, %%eax\n\t"
"movl $1, %%ebx\n\t"
"movl %1, %%ecx\n\t"
"movl %2, %%edx\n\t"
"int $0x80" :
"=a"(ret) : "g"(hello), "g"(size) :
"%ebx", "%ecx", "%edx", "%esi", "%edi"
);
return ret;
}
Looks fancy doesn't it? Lets work out this assembly first
the movl for 4 is the syscall number for write
the movl for 1 is the file descriptor for stdout
This code will output "Hello World!\n" to stdout. Pretty awesome eh?
Of course now we need to take a look at the assembly of this.
0000000000000000 <print_hello_world>:
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 41 54 push %r12
6: 53 push %rbx
7: c7 45 d0 48 65 6c 6c movl $0x6c6c6548,-0x30(%rbp)
e: c7 45 d4 6f 20 57 6f movl $0x6f57206f,-0x2c(%rbp)
15: c7 45 d8 72 6c 64 21 movl $0x21646c72,-0x28(%rbp)
1c: 66 c7 45 dc 0a 00 movw $0xa,-0x24(%rbp)
22: c7 45 ec 0e 00 00 00 movl $0xe,-0x14(%rbp)
29: 48 8d 45 d0 lea -0x30(%rbp),%rax
2d: 48 89 45 c8 mov %rax,-0x38(%rbp)
31: b8 04 00 00 00 mov $0x4,%eax
36: bb 01 00 00 00 mov $0x1,%ebx
3b: 8b 4d c8 mov -0x38(%rbp),%ecx
3e: 8b 55 ec mov -0x14(%rbp),%edx
41: cd 80 int $0x80
43: 41 89 c4 mov %eax,%r12d
46: 44 89 65 e8 mov %r12d,-0x18(%rbp)
4a: 8b 45 e8 mov -0x18(%rbp),%eax
4d: 5b pop %rbx
4e: 41 5c pop %r12
50: 5d pop %rbp
51: c3 retq
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 41 54 push %r12
6: 53 push %rbx
7: c7 45 d0 48 65 6c 6c movl $0x6c6c6548,-0x30(%rbp)
e: c7 45 d4 6f 20 57 6f movl $0x6f57206f,-0x2c(%rbp)
15: c7 45 d8 72 6c 64 21 movl $0x21646c72,-0x28(%rbp)
1c: 66 c7 45 dc 0a 00 movw $0xa,-0x24(%rbp)
22: c7 45 ec 0e 00 00 00 movl $0xe,-0x14(%rbp)
29: 48 8d 45 d0 lea -0x30(%rbp),%rax
2d: 48 89 45 c8 mov %rax,-0x38(%rbp)
31: b8 04 00 00 00 mov $0x4,%eax
36: bb 01 00 00 00 mov $0x1,%ebx
3b: 8b 4d c8 mov -0x38(%rbp),%ecx
3e: 8b 55 ec mov -0x14(%rbp),%edx
41: cd 80 int $0x80
43: 41 89 c4 mov %eax,%r12d
46: 44 89 65 e8 mov %r12d,-0x18(%rbp)
4a: 8b 45 e8 mov -0x18(%rbp),%eax
4d: 5b pop %rbx
4e: 41 5c pop %r12
50: 5d pop %rbp
51: c3 retq
Pay special attention to the encoding not the instructions, this is the bytecode the CPU understands and executes, it all encodes to a special meaning, this meaning is on the right.
Normally there would be a problem with this code. Currently it's PIC, but it might not be PIC (for example if you rely on data outside the function). Note the three successive movl's at the top, these are encoding for hello world and the size (but they could be addresses if you're not paying special attention). Simply loading code that uses addresses to read from memory and executing it would cause a segmentation fault (since we don't have access to that memory this). Instead we need a surfire way to make sure the code is position independant (even if it already is). Doing it requires a lot of work, you need relocation tables, and patching of all those addresses. Instead we can pass this information by registers from the dynamic loader (this isn't very usefull but this tutorial is only to explain and implement a primitive dynamic loader).
So lets make this code PIC by rewriting it, to something like this:
int print_hello_world(const char *hello, int size, int *ret) {
__asm__ __volatile__ (
"movl $4, %%eax\n\t"
"movl $1, %%ebx\n\t"
"movl %1, %%ecx\n\t"
"movl %2, %%edx\n\t"
"int $0x80" :
"=a"(*ret) : "g"(hello), "g"(size) :
"%ebx", "%ecx", "%edx", "%esi", "%edi"
);
return *ret;
}
as you can see no data on the heap. Everything fits in registers and is easily PIC. Of course you cannot make this assumption, lets check the dissasembly.
#objdump -d hello.o
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
4: 53 push %rbx
5: 48 89 7d f0 mov %rdi,-0x10(%rbp)
9: 89 75 ec mov %esi,-0x14(%rbp)
c: 48 89 55 e0 mov %rdx,-0x20(%rbp)
10: b8 04 00 00 00 mov $0x4,%eax
15: bb 01 00 00 00 mov $0x1,%ebx
1a: 8b 4d f0 mov -0x10(%rbp),%ecx
1d: 8b 55 ec mov -0x14(%rbp),%edx
20: cd 80 int $0x80
22: 41 89 c0 mov %eax,%r8d
25: 48 8b 45 e0 mov -0x20(%rbp),%rax
29: 44 89 00 mov %r8d,(%rax)
2c: 48 8b 45 e0 mov -0x20(%rbp),%rax
30: 8b 00 mov (%rax),%eax
32: 5b pop %rbx
33: 5d pop %rbp
34: c3 retq
1: 48 89 e5 mov %rsp,%rbp
4: 53 push %rbx
5: 48 89 7d f0 mov %rdi,-0x10(%rbp)
9: 89 75 ec mov %esi,-0x14(%rbp)
c: 48 89 55 e0 mov %rdx,-0x20(%rbp)
10: b8 04 00 00 00 mov $0x4,%eax
15: bb 01 00 00 00 mov $0x1,%ebx
1a: 8b 4d f0 mov -0x10(%rbp),%ecx
1d: 8b 55 ec mov -0x14(%rbp),%edx
20: cd 80 int $0x80
22: 41 89 c0 mov %eax,%r8d
25: 48 8b 45 e0 mov -0x20(%rbp),%rax
29: 44 89 00 mov %r8d,(%rax)
2c: 48 8b 45 e0 mov -0x20(%rbp),%rax
30: 8b 00 mov (%rax),%eax
32: 5b pop %rbx
33: 5d pop %rbp
34: c3 retq
Hey Look! Not bad, everything is happening between the registers! This is PIC!
Now we need to break this down further! Currently these object files we're working with are in ELF format and not really easy to load. What if we just rewrote those bytes on the left side of that dissasembly in a file manually we'd have a flat binary right? Wishfull thinking, that would take too long, this is what objcopy is for.
#objcopy --only-section=.text --output-target binary hello.o hello.bin
We specify we only want the .text section (other sections are useless) .text = code. We can thus validate the instructions to the objdump using hexdump
# hexdump -C hello.bin
00000000 55 48 89 e5 53 48 89 7d f0 89 75 ec 48 89 55 e0 |UH..SH.}..u.H.U.|
00000010 b8 04 00 00 00 bb 01 00 00 00 8b 4d f0 8b 55 ec |...........M..U.|
00000020 cd 80 41 89 c0 48 8b 45 e0 44 89 00 48 8b 45 e0 |..A..H.E.D..H.E.|
00000030 8b 00 5b 5d c3 |..[].|
00000035
As we can validate the encode is the same as the objdump. We now have loadable code that we can EXECUTE! Lets write our dynamic executor!
#include <stdio.h>
#include <stdlib.h>
unsigned char *mem;
int i;
long int size;
int ret;
int main() {
FILE *fp = fopen("hello.bin", "rb");
if (!fp) {
printf("[ld] Failed to load hello.bin\n");
return -1;
}
fseek(fp, 0, SEEK_END);
size = ftell(fp);
fseek(fp, 0, SEEK_SET);
printf("[ld] Loading %d byte program\n", size);
if (!(mem = malloc(size))) {
fclose(fp);
printf("[ld] Failed to allocate memory for code\n");
return -1;
} else {
printf("[ld] Reserved %d bytes for program\n", size);
}
fread(mem, 1, size, fp);
printf("[ld] Loaded program\n");
fclose(fp);
printf(" ");
for(i = 1; i < size+1; i++)
printf((i%8==0)?"0x%02X\n ":"0x%02X, ", mem[i-1]);
printf("\n[ld] Executing ...\n");
int (*call)(const char *, int, int*) = mem;
(int)(*call)("Hello World\n", sizeof("Hello World\n"), &ret);
free(mem);
printf("[ld] Called function; returned %d strlen(\"Hello World\")\n", ret);
return ret;
}
#include <stdlib.h>
unsigned char *mem;
int i;
long int size;
int ret;
int main() {
FILE *fp = fopen("hello.bin", "rb");
if (!fp) {
printf("[ld] Failed to load hello.bin\n");
return -1;
}
fseek(fp, 0, SEEK_END);
size = ftell(fp);
fseek(fp, 0, SEEK_SET);
printf("[ld] Loading %d byte program\n", size);
if (!(mem = malloc(size))) {
fclose(fp);
printf("[ld] Failed to allocate memory for code\n");
return -1;
} else {
printf("[ld] Reserved %d bytes for program\n", size);
}
fread(mem, 1, size, fp);
printf("[ld] Loaded program\n");
fclose(fp);
printf(" ");
for(i = 1; i < size+1; i++)
printf((i%8==0)?"0x%02X\n ":"0x%02X, ", mem[i-1]);
printf("\n[ld] Executing ...\n");
int (*call)(const char *, int, int*) = mem;
(int)(*call)("Hello World\n", sizeof("Hello World\n"), &ret);
free(mem);
printf("[ld] Called function; returned %d strlen(\"Hello World\")\n", ret);
return ret;
}
Now here's the problem if we compile and execute this code:
# gcc run.c -o run
# ./run
[ld] Loading 53 byte program
[ld] Reserved 53 bytes for program
[ld] Loaded program
0x55, 0x48, 0x89, 0xE5, 0x53, 0x48, 0x89, 0x7D
0xF0, 0x89, 0x75, 0xEC, 0x48, 0x89, 0x55, 0xE0
0xB8, 0x04, 0x00, 0x00, 0x00, 0xBB, 0x01, 0x00
0x00, 0x00, 0x8B, 0x4D, 0xF0, 0x8B, 0x55, 0xEC
0xCD, 0x80, 0x41, 0x89, 0xC0, 0x48, 0x8B, 0x45
0xE0, 0x44, 0x89, 0x00, 0x48, 0x8B, 0x45, 0xE0
0x8B, 0x00, 0x5B, 0x5D, 0xC3,
[ld] Executing ...
Segmentation fault
[ld] Reserved 53 bytes for program
[ld] Loaded program
0x55, 0x48, 0x89, 0xE5, 0x53, 0x48, 0x89, 0x7D
0xF0, 0x89, 0x75, 0xEC, 0x48, 0x89, 0x55, 0xE0
0xB8, 0x04, 0x00, 0x00, 0x00, 0xBB, 0x01, 0x00
0x00, 0x00, 0x8B, 0x4D, 0xF0, 0x8B, 0x55, 0xEC
0xCD, 0x80, 0x41, 0x89, 0xC0, 0x48, 0x8B, 0x45
0xE0, 0x44, 0x89, 0x00, 0x48, 0x8B, 0x45, 0xE0
0x8B, 0x00, 0x5B, 0x5D, 0xC3,
[ld] Executing ...
Segmentation fault
Wait what! Segmentation fault? There is some issues preventing us from executing code from the stack. This is a security feature, and simple to fix. First we need to tell GCC to not build with stack protection and pass the execstack flag to ld so we can execute code from the stack. Lets try again
# gcc -fno-stack-protector -z execstack run.c -o run
# ./run
[ld] Loading 53 byte program
[ld] Reserved 53 bytes for program
[ld] Loaded program
0x55, 0x48, 0x89, 0xE5, 0x53, 0x48, 0x89, 0x7D
0xF0, 0x89, 0x75, 0xEC, 0x48, 0x89, 0x55, 0xE0
0xB8, 0x04, 0x00, 0x00, 0x00, 0xBB, 0x01, 0x00
0x00, 0x00, 0x8B, 0x4D, 0xF0, 0x8B, 0x55, 0xEC
0xCD, 0x80, 0x41, 0x89, 0xC0, 0x48, 0x8B, 0x45
0xE0, 0x44, 0x89, 0x00, 0x48, 0x8B, 0x45, 0xE0
0x8B, 0x00, 0x5B, 0x5D, 0xC3,
[ld] Executing ...
Hello World
[ld] Called function; returned 13 strlen("Hello World")
[ld] Reserved 53 bytes for program
[ld] Loaded program
0x55, 0x48, 0x89, 0xE5, 0x53, 0x48, 0x89, 0x7D
0xF0, 0x89, 0x75, 0xEC, 0x48, 0x89, 0x55, 0xE0
0xB8, 0x04, 0x00, 0x00, 0x00, 0xBB, 0x01, 0x00
0x00, 0x00, 0x8B, 0x4D, 0xF0, 0x8B, 0x55, 0xEC
0xCD, 0x80, 0x41, 0x89, 0xC0, 0x48, 0x8B, 0x45
0xE0, 0x44, 0x89, 0x00, 0x48, 0x8B, 0x45, 0xE0
0x8B, 0x00, 0x5B, 0x5D, 0xC3,
[ld] Executing ...
Hello World
[ld] Called function; returned 13 strlen("Hello World")
Look at that we printed HELLO WORLD! I should note this isn't the same as dynamic loading but very close in princible. I hope you learn how all this works and can adapt it for future logic if you ever implement a dynamic loader or evn linker!. I should disclaimer here that dynamic linkers are a lot more complex, they handle relocation tables to relocate data, and do PIC via binary patching and other complex methods. But this is a good start!
Some other subtle notes that aren't covered here.
1 The data cache on the CPU should be cleared as well as the instruction cache since there is no gurantee the data and instruction cache will be in sync.
2 The memory for the code should be allocated with mmap and set unprotected (instead of using GCC trickery like I did)
Some other subtle notes that aren't covered here.
1 The data cache on the CPU should be cleared as well as the instruction cache since there is no gurantee the data and instruction cache will be in sync.
2 The memory for the code should be allocated with mmap and set unprotected (instead of using GCC trickery like I did)
Friday, April 6, 2012
Addition the bit hack way
Disclaimer: I would ask that none of this code actually be used in software; it doesn't offer anything, if your CPU has an ADD instruction, use it. Otherwise this is just a teaching aid.
Addition is a basic mathematical operation which can be emulated with some bitwise operations. Lets look at a canonical implementation:
uint8_t add(uint8_t x, uint8_t y) {
uint8_t a;
uint8_t b;
do {
a = x & y;
b = x ^ y;
x = a << 1;
y = b;
} while (a);
return b;
}To understand what is going on lets follow out with a basic example:
Let in:x = 00000011 (3)
Let in:y = 00000100 (4)
When we invoke this add function the steps taking place are:
- AND -- a = (x&y)
- XOR -- b = (x^y)
- LSHFT -- x = (a<<1)
Resulting in the following:
- x = 00000011
- y = 00000100
- a = 00000000
- b = 00000100
- x = 00000000
- y = 00000111 -- XOR a:b = 7
This canonical implementation works, but it's slow. We can make it slightly faster, to see why, lets take a look at the assembly that code translates to (note: all code is x86_64 non-optimized)
add:
pushq %rbp
movq %rsp, %rbp
movl %edi, %edx
movl %esi, %eax
movb %dl -20(%rbp) ; y
movb %al -24(%rbp) ; x
.loop:
movzbl -24(%rbp), %eax
movzbl -20(%rbp), %edx
andl %edx, %eax
movb %al, -1(%rbp)
movzbl -24(%rbp), %eax
movzbl -20(%rbp), %edx
xorl %edx, %eax
movb %al, -2(%rbp)
movzbl -1(%rbp), %eax
addl %eax, %eax
movb %al, -20(%rbp)
movzbl -2(%rbp), %eax
movb %al, -24(%rbp)
cmpb $0x0, -1(%rbp)
jne .loop
movzbl -2(%rbp), %eax
popq %rbp
ret There is 9+(15*iterations) worth instructions here. Assuming the best case senerio, that is 24 instructions. We can refine our implementation to be slightly faster.
void add(uint8_t *x, uint8_t *y) {
do {
*y = (*x^*y);*x = (*x&(*x^*y)) << 1;
} while (*x);
}In this version we've removed the return, and replaced our arguments with pointers to our variables, storing the result of the addition in *y, and using no temporaries!. This is (best case) 14! instruction less:
add:
movzbl (%rsi), %eax
.loop:
movzbl (%rdi), %ecx
movl %eax, %edx
andl %ecx, %edx
xorl %ecx, %eax
leal (%rdx,%rdx), %ecx
testb %dl, %dl
movb %cl, (%rdi)
movb %al, (%rsi)
jne .loop
rep
ret
That concludes addition by bit hacking. I've played around with many other methods, but this is 3+(9*iterations) worth instructions, or best case (11 instructions). I challenge anyone to beat this for best case. Must be C code (no assembly) .. post solution in the comments. Hint: abuse stack.
Of course since this function is small we can most likely make it inline, (which could allow for better optimization). I'd personally perfer a macro before the inline keyword.
#define add(x,y) do{*y=(*x^*y),*x=(*x&(*x^*y))<<1;} while(*x)
That concludes addition by bit hacking. I've played around with many other methods, but this is 3+(9*iterations) worth instructions, or best case (11 instructions). I challenge anyone to beat this for best case. Must be C code (no assembly) .. post solution in the comments. Hint: abuse stack.
Thursday, April 5, 2012
Checking if divisible by 10 or 5 the bitwise way.
My Problem: check if something was divisble by 10, the rules where I couldn't use the modulus, divison or multiplication operator. The check had to be in the form of a single expression, no branches or loop constructs could be used, pointer arithmetic wasn't allowed, neither was the use of floating point data types. The code had to work on a x86 CPU; and had to be ANSI C.
After spending a few hours I've come up with the following solution:
((x<<0x1F)+(x<<0x1E)+(x<<0x1B)+(x<<0x1A)+
(x<<0x17)+(x<<0x16)+(x<<0x13)+(x<<0x12)+
(x<<0x0F)+(x<<0x0E)+(x<<0x0B)+(x<<0x0A)+
(x<<0x07)+(x<<0x06)+(x<<0x03)+(x<<0x02)+x<0x33333334)&&!(x&1)
Now, this isn't exactly a fast solution, expecially on a P4 which lacks a barrel shifter, however this solution led me finding a faster soluton. The rules where I couldn't use the multiplication operator. But after removing those shifted adds and replacing it with a constant 0xCCCCCCCD, I had the following:
(x*0xCCCCCCCD) < 0x33333334 &&!(x&1)
much to my surprise this is faster than using modulus (x%10) as well as the division operator. MUL on the x86 is actually the fastest way to do division for NPOT (non-power-of-two) constants. This can be proofed with:
(x * 5^-1) < 0x33333334
Explaining how this works is quite simple: the constant 0xCCCCCCCD is the inverse of 5 modulo 2^32. Of course I couldn't stop here. I took it a bit further and came up with the following code:
!((0x19999999 * x + (x >> 1) + (x >> 3)) >> 28) [
"\x0\x1\x2\x2\x3\x3\x4\x5\x5\x6\x7\x7\x8\x8\x9\x0"
]
This is a lot more complicated to proof, mainly because it's simply odd code. To get the most obvious odd bit sorted out C allows array[idx] and idx[array] because both cases decay to *(idx+array) or *(array+idx) which are equivlent. Now for the explination:
In the table we have here: the following: 0,1,2,2,3,3,4,5,5,6,7,7,8,8,9,0
this maps the multiples of 5 to either 0 or 8. Better expressed as: (8/5 * x) % 16 (the error term is never too big) -- The real equivalency being:
((8/5 + 0x0.00000002/5) * x - delta) % 16 for delta in [0, 0.625/2^28]
This one uses a lot more operations to do the same thing, 3 shifts, 2 adds, and one MUL, still a lot faster on a pentium D than the former one. Of course I took it two more steps further.
The rules where it had to be ANSI C, but I sort of broke that bit, by exposing some x86 assembly (which entailed the best solutions).
MUL x, 0xCCCCCCCD, x
ROR x, 0x19999999
CMP x, 0x19999999
JLE hit_table ; same table {0,1,2,2,3,3,4,5,5,6,7,7,8,8,9,0}
This it the same to the CPU as the one above, but expressed in a simple fashion. Essentially the observation is:
x * 0xCCCCCCCD is even exactly if x is even, because 0xCCCCCCCD is odd, so the right rotation to move the last bit to the first and compare against 0x19999999 will give you the same answer, that is if you have ROR. So I rewrote it another way for ROR-less systems.
This method is the same as the first: which I find is a lot more nicer, too many MOVs though, but the real power comes from the x86 LEA instruction.
The reason this works is the x86 MUL instruction puts it's double width result in two registers, the sad part is you cannot write this in C.
But in the end the winner for performance ended up being:
After spending a few hours I've come up with the following solution:
((x<<0x1F)+(x<<0x1E)+(x<<0x1B)+(x<<0x1A)+
(x<<0x17)+(x<<0x16)+(x<<0x13)+(x<<0x12)+
(x<<0x0F)+(x<<0x0E)+(x<<0x0B)+(x<<0x0A)+
(x<<0x07)+(x<<0x06)+(x<<0x03)+(x<<0x02)+x<0x33333334)&&!(x&1)
Now, this isn't exactly a fast solution, expecially on a P4 which lacks a barrel shifter, however this solution led me finding a faster soluton. The rules where I couldn't use the multiplication operator. But after removing those shifted adds and replacing it with a constant 0xCCCCCCCD, I had the following:
(x*0xCCCCCCCD) < 0x33333334 &&!(x&1)
much to my surprise this is faster than using modulus (x%10) as well as the division operator. MUL on the x86 is actually the fastest way to do division for NPOT (non-power-of-two) constants. This can be proofed with:
(x * 5^-1) < 0x33333334
Explaining how this works is quite simple: the constant 0xCCCCCCCD is the inverse of 5 modulo 2^32. Of course I couldn't stop here. I took it a bit further and came up with the following code:
!((0x19999999 * x + (x >> 1) + (x >> 3)) >> 28) [
"\x0\x1\x2\x2\x3\x3\x4\x5\x5\x6\x7\x7\x8\x8\x9\x0"
]
This is a lot more complicated to proof, mainly because it's simply odd code. To get the most obvious odd bit sorted out C allows array[idx] and idx[array] because both cases decay to *(idx+array) or *(array+idx) which are equivlent. Now for the explination:
In the table we have here: the following: 0,1,2,2,3,3,4,5,5,6,7,7,8,8,9,0
this maps the multiples of 5 to either 0 or 8. Better expressed as: (8/5 * x) % 16 (the error term is never too big) -- The real equivalency being:
((8/5 + 0x0.00000002/5) * x - delta) % 16 for delta in [0, 0.625/2^28]
This one uses a lot more operations to do the same thing, 3 shifts, 2 adds, and one MUL, still a lot faster on a pentium D than the former one. Of course I took it two more steps further.
The rules where it had to be ANSI C, but I sort of broke that bit, by exposing some x86 assembly (which entailed the best solutions).
MUL x, 0xCCCCCCCD, x
ROR x, 0x19999999
CMP x, 0x19999999
JLE hit_table ; same table {0,1,2,2,3,3,4,5,5,6,7,7,8,8,9,0}
This it the same to the CPU as the one above, but expressed in a simple fashion. Essentially the observation is:
x * 0xCCCCCCCD is even exactly if x is even, because 0xCCCCCCCD is odd, so the right rotation to move the last bit to the first and compare against 0x19999999 will give you the same answer, that is if you have ROR. So I rewrote it another way for ROR-less systems.
This method is the same as the first: which I find is a lot more nicer, too many MOVs though, but the real power comes from the x86 LEA instruction.
MOV %edx, 0x33333334
MOV %ebx, %eax
MUL %edx
AND %edx, x
LEA %edx, [%edx+%edx*4]
The reason this works is the x86 MUL instruction puts it's double width result in two registers, the sad part is you cannot write this in C.
But in the end the winner for performance ended up being:
(x * 0xCCCCCCCD) < 0x33333334 && !(x&1)
Subscribe to:
Posts (Atom)