I got this article from 29A, a virus e-magazine, so it's NOT written by me, so DON'T nag me for anything...
BTW, add any other ways of optimization if you know some...
Cheers
belgther
ÜÛÛÛÛÛÜ ÜÛÛÛÛÛÜ ÜÛÛÛÛÛÜ
ÚÄ Optimization of 32bit code Ä¿ ÛÛÛ ÛÛÛ ÛÛÛ ÛÛÛ ÛÛÛ ÛÛÛ
³ by ³ ÜÜÜÛÛß ßÛÛÛÛÛÛ ÛÛÛÛÛÛÛ
ÀÄÄÄÄÄÄÄÄ Benny / 29A ÄÄÄÄÄÄÄÄÄÙ ÛÛÛÜÜÜÜ ÜÜÜÜÛÛÛ ÛÛÛ ÛÛÛ
ÛÛÛÛÛÛÛ ÛÛÛÛÛÛß ÛÛÛ ÛÛÛ
ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
³ 1. Disclamer ³
ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
The followin' document is an education purpose only. Author isn't
responsible for any misuse of the things written in this document.
ÚÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
³ 2. Foreword ³
ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
Eeeh, why da (filtered) I wrote this article ? There r many documents about
optimizations. Yes, that's truth, and there r many very gewd and kewl tutes
[* Billy, your tute rox! *]. But how can u see, not every tute has on the
mind, that the term "optimize" doesn't fully mean, your code will be only
small. There r many aspects of optimization and I wanna discuss it here and
make u complex view on the thing.
When I started to write this article, I was really drunk and totaly under the
drugs (hehe, no lie X-D), so if u feel, I made any mistake or u think, things
written here aren't true or simply u wanna give me some credits (do it
please X-D), u can find me on IRC UnderNet, channels #vir and/or #virus or
mail benny@post.cz. Thanx for all possitive (and also negative) comments.
ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
³ 3. Introduction ³
ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
As I said some seconds before, optimization has many aspects.
Generaly, we can optimize our code, so:
- code will be smaller
- code will be faster
- code will be smaller and faster
Well, it gives us some new space for thinkin'. If we optimize our code:
- code will be smaller, but also slower
- code will be bigger, but faster
- code will be smaller and faster
We should find compromise (if we can't reach third point) between first
and second point. I'm sure, u don't wanna alert user by slowin' down system
performace due to:
- huge and unoptimized code
- small, but slow code
or alert user by rapidly decreasin' space on the disk.
It's up to us, which way will we choose. Here we have a clue:
- if our code (or block of code, e.g. thread procedure) is
small, we should optimize it for faster code
- if our code (or block of code) is big, we should optimize
it for smaller/faster (find compromise, prefer speed) code
However, we should optimize our code by decreasin' its size and increasin'
speed, but u know, how is it difficult.
Is it clear ? I think, u already knew this. But still, there r still many
aspects of optimization. We have for example two instructions, that do the
same thing, but:
- one instruction is bigger
- one instruction is slower
- one instruction changes another registers
- one instruction writes to memory
- one instruction changes flags
- one instruction is faster on one processor, but slower on
another one
Example: LODSB MOV AL, [ESI] + INC ESI
-----------------------------------------
size: smaller bigger
speed: faster on 80386 faster on 80486, on Pentium only 1 cycle
flags: preserved changed
And why is LODSB faster on 80386 and why it takes only 1 cykle on Pentium ?
Pentium is superscalar processor supportin' pipelinin', so it can execute
pair of some integer instructions in a PIPE, i.e. it can execute those
instructions simultaneously. Two instructions, that can be executed
simultaneously r called "pairable instructions".
Hehe, don't worry, this arcticle won't be about Pentium processor
architecture, so u can forget words I said about pipes. Maybe l8r, if I will
write another article about Pentium processor optimization, I will explain
more in details terms such as pipes, V-pipe, U-pipe, pairin' and so on. For
now, u can forget them. Just remember, what does "pairin'" word mean.
Now, I will discuss step by step every optimization techniques.
ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
³ 4. Optimizin' our code ³
ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ
Well, let's go optimize. I will start from the easiest operation.
Beginners, hold on...
4.1. Zero register
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
I don't wanna see this anymore:
1) mov eax, 00000000h ;5 bytes
This is the worst instruction I've ever seen. Well, it seems
logical, that u will move zero to register, but u can do it
more optimizely like now:
2) sub eax, eax ;2 bytes
or
3) xor eax, eax ;2 bytes
3 bytes on one instruction saved, great ! X-D But what's better
to use, SUB or XOR ? I prefer XOR, coz Micro$oft prefers SUB and I
know, that Windozes r slooooow, hehe. Noo, that's not true reason.
What do u think, is better (for u) to substact two numbers or say
"where's 1 and 1, write 0" ? So u know, why I prefer XOR (as I hate
mathematix X-D).
4.2. Test if register is zero
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
Hmmm, let's see the brightest solution:
1) cmp eax, 00000000h ;5 bytes
je _label_ ;2/6 bytes (short/near)
[* NOTE: Many aritmetical instructions r optimized for register EAX,
so code usin' EAX register will be faster and smaller.
Example: CMP EAX, 12345678h (5 bytes). If I would use another register
instead of EAX, CMP instruction would have 6 bytes *]
Argh! Who normal can do this ? That's 7 or 15(!) bytes for simple
comparsion. No, no, no, don't do it and try this:
2) or eax, eax ;2 bytes
je _label_ ;2/6 (short/near)
or
3) test eax, eax ;2 bytes
je _label_ ;2/6 (short/near)
Hmm, much better, 4/8 bytes is really better than 7/15 bytes. So,
again, whats better, OR or TEST ? OR prefers Micro$oft so again, I
prefer TEST |-). Now seriously, TEST doesn't write to register (OR
does), so there will be better pairin' => faster code. I hope, u still
remember, what does "pairin'" word mean...If not, read again
Introduction section.
Now, the biggest magic. If u don't care of ECX register or u don't
care, where will be stored content of registers (EAX and ECX), u can
do it this way:
4) xchg eax, ecx ;1 byte
jecxz _label_ ;2 bytes
[* NOTE: XCHG is optimized for EAX register, so if XCHG will use
EAX register, it will be 1 byte long, otherwise 2 bytes *]
Great! We optimized our code, so we saved 4 bytes.
4.3. Test if register is 0FFFFFFFFh
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
Many APIs return -1, when function fail, so it is important to
test for this value. I'm always astonished, when I see how some
coders test for this value like now me:
1) cmp eax, 0ffffffffh ;5 bytes
je _label_ ;2/6 bytes
I hate this. And now look, how can it be optimized:
2) inc eax ;1 byte
je _label_ ;2/6 bytes
dec eax ;1 byte
Yes, yes, yes, we saved 3 bytes and made code faster
4.4. Move 0FFFFFFFFh to register
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
Some APIs need as parameter -1 value. Let's see, how can we set it:
Least optimized:
1) mov eax, 0ffffffffh ;5 bytes
More optimized:
2) xor eax, eax / sub eax, eax ;2 bytes
dec eax ;1 byte
Or this with same result (by Super/29A):
3) stc ;1 byte
sbb eax, eax ;2 bytes
This code is very useful in same cases, such as:
jnc _label_
sbb eax, eax ;2 bytes only!
_label_: ...
4.5. Zero register and move something to LSW
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
Example of unoptimized code:
1) xor eax, eax ;2 bytes
mov ax, word ptr [esi+xx] ;4 bytes
386+ supports new instruction called MOVZX (MOVe with Zero Extension).
[* NOTE: MOVZX is faster on 386, on 486+ is slower *] Example of
optimized code, where we can save 2 bytes:
2) movzx eax, word ptr [esi+xx] ;4 bytes
Next example of "ugly code":
3) xor eax, eax ;2 bytes
mov al, byte ptr [esi+xx] ;3 bytes
Now we can save valuable 1 byte X-D:
4) movzx eax, byte ptr [esi+xx] ;4 bytes
This is very effective, when u r readin' bytes/words from PE header.
Becoz u need to work with bytes/words/dwords altogether, MOVZX is
the best for this case.
And last example:
5) xor eax, eax ;2 bytes
mov ax, bx ;3 bytes
Better use this formula, which discards 2 bytes:
6) movzx eax, bx ;3 bytes
I use MOVZX evertime I can. It is small and it isn't so slow
as another instructions.
4.6. Push shit
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
Tell me, how will u store 50h to EAX...
----------------------------------------
Badly:
1) mov eax, 50h ;5 bytes
Better:
2) push 50h ;2 bytes
pop eax ;1 byte
Usin' PUSH and POP is little slower, but smaller too. When is operand
short (1 byte long), push takes 2 bytes. Otherwise it takes 5 bytes.
Let's try another thing. Push 7x 0 to stack...
-----------------------------------------------
Unoptimizely:
3) push 0 ;2 bytes
push 0 ;2 bytes
push 0 ;2 bytes
push 0 ;2 bytes
push 0 ;2 bytes
push 0 ;2 bytes
push 0 ;2 bytes
Optimizely, but still biggy X-D:
4) xor eax, eax ;2 bytes
push eax ;1 byte
push eax ;1 byte
push eax ;1 byte
push eax ;1 byte
push eax ;1 byte
push eax ;1 byte
push eax ;1 byte
Compactly, but slower:
5) push 7 ;2 bytes
pop ecx ;1 byte
_label_: push 0 ;2 bytes
loop _label_ ;2 bytes
Wow, without any pain, we saved 7 bytes
And now, life story... U wanna move something from one variable
into another variable. All registers must be preserved.
U probably do this:
----------------------------------------------------------------
6) push eax ;1 byte
mov eax, [ebp + xxxx] ;6 bytes
mov [ebp + xxxx], eax ;6 bytes
pop eax ;1 byte
And now, usin' only stack, no registers:
7) push dword ptr [ebp + xxxx] ;6 bytes
pop dword ptr [ebp + xxxx] ;6 bytes
This is useful, when u haven't any register free to use. I use it,
when I wanna save old entrypoint to another variable...
8) push dword ptr [ebp + header.epoint] ;6 bytes
pop dword ptr [ebp + originalEP] ;6 bytes
This saves wonderful 2 bytes |-). Though it is little slower than
normal manipulation by EAX (without savin' it), it still come handy,
when u don't wanna (or can't) use any register.
4.7. Multiply fun
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
Tell me, how u will calculate offset of last section, when u have
in EAX number_of_sections-1 ?
Badly:
1) mov ecx, 28h ;5 bytes
mul ecx ;2 bytes
Better:
2) push 28h ;2 bytes
pop ecx ;1 byte
mul ecx ;2 bytes
Much better:
3) imul eax, eax, 28h ;3 bytes
What IMUL does ? IMUL multiplies second register with third operand
and stores it in first register (EAX). U can so multiply 28h with EBX
and store it to EAX by this:
4) imul eax, ebx, 28h
Simple, and effective (as size, as speed). I dont wanna imagine, how
would u do this by MUL instruction... X-D
4.8. Stringz in action
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
I wanna jump into the wall when I see unoptimized string operations.
Here u have some hints, how can u optimize your code usin' string
instructions. Do it please, or I will really do it ! X-D
Startin' from the scratch, how can u load a byte ?
---------------------------------------------------
Faster:
1) mov al, [esi] ;2 bytes
inc esi ;1 byte
Smaller:
2) lodsb ;1 byte
I recommand to use *Smaller* version. This is one byte instruction,
that does exactly the same thing as *Faster* version. It's faster on
80386, but very slower on 80486+. On pentium, *Faster* takes one cycle
due to pairin'. However, I think the best to use is still *Smaller*
version.
And how can u load word ? Ehrm, DO NOT load words, it's too much
slow in 32bit enviroment such as Win32. But if u seriously wanna
load it, here is the clue...
-----------------------------------------------------------------
Faster:
3) mov ax, [esi] ;3 bytes
add esi, 2 ;3 byte
Smaller:
4) lodsw ;2 bytes
Whata 'bout speed and size ? See previous description (LODSB).
Aaaah, loadin' dwords is always funny. Look at this:
-----------------------------------------------------
Faster:
5) mov eax, [esi] ;2 bytes
add esi, 4 ;3 byte
Smaller:
6) lodsd ;1 byte
See description of LODSB.
And next very useful thing... Movin' something from somewhere
to somewhere. It's in fact LODSB/LODSW/LODSD + STOSB/STOSW/STOSD.
Here is the example of MOVSD:
------------------------------------------------------------------
Faster:
7) mov eax, [esi] ;2 bytes
add esi, 4 ;3 bytes
mov [edi], eax ;2 bytes
add edi, 4 ;3 bytes
Smaller:
8) lodsd ;1 byte
*Faster* is faster on 486+, *Smaller* is smaller
Finaly, I would like to say, that u should always load dwords instead
bytes or words, coz u run 386+ processor, which is 32bit. I.e. your
processor worx with 32 bits, so if u wanna work with one byte,
processor must load dword and then truncate it. Aaaa, too much work,
so if it's not neccesery to use bytes/words, don't use them.
Next fun... how can u get the end of string ?
----------------------------------------------
Here is the JQwerty's method:
9) lea esi, [ebp + asciiz] ;6 bytes
s_check: lodsb ;1 byte
test al, al ;2 bytes
jne s_check ;2 bytes
And Super's method:
10) lea edi, [ebp + asciiz] ;6 bytes
xor al, al ;2 bytes
s_check: scasb ;1 byte
jne s_check ;2 byte
Now, which is the best one ? Hmmm, hard to say truth...X-D
On 80386+ is faster Super's method, but on Pentium's, Jacky's method
is faster due to pairin'. Hehe, all these methods has the same size,
so choose, which would u like to use... |-)
4.9. Complex aritmetix
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
Now my favourite stuff. Its a pity, that this great technique hasn't
found usage at VX coderz. However, instructions, I wanna talk about
r WELL KNOWN (heh, but still, noone knows how slightly does it run and
what more it can do), VERY SMALL and VERY FAST on every processor.
Imagine, u have a table of DWORDs. Pointer to table is stored in EBX
register, index to table is in ECX. U wanna increment ECX. dword in
table, so something like this: EBX+(4*ECX). U don't want to modify any
register.
U can do it this way (everybody does it):
1) pushad ;1 byte
imul ecx, ecx, 4 ;3 bytes
add ebx, ecx ;2 bytes
inc dword ptr [ebx] ;2 bytes
popad ;1 byte
Or do it better (nobody does it):
2) inc dword ptr [ebx+4*ecx] ;3 bytes
This really rox !!! U saved processor time (this is very fast), space
in memory (very small, as u can see) and make better readable your
source code !!! U saved 6 bytes by simple ONE INSTRUCTION !!!
That's not all (not all for INC instruction). Imagine another
situation: EBX - pointer to memory, ECX - index to table, u wanna
increase ECX. dword + 4096 bytes, so this: EBX+(4*ECX)+1000h. Yeah,
and u wanna preserve all registers. U can do it unoptimizly like this:
3) pushad ;1 byte
imul ecx, ecx, 4 ;3 bytes
add ebx, ecx ;2 bytes
add ebx, 1000h ;6 bytes
inc dwor ptr [ebx] ;2 bytes
popad ;1 byte
Or very optimizely...
4) inc dword ptr [ebx+4*ecx+1000h] ;7 bytes
Yahoooooo, we saved 8 bytes by one instruction (and we used IMUL
instead of MUL), great !
This magic can do EVERY aritmetical instructions, not only INC.
Imagine, how much space will u save, when u will use this in
instructions such as ADD, SUB, ADC, SBB, INC, DEC, OR, XOR, AND, etc.
The biggest magic is commin' now. Hey guy, tell me, what does the LEA
instruction. U probably know, that it's instruction we use for
manipulatin' with variables in virus. But only some ppl know, how to
use this intruction really effectively.
LEA instruction can be translated as Load Effective Address. This name
is little claimin'. Let's have a look, what LEA really does.
Try to hardcode this:
lea eax, [12345678h]
What do u think, what will be in EAX after execution this opcode ?
Rite answer is 12345678h.
Another example (EBP = 1):
lea eax, [ebp + 12345678h]
What will be in register EAX ? Right answer is 12345679h. Yes, on the
least significant digit is 9h. So let's translate this instruction
to "normal" language:
lea eax, [ebp + 12345678h] ;6 bytes
==========================
mov eax, 12345678h ;5 bytes
add eax, ebp ;2 bytes
As u can see, LEA doesn't work with memory or addressed. It only worx
with its operands and makin' some operations with it, then it stores
result into first operand (EAX in our example). Now look at sizes.
Weird, it does exactly the same thing (not so exactly, LEA preserves
flags), but it is shorter. Let's show the whole magic...
5) Look at this unoptimized stuff:
mov eax, 12345678h ;5 bytes
add eax, ebp ;2 bytes
imul ecx, 4 ;3 bytes
add eax, ecx ;2 bytes
6) Open your mouth and look at this:
lea eax, [ebp+ecx*4+12345678h] ;7 bytes
Close your mouth now. LEA is shorter, faster (much faster) and
preserves flags. Look at it once again, we saved 5 bytes by one
instruction and processor time (LEA is much faster on every processor).
I won't explain here every aritmetical instruction, I think, it
wouldn't have a sense, coz it has the same syntax. U saw everything
important, now u can use it. If u wanna use these technique, the only
thing u have to have on the mind is the syntax:
OPCODE <SIZE PTR> [BASE + INDEX*SCALE + DISPLACEMENT]
4.10. Delta offset optimization
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
Naaah, u probably think, I'm mad. If u, as a reader of this e-paper,
aren't beginner, u must know, what da (filtered) delta offset is. However,
I saw at many VX coderz, that they don't use delta offset effectively.
If u have a look on my first viruses, u will see, I also (filtered) the
space it takes. And I wasn't alone. Let's see it in details..
[* Ehrm, let's have a pause. I think, u have to be tired from this
BIIIIIG paper. I will tell ya something... Before some minutes, I
went out to buy new cig-box (uuuh, to many drugs in my body now X-D).
Hot, sunny weather changed before some moments to hot, windy weather,
darky, total STOOOORM, but without any rain, I can see big lightenings,
I like it. It's the best weather to have a minute for thinkin' about
some things - girls, VX, friends, politix, ... I'm back now. I'm
plug-inin' some very kewl CD with very kewl music, czech music. Now I
can hear one very gewd song from one very gewd czech rock-group. Hehe,
90% of their songs were written when they were totally doped. But wait,
they r very gewd. Many things u can understand, only when u r doped.
They r singin' (rite now X-D) about Earth. It's very slow song, it's
like Indian music (but they also play hard rock, so hard, that Billy
would like it. Hehe, I will bring this CD sometimes, when we will be
on some meetin', somewhere, maybe. Billy, u will 100% like it, my
friend ! X-D). Hmmm, I will tell ya know some lyrix... Very gewd lyrix,
I hope, u will understand it, I will translate it for ya X-D...
She defence on and on,
there r ages, when someone like her,
Both of nice and cruel,
U can touch, she will give it also to u,
Now it is waitin' for that step,
which makes walk a fly,
And when then, when, if not now ?????
Politix can invent only atomic shit,
let's kick it back to them,
And when then, when, if not now ?????
She defence on and on, ....
Ooooh, my god, whata hell I'm doin' now ? Hehe, if u think, I'm mad,
be sure it's truth X-DDD. Ok, ok, back to reality... *]
So, again, let's look at that stuff.
This is the way, how is standardly delta offset handled...
1) call gdelta
gdelta: pop ebp
sub ebp, offset gdelta
That's normal way (but less efficent). Let's look, how we can work
with it...
lea eax, [ebp + variable]
Hmmm, if u look at it under some debugger, u will see followin' line:
3) lea eax, [ebp + 401000h] ;6 bytes
In the first generation of virus, EBP register will be nulified.
Ok, but let's look, what happens, if u code this:
4) lea eax, [ebp + 10h] ;3 bytes
Hmmm, weird. Sometimes it's 6 bytes, next time it's 3 bytes. It's
normal. Many instructions r optimized for SHORT (one byte long) values,
e.g. SUB EBX, 3 will be 3 bytes long too. If u code SUB EBX, 1234h, it
will have 6 bytes. Not only SUB instruction, also many other
instructions.
Look, what happens, if we will use "another" way, how to get delta
offset...
5) call gdelta
gdelta: pop ebp
Only ! As I said, in first generation of virus, EBP will be nulified
(in previous version of gdelta) and variable will be e.g. 401000h.
That's not good. What do u say, we will have 401000h value in EBP
and increment value will be that variable ? Thanx to our new version
of gdelta, we can use SHORT version of LEA and so save 3 bytes on
variable addressin'. Here is the sample...
6) lea eax, [ebp + variable - gdelta] ;3 bytes
We got it. Next thing, what should we do is insert all initialized
variables around the gdelta call. This will make our work (no more
6 bytes, but 3 bytes instructions) - THIS IS REALLY IMPORTANT. If u
won't do it, variable would be somewhere FAR (ehrm, I wanted say
NEAR X-D) from gdelta, so SHORT version of LEA wouldn't be used.
Heh, u probably think, that there is some trick, that it has some
limitation or something like that, coz if this would work, everybody
would use it. Don't worry, there aren't any limitation.
And why da (filtered) noone use it ? It's not easy hard to answer. I can say,
that I dont know. Really don't know.
[* Let me say my feelings. U probaly know Super/29A. He is the best
optimizer, I and VX world know. It's fact. U probably also know
JQwerty/29A. He is also VERY GOOD optimizer, but noone say "Super and
JQwerty r the best optimizers". I don't know why. I saw this delta
offset handlin' firstly at his code, noone use it before him (I think).
And that is soooo easy to use it. If u look at Win32.Cabanas u will
see MANY and MANY features. And it's only 2999 bytes !!! Who else than
Super or JQwerty could code it ? I don't know. I wanna only say, that
"someone" forgot to other kewl guy. *]
My new virus uses this delta offset handlin' too, and I saved TONS of
bytes. So why don't u use it too ?
4.11. Misc optimalizations
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
Here r included those optimization techniques, that I couldn't sort
to groups above... Just read it, something can be useful...
Zero EDX register, if EAX is less than 80000000h:
--------------------------------------------------
1) xor edx, edx ;2 bytes, but faster
2) cdq ;1 byte, but slower
I always use CDQ instead XOR. Why ? Why not ? X-D
Save space by usin' all registers, instead of EBP and ESP:
-----------------------------------------------------------
1) mov eax, [ebp] ;3 bytes
2) mov eax, [esp] ;3 bytes
3) mov eax, [ebx] ;2 bytes
Wanna have mirror effect of register content ? Try BSWAP.
---------------------------------------------------------
Example:
mov eax, 12345678h ;5 bytes
bswap eax ;2 bytes
;eax = 78563412h now
I haven't ever found this instruction useful for any viral work.
However, someone maybe will X-D.
Wanna save some bytes replacin' CALL ?
---------------------------------------
1) call _label_ ;5 bytes
ret ;1 byte
2) jmp _label_ ;2/5 (SHORT/NEAR)
Huh, we saved 4 bytes and processor time. Always replace call/ret with
jmp instruction, if call doesn't want any parameters on the stack...
Wanna save time while comparin' reg/mem ?
------------------------------------------
1) cmp reg, [mem] ;slower
2) cmp [mem], reg ;1 cycle faster
Wanna save space and CPU time while dividin'/multiplyin' by
power of 2 ?
------------------------------------------------------------
Dividin':
1) mov eax, 1000h
mov ecx, 4 ;5 bytes
xor edx, edx ;2 bytes
div ecx ;2 bytes
2) shr eax, 4 ;3 bytes
Multiplyin':
3) mov ecx, 4 ;5 bytes
mul ecx ;2 bytes
4) shl eax, 4 ;3 bytes
No comment...
Loops, loops and loops:
------------------------
1) dec ecx ;1 byte
jne _label_ ;2/6 bytes (SHORT/NEAR)
2) loop _label_ ;2 bytes
Next example:
3) je $+5 ;2 bytes
dec ecx ;1 byte
jne _label_ ;2 bytes
4) loopXX _label_ (XX = E, NE, Z or NZ) ;2 bytes
LOOP is smaller, but slower on 486+.
And next unforgetable thing. Noone normal can code this:
---------------------------------------------------------
1) push eax ;1 byte
push ebx ;1 byte
pop eax ;1 byte
pop ebx ;1 byte
Do this and only this. Nothing other than this:
2) xchg eax, ebx ;1 byte
And again, if XCHG's operand is EAX, it takes 1 byte otherwise
it takes 2 bytes. So when u wanna exchange ECX with EDX, XCHG will
be 2 bytes long:
3) xchg ecx, edx ;2 bytes
If u only want to move content of one register to another one, use
simple MOV instruction. It has better pairin' on Pentium and takes
less CPU time than XCHG without EAX register as operand:
4) mov ecx, edx ;2 bytes
Discard repeated code (and procedure code):
--------------------------------------------
1) Unoptimized:
lbl1: mov al, 5 ;2 bytes
stosb ;1 byte
mov eax, [ebx] ;2 bytes
stosb ;1 byte
ret ;1 byte
lbl2: mov al, 6 ;2 bytes
stosb ;1 byte
mov eax, [ebx] ;2 bytes
stosb ;1 byte
ret ;1 byte
---------
;14 bytes
2) Optimized:
lbl1: mov al, 5 ;2 bytes
lbl: stosb ;1 byte
mov eax, [ebx] ;2 bytes
stosb ;1 byte
ret ;1 byte
lbl2: mov al, 6 ;2 bytes
jmp lbl ;2 bytes
---------
;11 bytes
Remember, if u h

