Articles

Metasploit Framework Windows Tutorial
Remote Desktop Connection
Windows Processes That May Be Dangerous
How-To use NetCat a Tutorial
Common Linux Commands
Common Ports
Netcat Commands
HTTP Response Codes
War-Google Hack Terms
Wardriving
Avoiding Social Engineering and Phishing Attacks
Intrusion Detection on Linux
Linux Intrusion Detection
Penetration Testing Guide
Penetration Testing Tools
Social Engineering Fundamentals, Part I: Hacker Tactics
Social engineering (computer security)
The Psychology of Social Engineering

The Archives

General GSO
GovernmentSecurity.org News & Suggestions
In The News
Open Topic
General Security Information
Trash Can
Exploit & Vulnerability Mailing List Archives
Trial Member Forum
Product and Program Reviews GSO Tutorials
System Security
Windows Systems
Beginners Section
Linux & Unix Systems
File Downloads
Exploit Research & Discussion Trojan & Virus Errata
Networking Security / Firewall / IDS / VPN / Routers
System Hardening
E-Mail Security
Wifi Security
Trial Member Uploads
Upload discovered Trojans & Mal ware
GSO Programming Section
C , C++ , VC++
Visual Basic.NET
Perl /CGI
Java/Javascript
PHP/XML/ASP/HTML
Assembly + Other
The Cork Board
Network Security Consultant Directory
Network Security Jobs
The Archives
Encryption Information
General Network Security
Internet Anonymity
HTTP Protocol Security
Linux Security
MS IIS Information
Exploit Articles
Programming / Tool Design
GSO Software Projects
Public Downloads
Microsoft Security Questions and Papers

Full Version: Ways Of Optimization
belgther
Well, this article is quite old, as you can see it from the date, but i hope it gives some idea how to optimize programs so that they work faster / have less size...
I got this article from 29A, a virus e-magazine, so it's NOT written by me, so DON'T nag me for anything...
BTW, add any other ways of optimization if you know some...

Cheers

belgther

QUOTE

                                                      ÜÛÛÛÛÛÜ ÜÛÛÛÛÛÜ ÜÛÛÛÛÛÜ
      ÚÄ Optimization of 32bit code Ä¿                ÛÛÛ ÛÛÛ ÛÛÛ ÛÛÛ ÛÛÛ ÛÛÛ
      ³              by              ³                ÜÜÜÛÛß ßÛÛÛÛÛÛ ÛÛÛÛÛÛÛ
      ÀÄÄÄÄÄÄÄÄ Benny / 29A ÄÄÄÄÄÄÄÄÄÙ                ÛÛÛÜÜÜÜ ÜÜÜÜÛÛÛ ÛÛÛ ÛÛÛ
                                                      ÛÛÛÛÛÛÛ ÛÛÛÛÛÛß ÛÛÛ ÛÛÛ




ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
³ 1. Disclamer ³
ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ

The followin' document is an education purpose only. Author isn't
responsible for any misuse of the things written in this document.


ÚÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
³ 2. Foreword ³
ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÙ

Eeeh, why da (filtered) I wrote this article ? There r many documents about
optimizations. Yes, that's truth, and there r many very gewd and kewl tutes
[* Billy, your tute rox! *]. But how can u see, not every tute has on the
mind, that the term "optimize" doesn't fully mean, your code will be only
small. There r many aspects of optimization and I wanna discuss it here and
make u complex view on the thing.
When I started to write this article, I was really drunk and totaly under the
drugs (hehe, no lie X-D), so if u feel, I made any mistake or u think, things
written here aren't true or simply u wanna give me some credits (do it
please X-D), u can find me on IRC UnderNet, channels #vir and/or #virus or
mail benny@post.cz. Thanx for all possitive (and also negative) comments.


ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
³ 3. Introduction ³
ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ

As I said some seconds before, optimization has many aspects.
Generaly, we can optimize our code, so:
        -      code will be smaller
        -      code will be faster
        -      code will be smaller and faster


Well, it gives us some new space for thinkin'. If we optimize our code:
        -      code will be smaller, but also slower
        -      code will be bigger, but faster
        -      code will be smaller and faster


We should find compromise (if we can't reach third point) between first
and second point. I'm sure, u don't wanna alert user by slowin' down system
performace due to:
        -      huge and unoptimized code
        -      small, but slow code

or alert user by rapidly decreasin' space on the disk.


It's up to us, which way will we choose. Here we have a clue:
        -      if our code (or block of code, e.g. thread procedure) is
                small, we should optimize it for faster code
        -      if our code (or block of code) is big, we should optimize
                it for smaller/faster (find compromise, prefer speed) code

However, we should optimize our code by decreasin' its size and increasin'
speed, but u know, how is it difficult.


Is it clear ? I think, u already knew this. But still, there r still many
aspects of optimization. We have for example two instructions, that do the
same thing, but:
        -      one instruction is bigger
        -      one instruction is slower
        -      one instruction changes another registers
        -      one instruction writes to memory
        -      one instruction changes flags
        -      one instruction is faster on one processor, but slower on
                another one


Example:      LODSB            MOV AL, [ESI] + INC ESI
                -----------------------------------------
    size:      smaller          bigger
    speed:    faster on 80386  faster on 80486, on Pentium only 1 cycle
    flags:    preserved        changed



And why is LODSB faster on 80386 and why it takes only 1 cykle on Pentium ?
Pentium is superscalar processor supportin' pipelinin', so it can execute
pair of some integer instructions in a PIPE, i.e. it can execute those
instructions simultaneously. Two instructions, that can be executed
simultaneously r called "pairable instructions".

Hehe, don't worry, this arcticle won't be about Pentium processor
architecture, so u can forget words I said about pipes. Maybe l8r, if I will
write another article about Pentium processor optimization, I will explain
more in details terms such as pipes, V-pipe, U-pipe, pairin' and so on. For
now, u can forget them. Just remember, what does "pairin'" word mean.


Now, I will discuss step by step every optimization techniques.


ÚÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿
³ 4. Optimizin' our code  ³
ÀÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÙ

Well, let's go optimize. I will start from the easiest operation.
Beginners, hold on...


4.1. Zero register
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ

        I don't wanna see this anymore:

        1)      mov eax, 00000000h                    ;5 bytes

        This is the worst instruction I've ever seen. Well, it seems
        logical, that u will move zero to register, but u can do it
        more optimizely like now:

        2)      sub eax, eax                          ;2 bytes

                    or

        3)      xor eax, eax                          ;2 bytes

        3 bytes on one instruction saved, great ! X-D But what's better
        to use, SUB or XOR ? I prefer XOR, coz Micro$oft prefers SUB and I
        know, that Windozes r slooooow, hehe. Noo, that's not true reason.
        What do u think, is better (for u) to substact two numbers or say
        "where's 1 and 1, write 0" ? So u know, why I prefer XOR (as I hate
        mathematix X-D).


4.2. Test if register is zero
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ

        Hmmm, let's see the brightest solution:

        1)      cmp eax, 00000000h                    ;5 bytes
                je _label_                            ;2/6 bytes (short/near)

        [* NOTE: Many aritmetical instructions r optimized for register EAX,
        so code usin' EAX register will be faster and smaller.
        Example: CMP EAX, 12345678h (5 bytes). If I would use another register
        instead of EAX, CMP instruction would have 6 bytes *]

        Argh! Who normal can do this ? That's 7 or 15(!) bytes for simple
        comparsion. No, no, no, don't do it and try this:

        2)      or eax, eax                          ;2 bytes
                je _label_                            ;2/6 (short/near)

                    or

        3)      test eax, eax                        ;2 bytes
                je _label_                            ;2/6 (short/near)

        Hmm, much better, 4/8 bytes is really better than 7/15 bytes. So,
        again, whats better, OR or TEST ? OR prefers Micro$oft so again, I
        prefer TEST |-). Now seriously, TEST doesn't write to register (OR
        does), so there will be better pairin' => faster code. I hope, u still
        remember, what does "pairin'" word mean...If not, read again
        Introduction section.

        Now, the biggest magic. If u don't care of ECX register or u don't
        care, where will be stored content of registers (EAX and ECX), u can
        do it this way:

        4)      xchg eax, ecx                        ;1 byte
                jecxz _label_                        ;2 bytes

        [* NOTE: XCHG is optimized for EAX register, so if XCHG will use
        EAX register, it will be 1 byte long, otherwise 2 bytes *]

        Great! We optimized our code, so we saved 4 bytes.


4.3. Test if register is 0FFFFFFFFh
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ

        Many APIs return -1, when function fail, so it is important to
        test for this value. I'm always astonished, when I see how some
        coders test for this value like now me:

        1)      cmp eax, 0ffffffffh                  ;5 bytes
                je _label_                            ;2/6 bytes

        I hate this. And now look, how can it be optimized:

        2)      inc eax                              ;1 byte
                je _label_                            ;2/6 bytes
                dec eax                              ;1 byte

        Yes, yes, yes, we saved 3 bytes and made code faster wink.gif


4.4. Move 0FFFFFFFFh to register
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
       
        Some APIs need as parameter -1 value. Let's see, how can we set it:

        Least optimized:

        1)      mov eax, 0ffffffffh                  ;5 bytes

        More optimized:

        2)      xor eax, eax / sub eax, eax          ;2 bytes
                dec eax                              ;1 byte

        Or this with same result (by Super/29A):

        3)      stc                                  ;1 byte
                sbb eax, eax                          ;2 bytes

        This code is very useful in same cases, such as:
                jnc _label_
                sbb eax, eax                          ;2 bytes only!
      _label_: ...


4.5. Zero register and move something to LSW
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ

        Example of unoptimized code:

        1)      xor eax, eax                          ;2 bytes
                mov ax, word ptr [esi+xx]            ;4 bytes

        386+ supports new instruction called MOVZX (MOVe with Zero Extension).
        [* NOTE: MOVZX is faster on 386, on 486+ is slower *] Example of
        optimized code, where we can save 2 bytes:

        2)      movzx eax, word ptr [esi+xx]          ;4 bytes

        Next example of "ugly code":

        3)      xor eax, eax                          ;2 bytes
                mov al, byte ptr [esi+xx]            ;3 bytes

        Now we can save valuable 1 byte X-D:

        4)      movzx eax, byte ptr [esi+xx]          ;4 bytes

        This is very effective, when u r readin' bytes/words from PE header.
        Becoz u need to work with bytes/words/dwords altogether, MOVZX is
        the best for this case.

        And last example:

        5)      xor eax, eax                          ;2 bytes
                mov ax, bx                            ;3 bytes

        Better use this formula, which discards 2 bytes:

        6)      movzx eax, bx                        ;3 bytes

        I use MOVZX evertime I can. It is small and it isn't so slow
        as another instructions.


4.6. Push shit
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ

Tell me, how will u store 50h to EAX...
        ----------------------------------------

        Badly:

        1)      mov eax, 50h                          ;5 bytes

        Better:

        2)      push 50h                              ;2 bytes
                pop eax                              ;1 byte

        Usin' PUSH and POP is little slower, but smaller too. When is operand
        short (1 byte long), push takes 2 bytes. Otherwise it takes 5 bytes.

        Let's try another thing. Push 7x 0 to stack...
        -----------------------------------------------

        Unoptimizely:

        3)      push 0                                ;2 bytes
                push 0                                ;2 bytes
                push 0                                ;2 bytes
                push 0                                ;2 bytes
                push 0                                ;2 bytes
                push 0                                ;2 bytes
                push 0                                ;2 bytes

        Optimizely, but still biggy X-D:

        4)      xor eax, eax                          ;2 bytes
                push eax                              ;1 byte
                push eax                              ;1 byte
                push eax                              ;1 byte
                push eax                              ;1 byte
                push eax                              ;1 byte
                push eax                              ;1 byte
                push eax                              ;1 byte

        Compactly, but slower:

        5)      push 7                                ;2 bytes
                pop ecx                              ;1 byte
      _label_:  push 0                                ;2 bytes
                loop _label_                          ;2 bytes

        Wow, without any pain, we saved 7 bytes wink.gif)

        And now, life story... U wanna move something from one variable
        into another variable. All registers must be preserved.
        U probably do this:
        ----------------------------------------------------------------

        6)      push eax                              ;1 byte
                mov eax, [ebp + xxxx]                  ;6 bytes
                mov [ebp + xxxx], eax                  ;6 bytes
                pop eax                                ;1 byte

        And now, usin' only stack, no registers:

        7)      push dword ptr [ebp + xxxx]            ;6 bytes
                pop dword ptr [ebp + xxxx]            ;6 bytes

        This is useful, when u haven't any register free to use. I use it,
        when I wanna save old entrypoint to another variable...

        8)      push dword ptr [ebp + header.epoint]  ;6 bytes
                pop dword ptr [ebp + originalEP]      ;6 bytes

        This saves wonderful 2 bytes |-). Though it is little slower than
        normal manipulation by EAX (without savin' it), it still come handy,
        when u don't wanna (or can't) use any register.


4.7. Multiply fun
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ
               
        Tell me, how u will calculate offset of last section, when u have
        in EAX number_of_sections-1 ?

        Badly:

        1)      mov ecx, 28h                          ;5 bytes
                mul ecx                              ;2 bytes

        Better:

        2)      push 28h                              ;2 bytes
                pop ecx                              ;1 byte
                mul ecx                              ;2 bytes

        Much better:

        3)      imul eax, eax, 28h                    ;3 bytes

        What IMUL does ? IMUL multiplies second register with third operand
        and stores it in first register (EAX). U can so multiply 28h with EBX
        and store it to EAX by this:

        4)      imul eax, ebx, 28h

        Simple, and effective (as size, as speed). I dont wanna imagine, how
        would u do this by MUL instruction... X-D


4.8. Stringz in action
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ

        I wanna jump into the wall when I see unoptimized string operations.
        Here u have some hints, how can u optimize your code usin' string
        instructions. Do it please, or I will really do it ! X-D

        Startin' from the scratch, how can u load a byte ?
        ---------------------------------------------------

        Faster:

        1)      mov al, [esi]                        ;2 bytes
                inc esi                              ;1 byte

        Smaller:

        2)      lodsb                                ;1 byte

        I recommand to use *Smaller* version. This is one byte instruction,
        that does exactly the same thing as *Faster* version. It's faster on
        80386, but very slower on 80486+. On pentium, *Faster* takes one cycle
        due to pairin'. However, I think the best to use is still *Smaller*
        version.

        And how can u load word ? Ehrm, DO NOT load words, it's too much
        slow in 32bit enviroment such as Win32. But if u seriously wanna
        load it, here is the clue...
        -----------------------------------------------------------------

        Faster:

        3)      mov ax, [esi]                        ;3 bytes 
                add esi, 2                            ;3 byte

        Smaller:

        4)      lodsw                                ;2 bytes

        Whata 'bout speed and size ? See previous description (LODSB).

        Aaaah, loadin' dwords is always funny. Look at this:
        -----------------------------------------------------

        Faster:

        5)      mov eax, [esi]                        ;2 bytes
                add esi, 4                            ;3 byte

        Smaller:

        6)      lodsd                                ;1 byte

        See description of LODSB.

        And next very useful thing... Movin' something from somewhere
        to somewhere. It's in fact LODSB/LODSW/LODSD + STOSB/STOSW/STOSD.
        Here is the example of MOVSD:
        ------------------------------------------------------------------

        Faster:

        7)      mov eax, [esi]                        ;2 bytes
                add esi, 4                            ;3 bytes
                mov [edi], eax                        ;2 bytes
                add edi, 4                            ;3 bytes

        Smaller:

        8)      lodsd                                ;1 byte

        *Faster* is faster on 486+, *Smaller* is smaller wink.gif.
                                                   
        Finaly, I would like to say, that u should always load dwords instead
        bytes or words, coz u run 386+ processor, which is 32bit. I.e. your
        processor worx with 32 bits, so if u wanna work with one byte,
        processor must load dword and then truncate it. Aaaa, too much work,
        so if it's not neccesery to use bytes/words, don't use them.

        Next fun... how can u get the end of string ?
        ----------------------------------------------

        Here is the JQwerty's method:

        9)      lea esi, [ebp + asciiz]              ;6 bytes
      s_check: lodsb                                ;1 byte
                test al, al                          ;2 bytes
                jne s_check                          ;2 bytes

        And Super's method:

        10)    lea edi, [ebp + asciiz]              ;6 bytes
                xor al, al                            ;2 bytes
      s_check: scasb                                ;1 byte
                jne s_check                          ;2 byte

        Now, which is the best one ? Hmmm, hard to say truth...X-D
        On 80386+ is faster Super's method, but on Pentium's, Jacky's method
        is faster due to pairin'. Hehe, all these methods has the same size,
        so choose, which would u like to use... |-)


4.9. Complex aritmetix
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ

        Now my favourite stuff. Its a pity, that this great technique hasn't
        found usage at VX coderz. However, instructions, I wanna talk about
        r WELL KNOWN (heh, but still, noone knows how slightly does it run and
        what more it can do), VERY SMALL and VERY FAST on every processor.

        Imagine, u have a table of DWORDs. Pointer to table is stored in EBX
        register, index to table is in ECX. U wanna increment ECX. dword in
        table, so something like this: EBX+(4*ECX). U don't want to modify any
        register.
        U can do it this way (everybody does it):

        1)      pushad                                ;1 byte
                imul ecx, ecx, 4                      ;3 bytes
                add ebx, ecx                          ;2 bytes
                inc dword ptr [ebx]                  ;2 bytes
                popad                                ;1 byte

        Or do it better (nobody does it):

        2)      inc dword ptr [ebx+4*ecx]            ;3 bytes

        This really rox !!! U saved processor time (this is very fast), space
        in memory (very small, as u can see) and make better readable your
        source code !!! U saved 6 bytes by simple ONE INSTRUCTION !!!

        That's not all (not all for INC instruction). Imagine another
        situation: EBX - pointer to memory, ECX - index to table, u wanna
        increase ECX. dword + 4096 bytes, so this: EBX+(4*ECX)+1000h. Yeah,
        and u wanna preserve all registers. U can do it unoptimizly like this:

        3)      pushad                                ;1 byte
                imul ecx, ecx, 4                      ;3 bytes
                add ebx, ecx                          ;2 bytes
                add ebx, 1000h                        ;6 bytes
                inc dwor ptr [ebx]                    ;2 bytes
                popad                                ;1 byte

        Or very optimizely...

        4)      inc dword ptr [ebx+4*ecx+1000h]      ;7 bytes

        Yahoooooo, we saved 8 bytes by one instruction (and we used IMUL
        instead of MUL), great !

        This magic can do EVERY aritmetical instructions, not only INC.
        Imagine, how much space will u save, when u will use this in
        instructions such as ADD, SUB, ADC, SBB, INC, DEC, OR, XOR, AND, etc.

        The biggest magic is commin' now. Hey guy, tell me, what does the LEA
        instruction. U probably know, that it's instruction we use for
        manipulatin' with variables in virus. But only some ppl know, how to
        use this intruction really effectively.

        LEA instruction can be translated as Load Effective Address. This name
        is little claimin'. Let's have a look, what LEA really does.

        Try to hardcode this:

                lea eax, [12345678h]

        What do u think, what will be in EAX after execution this opcode ?
        Rite answer is 12345678h.

        Another example (EBP = 1):

                lea eax, [ebp + 12345678h]

        What will be in register EAX ? Right answer is 12345679h. Yes, on the
        least significant digit is 9h. So let's translate this instruction
        to "normal" language:

                lea eax, [ebp + 12345678h]            ;6 bytes
                ==========================
                mov eax, 12345678h                    ;5 bytes
                add eax, ebp                          ;2 bytes

        As u can see, LEA doesn't work with memory or addressed. It only worx
        with its operands and makin' some operations with it, then it stores
        result into first operand (EAX in our example). Now look at sizes.
        Weird, it does exactly the same thing (not so exactly, LEA preserves
        flags), but it is shorter. Let's show the whole magic...

        5) Look at this unoptimized stuff:

                mov eax, 12345678h                    ;5 bytes
                add eax, ebp                          ;2 bytes
                imul ecx, 4                          ;3 bytes
                add eax, ecx                          ;2 bytes

        6) Open your mouth and look at this:

                lea eax, [ebp+ecx*4+12345678h]        ;7 bytes

        Close your mouth now. LEA is shorter, faster (much faster) and
        preserves flags. Look at it once again, we saved 5 bytes by one
        instruction and processor time (LEA is much faster on every processor).

        I won't explain here every aritmetical instruction, I think, it
        wouldn't have a sense, coz it has the same syntax. U saw everything
        important, now u can use it. If u wanna use these technique, the only
        thing u have to have on the mind is the syntax:

                OPCODE <SIZE PTR> [BASE + INDEX*SCALE + DISPLACEMENT]


4.10. Delta offset optimization
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ

        Naaah, u probably think, I'm mad. If u, as a reader of this e-paper,
        aren't beginner, u must know, what da (filtered) delta offset is. However,
        I saw at many VX coderz, that they don't use delta offset effectively.
        If u have a look on my first viruses, u will see, I also (filtered) the
        space it takes. And I wasn't alone. Let's see it in details..


        [* Ehrm, let's have a pause. I think, u have to be tired from this
        BIIIIIG paper. I will tell ya something... Before some minutes, I
        went out to buy new cig-box (uuuh, to many drugs in my body now X-D).
        Hot, sunny weather changed before some moments to hot, windy weather,
        darky, total STOOOORM, but without any rain, I can see big lightenings,
        I like it. It's the best weather to have a minute for thinkin' about
        some things - girls, VX, friends, politix, ... I'm back now. I'm
        plug-inin' some very kewl CD with very kewl music, czech music. Now I
        can hear one very gewd song from one very gewd czech rock-group. Hehe,
        90% of their songs were written when they were totally doped. But wait,
        they r very gewd. Many things u can understand, only when u r doped.
        They r singin' (rite now X-D) about Earth. It's very slow song, it's
        like Indian music (but they also play hard rock, so hard, that Billy
        would like it. Hehe, I will bring this CD sometimes, when we will be
        on some meetin', somewhere, maybe. Billy, u will 100% like it, my
        friend ! X-D). Hmmm, I will tell ya know some lyrix... Very gewd lyrix,
        I hope, u will understand it, I will translate it for ya X-D...

                She defence on and on,
                there r ages, when someone like her,
                Both of nice and cruel,
                U can touch, she will give it also to u,
                Now it is waitin' for that step,
                which makes walk a fly,
                And when then, when, if not now ?????
                Politix can invent only atomic shit,
                let's kick it back to them,
                And when then, when, if not now ?????
                She defence on and on, ....

        Ooooh, my god, whata hell I'm doin' now ? Hehe, if u think, I'm mad,
        be sure it's truth X-DDD. Ok, ok, back to reality... *]


        So, again, let's look at that stuff.
        This is the way, how is standardly delta offset handled...

        1)      call gdelta
        gdelta: pop ebp
                sub ebp, offset gdelta

        That's normal way (but less efficent). Let's look, how we can work
        with it...

                lea eax, [ebp + variable]

        Hmmm, if u look at it under some debugger, u will see followin' line:

        3)      lea eax, [ebp + 401000h]              ;6 bytes

        In the first generation of virus, EBP register will be nulified.
        Ok, but let's look, what happens, if u code this:

        4)      lea eax, [ebp + 10h]                  ;3 bytes

        Hmmm, weird. Sometimes it's 6 bytes, next time it's 3 bytes. It's
        normal. Many instructions r optimized for SHORT (one byte long) values,
        e.g. SUB EBX, 3 will be 3 bytes long too. If u code SUB EBX, 1234h, it
        will have 6 bytes. Not only SUB instruction, also many other
        instructions.

        Look, what happens, if we will use "another" way, how to get delta
        offset...

        5)      call gdelta
        gdelta: pop ebp

        Only ! As I said, in first generation of virus, EBP will be nulified
        (in previous version of gdelta) and variable will be e.g. 401000h.
        That's not good. What do u say, we will have 401000h value in EBP
        and increment value will be that variable ? Thanx to our new version
        of gdelta, we can use SHORT version of LEA and so save 3 bytes on
        variable addressin'. Here is the sample...

        6)      lea eax, [ebp + variable - gdelta]    ;3 bytes

        We got it. Next thing, what should we do is insert all initialized
        variables around the gdelta call. This will make our work (no more
        6 bytes, but 3 bytes instructions) - THIS IS REALLY IMPORTANT. If u
        won't do it, variable would be somewhere FAR (ehrm, I wanted say
        NEAR X-D) from gdelta, so SHORT version of LEA wouldn't be used.
        Heh, u probably think, that there is some trick, that it has some
        limitation or something like that, coz if this would work, everybody
        would use it. Don't worry, there aren't any limitation.
        And why da (filtered) noone use it ? It's not easy hard to answer. I can say,
        that I dont know. Really don't know.

        [* Let me say my feelings. U probaly know Super/29A. He is the best
        optimizer, I and VX world know. It's fact. U probably also know
        JQwerty/29A. He is also VERY GOOD optimizer, but noone say "Super and
        JQwerty r the best optimizers". I don't know why. I saw this delta
        offset handlin' firstly at his code, noone use it before him (I think).
        And that is soooo easy to use it. If u look at Win32.Cabanas u will
        see MANY and MANY features. And it's only 2999 bytes !!! Who else than
        Super or JQwerty could code it ? I don't know. I wanna only say, that
        "someone" forgot to other kewl guy. *]

        My new virus uses this delta offset handlin' too, and I saved TONS of
        bytes. So why don't u use it too ?


4.11. Misc optimalizations
ÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ

        Here r included those optimization techniques, that I couldn't sort
        to groups above... Just read it, something can be useful...


        Zero EDX register, if EAX is less than 80000000h:
        --------------------------------------------------

        1)      xor edx, edx                          ;2 bytes, but faster

        2)      cdq                                  ;1 byte, but slower

        I always use CDQ instead XOR. Why ? Why not ? X-D


        Save space by usin' all registers, instead of EBP and ESP:
        -----------------------------------------------------------

        1)      mov eax, [ebp]                        ;3 bytes
        2)      mov eax, [esp]                        ;3 bytes

        3)      mov eax, [ebx]                        ;2 bytes


        Wanna have mirror effect of register content ? Try BSWAP.
        ---------------------------------------------------------

        Example:

                mov eax, 12345678h                    ;5 bytes

                bswap eax                            ;2 bytes

                ;eax = 78563412h now

        I haven't ever found this instruction useful for any viral work.
        However, someone maybe will X-D.


        Wanna save some bytes replacin' CALL ?
        ---------------------------------------

        1)      call _label_                          ;5 bytes
                ret                                  ;1 byte

        2)      jmp _label_                          ;2/5 (SHORT/NEAR)

        Huh, we saved 4 bytes and processor time. Always replace call/ret with
        jmp instruction, if call doesn't want any parameters on the stack...


        Wanna save time while comparin' reg/mem ?
        ------------------------------------------

        1)      cmp reg, [mem]                        ;slower

        2)      cmp [mem], reg                        ;1 cycle faster


        Wanna save space and CPU time while dividin'/multiplyin' by
        power of 2 ?
        ------------------------------------------------------------

        Dividin':

        1)      mov eax, 1000h
                mov ecx, 4                            ;5 bytes
                xor edx, edx                          ;2 bytes
                div ecx                              ;2 bytes

        2)      shr eax, 4                            ;3 bytes

        Multiplyin':

        3)      mov ecx, 4                            ;5 bytes
                mul ecx                              ;2 bytes

        4)      shl eax, 4                            ;3 bytes

        No comment...


        Loops, loops and loops:
        ------------------------

        1)      dec ecx                              ;1 byte
                jne _label_                          ;2/6 bytes (SHORT/NEAR)

        2)      loop _label_                          ;2 bytes

        Next example:

        3)      je $+5                                ;2 bytes
                dec ecx                              ;1 byte
                jne _label_                          ;2 bytes

        4)      loopXX _label_ (XX = E, NE, Z or NZ)  ;2 bytes

        LOOP is smaller, but slower on 486+.


        And next unforgetable thing. Noone normal can code this:
        ---------------------------------------------------------

        1)      push eax                              ;1 byte
                push ebx                              ;1 byte
                pop eax                              ;1 byte
                pop ebx                              ;1 byte
     
        Do this and only this. Nothing other than this:

        2)      xchg eax, ebx                        ;1 byte

        And again, if XCHG's operand is EAX, it takes 1 byte otherwise
        it takes 2 bytes. So when u wanna exchange ECX with EDX, XCHG will
        be 2 bytes long:

        3)      xchg ecx, edx                        ;2 bytes

        If u only want to move content of one register to another one, use
        simple MOV instruction. It has better pairin' on Pentium and takes
        less CPU time than XCHG without EAX register as operand:

        4)      mov ecx, edx                          ;2 bytes


        Discard repeated code (and procedure code):
        --------------------------------------------

        1) Unoptimized:

        lbl1:  mov al, 5                            ;2 bytes
                stosb                                ;1 byte
                mov eax, [ebx]                        ;2 bytes
                stosb                                ;1 byte
                ret                                  ;1 byte
        lbl2:  mov al, 6                            ;2 bytes
                stosb                                ;1 byte
                mov eax, [ebx]                        ;2 bytes
                stosb                                ;1 byte
                ret                                  ;1 byte
                                                      ---------
                                                      ;14 bytes
        2) Optimized:

        lbl1:  mov al, 5                            ;2 bytes
        lbl:    stosb                                ;1 byte
                mov eax, [ebx]                        ;2 bytes
                stosb                                ;1 byte
                ret                                  ;1 byte
        lbl2:  mov al, 6                            ;2 bytes
                jmp lbl                              ;2 bytes
                                                      ---------
                                                      ;11 bytes

        Remember, if u h
nicolas9510
well i read this article
it seems pretty complicated
im gonna look into it a bit cause im trying and beggining to learn x86 asm smile.gif
it took a while to read cause its really long
but in the long t erm i think it may actually be of some use
maybe i might spend more time optimising my code rather than making it work smile.gif
well looks good anyway, thanks a bunch, nice article
plinius
thanks.
nice read.
B.t.w. If you want the "bible" of optimizing in x86 , here it is:
CODE
http://www.agner.org/assem/pentopt.pdf
belgther
since this article is quite old, MMX, FPU and SSE methods are not mentioned at all....
That was the article which gave me the idea of using MMX and FPU registers for general purposes... Although these methods are quite old and known, some programs are still not optimized, even some of Microsoft's command-line tools which i won't mention here.
belgther
some other optimization methods from me:
use the apis with an "Ex" at the end. SleepEx, CreateWindowEx, MessageBoxEx, for example. Why? because i detected that the apis without "ex" push other parameters, and do a lot of other unnecessary things which i won't discuss there, and use the apis with "Ex". and that's how the job really done. pre-pushing the extra parameter speeds up your program by skipping these unnecessary instructions.
plinius
a nice one is here:
CODE
http://www.mark.masmcode.com/


which DOES makes use of xmm and mmx registers
tibbar
QUOTE(belgther @ Mar 14 2005, 04:44 PM)
some other optimization methods from me:
use the apis with an "Ex" at the end. SleepEx, CreateWindowEx, MessageBoxEx, for example. Why? because i detected that the apis without "ex" push other parameters, and do a lot of other unnecessary things which i won't discuss there, and use the apis with "Ex". and that's how the job really done. pre-pushing the extra parameter speeds up your program by skipping these unnecessary instructions.
*



those XXXEx api are simply the new version which you should be using in any new development to ensure future compatibility. but beware that the XXXEx api are unsupported in earlier version of windows, so you will find your program fails on say Win95.
belgther
but there are not so much people left who still use win95... i only know my brother as a win95 user... and today's programs are hardly written for win95 anymore. What about games, for example? since almost all games use 3d acceleration for fast AGP / PCI Express cards, games won't be programmed for Win95.
Maybe coders should write a Win95 version and a version for later operating systems... A similar method is already done by GoldWave or tools of oxid.it, forcing the users to use these operating systems.
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2005 Invision Power Services, Inc.