Saturday, October 16, 2010

General

Today I am just posting whatever comes to my mind. There is no specific topic. So these are just general optimization guidelines, which can be followed.

What not to optimize

There are many thing which will not yield much even after tedious optimization. There are two chief reasons for this:
  1. It may not need optimizing: optimizing parts of the program which hardly take any of the programs time will not improve the performance much. So only those parts should be optimized which are taking up a lot of the processors time
  2. It might have been automatically optimized. Many compilers align memory, optimize loops and use shifts instead of division where applicable. Optimizing these things will prove to be futile as it is already optimized.

General Optimization Guidelines

  1. Avoid branches as they will force the processor to speculate, which will result in significant performance losses, if the prediction is wrong.
  2. Optimize recursive functions and loops.
  3. Make 1-2 line functions inline, although many compilers do it themselves.
  4. Avoid fancy features like operator overloading and virtual functions, they are more inefficient. They can however e used in areas of programs which do not get executed frequently to make things more readable.
  5. Try not to continually allocate and free memory.
  6. At the same time watch out for memory leaks. For this see the performance tab of the task manager window and watch for a continual increase in memory usage.
  7. Do not compromise modularity for speed except in cases where it is really important.
  8. Concise does not always mean fast.
  9. Do not use recursion where linear algorithms will do.
  10. Do not pass unrequired arguments to functions

Thursday, October 7, 2010

Optimal Arithmetic Operations

While we write programs, we usually think in terms of the decimal number system. It may seem normal to us, but the computer prefers binary. So doing arithmetic operations that are more binary friendly will help optimize your program.

It is better to multiply and divide by a power of two rather than some other number. Obviously this should only be done when it does not affect the program execution. The optimization achieved here is because the compiler will replace the division instruction with a shift instruction, which in many processors can be completed in half clock cycle. This is similar to the division and multiplication of a number by 10 in the decimal system.

A more optimized code can also be obtained by using an int (4 bytes/1 word) rather than a char or a byte, even if your data will fit into 1 byte. This is because many compilers will first convert a char or byte to an int while doing the arithmetic operations and then convert it back again. This leads to slow program execution.

Saturday, September 4, 2010

Hello, I have downloaded Intel® 64 And IA-32 Optimizations Manual, and I got to know some nice tricks to optimize code. So I will be posting them one by one.

This is the first trick that I got from this manual:

Try to keep your loops less than 16 instructions long. If you do this, then the whole loop gets stored in the Instruction Queue itself, thus avoiding many penalties. This also allows the Instruction Fetch and pre-decode sections of the processor to remain idle, thus reducing power consumption.
However the manual also says that loop unrolling can be in many cases faster than trying to fit the whole loop in the instruction queue. So to take the right decision, profile your code and see which works faster.

Sunday, August 22, 2010

Very Very Quick Sort!!

Welcome back!!
This is a cool algorithm for super fast sorting. This algorithm sorts an array of numbers, nos, whose values range from 0 to n, the result is stored in res The size of nos is x
Declare array arr of size n
Initialise arr to 0

Loop from 0 to x using counter c
arr[nos[c]] ++
End loop


Loop from 0 to n using counter c
If arr[c] > 0
Loop from 0 to arr[c] using counter a
res[c+a] = arr[c]
End loop
End if
End loop
This algorithm trades speed for memory usage, and it is impractical to use, if the range of values to be sorted is large. Memory can be saved if the removal of repititions is not an issue. For example if 1, 9, 5, 9, 4, 4, 6, 7 can be sorted as 1, 4, 5, 6, 7, 9, where the repititions of 4 and 9 are removed. In this case one bit for each value is enough, for indicating whether a vlue is present or not.

Saturday, August 21, 2010

Hotspots

In this post we'll discuss the detection of hotspots in a program.
A hotspot is a place in the program where it is spending a lot of time. Thus optimising hotspots will have higher yield than optimising other areas of the program.

Profilers can be used to detect hotspots, one such profiler is the AMD CodeAnalyst. It is a free software provided by AMD to test programs for AMD machines. The instructions on how to use it are quite clear, and the download is about 50MB.

I have also heard a lot about Intel's VTune, and its features sound impressive. But I have not tested it because of its high cost. VTune even lets you to go inside a machine instruction and see what is happening there. The maximum resolution of Code Analyst, however is only one machine instruction.

Wednesday, July 21, 2010

Code Optimization Techniques

Code optimization is a much neglected activity these days, because of the ever growing computing speed. But if done properly, code optimization can be very fast, effective and pleasing. I will discuss some methods of code optimization here:


Loop Unrolling:
Loop unrolling involves, copying the loop body a few times, so that the number of iterations are decreased. To understand this, let us look at a program which adds an array of 400 items

Normal loop:
for(int i = 0;i < 400;i++){
x += arr[i];
}

Unrolled loop:
for(int i = 0;i < 100;i+=4){
x += arr[i];
x += arr[i+1];
x += arr[i+2];
x += arr[i+3];
}

The second (unrolled) loop will run much faster than the first, because there are fewer jumps and compare instructions. The number of times a loop is unrolled (here it is 4) should be 2 or 4 for maximum efficiency.


Moving Stuff Out Of Loops
Generally loops and recursive functions are the most time consuming parts of a program, so the body of loops and recursive functions need to be shortened. For example, the following program snippet makes 'n' number of copies of a string 's', and stores it in an array 'arr':



char *arr = new char*[n];

for(int i = 0;i < n;i++){

arr[i] = new char[strlen(s)+1];

strcpy(arr[i], s);

}

You might have noticed that we are calling the strlen function inside the loop, because of which it gets executed 'n' times. This is unnecessary as the length of the string 's' does not change, therefore a simple modification can make the program run much faster. Here is the modified code:


char *arr = new char*[n];

int len = strlen(s)+1;

for(int i = 0;i < n;i++){

arr[i] = new char[len];

strcpy(arr[i], s);

}

It is important to note that some compilers are smart enough to notice this error, and might move the strlen function call out of the loop body. In such a case, you will not notice a performance increase, because the program is already optimized!!




Optimizing Sequential Memory Reads
There is a vast scope for optimization while reading memory sequentially. The following unoptimized program snippet finds the maximum out of 10000 numbers (char's):


char *arr;//*arr is pointing to a 10000 byte array
max = arr[0];
for(int i = 1;i < 10000;i++){
if(arr[i] > max)max = arr[i];
}


The following program does that much more efficiently:

char *arr;//*arr is pointing to a 10000 byte array
max = arr[0];
for(int i = 1;i < 10000;i+=64){
if(arr[i] > max)max = arr[i];
}
for(int i = 1;i < 10000;i++){ //this will read every 64th element the second time, but the overhead is small
if(arr[i] > max)max = arr[i];
}


Explaining why this is faster is out of the scope of this blog. But if you read about how memory (RAM) is structured you will notice that it can handle 4 or more sequential memory requests in parallel, so the memory is loaded into the cache, after which it is read with hardly any delay.