Using Iprobe to optimize Apache on Alpha/Linux 
OR 
How I Learned to Love the Alpha and Hate the Scheduler

Phillip Ezolt

DATE

Abstract

This Paper describes how Iprobe, an Alpha/Linux performance tool, was
used to find system performance bottlenecks while running the Apache
web server on Alpha/Linux.  It describes the kernel bottlenecks uncovered,
and possible solutions. 

1 First things first: Is Linux performance important? 

Traditionally, Linux has had a small niche market. It grew up with
hobbyists and students who were more interested in the functionality
and extendability than performance. Linux's feature set was catered
to those who were creating it. It was cheap, low-end machine friendly,
and had reasonable performance. 

Recently, Linux has begun to grow beyond its academic base into the
corporate world. When making buying decisions, choices are often made
depending on the performance numbers of a certain product. Knowing
the importance of performance, competiting operating system have begun
to shout that their commercial solution is better than Linux. To stop
the stampede of users to Linux, commercial vendors have stepped up
to show how their product beats Linux at certain benchmarks. \protect\url|http://www.mindcraft.com/whitepapers/nts4rhlinux.html|
This result is not surprising. The commercial products have had years
of performance analysis and tuning, while Linux has had none. In order
to stay comptitive, Linux must increase its performance to match or
beat the competition. The thought of Linux being outperformed should
not be discourging to the Linux community. Any competitor's result
are a target which can eventually be achieved by Linux

To solve system performance problems, it is necessary to determine
where bottlenecks occur. Until recently, Linux had no accurate way
of determining where CPU cycles were being spent. It had no accurate
performance profiling tool. 

That is... Until Iprobe. 

2 The Iprobe Suite

2.1 What is it?

Iprobe is a suite of programs that uses the Alpha event counters to
pinpoint, at a function level, where a program is spending its time.
Iprobe has a low overhead (approximately 5% in its highest-overhead
mode), and can find bottlenecks in both kernel and user space. 

The Alpha event counters can measure a multitude of events simultaneously.
Cycles are the most commonly measured event, but others, such as as
bcache misses, floating point ops, loads, stores, and branch mispredicts,
can be measured as well. The entire list of supported events depends
on the Alpha chip implementation. 

Iprobe has a shining history of success. It has been used to track
down problems in various major commercial software products, including
three operating systems and four databases. In addition, the information
that Iprobe provided was used to influence Alpha chip design. 

2.2 What is the its structure? 

The Iprobe suite consists of three parts: 

1. a driver 

2. a library 

3. user space programs. 

The driver directly interacts with the kernel and handles performance
interrupts. 

Events are counted by the Alpha, and after a certain number, a performance
interrupt occurs. The pid, PC, PS of the currently running process
is then written to a kernel buffer. The user space program, Iprobe,
than uses the library, which contains an operating system independent
API, to retrieve these samples and write them to disk. Later using
Ipreduce and Rep, the samples are combined with PC mappings from the
original application to produce a perfromance report. The performance
report allows one to accurately determine were the kernel and application
bottlenecks are. 

\includegraphics{iprobe_picture.eps} 

(HERE: picture)

2.3 Porting to Alpha/Linux 

Iprobe, which was already cross-platform, was ported to Alpha/Linux
to give the Linux community a powerful performance tool. It uses the
Alpha on-chip counters, and therefore is only available for Alpha/Linux.
It gives the performance hungry Alpha users the opportunity to tune
their code to the fullest extent. However, it also allows one to find
and fix bottlenecks which plague the entire Linux community. 

After a year and half of part time work, in January of 1999, the entire
Iprobe suite of was released to Linux community under the GPL. 

3 Apache

Apache is an open source web server that has captured 54% of the web
server market with over 2 million sites using it. \protect\url|http://www.netcraft.com/Survey|
Apache has a high appeal in the ISP market, and it would be advantageous
if Apache on Linux could match or outperform its competitors. 

When running, apache has one parent process and many children. The
children accept connections, and serve pages, while the parent spawns
new children, and replaces dead ones. All children would normally
listen for incoming request with the accept() function. However, every
time a web request comes in, all of the processes awaken. Only one
process services the connection, and the rest go back to sleep. This
can cause a large amount of processes to change from sleeping to running
in a short amount of time. The phenomimon has been nicknamed the ``thundering
herd.''

Apache has attempted to avoid the thundering herd by guaranteeing that
only one process is listening with accept() at time. Children are
only allowed to listen with accept() when they have a acquired a file
lock with the flock() function.

(HERE: picutre)

4 SPECWeb96

SPEC, known initially for its CPU benchmark suite, has created a Web
benchmark suite, SPECweb96, that can be used to determine webserver
performance. SPECweb96 provides a objective way to measure the effectiveness
of performance tuning. Altough other web benchmarks exist, SPECWeb96
is the only industry standard. 

The SPECWeb96 test harness simulates a large amount of web users. Altough
a standard fileset must be served, any webserver can be measured in
a SPECWeb96 testbed. A successful SPECWeb96 run measures the maximum
ops/second a web server can handle as well as the response time for
each op. Ops are static pages, that are not dynamically generated.
Before a SPECWeb96 run, the user must give the the test harness an
approximate ops/sec goal. The webserver is pulsed ten times, starting
at a low value, and slowly climbs to the projected Max ops/sec. 

5 The test itself

5.1 The goal

Iprobe will run during a SPECWeb96 test. The goal of the test is to
determine where the Linux performance bottlenecks exist, and possibly
propose a solution to the problems that they reveal.

5.2 The hardware

5.2.1 Server 

When running a SPECWeb96 test, a large amount of hardware is required
to simulate a significant amount of users. In order to guarantee a
reasonably sized test bed, an underpowered server is used. This server,
while not capable of world class performance, is a microcosm of a
larger, higher powered server. 

* Digital AlphaStation 500 w/266MHz Alpha 21164 

* 256MB 

* 2 2.1gig RZ28B drives 

* 1 Qlogic ISP 1020 SCSI controller

* 1 Alteon ACEnic Gigabit Ethernet card

5.2.2 Client

* Digital Personal Workstation w/600Mhz 21164

* 192MB

* DEGPA

5.3 The server software

At the time of the test the latest Linux software was used to guarentee
that all of the latest performance patches had been integrated.

* Linux 2.2.5

* egcs-1.1.1

* Redhat 5.2

* apache-1.3.4

* iprobe-4.1

5.4 The Config (Kernel/Apache)

The kernel used was a stock v2.2.5.

Apache was configured with the following command:

All extra features were turned off. In addition, the maximum number
of running.

(HERE) (config files are appendend)

6 Running Iprobe for Apache/Linux test.

6.1 Preparing the system

In order to get Iprobe up and running on an Alpha/Linux system, the
machine must have a kernel v2.2.0 or greater. The system must also
boot from the Digital Unix SRM using aboot. The MILO PALcode currently
does not have a working wrperfmon instruction, and this is needed
for Iprobe.

If one wants an accurate analysis of where time is spent in the kernel,
/usr/src/linux/System.map must pertain to the kernel running during
the test. In addition, the test program should be compiled with symbols.
This can be done by using the ``-g'' flag in gcc. Altough Iprobe can
work with symbol-less images, it will not be able to map the PCs generated
by the event counters back to the original program's functions. 

Finally, the iprobe kernel module must be inserted into the kernel.
This will provide the iprobe kernel interface for the iprobe library.
A binary driver is provided with the suite, in order to ease use.
Although, one can recompile the driver against each new kernel, it
is really not necessary because the performance counter interface
has not changed for the entire v2.2 kernel series. Insmod will complain
that the module was compiled for a different kernel version, but by
using the ``-f'' flag, one can insert the module with the following
command: 

/sbin/insmod -f iprobe_suite-4.1-driver.o.gz

6.2 Collecting sampling information

Once the machine is setup, the test can be run and the sampling information
collected. First, iprobe can be started with the following command: 

iprobe -meth samp -o strobe300_single -buffer_count 50 

This command tells iprobe to collect sampled data (-meth samp), and
put it in the ``strob300_single'' sample file. It also tells iprobe
to use a larger number of sample buffers than the default of 3. These
sample buffers are the temporary holding space in kernel memory for
the samples generated by the Alpha event counters. A large number
of buffers can help to prevent missed samples. If all of the sample
buffers are full, and the driver receives an interrupt, it will simply
drop the sample, and record that it did so. A large amount of missed
samples reduces that accuracy of the information provided. A large
amount of internal buffers can reduce the likelihood of missed samples. 

Immediate after starting iprobe, the test must be started. In this
case the SPECWeb96 client is started. As the test runs, Iprobe collects
samples and stores them in the sample file. 

When the test is done, pressing ``control-c'' in the iprobe window
will stop the sample collection. In order to have iprobe start and
stop automatically, one could use iprobe's -command ``program'' switch.
This will start the sampling when the program starts, and finish the
sampling when the program ends. This method work wonderfully for single
process programs such as ``quake'' or ``mpg123'', but is not as useful
for apache because it never exits. 

After iprobe is run, a sampling file exists which contains all of the
information needed for analysis. However, to make the output more
useful address ranges must be extracted from the application.

6.3 Rep: Extracting the Symbols

The user level program, rep, will extract the mapping from PC address
to function name. Rep behaves much like the binutils "nm" or ``objdump''
program. The information that rep provides will later be combined
with the iprobe samples to produce a human readable report.

Rep can analyze a process in two ways. First, if a process is already
running, it can attach to the pid, and determine all the symbols from
the process's /proc information. This is the most reliable way of
gathering the symbols. 

rep -p 1288 (pid of a running apache) 

However, rep can also start try to start the image itself and extract
the symbols in a similar way using the following command:

rep /usr/local/apache/bin/apache

However, if a program is short lived, rep may not be able extract all
of the images before it exits. 

When rep runs it extracts program symbols from the specified application,
and kernel symbols from the /usr/src/linux/System.map file. To guarentee
accurate analysis, both of these files must pertain to application
and kernel used during the test.

6.4 Analysis with Ipreduce

Now that iprobe has produced the sample file and rep produced the mapping
from PC range to function, the interesting work can begin. By using
the following command: 

ipreduce -in strobe300 -ou strobe300.rpt -pthresh 1 -mode ku

a report will be generated which shows the system bottlenecks. 

The ``-mode ku'' tells ipreduce to only use samples gathered during
kernel and user mode. All samples gather when the machine was idle
are ignored. In addition, ``-pthresh 1'' tells ipreduce to only display
those functions that account for >=1% of the spend cycles. The ``-in
strobe300'' tells ipreduce where to find the sample file, and the
``-ou strobe300.rpt'' gives ipreduce a name for the report file. 

After Ipreduce was done, the file ``strobe300.rpt'' contains the following
information: 

Begin                End          Sample                    Image Total 
Address          Address          Name                Count  Pct   Pct 
-------          -------          ----                -----  ---   --- 
0000000000000000-00000000000029FC /usr/bin/httpd     127463 18.5 
00000001200419A0-000000012004339F ap_vformatter       15061 11.8   2.2
FFFFFC0000300000-00000000FFFFFFFF vmlinux            482385 70.1  
FFFFFC00003103E0-FFFFFC000031045F entInt               7848  1.6   1.1 
FFFFFC0000315E40-FFFFFC0000315F7F do_entInt           48487 10.1   7.0 
FFFFFC0000327A40-FFFFFC0000327D7F schedule           124815 25.9  18.1 
FFFFFC000033FAA0-FFFFFC000033FCDF kfree                7876 1.6    1.1 
FFFFFC00003A9960-FFFFFC00003A9EBF ip_queue_xmit        8616 1.8    1.3 
FFFFFC00003B9440-FFFFFC00003B983F tcp_v4_rcv          11131 2.3    1.6 
FFFFFC0000441CA0-FFFFFC000044207F do_csum_partial_copy_from_user 43112 8.9 6.3 

This report shows in which routines the events occur. It gives percentages
of events relative to both the entire run, and a particular module.
For example, The schedule function in vmlinux is responsible for 18.1%
of total event count, while it is responsible for 25.9% of the vmlinux
image count.

6.5 What are these routines?

Altough ipreduce has shown more routines than the following, these
appear to be where the majority of CPU cycles are spent.

1. Apache: ap_vformatter- This manipulates the strings for apache's
  printf implementation. 

2. vmlinux: do_entInt- When an interrupt occurs, this selects the proper
  interrupt handler. 

3. vmlinux:schedule- This decides which process will run next. 

4. vmlinux:do_csum_partial_copy_from_user: This function calculates
  IP checksumming for IP packets. 

Two of these functions, 'ap_vformatter' and 'schedule' are particularly
interesting. ``ap_vformatter'' is the main bottleneck of apache, while
``schedule'' is the main bottleneck of the kernel. It would be ideal
to widen performance bottlenecks in apache, because it usually easier
for a user to upgrade user space code rather than kernel code. 

6.6 What more can Ipreduce tell us?

Although the information provided by ipreduce is interesting, the data
can actually be analyzed in many other ways.

Ipreduce can show where in a particular routine time is being spent.
When the ``-routine routine_name'' is specified ipreduce will only
report on the given routine. The ``-display_options pc'' option will
tell ipreduce to show percentages at the PC level rather than the
function level. This will allow the user to determine if a certain
routine has any particular hot spots.

Since it would be desirable to fix the user space program, ipreduce
will look for hot spots in apache's ap_vformatter routine with the
following command: 

ipreduce -in strobe300 -pthresh 1 -mode u -routine ap_vformatter -report_type
statistics -display_options pc

00000001200419A0   ap_vformatter   /usr/bin/httpd
00000001200419F8            0058               196    ( 1.3)      *
0000000120041A00            0060               191    ( 1.3)      *
0000000120041A10            0070               322    ( 2.1)     **
0000000120041A1C            007C               290     (1.9)      *
0000000120041A50            00B0               180    ( 1.2)      *
0000000120041AF0            0150               242    ( 1.6)      *
0000000120041B40            01A0               193    ( 1.3)      *
0000000120041B5C            01BC               159    ( 1.1)      *
0000000120041D84            03E4               151    ( 1.0)      *
0000000120042068            06C8               197    ( 1.3)      *
000000012004316C            17CC               598    ( 4.0)    ***
0000000120043170            17D0               268    ( 1.8)      *
0000000120043180            17E0               449    ( 3.0)     **
0000000120043184            17E4               167    ( 1.1)      *
00000001200431F0            1850               220    ( 1.5)      *


The cycles spent in this function were spread out. A single PC location
does not seem to present itself as a hot spot.  It will not be easy
to increase the performance of this function without rethinking its
algorithm.  

Now ipreduce will be asked to provide the same information about schedule: 

ipreduce -in strobe300 -pthresh 1 -mode k -routine schedule -report_type
statistics -display_options pc 

FFFFFC0000327A40    schedule   vmlinux
FFFFFC0000327C1C    01DC       2160      ( 1.7)
FFFFFC0000327C34    01F4       28515     ( 22.8)
FFFFFC0000327C60    0220       1547      ( 1.2)
FFFFFC0000327C64    0224       26432     ( 21.2)
FFFFFC0000327C74    0234       36470     ( 29.2)
FFFFFC0000327C9C    025C       24858     ( 19.9)


This is exactly the kind of profile that a performance engineer likes
to see. Schedule has four hot spots which are closely grouped together.
If one can optimize these few lines of code, there could be significant
increase in overall performance. 

6.7 What iprobe found

When examining the output of ipreduce, it is clear that 70% of time
is being spent in kernel code. In addition, apache does not appear
to have any significantly hot routines. The one routine that appears
hot has a very large amount of code. 

Iprobe provides this information in a quick and easy fashion. It eliminates
the need for guess work, and provides an objective determination of
the performance bottlenecks in the system. 

7 Somethings rotten in Schedule

7.1 Most of the time is spent in a few places. 

As shown by Iprobe, most time is spent schedule. Of that time, the
vast majority is spent in a small amount of code. Since many of the
functions that the scheduler function calls are inlined, the exact
location of the bottleneck is uncertain. 

However, a quick glance on the C code reveals that one loop which,
depending on the size of the runqueue, could take a significant amount
of time. 

  while (p != &init_task) 
{ 
  if (can_schedule(p)) 
   { 
        int weight = goodness(p, prev, this_cpu); 
         if (weight > c) 
            c = weight, next = p; 
   } 
   p = p->next_run; 
 } 

At each call to schedule, the entire list of running processes is scanned
to find the next process ``good'' enough to run. If the runqueue is
significantly big, this could take a large amount of time. This is,
in fact, the case. While the test is running, vmstat shows a large
number of running processes (~145) as well as a large number of context
switches (>10,000). Not only is a single call to schedule going to
take a long time because of the large runqueue, but schedule is called
a significant amount of times because of the high number of context
switches!

In order to prove beyond a shadow of a doubt that O(n) runtime of schedule
is the problem, it must be removed. Performance must then be evaluated
with an alternate solution that takes O(1) time. 

The following process selection code was provided by Greg Lindahl and
will allow a process to be chosen quickly, but unfairly. Process selection
occurs in constant time. Every call to schedule will quickly select
the next process regardless of run queue length..

while (p != &init_task) 
{ 
        if (can_schedule(p)) { 
        int weight = goodness(p, prev, this_cpu); 
        if (weight >= 0) 
                { c = weight, next = p; break; } 
        } 
  p = p->next_run; 
} 

When running a SPECWeb96 strobe test with the O(1) process selection
patch, the amount of time spent in the scheduler drops significantly.
Here is ipreduce's results:

Begin                End          Sample                    Image Total 
Address          Address          Name                Count  Pct   Pct 
-------          -------          ----                -----  ---   --- 
0000000000000000-0000000120006F2F /usr/bin/httpd     124718  24.5  
000000012000F500-000000012000F9BF ap_bwrite           5455    4.4  1.1 
00000001200419A0-000000012004339F ap_vformatter       14705  11.8  2.9 
0000020000590000-0000020000772FFF /lib/libc-2.0.7.so  73230  14.4
00000200005F4E20-00000200005F4F7F memcpy               6409   8.8  1.3 
FFFFFC0000300000-00000000FFFFFFFF vmlinux            299958  59.0 
FFFFFC00003103E0-FFFFFC000031045F entInt               5309   1.8  1.0 
FFFFFC0000315E40-FFFFFC0000315F7F do_entInt           31072  10.4  6.1 
FFFFFC0000327A40-FFFFFC0000327D7F schedule             6363  2.1   1.3 
FFFFFC00003B9440-FFFFFC00003B983F tcp_v4_rcv          10052  3.4   2.0 
FFFFFC00003DA900-FFFFFC00003DAFBF make_request         5823  1.9   1.1 
FFFFFC0000441CA0-FFFFFC000044207F do_csum_partial_copy_from_user 32257 10.8 6.3 *** 
FFFFFC0000442900-FFFFFC0000442AD3 __copy_user         12138  4.0   2.4 


Instead of consuming 18% of the CPU time, schedule now only takes up
1.3% of the total runtime. In addition, SPECWeb96 max ops have increased
15%. 

The linear search is the performance bottleneck. 

8 How can the scheduling performance bottleneck be widened?

Now that iprobe pinpointed the performance bottleneck, the important
question is ``how can it be fixed?''.

One possible solution is to remove any unnecessary calls to schedule.
Since a call to schedule can take large amount of time, by reducing
the number of calls to schedule, the amount of time spent in schedule
can also be reduced. This change can be made to a relatively small
amount of kernel code, and could probably be integrated in the 2.2.X
stable kernel. In addition, this will eliminate extra work done by
the CPU. However, the number of legitimate calls to schedule depend
on the job mix. Reducing the number of extraneous calls may help,
but some job mixes might cause this problem to appear again.

Another solution is to remove the ``thundering herd'' problem. The
``thundering problem'' causes a large number of process to be on the
run queue for small amount of time. When the ``thundering herd'' problem
happens many times, on the average, run queue lengths stay long. Although,
apache attempted to work around the accept() thundering herd problem
with flock(), flock itself has a thundering herd problem. By removing
thisproblem from either or both of these functions, runqueues would
not become unnecessarily long. Once again, this fix could easily integrated
into the development kernel. However, it still does not solve the
problem, because some job mixes might require long run queues. 

In order to be truly scalable, Linux must rewrite its scheduler. By
writing a scheduler that can select a process in constant time, the
scalability problem will be solved once and for all. However, this
will require significant changes to the kernel. It will not be able
to guarantee the same features as the current Linux kernel, and it
is a change that will not be available until the 2.3.X kernel. 

8.1 Create a fixed priority scheduler.

One possible alternative to the current Linux scheduler, which has
a linked list runqueue, is an array of linked lists. Each list represts
a priority level, and processes within that priority level are selected
in a round robin fashion. To allow insertions and removes in O(1)
time, the current highest priority level will be stored in a variable,
and updated at each insertion and deletion. In addition, each element
in the array will have head a tail pointers to all for O(1) insertion
on at the head or tail of the list. 

To guarantee that interactive processes have a high priority, when
first added to the runqueue, processes will have an inflated priority,
and eventually settle to their natural Linux priority. 

(HERE: picture?)

Preliminary tests have shown that under a high load the system is more
responsive than when using the normal Linux scheduler. However, the
patch is pre-beta and some SMP issues still need to be resolved. This
is an area of development for the 2.3.X kernels.

9 Inspirational Section

Iprobe quickly and accurately located the performance bottlenecks.
It pinpointed the locations where time was spent. In order to bring
Linux's performance up to par with the rest of the commercial operating
systems, this kind of analysis must be done on all important benchmarks.
By finding and widening bottlenecks, Linux can become, not only the
most flexible operating system, but the best performer as well.

9.1 What did we learn

Altough iprobe has been demonstrated here, it has many more features.
For more information check out the webpage at \protect\url|http://www.alphalinux.org/iprobe|

Included in the iprobe distribution is a tutorial which is complementary
to this paper. It gives more details about the setup of iprobe rather
than analysis of results. 

10 To Infinity and Beyond

10.1 Special Thanks

* (HERE)

* Greg Gaertner (SPEC High Performance Computing member)- Helping to
  understand the current scheduler/Suggesting the new scheduling algorithm

* Jim Goddard (Iprobe Cross platform development)- Helping with porting
  Iprobe/Hashing out the new scheduling algorithm.

* Greg Tarsa (Manager)- Encouraging work on Alpha/Linux when the benefits
  were only long-term.

* Carol Astone (Performance Engineer)- Helping setup the SPECWeb96
  test bed.

* Paula Smith (SPECWeb committee member/SPEC OSG board chair)- Helping
  to configure and tune the testbed, webserver and operating system.

10.2 Contact Information

Any question about Linux Iprobe can be directed to Phillip.Ezolt@compaq.com
or ezolt@perf.zko.dec.com. Check the web page at \protect\url|http://www.alphalinux.org/iprobe|for
news and information.