Using Iprobe to optimize Apache on Alpha/Linux OR How I Learned to Love the Alpha and Hate the Scheduler Phillip Ezolt DATE Abstract This Paper describes how Iprobe, an Alpha/Linux performance tool, was used to find system performance bottlenecks while running the Apache web server on Alpha/Linux. It describes the kernel bottlenecks uncovered, and possible solutions. 1 First things first: Is Linux performance important? Traditionally, Linux has had a small niche market. It grew up with hobbyists and students who were more interested in the functionality and extendability than performance. Linux's feature set was catered to those who were creating it. It was cheap, low-end machine friendly, and had reasonable performance. Recently, Linux has begun to grow beyond its academic base into the corporate world. When making buying decisions, choices are often made depending on the performance numbers of a certain product. Knowing the importance of performance, competiting operating system have begun to shout that their commercial solution is better than Linux. To stop the stampede of users to Linux, commercial vendors have stepped up to show how their product beats Linux at certain benchmarks. \protect\url|http://www.mindcraft.com/whitepapers/nts4rhlinux.html| This result is not surprising. The commercial products have had years of performance analysis and tuning, while Linux has had none. In order to stay comptitive, Linux must increase its performance to match or beat the competition. The thought of Linux being outperformed should not be discourging to the Linux community. Any competitor's result are a target which can eventually be achieved by Linux To solve system performance problems, it is necessary to determine where bottlenecks occur. Until recently, Linux had no accurate way of determining where CPU cycles were being spent. It had no accurate performance profiling tool. That is... Until Iprobe. 2 The Iprobe Suite 2.1 What is it? Iprobe is a suite of programs that uses the Alpha event counters to pinpoint, at a function level, where a program is spending its time. Iprobe has a low overhead (approximately 5% in its highest-overhead mode), and can find bottlenecks in both kernel and user space. The Alpha event counters can measure a multitude of events simultaneously. Cycles are the most commonly measured event, but others, such as as bcache misses, floating point ops, loads, stores, and branch mispredicts, can be measured as well. The entire list of supported events depends on the Alpha chip implementation. Iprobe has a shining history of success. It has been used to track down problems in various major commercial software products, including three operating systems and four databases. In addition, the information that Iprobe provided was used to influence Alpha chip design. 2.2 What is the its structure? The Iprobe suite consists of three parts: 1. a driver 2. a library 3. user space programs. The driver directly interacts with the kernel and handles performance interrupts. Events are counted by the Alpha, and after a certain number, a performance interrupt occurs. The pid, PC, PS of the currently running process is then written to a kernel buffer. The user space program, Iprobe, than uses the library, which contains an operating system independent API, to retrieve these samples and write them to disk. Later using Ipreduce and Rep, the samples are combined with PC mappings from the original application to produce a perfromance report. The performance report allows one to accurately determine were the kernel and application bottlenecks are. \includegraphics{iprobe_picture.eps} (HERE: picture) 2.3 Porting to Alpha/Linux Iprobe, which was already cross-platform, was ported to Alpha/Linux to give the Linux community a powerful performance tool. It uses the Alpha on-chip counters, and therefore is only available for Alpha/Linux. It gives the performance hungry Alpha users the opportunity to tune their code to the fullest extent. However, it also allows one to find and fix bottlenecks which plague the entire Linux community. After a year and half of part time work, in January of 1999, the entire Iprobe suite of was released to Linux community under the GPL. 3 Apache Apache is an open source web server that has captured 54% of the web server market with over 2 million sites using it. \protect\url|http://www.netcraft.com/Survey| Apache has a high appeal in the ISP market, and it would be advantageous if Apache on Linux could match or outperform its competitors. When running, apache has one parent process and many children. The children accept connections, and serve pages, while the parent spawns new children, and replaces dead ones. All children would normally listen for incoming request with the accept() function. However, every time a web request comes in, all of the processes awaken. Only one process services the connection, and the rest go back to sleep. This can cause a large amount of processes to change from sleeping to running in a short amount of time. The phenomimon has been nicknamed the ``thundering herd.'' Apache has attempted to avoid the thundering herd by guaranteeing that only one process is listening with accept() at time. Children are only allowed to listen with accept() when they have a acquired a file lock with the flock() function. (HERE: picutre) 4 SPECWeb96 SPEC, known initially for its CPU benchmark suite, has created a Web benchmark suite, SPECweb96, that can be used to determine webserver performance. SPECweb96 provides a objective way to measure the effectiveness of performance tuning. Altough other web benchmarks exist, SPECWeb96 is the only industry standard. The SPECWeb96 test harness simulates a large amount of web users. Altough a standard fileset must be served, any webserver can be measured in a SPECWeb96 testbed. A successful SPECWeb96 run measures the maximum ops/second a web server can handle as well as the response time for each op. Ops are static pages, that are not dynamically generated. Before a SPECWeb96 run, the user must give the the test harness an approximate ops/sec goal. The webserver is pulsed ten times, starting at a low value, and slowly climbs to the projected Max ops/sec. 5 The test itself 5.1 The goal Iprobe will run during a SPECWeb96 test. The goal of the test is to determine where the Linux performance bottlenecks exist, and possibly propose a solution to the problems that they reveal. 5.2 The hardware 5.2.1 Server When running a SPECWeb96 test, a large amount of hardware is required to simulate a significant amount of users. In order to guarantee a reasonably sized test bed, an underpowered server is used. This server, while not capable of world class performance, is a microcosm of a larger, higher powered server. * Digital AlphaStation 500 w/266MHz Alpha 21164 * 256MB * 2 2.1gig RZ28B drives * 1 Qlogic ISP 1020 SCSI controller * 1 Alteon ACEnic Gigabit Ethernet card 5.2.2 Client * Digital Personal Workstation w/600Mhz 21164 * 192MB * DEGPA 5.3 The server software At the time of the test the latest Linux software was used to guarentee that all of the latest performance patches had been integrated. * Linux 2.2.5 * egcs-1.1.1 * Redhat 5.2 * apache-1.3.4 * iprobe-4.1 5.4 The Config (Kernel/Apache) The kernel used was a stock v2.2.5. Apache was configured with the following command: All extra features were turned off. In addition, the maximum number of running. (HERE) (config files are appendend) 6 Running Iprobe for Apache/Linux test. 6.1 Preparing the system In order to get Iprobe up and running on an Alpha/Linux system, the machine must have a kernel v2.2.0 or greater. The system must also boot from the Digital Unix SRM using aboot. The MILO PALcode currently does not have a working wrperfmon instruction, and this is needed for Iprobe. If one wants an accurate analysis of where time is spent in the kernel, /usr/src/linux/System.map must pertain to the kernel running during the test. In addition, the test program should be compiled with symbols. This can be done by using the ``-g'' flag in gcc. Altough Iprobe can work with symbol-less images, it will not be able to map the PCs generated by the event counters back to the original program's functions. Finally, the iprobe kernel module must be inserted into the kernel. This will provide the iprobe kernel interface for the iprobe library. A binary driver is provided with the suite, in order to ease use. Although, one can recompile the driver against each new kernel, it is really not necessary because the performance counter interface has not changed for the entire v2.2 kernel series. Insmod will complain that the module was compiled for a different kernel version, but by using the ``-f'' flag, one can insert the module with the following command: /sbin/insmod -f iprobe_suite-4.1-driver.o.gz 6.2 Collecting sampling information Once the machine is setup, the test can be run and the sampling information collected. First, iprobe can be started with the following command: iprobe -meth samp -o strobe300_single -buffer_count 50 This command tells iprobe to collect sampled data (-meth samp), and put it in the ``strob300_single'' sample file. It also tells iprobe to use a larger number of sample buffers than the default of 3. These sample buffers are the temporary holding space in kernel memory for the samples generated by the Alpha event counters. A large number of buffers can help to prevent missed samples. If all of the sample buffers are full, and the driver receives an interrupt, it will simply drop the sample, and record that it did so. A large amount of missed samples reduces that accuracy of the information provided. A large amount of internal buffers can reduce the likelihood of missed samples. Immediate after starting iprobe, the test must be started. In this case the SPECWeb96 client is started. As the test runs, Iprobe collects samples and stores them in the sample file. When the test is done, pressing ``control-c'' in the iprobe window will stop the sample collection. In order to have iprobe start and stop automatically, one could use iprobe's -command ``program'' switch. This will start the sampling when the program starts, and finish the sampling when the program ends. This method work wonderfully for single process programs such as ``quake'' or ``mpg123'', but is not as useful for apache because it never exits. After iprobe is run, a sampling file exists which contains all of the information needed for analysis. However, to make the output more useful address ranges must be extracted from the application. 6.3 Rep: Extracting the Symbols The user level program, rep, will extract the mapping from PC address to function name. Rep behaves much like the binutils "nm" or ``objdump'' program. The information that rep provides will later be combined with the iprobe samples to produce a human readable report. Rep can analyze a process in two ways. First, if a process is already running, it can attach to the pid, and determine all the symbols from the process's /proc information. This is the most reliable way of gathering the symbols. rep -p 1288 (pid of a running apache) However, rep can also start try to start the image itself and extract the symbols in a similar way using the following command: rep /usr/local/apache/bin/apache However, if a program is short lived, rep may not be able extract all of the images before it exits. When rep runs it extracts program symbols from the specified application, and kernel symbols from the /usr/src/linux/System.map file. To guarentee accurate analysis, both of these files must pertain to application and kernel used during the test. 6.4 Analysis with Ipreduce Now that iprobe has produced the sample file and rep produced the mapping from PC range to function, the interesting work can begin. By using the following command: ipreduce -in strobe300 -ou strobe300.rpt -pthresh 1 -mode ku a report will be generated which shows the system bottlenecks. The ``-mode ku'' tells ipreduce to only use samples gathered during kernel and user mode. All samples gather when the machine was idle are ignored. In addition, ``-pthresh 1'' tells ipreduce to only display those functions that account for >=1% of the spend cycles. The ``-in strobe300'' tells ipreduce where to find the sample file, and the ``-ou strobe300.rpt'' gives ipreduce a name for the report file. After Ipreduce was done, the file ``strobe300.rpt'' contains the following information: Begin End Sample Image Total Address Address Name Count Pct Pct ------- ------- ---- ----- --- --- 0000000000000000-00000000000029FC /usr/bin/httpd 127463 18.5 00000001200419A0-000000012004339F ap_vformatter 15061 11.8 2.2 FFFFFC0000300000-00000000FFFFFFFF vmlinux 482385 70.1 FFFFFC00003103E0-FFFFFC000031045F entInt 7848 1.6 1.1 FFFFFC0000315E40-FFFFFC0000315F7F do_entInt 48487 10.1 7.0 FFFFFC0000327A40-FFFFFC0000327D7F schedule 124815 25.9 18.1 FFFFFC000033FAA0-FFFFFC000033FCDF kfree 7876 1.6 1.1 FFFFFC00003A9960-FFFFFC00003A9EBF ip_queue_xmit 8616 1.8 1.3 FFFFFC00003B9440-FFFFFC00003B983F tcp_v4_rcv 11131 2.3 1.6 FFFFFC0000441CA0-FFFFFC000044207F do_csum_partial_copy_from_user 43112 8.9 6.3 This report shows in which routines the events occur. It gives percentages of events relative to both the entire run, and a particular module. For example, The schedule function in vmlinux is responsible for 18.1% of total event count, while it is responsible for 25.9% of the vmlinux image count. 6.5 What are these routines? Altough ipreduce has shown more routines than the following, these appear to be where the majority of CPU cycles are spent. 1. Apache: ap_vformatter- This manipulates the strings for apache's printf implementation. 2. vmlinux: do_entInt- When an interrupt occurs, this selects the proper interrupt handler. 3. vmlinux:schedule- This decides which process will run next. 4. vmlinux:do_csum_partial_copy_from_user: This function calculates IP checksumming for IP packets. Two of these functions, 'ap_vformatter' and 'schedule' are particularly interesting. ``ap_vformatter'' is the main bottleneck of apache, while ``schedule'' is the main bottleneck of the kernel. It would be ideal to widen performance bottlenecks in apache, because it usually easier for a user to upgrade user space code rather than kernel code. 6.6 What more can Ipreduce tell us? Although the information provided by ipreduce is interesting, the data can actually be analyzed in many other ways. Ipreduce can show where in a particular routine time is being spent. When the ``-routine routine_name'' is specified ipreduce will only report on the given routine. The ``-display_options pc'' option will tell ipreduce to show percentages at the PC level rather than the function level. This will allow the user to determine if a certain routine has any particular hot spots. Since it would be desirable to fix the user space program, ipreduce will look for hot spots in apache's ap_vformatter routine with the following command: ipreduce -in strobe300 -pthresh 1 -mode u -routine ap_vformatter -report_type statistics -display_options pc 00000001200419A0 ap_vformatter /usr/bin/httpd 00000001200419F8 0058 196 ( 1.3) * 0000000120041A00 0060 191 ( 1.3) * 0000000120041A10 0070 322 ( 2.1) ** 0000000120041A1C 007C 290 (1.9) * 0000000120041A50 00B0 180 ( 1.2) * 0000000120041AF0 0150 242 ( 1.6) * 0000000120041B40 01A0 193 ( 1.3) * 0000000120041B5C 01BC 159 ( 1.1) * 0000000120041D84 03E4 151 ( 1.0) * 0000000120042068 06C8 197 ( 1.3) * 000000012004316C 17CC 598 ( 4.0) *** 0000000120043170 17D0 268 ( 1.8) * 0000000120043180 17E0 449 ( 3.0) ** 0000000120043184 17E4 167 ( 1.1) * 00000001200431F0 1850 220 ( 1.5) * The cycles spent in this function were spread out. A single PC location does not seem to present itself as a hot spot. It will not be easy to increase the performance of this function without rethinking its algorithm. Now ipreduce will be asked to provide the same information about schedule: ipreduce -in strobe300 -pthresh 1 -mode k -routine schedule -report_type statistics -display_options pc FFFFFC0000327A40 schedule vmlinux FFFFFC0000327C1C 01DC 2160 ( 1.7) FFFFFC0000327C34 01F4 28515 ( 22.8) FFFFFC0000327C60 0220 1547 ( 1.2) FFFFFC0000327C64 0224 26432 ( 21.2) FFFFFC0000327C74 0234 36470 ( 29.2) FFFFFC0000327C9C 025C 24858 ( 19.9) This is exactly the kind of profile that a performance engineer likes to see. Schedule has four hot spots which are closely grouped together. If one can optimize these few lines of code, there could be significant increase in overall performance. 6.7 What iprobe found When examining the output of ipreduce, it is clear that 70% of time is being spent in kernel code. In addition, apache does not appear to have any significantly hot routines. The one routine that appears hot has a very large amount of code. Iprobe provides this information in a quick and easy fashion. It eliminates the need for guess work, and provides an objective determination of the performance bottlenecks in the system. 7 Somethings rotten in Schedule 7.1 Most of the time is spent in a few places. As shown by Iprobe, most time is spent schedule. Of that time, the vast majority is spent in a small amount of code. Since many of the functions that the scheduler function calls are inlined, the exact location of the bottleneck is uncertain. However, a quick glance on the C code reveals that one loop which, depending on the size of the runqueue, could take a significant amount of time. while (p != &init_task) { if (can_schedule(p)) { int weight = goodness(p, prev, this_cpu); if (weight > c) c = weight, next = p; } p = p->next_run; } At each call to schedule, the entire list of running processes is scanned to find the next process ``good'' enough to run. If the runqueue is significantly big, this could take a large amount of time. This is, in fact, the case. While the test is running, vmstat shows a large number of running processes (~145) as well as a large number of context switches (>10,000). Not only is a single call to schedule going to take a long time because of the large runqueue, but schedule is called a significant amount of times because of the high number of context switches! In order to prove beyond a shadow of a doubt that O(n) runtime of schedule is the problem, it must be removed. Performance must then be evaluated with an alternate solution that takes O(1) time. The following process selection code was provided by Greg Lindahl and will allow a process to be chosen quickly, but unfairly. Process selection occurs in constant time. Every call to schedule will quickly select the next process regardless of run queue length.. while (p != &init_task) { if (can_schedule(p)) { int weight = goodness(p, prev, this_cpu); if (weight >= 0) { c = weight, next = p; break; } } p = p->next_run; } When running a SPECWeb96 strobe test with the O(1) process selection patch, the amount of time spent in the scheduler drops significantly. Here is ipreduce's results: Begin End Sample Image Total Address Address Name Count Pct Pct ------- ------- ---- ----- --- --- 0000000000000000-0000000120006F2F /usr/bin/httpd 124718 24.5 000000012000F500-000000012000F9BF ap_bwrite 5455 4.4 1.1 00000001200419A0-000000012004339F ap_vformatter 14705 11.8 2.9 0000020000590000-0000020000772FFF /lib/libc-2.0.7.so 73230 14.4 00000200005F4E20-00000200005F4F7F memcpy 6409 8.8 1.3 FFFFFC0000300000-00000000FFFFFFFF vmlinux 299958 59.0 FFFFFC00003103E0-FFFFFC000031045F entInt 5309 1.8 1.0 FFFFFC0000315E40-FFFFFC0000315F7F do_entInt 31072 10.4 6.1 FFFFFC0000327A40-FFFFFC0000327D7F schedule 6363 2.1 1.3 FFFFFC00003B9440-FFFFFC00003B983F tcp_v4_rcv 10052 3.4 2.0 FFFFFC00003DA900-FFFFFC00003DAFBF make_request 5823 1.9 1.1 FFFFFC0000441CA0-FFFFFC000044207F do_csum_partial_copy_from_user 32257 10.8 6.3 *** FFFFFC0000442900-FFFFFC0000442AD3 __copy_user 12138 4.0 2.4 Instead of consuming 18% of the CPU time, schedule now only takes up 1.3% of the total runtime. In addition, SPECWeb96 max ops have increased 15%. The linear search is the performance bottleneck. 8 How can the scheduling performance bottleneck be widened? Now that iprobe pinpointed the performance bottleneck, the important question is ``how can it be fixed?''. One possible solution is to remove any unnecessary calls to schedule. Since a call to schedule can take large amount of time, by reducing the number of calls to schedule, the amount of time spent in schedule can also be reduced. This change can be made to a relatively small amount of kernel code, and could probably be integrated in the 2.2.X stable kernel. In addition, this will eliminate extra work done by the CPU. However, the number of legitimate calls to schedule depend on the job mix. Reducing the number of extraneous calls may help, but some job mixes might cause this problem to appear again. Another solution is to remove the ``thundering herd'' problem. The ``thundering problem'' causes a large number of process to be on the run queue for small amount of time. When the ``thundering herd'' problem happens many times, on the average, run queue lengths stay long. Although, apache attempted to work around the accept() thundering herd problem with flock(), flock itself has a thundering herd problem. By removing thisproblem from either or both of these functions, runqueues would not become unnecessarily long. Once again, this fix could easily integrated into the development kernel. However, it still does not solve the problem, because some job mixes might require long run queues. In order to be truly scalable, Linux must rewrite its scheduler. By writing a scheduler that can select a process in constant time, the scalability problem will be solved once and for all. However, this will require significant changes to the kernel. It will not be able to guarantee the same features as the current Linux kernel, and it is a change that will not be available until the 2.3.X kernel. 8.1 Create a fixed priority scheduler. One possible alternative to the current Linux scheduler, which has a linked list runqueue, is an array of linked lists. Each list represts a priority level, and processes within that priority level are selected in a round robin fashion. To allow insertions and removes in O(1) time, the current highest priority level will be stored in a variable, and updated at each insertion and deletion. In addition, each element in the array will have head a tail pointers to all for O(1) insertion on at the head or tail of the list. To guarantee that interactive processes have a high priority, when first added to the runqueue, processes will have an inflated priority, and eventually settle to their natural Linux priority. (HERE: picture?) Preliminary tests have shown that under a high load the system is more responsive than when using the normal Linux scheduler. However, the patch is pre-beta and some SMP issues still need to be resolved. This is an area of development for the 2.3.X kernels. 9 Inspirational Section Iprobe quickly and accurately located the performance bottlenecks. It pinpointed the locations where time was spent. In order to bring Linux's performance up to par with the rest of the commercial operating systems, this kind of analysis must be done on all important benchmarks. By finding and widening bottlenecks, Linux can become, not only the most flexible operating system, but the best performer as well. 9.1 What did we learn Altough iprobe has been demonstrated here, it has many more features. For more information check out the webpage at \protect\url|http://www.alphalinux.org/iprobe| Included in the iprobe distribution is a tutorial which is complementary to this paper. It gives more details about the setup of iprobe rather than analysis of results. 10 To Infinity and Beyond 10.1 Special Thanks * (HERE) * Greg Gaertner (SPEC High Performance Computing member)- Helping to understand the current scheduler/Suggesting the new scheduling algorithm * Jim Goddard (Iprobe Cross platform development)- Helping with porting Iprobe/Hashing out the new scheduling algorithm. * Greg Tarsa (Manager)- Encouraging work on Alpha/Linux when the benefits were only long-term. * Carol Astone (Performance Engineer)- Helping setup the SPECWeb96 test bed. * Paula Smith (SPECWeb committee member/SPEC OSG board chair)- Helping to configure and tune the testbed, webserver and operating system. 10.2 Contact Information Any question about Linux Iprobe can be directed to Phillip.Ezolt@compaq.com or ezolt@perf.zko.dec.com. Check the web page at \protect\url|http://www.alphalinux.org/iprobe|for news and information.