The Design and Evolution of Communication in PODOS

Sudharshan Vazhkudai

Tobin Maginnis

Department of Computer and Information Science

University of Mississippi

chucha@john.cs.olemiss.edu

ptm@pix.cs.olemiss.edu

http://www.cs.olemiss.edu/~chucha/PODOS

Abstract

Distributed Operating Systems (DOSs) have always attracted a plethora of researchers for decades that wish to make many computers appear to be one. Inspite of this goal, improvement in aggregate system performance has always been secondary to resource sharing or reliability. With performance as our goal, we are designing a Performance Oriented Distributed Operating System (PODOS). PODOS is the interaction of two or more monolithic Linux machines. The PODOS design has a number of key performance benefits. A few of these are:

PODOS builds upon a highly robust and performance oriented monolithic Linux kernel.
PODOS adds just four new components to the existing Linux operating system to make it distributed. These components are a Communication Manager (CM), a Network Manager (NM), a Resource Manager (RM), and Global Interprocess Communication (GIPC).
Each of these components is designed to achieve high performance. For example, the CM uses a custom high-speed protocol and the NM is tightly glued to Linux's filesystem to speed up remote file fetches.

In this paper we describe the design, implementation and evolution of the communication mechanism in PODOS.

Communication is the essence of a distributed system and in PODOS communication is handled by the CM. The CM is at the bottom of the DOS layer hierarchy (other higher-level layers being the NM, RM and GIPC). The CM has the following components:

The PODOS-packet protocol is a custom communication protocol designed with the performance doctrine in mind. The PODOS-packet lies at the very bottom of the communication subsystem in PODOS. The CM makes use of the PODOS-packet protocol for transmission and reception of packets. The CM interacts directly with what would be referred to as the datalink layer of a traditional network protocol stack.

PODOS-packets have separate protocol id at the datalink layer that differentiates PODOS-packets from all other network packets such as IP, ICMP, ARP and IGMP. The CM further installs a packet filter in the datalink layer to filter out PODOS-packets. PODOS-packet protocol design and its initial implementation are explained later.

The Communication Descriptor Table (CDT) is an interface provided by the CM for getting packets across the network. These packets come from higher-level PODOS layers namely the Network Manager (NM), the Resource Manager (RM), and the Global Interprocess communication (GIPC). The CDT design evolution and virtual circuit implementation is described later.

Transmission-Groups are a suite of algorithms that multiplex packets across multiple network interfaces. Transmission-Groups exploit parallel networks connected among distributed computers and thereby match the external aggregate network bandwidth with internal memory bandwidth.

In accomplishing all this PODOS exploits the inherent robustness of the Linux operating system.

1 Background

A distributed operating system (DOS) is basically the cooperation among a group of machines interconnected by a network such that the group of machines appear to the user as a single operating system. With distributed operating systems, users are not aware where their files are stored; nor are they aware that their programs may be executed by remote machines. All resources within the network are managed in a global fashion using global mechanisms rather than local mechanisms [1] .

A group of machines could cooperate for a variety of reasons. A few of them are (1) Resource Sharing, (2) Performance Enhancement, (3) Reliability, (4) Fault Tolerance and (5) Transparency. Tens of distributed operating systems have been designed and implemented with various goals. Most distributed system designs are willing to compromise on performance. On the other hand, systems that are designed to be performance oriented make no attempt to provide a single system image. They provide a high-performance computing environment (e.g., Beowulf, Condor, etc).

High-performance computing environments are designed to solve one class of problems, whereas a PODOS is designed as a general high-performance computing solution.

With these issues in mind, we are designing a distributed operating system that is performance oriented (PODOS) [2]. As a secondary design goal, PODOS provides a resource sharing environment.

1.1 The PODOS Cluster

PODOS is an experimental Linux cluster being developed at the University of Mississippi. The primary project objective is to explore the performance capabilities of a distributed system, but at the same time incorporate the basic features of any distributed operating system. Furthermore we try to minimize the additions to the basic operating system [3].

Each node in the PODOS cluster is a monolithic Linux kernel. (We have chosen a monolithic kernel against a microkernel because the monolithic paradigm suits our goal much better.) Our design involves adding four components to the basic operating system to support distributed features. These include

Communication Manager

Resource Manager

Global IPC

Network Manager

Let us look at these components briefly.

1.1.1 Communication Manager

The Communication Manager (CM) handles remote communication in the PODOS. Responsibility of the CM is to interact with peer CMs in the cluster. The CM transmits and receives frames to/from other hosts through a Local Area Network (LAN). The CMs implement a simple send and receive protocol to the Network Manager (NM), Resource Manager (RM) and Global Inter Process Communication (GIPC). The CM accepts remote host addresses and messages from higher-level processes (e.g., local-to-remote GIPC, NM-to-NM and RM-to-RM communication). The CM will combine these addresses, messages, and RM status information with an error-free protocol to send the message to its remote counterpart on the target machine [3].

1.1.2 Resource Manager

The Resource Manager maintains global system state information. Each RM in the cluster maintains information about every other node. Although resource management implies decision making algorithms (processor allocation algorithms, etc.), the RM is designed to be as simple as possible for performance. Thus, the RM provides the primitives required to build unique resource management strategies. Typically, a program loader will query the RM concerning a remote machine. The purpose of this query is to discover the number of processes, willingness to accept remote queries, etc. and based upon these queries, assign processes to processors. The RM will make use of the CM to transmit and receive system information in a broadcast or a piggybacked fashion among its peers [3].

1.1.3 Global Interprocess Communication

The GIPC provides a mechanism with which processes can communicate in PODOS. It allocates a global pid (GPID) for every process in PODOS so that processes can be uniquely identified. The GPID of each process is the pid (given to the process in the local machine) and the IP address of the machine (to record the fact that the process originated from this particular machine). GIPC further provides communication primitives for processes to communicate among themselves [3].

1.1.4 Network Manager

The Network Manager will interface with two Linux basic operating system (OS) components: the file manager and the process manager. These basic OS concepts will be extended to support the notion of a PODOS. File and process management will be able to recognize non-local file names and invoke the NM. NM local-to-remote requests will be carried out by simply invoking the CM [3].

This paper describes the design, implementation and evolution of communication in PODOS.

2 Communication in PODOS

Communication in PODOS is handled by the CM. Higher-level DOS layers (RM, NM, GIPC, etc.) rely on the CM for packet transmission and reception. The CM in each node uses a specialized protocol to talk to its neighbors. The CM comprises of the following components:

PODOS-packet protocol

Communication Descriptor Table (CDT)

Transmission-Groups

Figure 1 demonstrates the relationship among these subsystems. To the left in Figure 1 is the traditional network protocol stack, the Open Systems Interconnection (OSI) model of networking. It comprises of seven layers namely the Application layer, the Presentation layer, the Session layer, the Transport layer, the Network layer, the Datalink layer and the Physical layer. Each layer has its own set of headers and error checking protocol. All of this adds to the reliability and robustness of communication. In modern operating system implementations of the OSI model, the Transport and Network layers are together referred to as TCP/IP. To the right, in Figure 1, are the PODOS components. The communication subsystem is depicted in more detail. The CM comprises of three components namely the CDT, the Transmission-Groups and the PODOS-packet protocol. The CDT is an interface for higher-level PODOS layers. Higher-level layers typically make entries with the CDT. A Transmission-Group algorithm is applied to each entry in the CDT. And finally the PODOS-packet protocol transmits the packet using the datalink layer of the OSI model. The RM, NM and GIPC use the CM to communicate with peer entities in the PODOS cluster.

Figure 1

We will now look at the design and implementation of various components of the CM (Figure 1).

3 PODOS-packet Protocol

The PODOS-packet protocol is at the very bottom of the CM. It provides primitives for transmitting and receiving PODOS-packets. The PODOS packet protocol has evolved from a very rudimentary structure. In this section we will describe the primary design and the initial implementation of the protocol.

3.1 The Design

Since our primary goal is performance and the typical DOS bottleneck is network bandwidth, we needed an efficient communication mechanism that could speed up packet transmission and reception. We needed something other than the traditional networking protocol (left side of Figure 1). The traditional protocol consists of several layers and each layer has its own headers and error checking. However, traditional protocols also contain a lot of detail to accomodate many types of network configurations. Thus, we designed a distributed operating system packet (PODOS-packet), which bypasses the traditional network protocol stack [4].

In Figure 1, we can see how the CM resides above the datalink layer (Ethernet driver). From Figure 1 it is also evident how the CM has moved away from the traditional network layers and has less overhead.

Our approach here is to have the CM interact with the datalink layer (Ethernet driver) to transmit and receive packets. In short, the CM will have to transmit and receive packets that bypassed the network protocol stack. This would mean that the CM packets would have to have a separate protocol id, one that is different from IP, ICMP etc.

Now let us look at the implementation.

3.2 Implementation of PODOS-packet Protocol

When the PODOS project was initiated in August of 1998, the CM only existed in an embryonic stage.A mechanism was required to transmit and receive packets that bypassed the network protocol stack. Thus, we needed a testbed to send and receive PODOS-packets between local and remote user programs. This required the development of a user level program that directly invoked the Ethernet driver in the kernel to send and receive PODOS-packets [4].

3.2.1 Linux ioctl()

In Linux, user programs can query network interfaces using the input/output control (ioctl) system service. A typical ioctl call in a user program looks like this:

ioctl( int fd, int command, ( char * )argstruct );

The file descriptor fd is a socket descriptor returned by the socket() system call. The command can be subdivided into a number of categories depending on what aspect of driver control needs to be dealt with. For example:

Changing the routing table ( SIOCADDRT, SIOCDELRT )

Reading/updating the ARP/RARP caches ( SIOCDARP, SIOCSRARP )

Generic functions related to network interfaces ( SIOCGIFNAME, SIOCSIFADDR )

The argstruct depends on the ioctl command type. In some cases, new ioctl commands may be needed. For this purpose Linux has built-in hooks in Ethernet drivers. Linux provides a SIOCDEVPRIVATE ioctl command. The implementation is device-driver dependant [5]. A user program invokes a device specific ioctl command such as:

ioctl( sockid, SIOCDEVPRIVATE, ( char * )&ifr );

where the argument ifr is defined as an instance of struct ifreq. The user program fills ifr.ifr_name with the interface name associated with the device, e.g., eth0. Usually, the user program will also need to exchange command arguments and results with the kernel and this is done through the ifr.ifr_data field.

For the SIOCDEVPRIVATE command, the kernel checks to see if a device-specific handler has been set up in a device structure (every network device has a device structure). The addres of the device-specific handler is maintained as a function pointer in the do_ioctl field of the device structure. If the handler has been initialized by ifconfig, then the kernel invokes the ioctl handler inside the driver. The device-specific handler typically appears like this in the device structure:

Void ( * do_ioctl ) ( struct device *dev, struct ifreq *ifr, int cmd ); [5]

Implementation of ioctl commands are carried out with switch statements in the driver. Once in the handler, the selected switch case block is based upon the least significant bits of the command.

We implemented a new ioctl command by adding another case element to the ioctl handler in the Ethernet driver so that the user program could invoke the interface. In our example, the value of the case element would be SIOCDEVPRIVATE + n, where n is the next available number. We can have 16 such cases with codes ranging from 0x89F0 through 0x89FF. The Linux tulip driver employs the switch statement. Some drivers do not have a device-specific ioctl handler. For such drivers, we added a device-specific handler by pointing the do_ioctl field in the device structure to our handler. An example of such a driver is the epic100 [4].

Inside the ioctl command we built a PODOS-packet and transmitted it.

3.2.2 Transmitting a Packet

Our implementation of the ioctl command performs the following steps: [4]

The ioctl command copies the data from the user space to kernel space [6] and builds a PODOS-packet out of the data. (We will look at the structure of a PODOS-packet in more detail later.)

The ioctl command allocates a network socket buffer [7] called sk_buff. The sk_buffer pool is used by all network interfaces and protocols. Our implementation of the ioctl command also stuffs the PODOS-packet into the data portion of the network buffer [4].

The ioctl command builds an Ethernet packet [8] using the source Ethernet address, destination Ethernet address and a unique Ethernet protocol id (0x1975) by invoking the Ethernet device's header building function (also maintained as a function pointer in the device structure). The header routine is a typical network stack function in that it stuffs the Ethernet header in front the data (PODOS-packet) in the network sk_buffer [4]. Upon header routine completion the PODOS-packet has been encapsulated with an Ethernet packet as shown in Table 1.

The Ethernet Packet

62 bits	Preamble
2 bits	Start of Frame Delimiter
6 bytes	Destination Ethernet Address
6 bytes	Source Ethernet Address
2 bytes	Packet Type (For PODOS-packet id=0x1975)
46 - 1500 bytes	Data (The PODOS-packet structure)
4 bytes	Frame Check Sequence

Table 1

The ioctl command invokes the device's transmit function with the network sk_buffer and requests for immediate transfer of the packet. Upon successful transmission, sk_buffer is deallocated.

3.2.3 Receiving a Packet

There is no receive method in a network device, because it is the device that invokes processing of such events. With a typical network device, an interrupt notifies the handler that a completed packet is ready for reception. The network device allocates an sk_buffer of suitable size and places the bytes from the hardware into the buffer. Next, the device driver analyses the frame to decide the packet type. Finally, the device driver passes the buffer up to the protocol layer [7].

We added a packet filter in the datalink layer (Ethernet driver) to filter out our PODOS-packets before the packet is given to the network protocol stack. We do this by simply checking if the protocol id of the packet is 0x1975 (that of a PODOS packet) and pass it to the CM [4].

The following Figure illustrates the above described implementation.

4 The Communication Descriptor Table

The CDT is the CM's interface to the other PODOS layers. The CM shields the higher-level PODOS layers by performing all the intricate details involved with packet transmission. The higher-level PODOS protocols register their packets with the CDT. The CM picks up packets from the CDT and transmits the packets. The CM also uses the CDT to construct a virtual circuit, which helps in streamlining communication between peer components [2]. The CM also maintains a simple timeout mechanism by which it can keep track of errors and retransmissions. Thus, the CM maintains a simple and elegant protocol for packet delivery [3].

4.1 The CDT Structure

A fully functional communication descriptor table has marked a milestone in the evolution of the CM. The higher-level PODOS layers fill in a CDT entry and invoke the CM. The CM picks up the PODOS-packet from the CDT and transmits it. A typical entry in a CDT looks like this: [2]

Struct comm_desc_tab {
struct PODOS_pkt pkt;	The PODOS-packet structure is embedded in the CDT and will be discussed later.
char *to_pid[8];	Higher-level layers specify the process id in the remote machine to whom the packet is sent. This could be a group of processes to support group communication.
int to_pid_cnt;	The count of the processes referencing this CDT entry.
int need_reply;	Whether the CDT entry should be held for a longer time. For example, RM requests are typically short lived as compared to NM or GIPC requests.
char *hostname[20];	The name of the machine to whom the packet is being sent. Could be list of machine names (for group communication).
int host_cnt;	The count of machine names.
struct interface if;	The interface structure (will be discussed later).
};

Table 2

Now let us look at the PODOS-packet structure.

4.2 The PODOS-packet Structure

Table 3 describes the PODOS-packet structure and also illustrates the importance of each field [2].

struct PODOS_pkt {
char from_pid[8];	The global pid of the process from whom the packet is originating.
char to_pid[8];	The global pid of the process to whom the packet is being sent.
char ctrl_info;	This a one byte field that is used by the CM to differentiate NM, RM and IPC packets. It uses the least significant 3 bits. Higher-level protocols use the most significant 5 bits. For example, the NM would use it to differentiate open, read, write, close, etc.
int cdt_active_ind;	The CDT index at the active end (the end that originated the connection).
int cdt_passive_ind;	The CDT index at the passive end (the end that is accepting the connection).
int wake_active_passive;	Since the CM code is the same for active and passive ends, it needs to know which process to wake up.
int length;	The length of the data being sent.
char data[1400];	The data.
};

Table 3

5 Evolution of communication in PODOS

The PODOS-packet began as a simple Ethernet packet, but has come a long way over the last year. Currently, we have a fully functional CM (built into the kernel) that maintains the CDT. The CM achieves further performance gains by multiplexing PODOS-packets across multiple Ethernet interfaces. This feature is referred to as "Transmission-Groups" [9].

In this section we will look at how a higher-level PODOS protocol uses the CDT and the CM to get task done. Let us consider an NM request. The NM is responsible for remote file fetches. When a user process requests a remote file, the local NM would fill appropriate entries in the PODOS-packet and then make an entry in the CDT. It then invokes the CM [2]. The CM picks out the PODOS-packet from the CDT entry and decides on an interface. The CM employs "Transmission-Groups" to transmit packets [9].

5.1 Transmission-Groups

Each node in PODOS has been configured with multiple Ethernet interfaces. This increases the LAN bandwidth and effectively utilizes the communication media. Once the higher-level PODOS layers register their packets with the CDT, the CM decides which interface the packet should go through. This decision is based on a simple round-robin algorithm [9].

Table 4 shows the interface structure [2] of the CDT which is filled in by the Transmission-Group algorithm and is eventually used by the CM for transmitting packets.

struct interface if {
char active_interface[6];	The hardware address of the interface at the active end.
char passive_interface[6];	The hardware address of the interface at the passive end.
int (elect_interface)(struct device );	Transmission-Group algorithm held as a function pointer. This is the function that round-robin's packets.
};

Table 4

Further research is being conducted to explore load-based algorithms in Transmission-Groups. Even though the round-robin algorithm uniformly distributes packets across interfaces, it might inadvertently pump packets into an interface which is already loaded. In such cases, load-based algorithms would avoid overloading such interfaces and thereby make intelligent decisions on interface selection.

5.2 State of the testbed

The initial testbed was constructed for checking PODOS-packet delivery and it is now being used for load generation purposes. For the load-based algorithms described above, we need to be able to see some network traffic. So, test programs are used to artificially simulate load across interfaces. The (ioctl) testbed previously developed is being used along with simple network (socket) programs to simulate PODOS packet and IP load across interfaces [9].

5.3 Virtual Circuits

The NM file fetch request we are considering is characterized by an open call, a sequence of read or write calls in any order and then a close call. We call such requests "Virtual Circuits". Initially when the NM makes an open call, the first packet of the virtual circuit, the CM applies the Transmission-Group algorithm to decide on an interface for the virtual circuit and stores it in the interface structure of the CDT structure. The packet is transmitted to the destination where the CM (at the passive end) makes a CDT entry and gives the packet to the NM for further processing. Subsequent NM read or write calls to the remote file will go through the initial CDT entry. Now, when the CM has to transmit the PODOS-packet, it will just look at the interface structure of the CDT entry and send it to the passive_interface address from the active_interface address. When a close call is made by the NM the CDT entries at both the ends are released and the virtual circuit is disconnected. Thus peer to peer CM's interact to establish a virtual circuit [2].

6 Conclusion

In this paper we have dealt with the communication aspect in PODOS. We have explained the design, implementation and the evolution of communication in PODOS. We analyzed the main communication design which uses a custom protocol for packet transmission and reception in the LAN. We further examined the implementation of the PODOS-packet protocol and how it evolved from a simple ioctl testbed. We also shed some light on the evolution of the communication descriptor table and how it provides an interface for other DOS layers such as NM, RM and GIPC. And finally, we illustrate communication in PODOS with a simple virtual circuit established by the CM for a NM request.

The CM is undergoing a major metamorphosis currently with the addition of a suite of Transmission-Group algorithms and simple error checking mechanism in the CDT. These features would make the CM more robust and reliable. Actually it would be more apt to say that the whole PODOS system is taking shape as other components like the NM, RM and GIPC are evolving simultaneously.

References

[1] Goscinski, Distributed Operating Systems The Logical Design, Addison-Wesley Publishing Company, 1991.

[2] Sudharshan Vazhkudai, Tobin Maginnis, Performance Oriented Distributed Operating System (PODOS), Technical report, Computer Science Department, University of Mississippi, May 99.

[3] P.T. Maginnis, Design Considerations for the Transformation of MINIX into a Distributed Operating System, Proceedings of the 1988 ACM 16^th Annual Computer Science Conference.

[4] Sudharshan Vazhkudai, Tobin Maginnis, Distributed Linux: Evolutionary Steps, Technical Report, Computer Science Department, University of Mississippi, December 98.

[5] Vipul Gupta, Linux ioctl Primer, http://anchor.cs.binghamton.edu/courses/cs628/ioctl.html, 1996.

[6] Beck, Bohme, Dziadzka, Kunitz, Magnus and Verworner, Linux Kernel Internals, Addison Wesley, 1997.

[7] Michael K. Johnson, The Linux Kernel Hackers� Guide, http://khg.redhat.com, 1996.

[8] The Linux Documentation Project, http://sunsite.unc.edu/LDP/, 1998.

[9] Sudharshan Vazhkudai, Tobin Maginnis, Transmission-Group based communication mechanism for a clustered Linux, Proceedings of the LinuxWorldExpo Conference, San Jose, August 99.

Sudharshan Vazhkudai	Tobin Maginnis
Department of Computer and Information Science
University of Mississippi
chucha@john.cs.olemiss.edu
ptm@pix.cs.olemiss.edu
http://www.cs.olemiss.edu/~chucha/PODOS