The largest Linux machine in the world is a big hunk of iron. Mainframe-style
big iron is back in fashion as IBM and independent developers alike have
brought the Linux platform to the S/390 mainframe running within virtual
machines. In his article "S/390: The Linux Dream Machine," Scott Courtney
(see
Resources) introduced the S/390 port of Linux and hinted at its
potential as a high-end system to support large clusters of independent
virtual servers for running Linux applications.
For those who think that the mainframe is a dead architecture with limited
potential, you should know that IBM and other vendors now sell more
mainframes than ever before. In fact, the mainframe has flourished alongside
the growth of the Internet as large vendors strive to put their information
systems online and need to expand the capabilities of their mainframe systems
to support the added demand.
So what is this thing called the S/390? What is VM/ESA and LPAR? Where did
such a port come from? For those unfamiliar with the S/390 system but
interested in hearing about this Linux port, this three part series will
explain:
- The S/390 architecture and its origins
- The virtual machine (VM) hypervisor, an important part of the fledgling Linux-for-S/390 community
- The Logical Partitioning facilities (LPAR)
- The two organizations that were concurrently working on ports to S/390
- The I/O architecture, which is the biggest difference between the S/390 Linux port and other ports
- Why you might want to run Linux on S/390
- The applications the community has started running
- A vision, dubbed multiple virtual Linuxes or virtual penguin power, for using Linux in S/390.
Introduction to S/390
The S/390 (System/390) architecture evolved from the S/360 (System/360) of
the 1960s. IBM, and Thomas Watson, Jr., in particular, risked the family
jewels in undertaking the development of the S/360. It was the largest
private venture in American history, with $5 billion spent on five new plants
and 60,000 additional employees. S/360 was first to employ instruction
microprogramming to facilitate derivative designs and create the concept of a
family architecture. The family originally consisted of six computers that
could each use the same software and peripherals. The system also popularized
remote computing, with terminals communicating to the host via phone lines
(see Resources, Data General, 1997).
Since that time there have been radical changes and enhancements, but a
programmer from that era would recognize many of the facilities of S/390.
S/360, S/370, and S/390 (or ESA/390 as it is now known) are upwardly
compatible. The S/360 was originally designed to allow programs written for earlier
IBM hardware to migrate to the new platform. This required that S/360 use the
IBM EBCDIC character set rather than the standard ASCII system. Above all
else, this feature is what has differentiated and separated the S/360 from
the rest of the computing world.
Currently, the three leading vendors that offer mainframes are IBM, Hitachi
Data Systems (HDS), and Amdahl, with IBM leading sales by a good margin. The
S/390 uses a 31-bit custom processor, compared to the more common 32-bit
systems. This is specific to memory addressing capabilities only and not to
the general processor architecture. A 64-bit version of the hardware is
rumored to be in the works for release in the near future.
Basic architecture
S/390 uses 31 bits to address 2 GB of physical memory. Like many other
processor platforms (e.g., i386, PowerPC), the S/390 uses a two-tier paging
scheme (segments and pages) as opposed to the three-tier
mechanism defined in Linux. The good news is that the three-tier mechanism
has already been built for these other environments, helping ease some of the
porting tasks.
In addition, ESA/390 allows for multiple address spaces of 2 GB each and
multiple translate lookaside buffers (TLBs) for mapping each separate address
space to the physical memory. Theoretically, up to 16 terabytes of address
spaces can be controlled by the hardware. We exploited this feature in the
Linux for S/390 port, simplifying complex memory processes like
copy_to_user()
to a couple of instructions.
SMP support
The ESA/390 architecture is implemented on processors that range from a card
that slips into your laptop to a 16-way SMP configuration not much larger
than a refrigerator that sits in a corner of the machine room. IBM's largest
model is a 12-way SMP system. HDS currently ships a 13-way and has a 16-way
system on the way. Amdahl already offers a 16-way model.
Why Linux on VM/ESA?
Top ten reasons why running the Linux operating system as a guest of VM/ESA is a smart choice (as seen on IBM's VM/ESA Website):
- Resources can be shared among multiple Linux images running on the same VM/ESA system. These resources include CPU cycles, memory, storage devices, and network adapters.
Server hardware consolidation. Running tens or hundreds of Linux systems on a single S/390 server offers customers savings in terms of the space and personnel required to manage real hardware.
Virtualization. The virtual machine environment is flexible and adaptable. New Linux guests can be added to a VM/ESA system quickly and easily without dedicated resources. This is useful for replicating servers in addition to giving users a flexible test environment.
Running Linux on VM/ESA means Linux guest(s) can transparently take advantage of VM's support for S/390 hardware architecture and RAS features.
VM/ESA provides high-performance communication among virtual machines running Linux and other operating systems on the same processor. The underlying technologies enabling high-speed TCP/IP connections are virtual channel-to-channel (CTC) adapter support and VM's IUCV (interuser communication vehicle).
Linux on S/390 includes a minidisk device driver that can access all DASD types supported by VM/ESA.
Data-in-memory performance boosts are offered by VM's exploitation of the S/390 architecture.
Debugging. VM/ESA offers a functionally rich debug environment that is particularly valuable for diagnosing problems in the Linux kernel and device drivers.
Control and automation. VM's longstanding support for scheduling, automation, performance monitoring and reporting, and virtual machine management is available for Linux virtual machines, as well.
Horizontal growth. An effective way to expand your Linux workload capacity is to add more Linux guests to a VM/ESA system.
|
In addition, emulators like Hercules and Flex will allow your PC to run any
S/390 operating system and application.
Processor partitioning
Processor partitioning goes by various names according to manufacturer:
Amdahl calls it Multiple Domain Facility (MDF); Hitachi calls it Multiple
Logical Partition Feature (MLPF); and IBM calls it Logical Partitioning
(LPAR). Whatever the name, the intent is the same: It divides a single
machine into multiple virtual systems or images, each of which
appears to the operating system running in it as a complete and isolated processor.
Partitioning allows you to share all processing resources selectively. The
number of partitions you can create depends on the manufacturer and the
machine type, but typically the maximum is in the range of 10 to 15 images.
Partitioning can also be achieved using the hypervisor VM/ESA, which I'll
discuss in greater detail in the next part of this series. It provides a
processor with virtual machines for which the limit is measured in the range
of hundreds to tens of thousands.
I/O subsystem
One of the distinguishing features of S/390 is its channel subsystem. S/390
defines a unified means of accessing its I/O subsystem. It does this by
defining a channel subsystem that is, in effect, a collection of
sophisticated independent outboard processing systems that take complete
responsibility for performing I/O operations from the CPU. A System/390
operating system has only to issue a single instruction to get an I/O
operation initiated. The channel subsystem and the I/O devices will perform
all the support actions, such as memory access, path selection, and
connection, and handle conditions such as RPS miss, caching, and error
recovery.
Computers are often rated for speed in terms of MIPS, sometimes (correctly)
referred to as "meaningless indicators of processor speed." This is
especially true of S/390. Any true estimate of MIPS must include the work
performed by the channel subsystem. Each component of the subsystem may have
considerable processing power that is equivalent to a standalone server. Bear
this in mind when you see comparisons of CPU performance.
A more detailed explanation of the I/O subsystem as it affects the
implementation of Linux on S/390 will be detailed in part two of this series.
Early operating systems
In the early days, computing was batch oriented, and the operating systems
first used on the S/390 architecture reflected this. They had names like
Basic Operating System, Tape Operating System, Disk Operating System, and (my
favorite acronym) PCP (Primary Control Program).
These evolved into the predecessors of the OS/390 and VSE/ESA that are
available today. As they evolved, significant and robust timesharing and
realtime transaction processing capabilities were added.
A brief history of IBM, S/360, and Unix
In her treatise "VM, Past, Present, and Future," Melinda Varian (see
Resources) of Princeton University describes some interesting machinations
involving the development of System/360, MIT, timesharing, and Unix. This
passage is reproduced here with permission.
At the time IBM was embarking on its "make-or-break" development
of System/360 (the grandfather of S/390), MIT was committed to timesharing
and was providing timesharing services to several other New England
universities as well as to its own users. At MIT, it was "no longer a
question of the feasibility of a timesharing system, but rather a question of
how useful a system [could] be produced". The IBMers in the MIT Liaison
Office and the Cambridge Branch Office, being well aware of what was
happening at MIT, had become strong proponents of timesharing and were making
sure that the System/360 designers knew about the work that was being done at
MIT. They arranged for several of the leading System/360 architects to visit
MIT and talk with the faculty. However, inside IBM at that time there was a
strong belief that timesharing would never amount to anything and that what
the world needed was faster batch processing. MIT and other leading-edge
customers were dismayed, and even angered, on April 7, 1964, when IBM
announced System/360 without address relocation capability.
The previous fall, MIT had founded Project MAC to design and build an even
more useful timesharing system based on the CTSS prototype. Within Project
MAC, MIT were to draw on the lessons they had learned from CTSS to build the
Multics system. The basic goal of the Multics project "was to develop a
working prototype for a computer utility embracing the whole complex of
hardware, software, and users that would provide a desirable, as well as
feasible, model for other system designers to study." At the outset, Project
MAC purchased a second modified 7094 on which to run CTSS while developing
Multics. It then requested bids for the processor on which Multics would run.
One of the first jobs for the staff of the new center was to put together
IBM's proposal to Project MAC. In the process, they brought in many of IBM's
finest engineers to work with them to specify a machine that would meet
Project MAC's requirements, including address translation. They were
delighted to discover that one of the lead S/360 designers, Gerry Blaauw, had
already done a preliminary design for address translation on System/360.
Address translation had not been incorporated into the basic System/360
design, however, because it was considered to add too much risk to what was
already a very risky undertaking. It must be remembered that IBM was placing
the entire future of its business on the line with System/360.
The machine that IBM proposed to Project MAC was a System/360 that had been
modified to include the "Blaauw Box." This machine was also bid to Bell Labs
at about the same time. It was never built, however, because both MIT and
Bell Labs chose another vendor. MIT's stated reason for rejecting IBM's bid
was that it wanted a processor that was a mainline product, so that others
could readily acquire a machine on which to run Multics. It was generally
believed, however, that displeasure with IBM's attitude toward timesharing
was a factor in Project MAC's decision.
Losing Project MAC and Bell Labs had important consequences for IBM. Seldom
after that would IBM processors be the machines of choice for leading-edge
academic computer science research. Project MAC would go on to implement
Multics on a GE 645 and would have it in general use at MIT by October 1969.
Also in 1969, the system that was to become Unix would be begun at Bell Labs
as an offshoot and elegant simplification of both CTSS and Multics, and that
project, too, would not make use of IBM processors.
So started a period of long estrangement between System/360 and its
descendents and the world of Unix. How different things might have been!
In the late '80s and early '90s, IBM had made attempts to get back into the Unix
game on its mainframes with the introduction of AIX/370 and AIX/ESA.
Unfortunately, these birds would not fly, and they were quickly retired to the
operating system graveyard. Fortunately for IBM, AIX on the RT and RS6000
platforms did take off and has been a great line of business for the company.
The proliferation of business applications that were appearing in the Unix
world prompted IBM to try a different approach to making the Unix APIs
available to System/390 programmers. This time IBM came up with OpenEdition
for OS/390 (later called Unix System Services, or USS) and VM/ESA. The
premise behind these offerings was to provide a set of APIs to the base that
would allow vendors to port their Unix applications to System/390 without
rewriting the programs.
Both USS and OpenEdition still have an important, and even growing, role to
play within an enterprise as a result of the advent of Linux for S/390. Their
chief problem is that they are both EBCDIC implementations. The beauty of
Linux for S/390 for software vendors is that it is an ASCII implementation
that should look, feel, and act the same in all important respects as any
other port of Linux.
Enter VM
So where did VM come from and why was it created? Again, Melinda Varian's
history of VM is the canonical source for this material:
In the fall of 1964, the folks in Cambridge suddenly found
themselves in the position of having to cast about for something to do next.
A few months earlier, before Project MAC was lost to GE, they had been
expecting to be in the center of IBM's timesharing activities. Now, inside
IBM, "timesharing" meant TSS, and that was being developed in New York State.
However, Norm Rasmussen (who had headed IBM's bid for Project MAC) was very
dubious about the prospects for TSS and knew that IBM must have a credible
timesharing system for the S/360. He decided to go ahead with his plan to
build a timesharing system, with Bob Creasy leading what became known as the
CP-40 Project.
The official objectives of the CP-40 Project were the following:
- The development of means for obtaining data on the operational characteristics of both systems and application programs;
- The analysis of this data with a view toward more efficient machine structures and programming techniques, particularly for use in interactive systems;
- The provision of a multiple-console computer system for the center's computing requirements; and
- The investigation of the use of associative memories in the control of multiuser systems.
The project's real purpose was to build a timesharing system, but the other
objectives were genuine, too, and they were always emphasized in order to
disguise the project's "counter-strategic" aspects.
Bob Creasy and Les Comeau spent the last week of 1964 joyfully brainstorming
the design of CP-40, a new kind of operating system, a system that would
provide not only virtual memory, but also virtual machines. They had seen
that the cleanest way to protect users from one another (and to preserve
compatibility as the new System/360 design evolved) was to use the System/360
Principles of Operations manual to describe the user's interface to the
Control Program. Each user would have a complete System/360 virtual machine
(which at first was called a "pseudo-machine"). (The term virtual
machine has been attributed to Dave Sayre at IBM Research.)
This skunk-works project (which seems to be paralleled 30 years later by the
Linux for S/390 effort) resulted in CP-40, which became CP-67, VM/370, VM/SP,
and VM/XA and had been transformed by the early '90s into VM/ESA. The
internals are probably unrecognizable to the original developers but the
underlying principles remain the same.
Virtual machines
Virtual machines have found renewed interest in things like VMWare and Java Virtual Machines. VM/ESA, a virtual machine, can run anything that could be run on the bare
iron, including a copy of VM/ESA itself (and a copy running in that copy, and
so on). Virtual machines provide a "padded-cell environment" that
isolates one user from another while also allowing all users access to both the
real resources of the machine and the virtual resources of the VM operating
system. You can, for example, define multiple virtual CPUs when more or fewer
real ones exist, or virtual disks that may or may not correspond to real
hardware.
So, why virtual machines? R. P. Goldberg, in the March 1973 Proceedings of
ACM SIGARCH-SIGOPS Workshop on Virtual Computer Systems, describes the
rationale:
The development of interest in virtual computer systems can be
traced to a number of causes. First, there has been a gradual understanding
by the technical community of certain limitations inherent in conventional
timeshared multiprogramming operating systems. While these systems have
proved valuable and quite flexible for most ordinary programming activities, t
hey have been totally inadequate for system programming tasks. Virtual
machine systems have been developed to extend the benefits of modern
operating system environments to system programmers. This has greatly
expedited operating system debugging and has also simplified the transporting
of system software. Because of the complexity of evolving systems, this is
destined to be an even more significant benefit in the future.
As a second point, a number of independent researchers have begun to propose
architectures that are designed to directly support virtual machines, i.e.,
virtualizable architectures. These architectures trace their origins to an
accumulated body of experience with earlier virtual machines, plus a set of
principles taken from other areas of operating system analysis. They also
depend upon a number of technical developments, such as the availability of
low-cost associative memories and very large control stores, which now make
proposals of innovative architectures feasible.
A third reason for the widespread current interest in virtual machines stems
from its proposed use in attacking some important new problems and
applications such as software reliability and system privacy/security. A
final point is that IBM has recently announced the availability of VM/370 as
a fully supported software product on System/370. With this action, IBM has
officially endorsed the virtual machine concept and transformed what had been
regarded as an academic curiosity into a major commercial product.
VM/ESA is a hypervisor, that is, it provides an interface definition to the
entities running on it that is the same as the interface definition provided
by the real hardware. What this means is the logical entities we call virtual
machines are idealized simulations of a computer. The Control Program (CP)
component of VM/ESA operates the real machine hardware and multiplexes the
physical resources of the computing system to the virtual machines.
The System/390 architecture allows VM to do this because it separates its
instruction set into privileged (aka Supervisor State) and nonprivileged (aka
Problem State) groups. In the Supervisor State, all instructions are valid.
In the Problem State, only those instructions are valid that provide
meaningful information to the problem program and that cannot affect system
integrity; such instructions are called unprivileged instructions. The
instructions that are never valid in the Problem State are called privileged
instructions. When a CPU in the Problem State attempts to execute a
privileged instruction, a privileged-operation exception is recognized. A CPU
executes another group of instructions, called semiprivileged instructions,
in the Problem State only if specific authority tests are met; otherwise, a
privileged-operation exception or a special-operation exception is
recognized.
An operating system uses these privileged operations to schedule resources
between competing applications that are running under it. CP will dispatch a
virtual machine running the operating system in non-privileged mode and then
trap any privileged operations performed by the virtual machines. When it
traps these operations it can:
- Determine whether it is a valid thing for the virtual machine to have done
- Determine whether the resource the virtual machine is trying to use is accessible to that virtual machine
- Map any I/O operations to a virtual device or a real or emulated device
- Allow the virtual machine to continue processing from the point of the trap
Similarly, when interrupts occur on the real machine, CP will determine if
the interrupt needs to be reflected to a particular virtual machine, such as
when an I/O operation that had been initiated by a Linux virtual machine has
just completed.
Much of the workload for intercepting and simulating instructions and
interrupts for a virtual machine has been lifted from CP by the inclusion of
hardware assist functions built into the processor complexes. These hardware
assists provide significant performance boosts for the virtual machine.
VM and open source
VM started out within IBM but was soon adopted by the user community, which
soon started providing new functions, enhancements, and fixes to the
operating system. The code was a licensed program product of IBM but was free
of charge and came with complete source. The philosophy of the development
team is best described in the words of one of the chief architects, Bob
Creasy:
"The design of CP/CMS by a small and varied software research and
development group for its own use and support was, in retrospect, a very
important consideration. It was to provide a system for the new IBM
System/360 hardware. It was for experimenting with timesharing system design.
It was not part of a formal product development. Schedules and budgets, plans
and performance goals did not have to be met. It drew heavily on past
experience. New features were not suggested before old ones were completed or
understood. It was not supposed to be all things to all people. We did what
we thought was best within reasonable bounds. We also expected to redo the
system at least once after we got it going. For most of the group, it was
meant to be a learning experience. Efficiency was specifically excluded as a
software design goal, although it was always considered. We did not know if
the system would be of practical use to us, let alone anyone else. In January
1965, after starting work on the system, it became apparent from
presentations to outside groups that the system would be controversial. This
is still true today." (Varian, p. 97)
However, gradually what had been public started to become more and more
private. On February 8, 1983, IBM announced its Object Code Only (OCO)
policy. The VM community made an enormous effort to convince IBM's management
that the OCO policy was a mistake. Many people contributed to the effort in
SHARE (an IBM user group) and in the other user groups.
In February 1985, the SHARE VM Group presented IBM with a White Paper that
concluded with the sentence, "We hope that IBM will decide not to kill the
goose that lays the golden eggs." IBM chose not to reply to it.
A few months after the announcement of the OCO policy, IBM released the first
OCO version of VM, VM/PC. VM/PC had a number of problems, including poor
performance and incorrect, missing, or incompatible functions. Without
source, users were unable to correct or compensate for these problems, so
nobody was surprised when VM/PC fell flat.
IBM continued throughout the decade to divert much of its energy to closing
up its systems, not noticing until too late that the rest of the industry
(and many of its customers) were moving rapidly toward open systems. By 1991,
the same time Linus Torvalds began releasing his first Linux efforts, IBM
made major parts of VM Object Code Only (OCO: no source) and Object Code
Maintained (OCM: source available but fixes are object files only). IBM was
doing the exact opposite of what Richard Stallman was advocating with regard
to open source.
This is a salutary lesson for devotees of open source software: The price of
open source is eternal vigilance.
VM has always been the bastard child of IBM. It is extremely efficient, which
means that you do not need as much hardware to run it. This does not please
those who sell hardware. Every so often IBM attempts to kill it off, but it
has proven resilient:
"Throughout 1967 and very early 1968, IBM's Systems Development
Division, the guys who brought you TSS/360 and OS/360, continued its effort
to have CP-67 killed, sometimes with the help of some IBM Research staff.
Substantial amounts of Norm Rasmussen's, John Harmon's, and my time was spent
participating in technical audits which attempted to prove we were leading
IBM's customers down the wrong path and that for their (the customers'!)
good, all work on CP-67 should be stopped and IBM's support of existing
installations withdrawn." (R. U. Bayles quoted in Varian, p. 97).
Now with Linux for S/390, VM is again coming into its own. VM has a lot to
offer Linux in the S/390 environment. Think of it as a highly intelligent
BIOS that relieves Linux of distractions such as dynamic sparing and hardware
recovery, as well as supporting the concurrent operation of thousands of
virtual machines.
Finally, after years of working its way through the beast that is the IBM
bureaucracy (and the fact that the bottom line was starting to hurt), IBM
rediscovered open source.
Resources