CIS 4307: Performance Evaluation
Introduction
Engineers do it. Cooks do it. But seldom programmers do it. What is it?
It is good Performance Evaluation of either an existing system or of 
its design.
If an engineer estimates the need for steel in a building as a hundred 
tons when  two hundred are called for, he will be laughed at (or worse). 
If a cook doubles the correct amount of spices in a dish, he is likely 
to lose clients. But if a programmer makes estimates that are orders of 
magnitude off, he feels that nothing extraordinary has happened and 
many other programmers are likely to feel the same.
 
Yet it is not too difficult to come up with performance estimates 
of times and sizes that are 
fairly accurate (from 50% off to only a few % off). Thoughout this course I
do simple computations, often not more complicated than additions and 
multiplications, to estimate some performance factor of a system. 
A common method for performance evaluation is discrete-event simulation. 
Other methods of performance evaluation are 
analytical, some involving sophisticated math like Markov models, other
are simple, as we shall now see.
Performance Metrics
First of all, before we can evaluate performance, we need 
to know what are the 
metrics that are significant. If I want to see the power of a car I check
horsepower and torque, if I want the see your strenght I may check
the weight you lift, etc. In the case of systems very common metrics, 
characteristics that we consider to evaluate the performance of a system,
are:
- Utilization: The ratio between the time a server is 
in use and the total time. So utilization will be between 0 and 1.
- Response time: The time from when a job request is 
made to the time the job has completed.
- Throughput: Number of jobs completed per unit of 
time. If jobs cannot be executed in parallel then Response time and
Throughput are inverses of each other.
- Service time: The time a server takes to complete a 
job. It includes only the time the server is busy on the job, not any 
delay.
- Latency: It is a form of delay.
It represents the time elapsed from when a job request is 
started to when the request is received/recognised at the server.
- Queue size: when jobs arrive at a server they are 
delayed if the server is busy. They are kept in a queue. We are
interested in the size of that queue.
Saturation and BottleNecks
We say that a server is saturated, or that it has reached
saturation when its utilization is 100% (i.e. it is busy
all the time).
For example, if a disk takes 10msec to service a request,
it will saturate when it receives 100 requests per second. 
[More in general, people say that a server is saturated when its service
becomes inacceptable, and that may happen at a 80% utilization. However
for our purposes in this note we will stick to the 100% figure.]
If a system
has many servers, we can then talk of its bottleneck, that is, of
the server that will saturate first as the load on the system increases.
Once the bottleneck saturates, also the system will saturate, i.e. it will
not be able to handle any greater load. The bottleneck of the system determines
the throughput (number of jobs completed per unit of time) of the
system. [This discussion 
assumes that we have jobs that have all the same characteristics].
Example 1: In the following system
 
we can execute jobs. Each job first does READ, then moves to PROCESS, then
moves to PRINT. These three phases are independent of each other in the 
sense that a job can be reading while another is processing and another is 
printing. Assume that a job takes 1 second to read, 4 seconds to process, and
2 seconds to print. Question: What is the (best possible) response time 
[response time is the time interval from the moment that a job arrives to 
the moment it exits]? Answer: 7 seconds. Question: What is the 
bottleneck in this system? Answer: the PROCESS phase. 
Question: How many jobs can be completed per minute (at most)?
Answer: 15. 
This was an example of servers placed in sequence. In a serial or sequential
arrangement the throughput of the sequence is equal to the smallest saturation
throughput of a server in the series.
Example 2: In the following system
 
jobs require a total of 100msec of compute time and require 20
operations on the disk whose service time is 10msec. 
Question: What is the minimum response time? Answer: 300msec.
Question: What is the
bottleneck in this system? Answer: The disk. Question: What is the throughput 
of the system (you can assume that a job can be computing while another 
is waiting for IO)? Answer: 5 jobs per second. 
Example 3: In the following system
 
jobs require a total of 100msec of compute time, require 20 IO operations
on Disk 1 (whose service time is 10msec), and 50 IO operations on Disk 2
(whose service time is 5msec). 
Question: What is the minimum response time? Answer: 550msec.
Questions: What is the bottleneck? Answer: 
Disk 2. Question: What is the throughput? Answer: 4 jobs per second.
When servers are placed in parallel and a job can use either server, then
the maximum throughput of the system becomes equal to the sum of the maximum throughputs
of the individual servers. For example in the following system
 
assume that S1 has a throughput of 3 jobs per second and S2 a throughput of
2 jobs per second. Then the system has a total maximum thoughput of 5 jobs
per second. The minimal response time will be 1/3 of a second. The 
average response time is harder to compute (we assume no queueing): 
in one second 3 jobs will have a response time of 1/3sec and 2 jobs
a response time of 1/2sec; thus the average response time is 
(3/5*1/3+2/5*1/2), 
i.e. 2/5sec. In the case of servers in parallel it is not evident what
represents the bottleneck of the system: though in our example S2 will 
saturated before S1, the system as a whole does not saturate at that time.
In the case of parallel servers the question of saturation requires
additional assumptions about the use and cost of the servers.
Little's Law
Little's Law applies to the single server queueing system (a server and 
its queue). It is the equation:
N = a * T
where:
- N is the number of jobs in the queueing system (queue plus server)
- a is the arrival rate (the number of jobs that arrive to the system per
unit of time), and
- T is the Response or TurnAround Time for a job (total time from arrival to
departure from the system).
For example, if at the supermarket I notice that a teller has always 4
customer and that customers arrive every 2 minutes, then I know that 
customers, on average will be in line for 8 minutes.
Little's Law has an intuitive foundation and it can be proven under
general assumptions as to the arrival rate of jobs, as to the service rate, and
as to the scheduling discipline. One can use it reliably.
Notice that if we are given the average time spent by
a job on the server, i.e. the service time Ts, 
and the average inter-arrival time Ta,
then we 
can compute the utilization of the server: Ts/Ta.
If we let inter-arrival rate = a = 1/Ta then we have the relation
utilization = a*Ts, which is called the Utilization Law
.
Response Time in the M/M/1 Queueing System
One can do a mathematical analysis of a Single Server
Queueing System where arrival times are Poisson distributed and 
service times are exponential. These systems are called 
M|M|1 systems. We say that the arrivals are Poisson
distributed if the probability of K arrivals in time t
is
	Pr[X=k, t] = (a*t)k * e-a*t / k!
It corresponds to the situation where all arrivals are independent of each
other and each arrival may occur with equal probability in any interval
t..t+dt. In this formula 'a' represents the average arrival rate.
We say that the service times are exponential when the probability
that the service time is less than x is given by:
	Pr[X<=x] = 1 - e-a*x
An interesting relation between Poisson Distribution and Exponential
Distribution is that the the latter can be obtained from the former.
Consider the probability of at least one arrival in the interval t.
This probability is 
	1 - Pr[X=0, t] = 1 - e -a*t
It is important to realize that in a single server system the arrival rate
must be smaller than the service rate (i.e. a must be smaller than b) 
otherwise jobs will arrive faster into the system than the server can take care of them.
For the M|M|1 system we can determine the average response time.
It is given by the formula
	T = ServiceTime/(1-Utilization) =
          = ServiceTime/(1 - ServiceTime/InterArrivalTime) =
          = ServiceTime * InterArrivalTime / (InterArrivalTime - ServiceTime) =
          = 1 / (ServiceRate - ArrivalRate)
In other words, as the utilization reaches 1, i.e. as the server saturates,
the response time goes to infinity. It practice things are not usually
this bad. 
This result should give you a feeling
of how dangerous it is to operate in a system where some component is
saturated. In essence the saturated component (the bottleneck) forces 
the serialization of each job, that is, each job through the bottleneck
has to wait for the completion of the job that preceded it. If we add
one more arrival in the unit of time, then the response time of each job
is increased by the service time for the bottleneck of this additional job.
The increment would have been less if the bottleneck were not saturated.
The utilization represents the average number of jobs on the server (at times 
the server is idle, the number of jobs is 0, at times the server is busy, 
the number of jobs is 1; the average is the Utilization).
The average number of jobs in the queue is easily determined:
JobsInQueue = JobsInSystem - JobsOnServer =
            = N - Utilization = a*T - Utilization =
	    = a * (Ts/(1 - Utilization)) - Utilization =
	    = (Ts/Ta)/(1 - Utilization) - Utilization =
	    = Utilization/(1 - Utilization) - Utilization =
	    = Utilization(1/(1 -Utilization) - 1) =
	    = Utilization**2/(1-Utilization)
As to the time spent by a job in the queue we have:
JobTimeInQueue = JobTimeInSystem - JobTimeOnServer =
	       = Ts/(1 - Utilization) - Ts =
	       = Utilization * Ts/(1 - Utilization) =
	       = Utilization * JobTimeInSystem =
	       = Utilization * T
This discussion allows us to arrive at a couple of insights:
- What happens to a system if we double arrival rates and service rates?
May be one would think that Response Time and Average Queue size will not
change. In reality as you can see from the formula, the response time is 
halved. As for the queue size, using Little's Law, we see that
it remains unchanged (the arrival rate has doubled but the response time
has halved).
- What happens if we double the power of the server? The response time
is better than halved. Suppose for example that
the current utilization is 80%. Then the old
response time was
        Ts
    -------- = 5*Ts
     1 - 0.8
and the new response time will be 
    0.5*Ts
    -------  ~ 0.8*Ts
    1 - 0.4
with an improvement that is by more than a factor of 5.
Here is a numerical example. Suppose we are given a disk where the 
service time is 10ms and where requests arrive every 12 ms. 
Question: what is the average queue size?
Utilization = 10/12 = 0.83.
T = Ts/(1 - Utilization) = 10/(1 - 0.83) = 10/0.17 = 60
Nqueue = N - Utilization = a*T - Utilization = 1/12*60 - 0.83 = 4.17
Response Time in Round Robin Scheduling
We have seen that RR scheduling involves a quantum q. A running
process, after q seconds, is pre-empted and 
placed at the end of the ready queue.
Let's assume that a process, upon completion of a quantum, has probability
r of having to run again (thus, probability 1-r of being terminated).
r is often called a Routing Probability ("route" as in which direction 
to go).
How long, in total, will this program run? We can represent this system 
with a picture:
 
If we define g(i) as the probability that a job will go i times through the 
server, we have:
g(i) = (1-r)*ri-1
and the expected service time for the process, E(S), will be weighted average
of the number of quanta used at the server, say i, times the probability of
that many iterations, g(i). That is, E(S) is the sum for i from 1 to infinity 
of i*q*g(i) which is:
 E(S) = q/(1-r) 
This assumes that there is no waiting time in the queue, i.e. that there 
is no competition for the services of the server. A more difficult question, 
left unanswered, is how long will it 
take in this system for a job to terminate in the case that m jobs 
arrive simultaneously? 
Interestingly, the study of transmission in a lossy channel, where the
the transmission takes time q and the probability of loss is r (hence the
message needs to be retransmitted), results in a total
transmission time that is again q/(1-r).
The Five-Minutes Rule
The Five-Minutes Rule is used to evaluate when to cache in memory a
frequently accessed object stored on disk. The idea is to compare the 
cost of memory required to store the information, and the cost of
accessing the information from disk.
Suppose I am give a disk that costs $100 and where I can perform
100 accesses per second. That means that the ability to access disk
once per second costs me $1. Assume that a page on disk is accessed
every n seconds, thus the cost of accessing that page is 1/n dollars.
Assume that a megabyte consists of, say, 256 pages
(i.e. 4KB pages), and that 1 MB of memory costs $1. Thus to keep a
page in memory costs 1/256 Dollars. When the cost of keeping the
page in memory is smaller than the cost of accessing the page on
disk, we should cache the page. This happens when, in our example
	1/256 < 1/n
that is
	n < 256
that is if we access the page at least every 256 seconds, or about 5
minutes. Notice that as the price of memory goes down, the time becomes greater,
i.e. it becomes more and more advantageous to use bigger memories and to reduce  
access to disk.
This same type of reasoning can be applied in a variety of
situations, to trade communication costs and storage (replication) costs.
We can use this information also to predict the future of technology (!!!). Let D be the 
cost of the disk, n be the number of possible accesses per second to disk (this is about 
100 and fairly constant over time), t be the time between successive accesses to a page.
Then D/(n*t) is the cost of storing the page on disk. If c is the cost of a main memory 
page, then it pays to keep the page in memory when c < D/(n*t) or t < D/(n*c). Since n is 
almost a constant, t is going to be bound by D/c. The cost of disk is remaining 
fairly constant as the capacity of disks increases, while the cost of of the individual memory
page is diminishing because for the same dollars now we get many more pages. So the ratio 
D/(n*c) is getting larger, thus t becomes larger and it becomes more and more advantageous to 
add memory instead of going to disk.
By the way, already the cost of main memory has gone below $100 per gigabyte. Thus
t is now about 2500 i.e. more than 40 minutes. The introduction of new technology, for example flash memory, 
changes the balance rapidly. For flash memory the cost of a drive is ~$500 and the access time is ~01.ms thus the 
equation becomes (assuming 1GB = $100 and 4KB pages) t < 500/(10000*0.1/256), i.e. t < 125 i.e. the page must be 
accessed every two minutes or more often.
Computational Speedup
Suppose we run a program using a single processor and it takes time T(1). We then run the program using n processors
(we assume the program is written to take advantage of the available number of processors) and it takes time T(n).
What is the relationship between T(1), T(n) and n? For example, if the program takes 100sec on one processor, how
long should it take on four processors? May be 25sec? in fact T(n) at best should be T(1)/n. [Cases where T(n) is less
than T(1)/n, i.e. in our example less than 25 seconds, are called cases of superlinearity. Such cases can be explained
and analyzed separately in terms of OS or computational characteristics. For example, a large program split into 4 parts
running independently may have in the parts a smaller, more easily accomodated working set.] 
We call Computational Speedup s the ratio: 
s = T(1)/T(n)
and Efficiency e the ratio: 
e = s/n
Amdhal has suggested that any computation can be analyzed in terms of a portion that must be executed sequentially,
Ts, and a portion that can be executed in parallel, Tp. Then 
T(n) = Ts + Tp/n
This is known as Amdhal's Law. It says that we can improve only the parallel part of a computation. Going back to
our example, if the computation requires 40 seconds of sequential computation and 60 of parallel computation, then the
best that T(n) will ever be is 40sec, even for n growing to infinity. In practice there are many problems where the
sequential components are minimal and thus these problems can be effectively parallelized. In general though speedup
due to parallelism tends to be not too great. Some people suggest that the speedup grows with the square root of n, i.e,
if we quadruple the number of processors, we only double the performance. [Notice that in this case efficiency will
change as 1 over the square root of n, i.e. the more the processors, the less efficiently they will be used.] The maximum
speedup assuming Amdhal's Law is for n going to infinity. That is 
s = (Ts + Tp)/Ts = 1 + Tp/Ts
For example, if we are given a program that takes 40 minutes to complete on
a single processor and we are told that for this program Ts is 10m and Tp is 30m then,
if we have 5 processors the speedup will be 40/16 = 2.5 and the efficiency 0.5.
The maximum speedup possible (when we use an infinite number of processors)
is 4. Of course in that case the efficiency becomes 0.
Note that in the speedup we see an application of the Law of 
Diminishing Returns. In the case above if we use 2 processors, the time
required becomes 25m, with a percentual gain due to the additional processor
of 15/40, i.e. 37%. If we go from 2 to 3 processors, now the time becomes 20
minutes, with an improvement from 25 of 20%, and if we go to 4 processors
we go to 17.5 minutes, with an improvement of 12.5%, i.e. the 
percentual of improvement for each additional machine keeps getting
smaller and smaller.
From a Berkeley report on the future of parallelism:
"As Amdahl observed 40 years ago, the less parallel portion of a program can limit
performance on a parallel computer [Amdahl 1967]. Hence, one reason to have different
“sized” processors in a manycore architecture is to improve parallel speedup by reducing
the time it takes to run the less parallel code. For example, assume 10% of the time a
program gets no speed up on a 100-processor computer. Suppose to run the sequential
code twice as fast, a single processor would need 10 times as many resources as a simple
core runs due to bigger power budget, larger caches, a bigger multiplier, and so on. Could
it be worthwhile? Using Amdahl’s Law [Hennessy and Patterson 2007], the comparative
speedups of a homogeneous 100 simple processor design and a heterogeneous 91-
processor design relative to a single simple processor are:
         SpeedupHomogeneous = 1 / (0.1 + 0.9/100) = 9.2 times faster
         SpeedupHeterogeneous = 1 / (0.1/2 + 0.9/90) = 16.7 times faster
In this example, even if a single larger processor needed 10 times as many resources to
run twice as fast, it would be much more valuable than 10 smaller processors it replaces."
It has been seen experimentally that Amhdal's Law works well for certain kinds
of computations only, namely in cases of control parallelism, i.e.
where the algorithm can be seen as divided in a number of different
activities that can be done in parallel. It does not work as well in the
case of data parallelism, i.e. in cases where the same operations 
can be applied in parallel to different data. In the latter case Amhdal's
predictions are very pessimistic. 
For this case has been proposed the 
Gustafson-Barsis Law. It is based on the observation
that the serial component diminishes as problems become larger. Then 
Take the time with n processors as the unit time
      T(n) = Ts + Tp = 1
where Ts and Tp are respectively the serial and parallel time in T(n).
Then on one processor
      T(1) = Ts + n * Tp
that is the speedup is
      T(1)/T(n) = (Ts + n*Tp)/1 = Ts + n*Tp = Ts + n*(1 - Ts)
Here if Ts, the serial fraction of the work, goes to zero as the size increases,
the speedup becomes n.
Operational Analysis
We can analyze the behavior of Queueing Networks (i.e. networks of queues
and their servers) using fairly simple means. We will examine this topic by
example.
Example 1
 
We have the 
following queueing network which represents a computer system with a CPU 
and a disk. When a job enters the system it goes to the CPU, whence it goes 
either to disk (an IO request) or terminates. Upon completion of the IO 
operation the job goes back to the CPU and things are repeated.
 
X0 represents the number of jobs entering the system per unit of time;
X1 represents the number of jobs exiting the system per unit of time;
X2 represents the number of jobs going to disk per unit of time;
X3 represents the number of jobs going to the CPU per unit of time.
At steady state (i.e. when the system displays the same behavior for long
periods of time) we can assume that X0=X1. By observation we determine that
in the system are present in total N jobs at a time. The service times
at the CPU and at the disk are respectively Sc and Sd.
Here are the relations that will hold in this system:
- X0+X2 = X3, because of continuity at the fork going to disk and to the exit
- X2    = r*X3, because of definition of r, X2 and X3
- N     = X0*R, where R is the Response Time for this system, because 
of Little's Law.
We can use these relations any way we want to compute, given the values
of some variables, the values of the remaining variables.
Example 2
Example 2 is very similar to example 1. But now the users of the systems are
all sitting at terminals. Each user request constitutes a job.
 
We assume:
- Each user request generates v = 20 disk requests
- Disk utilization u is 50%
- Service Time at the disk is Sd = 25ms
- There are M terminals, M = 25
- When a user receives a response, thinks for Z = 18 seconds before 
submitting the next request
Question: What is the response time R? We use the equations:
- M = (R+Z)*X0, by Little's Law
- X2 = v*X0, by definition of v, X0, X2
- u = X2 / (1/Sd), by definition of u, X1, Sd
Then we get easily our response time:
- X2 = u/Sd = 0.5/0.025 = 20
- X0 = X2/v = 20/20 = 1
- R = M/X0 - Z = 25/1 - 18 = 7
Notice that it is easy to determine the maximum possible value of X2, the
saturation value, that is reached when the disk is fully utilized.
Since the service time of the disk is 25ms, the maximum value of X2 is 40
[since in a second we can have at most 40 disk operations].
Back of the Envelope Calculations
As you have seen from these examples, simple performance evaluation is
very feasible, using intuition and simple math. Most of what we do in 
performance evaluation is back of the envelope calculations.
In a talk at Stanford, Jeff Dean, a Google Fellow, has stressed this point
and suggested some numbers that people may find useful in their back of 
the envelope computations:
Numbers everyone should know (circa 2007):
- L1 cache reference 0.5 ns
- Branch mispredict 5 ns
- L2 cache reference 7 ns
- Main memory reference 100 ns
- Mutex lock/unlock 100 ns
- Compress 1K bytes with Zippy 10,000 ns
- Send 2K bytes over 1 Gbps network 20,000 ns
- Read 1 MB sequentially from memory 250,000 ns
- Read 1 MB sequentially from network 10,000,000 ns
- Read 1 MB sequentially from disk 30,000,000 ns
- Round trip within same datacenter 500,000 ns
- Disk seek 10,000,000 ns
- Send packet CA->Netherlands->CA 150,000,000 ns 
Dean gave the following as an example of back of the envelope computation 
that helps to make 
good design choices: A server needs to generate 
result pages containing ~30 thumbnails each consisting of ~256KB. One can 
use one of two designs:
- Design 1: Read serially the thumbnails
- 30 seeks * 10ms/seek + 30 * 256K/ 30MB/s = 560ms
- Design2: Read in parallel the thumbnails
- 10ms/seek + 256K/30MB/s = 18ms
Of course these numbers do not take into account a lot of variability and 
special circumstances, but they give an indication of the 
expected performance of 
the two designs. And do not think that "obviously parallel is better than 
serial". Parallel has additional costs in its implementation. Only if we 
can estimate the performance gain we can determine if the additional 
complexity is worthwhile.
ingargio@joda.cis.temple.edu