Parallel Programming week one notes

2016-09-30

Parallel Programming

Week one lecture notes about Parallel Programming on Coursera.

Introduction

What is parallel programming

Parallel Programming is a type of computation in which many calculations are performed at the same time.

It is based on the basic principle that: a computation can be divided into smaller subproblems, each of which can be solved simultaneously.

It also assumes some parallel hardware, which is capable of executing these computations in parallel.

However, parallel programming is much harder than sequential programming, and there are several reasons for this:

  1. Separating sequential computations into parallel subcomputations can be challenging, or even impossible.
  2. Ensuring parallel program correctness is more difficult, due to new types of errors.

The only reason we bother paying for parallel programming is Speedup.

Parallel programming VS. Concurrent programming

Parallelism

A parallel program is a program that uses parallel hardware to execute computation more quickly. As such, parallel programing mainly concerned efficiency and focusing how to divide problems into subproblems and optimal use of parallel hardware.

Examples are matrix multiplication, data processing and computer graphic rendering.

Concurrence

A concurrent program is a program that may or may not execute multiple executions at the same time. Concurrent programming mainly concerned modularity, responsiveness or maintainability, focusing on when can a specific execution start or how can information exchange occur?

Examples are writing asynchronous applications, such as web servers, user interface or database.

These two programming disciplines often overlap. Parallel programming may rely on insights on from concurrent programming and vice versa. Concurrent programming may be used to solve parallel programming problems. However, neither discipline is the super set of another.

Parallelism granularity

Parallelism manifests itself at different granularity levels:

  • bit-level parallelism – processing multiple bits of data in parallel.
  • instruction-level parallelism – executing different instructions from the same instruction stream in parallel.
  • task-level parallelism – executing separate instruction streams in parallel.

Note that: bit-level and instruction-level parallelism are exploited by the underlying parallel hardware. In other words, they are implemented inside the processor itself. Task-level on the other hand, are achieved through software.

Classes of parallel computers

  • multi-core processors
  • symmetric multiprocessors
  • graphic processing unit
  • field-programmable gate arrays
  • computer clusters

Parallelism on the JVM

Threads

We mainly depend on threads achieve parallelism on the JVM. Each JVM process starts with a main thread, this thread executes the main method of Java/Scala program. In a normal sequential program, we only need the main thread to execute the program. However, in a parallel program, we must start several additional threads to parallelize our computation. The operating system then assigns our threads to available CPUs.

To start additional threads, we can:

  1. Declare a custom Thread subclass which defines the code that the new thread must execute.
  2. Instantiate a new Thread object which tracks the execution of the thread.
  3. Call start on that Thread object to start the actual thread execution.

Note that: the same class Thread object can be used to create and start to many instances of that thread subclass. Also, we can use synchronized key word to achieve synchronization between threads and avoid DeadLock.

Atomicity

An operation is atomic if it appears as if it occurred instantaneously from the point of view of other threads. JVM provides different synchronization primitive such as synchronized block to enforce atomicity.

Memory model

Memory model is a set of rules that describe how threads interact when accessing shared memory.

Java Memory Model is the memory model for the JVM run time. Two important subset of rules are the following:

  1. Two threads writing to separate locations in memory do not need synchronization .
  2. A thread X that calls join on another thread Y is guaranteed to observe all the writes by thread Y after join returns.

Running Computations in Parallel

To transform a sequential program into a parallel program, we can define functions to perform the transformation.

Basic parallel construct

Given expressions $e_1$ and $e_2$, parallel($e_1$,$e_2$) can compute them in parallel and return the pair of results. The signature of parallel is the following:

def parallel[A,B](taskA: => A,taskB: => B):(A, B) = {
    ...
}
  • parallel returns the same value as given, from the point of view of value computed, parallel behaves as an identity function.
  • benefit: parallel(a, b) can be faster than (a, b)
  • it takes its arguments as by name, indicated with => A and => B. Because if we use two arguments by value, then these two computations are evaluates sequentially before calling the parallel function. As a result, there is no parallelism.

Note that: this is not the standard library implementation, and it just for education purpose.

Underlying hardware architecture affects performance

For large array elements, sometimes it is difficult to obtain performance speedup even if we use parallel computations. Because the computations is bound by the memory bandwidth. Whether we have one or more cores working, they will all end up accessing the array elements through the same memory unit. And that means that the time that the computations takes cannot be less than the time it takes to fetch the entire array from the memory into the processor. Therefore, when considering using parallel computations for speed-up, we must take into account not only the number of cores, but also the parallelism available for any other shared resources that we might need in order to perform computation.

Combining computations of different length with parallel

When calling parallel with two expressions $e_1$, $e_2$, and $e_1$ and $e_2$ takes different length of time, the minimum time of parallel($e_1$,$e_2$) is the maximum running time of $e_1$ and $e_2$.

First-Class Tasks

More flexible construct for parallel computation

task is alternative construct for parallel programming. Considering the parallel computation:

val (v1, v2) = parallel(e1, e2)

we can write alternatively using the task construct:

val t1 = task(e1)
val t2 = task(e2)
val v1 = t1.join
val v2 = t2.join

t = task(e) starts computations $e$ “in the background”

  • t is a task, which performs computation of e
  • current computation proceeds in parallel with t
  • to obtain the result of e, use t.join
  • t.join blocks and waits until the result is computed
  • subsequent t.join calls quickly return the same result
Task interface

Here is a minimal interface for tasks:

def task(c :=> A): Task[A]

trait Task[A]{
    def join: A
}

task and join establish maps between computations and tasks. In terms of the value computed the equation task(e).join==e holds.

We can omit writing .join if we also define am implicit conversion:

implicit def getJoin[T](x: Task[T]): T = x.join
Implement parallel using task

It’s possible to implement parallel using Task construct:

def parallel[A,B](cA :=> A,cB :=> B): (A,B) =  {
    val tB : Task[B] = task{ cB }
    val tA : A = cA
    (tA, tB.join)
}

Measurement of Parallel Programming

asymptotic analysis is a way of analyzing performance of parallel program, but it depends on available parallel resources. Below we introduce two measures for a parallel program.

Work and depth

Work W(e)

number of steps e would take if there was no parallelism

  • this is simply the sequential execution time
  • treat all parallel($e_1$, $e_2$) as ($e_1$, $e_2$)
Depth D(e)

also known as span, It is the number of steps if we had unbounded parallelism

  • we take maximum of running times for arguments of parallel

Rules for depth(span)

Key rules are:

  • $W(parallel(e_1, e_2)) = W(e_1) + W(e_2) + c_2$
  • $D(parallel(e_1, e_2)) = max(D(e_1),D(e_2)) + c_1$

If we divide work in equal parts, for depth in counts only once! For parts of code where we do not use parallel explicitly, we must add up costs. For function call or operation $f(e_1,…,e_n)$

  • $W(f(e_1,…, e_n)) = W(e_1)+ … +W(e_n) + W(f)(v_1,… ,v_n)$
  • $D(f(e_1,…, e_n)) = D(e_1)+ … +D(e_n) + D(f)(v_1,… ,v_n)$

Here $v_i$ denote values of $e_i$. If $f$ is primitive operation on integers, then $W(f)$ and $D(f)$ are constant functions, regardless of $v_i$.

Computing time bound for given parallelism

Suppose we know $W(e)$ and $D(e)$ and our platform has $P$ parallel threads

Regardless of $P$, The computation cannot finish sooner than $D(e)$ because of dependencies.

Regardless of $D(e)$, The computation cannot finish sonner than $W(e) \over P$, because every piece of work needs to be done.

We conclude that it is reasonable to use the following estimate for the runnint time of a parallel program in case we have P parallel processing units:

That is, the depth of the expression and add work divided by P.

Given $W$ and $D$, we can estimate how programs behave for different $P$

  • If $P$ is constant but inputs grow, parallel programs have same asymptotic time complexity as sequential ones.
  • Even if we have infinite resources, $(P \to \infty)$, we have non-zero complexity given by $D(e)$

Parallelism and Amdahl’s Law

The Amdahl’s lay states that:

suppose we have two parts of a sequential computation:

  • part1 takes fraction f of the computation time (that is, part1 needs to be done first)
  • part2 takes fraction 1-f fraction of time and we can speed it up

If we make part2 P times faster, the speedup is:

Benchmarking Parallel Programs

Benchmarking is used to evaluating various performance metrics of the program. A performance could be running time, memory footprint, metric traffic, disk usage, or latency. Note that: this is value is actually random variable, it varies from run to run.

Difference between testing and benchmarking

  • testing – ensures that parts of the program are behaving according to the intended behaviour
  • benchmarking – computes performance metrics for parts of the program

Typically, testing yields a binary output – a program or its part is either correct or it is not. Benchmarking usually yields a continuous value, which denotes the extent to which the program is correct.

Note that: performance is main benefit of parallel program, so it is important to benchmarking parallel program.

Performance factors

Performance (specifically, running time) is subject to many factors:

  • processor speed
  • number of processors
  • memory access latency and throughput (affects contention)
  • cache behavior (e.g. false sharing, associativity effects)
  • runtime behavior (e.g. garbage collection, JIT compilation, thread scheduling)

from above, we learn that measuring performance is difficult, more specifically we have to do the following to measure performance:

  • multiple repetitions
  • statistical treatment – computing mean and variance
  • eliminating outliers
  • ensuring steady state (wram-up)
  • preventing anomalies (GC, JIT compilation, aggressive optimizations)
ScalaMeter

ScalaMeter is a benchmarking and performance regression testing framework for the JVM.

Note that performance regression testing and benchmarking are different. Namely, pefromance regression testing comparing performances of the current program run against known previous runs, while benchmarking measuring performance of the current (part of the ) program.

See ScalaMeter for more info.

Summary

In this week, we learn about parallel programming. First, we learn basics about parallelism. Then, we learn two basic construct for task parallel programming.