the

Intro to Parallel Computing - MPI - 1

hi welcome to the MPI workshop presented

by the NSS exceed program this was

originally a two-day event with hands-on

exercises that you can view at your own

pace to get access to the exercises in

slide content look at the link below

this gives me the opportunity to jump

into the overview it's kind of the only

collection of buzzwords without

programming it will do over the next

couple of days but to some extent

understanding you know buzzwords and

jargon is a necessary evil here so I'll

cover it all here and we'll give you a

good orientation as to where MPI fits

into all these other parallel computing

options that you may or may not be aware

of you certainly will be by the time

we're done here and that way when we

talk about things in practical terms as

we jump into the actual programming and

hands-on I'll be able to refer back to

these things and make some sense so you

know as I said this is the 50,000 foot

view will also have a outro talk at the

end which is a shorter version of this

where we'll be able to compare and

contrast various programming parallel

programming approaches a little bit more

detail because at that point you will

know how to do MPI you'll be real MPI

programmers and so that's what we'll

really dive into the software comparison

this will be a little bit more Hardware

oriented that one will be a little bit

more software oriented so the first

theme here is is why do we need MPI why

do we need large scale computing X scale

computing something I'll define here

what's the point of all of this many of

you are coming in here with particular

applications in mind you know why you

need MPI some of you may not but it's

helpful for all of you to understand the

demand the applications that derive the

development of this stuff because that's

how your that's why the stuff exists

it's how you're able to utilize things

make you some idea what the roadmap of

the future will be is your investment

MPI a good thing is this trendy is it

liable to be a something that will be

displaced by something else in a year or

two it you know computing certainly

faddish things are not unknown matter of

fact they're the norm really right so

let's let's look at the applications

that drive this stuff at the highest end

and this ultimately filters down to us

at the big end if you look at the

largest machines in the world and we

will we'll look at some of them in some

some depth due to the large machines the

world the problems that they run

ones that are considered to be a great

strategic importance to justify building

200 million dollar or more these days

computer supercomputers and there they

exist because they have a lot of flops

so flops is a term that we'll use number

of times over the next couple of days

and certainly you'll come across if you

stick around the American computing at

all stands for floating-point operations

per second and it's a nice way to

characterize the horsepower of a

computing platform how powerful is it

how many flops how many floating-point

operations per second going to do it's

also an important way to characterize

the demands of your code if you've got a

numerical application the question

someone you might ask of you is you know

how many flops does it take to run your

application so these are 64-bit flops by

the way I should point out what kind of

standardized line scientific computing

by default that somebody says a

floating-point operation it's a 64-bit

floating-point operation that's not

always necessary as a matter of fact I

often come across people that are using

64-bit precision needlessly but at any

rate it's it's the default for flops

when we talk about how many flops a

machine does it's funny that a lot of

codes can benefit greatly for memory

bandwidth issues and other things by

going to 32 bits if they don't need it

and the machine learning world today

we're finding that 16 bit or less

precision is actually desirable for many

things but nevertheless 64 bits in

scientific computing is often required

and so it's the default so how many

flops do you need to run a large climate

model well turns out you need a lot

we'll look at the definition a lot it's

a good example I like a climate model

it's one we'll come back to in various

forms or in a couple of days because to

the problem to understand we're trying

to do weather modeling in essence in

climate modeling you're trying to do

weather bonding except over the entire

globe and you're trying to do it over

decades not the next day or 48-hour

forecast so it's not hard to imagine it

takes extreme amount of computing power

to do also very large amounts of memory

something else that was MPI old access

to you'll find it for any of you that

have memory bound problems or will find

that you have memory about problems you

know you don't have enough memory MPI is

a way to get past that lump a lot of

memory together so these very large

problems like climate modeling demanded

I've got a slide here that gives you a

pretty good idea with these large

climate modeling problems how important

it is and you can see that the grid size

really is we go from a large grid

in you know the early the mid early

2000s a grid size for a large climate

problem would be some of that winter

kilometer grid or pixel of weather if

you will or voxel it's really a 3d

problem and 200 kilometers might seem

like that's reasonable you know on a map

like you know the globe but if you look

out your window 200 kilometers of

weather certainly there's a lot going on

it's not not particularly fine detail

and indeed as you go to up three orders

of magnitude and computing power so that

you can go to a 25 kilometer grid point

all of a sudden want more science

emerges here a fluid dynamics is like

that and so we have a lot of stuff going

on here that wasn't going on previously

both still 25 kilometers again if you

look out your window 25 kilometers

there's a lot going on in there you're

kept not capturing you know all of the

physics that's going on and consequently

we need to go up another you know three

orders of magnitude to get down to even

just a couple kilometers so this is a

good example of how insatiable computing

demand is for an application I think we

all appreciate the importance of but

sometimes we don't understand the the

requirements on the smaller scale

problems can also be just as great in

the case of save another fluid flow

problem that's why I picked this one

because it's also fluid flows combustion

now we're not talking about the earth

we're talking about something that might

be an engine piston size problem or

perhaps for a lot of supercomputing

world there's a lot of them are are

things like turbines steam turbines

things that fit in a room a large room

but a room nonetheless as you're trying

to model combustion and flow in these

problems with the multi physics is going

on

it also requires incredibly large

amounts of computing power to the point

where it requires the largest computing

2200 million dollar computing platform

so this is not just a few a few large

problems of you know modeling the cosmos

or modeling the globe supercomputers are

required to model things at microscopic

scale accurately as well I'll give one

last one here that's so something in a

kind of a very different direction but

also equally importantly is very large

platforms which is modeling brains

essentially this is a there are several

brain initiatives if you will

currently in the United State

in Europe and in China they're being

funded to large-scale and this group

here the motive group at IBM is maybe

the most well known they've certainly

been persistent they've been at it for a

while long enough that you can see the

evolution of what they're going hear

very well as they went from a mouse

brain with 16 million neurons 128

billion synapses - currently they're not

quite at the human brain yet which is 22

billion neurons but they've been

gradually working their way there

steadily working their way there this is

a problem that requires they're very

large machines the point where this this

modeling right here the human brain is

probably going to require an exascale

computer and so that's a term that I've

mentioned and I should not define

because you'll come across it you'll

trip over it all the time as you look at

current discussions of computing and

that's because we've all been racing

converging on a machine that is three

orders of magnitude faster in terms of

flops then the petascale type of

machines are the petaflop capable

machines that we're working on today

today's large machines work in the

petaflop range

so 10 to the 15 floating-point

operations per second which is by many

measures an incredibly impressive amount

of computing power certainly takes a lot

of hardware and a lot of money a lot of

power power electrical power to do but

that's the range we're at at the moment

we've been fixated because of Moore's

law and other things that keep saying

things are going to increase

geometrically been fixated on that next

generation machines the exascale

machines for about the past seven or

eight years we can anticipate in getting

there and at one point in time it look

like we might get there by about 2020

now we're thinking maybe a little bit

past there but everybody's racing

towards what do I mean by everybody well

you know any you know any strategic we

thoughtful organization or country or

multinational organization these days

recognizes how important supercomputing

is and so the United States and EU and

China and Japan or all all have exascale

computing initiatives to build a machine

it can do an exaflop of computing power

and that's ten to the eighteenth

floating-point operations per second and

that's the level at which the human

brain that I showed you starts to become

viable that's the level at which

lot of problems start to become more

tractable and climate modeling and many

many other domains which I didn't have

luck sure going into I could spend hours

talking about important problems that

exascale will enable it will it will

allow the signs to become really

transformative and become much more

applicable to real world problems so

everybody's racing towards this exascale

thing and those developments filtered

down to the rest of us whether you need

to be there or not some of you I'm sure

more than a few of you out there today

actually will be using these exascale

machines as quickly should get your

hands on them but for the rest of you

even if you think well departmental

resources are kind of the level that I

want to work at everything filters down

from that leading edge and the computing

world fairly rapidly

we're not far behind I'll give you some

examples as we're not far behind

yesterday's supercomputers on the

desktop today so you're not getting to

that level though without going very

very parallel that's the main theme of

this talk here is that you can only use

these machines or even your desktop

machines today effectively if you're

doing parallel programming serial

programming is hopeless anymore in terms

of getting any kind of reasonable

performance out of it any hardware

platform including your your smartphone

for that matter here's a good

illustration of that here's a fairly

typical generic benchmark spec

benchmarks a couple generations of them

over time and this is you know there are

lots of benchmarks apply lots of

different domains and everything so we

can pick different benchmarks but pretty

much any benchmark that you pick that's

you know in the center-of-mass of

generic numerical applications or video

games or anything like that or web

rendering is going to show this kind of

curve here where somewhere around the

year 2004 so in most any graph of this

nature somewhere around 2004 we lost

what with decades of this kind of

continual growth where every 18 months

every two years computing power doubled

and this happened again for decades

since the early 60s somewhere around

2004 that quit happening as a matter of

fact today we're already seven years or

so behind where we would have been if

that kept happening it definitely

stopped though things

really leveled out this is quite visible

in lots of different ways besides just a

benchmark is visible in terms of clock

rates of computers and lots of things

now this is this right here is the is

the baseline proof that serial

programming is dead in the sense of well

if I just wait next year the computers

will be faster and more powerful and run

my code faster so I'm going to write

this you know this really slow code and

whatever programming language is

convenient but it's okay it'll just run

faster next year and that was true

actually for for many decades you could

count on things running twice as fast

soon enough that hasn't been true for

awhile now this is not because something

called Moore's law that everybody likes

to attach themselves to is dead Moore's

law but to put it explicitly Moore's law

simply an observation according were

made back in the mid-60s that transistor

density how many transistors you could

cram into any given area of a chip would

double about every 18 months or so two

years somewhere in that time frame

Moore's law is not that just yet not

quite roots running into a wall pretty

soon but it's not dead yet it certainly

hasn't been dead over the past ten years

you know well things have slowed down

it's not responsible for that customer

just looked at Moore's laws is still

been pretty healthy the engineers been

heroic job over four plus five plus

decades now of keeping it going through

multiple revolutions and fabrication

technologies and things like that they

kept it going so Moore's law keeps

giving us more and more transistors all

the time so we can't blame Moore's law

for being dead although again it's it's

really starting to finally hit the limit

as a matter of fact if you look into

things pretty carefully you can make an

argument that Moore's law did kind of

die six or seven years ago even you can

claim because transistors have they

still managed to keep cramming

transistors in there with things like 3d

packaging and whatnot but more and more

the transistors are lost error

correction and issues having to deal

with with working at that scale so the

cost per transistor is actually going up

for the first time ever so at any rate

Moore's law may may be defunct soon but

it's not responsible for that what is

responsible for it is something else

we'll look at the next slide but first

I'll point out that this is another way

to see that Moore's law has continued

engineers haven't disappointed us and

giving us more

here's a pair bunch of different common

processors popular processors and you

can see that over the past well however

many decades you want to look back

transistors that they cram in every

couple of years on the chips due Delta

so today's processors do have a lot more

transistors than processors or just a

couple of years ago so transistors have

been delivered by the engineers so why

have the speeds why is serial computing

come through of such a halt and this is

the reason this is probably a next

couple of date for the two days of

material we're gonna cover this is

probably the most important slide you

could show somebody in terms of

demonstrating importance of parallel

programming and MPI because this is the

reason right here that parallel

programming has taken over all of

computing and the reason that clock

speeds so here are those clock speeds

over the past you know however many

decades here three decades and right

around two thousand four clock speeds

have leveled off another crude measure

of computing performance bet whatever

benchmark you want clock speeds you know

another one you could pick clock speeds

leveled off around 2004 why do they

level off around 2004 well this is the

reason right here this is the reason

that parallel computing exists as such a

dominant thing in computing today and

it's because the chips are running at

the verge of melting and that that

little bit of physics right there

dominates modern computing design

today's ships run at well over well in

the order of a hundred watts per square

centimeter it's a hot day run and that's

that's not a difficult number to

actually understand even if you're not a

physicist one hundred watts of power

under watt light bulb is you know

something you wouldn't want to put your

hand on right it would burn you

immediately it's a lot of power and

that's not much heat we're having to

dissipate through a square centimeter

less than a postage stamp area of a

modern computing chip when it's running

full-out so that's an impressive amount

of heat to dissipate especially if you

don't want to go into soo exotic liquid

cooling technology just want to blow a

fan over things because that's

convenient or it needs to sit in your

lap on a laptop or be in your cell phone

so the fact that modern chips run at

this that temperature should impress you

were well past how hot hot plate is this

is much hotter than the surface of a hot

plate we're coming up on nuclear

reactors here so this right here is the

physical limitation we ran into it too

and for is that as they crab more and

more transistors into a small area you

know just running the current through

them switching them fast enough became

the promise Madrid if you think about it

if you are electrical any electrical

engineers out there warning 100 watts

through as less than a bolt through a

one square centimeter area that's 100

amperes basically effective current

we're running around the square

centimeter so it's very impressive in

multiple respect that these chips don't

melt down as it is so again we could

resort to the more exotic things

solutions like liquid cooling things

here's a picture of a Cray two from the

mid 80s this is not a new problem by the

way in supercomputing we've had to deal

with heat dissipation she's for a long

time so creative supercomputer in mid

eighties had this really very artistic

looking heat exchanger in front of it

full of a foreigner bubbling away when

the thing was running to keep the thing

from melting so we could do that but

that's just not practical right you

don't really want to worry about liquid

cooling on your laptop so instead that's

why we're in the parallel world we are

in today the engineers said we've got

more transistors we can't keep running

them at the same faster clock rates

because we're cramming more in a small

area that we've done in the past so what

can we do to get performance out of them

how can we still get a benefit even

though we can't deal with the

thermodynamics of this without doing

something that's it's unfeasible and so

what they did was they recognized this

revelation and it wasn't new to 2040

particular year of parallel computing

has been around for a long long time but

what they recognized was parallel

computing how to make its way into the

commodity world a desktop world your

your cell phone world and so they

recognized that if we've got all these

transistors here we could do what we've

done for decades which is make a bigger

faster single core to run your serial

code better use all the tricks that we

can bring in with more transistors add

more cache all these things that we can

do that we've been doing instead of

doing this what if instead we broke this

big core up into some smaller cores so

we use the same amount of power so if

we've got a hundred Watts running into

this single big core what if instead we

made these four simpler course but not

going to be as clever they're not going

to be as much cash they're not going to

be able to use as many tricks but these

four Dumber cores will in net

have more performance than our one big

core and if you actually want to look

into the details the performance skills

roughly is the square root of the area

and there are a lot of different metrics

and everything you get into if you

interested in this it's not inaccessible

to understand some of the finer details

here but the bottom line is that instead

of one big core we'll have four or more

smaller cores in parallel here's the

kind of math that happened in the real

world back when everything went from

single core double pore around the turn

of the century is that we said instead

of having one core that's running at

this voltage in this frequency what if

we take two cores and we run in a 15%

less voltage 15% lower clock rate

that'll is because of the way these

things scale it's the same amount of

power but now we've got 1.8 times the

performance so this is the math right

here that makes parallel computing so

compelling to the engineers this is the

reason it's not a fad it's not a fashion

it's not something computer scientists

who are interested in concurrent

programming to foist it on the world

it's not something that we're going to

change anytime soon it's the fundamental

physics of of computing here's a good

example of how it's made its appearance

on just a generic commodity chips if we

look at a processor from around the turn

of the century it could deliver around 3

flops floating-point operations for

every clock tick if we look at a

processor today a skylit processor the

kind it's in my laptop right here not

that exotic it delivers close to 2700

floating-point operations for every

click of the top every tick of the clock

so this is an example of how incredibly

parallel a generic modern processor is

compared to a processor from just you

know a lot of 20 years ago this is where

parallelism is correct in if you can't

take advantage of that you can't use a

modern processor well at all

so if you're writing some simple-minded

program that's very inherently serial

you can see you're not going to get a

percent of the performance capability of

that processor now the way that it's

actually made its way into a modern

processor is a couple of different

things are going on there's multiple

cores certainly modern processors like

Ascot legs got twenty eight cores

so again a lot of cores going on each

one of those cores also has a lot of

vector type instructions

behind-the-scenes these are instructions

that work on a lot of things data

simultaneously you might hope that the

compiler can do a lot of this magic for

your here so hope sometimes it's true

with really simple loops sometimes it

needs a little bit of help but any rate

you might you might hope that this stuff

stays hidden but if you want to use

multiple course well you're definitely

going to have to start to become a

parallel programmer but at any rate you

can see that parallel computing is not

just some super computing kind of thing

out there it's it's your laptop if you

want to use it well or your smartphone

again modern smartphones right lives

Apple phones come up they've got what

six cores versus we can print again

Samsung's phones right you can see even

people worried about smart phones how

good is my smart phone worried about how

many cores their phone has now

everything is parallel so we put all

this stuff together that I've talked

about in one place here give you an idea

here the transistors again of kept

coming this is Moore's law

so Moore's law hasn't abandoned us the

single thread performance has really

plateaued here so being a serial

programmer is a is a dead end you can

see that in terms of the clock rate - if

you want something it's a very simple

metric you know computers now today

nowadays run at you know a couple

gigahertz plus or minus a couple

gigahertz the same speed they were

running ten years ago so clock rates

haven't gone up processors serial

performance is the same because they

can't that things can't but will melt

that they keep cranking more power

through them so instead what has

happened is the number of cores and

processors steps climb and this is where

your MPI programming is going to turn

you into somebody that's instead of

being dismayed by this by these facts be

happy because you can take advantage of

it you can exploit this so what does

parallel computing look like what it how

do we actually do this stuff in parallel

and this takes us back to the

fundamental problem of parallel

computing how to break things up into

pieces so that they can be done in

parallel this is the the ongoing

discussion about how accessible parallel

programming is how likely you are to

have success with your particular

problem and application of parallel

programming and here's an example from

way back from the mythical man-month

which is a classic any of you that do

serious software development and

software engineering i've probably or

should certainly read this book and

books law from there

basically said if a woman can make a

baby in nine months well then can nine

women make a baby in one month that's

absurd right but nine women can make

nine babies in nine months so this is

the this kind of exposes the idea in

parallel program in it things can be

done in parallel sometimes sometimes

it's not quite so obvious

sometimes it takes a little bit of

different thinking and this is the

notion that we're going to to get

comfortable with over the next day as we

apply MPI the different problems so

let's look at this from the perspective

of a weather problem again I like

weather because it's intuitive everybody

no matter what background you're coming

come get to kind of get the idea if we

wrote a weather model back in the 1970s

we on a serial computer we would

basically do the same thing you do with

many many scientific problems you'd

break your weather map up into a grid so

we break our map of the United States

here up into a grid each one of these

mesh points on the grid is going to be

you know have some some wind velocities

and some temperatures and the pressures

and the various things that you want to

use to compute your weather and so what

your CPU would do would be go over this

whole grid and calculate update the grid

for the next time step in time based on

the way the wind is blowing and then and

you know the atmosphere is flowing so to

speak so this would be a classic serial

way to write this code well it so

happens a guy named Richardson in 1917

figured out a way that we might do this

in parallel so 1917 Richardson had this

idea that if we had a bunch of

meteorologists in a room in his case he

envisioned like sixty sixty four

thousand meteorologists would be in a

room and they would be able to break the

weather map down into a bunch of small

patches and each one of those

meteorologists would do the calculations

in his day with a pencil and paper on

the map and you might wonder why he was

worried about parallel programming in

1917 well he is this notion if they

would do it with pencil and paper and so

with pencil and paper on the map they

would do the calculations required to

figure out the weather and their local

neighborhood based on what's happening

then in the neighbors neighborhoods

right because your weather only

immediately depends on the neighbor the

weather neighborhood

the neighbor the the weather and

proximity to you so if we're sitting

here in our corner of Pennsylvania here

we only care about the weather that

swung over the border from Ohio

immediately that's the weather that's

coming our face the weather that's

happening out in Indiana it'll get here

eventually but for my immediate

calculations I only need to know about

my immediate neighbors so this is

Richardson's idea and again he had

64,000 meteorologists got around through

a little patch of the map and they would

do this and they would after every

calculation exchange with their

immediate neighbors the information they

needed to advance the calculation in one

step and this was Richardson's idea of

parallel well how to compute the weather

now we today realize this is very naive

in that pencil paper isn't going to cut

it because you need a much finer grid

and much more detailed modeling and

you're going to do with hand

calculations but his idea his paradigm

is actually very sound and that indeed

is how we do modern weather modeling so

here is how we would do modern weather

modeling on something like your laptop

with multiple cores so with technical

course if we had a laptop with four

cores for example which many of you out

there do have and we put the weather map

on it the way we might do this so that

we get some speed up from those courses

we would break the weather map up into

four pieces each one of those cores

would be responsible for running kind of

a serial approach on its own section of

the map so it would do the old-school

serial code over its corner of the map

and the other three neighbors will do

the same thing and then at the end of

each time step they can exchange the

information and need to exchange which

is just going to be along the borders

right which weather is flowing from here

to here in which weather is flowing from

here to here they would exchange that

information along the border and then

they continue on with their own

calculations and the way we might do

this again with four cores on a laptop

the simplest way at least would be using

a programming model of something called

like openmp which is for doing

multi-threaded multi-core programming

and it's basically amounts to you take

your old serial code and we would

introduce some hints to the compiler

some things called directives to the

compiler and they would allow it to take

and break up these these simple loops

amongst these four cores and that would

be an approach that would work well with

four cores on your laptop and indeed we

might help speed things up about four

times but not be unreasonable probably

speed things up like

3.8 times close to four times you might

speed things up this is like having four

meteorologists in a room from

Richardson's perspective like having his

birth his meteorologist Lee envisioned

in a room and they each work on one

corner of map a different approach you

might take is using GPS for any of you

that are paying attention GPU

programming has become incredibly

important today in modern programming

because GPUs offer a lot of flops for

the money or for the power actually

those so those are two good

justifications GPUs have a whole lot of

flops if you could look at how many

flops you get out of a GPU a card

accelerator card you can buy off Amazon

you buys GPU slap it in your workstation

if you look at how many flops you get

either per dollar it's very competitive

versus buying a better CPU and if you're

concerned about building a big room you

get a lot of flops per watt too so if

we're concerned about how much power it

takes because we need to buy the power

we need to cool the room then those

cards are incredibly efficient so GPUs

have become really important and in some

areas they've taken over some machine

learning for example it's pretty much

all we focus on anymore so GPU

programming is a little bit different

way of approaching the problem GPU

programming says let's take our serial

code version of this and then we know

that GPU over here has a lot of flops

but one thing about GPUs that we don't

have time to get into but I'm not

misrepresenting here is that GPU cores

are very simple-minded they are they're

synchronized together they all need to

do the same operations in lockstep to a

large degree and so our weather map has

to be put onto the GPU over the over

some kind of bus usually when you plug

your card in in your workstation that's

a PCI bus is that what's called that

you're plugging your card into and

that's actually fairly slow connection

you might think it's fast because it's

right there on the motherboard but it's

a slow connection compared to normal

memory speeds so we we have to cram the

data over that PCI bus we put it on the

GPU the GPU can only do limited subset

of operations on that weather map we

hope it can do as much as possible with

something like a weather prophet might

be able to do a whole lot of stuff and

then eventually has to make its way back

to the CPU again over this relatively

slow pci bus so this is how GPU

computing would have approached this

problem this is like having one

meteorologist who really knows what

they're doing here that can do anything

on the CPU coordinating with a thousand

math

of us in another room but you've got to

use a slow connection here you know tin

cans and a string kind of connection but

the savants in the other room can really

number crunch a whole lot if you give

them a simple enough problem

this is GPU you know programming in a

nutshell this is the kind of thing we do

in our open ICC workshop with the same

problems that you're going to use here

in a MPI workshop another approach oh by

the way the way we can do this could

either be with a typical classic

approach to doing this is using

something called CUDA which is very low

level and not standard at all and not

portable and it's a maintenance

nightmare because CUDA gets updates

every 15 months or so and it really

requires a lot of maintenance or you

could use a directive based approach

like don't but MP the open AC version of

it which means you give some hints to

the compiler and you hope you can do a

lot of the smarts to make this stuff

happen magically on machines and that's

not unreasonable expectation for a lot

of scientific type of algorithms but

what we're going to learn instead is

we're going to learn how to do this so

that we can really scale this thing up

infinitely in a sense we can scale this

thing off as large as we want because

four cores on your laptop breaking the

map up into four pieces probably isn't

going to be enough your National Weather

Service that's not going to get you an

accurate forecast and hurricane

predictions a single GPU

is might be faster than your CPU but

still you're not going to run a serious

weather model on you need you sit down

and do the matter how many flops do I

need in order to get my prediction out

tomorrow instead of five years from now

and the answer is probably going to be

that you need thousands and thousands or

tens of thousands of course worth of

flops you need a lot of flops to do

water modern weather model and so MPI is

going to give us that capability MPI

says instead let's break up the problem

across a bunch of different physically

independent machines a cluster of

machines might be a good place to start

so and you can certainly build your own

cluster you buy a bunch of I call them

white boxes right as a term for buying a

bunch of generic workstations buy a

bunch of workstation and stick them in a

closet hook them together with Ethernet

you've got yourself a cluster so you

could build a cluster like that or you

can find yourself a super commuter we'll

find out MPI makes all these things look

pretty much alike for the programming

perspective so you build yourself a

cluster and now you can get as much

computing flops as you need in one place

as much as you can afford if you can

afford you get more money you buy twice

as many machines so as much computing

power as you need you can keep scaling

it up but our computing model now

requires us to break this map up into a

lot of small independent pieces that

live on each one of these separate

machines in their own separate memory on

those machines and they only communicate

between themselves over an actual

visible network and Ethernet network for

example or you know you that you that

you're using to connect them together

so this is where MPI comes into play MPI

says ok we can break this map up into 50

different pieces if we want one for each

state and each one of those states is

going to have its own workstation so it

only has to compute for one small state

but it's got a talk and communicate that

weather information with the other

states because you know whether you are

affected by your neighbor you know - to

some degree right the weather does come

across the border so we do need to

exchange some information and that's

going to have to happen over over the

network MDI allows us through this model

so this is like having 50 meteorologists

but they're not longer in the same room

just looking at the same map staring

over each other's shoulders now they

have to explicitly communicate so it's

like having 50 meteorologist using a

telegraph so this is a little bit of

more attention we've got to pay to

detail it's going to require more effort

on our part to make this work but it

will scale as large as we can afford

basically from this is MPI and I won't

go into this because we're about to

we're about to jump right into these

details so here this is how the pieces

fit together if you understand this

right here you understand the vast

majority the actual real computing

landscape there are lots and lots and

we'll talk about this tomorrow the entre

talk there are lots of experimental and

you know in hyped ways to do parallel

programming and we'll dive into it

tomorrow I don't want to confuse you now

before you know one that actually works

MPI after we do MPI tomorrow we'll look

at a lot each other you might refer

somebody say well if you do functional

programming it makes all this stuff work

out automatically you know just do you

know Haskell it will happen you may have

heard things like that we'll look at

those claims tomorrow

but I can tell you right now that the

reality at the moment is that 99 plus

of computing done on supercomputers is

done with the pieces that we just cover

right now and I'll show you how they fit

together

that's the reality so if you go to a

supercomputing 18 conference it takes

place next month in Dallas this year and

you walk the large show for they're

looking at thousands of projects that

people were doing 99 plus percent of the

ones that are running on supercomputers

use MPI GPU programming and OpenMP that

so the this is the reality the

possibilities for the future or a

different story and we'll talk about

those tomorrow but the reality the

pieces look like this is it today if you

buy a processor it's going to have

multiple cores can't buy a serial

processor any more so even a simple ARM

processor to power your eye watch has

got multiple cores so your by processor

today it's got multiple cores if you

want to program those multiple cores you

need multi-threaded programming openmp

is is by far the biggest standard for

doing that so OpenMP allows you to use

multiple cores if you buy a GPU to plug

into your whatever you've got your

workstation so we buy a workstation is

computing and we decide we want to buy

some more flops like they go on Amazon

you buy yourself up a bolt the GPU and

you plug it in and all of a sudden

you've got teraflops of performance 100

teraflops if you're willing to go a low

precision a performance you just bought

and plugged game if you want to program

that you need some kind of GPU

programming approach like open ACC but

if that's not sufficient if that one box

no matter how much money you put into it

isn't sufficient then with you have no

other choice and this is again this is

where I you know is this is not my

matter of preference and portal

preferences a matter of reality if you

look at all the large codes out there

that run on tens of thousands of course

all of them use MPI but you're about to

learn so if we want to go beyond a

single box single node we need MPI to

put those pieces together and so MPI

allows us to scale things up to - you

know infinite scale in a sense I'm

internet and there really isn't the

ceiling there and that it does come down

to is as much money as you have and we

can see this in the trajectory over the

years that we're up to now tens of

millions of cores I'll show you that on

some of the bigger machines they're all

MPI there's no other way to program

big machines and feautures know there

are no alternatives that are viable at

the moment lots of discussion lots of

hopes and dreams but how all the codes

that are running on the big machines

right now they are MPI codes okay so

let's let's recap here some of the

levels of parallelism we have going on

we can start as I mentioned at the

lowest level amount of processors we

have these vector instructions that we

don't get into because we hope that the

compiler takes care of it for us and

that's that's not an unreasonable hope

it's good to know optimization and talk

with somebody knows optimization maybe

verifies you're getting good performance

out of this and run profilers on your

code but you might hope that the

compiler does a good job of using this

stuff at the lowest level the

instruction parallelism in modern

processors is incredibly complex and

sophisticated this is by the way where

all these Spector and meltdown problems

come up in all the security things that

you may or may have heard of in the past

year and are difficult to eliminate

because modern processors do have a lot

of bag of tricks that are quite

effective that's why you know Intel is

kind of it's not easy problem for an

assault to get rid of all these kids

they're giving up a lot of performance

but these things all happen in the

background there are all these clever

tricks for rearranging orders

instructions and renaming registers and

deciding what's likely to happen in your

code but this happens at some level

loops between the hardware and the

compiler again so we hope as programmers

this stuff doesn't really have to be

visible to us we hope that's all the

compiler and not your problem again

that's not always true sometimes it's

helpful to be able to jump in here and

give the compiler some some hints as now

it might do better sometimes it can be

quite effective it's always good to at

least have a handle on how well you're

doing here that's why running a profiler

on your codes a nice thing to do but we

can have it's not unreasonable to hope

that this stuff could remain invisible

to us on the other hand once you start

talking about using more than one core

because you know your processor whatever

it is whether it's your laptop in front

of you or some other machine I've got

multiple cores in it we know that to use

multiple cores we've got to do that the

compiler can't do that effectively we've

got to get to program it so we could use

OpenMP and worry about programming on

multiple cores here if we plug in an

accelerator like a GPU or there are some

alternatives to GPUs but

views are pretty much the center of mass

of a accelerator program a plug-in and

accelerator we could program that in

some way but if we want to go beyond

that at all to do anything larger than a

single node then we're in the NPI world

now there are a few other things that

are important in parallel programming

that I won't go into we don't have time

to cover everything but next couple of

days but they're important parallel i/o

for example is incredibly effective and

important for these large machines if

you've got if you build yourself a big

cluster it's got thousands of nodes or

more doing i/o reading your data in for

your problem and writing it out as you

do your computations is something that's

it's not trivial at that point and you

hope that happens in parallel as well

it's essential it does or you're you

know so you ran the problem 1,000 times

faster it takes you two days to read

your data so IO is a whole thing that

happens in parallel will touch upon

MPI's a good way to deal with that issue

- so we'll touch upon it briefly MPI is

a great solution for this problem but

what I oh and Prell is important also

for some of you out there you know there

are these more or less custom exotic

type of devices it can be very effective

in certain classes of problems they're

either straight-up custom ships like a

six or they're a field programmable gate

arrays that you can kind of reprogram to

do your thing or digital signal

processors that are very good at

particular types of problems so some of

you out there will undoubtedly come

across these kind of devices and how to

program them they also fit into this

scheme in various ways but I can't

digress

you know into into the particulars here

but these are these are other areas of

parallel programming you come across an

important that scale but let's look at

all MPI applies as we kind of you know

work our way up the hierarchy of

machines here if you buy a motherboard

you know where you look at the

motherboard that you bought for your

workstation in there for the cheap

motherboards Cod you have on your your

one in your laptop or in your you know

your typical home PC even if your

builder probably just has one chip in at

one processor that processor has

multiple cores but it's only got one

processor if you look at a data center

on the other hand they typically want to

cram more performance per processor or

per blade that they stick in their

cabinets in their data centers and so

they typically have multi socket

motherboards so you can stick in dual or

even quad processors on a single

motherboard and

get a lot more cores on a board that way

you can start to stick a lot of the

stuff together and build a larger

machine you can even try to build a

shared memory machine that's very large

where all the cores share the same form

of memory and this is desirable because

it allows us to use that openmp

programming model that I briefly

mentioned earlier I said it's good for

doing like 4 cores to break our weather

map up into well that programming model

is definitely easier than MPI what

you're about to learn with MPI some of

you undoubtedly have been told how

intimidating difficult MPI is and I'll

dispel that we're going to find out it

returns you in MPI programmers over the

next day and so you'll walk away saying

well that wasn't that that hard to do

but it is a little bit harder than

something like openmp so the desire to

use openmp the openmp programming model

will just throw a few directives at

loops kind of approach which can be

effective the desire to retain that

programming model on bigger machines

beyond a single node is such that people

occasionally companies try to build very

large shared memory machines and so

there are some very large heavy machines

out there we have on bridges some of

these 12 terabytes notes they've got 12

terabytes from 260 cores that's about as

large as it gets with shared memory but

you can't hope to push the envelope

something so shared memory machines can

be more than four cores or 20 of course

they can be larger but they're very

expensive and even then when you really

spend millions of dollars that's it

that's the limit so you're still stuck

at several hundred course you're never

going to go up to the scale of well

cluster that you might build yourself so

there are lots of clusters out there

that people have built over the over the

decades it's cost of computing became

popular back in the mid 90s because if

you've you know if you want to build

your own parallel computer nobody can

stop you from just sticking a lot of

nodes in one place and kokum together

the network and you might you might put

them in Nice catholics as you know as

blades or instead of having a as I

implied earlier of in a room somewhere

none use room where you threw a bunch of

white boxes on a floor it's throwing

them together but at the end of the day

it's the same thing it's the same idea

that it's conventional computing box

hooked up through a network connection

to a bunch more conventional computing

boxes and there are lots of big clusters

out there all of which a program with

MPI and if you've got the the money to

have

buddy do the slickest solution for you

they can start to make custom networks

to cram things even closer together

physical proximity with networks can be

helpful for speed purposes you know your

speed of light actually becomes

important cooling becomes tougher so all

these things that come into play when

you want to build a custom machine are

the reasons why people do buy machines

from crave for example or a big vendor

and if you do that you can build

yourself an MVP a massively parallel

processor and whatever scale you can

afford and so at the moment the biggest

machines in the world for example are

summit at Oak Ridge National Lab it's

122 petaflop so we've been talking about

the race exascale or exa flops well this

is you know about 1/8 of the way there

and and it's got a bunch of pieces and

we won't go to the pieces but it's got

to 2.2 million cores so this is where

you as an MPI program are good to go

this is awesome millions of course to

use as compared to somebody it's not an

MPI programmer to whom this is

completely inaccessible the second place

machine it was in the number one spot

for the past couple of years is a

Chinese machine that's got 10 million

cores and it's got a custom interconnect

it's got a lot of custom stuff going on

in it ok before we move on I want to I

want to clarify some terminology here

it's important because we're going to be

using banding the stuff about the next

next day or so and it's easy to get

confused as a matter of fact it's easy

because people that uses terminology

often use it in a confusing and not

consistent manner including I'm guilty

of it as well so one of these terms

cores and nodes and processors and one

that we haven't used yet cold Pease when

we search surely we'll use one of these

terms mean well nodes refers to an

actual physical board that you've you

know got either know your cabinet or

your workstation it's sitting on the

floor in your closet a physical board

that has a network connection to it so

it's the node at the end of a network so

anything that's got its own network

connection no matter how much computing

power is lumped in there might have

multiple sockets in there with a bunch

of cores it might just be one core with

you know 12 or one socket with 12 cores

and into one processor so a node is

basically anything that's got its own

network endpoint that's a node and again

in a modern data center that you're

going

walk into and you know big warehouse

full of cabinet after cabinet after

cabinet the note amounts to one of those

blades that you'll pull out of a cabinet

on that blade you will find a physical

chip that's a processor so the processor

should always refer to a physical chip

basically something you can order off

Amazon you I want to buy a faster

version of my processor for my

workstation and you get a chip and you

plug it in the socket that is a

processor now processors today have

multiple cores always so the core is the

actual independent little CPU on that

processor within that processor that can

run its own program

so of course basically this thing that

can independently run its own program

and doesn't care what the other cores

are going theoretically and can run a

serial program off on its own that's the

core and last because these terms are

thrown about and used interchangeably

where they shouldn't be so much people

that do a lot of parallel programming

long ago came up with the term

processing element or PE in solids often

abbreviated and will use this actually

in our codes and a PD basically says

okay whatever however you put together

this machine that you built however many

processors it is it nodes and everything

else each one of the pieces of that

machine it can run its own separate

serial thread let's call that a PE a

processing element and so PE is a nice

term that gets rid of ambiguity here and

the misuse of processor core in

particular so we'll call things at PG as

long as they can run their own separate

program so if we looked at a big MPP and

a data center somewhere we go in with

walk in the door and say oh my god this

thing has it's got 200 cabinets in here

each one of those cabinets has a dozen

blades in it if we pull on one of those

blades it's got two processors on it and

if we look in each one of those

processors it's got a dozen cores well

then our PE then would be each one of

those cores inside there and we could

say if we do the math and do all the

multiplication I just made happen we can

say oh we've got 200,000 cores in this

room and so our few 200,000 PE s in this

room so PE es is what we'll we'll think

of in terms of ultimately the thing that

we can separately have running its own

piece of our weather map or anything

else that we break our problem up into

and I may be guilty of misusing the

terminology here and there but through

context you can usually help tell what I

mean or what anybody else means it's why

people do abuse the terminology so much

the network we're going to find out with

MPI is is largely hidden from us we're

not going to have to worry about it

because it's wonderful thing about MPI

it's going to make the network from the

programming perspective mostly go away

if we wanted to we can we could delve

into it if we wanted but in general you

want your codes to be pretty portable

and you want to move them back and forth

but of course the actual network does

have performance implications just

because our software abstracts in a way

does it mean it's it's not really there

affecting things and so you should

understand a couple of simple things

about the networks that connect things

together and these couple simple things

I'll point out to you will suffice to

give you a pretty good understanding of

how they perform we could dive into

lower-level and lots more detail but

unless you're building the machine in

the network you probably don't care but

to understand the general performance of

a network we only need a couple things

you only need to understand the latency

the bandwidth and the shape or topology

of network the latency of a network is

basically the time lag to send a very

small message between any node and the

other node so you can define it as a

zero byte message so small especially

you can send to onenote another word

know that's the latency and if you've

got a code that's setting lots of small

messages that latency might be the

dominant factor that determines its

performance if the networks got for

latency it's going to spend a lot of

time waiting for these little messages

to come across on the other hand let's

say you've got a code doesn't send many

messages but they're very huge each one

of the messages has a lot of data in

that case you're more concerned about

the bandwidth so the bandwidth is the

speed at which megabytes per second

gigabytes per seconds is at which your

messages can be sent between any two

nodes in the network and so for graphic

time if they to send a message we can

see it's got this latency attached to it

and then depending on the size of it the

bandwidth determines how long it's going

to take so some applications you send a

lot of small messages some applications

you send large messages so latency or

bandwidth can have a very importance for

you the actual connections the shape of

the connections between these things is

also very important because that will

determine ultimately how much the

messages

more or less interfere with each other

give you a quick example of how these

things have evolved into sophisticated

pieces they are if you build your own

network with the ethernet it's kind of a

baseline way to throw together a quick

cluster and probably any of you working

in computer labs you probably have

actually Ethernet connecting things

together you basically take your

Ethernet and you daisy chain everything

together and you've got one wire

connecting everything and that's

convenient because if I get another

workstation I can just plop it down and

run the ethernet to it and everybody's

happy

so this is a nice way to put together a

computer cluster and it works fine as

long as you know one person here is

browsing the Internet and another person

here is doing a little bit of

programming another person here is

watching a movie we're all doing

different things at different times this

network can keep us all happy

however most programs parallel programs

don't have that kind of asynchronous

random behavior instead things like our

weather map tend to have the exact

opposite behavior we tend to on our

weather map do a lot of computing

compute the weather and then at the end

of our time step so we just computed

what the weather is like it you know 103

a.m. you know on this day now we want to

move on to 104 a.m. for the next time

step in our weather map and at that

point we all want to exchange

information with our immediate neighbors

so at that point we're all trying to use

a network at the same time and something

like an Ethernet here gets overwhelmed

and becomes a bottleneck and so that

networks can easily be bottlenecked by

parallel codes it's the number one

problem we have to worry about more

designing in network and it will happen

with something like it either net

routinely because it's it's not designed

for this kind of pattern where everybody

wants to communicate at once so instead

we build a network like this we'd have

each one of our our pcs connected to

each one of the other ones independently

and then if these two want to

communicate these two can communicate

without interfering and that's great and

nobody can argue the complete

connectivity is a fantastic network it

just turns out to be completely

impractical to build at any kind of

large scale because the way the number

of connection scales so you know

basically N squared so if we had a you

know a thousand nodes here we need half

a million connections between it so you

can't build any reasonable size cluster

with complete connectivity but it is the

ideal so it's only the ideal you know

the

things independently connected you can

try to fake it if you build a small

cluster you'll buy something typically

called a crossbar and a crossbar tries

to fake complete connectivity it's the

central point that tries to look

maintain the appearance that every node

is independently connected every other

node but the reality is there's still

some saturation limit in here and it can

happen in various ways that ends up

putting that the that that the lied to

the test whenever again in a scientific

code all the nodes want to communicate

at once and so you hit that saturation

point and then you find out that a

crossbar isn't actually completely

connected network so what do engineers

do to try to get around this that is

they build tree networks

the simplest tree network you could

build would be a binary tree so a binary

tree would basically be our compute

nodes here at the bottom and these are

network routers here to connect the

nodes above and then he node here can

talk with any other node on this network

and it's you know it's a nice feature of

it it's also nice that if we had a

network we had a cluster of four nodes

and we wanted to make it bigger we just

buy four more nodes here and some more

networking hardware and we plop it down

next to it and we connect it up we've

got a bigger cluster right so trees are

nice if they grow in a incremental

fashion they're so so nice so far so

good the problem is that we end up with

this top point in the network becoming

oversaturated with communications

because it has to support two

communications between this entire half

the Machine of this entire half the

machine as a matter of fact this problem

of communicating across the middle of

the machine as we scale things up has a

term called the bisection bandwidth to

characterize it it basically gives you

an idea as the term suggests if we

bisect the machine what's the bandwidth

that we have available to us because

that's really the concern to build a big

tree network is do I have enough

bandwidth at the bisection point the

worst point in my network to support all

the communications that takes place

between this half the machine and this

half the machine and if the network

looks like this the answer is going to

be no the larger we built this network

the more this top point gets stressed

out without relief so instead we build a

factory and a factory is where we add in

more connections as we go further up

with a network to relieve that

congestion so here all these little

boxes the bottom er the compute nodes

and these circles are network connection

points and so in this case here as we go

up further up in the network we have

more and more connections to try to

relieve this congestion at the top and

this is a factory it's a very successful

way of dealing with this problem but the

formulate you have for how many more

connections should we have at the top is

something that is up as a parameter you

can adjust how many connections you have

going up and down and how many cross

connections you have is something that

you can vary and so if we look at the

apologies of these various factories

we'll find different ones out there in

the wild here are some some machines

that have been somewhat well known large

machines over the years if we graph

their topologies out you can see that

they have different patterns to them

they have different performance

characteristics some of them are good at

things like broadcasting and some of

them are things good things like

gathering data all the things we'll find

out MPI will allow us to do so Network

topologies can be very sophisticated and

they're constantly evolving that it's

not a solved problem by any means it

depends on the balance you want to give

in your machine and there are new ideas

and designs coming out all the time

things like dragonfly networks now are

becoming quite trendy at any rate this

is an ongoing area of research and

hardware evolution so what about this

the very crude 3d torus which also is

the other alternative to fat tree so

basically if we look at large networks

are going to be in that tree or they're

going to be a torus and a torus which is

a 3d torus is again not sophisticated at

all compared to a fat tree it is just a

3d grid now the nice thing about a 3d

grid is that it tends to map pretty well

to a lot of scientific problems which is

a 3d problem you've got like our weather

map right our weather is actually it's

actually 3d we've been looking at is too

deep it's 3d problem are we going to 3d

problem we break it up into a bunch of

pieces it's going to map pretty well in

this network so a 3d torus is a nice way

to map our problem right onto the

network and make sure if the data

communications is going to be efficient

our nearest neighbor is Pennsylvanian

communicates with ohio it's got one

network connection it's not interfering

with down here with some with

Pennsylvania on the other side

communicating with Jersey so we've got a

lot of nice characteristics of mapping

and 3d tours and our problem

the tourist term corresponds to the fact

that it's nice to not have to tunnel

data back through the middle of the

network so we actually wrap the ends

around I haven't drawn that connection

here but a and B actually have this

wraparound connection on the outside

that's what makes it a tourist such as

the 3d grid and that's a nice

characteristic so you'll find that in

all of these things so 3d Tauruses and

factories represent the network that

you're going to build at large scale you

come across clusters or build your own

cluster it will maybe add something like

a crossbar network in it or people today

increasingly are getting sophisticated

building trees even on relatively small

clusters so these are the networks that

you're going to find with MPI with MPI

codes we're going to find out that MPI

hides most of this from us but again you

should at least be aware of the fact

that the shape of the network that the

bandwidth and network and the lanes

network are going to have an impact

depending on the performance depending

on what your FBI is asking of the

network I mentioned briefly GPU

architectures are different I won't go

into them here I just figured I'd at

least throw the slides up here so you

can see this is the GPU architectures

are the cores the many thousands of

cores that we have going on in a modern

GPU which will have you know 4,000 or

more cores in a 6,000 cores the cores

are extremely simple

they're not real full-blown CPUs they

can't run independent threads so we

won't dive into that that here nor

Intel's approach to that so instead

let's look at the top 10 systems there's

a top 10 list that comes out twice a

year and the last version of it was in

June and if these are the top 10

machines in the world it's based on a

benchmark that you may or may not think

it's applicable to you the impact

benchmark but at any rate you need some

benchmark to in common to raid these

machines and if we look at these top 10

machines in the world right now you'll

see that they all have you know hundreds

of thousands of course some of them 10

million cores right here all of this

implies doesn't explicitly says that

anybody knows we're looking at here

these are MPI machines you program these

machines with MPI you don't use MPI in

these machines

you can't run things at scale on this

machine you probably shouldn't be on

these machines at all they all have some

well not all about half of the map

accelerators of some sort or other if we

look

you have some Nvidia things stuck in

here is Nvidia video here's Intel Xeon

Phi that's Intel's version of an

accelerator so there's a lot of

accelerators going on in here too so

that's in the mix that's an important

thing note if you do want to use a

certain portage machine being able to

program an accelerator a GPU is is

important and I think that's that's all

we'll go into here outside of technical

point out the factor if you're an m5

programmer you can use these machines if

you're not an FBI programmer you

probably aren't allowed on these

machines parallel i/o again we won't go

into any more detail here so last thing

I'll talk about here is where this stuff

is headed since you have some idea of

the road map in the immediate future

that I can pretty much guarantee I'm not

going to go into a lot of speculation

here instead I'll show you where things

were in either the momentum or the

outright funding and road maps are in

place to guarantee what things look like

for the next couple of years so we've

the rides not for a couple of years is

very definite and maybe for five or six

years out you can make some pretty good

guarantees as well an MPI is a big part

of all of that

so today peda flops computing is is

everywhere there's a well over 270

machines to breach the petaflop barrier

so lots of that everybody's got an

exascale project going on in you know

every substantial region right now

worrying about getting to X scale so

again here's exascale put it in a

different perspective the world's

biggest machine in 2004 was Cray Red

Storm this is equal to 23,000 of those

machines or today if you've got here's a

middle-of-the-road

middling kind of GPU it's you're liable

to find around your department an Nvidia

kk4 T it's 1.2 teraflops it's equivalent

to eight hundred thirty three thousand

of those so these are big machines the

reason again I showed you briefly

earlier graphic there's a reason that we

pretty sure we're getting there soon is

because we've got the history here

behind us showing us you know kind of

incremental progress or making some hard

to extrapolate out a couple of years and

at one point in time we were pretty sure

we were going to get there so this

machine on the bottom is the fastest

machine is this is the fastest machine

this is the cumulative sum of all these

machines right here or should be this is

the 500 machines

the slowest machine the 500 list this is

the fastest machine on the list we're

pretty sure we were going to get to

right here next to flop right around

2020 but if you kind of read into this a

little bit if you want to do a little

bit of regression on these this data

here you can see that we're kind of

tailing off here a little bit now we're

guessing more like around 20 21 20 22

we're going to have that first exascale

machine show up but what is very

competitive

could be some surprises along the way

I'll skip this slide here I'll point out

some of the obstacles in the future

though that you know are affecting this

roadmap and there are the groups that do

these exascale roadmaps that are trying

to deploy excel machines some of them

come forward with not made public there

they're kind of miss priorities and

concerns things obstacle that have to be

surmounted before they're going to get

there and these are some of them right

here

energy efficiency actually energy and

cooling and everything else is extremely

important and difficult at that scale

we're talking about machines that are

use more than 20 megawatts of power at

least 20 megawatts of power to power

them lots of problems with with

reliability you've got millions of

processors in one place keeping them

going is well it's impossible actually

if you think about the mean time between

failure if I told you a processor only

fails once every 10 years you might

think god that's not bad performance but

if you put 10 million of them in one

place it means they're not all going to

keep going for even a couple of hours at

a time so building a big machine with 10

million course means fault tolerance

reliability never they become not

afterthoughts if you want to use the

whole machine so lots of problems like

this this is this is a list put forward

by the the advanced scientific computer

visor committee for example so for those

of you interested you can stare at this

list and say what problems actually

might you want to contribute to if

you're interested in being on the

leading edge of this stuff there's a lot

of opportunity to help contribute to

solving these problems and they're

outstanding problems still so it's not

like the exascale machines or something

in evitable have illusion which is going

to keep cranking away these problems are

between us and making those machines

effective the roadmap to get there has

had a couple

quantum jumps on one of which has

happened one of which were in the midst

of the first one that happened was the

first boost if we look at the top 500

list over time if it going up if you

look at it closely it's had a couple of

jumps here to happen the first one was

back when accelerators

became popular that really kept the

momentum going as the cereal performance

Peter dials around 2004 you know but

there's top 500 in this data over the

past decades kept marching forward

somehow well the first thing that really

kept things on track was the boost from

accelerators because GPUs really helped

to bring a lot of flops news machines

the second boost which is taking place

pretty much right now is the move to 3d

circuitry in electronics this has

happened industry-wide and commodity

devices as well which is stacked

electronics this has become really

really important because Moore's law you

know gives us the density on a single

chip but stacking those chips on top of

each other is not a trivial thing we're

not talking about stacking about some

package level we're talking about

stacking silicon dyes and silicon dyes

stacking them 3d has bought us brought

us continued performance improvements

that kind of hit hide Moore's loss

issues at the moment so the increased

bandwidth some things that kept chugging

along a lot of it's due to 3d packaging

that's happening now the third boost

which may be needed to get us the next

scale machine is moving to silicon

photonics which is basically saying that

copper you know wires to connect

everything together have a lot of

drawbacks to them join things fiber

optics is you know obviously superior in

many many ways right we know that our

our big networks are wide scale networks

are all fiber optic now many of you have

fibre coming into your home well fiber

optic connections at the network level

on these big machines are becoming

commonplace but maybe maybe to the

integrated circuit level are really

important to get those benefits if we

want to get the network efficiencies

that we need to build these exascale

machines so these are these are

important quantum jumps in technology

that show up at the bleeding edge and

then filter their way back down to your

desk top one one really interesting

thing that's a little bit wonkish a

little bit technically specific but it's

I think it's it's actually really nifty

and that MPI people get a

benefit from what for everybody else is

a huge problem

and that is that if we look at where the

power is actually being spent in modern

computing it used to be that almost all

the power was spent in scientific

computing actually doing the number

crunching actually if you looked at it

getting getting the data the registers

to get the answer out of it doing the

number crunching is where all the power

and a processor went all the power and

your machine went the ones doing the

flops the number crunching

today we've hit this point where most of

the power is now spent moving data

around on the machine the actual

computing number crunching part of

computing is the least amount of power

whereas moving data between all the

memories because you've got all these

memory hierarchies and network

connections in house has now become the

dominant consumer of power so data

movement is now the biggest consumer of

power as of this year and so MPI as

you're about to learn we haven't learned

it yet but we're about to learn MPI

gives us control over where we move to

data and so most people are fearful of

this future in which controlling data

movement becomes really important to

getting good performance out of these

machines or even making it possible as

an MPI programmer you're actually you

have the capability whether you want it

or not it's a responsibility to MPI

programming to control the data movement

and so as the world becomes more fixated

on moving the data around for 50-state

instead of just going computing doing

the number crunching into registers you

as MPI programmers are really in a good

position to deal with that problem we'll

come back to that briefly after you know

what MPI is which will be shortly if

you're not so another way of looking at

that is that if the flops if the

floating-point operations actually are

free at some point because they're

taking a little of power and all of your

time is spent moving operations around

on the machine then you know eventually

we're going to worry about optimizing

data movement not eventually it's

already happening more than anything

else

and MPI is well-positioned to do that ok

I won't belabor these points these are

again it's a another way of looking at

the old constraints that used to be the

main problems in programming things like

how fast is your clock and

flops are using on things too now in

water machines now it says that power as

I mentioned data movement as I just

mentioned concurrency using these things

in parallel is now the most important

thing in programming and that's where

with MPI your well position memory

scaling computing capabilities grown way

faster than the memory bandwidth so

that's related to the data movement to

having your data in the right place MPI

is well-positioned to deal with that

locality where is your data in you know

in a modern machine you've got from the

CPU you've got registers and you've got

these multiple caches to get the data

into before you hit regular memory now

in modern machines now we've got the

behind the regular memory we've got

non-volatile type of memories flash

memories things like that SSDs then

ultimate maybe some disk spinning disks

long term stuff where is your data

sitting in that hierarchy where where

should it be these are things MPI is

particularly well developed to deal with

a data locality heterogeneity saying

that you know I showed you a top 10 list

these machines have have processors

they've got accelerators in them they've

got multiple processors inch node

multiple cores on each processor so the

machine's not some regular you know

simple building block it's got a lot of

pieces and moving parts to it so these

are all the things that MPI is well

placed to deal with last one is not

still very much an outstanding problem

the reliability thing I mentioned to you

okay

another last thing I'll mention because

if we're going to talk about

architectures if you're if you're either

keeping up with stuff or you're going to

dive into this field you'll quickly find

out that people are interested in

architectures that are different than

what we're going today substantially

different not just a rearrangement of

the pieces that we're using today today

we're doing things with standard silicon

electronics CMOS electronics if you will

that's the kind of silicon fabrication

technique it's almost everything today

is based on and our computers look

pretty much the same as they did from

the first computers we built 1940s a von

Neumann architecture basically you've

got registers and you get memory and you

move things back and forth between the

two and that's where we are today but

maybe the answer to the end of Moore's

law is instead something drastically

different what if we go beyond so it can

transistors so we keep the architectures

today we've got our registers and our

you know we keep kind of computers we've

got now but we build them onto something

that's higher performance and doesn't

have thermodynamics issues and and

whatnot a lot of things like graphing

for example people are trying to make

transistors out of graphene that is a

incredibly obviously desirable thing it

preserves a lot of the technology and

techniques that we have but allows us to

continue to move forward without Moore's

law or to at least reset Moore's law in

some other different domain so there's a

lot of hope for that but those things

are still pretty much in a laboratory

when you read about cracking transistors

it's sitting on a lab bench somewhere

it's not in fabrication how about if we

abandon the bond lineman architecture

and go for something that's radically

different design something like quantum

computing this has certainly got an

awful lot of mindshare in the world of

computing so we were going to do

something that's certainly not silicon

based electronics is certainly not a

bundle in architecture it's very

different but maybe gets us a whole

different world of capability quantum

computing is a is a fascinating

interesting area maybe one of those

interesting things about it is the more

you learn about it well I say the people

that are most expert in this field have

very very diverse opinions on how when

the near-term practablity is going to

materialize on how soon any of this is

going to be real how close we are to

actual real devices and real

applications it is unusual in that

respect usually right it you can as you

get towards the experts and

knowledgeable people consensus kind of

forms that is not the case here you will

find that people who are deeply involved

in this field have very differing views

about whether we're a couple of years

away from some practical quantum

computing at least in some narrow

domains or whether we're 15 years away

and you'll hear both of those opinions

from people that are are well informed

not just from somebody that's getting

secondhand information so very

interesting and rapidly developing area

and a lot of literature is accessible to

you for those of you dressed in it but

hard hard to predict the last

alternative that's worth I've been

talking about going to any practical

reality of coming to bear is something

that uses our modern electronics

techniques CMOS electronic silicon-based

electronics but a very different non von

women design like neuromorphic computing

is the practical example of building a

computer that looks in this case like a

neuron that machine learning deep

learning has become wildly successful in

many different areas and it's based upon

building in software basically and using

GPUs building these architectures that

will presumable something like

biological neural neural nets the idea

that we could instead implement that

directly in silicon and thereby remove

kind of and the need for this

translation is is not only irresistible

but it's also practical and it turns out

that there are more than a few companies

from IBM on out that are developing

neuromorphic computing devices which

have had varying degrees of actual

real-world effectiveness so here's a

different type of computing that is

built on silicon electronics so how to

fabricate it's not an unknown that's not

iffy they can definitely do it how much

success they'll happen very very

applications is still an open question

but it certainly has had some early

successes so Moore's law is not

necessarily end and nor should we be

freaked out that Moore's law is coming

to an end because it's not the first

paradigm shift in computing we've got

very very spoiled by this integrated

circuit era computing came out of you

know mechanical devices with Hollerith

cards to do you know census surveys and

things the first computers are built

with relay type electronics and then

vacuum tubes and then independent

transistors so you know computing went

through a lot of upheaval and a lot of

revolutions over the years it's just

we've been stuck since the late 60s in

this integrated circuit era and we think

that's all there is to computing but

it's kind of overdue in that sense for a

paradigm shift it wouldn't be the first

there's also now finally a big

appreciation out there of the need to

since moore's law isn't giving us the

ability to just go okay however poor

your programming is or your approaches

it'll run faster next year computing

time is cheap compared to developer time

or productivity time now that was a

mantra that became quite popular over

the past 20 20 some years as we took it

for granted the computing power was so

cheap and was just boundlessly growing

over the past five or six years have

been more real as

that maybe we need to actually go back

to knowing how to program because you

know things aren't just randomly

speeding up anymore and that saying that

my codes fast enough even though it's

using two percent of the capability

that's okay on your laptop but if you're

going to run something in the cloud

which means somebody's data center

somewhere if you're going to run

something in a data center you know it's

taking megawatts to run than saying it's

running at two percent of its potential

speed because I don't have time to do it

right is incredibly wasteful all right

so there's that shift too and you as as

MPI programmers are well positioned to

to take advantage of that I feel like

that to give to sign off here with the

fact that put it all back in perspective

that we can talk about machines that

take 20 megawatts plus these xql

machines to model the human brain well

it's important to note the human brain

takes about 20 watts to run so it's a

awful lot of room for improvement right

there right we're hoping to be able to

run in real time human brain with you

know 20 megawatts well so there's an

awful lot of room for improvement and

development and so even though Moore's

law and other things are coming to these

depressing asymptotes that the world

will remain exciting and there will be

lots of development that evolution to

come I hope you're very motivated now

we're going to about to jump into the

actual programming we're going to their

hands dirty and start writing code fear

not I said this was all the overview and

buzzwords but I'm going to refer back to

a lot of the stuff over the next day and

a half so I wanted to put it all in one

place

parallel computing is no fad right this

is not something that's optional we can

see we've been forced into it by physics

no getting around thermodynamics in

particular right we have to go parallel

and that's why everything is parallel

and if you if you jump on board

the right approach to this stuff which I

guarantee you is FBI you're you're going

to have you can get great utility out of

it not just now but for the indefinite

future every road map for these big

machines that's being built all the

excess scale machines the programming

model for them is MPI people hope some

other things might come onboard but the

programming model for every machine it's

being funded right now to be built in

the next five years like baseline

programming model

is MPI how they're assuming it and again

pieces fit together like this you know

you might program a single processor

with OpenMP do multi-threaded

programming you might plug-in a GPU and

program it with CUDA open up a PC open

CL but the second you go beyond that to

multiple nodes it's it's MPI