hi welcome to the MPI workshop presented
by the NSS exceed program this was
originally a two-day event with hands-on
exercises that you can view at your own
pace to get access to the exercises in
slide content look at the link below
this gives me the opportunity to jump
into the overview it's kind of the only
collection of buzzwords without
programming it will do over the next
couple of days but to some extent
understanding you know buzzwords and
jargon is a necessary evil here so I'll
cover it all here and we'll give you a
good orientation as to where MPI fits
into all these other parallel computing
options that you may or may not be aware
of you certainly will be by the time
we're done here and that way when we
talk about things in practical terms as
we jump into the actual programming and
hands-on I'll be able to refer back to
these things and make some sense so you
know as I said this is the 50,000 foot
view will also have a outro talk at the
end which is a shorter version of this
where we'll be able to compare and
contrast various programming parallel
programming approaches a little bit more
detail because at that point you will
know how to do MPI you'll be real MPI
programmers and so that's what we'll
really dive into the software comparison
this will be a little bit more Hardware
oriented that one will be a little bit
more software oriented so the first
theme here is is why do we need MPI why
do we need large scale computing X scale
computing something I'll define here
what's the point of all of this many of
you are coming in here with particular
applications in mind you know why you
need MPI some of you may not but it's
helpful for all of you to understand the
demand the applications that derive the
development of this stuff because that's
how your that's why the stuff exists
it's how you're able to utilize things
make you some idea what the roadmap of
the future will be is your investment
MPI a good thing is this trendy is it
liable to be a something that will be
displaced by something else in a year or
two it you know computing certainly
faddish things are not unknown matter of
fact they're the norm really right so
let's let's look at the applications
that drive this stuff at the highest end
and this ultimately filters down to us
at the big end if you look at the
largest machines in the world and we
will we'll look at some of them in some
some depth due to the large machines the
world the problems that they run
ones that are considered to be a great
strategic importance to justify building
200 million dollar or more these days
computer supercomputers and there they
exist because they have a lot of flops
so flops is a term that we'll use number
of times over the next couple of days
and certainly you'll come across if you
stick around the American computing at
all stands for floating-point operations
per second and it's a nice way to
characterize the horsepower of a
computing platform how powerful is it
how many flops how many floating-point
operations per second going to do it's
also an important way to characterize
the demands of your code if you've got a
numerical application the question
someone you might ask of you is you know
how many flops does it take to run your
application so these are 64-bit flops by
the way I should point out what kind of
standardized line scientific computing
by default that somebody says a
floating-point operation it's a 64-bit
floating-point operation that's not
always necessary as a matter of fact I
often come across people that are using
64-bit precision needlessly but at any
rate it's it's the default for flops
when we talk about how many flops a
machine does it's funny that a lot of
codes can benefit greatly for memory
bandwidth issues and other things by
going to 32 bits if they don't need it
and the machine learning world today
we're finding that 16 bit or less
precision is actually desirable for many
things but nevertheless 64 bits in
scientific computing is often required
and so it's the default so how many
flops do you need to run a large climate
model well turns out you need a lot
we'll look at the definition a lot it's
a good example I like a climate model
it's one we'll come back to in various
forms or in a couple of days because to
the problem to understand we're trying
to do weather modeling in essence in
climate modeling you're trying to do
weather bonding except over the entire
globe and you're trying to do it over
decades not the next day or 48-hour
forecast so it's not hard to imagine it
takes extreme amount of computing power
to do also very large amounts of memory
something else that was MPI old access
to you'll find it for any of you that
have memory bound problems or will find
that you have memory about problems you
know you don't have enough memory MPI is
a way to get past that lump a lot of
memory together so these very large
problems like climate modeling demanded
I've got a slide here that gives you a
pretty good idea with these large
climate modeling problems how important
it is and you can see that the grid size
really is we go from a large grid
in you know the early the mid early
2000s a grid size for a large climate
problem would be some of that winter
kilometer grid or pixel of weather if
you will or voxel it's really a 3d
problem and 200 kilometers might seem
like that's reasonable you know on a map
like you know the globe but if you look
out your window 200 kilometers of
weather certainly there's a lot going on
it's not not particularly fine detail
and indeed as you go to up three orders
of magnitude and computing power so that
you can go to a 25 kilometer grid point
all of a sudden want more science
emerges here a fluid dynamics is like
that and so we have a lot of stuff going
on here that wasn't going on previously
both still 25 kilometers again if you
look out your window 25 kilometers
there's a lot going on in there you're
kept not capturing you know all of the
physics that's going on and consequently
we need to go up another you know three
orders of magnitude to get down to even
just a couple kilometers so this is a
good example of how insatiable computing
demand is for an application I think we
all appreciate the importance of but
sometimes we don't understand the the
requirements on the smaller scale
problems can also be just as great in
the case of save another fluid flow
problem that's why I picked this one
because it's also fluid flows combustion
now we're not talking about the earth
we're talking about something that might
be an engine piston size problem or
perhaps for a lot of supercomputing
world there's a lot of them are are
things like turbines steam turbines
things that fit in a room a large room
but a room nonetheless as you're trying
to model combustion and flow in these
problems with the multi physics is going
on
it also requires incredibly large
amounts of computing power to the point
where it requires the largest computing
2200 million dollar computing platform
so this is not just a few a few large
problems of you know modeling the cosmos
or modeling the globe supercomputers are
required to model things at microscopic
scale accurately as well I'll give one
last one here that's so something in a
kind of a very different direction but
also equally importantly is very large
platforms which is modeling brains
essentially this is a there are several
brain initiatives if you will
currently in the United State
in Europe and in China they're being
funded to large-scale and this group
here the motive group at IBM is maybe
the most well known they've certainly
been persistent they've been at it for a
while long enough that you can see the
evolution of what they're going hear
very well as they went from a mouse
brain with 16 million neurons 128
billion synapses - currently they're not
quite at the human brain yet which is 22
billion neurons but they've been
gradually working their way there
steadily working their way there this is
a problem that requires they're very
large machines the point where this this
modeling right here the human brain is
probably going to require an exascale
computer and so that's a term that I've
mentioned and I should not define
because you'll come across it you'll
trip over it all the time as you look at
current discussions of computing and
that's because we've all been racing
converging on a machine that is three
orders of magnitude faster in terms of
flops then the petascale type of
machines are the petaflop capable
machines that we're working on today
today's large machines work in the
petaflop range
so 10 to the 15 floating-point
operations per second which is by many
measures an incredibly impressive amount
of computing power certainly takes a lot
of hardware and a lot of money a lot of
power power electrical power to do but
that's the range we're at at the moment
we've been fixated because of Moore's
law and other things that keep saying
things are going to increase
geometrically been fixated on that next
generation machines the exascale
machines for about the past seven or
eight years we can anticipate in getting
there and at one point in time it look
like we might get there by about 2020
now we're thinking maybe a little bit
past there but everybody's racing
towards what do I mean by everybody well
you know any you know any strategic we
thoughtful organization or country or
multinational organization these days
recognizes how important supercomputing
is and so the United States and EU and
China and Japan or all all have exascale
computing initiatives to build a machine
it can do an exaflop of computing power
and that's ten to the eighteenth
floating-point operations per second and
that's the level at which the human
brain that I showed you starts to become
viable that's the level at which
lot of problems start to become more
tractable and climate modeling and many
many other domains which I didn't have
luck sure going into I could spend hours
talking about important problems that
exascale will enable it will it will
allow the signs to become really
transformative and become much more
applicable to real world problems so
everybody's racing towards this exascale
thing and those developments filtered
down to the rest of us whether you need
to be there or not some of you I'm sure
more than a few of you out there today
actually will be using these exascale
machines as quickly should get your
hands on them but for the rest of you
even if you think well departmental
resources are kind of the level that I
want to work at everything filters down
from that leading edge and the computing
world fairly rapidly
we're not far behind I'll give you some
examples as we're not far behind
yesterday's supercomputers on the
desktop today so you're not getting to
that level though without going very
very parallel that's the main theme of
this talk here is that you can only use
these machines or even your desktop
machines today effectively if you're
doing parallel programming serial
programming is hopeless anymore in terms
of getting any kind of reasonable
performance out of it any hardware
platform including your your smartphone
for that matter here's a good
illustration of that here's a fairly
typical generic benchmark spec
benchmarks a couple generations of them
over time and this is you know there are
lots of benchmarks apply lots of
different domains and everything so we
can pick different benchmarks but pretty
much any benchmark that you pick that's
you know in the center-of-mass of
generic numerical applications or video
games or anything like that or web
rendering is going to show this kind of
curve here where somewhere around the
year 2004 so in most any graph of this
nature somewhere around 2004 we lost
what with decades of this kind of
continual growth where every 18 months
every two years computing power doubled
and this happened again for decades
since the early 60s somewhere around
2004 that quit happening as a matter of
fact today we're already seven years or
so behind where we would have been if
that kept happening it definitely
stopped though things
really leveled out this is quite visible
in lots of different ways besides just a
benchmark is visible in terms of clock
rates of computers and lots of things
now this is this right here is the is
the baseline proof that serial
programming is dead in the sense of well
if I just wait next year the computers
will be faster and more powerful and run
my code faster so I'm going to write
this you know this really slow code and
whatever programming language is
convenient but it's okay it'll just run
faster next year and that was true
actually for for many decades you could
count on things running twice as fast
soon enough that hasn't been true for
awhile now this is not because something
called Moore's law that everybody likes
to attach themselves to is dead Moore's
law but to put it explicitly Moore's law
simply an observation according were
made back in the mid-60s that transistor
density how many transistors you could
cram into any given area of a chip would
double about every 18 months or so two
years somewhere in that time frame
Moore's law is not that just yet not
quite roots running into a wall pretty
soon but it's not dead yet it certainly
hasn't been dead over the past ten years
you know well things have slowed down
it's not responsible for that customer
just looked at Moore's laws is still
been pretty healthy the engineers been
heroic job over four plus five plus
decades now of keeping it going through
multiple revolutions and fabrication
technologies and things like that they
kept it going so Moore's law keeps
giving us more and more transistors all
the time so we can't blame Moore's law
for being dead although again it's it's
really starting to finally hit the limit
as a matter of fact if you look into
things pretty carefully you can make an
argument that Moore's law did kind of
die six or seven years ago even you can
claim because transistors have they
still managed to keep cramming
transistors in there with things like 3d
packaging and whatnot but more and more
the transistors are lost error
correction and issues having to deal
with with working at that scale so the
cost per transistor is actually going up
for the first time ever so at any rate
Moore's law may may be defunct soon but
it's not responsible for that what is
responsible for it is something else
we'll look at the next slide but first
I'll point out that this is another way
to see that Moore's law has continued
engineers haven't disappointed us and
giving us more
here's a pair bunch of different common
processors popular processors and you
can see that over the past well however
many decades you want to look back
transistors that they cram in every
couple of years on the chips due Delta
so today's processors do have a lot more
transistors than processors or just a
couple of years ago so transistors have
been delivered by the engineers so why
have the speeds why is serial computing
come through of such a halt and this is
the reason this is probably a next
couple of date for the two days of
material we're gonna cover this is
probably the most important slide you
could show somebody in terms of
demonstrating importance of parallel
programming and MPI because this is the
reason right here that parallel
programming has taken over all of
computing and the reason that clock
speeds so here are those clock speeds
over the past you know however many
decades here three decades and right
around two thousand four clock speeds
have leveled off another crude measure
of computing performance bet whatever
benchmark you want clock speeds you know
another one you could pick clock speeds
leveled off around 2004 why do they
level off around 2004 well this is the
reason right here this is the reason
that parallel computing exists as such a
dominant thing in computing today and
it's because the chips are running at
the verge of melting and that that
little bit of physics right there
dominates modern computing design
today's ships run at well over well in
the order of a hundred watts per square
centimeter it's a hot day run and that's
that's not a difficult number to
actually understand even if you're not a
physicist one hundred watts of power
under watt light bulb is you know
something you wouldn't want to put your
hand on right it would burn you
immediately it's a lot of power and
that's not much heat we're having to
dissipate through a square centimeter
less than a postage stamp area of a
modern computing chip when it's running
full-out so that's an impressive amount
of heat to dissipate especially if you
don't want to go into soo exotic liquid
cooling technology just want to blow a
fan over things because that's
convenient or it needs to sit in your
lap on a laptop or be in your cell phone
so the fact that modern chips run at
this that temperature should impress you
were well past how hot hot plate is this
is much hotter than the surface of a hot
plate we're coming up on nuclear
reactors here so this right here is the
physical limitation we ran into it too
and for is that as they crab more and
more transistors into a small area you
know just running the current through
them switching them fast enough became
the promise Madrid if you think about it
if you are electrical any electrical
engineers out there warning 100 watts
through as less than a bolt through a
one square centimeter area that's 100
amperes basically effective current
we're running around the square
centimeter so it's very impressive in
multiple respect that these chips don't
melt down as it is so again we could
resort to the more exotic things
solutions like liquid cooling things
here's a picture of a Cray two from the
mid 80s this is not a new problem by the
way in supercomputing we've had to deal
with heat dissipation she's for a long
time so creative supercomputer in mid
eighties had this really very artistic
looking heat exchanger in front of it
full of a foreigner bubbling away when
the thing was running to keep the thing
from melting so we could do that but
that's just not practical right you
don't really want to worry about liquid
cooling on your laptop so instead that's
why we're in the parallel world we are
in today the engineers said we've got
more transistors we can't keep running
them at the same faster clock rates
because we're cramming more in a small
area that we've done in the past so what
can we do to get performance out of them
how can we still get a benefit even
though we can't deal with the
thermodynamics of this without doing
something that's it's unfeasible and so
what they did was they recognized this
revelation and it wasn't new to 2040
particular year of parallel computing
has been around for a long long time but
what they recognized was parallel
computing how to make its way into the
commodity world a desktop world your
your cell phone world and so they
recognized that if we've got all these
transistors here we could do what we've
done for decades which is make a bigger
faster single core to run your serial
code better use all the tricks that we
can bring in with more transistors add
more cache all these things that we can
do that we've been doing instead of
doing this what if instead we broke this
big core up into some smaller cores so
we use the same amount of power so if
we've got a hundred Watts running into
this single big core what if instead we
made these four simpler course but not
going to be as clever they're not going
to be as much cash they're not going to
be able to use as many tricks but these
four Dumber cores will in net
have more performance than our one big
core and if you actually want to look
into the details the performance skills
roughly is the square root of the area
and there are a lot of different metrics
and everything you get into if you
interested in this it's not inaccessible
to understand some of the finer details
here but the bottom line is that instead
of one big core we'll have four or more
smaller cores in parallel here's the
kind of math that happened in the real
world back when everything went from
single core double pore around the turn
of the century is that we said instead
of having one core that's running at
this voltage in this frequency what if
we take two cores and we run in a 15%
less voltage 15% lower clock rate
that'll is because of the way these
things scale it's the same amount of
power but now we've got 1.8 times the
performance so this is the math right
here that makes parallel computing so
compelling to the engineers this is the
reason it's not a fad it's not a fashion
it's not something computer scientists
who are interested in concurrent
programming to foist it on the world
it's not something that we're going to
change anytime soon it's the fundamental
physics of of computing here's a good
example of how it's made its appearance
on just a generic commodity chips if we
look at a processor from around the turn
of the century it could deliver around 3
flops floating-point operations for
every clock tick if we look at a
processor today a skylit processor the
kind it's in my laptop right here not
that exotic it delivers close to 2700
floating-point operations for every
click of the top every tick of the clock
so this is an example of how incredibly
parallel a generic modern processor is
compared to a processor from just you
know a lot of 20 years ago this is where
parallelism is correct in if you can't
take advantage of that you can't use a
modern processor well at all
so if you're writing some simple-minded
program that's very inherently serial
you can see you're not going to get a
percent of the performance capability of
that processor now the way that it's
actually made its way into a modern
processor is a couple of different
things are going on there's multiple
cores certainly modern processors like
Ascot legs got twenty eight cores
so again a lot of cores going on each
one of those cores also has a lot of
vector type instructions
behind-the-scenes these are instructions
that work on a lot of things data
simultaneously you might hope that the
compiler can do a lot of this magic for
your here so hope sometimes it's true
with really simple loops sometimes it
needs a little bit of help but any rate
you might you might hope that this stuff
stays hidden but if you want to use
multiple course well you're definitely
going to have to start to become a
parallel programmer but at any rate you
can see that parallel computing is not
just some super computing kind of thing
out there it's it's your laptop if you
want to use it well or your smartphone
again modern smartphones right lives
Apple phones come up they've got what
six cores versus we can print again
Samsung's phones right you can see even
people worried about smart phones how
good is my smart phone worried about how
many cores their phone has now
everything is parallel so we put all
this stuff together that I've talked
about in one place here give you an idea
here the transistors again of kept
coming this is Moore's law
so Moore's law hasn't abandoned us the
single thread performance has really
plateaued here so being a serial
programmer is a is a dead end you can
see that in terms of the clock rate - if
you want something it's a very simple
metric you know computers now today
nowadays run at you know a couple
gigahertz plus or minus a couple
gigahertz the same speed they were
running ten years ago so clock rates
haven't gone up processors serial
performance is the same because they
can't that things can't but will melt
that they keep cranking more power
through them so instead what has
happened is the number of cores and
processors steps climb and this is where
your MPI programming is going to turn
you into somebody that's instead of
being dismayed by this by these facts be
happy because you can take advantage of
it you can exploit this so what does
parallel computing look like what it how
do we actually do this stuff in parallel
and this takes us back to the
fundamental problem of parallel
computing how to break things up into
pieces so that they can be done in
parallel this is the the ongoing
discussion about how accessible parallel
programming is how likely you are to
have success with your particular
problem and application of parallel
programming and here's an example from
way back from the mythical man-month
which is a classic any of you that do
serious software development and
software engineering i've probably or
should certainly read this book and
books law from there
basically said if a woman can make a
baby in nine months well then can nine
women make a baby in one month that's
absurd right but nine women can make
nine babies in nine months so this is
the this kind of exposes the idea in
parallel program in it things can be
done in parallel sometimes sometimes
it's not quite so obvious
sometimes it takes a little bit of
different thinking and this is the
notion that we're going to to get
comfortable with over the next day as we
apply MPI the different problems so
let's look at this from the perspective
of a weather problem again I like
weather because it's intuitive everybody
no matter what background you're coming
come get to kind of get the idea if we
wrote a weather model back in the 1970s
we on a serial computer we would
basically do the same thing you do with
many many scientific problems you'd
break your weather map up into a grid so
we break our map of the United States
here up into a grid each one of these
mesh points on the grid is going to be
you know have some some wind velocities
and some temperatures and the pressures
and the various things that you want to
use to compute your weather and so what
your CPU would do would be go over this
whole grid and calculate update the grid
for the next time step in time based on
the way the wind is blowing and then and
you know the atmosphere is flowing so to
speak so this would be a classic serial
way to write this code well it so
happens a guy named Richardson in 1917
figured out a way that we might do this
in parallel so 1917 Richardson had this
idea that if we had a bunch of
meteorologists in a room in his case he
envisioned like sixty sixty four
thousand meteorologists would be in a
room and they would be able to break the
weather map down into a bunch of small
patches and each one of those
meteorologists would do the calculations
in his day with a pencil and paper on
the map and you might wonder why he was
worried about parallel programming in
1917 well he is this notion if they
would do it with pencil and paper and so
with pencil and paper on the map they
would do the calculations required to
figure out the weather and their local
neighborhood based on what's happening
then in the neighbors neighborhoods
right because your weather only
immediately depends on the neighbor the
weather neighborhood
the neighbor the the weather and
proximity to you so if we're sitting
here in our corner of Pennsylvania here
we only care about the weather that
swung over the border from Ohio
immediately that's the weather that's
coming our face the weather that's
happening out in Indiana it'll get here
eventually but for my immediate
calculations I only need to know about
my immediate neighbors so this is
Richardson's idea and again he had
64,000 meteorologists got around through
a little patch of the map and they would
do this and they would after every
calculation exchange with their
immediate neighbors the information they
needed to advance the calculation in one
step and this was Richardson's idea of
parallel well how to compute the weather
now we today realize this is very naive
in that pencil paper isn't going to cut
it because you need a much finer grid
and much more detailed modeling and
you're going to do with hand
calculations but his idea his paradigm
is actually very sound and that indeed
is how we do modern weather modeling so
here is how we would do modern weather
modeling on something like your laptop
with multiple cores so with technical
course if we had a laptop with four
cores for example which many of you out
there do have and we put the weather map
on it the way we might do this so that
we get some speed up from those courses
we would break the weather map up into
four pieces each one of those cores
would be responsible for running kind of
a serial approach on its own section of
the map so it would do the old-school
serial code over its corner of the map
and the other three neighbors will do
the same thing and then at the end of
each time step they can exchange the
information and need to exchange which
is just going to be along the borders
right which weather is flowing from here
to here in which weather is flowing from
here to here they would exchange that
information along the border and then
they continue on with their own
calculations and the way we might do
this again with four cores on a laptop
the simplest way at least would be using
a programming model of something called
like openmp which is for doing
multi-threaded multi-core programming
and it's basically amounts to you take
your old serial code and we would
introduce some hints to the compiler
some things called directives to the
compiler and they would allow it to take
and break up these these simple loops
amongst these four cores and that would
be an approach that would work well with
four cores on your laptop and indeed we
might help speed things up about four
times but not be unreasonable probably
speed things up like
3.8 times close to four times you might
speed things up this is like having four
meteorologists in a room from
Richardson's perspective like having his
birth his meteorologist Lee envisioned
in a room and they each work on one
corner of map a different approach you
might take is using GPS for any of you
that are paying attention GPU
programming has become incredibly
important today in modern programming
because GPUs offer a lot of flops for
the money or for the power actually
those so those are two good
justifications GPUs have a whole lot of
flops if you could look at how many
flops you get out of a GPU a card
accelerator card you can buy off Amazon
you buys GPU slap it in your workstation
if you look at how many flops you get
either per dollar it's very competitive
versus buying a better CPU and if you're
concerned about building a big room you
get a lot of flops per watt too so if
we're concerned about how much power it
takes because we need to buy the power
we need to cool the room then those
cards are incredibly efficient so GPUs
have become really important and in some
areas they've taken over some machine
learning for example it's pretty much
all we focus on anymore so GPU
programming is a little bit different
way of approaching the problem GPU
programming says let's take our serial
code version of this and then we know
that GPU over here has a lot of flops
but one thing about GPUs that we don't
have time to get into but I'm not
misrepresenting here is that GPU cores
are very simple-minded they are they're
synchronized together they all need to
do the same operations in lockstep to a
large degree and so our weather map has
to be put onto the GPU over the over
some kind of bus usually when you plug
your card in in your workstation that's
a PCI bus is that what's called that
you're plugging your card into and
that's actually fairly slow connection
you might think it's fast because it's
right there on the motherboard but it's
a slow connection compared to normal
memory speeds so we we have to cram the
data over that PCI bus we put it on the
GPU the GPU can only do limited subset
of operations on that weather map we
hope it can do as much as possible with
something like a weather prophet might
be able to do a whole lot of stuff and
then eventually has to make its way back
to the CPU again over this relatively
slow pci bus so this is how GPU
computing would have approached this
problem this is like having one
meteorologist who really knows what
they're doing here that can do anything
on the CPU coordinating with a thousand
math
of us in another room but you've got to
use a slow connection here you know tin
cans and a string kind of connection but
the savants in the other room can really
number crunch a whole lot if you give
them a simple enough problem
this is GPU you know programming in a
nutshell this is the kind of thing we do
in our open ICC workshop with the same
problems that you're going to use here
in a MPI workshop another approach oh by
the way the way we can do this could
either be with a typical classic
approach to doing this is using
something called CUDA which is very low
level and not standard at all and not
portable and it's a maintenance
nightmare because CUDA gets updates
every 15 months or so and it really
requires a lot of maintenance or you
could use a directive based approach
like don't but MP the open AC version of
it which means you give some hints to
the compiler and you hope you can do a
lot of the smarts to make this stuff
happen magically on machines and that's
not unreasonable expectation for a lot
of scientific type of algorithms but
what we're going to learn instead is
we're going to learn how to do this so
that we can really scale this thing up
infinitely in a sense we can scale this
thing off as large as we want because
four cores on your laptop breaking the
map up into four pieces probably isn't
going to be enough your National Weather
Service that's not going to get you an
accurate forecast and hurricane
predictions a single GPU
is might be faster than your CPU but
still you're not going to run a serious
weather model on you need you sit down
and do the matter how many flops do I
need in order to get my prediction out
tomorrow instead of five years from now
and the answer is probably going to be
that you need thousands and thousands or
tens of thousands of course worth of
flops you need a lot of flops to do
water modern weather model and so MPI is
going to give us that capability MPI
says instead let's break up the problem
across a bunch of different physically
independent machines a cluster of
machines might be a good place to start
so and you can certainly build your own
cluster you buy a bunch of I call them
white boxes right as a term for buying a
bunch of generic workstations buy a
bunch of workstation and stick them in a
closet hook them together with Ethernet
you've got yourself a cluster so you
could build a cluster like that or you
can find yourself a super commuter we'll
find out MPI makes all these things look
pretty much alike for the programming
perspective so you build yourself a
cluster and now you can get as much
computing flops as you need in one place
as much as you can afford if you can
afford you get more money you buy twice
as many machines so as much computing
power as you need you can keep scaling
it up but our computing model now
requires us to break this map up into a
lot of small independent pieces that
live on each one of these separate
machines in their own separate memory on
those machines and they only communicate
between themselves over an actual
visible network and Ethernet network for
example or you know you that you that
you're using to connect them together
so this is where MPI comes into play MPI
says ok we can break this map up into 50
different pieces if we want one for each
state and each one of those states is
going to have its own workstation so it
only has to compute for one small state
but it's got a talk and communicate that
weather information with the other
states because you know whether you are
affected by your neighbor you know - to
some degree right the weather does come
across the border so we do need to
exchange some information and that's
going to have to happen over over the
network MDI allows us through this model
so this is like having 50 meteorologists
but they're not longer in the same room
just looking at the same map staring
over each other's shoulders now they
have to explicitly communicate so it's
like having 50 meteorologist using a
telegraph so this is a little bit of
more attention we've got to pay to
detail it's going to require more effort
on our part to make this work but it
will scale as large as we can afford
basically from this is MPI and I won't
go into this because we're about to
we're about to jump right into these
details so here this is how the pieces
fit together if you understand this
right here you understand the vast
majority the actual real computing
landscape there are lots and lots and
we'll talk about this tomorrow the entre
talk there are lots of experimental and
you know in hyped ways to do parallel
programming and we'll dive into it
tomorrow I don't want to confuse you now
before you know one that actually works
MPI after we do MPI tomorrow we'll look
at a lot each other you might refer
somebody say well if you do functional
programming it makes all this stuff work
out automatically you know just do you
know Haskell it will happen you may have
heard things like that we'll look at
those claims tomorrow
but I can tell you right now that the
reality at the moment is that 99 plus
of computing done on supercomputers is
done with the pieces that we just cover
right now and I'll show you how they fit
together
that's the reality so if you go to a
supercomputing 18 conference it takes
place next month in Dallas this year and
you walk the large show for they're
looking at thousands of projects that
people were doing 99 plus percent of the
ones that are running on supercomputers
use MPI GPU programming and OpenMP that
so the this is the reality the
possibilities for the future or a
different story and we'll talk about
those tomorrow but the reality the
pieces look like this is it today if you
buy a processor it's going to have
multiple cores can't buy a serial
processor any more so even a simple ARM
processor to power your eye watch has
got multiple cores so your by processor
today it's got multiple cores if you
want to program those multiple cores you
need multi-threaded programming openmp
is is by far the biggest standard for
doing that so OpenMP allows you to use
multiple cores if you buy a GPU to plug
into your whatever you've got your
workstation so we buy a workstation is
computing and we decide we want to buy
some more flops like they go on Amazon
you buy yourself up a bolt the GPU and
you plug it in and all of a sudden
you've got teraflops of performance 100
teraflops if you're willing to go a low
precision a performance you just bought
and plugged game if you want to program
that you need some kind of GPU
programming approach like open ACC but
if that's not sufficient if that one box
no matter how much money you put into it
isn't sufficient then with you have no
other choice and this is again this is
where I you know is this is not my
matter of preference and portal
preferences a matter of reality if you
look at all the large codes out there
that run on tens of thousands of course
all of them use MPI but you're about to
learn so if we want to go beyond a
single box single node we need MPI to
put those pieces together and so MPI
allows us to scale things up to - you
know infinite scale in a sense I'm
internet and there really isn't the
ceiling there and that it does come down
to is as much money as you have and we
can see this in the trajectory over the
years that we're up to now tens of
millions of cores I'll show you that on
some of the bigger machines they're all
MPI there's no other way to program
big machines and feautures know there
are no alternatives that are viable at
the moment lots of discussion lots of
hopes and dreams but how all the codes
that are running on the big machines
right now they are MPI codes okay so
let's let's recap here some of the
levels of parallelism we have going on
we can start as I mentioned at the
lowest level amount of processors we
have these vector instructions that we
don't get into because we hope that the
compiler takes care of it for us and
that's that's not an unreasonable hope
it's good to know optimization and talk
with somebody knows optimization maybe
verifies you're getting good performance
out of this and run profilers on your
code but you might hope that the
compiler does a good job of using this
stuff at the lowest level the
instruction parallelism in modern
processors is incredibly complex and
sophisticated this is by the way where
all these Spector and meltdown problems
come up in all the security things that
you may or may have heard of in the past
year and are difficult to eliminate
because modern processors do have a lot
of bag of tricks that are quite
effective that's why you know Intel is
kind of it's not easy problem for an
assault to get rid of all these kids
they're giving up a lot of performance
but these things all happen in the
background there are all these clever
tricks for rearranging orders
instructions and renaming registers and
deciding what's likely to happen in your
code but this happens at some level
loops between the hardware and the
compiler again so we hope as programmers
this stuff doesn't really have to be
visible to us we hope that's all the
compiler and not your problem again
that's not always true sometimes it's
helpful to be able to jump in here and
give the compiler some some hints as now
it might do better sometimes it can be
quite effective it's always good to at
least have a handle on how well you're
doing here that's why running a profiler
on your codes a nice thing to do but we
can have it's not unreasonable to hope
that this stuff could remain invisible
to us on the other hand once you start
talking about using more than one core
because you know your processor whatever
it is whether it's your laptop in front
of you or some other machine I've got
multiple cores in it we know that to use
multiple cores we've got to do that the
compiler can't do that effectively we've
got to get to program it so we could use
OpenMP and worry about programming on
multiple cores here if we plug in an
accelerator like a GPU or there are some
alternatives to GPUs but
views are pretty much the center of mass
of a accelerator program a plug-in and
accelerator we could program that in
some way but if we want to go beyond
that at all to do anything larger than a
single node then we're in the NPI world
now there are a few other things that
are important in parallel programming
that I won't go into we don't have time
to cover everything but next couple of
days but they're important parallel i/o
for example is incredibly effective and
important for these large machines if
you've got if you build yourself a big
cluster it's got thousands of nodes or
more doing i/o reading your data in for
your problem and writing it out as you
do your computations is something that's
it's not trivial at that point and you
hope that happens in parallel as well
it's essential it does or you're you
know so you ran the problem 1,000 times
faster it takes you two days to read
your data so IO is a whole thing that
happens in parallel will touch upon
MPI's a good way to deal with that issue
- so we'll touch upon it briefly MPI is
a great solution for this problem but
what I oh and Prell is important also
for some of you out there you know there
are these more or less custom exotic
type of devices it can be very effective
in certain classes of problems they're
either straight-up custom ships like a
six or they're a field programmable gate
arrays that you can kind of reprogram to
do your thing or digital signal
processors that are very good at
particular types of problems so some of
you out there will undoubtedly come
across these kind of devices and how to
program them they also fit into this
scheme in various ways but I can't
digress
you know into into the particulars here
but these are these are other areas of
parallel programming you come across an
important that scale but let's look at
all MPI applies as we kind of you know
work our way up the hierarchy of
machines here if you buy a motherboard
you know where you look at the
motherboard that you bought for your
workstation in there for the cheap
motherboards Cod you have on your your
one in your laptop or in your you know
your typical home PC even if your
builder probably just has one chip in at
one processor that processor has
multiple cores but it's only got one
processor if you look at a data center
on the other hand they typically want to
cram more performance per processor or
per blade that they stick in their
cabinets in their data centers and so
they typically have multi socket
motherboards so you can stick in dual or
even quad processors on a single
motherboard and
get a lot more cores on a board that way
you can start to stick a lot of the
stuff together and build a larger
machine you can even try to build a
shared memory machine that's very large
where all the cores share the same form
of memory and this is desirable because
it allows us to use that openmp
programming model that I briefly
mentioned earlier I said it's good for
doing like 4 cores to break our weather
map up into well that programming model
is definitely easier than MPI what
you're about to learn with MPI some of
you undoubtedly have been told how
intimidating difficult MPI is and I'll
dispel that we're going to find out it
returns you in MPI programmers over the
next day and so you'll walk away saying
well that wasn't that that hard to do
but it is a little bit harder than
something like openmp so the desire to
use openmp the openmp programming model
will just throw a few directives at
loops kind of approach which can be
effective the desire to retain that
programming model on bigger machines
beyond a single node is such that people
occasionally companies try to build very
large shared memory machines and so
there are some very large heavy machines
out there we have on bridges some of
these 12 terabytes notes they've got 12
terabytes from 260 cores that's about as
large as it gets with shared memory but
you can't hope to push the envelope
something so shared memory machines can
be more than four cores or 20 of course
they can be larger but they're very
expensive and even then when you really
spend millions of dollars that's it
that's the limit so you're still stuck
at several hundred course you're never
going to go up to the scale of well
cluster that you might build yourself so
there are lots of clusters out there
that people have built over the over the
decades it's cost of computing became
popular back in the mid 90s because if
you've you know if you want to build
your own parallel computer nobody can
stop you from just sticking a lot of
nodes in one place and kokum together
the network and you might you might put
them in Nice catholics as you know as
blades or instead of having a as I
implied earlier of in a room somewhere
none use room where you threw a bunch of
white boxes on a floor it's throwing
them together but at the end of the day
it's the same thing it's the same idea
that it's conventional computing box
hooked up through a network connection
to a bunch more conventional computing
boxes and there are lots of big clusters
out there all of which a program with
MPI and if you've got the the money to
have
buddy do the slickest solution for you
they can start to make custom networks
to cram things even closer together
physical proximity with networks can be
helpful for speed purposes you know your
speed of light actually becomes
important cooling becomes tougher so all
these things that come into play when
you want to build a custom machine are
the reasons why people do buy machines
from crave for example or a big vendor
and if you do that you can build
yourself an MVP a massively parallel
processor and whatever scale you can
afford and so at the moment the biggest
machines in the world for example are
summit at Oak Ridge National Lab it's
122 petaflop so we've been talking about
the race exascale or exa flops well this
is you know about 1/8 of the way there
and and it's got a bunch of pieces and
we won't go to the pieces but it's got
to 2.2 million cores so this is where
you as an MPI program are good to go
this is awesome millions of course to
use as compared to somebody it's not an
MPI programmer to whom this is
completely inaccessible the second place
machine it was in the number one spot
for the past couple of years is a
Chinese machine that's got 10 million
cores and it's got a custom interconnect
it's got a lot of custom stuff going on
in it ok before we move on I want to I
want to clarify some terminology here
it's important because we're going to be
using banding the stuff about the next
next day or so and it's easy to get
confused as a matter of fact it's easy
because people that uses terminology
often use it in a confusing and not
consistent manner including I'm guilty
of it as well so one of these terms
cores and nodes and processors and one
that we haven't used yet cold Pease when
we search surely we'll use one of these
terms mean well nodes refers to an
actual physical board that you've you
know got either know your cabinet or
your workstation it's sitting on the
floor in your closet a physical board
that has a network connection to it so
it's the node at the end of a network so
anything that's got its own network
connection no matter how much computing
power is lumped in there might have
multiple sockets in there with a bunch
of cores it might just be one core with
you know 12 or one socket with 12 cores
and into one processor so a node is
basically anything that's got its own
network endpoint that's a node and again
in a modern data center that you're
going
walk into and you know big warehouse
full of cabinet after cabinet after
cabinet the note amounts to one of those
blades that you'll pull out of a cabinet
on that blade you will find a physical
chip that's a processor so the processor
should always refer to a physical chip
basically something you can order off
Amazon you I want to buy a faster
version of my processor for my
workstation and you get a chip and you
plug it in the socket that is a
processor now processors today have
multiple cores always so the core is the
actual independent little CPU on that
processor within that processor that can
run its own program
so of course basically this thing that
can independently run its own program
and doesn't care what the other cores
are going theoretically and can run a
serial program off on its own that's the
core and last because these terms are
thrown about and used interchangeably
where they shouldn't be so much people
that do a lot of parallel programming
long ago came up with the term
processing element or PE in solids often
abbreviated and will use this actually
in our codes and a PD basically says
okay whatever however you put together
this machine that you built however many
processors it is it nodes and everything
else each one of the pieces of that
machine it can run its own separate
serial thread let's call that a PE a
processing element and so PE is a nice
term that gets rid of ambiguity here and
the misuse of processor core in
particular so we'll call things at PG as
long as they can run their own separate
program so if we looked at a big MPP and
a data center somewhere we go in with
walk in the door and say oh my god this
thing has it's got 200 cabinets in here
each one of those cabinets has a dozen
blades in it if we pull on one of those
blades it's got two processors on it and
if we look in each one of those
processors it's got a dozen cores well
then our PE then would be each one of
those cores inside there and we could
say if we do the math and do all the
multiplication I just made happen we can
say oh we've got 200,000 cores in this
room and so our few 200,000 PE s in this
room so PE es is what we'll we'll think
of in terms of ultimately the thing that
we can separately have running its own
piece of our weather map or anything
else that we break our problem up into
and I may be guilty of misusing the
terminology here and there but through
context you can usually help tell what I
mean or what anybody else means it's why
people do abuse the terminology so much
the network we're going to find out with
MPI is is largely hidden from us we're
not going to have to worry about it
because it's wonderful thing about MPI
it's going to make the network from the
programming perspective mostly go away
if we wanted to we can we could delve
into it if we wanted but in general you
want your codes to be pretty portable
and you want to move them back and forth
but of course the actual network does
have performance implications just
because our software abstracts in a way
does it mean it's it's not really there
affecting things and so you should
understand a couple of simple things
about the networks that connect things
together and these couple simple things
I'll point out to you will suffice to
give you a pretty good understanding of
how they perform we could dive into
lower-level and lots more detail but
unless you're building the machine in
the network you probably don't care but
to understand the general performance of
a network we only need a couple things
you only need to understand the latency
the bandwidth and the shape or topology
of network the latency of a network is
basically the time lag to send a very
small message between any node and the
other node so you can define it as a
zero byte message so small especially
you can send to onenote another word
know that's the latency and if you've
got a code that's setting lots of small
messages that latency might be the
dominant factor that determines its
performance if the networks got for
latency it's going to spend a lot of
time waiting for these little messages
to come across on the other hand let's
say you've got a code doesn't send many
messages but they're very huge each one
of the messages has a lot of data in
that case you're more concerned about
the bandwidth so the bandwidth is the
speed at which megabytes per second
gigabytes per seconds is at which your
messages can be sent between any two
nodes in the network and so for graphic
time if they to send a message we can
see it's got this latency attached to it
and then depending on the size of it the
bandwidth determines how long it's going
to take so some applications you send a
lot of small messages some applications
you send large messages so latency or
bandwidth can have a very importance for
you the actual connections the shape of
the connections between these things is
also very important because that will
determine ultimately how much the
messages
more or less interfere with each other
give you a quick example of how these
things have evolved into sophisticated
pieces they are if you build your own
network with the ethernet it's kind of a
baseline way to throw together a quick
cluster and probably any of you working
in computer labs you probably have
actually Ethernet connecting things
together you basically take your
Ethernet and you daisy chain everything
together and you've got one wire
connecting everything and that's
convenient because if I get another
workstation I can just plop it down and
run the ethernet to it and everybody's
happy
so this is a nice way to put together a
computer cluster and it works fine as
long as you know one person here is
browsing the Internet and another person
here is doing a little bit of
programming another person here is
watching a movie we're all doing
different things at different times this
network can keep us all happy
however most programs parallel programs
don't have that kind of asynchronous
random behavior instead things like our
weather map tend to have the exact
opposite behavior we tend to on our
weather map do a lot of computing
compute the weather and then at the end
of our time step so we just computed
what the weather is like it you know 103
a.m. you know on this day now we want to
move on to 104 a.m. for the next time
step in our weather map and at that
point we all want to exchange
information with our immediate neighbors
so at that point we're all trying to use
a network at the same time and something
like an Ethernet here gets overwhelmed
and becomes a bottleneck and so that
networks can easily be bottlenecked by
parallel codes it's the number one
problem we have to worry about more
designing in network and it will happen
with something like it either net
routinely because it's it's not designed
for this kind of pattern where everybody
wants to communicate at once so instead
we build a network like this we'd have
each one of our our pcs connected to
each one of the other ones independently
and then if these two want to
communicate these two can communicate
without interfering and that's great and
nobody can argue the complete
connectivity is a fantastic network it
just turns out to be completely
impractical to build at any kind of
large scale because the way the number
of connection scales so you know
basically N squared so if we had a you
know a thousand nodes here we need half
a million connections between it so you
can't build any reasonable size cluster
with complete connectivity but it is the
ideal so it's only the ideal you know
the
things independently connected you can
try to fake it if you build a small
cluster you'll buy something typically
called a crossbar and a crossbar tries
to fake complete connectivity it's the
central point that tries to look
maintain the appearance that every node
is independently connected every other
node but the reality is there's still
some saturation limit in here and it can
happen in various ways that ends up
putting that the that that the lied to
the test whenever again in a scientific
code all the nodes want to communicate
at once and so you hit that saturation
point and then you find out that a
crossbar isn't actually completely
connected network so what do engineers
do to try to get around this that is
they build tree networks
the simplest tree network you could
build would be a binary tree so a binary
tree would basically be our compute
nodes here at the bottom and these are
network routers here to connect the
nodes above and then he node here can
talk with any other node on this network
and it's you know it's a nice feature of
it it's also nice that if we had a
network we had a cluster of four nodes
and we wanted to make it bigger we just
buy four more nodes here and some more
networking hardware and we plop it down
next to it and we connect it up we've
got a bigger cluster right so trees are
nice if they grow in a incremental
fashion they're so so nice so far so
good the problem is that we end up with
this top point in the network becoming
oversaturated with communications
because it has to support two
communications between this entire half
the Machine of this entire half the
machine as a matter of fact this problem
of communicating across the middle of
the machine as we scale things up has a
term called the bisection bandwidth to
characterize it it basically gives you
an idea as the term suggests if we
bisect the machine what's the bandwidth
that we have available to us because
that's really the concern to build a big
tree network is do I have enough
bandwidth at the bisection point the
worst point in my network to support all
the communications that takes place
between this half the machine and this
half the machine and if the network
looks like this the answer is going to
be no the larger we built this network
the more this top point gets stressed
out without relief so instead we build a
factory and a factory is where we add in
more connections as we go further up
with a network to relieve that
congestion so here all these little
boxes the bottom er the compute nodes
and these circles are network connection
points and so in this case here as we go
up further up in the network we have
more and more connections to try to
relieve this congestion at the top and
this is a factory it's a very successful
way of dealing with this problem but the
formulate you have for how many more
connections should we have at the top is
something that is up as a parameter you
can adjust how many connections you have
going up and down and how many cross
connections you have is something that
you can vary and so if we look at the
apologies of these various factories
we'll find different ones out there in
the wild here are some some machines
that have been somewhat well known large
machines over the years if we graph
their topologies out you can see that
they have different patterns to them
they have different performance
characteristics some of them are good at
things like broadcasting and some of
them are things good things like
gathering data all the things we'll find
out MPI will allow us to do so Network
topologies can be very sophisticated and
they're constantly evolving that it's
not a solved problem by any means it
depends on the balance you want to give
in your machine and there are new ideas
and designs coming out all the time
things like dragonfly networks now are
becoming quite trendy at any rate this
is an ongoing area of research and
hardware evolution so what about this
the very crude 3d torus which also is
the other alternative to fat tree so
basically if we look at large networks
are going to be in that tree or they're
going to be a torus and a torus which is
a 3d torus is again not sophisticated at
all compared to a fat tree it is just a
3d grid now the nice thing about a 3d
grid is that it tends to map pretty well
to a lot of scientific problems which is
a 3d problem you've got like our weather
map right our weather is actually it's
actually 3d we've been looking at is too
deep it's 3d problem are we going to 3d
problem we break it up into a bunch of
pieces it's going to map pretty well in
this network so a 3d torus is a nice way
to map our problem right onto the
network and make sure if the data
communications is going to be efficient
our nearest neighbor is Pennsylvanian
communicates with ohio it's got one
network connection it's not interfering
with down here with some with
Pennsylvania on the other side
communicating with Jersey so we've got a
lot of nice characteristics of mapping
and 3d tours and our problem
the tourist term corresponds to the fact
that it's nice to not have to tunnel
data back through the middle of the
network so we actually wrap the ends
around I haven't drawn that connection
here but a and B actually have this
wraparound connection on the outside
that's what makes it a tourist such as
the 3d grid and that's a nice
characteristic so you'll find that in
all of these things so 3d Tauruses and
factories represent the network that
you're going to build at large scale you
come across clusters or build your own
cluster it will maybe add something like
a crossbar network in it or people today
increasingly are getting sophisticated
building trees even on relatively small
clusters so these are the networks that
you're going to find with MPI with MPI
codes we're going to find out that MPI
hides most of this from us but again you
should at least be aware of the fact
that the shape of the network that the
bandwidth and network and the lanes
network are going to have an impact
depending on the performance depending
on what your FBI is asking of the
network I mentioned briefly GPU
architectures are different I won't go
into them here I just figured I'd at
least throw the slides up here so you
can see this is the GPU architectures
are the cores the many thousands of
cores that we have going on in a modern
GPU which will have you know 4,000 or
more cores in a 6,000 cores the cores
are extremely simple
they're not real full-blown CPUs they
can't run independent threads so we
won't dive into that that here nor
Intel's approach to that so instead
let's look at the top 10 systems there's
a top 10 list that comes out twice a
year and the last version of it was in
June and if these are the top 10
machines in the world it's based on a
benchmark that you may or may not think
it's applicable to you the impact
benchmark but at any rate you need some
benchmark to in common to raid these
machines and if we look at these top 10
machines in the world right now you'll
see that they all have you know hundreds
of thousands of course some of them 10
million cores right here all of this
implies doesn't explicitly says that
anybody knows we're looking at here
these are MPI machines you program these
machines with MPI you don't use MPI in
these machines
you can't run things at scale on this
machine you probably shouldn't be on
these machines at all they all have some
well not all about half of the map
accelerators of some sort or other if we
look
you have some Nvidia things stuck in
here is Nvidia video here's Intel Xeon
Phi that's Intel's version of an
accelerator so there's a lot of
accelerators going on in here too so
that's in the mix that's an important
thing note if you do want to use a
certain portage machine being able to
program an accelerator a GPU is is
important and I think that's that's all
we'll go into here outside of technical
point out the factor if you're an m5
programmer you can use these machines if
you're not an FBI programmer you
probably aren't allowed on these
machines parallel i/o again we won't go
into any more detail here so last thing
I'll talk about here is where this stuff
is headed since you have some idea of
the road map in the immediate future
that I can pretty much guarantee I'm not
going to go into a lot of speculation
here instead I'll show you where things
were in either the momentum or the
outright funding and road maps are in
place to guarantee what things look like
for the next couple of years so we've
the rides not for a couple of years is
very definite and maybe for five or six
years out you can make some pretty good
guarantees as well an MPI is a big part
of all of that
so today peda flops computing is is
everywhere there's a well over 270
machines to breach the petaflop barrier
so lots of that everybody's got an
exascale project going on in you know
every substantial region right now
worrying about getting to X scale so
again here's exascale put it in a
different perspective the world's
biggest machine in 2004 was Cray Red
Storm this is equal to 23,000 of those
machines or today if you've got here's a
middle-of-the-road
middling kind of GPU it's you're liable
to find around your department an Nvidia
kk4 T it's 1.2 teraflops it's equivalent
to eight hundred thirty three thousand
of those so these are big machines the
reason again I showed you briefly
earlier graphic there's a reason that we
pretty sure we're getting there soon is
because we've got the history here
behind us showing us you know kind of
incremental progress or making some hard
to extrapolate out a couple of years and
at one point in time we were pretty sure
we were going to get there so this
machine on the bottom is the fastest
machine is this is the fastest machine
this is the cumulative sum of all these
machines right here or should be this is
the 500 machines
the slowest machine the 500 list this is
the fastest machine on the list we're
pretty sure we were going to get to
right here next to flop right around
2020 but if you kind of read into this a
little bit if you want to do a little
bit of regression on these this data
here you can see that we're kind of
tailing off here a little bit now we're
guessing more like around 20 21 20 22
we're going to have that first exascale
machine show up but what is very
competitive
could be some surprises along the way
I'll skip this slide here I'll point out
some of the obstacles in the future
though that you know are affecting this
roadmap and there are the groups that do
these exascale roadmaps that are trying
to deploy excel machines some of them
come forward with not made public there
they're kind of miss priorities and
concerns things obstacle that have to be
surmounted before they're going to get
there and these are some of them right
here
energy efficiency actually energy and
cooling and everything else is extremely
important and difficult at that scale
we're talking about machines that are
use more than 20 megawatts of power at
least 20 megawatts of power to power
them lots of problems with with
reliability you've got millions of
processors in one place keeping them
going is well it's impossible actually
if you think about the mean time between
failure if I told you a processor only
fails once every 10 years you might
think god that's not bad performance but
if you put 10 million of them in one
place it means they're not all going to
keep going for even a couple of hours at
a time so building a big machine with 10
million course means fault tolerance
reliability never they become not
afterthoughts if you want to use the
whole machine so lots of problems like
this this is this is a list put forward
by the the advanced scientific computer
visor committee for example so for those
of you interested you can stare at this
list and say what problems actually
might you want to contribute to if
you're interested in being on the
leading edge of this stuff there's a lot
of opportunity to help contribute to
solving these problems and they're
outstanding problems still so it's not
like the exascale machines or something
in evitable have illusion which is going
to keep cranking away these problems are
between us and making those machines
effective the roadmap to get there has
had a couple
quantum jumps on one of which has
happened one of which were in the midst
of the first one that happened was the
first boost if we look at the top 500
list over time if it going up if you
look at it closely it's had a couple of
jumps here to happen the first one was
back when accelerators
became popular that really kept the
momentum going as the cereal performance
Peter dials around 2004 you know but
there's top 500 in this data over the
past decades kept marching forward
somehow well the first thing that really
kept things on track was the boost from
accelerators because GPUs really helped
to bring a lot of flops news machines
the second boost which is taking place
pretty much right now is the move to 3d
circuitry in electronics this has
happened industry-wide and commodity
devices as well which is stacked
electronics this has become really
really important because Moore's law you
know gives us the density on a single
chip but stacking those chips on top of
each other is not a trivial thing we're
not talking about stacking about some
package level we're talking about
stacking silicon dyes and silicon dyes
stacking them 3d has bought us brought
us continued performance improvements
that kind of hit hide Moore's loss
issues at the moment so the increased
bandwidth some things that kept chugging
along a lot of it's due to 3d packaging
that's happening now the third boost
which may be needed to get us the next
scale machine is moving to silicon
photonics which is basically saying that
copper you know wires to connect
everything together have a lot of
drawbacks to them join things fiber
optics is you know obviously superior in
many many ways right we know that our
our big networks are wide scale networks
are all fiber optic now many of you have
fibre coming into your home well fiber
optic connections at the network level
on these big machines are becoming
commonplace but maybe maybe to the
integrated circuit level are really
important to get those benefits if we
want to get the network efficiencies
that we need to build these exascale
machines so these are these are
important quantum jumps in technology
that show up at the bleeding edge and
then filter their way back down to your
desk top one one really interesting
thing that's a little bit wonkish a
little bit technically specific but it's
I think it's it's actually really nifty
and that MPI people get a
benefit from what for everybody else is
a huge problem
and that is that if we look at where the
power is actually being spent in modern
computing it used to be that almost all
the power was spent in scientific
computing actually doing the number
crunching actually if you looked at it
getting getting the data the registers
to get the answer out of it doing the
number crunching is where all the power
and a processor went all the power and
your machine went the ones doing the
flops the number crunching
today we've hit this point where most of
the power is now spent moving data
around on the machine the actual
computing number crunching part of
computing is the least amount of power
whereas moving data between all the
memories because you've got all these
memory hierarchies and network
connections in house has now become the
dominant consumer of power so data
movement is now the biggest consumer of
power as of this year and so MPI as
you're about to learn we haven't learned
it yet but we're about to learn MPI
gives us control over where we move to
data and so most people are fearful of
this future in which controlling data
movement becomes really important to
getting good performance out of these
machines or even making it possible as
an MPI programmer you're actually you
have the capability whether you want it
or not it's a responsibility to MPI
programming to control the data movement
and so as the world becomes more fixated
on moving the data around for 50-state
instead of just going computing doing
the number crunching into registers you
as MPI programmers are really in a good
position to deal with that problem we'll
come back to that briefly after you know
what MPI is which will be shortly if
you're not so another way of looking at
that is that if the flops if the
floating-point operations actually are
free at some point because they're
taking a little of power and all of your
time is spent moving operations around
on the machine then you know eventually
we're going to worry about optimizing
data movement not eventually it's
already happening more than anything
else
and MPI is well-positioned to do that ok
I won't belabor these points these are
again it's a another way of looking at
the old constraints that used to be the
main problems in programming things like
how fast is your clock and
flops are using on things too now in
water machines now it says that power as
I mentioned data movement as I just
mentioned concurrency using these things
in parallel is now the most important
thing in programming and that's where
with MPI your well position memory
scaling computing capabilities grown way
faster than the memory bandwidth so
that's related to the data movement to
having your data in the right place MPI
is well-positioned to deal with that
locality where is your data in you know
in a modern machine you've got from the
CPU you've got registers and you've got
these multiple caches to get the data
into before you hit regular memory now
in modern machines now we've got the
behind the regular memory we've got
non-volatile type of memories flash
memories things like that SSDs then
ultimate maybe some disk spinning disks
long term stuff where is your data
sitting in that hierarchy where where
should it be these are things MPI is
particularly well developed to deal with
a data locality heterogeneity saying
that you know I showed you a top 10 list
these machines have have processors
they've got accelerators in them they've
got multiple processors inch node
multiple cores on each processor so the
machine's not some regular you know
simple building block it's got a lot of
pieces and moving parts to it so these
are all the things that MPI is well
placed to deal with last one is not
still very much an outstanding problem
the reliability thing I mentioned to you
okay
another last thing I'll mention because
if we're going to talk about
architectures if you're if you're either
keeping up with stuff or you're going to
dive into this field you'll quickly find
out that people are interested in
architectures that are different than
what we're going today substantially
different not just a rearrangement of
the pieces that we're using today today
we're doing things with standard silicon
electronics CMOS electronics if you will
that's the kind of silicon fabrication
technique it's almost everything today
is based on and our computers look
pretty much the same as they did from
the first computers we built 1940s a von
Neumann architecture basically you've
got registers and you get memory and you
move things back and forth between the
two and that's where we are today but
maybe the answer to the end of Moore's
law is instead something drastically
different what if we go beyond so it can
transistors so we keep the architectures
today we've got our registers and our
you know we keep kind of computers we've
got now but we build them onto something
that's higher performance and doesn't
have thermodynamics issues and and
whatnot a lot of things like graphing
for example people are trying to make
transistors out of graphene that is a
incredibly obviously desirable thing it
preserves a lot of the technology and
techniques that we have but allows us to
continue to move forward without Moore's
law or to at least reset Moore's law in
some other different domain so there's a
lot of hope for that but those things
are still pretty much in a laboratory
when you read about cracking transistors
it's sitting on a lab bench somewhere
it's not in fabrication how about if we
abandon the bond lineman architecture
and go for something that's radically
different design something like quantum
computing this has certainly got an
awful lot of mindshare in the world of
computing so we were going to do
something that's certainly not silicon
based electronics is certainly not a
bundle in architecture it's very
different but maybe gets us a whole
different world of capability quantum
computing is a is a fascinating
interesting area maybe one of those
interesting things about it is the more
you learn about it well I say the people
that are most expert in this field have
very very diverse opinions on how when
the near-term practablity is going to
materialize on how soon any of this is
going to be real how close we are to
actual real devices and real
applications it is unusual in that
respect usually right it you can as you
get towards the experts and
knowledgeable people consensus kind of
forms that is not the case here you will
find that people who are deeply involved
in this field have very differing views
about whether we're a couple of years
away from some practical quantum
computing at least in some narrow
domains or whether we're 15 years away
and you'll hear both of those opinions
from people that are are well informed
not just from somebody that's getting
secondhand information so very
interesting and rapidly developing area
and a lot of literature is accessible to
you for those of you dressed in it but
hard hard to predict the last
alternative that's worth I've been
talking about going to any practical
reality of coming to bear is something
that uses our modern electronics
techniques CMOS electronic silicon-based
electronics but a very different non von
women design like neuromorphic computing
is the practical example of building a
computer that looks in this case like a
neuron that machine learning deep
learning has become wildly successful in
many different areas and it's based upon
building in software basically and using
GPUs building these architectures that
will presumable something like
biological neural neural nets the idea
that we could instead implement that
directly in silicon and thereby remove
kind of and the need for this
translation is is not only irresistible
but it's also practical and it turns out
that there are more than a few companies
from IBM on out that are developing
neuromorphic computing devices which
have had varying degrees of actual
real-world effectiveness so here's a
different type of computing that is
built on silicon electronics so how to
fabricate it's not an unknown that's not
iffy they can definitely do it how much
success they'll happen very very
applications is still an open question
but it certainly has had some early
successes so Moore's law is not
necessarily end and nor should we be
freaked out that Moore's law is coming
to an end because it's not the first
paradigm shift in computing we've got
very very spoiled by this integrated
circuit era computing came out of you
know mechanical devices with Hollerith
cards to do you know census surveys and
things the first computers are built
with relay type electronics and then
vacuum tubes and then independent
transistors so you know computing went
through a lot of upheaval and a lot of
revolutions over the years it's just
we've been stuck since the late 60s in
this integrated circuit era and we think
that's all there is to computing but
it's kind of overdue in that sense for a
paradigm shift it wouldn't be the first
there's also now finally a big
appreciation out there of the need to
since moore's law isn't giving us the
ability to just go okay however poor
your programming is or your approaches
it'll run faster next year computing
time is cheap compared to developer time
or productivity time now that was a
mantra that became quite popular over
the past 20 20 some years as we took it
for granted the computing power was so
cheap and was just boundlessly growing
over the past five or six years have
been more real as
that maybe we need to actually go back
to knowing how to program because you
know things aren't just randomly
speeding up anymore and that saying that
my codes fast enough even though it's
using two percent of the capability
that's okay on your laptop but if you're
going to run something in the cloud
which means somebody's data center
somewhere if you're going to run
something in a data center you know it's
taking megawatts to run than saying it's
running at two percent of its potential
speed because I don't have time to do it
right is incredibly wasteful all right
so there's that shift too and you as as
MPI programmers are well positioned to
to take advantage of that I feel like
that to give to sign off here with the
fact that put it all back in perspective
that we can talk about machines that
take 20 megawatts plus these xql
machines to model the human brain well
it's important to note the human brain
takes about 20 watts to run so it's a
awful lot of room for improvement right
there right we're hoping to be able to
run in real time human brain with you
know 20 megawatts well so there's an
awful lot of room for improvement and
development and so even though Moore's
law and other things are coming to these
depressing asymptotes that the world
will remain exciting and there will be
lots of development that evolution to
come I hope you're very motivated now
we're going to about to jump into the
actual programming we're going to their
hands dirty and start writing code fear
not I said this was all the overview and
buzzwords but I'm going to refer back to
a lot of the stuff over the next day and
a half so I wanted to put it all in one
place
parallel computing is no fad right this
is not something that's optional we can
see we've been forced into it by physics
no getting around thermodynamics in
particular right we have to go parallel
and that's why everything is parallel
and if you if you jump on board
the right approach to this stuff which I
guarantee you is FBI you're you're going
to have you can get great utility out of
it not just now but for the indefinite
future every road map for these big
machines that's being built all the
excess scale machines the programming
model for them is MPI people hope some
other things might come onboard but the
programming model for every machine it's
being funded right now to be built in
the next five years like baseline
programming model
is MPI how they're assuming it and again
pieces fit together like this you know
you might program a single processor
with OpenMP do multi-threaded
programming you might plug-in a GPU and
program it with CUDA open up a PC open
CL but the second you go beyond that to
multiple nodes it's it's MPI