Archive for the Parallel Talk Category

It has just occured to me …

Posted in Parallel Talk, Erlang, ConCurrency, ODROID-C1, Chapel, Parallella on May 19, 2017 by asteriondaedalus

… what a coincidence.

The Parallella and the Odroid-C1 PCB are same size (almost).  Is that Feng Shui, synchronicity or serendipity?

Pity the post holes don’t line up.

Tax return is in

Posted in Chapel, ODROID is wonderful, ODROID-C1, ODROID-XU4, Parallel Talk on April 30, 2017 by asteriondaedalus

Well late really, last two years after tax department started writing letters.

Serve them right as I got a quite hefty lump sum out of it.

So, after agonising over what to do about GPU based system, I decided I am only dabbling so I opted to simply get an ODROID-XU4 so that I can at least run OpenCL under Python.

Mind you, after toying with Chapel on my quad-core Odriod C1, I will be interested in the Octa-Core Xu4.  I did splurge and get an 32GB eMMC 5.0 with Linux.

I went this way as even now the Jetson TK1 is available (as its now obsolete) in Australia, I am rather more interested now in Chapel.   Not to mention that Chapel appears to be a better approach than Cuda.  That is the appeal, you can take your Chapel code and just change the underlying engine.

So, my cluster will be one ODROID-XU4 and four ODROID-C1.  So, 24 cores all up (not taking the general purpose GPU on the XU4 into account).

Oh, and the 18 cores the Parallella adds.  You do know there is an Erlang for the Parralella don’t you?

I am in parallel programming heaven.

The thing that makes it all possible is the 9 port POE switch and a gaggle of active POE splitters (48V down to 5V) – you don’t need n-power packs on your wall.

Don’t forget TTY+ (MPutty) or some other multiple session tool.

Built into the guts of an old game PC case my mate gave me, so the wife will never know.

If I need more grunt a NVDIA card maybe later, in an 8-core PC, maybe.

However, PhD submission draft this weekend so dabbling is all that I will have time for.

I am in so much trouble …

Posted in Linux, Nostalgia, Parallel Talk, The after market on January 18, 2017 by asteriondaedalus

… wife will kill me!

Look what I bid on!

1096078_1.jpg

So, 4 cpu server with 70plus gig of memory.  Maybe dual cores (so eight all up), may be quad cores (so 16 all up).  Won’t know until I grab it (if I win the auction).

I will have to install a hard drive.

Turns out Debian installs just fine, based upon web searches at least.

Why?

Chapel box.

And cheaper than a Parallella.

Should be able to stuff at least one CUDA board in it – just need find a good second hand one.

Go figure, powers up but that’s all I know.

I am so dead.

Mind you, both the wife and the mother in law are actually addicted now to the auction website.

So I feel vindicated … but very, very afraid.

So why do I want to buy the second one they have?

Spares?

Death wish?

Is Charles Bronson channeling through me?

5dw

Parallel versus Concurrent

Posted in Parallel Talk, Sensing on December 26, 2016 by asteriondaedalus

So, I fell upon Chapel, a parallel programming language.  It makes sense for things that Erlang and Elixir are not good for – like image processing.

It isn’t Google’s,  it’s actually Cray’s.

So fun facts:

  1. When working in Canada I came across a coffee mug in a second hand store that was from Cray.
  2. You can Cram a replica of a Cray into an FPGA.

Now I have Erlang up and running on my ODRIOD-C1 home server.  I will get Elixir and Phoenix running as the wife will likely want something less geeky than node-red for her interface to the home automation.

In the meantime, for fun, Chapel is at this very moment compiling on the OD-C1.  Why not, quad core.  Apparently, for non-Cray etc. monsters, you simply need a  UDP module compiled, a lot of fiddling, and you can have two or more nodes (Locales) running.

In any event, the parallel helloworld found me four cores on me OD-C1.

hello-chapel

Black Magic

There was a configuration script to run but other than that I did not have to tell it how many cores, so some smarts buried does the job.  Otherwise the code to run the print on all four cores is a straight forward as:

config const numTasks = here.maxTaskPar;
coforall tid in 0..#numTasks do
   writeln("Hello, world! (from task " + tid + " of " + numTasks + ")");

There is also an image processing tutorial using Chapel.

There are even lecture notes around.

Accelerate 3

Posted in Parallel Talk, Python RULES! on April 21, 2014 by asteriondaedalus

Just ran a couple of examples from a pyOpenCL tutorial site.

I tried running the optimized matrix multiply but it clapped out.  I had to change the “blocksize = 32” to “blocksize=16” before it would work on my GPU.  Otherwise it didn’t offer any impressive numbers.

I then tried the mandlebrot example (having spent some time years ago with mandlebrots in FORTH and on my Amiga).

The grunting GPU snapped the finished mandlebrot in 5.1 seconds.  The serial version took about 190 seconds.  The Numpy version however only took 5.3 seconds???

When running the CL code but selecting the CPU instead of GPU it takes 1.8 to 3.2 seconds???

These examples appear great to show how to get code running via OpenCL and python.  They also quickly start raising questions about algorithm design, data transfer overhead etc.

So, summary.

Running OpenCL on the GPU seems to take as long as running native Numpy.

Running vanilla (non-OpenCL, non-Numpy) on the CPU takes about 3 minutes versus the 5 seconds on the GPU.

Running the OpenCL on the CPU appears to run about twice as fast as on GPU.  This may be any or a combination of 1) Intel OpenCL calling AMD (who knows what sneaky tricks we play to make AMD look bad) 2) array size too small vs data transfers to bring out speed advantage of the GPU 3) algorithm not optimized or not suitable for large speed up through GPU.

Must be some rules/heuristics around for selecting problems that best fit GPU speed up.

 

Accelerate 2

Posted in Development, Parallel Talk, Python RULES! on April 20, 2014 by asteriondaedalus

Okay, so a half day fidgeting to get pyOpenCL working.

Turns out it was easy.

I have a INTEL CORE i5  CPU and you just need to install the Intel OpenCL runtime.  The SDK won’t install anyway unless 1) you have something other than VS Express or 2) there appears to be a cheat to install it anyway.  It isn’t need for pyOpenCL.

Then install the relevant exe for pyOpenCL and run that.  I installed the 32 bit version because I opted for 32bit given the raft of libraries I wanted to use. I did try installing the AMD OpenCL SDK, thinking I would need it for the Radeon, but it turned out that wasn’t it at all. Experimenting with code examples from various places it looks good.

I have copies of a couple of OpenCL books so I can dabble in the background. This is primarily for prototyping and learning the ropes of OpenCL.

I am holding off on my ARM cluster, I was about to jump in and get 3 ODROID U3 but I can wait until they bring out the U4, hopefully with an RK3288.  The ODROID XU has OpenCL (via Octa chip set) but too expensive.

 

Nvidia SUX

Posted in Parallel Talk on April 6, 2014 by asteriondaedalus

Not really, but I do hope they sort out there anti-FreeTrade thinking by not selling things like Shield and TK1 in Australia.

Now if a country can be forced, under free-trade, to export commodities, why are companies on the Internet not policed for failing to sell into countries with free-trade agreements with the source of the technology?

All the politics aside, the TK1 sorta kinda wraps it up for Parallella and for GreenArrays don’t you think?

Neat though is the latest from XMOS (XCORE-EA) which I think still might find it hard to get traction.

XCORE-EA is not targeting the same marked as TK1, Parallella but might go happily against GA144 I suppose.  Because the XCORE-EA is a little more “familiar” to people (and they are avoiding saying “transputer”) they might have a chance.