To: soltz@llnl.gov
cc: thomas@beamdump.phys.unm.edu
Subject: Re: parallel computing 
Date: Thu, 21 Mar 2002 08:42:14 -0700
From: "Timothy L. Thomas" 

> Hi Tim,
> 
> Can you tell me briefly how you farmed out your events over MPI?  We
> have larger clusters which we have not yet used because the queuing
> emphases parallel jobs.  Thanks.
> 
> -Ron


Here are some details.  I've been meaning to write this up, so you've triggered
a partial core dump.  Sorry it's not so brief.  For an executive summary, read
from here to "[...END OF EXECUTIVE SUMMARY!]", then read the first paragraph after
"[SEMAPHORES:]".

=====================================================================================

My AHPCC system has three layers - *per MPI job*, there are three main scripts:

  Master     - submit this to pbs.  It's a csh script with embedded pbs
                commands.  It uses 'mpirun' to start the subjobs ("tasks").
                The number of tasks per job in my system is a free parameter,
                typically set to 64.  So is the number of events per task;
                I leave this at 100 for various reasons.

  Bureaucrat - another csh script: one of these is run (by mpirun) for each
                task.  This layer exists primarily to do some task-specific
                unpacking (from a common NFS-mounted user area "task queue")
                and to funnel top level log info to a master output log, so
                I can keep track of what the whole system is doing at a
                course-grained level.

  Slave      - Another csh script, called by Bureaucrat, it does the real work:
                copies, unpacks, and installs from a 'HIJING + PISA system commons'
                tar file (including shared libs associated with the BNL version
                of the gcc compiler, which happened to be different from Los Lobos'
                version; I just stole HIJING and PISA the binaries from BNL.)
                Slave runs the task, directed by the task-specific command files
                grabbed from the "task queue" by the Bureaucrat.  I don't know which
                CPU does what; it's first come first served.  All processing is done
                on node-local scratch disk space (which pbs purges at job's end.)
                I run HIJING, then process the output through PISA, discard the
                HIJING output (probably should have saved it), and copy the other
                stuff back to a common NFS-mounted "data" disk.

                (From there, I used bbftp - the BaBar thing - to copy the output
                to BNL, where I put it in the HPSS using a modified version
                of my "sinker" script from my DPM days in the counting house.)

                The slave also writes a separate log file per task.  I save all
                output files from the PISA step, with a naming scheme that adds
                the pbs job number, master and slave machine names, and the a
                task number (an index to the stuff in my "task queue.")

The random seeds, etc. are managed at the "back end" when I create the entries
(files) in the "task queue".  I do this work (and keep track of all of it)
with the hacked tools I built while at BNL as DPM.

[...END OF EXECUTIVE SUMMARY!]

I saved the files to HPSS with those names; see, for example...

  https://www.phenix.bnl.gov/phenix/WWW/publish/tlthomas/HPSS-inventory/phnxreco/montecarlo/UNM_MiscIons/SiSi/cent/pisa/.INVENTORY

Someday, I might rename those HPSS files to the PHENIX standard...
which ought to be the well-thought-out one that I invented
a long time ago, but not until a consensus is reached about
about simulation "run number spaces" that I posed a long
time ago.  See the first two bullets under my stale DPM page:

  https://www.phenix.bnl.gov/phenix/WWW/publish/tlthomas/ana-links.html


[SEMAPHORES:]

All the file copying to/from common (NFS) disk space is regulated
via a perl-script-based semaphore system that I invented.  It's a hack
but it works:  it's based upon the fact that file locking under perl
(well... period) doesn't work over NFS, so you need to create semaphore
files on some local disk of some machine.  That machine becomes your
'semaphore server', and you call my semaphore routines via ssh (which
authenticates via rhosts - i.e., without the need for a password - under
pbs on Los Lobos) to grab or release semaphores.  A process that asks for
one when one is not available blocks in the ssh step until some other
task frees a semaphore.


Basically, my semaphore system is like those short-duration traffic lights
at freeway on-ramps:  smooth things out and you get better flow.  All this
technology might not be necessary if you have a really fat pipe to your
common disks, but I had to implement it at AHPCC because with 512 CPUs,
there was a significant bottleneck to the common disk, and the occurrence
of a large number of processes all trying to copy back their 100-event PISA
output files all at roughly the same time would create various problems
(including slowing the system down so much that I would run out of my
wall-clock job allocation, at which time pbs mercilessly kills all jobs
and all their tasks, and purges the local scratch disk spaces.)


Here's what an almost full (7*32+27=251 "nodes" ==> 502 out of 512 CPUs)
Los Lobos looked like to the pbs qstat command:

-----------------------------------------------------------------------------------
  ll03.alliance.unm.edu: 
                                                            Req'd  Req'd   Elap
  Job ID          Username Queue    Jobname    SessID NDS TSK Memory Time  S Time
  --------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
  1586.ll03.allia tlthomas ll-main  Pisa-R001    7699  32  --    --  48:00 R 37:25
  1587.ll03.allia tlthomas ll-main  Pisa-R002   31216  32  --    --  48:00 R 36:53
  1588.ll03.allia tlthomas ll-main  Pisa-R003    3568  32  --    --  48:00 R 36:26
  1589.ll03.allia tlthomas ll-main  Pisa-R004   16866  32  --    --  48:00 R 36:00
  1590.ll03.allia tlthomas ll-main  Pisa-R005   20645  32  --    --  48:00 R 35:28
  1591.ll03.allia tlthomas ll-main  Pisa-R006   14263  32  --    --  48:00 R 34:59
  1592.ll03.allia tlthomas ll-main  Pisa-R007    9015  32  --    --  48:00 R 34:33
  1593.ll03.allia tlthomas ll-main  Pisa-R008    8185  27  --    --  48:00 R 34:05
-----------------------------------------------------------------------------------


These were all 100-event central gold-gold jobs.  R001, for example, ended after
45 of the 48 hours, which implies 27 CPU minutes per event on these dual 750 MHz
P-IIIs.  (This is a little slower than I think I told you on the phone.)

In total, I used about 100,000 hours... which at of order $1 per hour, which is
what they quote, is a pretty good grant from UNM!  This is what was done:

-------------------------------------------------------------------------------
                 SUMMARY OF EVENT COUNTS, AS OF 10/15/01

  Au+Au Central:                 ~40,000   (captured 33,340)

   D+Au Minbias:               1,201,372   ( 23,103 central Au+Au equiv)

   D+Au Central:               2,057,527   ( 82,301    "      "     "  )
  Au+D  Central (reversed):    1,407,827   ( 42,661    "      "     "  )

  Si+Si Minbias:                 340,557   ( 13,098    "      "     "  )

  Si+Si Central:                 187,331   ( 20,814    "      "     "  )

  Cu+Cu Minbias:                 198,755   ( 15,289    "      "     "  )

          TOTAL:               5,433,369   (237,266 central Au+Au equiv)
-------------------------------------------------------------------------------

(The failure to "capture" 17% of my 'Au+Au Central' runs, which were early ones,
was due to a single-node failure (overheated!)... pbs is not kind about
this:  it took out the other 63 nodes in the job before any of the tasks
were done!  This is one of the dangers of running in this pbs/MPI-based mode.)


I am hoping to get another allocation here at UNM.  I would like to install
the post-PISA technology here, too, but I'm waiting to see how Vanderbilt's
system works out.  I may try to piggyback remotely off their Objectivity mirror.

I should also mention:  I may administer (part of) the next round of work
in a completely different fashion:  via Globus, Condor-G, and Condor
Glide-ins.  These are Grid technologies, the Condor part of which I've
already been experimenting with for several years.  AHPCC has had Globus
installed for a while.  Most recently [spring 2002], I they have upgraded
to Globus 2.0.

- Tim