User Tools

Site Tools


Sidebar

Systems Engineering Wiki

SE WiKi Information

DokuWiki Information

wiki:secluster

SE Cluster

This is a short tutorial on how to use the secluster-02. The Cluster consists of

  • 1 head node (secluster-02) - this is where you log in
  • 20 nodes (compute-0-0 to compute-0-19)
  • 40x 2800MHz Dualcore CPUs (2 CPUs per machine, so 80 cores in total)
  • 4 GB of RAM memory per node
  • 1 shared file system

The cluster works with queue’s (just like we like to simulate so much:)). The jobs in the queue will be distributed as evenly as possible if the parameters of the jobs allow it. This example uses 1 CPU core per job, which implies that 80 jobs can run simultaneously.

Each simulation has to be submitted into the queue separately. The easiest way to start these simulations is to use Python to submit these jobs. Detailed documentation about the commands you should use (qsub, qdel, etc.) can found in the Rocks 4.3 user guide, section 2.2.

Warning: Each computation node has 2 Dualcore cpu's. This means four jobs can run on one machine simultaneously. Since the stochasticity in chi depends on the moment of starting the simulation, two or more jobs can produce exactly the same result! This is because stochastic functions in chi produce identical results if two simulations are started at the same instance (they have the same seed). To solve this problem a short delay has to be built in.

An alternative solution would be to give each simulation a uniq start seed using the (I think) -s option. I don't know how to use this option, ping localhost worked fine too [EvdRijt].

Python script

…. (more to come soon) …

In this example a simulation called chisim is started with parameters x and y. Herein runs x from 1 to 10 with steps of 1 and y from 10 to 20 with steps of 2. Since only a linux script textfile can be used to submit a simulation into the queue, a job script file must be generated for each simulation in the directory called jobs. When the jobs scripts and the simulation output directories are made the jobs are submitted into the queue.

# usage: clus.py [# sims]
#
#   Starts simulating [# sims] identical jobs
#   on cluster with parameters X and Y
#
#   Warning:
#   Use with care!
#
#   Created:  Oct   11, 2006  V1 Emiel van de Rijt
#   Modified:
#             Oct   12, 2006  V2 Comments added
 
import time, os, sys
 
# The process which submits the job to the queue
def smJob(intI, sX, sY):
    # The name of the job
    myJobName = "job." + sX + "-" + sY
    # the command to submit the job to the queue
    myCmd = "qsub -N " + myJobName + "." + str(intI) + \
     " -e jobs/" + myJobName + ".err." + str(intI) + \
     " -o jobs/" + myJobName + ".out." + str(intI) + \
     " -q workq" \
     " -l nodes=1:ppn=1,walltime=24:00:00" \
     " jobs/" + myJobName + "." + str(intI)
    # -N: ???submit naar nodes
    # -e: write error to file
    # -o: write screen output to file
    # -q: queue name to submit to
    # -l: see manual orca
    # the actual job script file to execute
    # send the command
    os.system(myCmd)
 
# ====================================================================================================
 
# Process to make the job scripts
def mkJobs(intStart, intStop, sX, sY, outputdir):
  # read the path from the global variable
  global myPath
  # filename for the script file in the jobs folder
  myFOut = "job"
  # construct a name for the job
  myJobName = sX + "-" + sY
  for simnr in range(intStart, intStop):
    # create/open file for writing
    fOut = open(myPath + "jobs/" + myFOut + '.' + myJobName + '.' + str(simnr), 'w')
    # write the job script
    myCmd = "./chisim -e 1e3 " + \
      sX + " " + \
      sY + " " + \
      outputdir + "/out." + sX + "-" + sY + "." + str(simnr) + ".txt"
    # simulation to run with end time
    # param 1 for the chi file
    # param 2 for the chi file
    # param 3 for the chi file
 
    # go to the work directory
    # make sure the jobs starts at a random time
    # print on screen the cmd to run
    # run the command
    myTxt = "cd " + myPath + "\n" + \
      "ping -c 2 127.0.0.1\n" + \
      "echo ran: " + myCmd + "\n" + \
      myCmd + "\n"
    fOut.write(myTxt)
    fOut.close()
 
# ====================================================================================================
 
# get current path
myPath = os.getcwd() + "/"
#print myPath
 
# run x from 1 to 10 with steps of 1
xb = 1    # begin
xe = 10   # end
xs = 1    # step size
 
# run y from 10 to 20 with steps of 2
yb = 10   # begin
ye = 20   # end
ys = 2    # step size
 
# Read number of identical simulations
# to do from the prompt
intStart = 1
# take start + the first argument of the prompt
intStop = intStart + int(sys.argv[1])
 
# measure process time and wall time
# set start time
t0_process_time = time.clock()
t0_wall_time = time.time()
 
# make directory for job script files
# in the current directory
os.system("mkdir jobs")
 
# do for all x
for x in range(xb, xe+1, xs):
 
  # do for all y
  for y in range(yb, ye+1, ys):
 
    # convert to string
    sX = str(x)
    sY = str(y)
 
    # make output dir string
    outputdir = str("sim-" + sX + "-" + sY)
 
    # make the output directory
    # for a simulation output
    os.system("mkdir " + outputdir)
 
    # make the linux script files
    mkJobs(intStart, intStop,  sX, sY, outputdir)
 
    # start submitting the jobs to the queue
    for simnr in range(intStart, intStop):
      smJob(simnr, sX, sY)
 
 
# measure process time and wall time
# calculate time
print time.clock() - t0_process_time, "seconds process time"
print time.time() - t0_wall_time, "seconds wall time"
wiki/secluster.txt · Last modified: Monday, 29 September 2008 : 10:23:53 (external edit)