Research:condor cluster

From biowiki
Jump to navigation Jump to search

Running on the Condor cluster

  • An example working directory is in ~chona/amoeba/softcore_vdw3 on lela.tacc.utexas.edu.
  • Prior to submitting the job, make sure that the dynamic libraries can be found by the dynamic linker. This can be done by redefining the environment library LD_LIBRARY_PATH. In bash,
 lela% export LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH

will prepend the current directory to the list of paths searched by ld or ld.so. Then run ldd

 lela% ldd sander
       libpthread.so.0 => /lib/tls/libpthread.so.0 (0x00365000)
       libguide.so => ./libguide.so (0x00f27000)
       libm.so.6 => /lib/tls/libm.so.6 (0x0024f000)
       libcxa.so.5 => ./libcxa.so.5 (0x00963000)
       libunwind.so.5 => ./libunwind.so.5 (0x00b77000)
       libc.so.6 => /lib/tls/libc.so.6 (0x0011c000)
       libdl.so.2 => /lib/libdl.so.2 (0x00249000)
       /lib/ld-linux.so.2 (0x00103000)

and confirm that there are no "not found" errors.

  • Alternatively, you can add "-Wl,rpath=." to the sander Makefile. If this is done, then LD_LIBRARY_PATH does not have to be redefined.
  • The file "job" is the Condor job script:
 lela% cat job
 getenv=true
 initialdir = /home/chona/amoeba/softcore_vdw2
 Arguments=-i vdw1.mdin -c vdw1.inpcrd -p vdw1.prmtop -o 06.mdout -r 06.restrt
 universe = vanilla
 should_transfer_files = TRUE
 when_to_transfer_output = ON_EXIT
 transfer_output_files = 06.mdout,06.restrt
 transfer_executable = true
 executable=sander
 transfer_input = true
 transfer_input_files = vdw1.mdin,vdw1.inpcrd,vdw1.prmtop,libsvml.so,libvml.so,libguide.so,libcxa.so.5,libunwind.so.5,bnd.rst
 requirements = (OpSys == "LINUX") && (Arch != "dummy")
 error = log/sander.err
 log = log/sander.log
 queue 1

getenv tells Condor to run the executable in the environment inherited from the submission machine

initialdir is the directory in the submission machine to which all of the file paths (i.e., input, executables, log and error files) are relative

arguments are the runtime arguments for the executable

transfer_output_files contains a list of files that will be transferred back to submission machine when the job terminates

executable is the program/script to be run. In this case it is sander because that's what we run in the backend. Alternatively, this could be a wrapper script

transfer_input_files is a list of files that will be staged to the execution machine. This is necessary because there is no shared file system between the frontend (lela) and the computational backend. Note that we are transferring shared libraries because sander is a dynamically-linked executable.

error will contain stderr from the job

log is the job log file

queue 1 indicates that we will run sander once on one CPU or core

  • Please read the condor_submit man page for more detailed information on other Condor job script keywords.
 lela% man condor_submit
  • Submit the job.
 lela% condor_submit job
  • Monitor the job.
 lela% condor_status <jobID>

where <jobID> is a numeric ID returned by condor_submit after job is successfully submitted.

  • If job alternates between Run and Idle states, then there is something wrong with the job script. Usually, this could be due to files not being copied, dynamic librraies not being found, etc. You can further diagnose the problem using
 lela% condor_status -analyze <jobID>

or by reading the job log and error files.