Research:condor cluster
Running on the Condor cluster
- An example working directory is in ~chona/amoeba/softcore_vdw3 on lela.tacc.utexas.edu.
- Prior to submitting the job, make sure that the dynamic libraries can be found by the dynamic linker. This can be done by redefining the environment library LD_LIBRARY_PATH. In bash,
lela% export LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH
will prepend the current directory to the list of paths searched by ld or ld.so. Then run ldd
lela% ldd sander
libpthread.so.0 => /lib/tls/libpthread.so.0 (0x00365000)
libguide.so => ./libguide.so (0x00f27000)
libm.so.6 => /lib/tls/libm.so.6 (0x0024f000)
libcxa.so.5 => ./libcxa.so.5 (0x00963000)
libunwind.so.5 => ./libunwind.so.5 (0x00b77000)
libc.so.6 => /lib/tls/libc.so.6 (0x0011c000)
libdl.so.2 => /lib/libdl.so.2 (0x00249000)
/lib/ld-linux.so.2 (0x00103000)
and confirm that there are no "not found" errors.
- Alternatively, you can add "-Wl,rpath=." to the sander Makefile. If this is done, then LD_LIBRARY_PATH does not have to be redefined.
- The file "job" is the Condor job script:
lela% cat job getenv=true initialdir = /home/chona/amoeba/softcore_vdw2 Arguments=-i vdw1.mdin -c vdw1.inpcrd -p vdw1.prmtop -o 06.mdout -r 06.restrt universe = vanilla should_transfer_files = TRUE when_to_transfer_output = ON_EXIT transfer_output_files = 06.mdout,06.restrt transfer_executable = true executable=sander transfer_input = true transfer_input_files = vdw1.mdin,vdw1.inpcrd,vdw1.prmtop,libsvml.so,libvml.so,libguide.so,libcxa.so.5,libunwind.so.5,bnd.rst requirements = (OpSys == "LINUX") && (Arch != "dummy") error = log/sander.err log = log/sander.log queue 1
getenv tells Condor to run the executable in the environment inherited from the submission machine
initialdir is the directory in the submission machine to which all of the file paths (i.e., input, executables, log and error files) are relative
arguments are the runtime arguments for the executable
transfer_output_files contains a list of files that will be transferred back to submission machine when the job terminates
executable is the program/script to be run. In this case it is sander because that's what we run in the backend. Alternatively, this could be a wrapper script
transfer_input_files is a list of files that will be staged to the execution machine. This is necessary because there is no shared file system between the frontend (lela) and the computational backend. Note that we are transferring shared libraries because sander is a dynamically-linked executable.
error will contain stderr from the job
log is the job log file
queue 1 indicates that we will run sander once on one CPU or core
- Please read the condor_submit man page for more detailed information on other Condor job script keywords.
lela% man condor_submit
- Submit the job.
lela% condor_submit job
- Monitor the job.
lela% condor_status <jobID>
where <jobID> is a numeric ID returned by condor_submit after job is successfully submitted.
- If job alternates between Run and Idle states, then there is something wrong with the job script. Usually, this could be due to files not being copied, dynamic librraies not being found, etc. You can further diagnose the problem using
lela% condor_status -analyze <jobID>
or by reading the job log and error files.