Sunday 3 February 2013

How do I run large jobs in ABAQUS?


Running Long Programs

To run long jobs, just log into ts-access and run them. Some familiarity with running things from the Unix/MacOS command line is essential (see A short Unix crib). Familiarity with shell scripts is an advantage. You can run the programs as you normally would from the Linux/MacOS command line, but when running long programs note that
  • You probably won't be able to give your program input from the command line
  • If you put & at the end of your command, you can run your program "in the background", freeing up the command line to do other things.
  • If you put nohup at the start of your command, you can log out without the program being killed when you log out.
so once you're logged into a linux/Unix/MacOS machine at CUED, typing
  slogin ts-access
...
nohup my/program &
will start running my/program and continue running it after you log out of ts-access. Output that would normally appear onscreen goes into a file called nohup.out.
But before you run your program in the background first check that it starts ok when run the normally.

Matlab

If you run a matlab job remember to exit from matlab at the end of the script or function, because Matlab won't automatically exit. If, for example, you have a file in your home directory called roll2dice.m containing
function answer=roll2dice
answer= randi(6) + randi(6)
exit
you could log into ts-access, type
  nohup matlab -r roll2dice &
then log out of ts-access. Soon in your home directory on CUED's central system you'll have a file called nohup.out containing the output of your program. Matlab will no longer be running on the ts-access machine.

Preparing your code

Try to write your code so that it saves results periodically, and the program can re-start by loading in those results, carrying on from that stage. In this way you can still make progress even if your programs are interrupted by power-cuts, etc.
Many programs will run much faster if a little thought is given to optimising the code. Once programs run for days, even an improvement of a few percent becomes significant. See
for ways of speeding your programs up.
If your program requires interaction you may need to rewrite it so that interaction isn't required. See the Command line options section for help.

Troubleshooting

Your program may fail for several reasons
  • Using too much CPU - the system should be set up so that there's no limit to your CPU usage. Confirm that by typing
    ulimit
    You should get the reply "unlimited". If you type
    ulimit -a
    you'll get a list of other limits.
  • Using too much memory - maybe you have a "memory leak". Each time your program goes round a loop it may ask for more memory until finally there's no more memory left. You can use the "top" program to monitor memory usage. See Big Processes - Memory issues page for details.
  • The machine was rebooted. For details about when the machine was last rebooted, type
    uptime
  • There's a bug in your code that's only triggered after a certain number of iterations or when arrays reach a certain size (because of an unexpected divide-by-zero, or a variable value that becomes bigger than can fit in a variable of that type, etc)
Signals are messages that are sent to processes. Typing
    man 7 signal
will show you a list of them. If your process receives a "SIGSEGV" signal for example, then that generally means a pointer has gone wrong (it's tried to access a piece of memory it's not allowed to) and typically indicates a code bug (most frequently trying to dereference a null pointer). Some signals (e.g. "SIGINT") can be ignored if you choose to do so but "SIGKILL" and "SIGSTOP" can't and will always stop your program. It's possible to add a signal handler to your code to deal with signals. Even if you can't protect your program from being stopped, you might be able to record why it stopped. The Unix Signals and Forking page has some information for C/C++ users.
By tl136 and js138

0 comments:

Post a Comment