Cluster guide
Welcome to the jupiter cluster guide!
This is the documentation of our cluster and it's resources.
Jump to
- Logging in and working in the cluster
- Transfering files
- Debugging errors and failed jobs
- A first look into the linux world
- Softwares available
- Other resources
- Contributing
Basic usage
Ask Prof. Giovanni for a user, if one is granted you can login to the cluster using IP:150.162.31.2 and Port:2222 as follows.
$ ssh your_user@150.162.31.2 -p 2222
When logging in the cluster for the first time, please update your password to a strong one.
If you don't want to remember the port and the ip everytime you login, you can register this remote location in your ~/.ssh/config file (you can change jupiter for any name of your choice).
Host jupiter
HostName 150.162.31.2
User your_user
Port 2222
Now you can login with:
$ ssh jupiter
You can also set a passwordless connection by creating a key and making it available in jupiter.
Softwares
ORCA
Usage:
In your $HOME/bin folder in jupiter there should already be a job
script that you can use directly on yout .inp files.
In your .inp it should be specified both the nprocs
and maxcore
keywords, i.e. for a h2o_opt.inp file:
! HF def2-svp Opt
%pal
nprocs 8
end
%maxcore 2000
*xyzfile h2o.xyz 0 1
Any doubts on the input please check the orca input library.
To submit this job, run:
$ job h2o.inp 1 8
Where the 1 means how many machines you're using and 8 means the number of procs. Or more generally:
$ job file.inp $nmachines $nprocs
Use the default values and you'll be fine (1 machine and 8 procs).
Checking a job status
$ qstat
Or if you want information only on your submitted jobs:
$ qstat -u $USER
Where $USER
is your cluster username.
These command just prints the queue usage without some information that is useful for debugging some failed jobs. To receive more information (for instance on the node your job is running) run:
$ qstat -as
Transfering files
The most common way of transferring files between local and remote locations (and the easiest if you're just starting) is the scp command.
For istance, if you want to send a opt.inp
file located at ~/geem/projects
to a /home/your_user/projects/
folder located in jupiter:
$ scp ~/geem/projects/opt.inp jupiter:/home/your_user/projects/
This comes from:
$ scp -P 2222 ~/geem/projects/opt.inp USERNAME@150.162.31.2:/home/your_user/projects/
Or the other way around, from jupiter to your local folder:
$ scp jupiter:/home/your_user/projects/opt.out ~/geem/projects/
There are other options (for instance, rclone) that can be also be used to sync files between your local machine and jupiter.
If you're using a windows machine, a good graphical interface that allows you to drag and drop files is mobaxterm.
Vinicius also has a script that syncs and submits jobs while in your local machine, any questions talk to him.
Debugging errors and jobs not running
Sometimes your jobs will fail (I know, sucks!), if you ever find yourself questioning why that happen this may help you:
First thing to do is to check if your job failed because of some issue related to orca, to do that open the output file in your favorite text-editor and look at the last lines. A quick alternative is to use the tail
command, which shows you the last lines of a file.
If some error happened in a SCF procedure (if the issue is related to orca itself) you should see some error message at the very bottom of the output file. This error should at least clarify what went wrong in your calculation.
If the output file doesn't contain any error at the end of the file, then the issue might be hardware related. In this case, try to do the following:
- Resubmit the job and check in what node your job is being run (do that by issuing the
qstat -as
command, once it starts you can see information on what node is running your job), for example:
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--------------- -------- -------- ---------- ------ --- --- ------ ----- - -----
5090.ufsc giliand* ufsc tetrazol_* 7964 1 8 10gb 120:0 R 00:59
Job run at Mon Apr 25 at 10:47 on (himalia:ncpus=8:mem=10485760kb)
5119.ufsc giliand* ufsc 14_hess.i* 306745 1 8 10gb 120:0 R 32:23
Job run at Tue Apr 26 at 22:13 on (io:ncpus=8:mem=10485760kb)
In this situation, the job 5090.ufsc is running on himalia and the job 5119.ufsc is running on io.
- Connect through ssh to that node while logged in jupiter (
ssh himalia
for example) and check for temperature issues (runningsensors
). If you see the temperature of the CPU core's reaching 100 degrees, please contact matheus or vinicius.
Asking for help
If you ever cant understand what the hell is going on, or if you ever identify some kind of issue in the cluster, please open an issue in our cluster repository, this way we can keep track of what is happening and also have an archive for the solutions, if the problems ever happen again.
A first look into the linux world
You may soon realise that you're now entering a field where we're generally looking at a command line. Don't worry, you'll soon get very used to it and learn to love it. 😄
There are good reasons to use a terminal insted of GUI's (graphical user interfaces), but I will assume that if you're here you already have a slight idea about that.
For now I will only leave this quote here:
I once heard an author say that when you are a child you use a computer by looking at the pictures. When you grow up, you learn to read and write.
Also, here are some articles explaining the basics of using a unix terminal, if you're new to this world:
https://ubuntu.com/tutorials/command-line-for-beginners#4-creating-folders-and-files https://www.digitalocean.com/community/tutorials/an-introduction-to-the-linux-terminal https://maker.pro/linux/tutorial/basic-linux-commands-for-beginners
Other resources
Right now we've only got acces to jupiter and Tarsus (there will be a tutorial on using Tarsus too, dont worry), but soon we will have acces to cesup and Santos-Dumont.
- CESUP
- Tarsus (Franca-SP)
- Santos-Dumont
Contributing
This site is open source and made by your dearest lab colleagues, if you feel like there should be more information, tutorials and guides here feel free to contribute. 😄