This wiki page is based on the slides made by Di for the WLCG Collaboration Workshop (January 2007). Thanks to Di :-)
Important places and files
- ${INSTALL_ROOT}/glite/etc/config/glite-*.cfg.xml: to verify if the conversion with the tool YAIM2gLiteConverter was ok. This tool tranforms Yaim configuration values to gLite XML configuration files:
- /opt/glite/etc/config/glite-wms.cfg.xml: configuration file for the WMS.
- /opt/glite/etc/config/glite-lb.cfg.xml: configuration file for the LB.
- Some other important configuration files:
- /opt/glite/etc/glite_wms.conf: general configuration file.
- /opt/glite/etc/glite_wms_wmproxy_httpd.conf and /opt/glite/etc/glite_wms_wmproxy.gacl: configuration files for the WMproxy.
- /opt/condor-c/etc/condor_config: global configuration file for Condor.
- grid-mapfile related files:
- To map or to blacklist an user: /opt/edg/etc/grid-mapfile-local
What is wrong with my gLite WMS
Error while calling the "NSClient::multi" native api AuthentificationException: Failed to establish security context...
- Check if your DN is in the /etc/grid-security/grid-mapfile file on the WMS.
- Restart the network server (possibly).
Many jobs stay in running or ready status forever
- Probably the log monitor daemon is dead and could not be restarted
- Increase the LogLevel value in the LogMonitor section of the /opt/glite/etc/glite_wms.conf file.
- Restart the log monitor daemon to check which log file causes the log monitor to crash.
- Remove the corrupted log file from /var/glite/logmonitor/CondorG.log directory.
- Problem with the script /opt/lcg/sbin/grid_monitor.sh.
- Daemon interlogd got stuck, so it should be restart.
- A workaround patch is available.
glite-job-logging-info shows error message "Cannot take token!"
- Check that the edg-gridftp-clients or glite-gridftp-clients package are installed on the WNs.
- Check if it is possible to do some globus-url-copy of files to and from WMS on WNs.
- Proxy expired before job executing and could not be renewed.
- Job submitted to batch system but BLAH did not get submission informations, so it submitted the job again, and the token was removed by the first running one.
"Got a job held event, reason: Spooling input data files"
- It may fail with "Globus error 7: authentication with the remote server failed".
- Race condition between the gridmanager on machine A querying the job status of the job on machine B and the schedd on machine B releasing the job after file stage-in, fixed in later version of condor (which version ?).
glite-lb-bkserverd: "Database call failed (the table long_fields is full)" in /var/log/message
- The database of the LB reached 4GB limit.
- May cause incomplete log events.
- Increase them by executing the following commands in MySQL:
alter table short_fields max_rows=1000000000;
alter table long_fields max_rows=55000000;
alter table states max_rows=9500000;
alter table events max_rows=175000000;
glite-job-logging-info shows: "Cannot read JobWrapper output, both from Condor and from Maradona"
- Similar to the LCG workload management system. More informations can be found here
.
glite-job-logging-info shows: "Got a job held event reason: The PeriodicHold expression 'Matched = TRUE && CurrentTime > QDate + 900' evaluated to TRUE"
- Condor could not submit the job to CE in more than 900 seconds.
- Probably Condor-C on CE could not be launched:
- because the authentication failed.
- The previous launcher jobs failed but still in condor queue. Remove it by condor_rm.
- IP address is incorrect in /etc/hosts.
- Possibly because of firewall.
glite-job-logging-info shows "Got a job held event, reason: Error connecting to schedd..."
- Condor met timeout when connecting sched on gLite CE.
- Possibly because of unstable network, or a disk fills up somewhere.
glite-job-logging-info shows "Got a job held event, reason: Attempts to submit failed"
- It means that the job could not be successfully handed over the batch system by the non-privileged user that resulted from the GRAM/LCMAPS.
- For example, BLParser not running on batch system head node.
Failed to load GSI credential: edg_wl_gss_acquire_cred_gsi() failed
- locallogger puts log event under /tmp (/var/glite/tmp in version 3.1), but tmp is full.
Unable to delegate the credential to the endpoint: https://:7443/glite_wms_wmproxy_server
- VOMS extension of your proxy is not in /opt/glite_etc/glite_wms_wmproxy.gacl
- VOMS extension missing in your proxy and WMproxy is configured only to support VOMS proxy.
- WMproxy.log and lcmaps.log can give some clues.
Unable to register the job to the service: https://:7443/glite_wms_wmproxy_server
- LB or LBProxy is too busy, increase the timeout.
- LB is in bad status, restart it.
Lot of dglod files in /var/glite/log directory
- Restart the glite-lb-locallogger service.
Performance and stability improvement on the LB and WMS nodes
See the work log related to the gLite WMS 3.1
here.
Split /var/glite into several partions of hard disks
- Use the script WMS-post-install_found in _/root on the WMS node.
- This script is doing the migration of the middleware log files, sandbox and MySQL database on /data01, /data02 and /data03 partitions.
Deploy standalone LB
- One standalone LB (like rb201) can server several WMS.
- Create a file /opt/glite/etc/LB-super-users containing the DN of allowd WMS. You can add the DN of your certificate there in order to debug and get log informations of other users's jobs.
- Add the option "--super-users-file /opt/glite/etc/LB-super-user" in the startup line of the bkserverd daemon, i.e. in file /opt/glite/etc/init.d/glite-lb-bkserverd (see variable "super" in the script).
- Add the following line in the WorkloadManagerProxy of file /opt/glite/etc/glite_wms.conf:
LBServer = "rb201.cern.ch:9000";
- Restart the glite-wms-wmproxy service.
Reduce the planner number per DAG job
- In case there are too many DAG jobs.
- For example, add "DagmanMaxPre=2" in the JobController section of file /opt/glite/etc/glite_wms.conf to reduce it to 2.
Increase the timeout
- Set GLITE_WMS_QUERY_TIMEOUT and GLITE_PR_TIMEOUT variables in /etc/glite/profile.d/glite-setenv.* and add the line "PassEnv GLITE_PR_TIMEOUT" in /opt/glite/etc/glite_wms_wmproxy_httpd.conf.
Remove the pending export of job registrations into JP
- If this service is not required by the experiments, you can disable this option by removing the string $maildir in /opt/glite/etc/init.d/glite-lb-bkserverd startup script. This service must be restarted. It will prevent the "no more inodes available in /tmp" error message...
--
YvanCalas - 09 Mar 2007