TWiki
>
LCG Web
>
LCGGridDeployment
>
TheLCGTroubleshootingGuide
(2011-10-17,
MaartenLitmaath
)
(raw view)
E
dit
A
ttach
P
DF
---+ The LCG Troubleshooting Guide <big> *%RED%WARNING:%ENDCOLOR%* many of these entries contain *%RED%OBSOLETE%ENDCOLOR%* information. Please consult the [[https://wiki.egi.eu/wiki/Tools/Manuals/SiteProblemsFollowUp][<b>EGI Wiki</b>]] instead. </big> %TOC% ---++ YAIM solutions ---+++ Log messages appear twice *%MAROON%Error%ENDCOLOR%* Sometimes when running the yaim command log messages appear twice in the screen. *%GREEN%Solution%ENDCOLOR%* %Y% This is because yaim prints the output messages through a 'tail' command. (This is a workaround for some inproperly daemonized soft.). Look for 'tail' processes in your process tree and kill the old ones. This will solve the problem. ---+++ No configuration target has been found. *%MAROON%Error%ENDCOLOR%* <verbatim> ERROR: The node-info for service myservice not found in /opt/glite/yaim/bin/../node-info.d nor in /opt/glite/yaim/bin/../defaults/node-info.def </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% You can use =yaim -a= to show you the available configuration targets. Probably you don't have the corresponding yaim module installed for your configuration target. ---++ Authentication solutions ---+++ 7 authentication failed *%MAROON%Error%ENDCOLOR%* This error message can be see from the job logging information using =edg-job-get-logging-info=: Something like the following: <verbatim> - reason = 7 authentication failed: GSS Major Status: Authentication Failed GSS Minor Status Error Chain:init.c:497: globus_gss_assist_init_sec_context_async: Error during context initialization init_sec_context </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% * Please refer to =530 530 No local mapping for Globus ID= entry in Troubleshooting Guide * To get more informations, try to list the server files using gridftp if possible : <verbatim> edg-gridftp-ls gsiftp://<hostname>/tmp </verbatim> * Please check that your CRLs are up to date (file date must be very recent - less than 6 hours) * Please check that your host certificate is still valid : <verbatim> openssl x509 -in /etc/grid-security/hostcert.pem -noout -enddate </verbatim> * Please check that your grid-mapfile is up-to-date * If you get this error when submitting a =globus-job-run <ce-name> /bin/hostname= to the affected: <verbatim> GRAM Job submission failed because authentication failed: GSS Major Status: Unexpected Gatekeeper or Service Name GSS Minor Status Error Chain: init.c:499: globus_gss_assist_init_sec_context_async: Error during context initialization init_sec_context.c:251: gss_init_sec_context: Mutual authentication failed: The target name (/C=IT/O=ORG/OU=Host/L=INST/CN=server02.domain.net) in the context, and the target name (/CN=host/server01.domain.net) passed to the function do not match (error code 7) </verbatim> So the reverse resolution of the host IP address(server01.domain.net) is not equivilent to what is found in the host certificate(server02.domain.net) * Check for the reverse lookup problem in "/etc/hosts" on the client side or dns configuration. ---+++ 530 530 No local mapping for Globus ID *%MAROON%Error%ENDCOLOR%* Possible errors could be the following: * If occured during job submission, could be credential problem * Problem in =/etc/grid-security/grid-mapfile= * Problem with /opt/edg/etc/edg-mkgridmap.conf * Problem with pool accounts * Problem with /etc/grid-security/gridmapdir * No files about pool accounts in /etc/grid-security/gridmapdir * Variable GRIDMAPDIR is not set correctly Gatekeeper and gridFTP daemon needs this in order to be able to use pool accounts. No error messages, when starting up the gatekeeper, what's more it even works fine with local accounts (like dteamsgm)! * All pool accounts were taken * If the error occured during job submission, might be related with <verbatim> /opt/edg/etc/lcas/lcas.db or /opt/edg/etc/lcmaps/lcmaps.db files </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% * Check if <verbatim> globus-url-copy -dbg <from_file> <to_file> </verbatim> complains about CRLs in its long ouput. If it does, see the topic: Invalid CRL: The available CRL has expired * Check that it * exists and is updated via cron job <verbatim> 30 1,7,13,19 * * * /opt/edg/sbin/edg-mkgridmap --output=/etc/grid-security/grid-mapfile --safe </verbatim> * it contains right values (entries like: ="/C=CH/O=CERN/OU=GRID/CN=Piotr Nyczyk 9654" .dteam= ) You should copy a gridmap-file from a service node on the GRID, that you can trust to be configured properly, and compare your node's file with that one. * Check that it contains correct URLs for the VOs (like: =ldap://lcg-vo.cern.ch/ou=lcg1,o=dteam,dc=lcg,dc=org .dteam=) * Check that they are existing for each supported VO (like: =dteam001=, ... , =dteam050=) * Check if the directory is on the CE/SE has permissions <verbatim> drwxrwxr-x 2 root root 8192 Nov 29 15:08 gridmapdir </verbatim> and on the Resource Broker <verbatim> drwxrwxr-T 2 root edguser 8192 Nov 29 15:08 gridmapdir </verbatim> (instead of 'T' it can be 't' or 'x') * Touch a file in =/etc/grid-security/gridmapdir/= for each pool account like: <verbatim> touch /etc/grid-security/gridmapdir/dteam001 ... touch /etc/grid-security/gridmapdir/dteam050 </verbatim> * Set the variable in etc/sysconfig/edg to the following <verbatim> GRIDMAPDIR=/etc/grid-security/gridmapdir/ </verbatim> * In =/etc/grid-security/gridmapdir/= there are hard links (with strange names like %2fc%3dch%2fo%3dcern%2fou%3dgrid%2fcn%3dpiotr%20nyczyk%209654) to each pool account that is taken. They have the same inode number ( =ls -li FILENAME= ) as the pool account file they point to. If there's no pool account file left free, run <verbatim> /opt/edg/sbin/lcg-expiregridmapdir.pl </verbatim> * and check if the following crontab entry on the CE exists <verbatim> 0 5 * * * /opt/edg/sbin/lcg-expiregridmapdir.pl -v 1>>/var/log/lcg-expiregridmapdir.log 2>&1 </verbatim> * Example files * /opt/edg/etc/lcas/lcas.db <verbatim> # LCAS database/plugin list # # Format of each line: # pluginname="<name/path of plugin>", pluginargs="<arguments>" # # pluginname=lcas_userallow.mod,pluginargs=allowed_users.db pluginname=lcas_userban.mod,pluginargs=ban_users.db pluginname=lcas_timeslots.mod,pluginargs=timeslots.db pluginname=lcas_plugin_example.mod,pluginargs=arguments </verbatim> * /opt/edg/etc/lcmaps/lcmaps.db <verbatim> # LCMAPS policyfile generated by LCFG::lcmaps - DO NOT EDIT # @(#)/opt/edg/etc/lcmaps/lcmaps.db # # where to look for modules path = /opt/edg/lib/lcmaps/modules # module definitions localaccount = "lcmaps_localaccount.mod -gridmapfile /etc/grid-security/grid-mapfile" poolaccount = "lcmaps_poolaccount.mod -override_inconsistency -gridmapfile /etc/grid-security/grid-mapfile -gridmapdir /etc/grid-security/gridmapdir/" posixenf = "lcmaps_posix_enf.mod -maxuid 1 -maxpgid 1 -maxsgid 32 " # policies standard: localaccount -> posixenf | poolaccount poolaccount -> posixenf </verbatim> ---+++ Proxy expired *%MAROON%Error%ENDCOLOR%* (Remaining) lifetime for proxy is less then 30 minutes. After extending with myproxy-init edg-job-status returns error for previously submitted jobs, while new job submission results in <verbatim> **** Error: UI_PROXY_EXPIRED **** Proxy certificate validity expired </verbatim> In the Resource Broker log file (=/var/log/messages=) <verbatim> Apr 6 13:14:45 <rb name> edg-wl-renewd[2567]: Proxy lifetime exceeded value of the Condor limit! </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% * Check if both proxies are expired <verbatim> grid-proxy-info -text myproxy-info </verbatim> * How much time was left before issuing myproxy-init? * If there is less than 30 minutes left for your proxy when executing myproxy-init, the Work Management System (WMS) will NOT renew your proxy. ---+++ 501 501-FTPD GSSAPI error: GSS Major Status: General failure *%MAROON%Error%ENDCOLOR%* One get the following when using =edg-gridftp-ls=: <verbatim> Error the server sent an error response: 501 501-FTPD GSSAPI error: GSS Major Status: General failure 501-FTPD GSSAPI error: GSS Minor Status Error Chain: 501-FTPD GSSAPI error: 501-FTPD GSSAPI error: acquire_cred.c:125: gss_acquire_cred: Error with GSI credential ... 501-FTPD GSSAPI error: The host key could not be found in: 501-FTPD GSSAPI error: 1) env. var. X509_USER_KEY=/etc/grid-security/hostkey.pem 501-FTPD GSSAPI error: 2) /etc/grid-security/hostkey.pem 501-FTPD GSSAPI error: 3) /opt/globus/etc/hostkey.pem 501-FTPD GSSAPI error: 4) /root/.globus/hostkey.pem </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% * Verfify validity of host certificate. * Check that the host certificate permissions are set correctly (644) * Contact CA if certificate has expired. * Set permissions to 644 ---+++ Invalid CRL: The available CRL has expired *%MAROON%Error%ENDCOLOR%* Invalid CRL: The available CRL has expired One of the possible error messages (returned by edg-replica-manager command) looks like: <verbatim> GridFTP: exist operation failed. the server sent an error response: 535 535-FTPD GSSAPI error: GSS Major Status: Authentication Failed 535-FTPD GSSAPI error: GSS Minor Status Error Chain: 535-FTPD GSSAPI error: 535-FTPD GSSAPI error: accept_sec_context.c:170: gss_accept_sec_context: SSLv3 handshake problems 535-FTPD GSSAPI error: globus_i_gsi_gss_utils.c:881: globus_i_gsi_gss_handshake: Unable to verify remote side's credentials 535-FTPD GSSAPI error: globus_i_gsi_gss_utils.c:854: globus_i_gsi_gss_handshake: SSLv3 handshake problems: Couldn't do ssl handshake 535-FTPD GSSAPI error: OpenSSL Error: s3_srvr.c:1816: in library: SSL routines, function SSL3_GET_CLIENT_CERTIFICATE: no certificate returned 535-FTPD GSSAPI error: globus_gsi_callback.c:351: globus_i_gsi_callback_handshake_callback: Could not verify credential 535-FTPD GSSAPI error: globus_gsi_callback.c:477: globus_i_gsi_callback_cred_verify: Could not verify credential 535-FTPD GSSAPI error: globus_gsi_callback.c:769: globus_i_gsi_callback_check_revoked: Invalid CRL: The available CRL has expired 535 FTPD GSSAPI error: accepting context </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% * Certificates in =/etc/grid-security/certificates/= are outdated Make sure that CA RPMS (called ca_<SITENAME>, like ca_CERN) are installed, and updated to the last CA release. http://grid-deployment.web.cern.ch/grid-deployment/lcg2CAlist.html * Periodic update failed A way to check this is to compare the sizes of the files in =/etc/grid-security/certificates/= with =edg-gridftp-ls= between the node, and a server that surely has the right credentials. Run =edg-fetch-crl= command manually, and see if it produced any error message. Make sure that the following crontab entry exists <verbatim> 30 1,7,13,19 * * * /opt/edg/etc/cron/edg-fetch-crl-cron </verbatim> ---+++ Certificate proxy not yet valid *%MAROON%Error%ENDCOLOR%* Following error occured when using globus-url-copy command: <verbatim> error: the server sent an error response: 535 535 Authentication failed: GSSException: Defective credential detected [Root error message: Certificate C=CH,O=CERN,OU=GRID,CN=Judit Novak 0973,CN=proxy not yet valid.] [Root exception is org.globus.gsi.proxy.ProxyPathValidatorException: Certificate C=CH,O=CERN,OU=GRID,CN=Judit Novak 0973,CN=proxy not yet valid.] </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% Source and destination nodes weren't syncronized in time. Syncronize the nodes ! ---+++ 'Bad certificate' returned instead of 'Unknown CA' *%MAROON%Error%ENDCOLOR%* Couldn't verify the remote certificate ! In SSL, the 'unknown CA' error obtained by the SSL server during the handshake gets translated (by the ssl3_alert_code call) into a generic 'bad certificate' error: <verbatim> case SSL_AD_UNKNOWN_CA: return(SSL3_AD_BAD_CERTIFICATE); </verbatim> This is sent as an alert to the SSL client during the SSL handshake. The Globus GSI handshake callback (globus_i_gsi_gss_handshake) always casts a 'bad certificate' error, no matter how it was obtained, into a =GLOBUS_GSI_GSSAPI_ERROR_REMOTE_CERT_VERIFY_FAILED=: <verbatim> 839 /* checks for ssl alert 42 */ 840 if (ERR_peek_error() == 841 ERR_PACK(ERR_LIB_SSL,SSL_F_SSL3_READ_BYTES, 842 SSL_R_SSLV3_ALERT_BAD_CERTIFICATE)) 843 { 844 GLOBUS_GSI_GSSAPI_OPENSSL_ERROR_RESULT( 845 minor_status, 846 GLOBUS_GSI_GSSAPI_ERROR_REMOTE_CERT_VERIFY_FAILED, 847 ("Couldn't verify the remote certificate")); 848 } </verbatim> So, the error "Couldn't verify the remote certificate" can also mean (among other things, including its literal meaning) "the SSL client certificate was found by the remote SSL server to be issued by an unknown CA". This is quite misleading. *%GREEN%Solution%ENDCOLOR%* %Y% The Certification Autority files for the unknown CA are missing in =/etc/grid-security/certificates= or in the directory pointed to by the environmental variable =X509_CERT_DIR=. Instructions on how to upload the CA files for the Certification Authorities accepted by LCG/EGEE can be found here: http://grid-deployment.web.cern.ch/grid-deployment/lcg2CAlist.html ---++ DPM and LFC solutions <!-- ***************************************************************************************** --> ---+++ Cannot map principal to local user *%MAROON%Error%ENDCOLOR%* You get this error : =cannot map principal to local user= *%GREEN%Solution%ENDCOLOR%* %Y% =/etc/grid-security/gridmapdir= directory should be writable by =lfcmgr= or =dpmmgr=. If you are using another directory, it also has to be writable, and should be specified in the =/etc/sysconfig/SERVICE_NAME= files. <!-- ***************************************************************************************** --> ---+++ Problem with Mysql 4.1 *%MAROON%Error%ENDCOLOR%* When using Mysql 4.1 with either the LFC or the DPM, you get the following error (here in =/var/log/dpns/log=) : =09/23 12:19:41 26938 Cns_opendb: CONNECT error: Client does not support= =authentication protocol requested by server; consider upgrading Mysql client. *%GREEN%Solution%ENDCOLOR%* %Y% According to the Mysql documentation, paragraph A.2.3, there is a very simple solution to this problem: use the OLD_PASSWORD() function instead of the PASSWORD() function when creating the DB account. <!-- ***************************************************************************************** --> ---+++ service lfcdaemon stop : No valid credential found *%MAROON%Error%ENDCOLOR%* You get this : * =service lfcdaemon start= is OK * but =service lfcdaemon stop= doesn't work : <verbatim> $ service lfcdaemon stop Stopping lfcdaemon: send2nsd: NS002 - send error : No valid credential found nsshutdown: Could not establish context </verbatim> And trying to create =/grid= as root doesn't work either : <verbatim> $ lfc-mkdir /grid send2nsd: NS002 - send error : No valid credential found cannot create /grid: Could not establish context </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% Check that : * you have a valid host certificate and key * you have copied and renamed them to =/etc/grid-security/lfcmgr= : <verbatim> $ ll /etc/grid-security/ | grep host -rw-r--r-- 1 root root 5423 May 27 12:35 hostcert.pem -r-------- 1 root root 1675 May 27 12:35 hostkey.pem </verbatim> * *IMPORTANT : the host certificate and key have to be kept at their original place !!!* <verbatim> $ ll /etc/grid-security/lfcmgr | grep lfc -rw-r--r-- 1 lfcmgr lfcmgr 5423 May 30 13:58 lfccert.pem -r-------- 1 lfcmgr lfcmgr 1675 May 30 13:58 lfckey.pem </verbatim> Check that the CA certificates are present : <verbatim> ls /etc/grid-security/certificates/ 01621954.0 01621954.crl_url 01621954.info 01621954.r0 01621954.signing_policy 03aa0ecb.0 03aa0ecb.crl_url 03aa0ecb.info 03aa0ecb.r0 03aa0ecb.signing_policy ... </verbatim> Get more information, with *export CSEC_TRACE=1* : <verbatim> $ export CSEC_TRACE=1 $ lfc-mkdir /grid </verbatim> *%BLUE%Further help%ENDCOLOR%* %H% If it still doesn't help, send the =/var/log/lfc/log= file to support@ggus.org (remove the NONSPAM !). And send us the output of : =$ cat /proc/lfc_master_pid/environ= <!-- ***************************************************************************************** --> ---+++ sendrep: NS003 - illegal function 12 *%MAROON%Error%ENDCOLOR%* You get this : <verbatim> $ tail -f /var/log/lfc/log ... 11/23 09:37:13 12001,0 sendrep: NS003 - illegal function 12 ... </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% It means you are calling a method that is not allowed after another call has failed. For instance, if an =lfc_opendirg= fails, you cannot call =lfc_closedirg= afterwards. (In LFC/DPM 1.4.1, this is fixed, and the =lfc_closedirg= is automatically ignored). *The solution is* : check the possible failures in your code, so that =lfc_closedirg= isn't called if =lfc_opendirg= has failed ! <!-- ***************************************************************************************** --> ---+++ No user mapping *%MAROON%Error%ENDCOLOR%* You get this error : <verbatim> Could not get virtual id: No user mapping ! </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% Check this : * permissions/ownership on =/etc/grid-security/gridmapdir= ? * does the user appear in =/etc/grid-security/grid-mapfile= ? * aren't all the pool accounts in use ? * do all the pool accounts exist in /etc/passwd ? * does /opt/lcg/etc/lcgdm-mapfile exist ? * if yes, does it contain the user that seems to be missing ? *%BLUE%Further help%ENDCOLOR%* %H% If the problem still appears, contact support@ggus.org (remove the NONSPAM !) specifying/giving : * the answers to the previous questions, * the version of the LFC/DPM server, * the version of the LFC/DPM client, * the appropriate logs. ---+++ How to make srmcopy work Here is a recipe from James Casey (James.Casey@cern.ch) on how to make =srmcopy= work with the DPM : * Using srmcp to download from castor2 * upload that file from local storage to a dpm * copy from castor2 to dpm, in 'pushmode' * download the file from the dpm to local storage. ------------------------------ <verbatim> $/opt/d-cache/srm/bin/srmcp srm://castorgridsc:8443/castor/cern.ch/grid/dteam/storage/transfer-test/castor2/s00/file-test.dat file:////tmp/foo $ls -l /tmp/foo -rw-r--r-- 1 jamesc zg 2364 Sep 27 16:56 /tmp/foo </verbatim> ------------------------------ <verbatim> $/opt/d-cache/srm/bin/srmcp file:////tmp/foo srm://lxfsrm528:8443/dpm/cern.ch/home/dteam/jamesc-foo $dpns-ls -l /dpm/cern.ch/home/dteam/jamesc-foo -rw-rw-r-- 1 dteam002 cg 2364 Sep 27 17:01 /dpm/cern.ch/home/dteam/jamesc-foo </verbatim> ----------------------------------- <verbatim> $/opt/d-cache/srm/bin/srmcp --debug --pushmode=true srm://castorgridsc.cern.ch:8443/castor/cern.ch/grid/dteam/storage/transfer-test/castor2/s00/file-test.dat srm://lxfsrm528:8443/dpm/cern.ch/home/dteam/jamesc-foo-srmcp Storage Resource Manager (SRM) CP Client version 1.16 Copyright (c) 2002-2005 Fermi National Accelerator Laborarory SRM Configuration: debug=true gsissl=true help=false pushmode=true userproxy=true buffer_size=2048 tcp_buffer_size=0 stream_num=10 config_file=/afs/cern.ch/user/j/jamesc/.srmconfig/config.xml glue_mapfile=/opt/d-cache/srm/conf/SRMServerV1.map webservice_path=srm/managerv1.wsdl webservice_protocol=https gsiftpclinet=globus-url-copy protocols_list=gsiftp save_config_file=null srmcphome=/opt/d-cache/srm urlcopy=bin/urlcopy.sh x509_user_cert=/afs/cern.ch/user/j/jamesc/.globus/usercert.pem x509_user_key=/afs/cern.ch/user/j/jamesc/.globus/userkey.pem x509_user_proxy=/tmp/x509up_u4290 x509_user_trusted_certificates=/afs/cern.ch/user/j/jamesc/.globus/certificates retry_num=20 retry_timeout=10000 wsdl_url=null use_urlcopy_script=false connect_to_wsdl=false delegate=true full_delegation=true from[0]=srm://castorgridsc.cern.ch:8443/castor/cern.ch/grid/dteam/storage/transfer-test/castor2/s00/file-test.dat to=srm://lxfsrm528:8443/dpm/cern.ch/home/dteam/jamesc-foo-srmcp</verbatim>= Tue Sep 27 17:04:35 CEST 2005: starting SRMCopyPushClient Tue Sep 27 17:04:35 CEST 2005: SRMClient(https,srm/managerv1.wsdl,true) Tue Sep 27 17:04:35 CEST 2005: connecting to server Tue Sep 27 17:04:35 CEST 2005: connected to server, obtaining proxy SRMClientV1 : connecting to srm at httpg://oplapro58.cern.ch:8443/srm/managerv1 Tue Sep 27 17:04:37 CEST 2005: got proxy of type class org.dcache.srm.client.SRMClientV1 Tue Sep 27 17:04:37 CEST 2005: copying srm://castorgridsc.cern.ch:8443/castor/cern.ch/grid/dteam/storage/transfer-test/castor2/s00/file-test.dat into srm://lxfsrm528:8443/dpm/cern.ch/home/dteam/jamesc-foo-srmcp SRMClientV1 : copy, srcSURLS[0]="srm://castorgridsc.cern.ch:8443/castor/cern.ch/grid/dteam/storage/transfer-test/castor2/s00/file-test.dat" SRMClientV1 : copy, destSURLS[0]="srm://lxfsrm528:8443/dpm/cern.ch/home/dteam/jamesc-foo-srmcp" SRMClientV1 : copy, contacting service httpg://oplapro58.cern.ch:8443/srm/managerv1 Tue Sep 27 17:04:40 CEST 2005: srm returned requestId = 618988755 Tue Sep 27 17:04:40 CEST 2005: sleeping 1 seconds ... Tue Sep 27 17:04:42 CEST 2005: sleeping 1 seconds ... Tue Sep 27 17:04:44 CEST 2005: sleeping 1 seconds ... Tue Sep 27 17:04:45 CEST 2005: sleeping 1 seconds ... Tue Sep 27 17:04:46 CEST 2005: FileRequestStatus fileID = 0 is Done => copying of srm://castorgridsc.cern.ch:8443/castor/cern.ch/grid/dteam/storage/transfer-test/castor2/s00/file-test.dat is complete </verbatim> -------------------------------------- <verbatim> $/opt/d-cache/srm/bin/srmcp --debug srm://lxfsrm528:8443/dpm/cern.ch/home/dteam/jamesc-foo-srmcp file:////tmp/foo2 Storage Resource Manager (SRM) CP Client version 1.16 Copyright (c) 2002-2005 Fermi National Accelerator Laborarory SRM Configuration: debug=true gsissl=true help=false pushmode=false userproxy=true buffer_size=2048 tcp_buffer_size=0 stream_num=10 config_file=/afs/cern.ch/user/j/jamesc/.srmconfig/config.xml glue_mapfile=/opt/d-cache/srm/conf/SRMServerV1.map webservice_path=srm/managerv1.wsdl webservice_protocol=https gsiftpclinet=globus-url-copy protocols_list=gsiftp save_config_file=null srmcphome=/opt/d-cache/srm urlcopy=bin/urlcopy.sh x509_user_cert=/afs/cern.ch/user/j/jamesc/.globus/usercert.pem x509_user_key=/afs/cern.ch/user/j/jamesc/.globus/userkey.pem x509_user_proxy=/tmp/x509up_u4290 x509_user_trusted_certificates=/afs/cern.ch/user/j/jamesc/.globus/certificates retry_num=20 retry_timeout=10000 wsdl_url=null use_urlcopy_script=false connect_to_wsdl=false delegate=true full_delegation=true from[0]=srm://lxfsrm528:8443/dpm/cern.ch/home/dteam/jamesc-foo-srmcp to=file:////tmp/foo2 Tue Sep 27 18:02:00 CEST 2005: starting SRMGetClient Tue Sep 27 18:02:00 CEST 2005: SRMClient(https,srm/managerv1.wsdl,true) Tue Sep 27 18:02:00 CEST 2005: connecting to server Tue Sep 27 18:02:00 CEST 2005: connected to server, obtaining proxy SRMClientV1 : connecting to srm at httpg://lxfsrm528.cern.ch:8443/srm/managerv1 Tue Sep 27 18:02:01 CEST 2005: got proxy of type class org.dcache.srm.client.SRMClientV1 SRMClientV1 : get: surls[0]="srm://lxfsrm528:8443/dpm/cern.ch/home/dteam/jamesc-foo-srmcp" SRMClientV1 : get: protocols[0]="http" SRMClientV1 : get: protocols[1]="dcap" SRMClientV1 : get: protocols[2]="gsiftp" SRMClientV1 : get, contacting service httpg://lxfsrm528.cern.ch:8443/srm/managerv1 doneAddingJobs is false copy_jobs is empty Tue Sep 27 18:02:09 CEST 2005: srm returned requestId = 27373 Tue Sep 27 18:02:09 CEST 2005: sleeping 1 seconds ... Tue Sep 27 18:02:11 CEST 2005: FileRequestStatus with SURL=srm://lxfsrm528:8443/dpm/cern.ch/home/dteam/jamesc-foo-srmcp is Ready Tue Sep 27 18:02:11 CEST 2005: received TURL=gsiftp://lxfsrm528.cern.ch/lxfsrm528:/shift/lxfsrm528/data01/cg/2005-09-27/jamesc-foo-srmcp.27372.0 doneAddingJobs is false copy_jobs is not empty Tue Sep 27 18:02:11 CEST 2005: fileIDs is empty, breaking the loop copying CopyJob, source = gsiftp://lxfsrm528.cern.ch/lxfsrm528:/shift/lxfsrm528/data01/cg/2005-09-27/jamesc-foo-srmcp.27372.0 destination = file:////tmp/foo2 GridftpClient: memory buffer size is set to 2048 GridftpClient: connecting to lxfsrm528.cern.ch on port 2811 GridftpClient: gridFTPClient tcp buffer size is set to 0 GridftpClient: gridFTPRead started GridftpClient: parallelism: 10 GridftpClient: waiting for completion of transfer GridftpClient: gridFtpWrite: starting the transfer in emode from lxfsrm528:/shift/lxfsrm528/data01/cg/2005-09-27/jamesc-foo-srmcp.27372.0 GridftpClient: DiskDataSink.close() called GridftpClient: gridFTPWrite() wrote 2364bytes GridftpClient: closing client : org.dcache.srm.util.GridftpClient$FnalGridFTPClient@4be2cc GridftpClient: closed client execution of CopyJob, source = gsiftp://lxfsrm528.cern.ch/lxfsrm528:/shift/lxfsrm528/data01/cg/2005-09-27/jamesc-foo-srmcp.27372.0 destination = file:////tmp/foo2 completed setting file request 0 status to Done doneAddingJobs is true copy_jobs is empty stopping copier $ls -l /tmp/foo2 -rw-r--r-- 1 jamesc zg 2364 Sep 27 18:02 /tmp/foo2 </verbatim> <!-- ***************************************************************************************** --> ---+++ No space left on device *%MAROON%Error%ENDCOLOR%* You get this with *srmcp*: <verbatim> $ srmcp -debug=true file://localhost//tmp/hello srm://dpm01.pic.es:8443/dpm/pic.es/home/dteam/testdir2/test-srmcp Exception in thread "main" java.io.IOException: rs.state = Failed rs.error = No space left on device at gov.fnal.srm.util.SRMPutClient.start(SRMPutClient.java:331) at gov.fnal.srm.util.SRMCopy.work(SRMCopy.java:409) at gov.fnal.srm.util.SRMCopy.main(SRMCopy.java:242) Tue Oct 18 15:59:17 CEST 2005: setting all remaining file statuses to "Done" Tue Oct 18 15:59:17 CEST 2005: setting file request 0 status to Done SRMClientV1 : getRequestStatus: try #0 failed with error SRMClientV1 : Invalid state java.lang.RuntimeException: Invalid state at org.dcache.srm.client.SRMClientV1.setFileStatus(SRMClientV1.java:1097) at gov.fnal.srm.util.SRMPutClient.run(SRMPutClient.java:362) at java.lang.Thread.run(Thread.java:534) </verbatim> Or a similar error with *globus-url-copy*, or another utility. *%GREEN%Solution%ENDCOLOR%* %Y% The problem is that some utilities use Permanent as their default and some others Volatile. For instance : * srmcp doesn't work if your pool is of *volatile* type. * globus-url-copy You have two possibilities : * Modify the type of the pool to "-" (this type allows both Volatile and Permanent files): <verbatim> dpm-modifypool --poolname <my_pool> --s_type "-" </verbatim> * Create two pools, one Volatile and one Permanent *%BLUE%Further help%ENDCOLOR%* %H% If it still doesn't help, send the relevant DPM log files to support@ggus.org (remove the NOSPAM !). <!-- ******************************************************************************************** --> ---+++ globus-url-copy : Connection closed by remote end *%MAROON%Error%ENDCOLOR%* =globus-url-copy file:/etc/group gsiftp://DPM_POOL_NODE/dpm/cern.ch/home/dteam/tests.sophie.shift.conf2= =error: the server sent an error response: 553 553 /dpm/cern.ch/home/dteam/tests.sophie.shift.conf2: Connection closed by remote end.= *%MAROON%Is this really what you want to be doing ?%ENDCOLOR%* The same command with the =DPM_SERVER= instead of the =DPM_POOL_NODE= will work... So, this error only occurs if you try to contact a pool node directly. This is not necessarily what you want to be doing, as it can involve an unnecessary copy, if the file finally ends up on another pool node than the one contacted. So, doing this adds load on the DPM setup. *%GREEN%solution%ENDCOLOR%* %Y% If you still want to do this, on the DPM server, add this line to =/etc/shift.conf= : =RFIOD TRUST DPM_server_short_name DPM_server_long_name disk_server1_short_name disk_server1_long_name...= <!-- **************************************************************************************************** --> ---+++ gLite I/O and DPM Here is Jean-Philippe's explanation : All physical files on disk belong to a special user "dpmmgr" and are only accessible by this user. RFIOD and gsiFTP which are launched as root have been modified to check with the DPNS (DPM Name Server) if the client is authorized to open (or delete or ...). Then RFIOD or gsiFTP does the open on behalf of the user and returns an handle that can be used in rfio_read/rfio_write ... The disk server must be trusted by the DPNS using entries in shift.conf of the form : =DPNS TRUST disk_server1 disk_server2 ...= The users are mapped using the standard grid-mapfile. If the gliteIO daemon runs with a host/service certificate and is modified to be DPM-aware i.e. to contact the DPNS, everything is ok. If you do not want to modify gliteIO daemon, and gliteIO runs as the client, you may still access data on other disk servers using RFIO, but you cannot access the data residing on the same machine as the glieteIO daemon because in this case the file is seen as local and RFIO does not use RFIOD. One solution which was explained to Gavin and his successors was: it is possible to modify RFIO to use RFIOD even if the file is local. The cost is an extra copy operation between RFIOD et gliteIO servers. The modification is not very difficult but is not very high on our list of priorities either. Please note that you will encounter the same problem with CASTOR as soon as the secure version of CASTOR is released. <!-- **************************************************************************************************** --> ---+++ How to restrict a pool to a VO *%MAROON%How to create a pool dedicated to a VO ?%ENDCOLOR%* It is possible to have one pool dedicated to a given VO, with all the authorization behind, using the =dpm-addpool= or =dpm-modifypool= commands. For instance : =dpm-addpool --poolname VOpool --def_filesize 200M --gid the_VO_gid= =dpm-addpool --poolname VOpool --def_filesize 200M --group the_VO_group_name= *%BLUE%Comment%ENDCOLOR%* If you define : * one pool dedicated to =group1= / =VO1= * one pool open to all groups / VOs then, the *dedicated pool will be used until it is full*. When the dedicated pool is full, the open pool is then be used. <!-- ************************************************************************************** --> ---+++ globus-url-copy : Permission denied (error 13 on XXX) *%MAROON%Error%ENDCOLOR%* You get this : <verbatim> $globus-url-copy file:///tmp/hello gsiftp://<dpm_server>/dpm/<domain.name>/home/dteam/testdir2/test error: the server sent an error response: 553 553 /dpm/<domain.name>/home/dteam/testdir2/test: Permission denied (error 13 on <disk_server>). </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% You might want to check that : * the DPM server and the disk server are not on different subnets. If they are, you should create the =/etc/shift.localhosts= file on the DPM server, containing the disk server subnet (as an IP address). For instance : <verbatim> $cat /etc/shift.localhosts 212.189.153 </verbatim> * the =dpmmgr= user has the same uid/gid on each machine (DPM server and disk server). *Important:* if you change the =dpmmgr= uid/gid, restart all the daemons afterwards. * check the permissions on the =/dpm/domain.name/home/dteam/testdir= hierarchy * =/etc/shift.conf= on the DPM server : <verbatim> DPM TRUST <disk_server1_short_name> <disk_server1_long_name> <disk_server2_short_name> <disk_server2_long_name> DPNS TRUST <disk_server1_short_name> <disk_server1_long_name> <disk_server2_short_name> <disk_server2_long_name> RFIOD TRUST <dpm_server_short_name> <dpm_server_long_name> RFIOD WTRUST <dpm_server_short_name> <dpm_server_long_name> RFIOD RTRUST <dpm_server_short_name> <dpm_server_long_name> RFIOD XTRUST <dpm_server_short_name> <dpm_server_long_name> RFIOD FTRUST <dpm_server_short_name> <dpm_server_long_name> </verbatim> * =/etc/shift.conf= on the disk server : <verbatim> RFIOD TRUST <dpm_server_short_name> <dpm_server_long_name> RFIOD WTRUST <dpm_server_short_name> <dpm_server_long_name> RFIOD RTRUST <dpm_server_short_name> <dpm_server_long_name> RFIOD XTRUST <dpm_server_short_name> <dpm_server_long_name> RFIOD FTRUST <dpm_server_short_name> <dpm_server_long_name> </verbatim> * the permissions of the file system on the disk server : the directory and its subdirectories should have <verbatim> ls -lad /data01 drwxrwx--- 365 dpmmgr dpmmgr 8192 Sep 29 09:58 /data01 </verbatim> *%BLUE% Further help %ENDCOLOR%* %H% If it still doesn't help, send the =/var/log/rfiod/log= file to support@ggus.org (remove the NOSPAM !). <!-- ******************************************************************************************** --> ---+++ rfdir : Permission denied (error 13 on XXX) *%MAROON%Error%ENDCOLOR%* You get this : <verbatim> $ rfdir <my_dpm_host>:/storage opendir(): <my_dpm_host>:/storage: Permission denied (error 13 on <my_dpm_host>) </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% To use =rfdir= with the DPM, the recipe is : <verbatim> $ export DPNS_HOST=<my_dpns_host> $ rfdir /dpm/cern.ch/home/dteam/ </verbatim> *%BLUE%Comment%ENDCOLOR%* %P% To use =rfrm=, you need to set =DPM_HOST= and =DPNS_HOST= : <verbatim> $ export DPNS_HOST=<my_dpns_host> $ export DPM_HOST=<my_dpm_host> $ rfrm -r /dpm/cern.ch/home/dteam/tests_sophie </verbatim> *%BLUE%Furher help %ENDCOLOR%* %H% If it still doesn't help, send the =/var/log/rfiod/log= file to support@ggus.org (remove the NONSPAM !). <!-- ********************************************************************************** --> ---+++ 426 426 Data connection. tmp file_open failed *%MAROON%Error%ENDCOLOR%* You get this : <verbatim> $ lcg-cp -v --vo dteam lfn:essai_node08_3 file:/home/cleroy/node08_node02 Source URL:lfn:essai_node08_3 File size: 202 VO name: dteam Source URL for copy: gsiftp://MY_DISK_SERVER.cern.ch/MY_DISK_SERVER:/storage/dteam/2005-11-10/file11e39190-5c5a-4a64-bf39-07ef7616186f.171.0 Destination URL: file:/home/cleroy/node08_node02 # streams: 1 # set timeout to 0 (seconds) 0 bytes 0.00 KB/sec avg 0.00 KB/sec instthe server sent an error response: 426 426 Data connection. tmp file_open failed lcg_cp: Transport endpoint is not connected </verbatim> Or this : <verbatim> $ globus-url-copy gsiftp://MY_DPM.cern.ch/MY_DPM:/storage/cg/2005-11-14/file356ff811-f30b-412e-bd13-bfb6f0a95634.1.0 file:/tmp/sophie error: the server sent an error response: 426 426 Data connection. tmp file_open failed </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% It seems that the permissions on =/tmp= are wrong. They should look like : <verbatim> $ ll -ld /tmp drwxrwxrwt 14 root root 4096 Nov 14 17:21 /tmp </verbatim> *%BLUE%Further help%ENDCOLOR%* %H% If it still doesn't help, send the =/var/log/messages= file to support@ggus.org (remove the NONSPAM !). <!-- ****************************************************************************************************** --> ---+++ Going from a Classic SE to the DPM Turning your Classic SE into a DPM is easy : it doesn’t require to move the data in any way. You only need to make the DPM server aware of the files that are present on your Storage Element. In other words, this is only a metadata operation, and no actual file movement is required at all. *%MAROON% How long will it take ?%ENDCOLOR%* To give a time estimate, the tests we have performed at CERN took : * 4 hours 23 minutes 17 seconds * for 236546 files This gives an average of 14.97 files migrated per second. *%MAROON% Possible scenarios%ENDCOLOR%* There are two possibilities : * install the DPM servers on the Classic SE, and consider the Classic SE as a pool node as well, * install the DPM servers on a different machine, and turn the Classic SE into a DPM pool node. *%MAROON% Preliminary steps %ENDCOLOR%* You have to install the DPM servers on a given machine (it can be the Classic SE itself) See the [[DpmAdminGuide][DPM Admin Guide]]. If installed on a different machine, the Classic SE will act as a pool node (=disk server) of the DPM. *Important :* Make sure that the VO groups and pool accounts have the same uids/gids on the Classic SE and on the DPM server. Otherwise, the migrated permissions will no be the correct ones. *%MAROON%Permissions %ENDCOLOR%* Make sure that the VO groups ids and pool accounts uid/gids correspond on the DPM server and on the Classic SE. Otherwise, the ownership will not be correctly migrated to the DPM Name Server *%MAROON% Get the script %ENDCOLOR%* To perform the migration, the IT-GD group provides a migration script. You can find it in the CERN central CVS service (repository [[http://isscvs.cern.ch:8180/cgi-bin/cvsweb.cgi/migration-classicSE-DPM/?cvsroot=lcgware&hideattic=0][lcgware/migration-classicSE-DPM]]). You can also download the following tarball: [[%ATTACHURL%/migration-classicSE-DPM.tar.gz][migration-classicSE-DPM.tar.gz]] (_last update: 2005-10-11_). *Note that a new version of this script is currently rewritten in order to manage problems encountered during the migration (for example when migrating the entries to an already existing DPM server (already having entries).* *%MAROON% Configuration %ENDCOLOR%* *%MAROON% - on the classic SE %ENDCOLOR%* * Stop the =GridFTP= server : <verbatim> service globus-gridftp stop chkconfig globus-gridftp off </verbatim> * Install the DPM-client package. * Set the environment variable DPNS_HOST with the DPNS hostname : =export DPNS_HOST=DPNS_HOSTNAME= * Put in the =/etc/shift.conf= the following lines: <verbatim> RFIOD RTRUST SHORT_DPNS_HOSTNAME LONG_DPNS_HOSTNAME RFIOD WTRUST SHORT_DPNS_HOSTNAME LONG_DPNS_HOSTNAME RFIOD XTRUST SHORT_DPNS_HOSTNAME LONG_DPNS_HOSTNAME RFIOD FTRUST SHORT_DPNS_HOSTNAME LONG_DPNS_HOSTNAME </verbatim> * Compile the migration.c file using the =Makefile= : =make all= *%MAROON% - on the DPNS server %ENDCOLOR%* * Put in the =/etc/shift.conf= the following line: =DPNS TRUST SHORT_CLASSIC_SE_HOSTNAME LONG_CLASSIC_SE_HOSTNAME= *%MAROON% Migration %ENDCOLOR%* Run the following command on the classic SE host: =./migration classicSE_hostname classicSE_directory dpm_hostname dpm_directory dpm_poolname= where: * =classicSE_hostname= is the (short) name of the classic SE (i.e. without the domain name). * =classicSE_directory= is the name of the directory where are stored all the files (for example /storage). * =dpm_hostname= is the (short) name of the DPM (i.e. without the domain name). * =dpm_directory= is the name of the directory where will be stored all the files (for example /dpm/DOMAIN_NAME/home). * =dpm_poolname= is the name of the pool (obtained by using dpm_qryconf) on the DPM. *Important :* Note that you have to put short hostname (i.e. do not add the domain name) on the command line. *%MAROON% Post migration steps %ENDCOLOR%* If the Classic SE is a separate machine, make sure you turn it into a DPM pool node : *%MAROON% - on the Classic SE :%ENDCOLOR%* *Attention : before doing that, make sure that the entries appear in DPM Name Server as expected !* Configure the Classic SE to be a pool node : * remove the =CASTOR-client= RPM * install the =DPM-client=, =DPM-rfio-server= and =DPM-gsiftp-server= RPMs * configure security (globus, grid-mapfile, gridmapdir, pool accounts * create the =dpmmgr= user/group (with the same uid/gid as on the DPM server) * =chown root:dpmmgr /etc/grid-security/gridmapdir= * create =/etc/grid-security/dpmmgr= * =chown dpmmgr:dpmmgr /etc/grid-security/dpmmgr= * =cp -p /etc/grid-security/hostcert.pem /etc/grid-security/dpmmgr/dpmcert.pem= * =cp -p /etc/grid-security/hostkey.pem /etc/grid-security/dpmmgr/dpmkey.pem= * =service rfiod start= * =service dpm-gsiftp start= *VERY IMPORTANT :* Change the ownership of all the Classic SE files/directories : *WARNING : before changing the permissions, make sure that all the files have been properly migrated in the DPNS. Once the permissions changed, you cannot get the old permissions back...* * =chown -R dpmmgr:dpmmgr /YOUR_PARTITION= * =chmod -R 660 /YOUR_PARTITION= * =find /storage -type d -exec chmod 770 {} \;= to have the correct permissions on directories *%MAROON% - on the DPM server : %ENDCOLOR%* Create the pool and add the Classic SE file system to it : * =export DPM_HOST=YOUR_DPM_SERVER= * =dpm-addpool --poolname POOL_NAME --def_filesize 200M= (if the pool doesn't exist yet !) * =dpm-addfs --poolname POOL_NAME --server CLASSIC_SE_SHORT_NAME --fs CLASSIC_SE_FILE_SYSTEM= For more details, refer to the [[DpmAdminGuide][DPM Admin Guide]]. *%MAROON% Catalog %ENDCOLOR%* The entries that exist already in a catalog (RLS or LFC) won't be migrated. The corresponding entries can still be accessed in the same way as before the migration. For instance : =lcg-cp --vo dteam= =sfn://ClassiSE_hostname/storage/dteam/generated/2005-03-29/filef70996ba-ba4e-42dc-9bae-03a3d7e7ac31= =file:/tmp/test.classic.se.migration.1= *%MAROON%Information System %ENDCOLOR%* You have to publish the DPM as an SRM in the Information System. There is no need to publish the Classic SE as such in the Information System. *%BLUE%Further help %ENDCOLOR%* %H% Please send all your questions/comments to hep-service-dpm@cern.ch (remove the NONSPAM !) or to [[mailto:yvan.calas@cern.ch?subject=migration-classicSE-DPM][yvan.calas@cern.ch]]. ---+++ lcg-cr: Permission denied *%MAROON% Error %ENDCOLOR%* You get this *when targetting a DPM Storage Element* : <verbatim> $ lcg-cr -v --vo dteam -d se.polgrid.pl -l lfn:/grid/dteam/apadee/test-file-polgrid.pl.1 file:///etc/group Using grid catalog type: lfc Using grid catalog : lfc-dteam.cern.ch lcg_cr: Permission denied </verbatim> *%GREEN% Solution %ENDCOLOR%* It can be that one of the partitions on one Disk Server is not properly configured. The permissions on *all partitions* should be : <verbatim> $ ll -ld /storage drwxrwx--- 3 dpmmgr dpmmgr 4096 Nov 14 17:21 /storage </verbatim> ---+++ CGSI-gSOAPError reading token data: Success *%MAROON%Error%ENDCOLOR%* You get this error: <verbatim> CGSI-gSOAP: Error reading token data: Success </verbatim> This means that the SRM server has dropped the connection. *%GREEN%Solution%ENDCOLOR%* Try to restart the srm server: <verbatim> service srmv1 restart </verbatim> If it doesn't help, the reasons can be : * a security handshake problem * a =grid-mapfile= or =gridmapdir= problem * one of the server thread crashed (but, it has never been seen in production...) Check : * the =/var/log/srmv1/log= and =/var/log/srmv2/log= log files * the permissions/contents of =grid-mapfile= and =gridmapdir= * that all the DPM ports are open Set the following environment variables : <verbatim> $CGSI_TRACE=1 $CGSI_TRACEFILE=/tmp/tracefile </verbatim> and see if the error messages contained in =/tmp/tracefile= help. ---+++ Error response 550:550 - not a plain file *%MAROON%Error%ENDCOLOR%* For instance, you get this : <verbatim> $ lcg-cp srm://grid05.lal.in2p3.fr:8443/dpm/lal.in2p3.fr/home/atlas/dq2/file.11 /tmp/test --vo dteam the server sent an error response: 550 550 grid07.lal.in2p3.fr:/dpmpart/part1/atlas/2006-04-29/file.11.29648.0: not a plain file. lcg_cp: Invalid argument </verbatim> But the file exists in the DPM Name Server : <verbatim> $ dpns-ls -l /dpm/lal.in2p3.fr/home/atlas/dq2/csc11.root.11 -rw-rw-r-- 1 19478 20008 28472534 Apr 29 23:23 /dpm/lal.in2p3.fr/home/atlas/dq2/csc11.root.11 </verbatim> *%GREEN%Solution 1%ENDCOLOR%* Although it appears in the DPM namespace, the file doesn't *physically* exist on disk anymore. You should un-register the file from the namespace, to avoid this inconsistency. *%GREEN%Solution 2%ENDCOLOR%* Check that, *on all disk servers* you are actually running : * the DPM RFIO server, and not the CASTOR one, * the DPM GRIDFTP server, and not the Classic SE GRIDFTP one : <verbatim> $ ps -ef|grep rfio root 20313 1 0 Sep19 ? 00:00:10 /opt/lcg/bin/rfiod -sl -f /var/log/rfio/log $ ps -ef|grep ftp root 20291 1 0 Sep19 ? 00:00:03 /opt/lcg/sbin/dpm.ftpd -i -X -L -l -S -p 2811 -u 002 -o -a -Z /var/log/dpm-gsiftp/dpm-gsiftp.log </verbatim> Also check that : * the =dpmmgr= user has been created before =rfiod= and =dpm-gsiftp= were started, * the =dpmmgr= user has the same uid and gid *on all disk servers*. ---+++LFC daemon crashes with old oracle database 10gR2 *%MAROON%Error%ENDCOLOR%* The LFC daemon crashes regularly with Oracle 10gR2 database backend. What can I do ? *%GREEN%Solution%ENDCOLOR%* You have to use the 10gR2 Oracle Instant Client, instead of the 10gR1 one. Remember to change =$ORACLE_HOME= in =/etc/sysconfig/lfcdaemon= to point to the right directory. And restart the service : <verbatim> $ service lfcdaemon restart </verbatim> For further help: Get a core dump, by uncommenting the following line in =/etc/sysconfig/lfcdaemon= : <verbatim> #ALLOW_COREDUMP="yes" </verbatim> And restarting the service : <verbatim> $ service lfcdaemon restart </verbatim> The core dump will appear under =/home/lfcmgr/lfc=. Put the core dump in a public location, and send this location to helpdesk@ggus.org (remove the NOSPAM!) : your ROC will help you, and contact the appropriate experts if needed. ---+++ File exists *%MAROON%Error%ENDCOLOR%* You get this error : <verbatim> lfc-rm /grid/atlas/tests/file1 /grid/atlas/tests/file1: File exists </verbatim> or this <verbatim> dpns-rm /dpm/in2p3.fr/home/auvergrid/tests/file1 /dpm/in2p3.fr/home/auvergrid/tests/file1: File exists </verbatim> *%GREEN%Solution%ENDCOLOR%* =lfc-rm= and =dpns-rm= remove the entry in the Name Server only, but not the physical file itself. The =File exists= error means that there are still physical replicas attached to the Name Server entry. To remove both physical and logical files, you can : * use =lcg_util= * use =rfrm= (in the DPM case) ---+++ VOMS signature error *%MAROON%Error%ENDCOLOR%* You get this error in =/var/log/lfc/log= or =/var/log/dpns/log= : <verbatim> 05/19 12:05:13 16051,0 Cns_serv: Could not establish security context: _Csec_get_voms_creds: VOMS Signature error (failure)! </verbatim> *%GREEN%Solution%ENDCOLOR%* On the LFC/DPNS machine, the host certificate of your VO VOMS server is missing in =/etc/grid-security/vomsdir=. For instance : <verbatim> $ ls /etc/grid-security/vomsdir | sort cclcgvomsli01.in2p3.fr.43 lcg-voms.cern.ch.1265 voms.cern.ch.1877 voms.cern.ch.963 </verbatim> ---+++ grid-proxy-init OK, but voms-proxy-init NOT OK *%MAROON%Problem%ENDCOLOR%* For a given user, usage of LFC/DPM with: * grid-proxy-init or simple voms-proxy-init works fine, * voms-proxy-init -voms doesn't work fine *%GREEN%Solutions%ENDCOLOR%* Wrong VOMS setup Check the VOMS setup on: * the UI * the LFC / DPM server On LFC & UI, /etc/grid-security/vomsdir contains VO VOMS server <verbatim> $ ls -ld /etc/grid-security/vomsdir/ drwxr-xr-x 2 root root 4096 Jun 8 15:07 /etc/grid-security/vomsdir/ $ ls /etc/grid-security/vomsdir cclcgvomsli01.in2p3.fr.43 lcg-voms.cern.ch.1265 </verbatim> On the UI (client), /opt/glite/etc/vomses should contain : <verbatim> $ ls /opt/glite/etc/vomses alice-lcg-voms.cern.ch alice-voms.cern.ch </verbatim> User uses several different VOMS roles For details, see LFC and DPM internal virtual ids The same user with two different VOMS roles will be mapped to two different internal virtual gids. To grant privileges to other VOMS roles on given directories/files, use lfc-setacl (see man lfc-setacl). ---+++ lcg_utils : "Invalid Argument" error *%MAROON%Error%ENDCOLOR%* An =lcg_util= command returns the =Invalid Argument= error. *%GREEN%Solution%ENDCOLOR%* It usually means that there is a problem with the information published by the Information System. Either : * for the LFC, or * for the Storage Element ---+++ "Could not establish security context: Connection dropped by remote end !" *%MAROON%Error%ENDCOLOR%* This error appears in the LFC/DPM log file. <verbatim> 07/28 10:08:22 18550,0 Cns_serv: Could not establish security context: _Csec_recv_token: Connection dropped by remote end ! </verbatim> *%GREEN%Explanation%ENDCOLOR%* This is not a problem. This warning only means that the LFC/DPM client dropped the connection itself. For instance, it appears in the server log file, if a user doesn't have a valid proxy : <verbatim> $ lfc-ls / send2nsd: NS002 - send error : No valid credential found /: Bad credentials </verbatim> ---+++ What to do if the DN of a user changes ? *%MAROON%Problem%ENDCOLOR%* The DN of a user changes. What does the LFC/DPM admin have to do, so that the user can still access her files ? *%MAROON%Problem%ENDCOLOR%* The name of a group/VO changes. What does the LFC/DPM admin have to do, so that the permissions remain correct ? *%GREEN%Solution%ENDCOLOR%* Use the =lfc-modifyusrmap= or =lfc-modifygrpmap= commands. See =man lfc-modifyusrmap= and =man lfc-modifygrpmap=. ---+++What to do if the host certificate expired or going to be changed *%MAROON%Problem%ENDCOLOR%* The LFC or DPM server host certificate will expire soon. *%GREEN%Solution%ENDCOLOR%* Replace the old host certificate and key : <verbatim> $ ll /etc/grid-security/ | grep host -rw-r--r-- 1 root root 5423 May 27 12:35 hostcert.pem -r-------- 1 root root 1675 May 27 12:35 hostkey.pem </verbatim> At the same time, a renamed copy of them has to be put under : <verbatim> $ ll /etc/grid-security/lfcmgr | grep lfc -rw-r--r-- 1 lfcmgr lfcmgr 5423 May 30 13:58 lfccert.pem -r-------- 1 lfcmgr lfcmgr 1675 May 30 13:58 lfckey.pem </verbatim> *You don't need to restart any of the services then.* Note : replace =lfcmgr= with =dpmmgr= for the DPM. ---+++ How do ACLs work ? *%BLUE%Question%ENDCOLOR%* How do ACLs work in the LFC or DPM Name Server ? *%GREEN%Answer%ENDCOLOR%* ACLs are standard POSIX ACLs. For details, see =man lfc-setacl= or =man dpns-setacl=. If a same file has several Logical File Names (LFNs), this file has : * a primary LFN, * secondary LFNs : they are implemented as symlinks, and have dummy =777= permissions. When an LFN (primary or secondary) is accessed, the permissions/ACLs on the primary LFN are checked. ---+++ How to know all the file residing on a given SE ? *%BLUE%Question%ENDCOLOR%* How can I know all the replicas stored on a given Storage Element ? *%GREEN%Answer%ENDCOLOR%* The "lfc_listreplicax" method allows to do this : it lists all the replica entries stored in the LFC for a given server. It is available in : * the LFC C API, * the LFC Python interface, * the LFC Perl interface See =man lfc_listreplicax=. *%BLUE%Warning%ENDCOLOR%* This method is based on the =host= field in the =Cns_file_replica= table. But be aware that *some VOs don't store the actual server machine name in the* =host= *field* ! For instance, in its LFC central server, LHCb stores =CERN_Castor= instead of =castorsrm.cern.ch=... In the future, =srmLs= can be used too. But it has to be implemented for all Storage Element types first. ---+++How to restrict a pool to a given VO ? It is possible to have one pool dedicated to a given VO, with all the authorization behind, using the =dpm-addpool= or =dpm-modifypool= commands. For instance : =dpm-addpool --poolname VOpool --def_filesize 200M --gid the_VO_gid= =dpm-addpool --poolname VOpool --def_filesize 200M --group the_VO_group_name= *%BLUE%Comment:%ENDCOLOR%* If you define : * one pool dedicated to =group1= / =VO1= * one pool open to all groups / VOs then, the *dedicated pool will be used until it is full*. When the dedicated pool is full, the open pool is then be used. ---++ R-GMA solutions ---+++ General, very simple R-GMA test *%MAROON%Question%ENDCOLOR%* How can I test if I've set up RGMA correctly? *%GREEN%Answer%ENDCOLOR%* %Y% R-GMA developers provide 2 scripts for testing the installation. <verbatim> /opt/edg/bin/rgma-client-check /opt/edg/bin/rgma-server-check </verbatim> ---+++ Which logs should I back up for accounting purposes? *%MAROON%Question%ENDCOLOR%* I need to know which logs to back up for accounting purposes. *%GREEN%Answer%ENDCOLOR%* %Y% This question is answered on the Accounting FAQ page at the UK GOC and the list, in short, comprises: * Gatekeeper logs: /var/log/globus-gatekeeper.log.* * Job Manager logs: /var/spool/pbs/server_priv/accounting/* * System logs: /var/log/messages* *%BLUE%Note%ENDCOLOR%* %P% Note that there may be other logs that it is necessary to retain for security audit reasons. ---+++ Failed to get list of tables from the Schema *%MAROON% Error %ENDCOLOR%* Something like this one: <verbatim> ================================================================ You are connected to the following R-GMA Schema service: https://lcgic01.gridpp.rl.ac.uk:8443/R-GMA/SchemaServlet WARNING: failed to get list of tables from the Schema ============================================================== </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% Generaly this error message appears when one would like to connect to a secure R-GMA server a.) without a user proxy or b.) having a user proxy but the =X509_USER_PROXY= enviromental variable is not pointing to the proxy. *%BLUE%Comment%ENDCOLOR%* %P% Note, that the =grid-proxy-init= does not set the value of the =X509_USER_PROXY= variable. ---+++ Problems with =rgma-client-check= ---++++ Unable to source /opt/edg/etc/profile.d/edg-rgma-env.sh *%MAROON%Error%ENDCOLOR%* Running R-GMA client checking script <verbatim> /opt/edg/sbin/test/edg-rgma-run-examples Unable to source /opt/edg/etc/profile.d/edg-rgma-env.sh </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% R-GMA has not been configured. Configure R-GMA. ---++++ RGMA_HOME is not set *%MAROON%Error%ENDCOLOR%* Running R-GMA client checking script <verbatim> /opt/edg/bin/rgma-client-check RGMA_HOME is not set </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% R-GMA is not configured. Configure R-GMA or set the enviroment variable RGMA_HOME ---++++ No C++ compiler found *%MAROON%Error%ENDCOLOR%* Running =rgma-client-check= gives: <verbatim> /opt/edg/sbin/test/edg-rgma-run-examples Configuring... No C++ compiler found </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% This testing script requires a C++ compiler to complete succesfully. Install both the =gcc-c++= and =openssl-devel= packages for the operating system. ---++++ Cannot declareTable: table description not defined in the Schema *%MAROON%Error%ENDCOLOR%* Running =rgma-client-check= gives: <verbatim> /opt/edg/bin/rgma-client-check *** Running R-GMA client tests on cmsfarmbl12.lnl.infn.it *** Checking C API: Failed to declare table. Failure Checking C++ API: R-GMA application error in PrimaryProducer. Cannot declareTable: table description not defined in the Schema Success Checking Python API: RGMA Error StreamProducer__declareTable_StringString:Cannot declareTable: table description not defined in the Schema Failure Checking Java API: R-GMA application error in PrimaryProducer. org.glite.rgma.RGMAException: Unknown RGMA Exception: Cannot declareTable: table description not defined in the Schema at org.glite.rgma.stubs.PrimaryProducerStub.declareTable(Unknown Source) at PrimaryProducerExample.main(Unknown Source) Failure Checking for safe arrival of tuples, please wait... ERROR: Failed to instantiate Consumer There should be 4 tuples, there was only: </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% The Registry servlet has a hosts allow file and the site R-GMA server machine is not registered in this file. Running: <verbatim> wget http://lcgic01.gridpp.rl.ac.uk:8080/R-GMA/SchemaServlet cat SchemaServlet <?xml version = '1.0' encoding='UTF-8' standalone='no'?> <edg:XMLResponse xmlns:edg='http://www.edg.org'> <XMLException type="SchemaException" source="Servlet" isRecoverable="false"> <message>cannot service request, client hostname is currently being blocked</message> </XMLException> </edg:XMLResponse> </verbatim> This shows that the host you running this command on is currently blocked. Send a mail to lcg-support@gridpp.rl.ac.uk for the allow list to included the machine running the R-GMA server. In the email, specify the full machine name as well as the full domain. For instance: <verbatim> Hi, Please could you add MY-SITE to the R-GMA Registry. R-GMA Server : mon.my-site.my-domain Domain : my-domain </verbatim> ---++++ libgcj-java-placeholder.sh *%MAROON%Error%ENDCOLOR%* Running =/opt/edg/bin/rgma-client-check= gives: <verbatim> /opt/edg/bin/rgma-client-check Checking C API: Done. Success Checking C++ API: Success Checking Python API: Success Checking Java API: libgcj-java-placeholder.sh This script is a placeholder for the /usr/bin/java and /usr/bin/javac master links required by jpackage.org conventions. libgcj's rmiregistry, rmic and jar tools are now slave symlinks to these masters, and are managed by the alternatives(8) system. This change was necessary because the rmiregistry, rmic and jar tools installed by previous versions of libgcj conflicted with symlinks installed by jpackage.org JVM packages. Success Checking for safe arrival of tuples, please wait... There should be 4 tuples, there was only: | C producer | | C++ producer | | Python producer | </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% The default installation of linux puts a placeholder for the java command. This is being pick up instead of the proper java command. Make sure that Java has been installed and that the java command is found in the path before the placeholder. ---++++ Connection refused *%MAROON%Error%ENDCOLOR%* Running =/opt/edg/bin/rgma-client-check= gives: <verbatim> *** Running R-GMA client tests on alifarm19.ct.infn.it *** Checking C API: Failed to create producer. Failure Checking C++ API: R-GMA application error in PrimaryProducer. Cannot open connection to servlet: Connection refused Success Checking Python API: RGMA Error Failed to instantiate StreamProducer Failure Checking Java API: Failed to contact PrimaryProducer service. org.glite.rgma.RemoteException at org.glite.rgma.stubs.ProducerFactoryStub.createPrimaryProducer(Unknown Source) at PrimaryProducerExample.main(Unknown Source) Failure Checking for safe arrival of tuples, please wait... ERROR: Failed to instantiate Consumer There should be 4 tuples, there was only: </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% The tomcat and the servlets are not up and running. Restart Tomcat and check the Tomcat logs for errors. As root do the following: <verbatim> /etc/rc.d/init.d/tomcat5 stop (use Crtl-C if this hangs.) su - tomcat4 -c 'killall -9 java' rm -f /var/log/tomcat5/catalina.out /etc/rc.d/init.d/tomcat5 start tail -f /var/log/tomcat5/catalina.out </verbatim> *%BLUE%Note%ENDCOLOR%* %P% Note: tomcat5 runs as user tomcat4 !!! ---++++ HTML returned instead of XML *%MAROON%Error%ENDCOLOR%* Running =/opt/edg/bin/rgma-client-check= gives: <verbatim> /opt/edg/bin/rgma-client-check *** Running R-GMA client tests on node064.lancs.pygrid *** Checking C API: Failed to create producer. Failure Checking C++ API: R-GMA application error in PrimaryProducer. HTML returned instead of XML. This usually means either there is a problem with the proxy cache, e.g. it is unable to find the R-GMA server; or an unhandled exception in the R-GMA servlet. The title of the HTML document is: ERROR: The requested URL could not be retrieved Success Checking Python API: RGMA Error Failed to instantiate StreamProducer Failure Checking Java API: Failed to contact PrimaryProducer service. org.glite.rgma.RemoteException at org.glite.rgma.stubs.ProducerFactoryStub.createPrimaryProducer(Unknown Source) at PrimaryProducerExample.main(Unknown Source) Failure Checking for safe arrival of tuples, please wait... ERROR: Failed to instantiate Consumer There should be 4 tuples, there was only: </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% A previous configuration script for R-GMA removed some jar files that were in deployed in the Tomcat rpm. Checking the rpm shows the error: <verbatim> rpm -V tomcat4 ......GT c /etc/tomcat4/server.xml SM5..U.T c /etc/tomcat4/tomcat-users.xml S.5....T c /etc/tomcat4/tomcat4.conf missing /var/tomcat4/common/endorsed/jaxp_parser_impl.jar missing /var/tomcat4/common/endorsed/xml-commons-apis.jar </verbatim> Re-install tomcat4 ! ---++++ No tuples returned *%MAROON%Error%ENDCOLOR%* Running =/opt/edg/bin/rgma-client-check= gives: <verbatim> /opt/edg/bin/rgma-client-check *** Running R-GMA client tests on bf35.tier2.hep.man.ac.uk *** Checking C API: Done. Success Checking C++ API: Success Checking Python API: Success Checking Java API: Success Checking for safe arrival of tuples, please wait... There should be 4 tuples, there was only: </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% * The clocks could be out and the producers are probably being cleaned up as soon as they have been created. Check that the time is correct. NTP needs to be running on all nodes. * Port 8088 could be blocked by a firewall. Run the rgma-server-check on the R-GMA server and open port 8088 in the firewall if it reports that it is blocked. ---++++ Object has been closed: 1949004681 *%MAROON%Error%ENDCOLOR%* Running =/opt/edg/bin/rgma-client-check= gives: <verbatim> + /opt/edg/bin/rgma-client-check *** Running R-GMA client tests on egeewn14.ifca.org.es *** Checking C API: Done. Success Checking C++ API: Success Checking Python API: Success Checking Java API: Success Checking for safe arrival of tuples, please wait... ERROR: Consumer__isExecuting:Servlet not accessible, API has been closed Caused by: Object has been closed: 1949004681 There should be 4 tuples, there was only: </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% The clocks could be out and the producers are probably being cleaned up as soon as they have been created. Check that the time is correct. NTP needs to be running on all nodes including the R-GMA servlet box. ---++++ Unable to locate an available Registry Service *%MAROON%Error%ENDCOLOR%* Running =/opt/edg/bin/rgma-client-check= gives: <verbatim> /opt/edg/bin/rgma-client-check *** Running R-GMA client tests on PAKWN1.pakgrid.org.pk *** Checking C API: Failed to create producer. Failure Checking C++ API: R-GMA application error in PrimaryProducer. Unable to locate an available Registry Service Success Checking Python API: RGMA Error Failed to instantiate StreamProducer Failure Checking Java API: R-GMA application error in PrimaryProducer. org.glite.rgma.RGMAException: Unable to locate an available Registry Service at org.glite.rgma.stubs.ProducerFactoryStub.createPrimaryProducer(Unknown Source) at PrimaryProducerExample.main(Unknown Source) Failure Checking for safe arrival of tuples, please wait... ERROR: Failed to instantiate Consumer There should be 4 tuples, there was only: *** R-GMA client test failed *** </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% The configuration on the R-GMA server is incorrect. Using the R-GMA browser on the R-GMA server and looking at "Table Sets" should show and error message. <verbatim> Cannot connect to servlet: </verbatim> Correctly configure the R-GMA server to point to the correct Registry and Schema. ---++++ cannot remove `/tmp/cmds.sql': Operation not permitted *%MAROON%Error%ENDCOLOR%* Running =/opt/edg/bin/rgma-client-check= gives: <verbatim> Checking for safe arrival of tuples, please wait... /opt/edg/bin/rgma-client-check: line 99: /tmp/cmds.sql: Permission denied There should be 4 tuples, there was only: rm: cannot remove `/tmp/cmds.sql': Operation not permitted </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% The file has probably been created when the client check script command was run as root or as a pool account. A new pool account is now unable to delete the file. Delete the file. A fix is in the latest version of R-GMA which will be deployed with the next R-GMA version to be deployed. ---++ Information System (and EGEE.BDII) solutions ---+++ General considerations LCG uses an LDAP based information system. Click here for a quick introduction to LDAP. The LCG information system consists of four distinct parts. The Generic Information Provider (GIP), the MDS, GRIS, the site EGEE.BDII and the top level EGEE.BDII. All the information is produced by the information provider, everything else is the transport mechanism. If there are any problems with the information then the information provider will need to be investigated. Each site should produce the following information. * One =SiteInfo= entry. * One =GlueCluster= and =GlueSubCluster= entry per cluster. * One =GlueCE=, =GlueCESEBind= and =GlueCESEBindGroup= entry per queue. * One =GlueSE= and =GlueSL= entry per Storage Element. * One =GlueSA= entry per VO. If the correct information for the site is in the top level EGEE.BDII then there is usually no problem. For this reason we can take a top down approach for trouble shooting. See the following 4 entries in the topic. ---+++ Check that the information is in the top level EGEE.BDII The following query can be used to extract the information about the site from the top level EGEE.BDII. Replace bdii-host.invalid with the EGEE.BDII host and domain.invalid with the domain name of the site. An assumption has been made in the query where the mail address for the sysAdminContact contains the domain name of the site. <verbatim> ldapsearch -LLL -x -h bdii-host.invalid -p 2170 -b o=grid\ '(|(GlueChunkKey=*domain.invalid)(GlueForeignKey=*domain.invalid)(GlueInformationServiceURL=*domain.invalid*)\ (GlueCESEBindSEUniqueID=*.domain.invalid)(GlueCESEBindSEUniqueID=*.domain.invalid)\ (GlueCESEBindGroupSEUniqueID=*domain.invalid)(sysAdminContact=*domain.invalid))' </verbatim> Adding to the end of the command, <verbatim> dn | grep dn | cut -d "," -f 1 </verbatim> will show just the entries. ---+++ Check that the information is in the site level EGEE.BDII To check that the information for the site is in the site bdii, do the following ldapsearch, replacing site-bdii.invalid with the hostname of the machine running the site EGEE.BDII. <verbatim> ldapsearch -x -h site-bdii.invalid -p 2170 -b o=grid. </verbatim> ---+++ Check that the information is is the GRIS To check that the information for is in a GRIS, do the following ldapsearch, replacing gris-host.invalid with the hostname of the machine running the GRIS. <verbatim> ldapsearch -x -h gris-host.invalid -p 2135 -b mds-vo-name=local,o=grid. </verbatim> ---+++ Check that the information is returned by the information provider Run the following command to check the output of the information provider. <verbatim> /opt/lcg/libexec/lcg-info-wrapper. </verbatim> ---+++ No information found in EGEE.BDII If there is no information returned, then there is a problem with either the URL used to obtain the information or the information source itself. The URLs are found in the file /opt/lcg/var/bdii/lcg-bdii-update.conf. Find the URL in the file and transform it into and ldapsearch. <verbatim> NAME ldap://host.invalid:port/bind ldapsearch -x -h host.invalid -p port -b bind </verbatim> ---+++ Entry's missing in the EGEE.BDII If invalid LDIF is produced, then the entry will be rejected when it is being inserted in to the LDAP database. To see if any entries are being rejected run the EGEE.BDII update script. <verbatim> /opt/lcg/libexec/lcg-bdii-update /opt/lcg/var/bdii/lcg-bdii.conf </verbatim> The dn of any rejected entries will be shown along with the error. This will also show if any problems with the ldap URLs. ---+++ Problems updating the EGEE.BDII configuration file from the web Check that the attribute =BDII_AUTO_UPDATE= in the configuration file =/opt/lcg/var/bdii/lcg-bdii.conf= is set to "yes". If this value is set to "no" the EGEE.BDII will not attempt to update the configuration file from the web. Next check that the value for the attribute =BDII_HTTP_URL= points to an existing web page and that this web page is the file that contains the URLs that you want to use for the EGEE.BDII. ---+++ Can not connect to the GRIS Check the status of the GRIS. <verbatim> /etc/rc.d/init.d/globus-mds status </verbatim> If the GRIS failed to start, try to restart it. <verbatim> /etc/rc.d/init.d/globus-mds restart. </verbatim> Repeat this this command a few times. If it fails on stopping the GRIS then it usually means that it failed to start. ---+++ The GRIS fails to start The GRIS sometimes fails to start due to stale slapd processes being left around. Try to removed all these. <verbatim> kill -9 slapd. </verbatim> Note that if the EGEE.BDII is on the same machine this will now need to be restarted. Try re-staring the GRIS a few times. <verbatim> /etc/rc.d/init.d/globus-mds restart. </verbatim> If it fails on stopping the GRIS then it usually means that it failed to start. Try starting the GRIS by hand with debugging turned on. This should show up any errors. <verbatim> /opt/globus/libexec/slapd -h ldap://localhost:2135 -f /opt/globus/etc/grid-info-slapd.conf -d 255 -u edginfo </verbatim> ---+++ No information returned by the GRIS If no information is returned, then either the information provider is not working or there is a problem with the GRIS configuration. ---+++ There is a problem with the GRIS configuration Check that the entry for the information provider is in the GRIS configuration file /opt/globus/etc/grid-info-resource-ldif.conf. This file is automatically created from the globus-mds init.d script. It uses the file /opt/edg/var/info/edg-globus.ldif get the entry. ---+++ No information was produced by the information provider Check that the static ldif file has been created. The static ldif file location is defined in the file =/opt/lcg/var/lcg-info-generic.conf= and by default is =/opt/lcg/var/lcg-info-static.ldif=. If this file does not exist try to re-run the configuration to create it. <verbatim> /opt/lcg/sbin/lcg-info-generic-config /opt/lcg/var/lcg-info-generic.conf </verbatim> If this does not create the ldif file check the contents of the file =/opt/lcg/var/lcg-info-generic.conf=. There should be at least one template and one dn specified in this file. ---+++ Default values show instead of dynamic values The dynamic plug has a problem or there is a miss-match with the dn's. The command used to run the dynamic plug-in is in the file =/opt/lcg/var/lcg-info-generic.conf=. Copy and paste the command on to the command line and execute it. This should show up any errors. Check that the dn's produced by the dynamic plug-in are the same as in the static ldif file. ---+++ New values not shown in GRIS This can occur because a stale slapd processes is left around and is still serving the data even after a restart. This error can usually be found be doing globus-mds stop . The command will fail and you should still be able to do a query. The solution is to kill all the slapd process and restart the GRIS. <verbatim> kill -9 slapd. </verbatim> Note that if the EGEE.BDII is on the same machine this will now need to be restarted. ---+++ How to set up a dns load balanced EGEE.BDII service. *%MAROON%Question%ENDCOLOR%* How to use several EGEE.BDII and load sharing ? *%GREEN%Solution%ENDCOLOR%* %Y% Multiple BDIIs can be used behind a "round robin" dns alias to provide a load balance EGEE.BDII Service. ---+++ No such object (32): error message *%MAROON%Error%ENDCOLOR%* Gstat BDIIUpdate Check gives following error: <verbatim> No such object (32) </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% BDIIUpdate Check tries to update the bdii database by contacting each GIIS listed at: <verbatim> http://grid-deployment.web.cern.ch/grid-deployment/gis/lcg2-bdii/dteam/lcg2-all-sites.conf </verbatim> If your site has this error, you should check try to query the contact string listed in the bdii config above and verify that it is functioning properly. If the contact string is incorrect please email the ROLLOUT list to request a change. A search example: <verbatim> ldapsearch -x -H ldap://<giis host>:2170 -b mds-vo-name=<sitename>,o=grid </verbatim> ---+++ How to close the site so it won't receive anymore jobs from the RBs *%MAROON%Question%ENDCOLOR%* How to close the site so it won't receive anymore jobs from the RBs If you want to stop the RB from sending you jobs (for example as you want to do some update on your CE), an atribute exists in the ldif Schema which is consulted by the RB to check the availability of your site. This page explains how to publish a closed status on your farm. It's about the information system. The right place The attributes GlueCEStateStaus can take some values for which the RB will look. These attributes may be : * =Queueing=: the queue can accept job submission, but can�t be served by the scheduler * =Production=: the queue can accept job submissions and is served by a scheduler * =Closed=: The queue can�t accept job submission and can�t be served by a scheduler * =Draining=: the queue can�t accept job submission, but can be served by a scheduler This attribute is published under the dn : =GlueCEUniqueId\=hostname=... And such a dn exists for each queue. *%GREEN%Answer%ENDCOLOR%* %Y% Now we are going to change the value of this attribute. You'll have to edit the =/opt/lcg/var/gip/lcg-info-generic.conf= Find the line whith the right =dn=. If it doesn't allready exist, add the line : <verbatim> GlueCEStateStatus: Closed </verbatim> for closing your site. else, you'll only have to change the value of this attribute. Be carefull to remove any space at the end of the line. Do this for each queue you have to change. You should find a dn for each of these queues. To activate the changes use the command: <verbatim> /opt/lcg/sbin/lcg-info-generic-config /opt/lcg/var/gip/lcg-info-generic.conf= </verbatim> Don't forget that, if you're using a EGEE.BDII as GIIS, you have to wait until the EGEE.BDII refreshes itself or refresh it manually. *%BLUE%Note%ENDCOLOR%* %P% If you want to remove the closed status of your site, simply remove the line you added or change the value at will. ---++ Job submission solutions ---+++ 10 data transfer to the server failed *%MAROON%Error%ENDCOLOR%* Globus job manager on the CE cannot call back RB (or UI in tests) *%GREEN%Solution%ENDCOLOR%* %Y% * Check if the account to which the DN is mapped has a writable home directory. A globus-job-run (instead of edg-job-get-logging-info) may report this error: <verbatim> GRAM Job submission failed because cannot access cache files in ~/.globus/.gass_cache, check permissions, quota, and disk space (error code 76) </verbatim> * Check contents of $GLOBUS_LOCATION/etc/grid-services/jobmanager-* files. * Check contents of $GLOBUS_LOCATION/etc/globus-job-manager.conf. * Ensure /etc/grid-security is world-readable (only hostkey.pem must be protected). * Ensure outgoing connections are allowed from the CE to the GLOBUS_TCP_PORT_RANGE on RB (or UI). ---++ SAM solutions ---++ VOMS solutions ---+++ Wrong host certificate subject in the vomses file It is possible that after renewing a host certificate, the host certificate subject changes and the vomses file containing the VOMS server information is not updated accordingly. The client side message is like in the following example: <verbatim> bash-2.05b$ voms-proxy-init -voms mysql_vo1 -userconf ~/vomses Your identity: /C=CH/O=CERN/OU=GRID/CN=Maria Alandes Pradillo 5561 Enter GRID pass phrase: Creating temporary proxy ....................................... Done Contacting lxb0769.cern.ch:15001 [/C=CH/O=CERN/OU=GRID/CN=lxb0769.cern.ch] "mysql_vo1" Failed Error: Could not establish authenticated connection with the server. GSS Major Status: Unexpected Gatekeeper or Service Name GSS Minor Status Error Chain: an unknown error occurred Failed to contact servers for mysql_vo1. </verbatim> The server log file contains the following lines: <verbatim> Wed Aug 16 11:04:48 2006:lxb0769.cern.ch:vomsd(4341):ERROR:REQUEST:AcceptGSIAuthentication home/glbuild/GLITE_3_0_0_final/org.glite.security.voms/src/socklib/Server.cpp:259):Failed to establish security context (accept):.GSS Major Status: General failure.GSS Minor Status Error Chain:..accept_sec_context.c:305:gss_accept_sec_context: Error during delegation: Delegation protocol violation </verbatim> In this case it's good that you check whether the vomses file contains the correct host certificate subject. To check what's your VOMS host certificate subject, run the following command: <verbatim> [root@lxb0769 root]# openssl x509 -in /etc/grid-security/hostcert.pem -noout -subject subject= /C=CH/O=CERN/OU=GRID/CN=host/lxb0769.cern.ch </verbatim> And check in the vomses file that the certificate subject is correct: <verbatim> bash-2.05b$ more vomses ... "mysql_vo1" "lxb0769.cern.ch" "15001" "/C=CH/O=CERN/OU=GRID/CN=host/lxb0769.cern.ch" "mysql_vo1" ... </verbatim> ---+++Database initialization error with MySQL When installing VOMS MySQL sometimes the following error appears just after starting the VOMS server: Database initialization error. This could be caused because before the configuration of the server, the following commands were not executed: <verbatim> /usr/bin/mysqladmin -u root password 'yourPassword' /usr/bin/mysqladmin -u root -h yourHostname password 'yourPassword' </verbatim> When installing VOMS MySQL it is extremely important to execute the mentioned commands before configuring VOMS. Although this is specified in the Installation guide that can be found [[http://glite.web.cern.ch/glite/packages/R3.0/R20060502/doc/installation_guide_3.0-2.html][here]] many people don't read it. It is also mentioned when VOMS MySQL rpms are installed using APT. However, since many messages and warnings appear it is easy to miss the message that warns about the need of executing the above mentioned commands. ---+++ =WARNING: Unable to verify signature!= *%MAROON%Error%ENDCOLOR%* Running =voms-proxy-info= gives the following error: <verbatim> error = 5025 WARNING: Unable to verify signature! subject : /O=GermanGrid/OU=LMU/CN=John Kennedy/CN=proxy ... .. </verbatim> While =voms-proxy-init= is OK: <verbatim> voms-proxy-init -voms atlas Your identity: /O=GermanGrid/OU=LMU/CN=John Kennedy Enter GRID pass phrase: Creating temporary proxy .............................................. Done Contacting voms.cern.ch:15001 [/C=CH/O=CERN/OU=GRID/CN=host/voms.cern.ch] "atlas" Error: VERR_NOSOCKET Failed. Trying next server for atlas. Creating temporary proxy ............................................. Done Contacting lcg-voms.cern.ch:15001 [/C=CH/O=CERN/OU=GRID/CN=host/lcg-voms.cern.ch] "atlas" Creating proxy ................................................... Done Your proxy is valid until Mon Jul 17 13:36:56 2006 </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% It just means that you don't have the VOMS server host certificate (or at least v-p-i can't find it) so the code can't verify that the VO signature is valid. It doesn't matter if you just want to see the info. ---++ APT solutions ---+++ =apt-get update= : W: Release file did not contain checksum information for :.... *%MAROON%Error%ENDCOLOR%* Running =apt-get update= gives a message similar to this one: <verbatim> W: Release file did not contain checksum information for http://grid- deployment.web.cern.ch/grid-deployment/gis/apt/LCG-2_7_0/sl3/en/i386/base/pkglist.lcg_sl3 W: Release file did not contain checksum information for http://grid- deployment.web.cern.ch/grid-deployment/gis/apt/LCG-2_7_0/sl3/en/i386/base/release.lcg_sl3 W: Release file did not contain checksum information for http://grid- deployment.web.cern.ch/grid-deployment/gis/apt/LCG-2_7_0/sl3/en/i386/base/pkglist.lcg_sl3.security W: Release file did not contain checksum information for http://grid- deployment.web.cern.ch/grid-deployment/gis/apt/LCG-2_7_0/sl3/en/i386/base/release.lcg_sl3.security W: You may want to run apt-get update to correct these problems </verbatim> *%GREEN%Solution%ENDCOLOR%* %Y% There is a problem on the server side, thus please send an e-mail to lcg-rollout@listserv.cclrc.ac.uk including the error message. ---++ FTS Solutions #NotAuthorizedToSubmit ---+++ I tried to submit a job and it said: =submit: You are not authorised to submit jobs to this service= The user is not authorised to submit jobs to the FTS service. In order to authorize him/her, you have to add his/her DN in the =submit-mapfile= on the FTS server. You can have a look at [[LCG.FtsServerInstall13][FtsServerInstall112]] in the =Mapfile= section and at [[FtsServerSubmitMapfile13][FtsServerSubmitMapfile13]] However, due to bug in the FTS ([[http://savannah.cern.ch/bugs/?func=detailitem&item_id=10362][#10362]]), if the user has a double or more delegated proxy (i.e. the DN ends with =/CN=proxy/CN=proxy=), a parsing error will cause a authorization denied. This bug has being solved in FTS version 1.4 and in the latest !QuickFix for 1.3 If the user is still not authorized to submit request, check his/her DN is not in the =veto-mapfile= #MonoDirectionalChannel ---+++ I submitted a job from site X to Y but it didn't work. The channel Y-X exists and has a share for my VO! From version 1.3 onwards the channel definitions are mono-directional. You have to create another channel in the opposite direction (=glite-transfer-channel-add=), set the share for the VO interested in using the channel (=glite-transfer-channel-setvoshare=) and install an !Channel !Agent that will managed it #WhichSurlFormat ---+++ Which format should I use for the SURLs? Starting from gLite 1.4.1, the FTA implements the enhancement request [[http://savannah.cern.ch/bugs/?func=detailitem&item_id=8364][#8364]], that allows a user to specify any format he prefers: the agent would then convert each SURL before transfering or registering into the catalog to either a fully qualified format <verbatim> srm://<host>:<port>/srm/managerv1?SFN=<file_path> </verbatim> or a compact one <verbatim> srm://<host>/<file_path> </verbatim> depending on the configuration. By default it would use the compact format. In case you want to change this parameter, you have to set the related !ChannelAgent configuration parameter =transfer-agent-channel-actions.SurlNormalization= to one of the following values: * =compact= all the SURLs will be converted to the format: <verbatim> srm://<host>/<file_path> </verbatim> * =compact-with-port= all the SURLs will be converted to the format: <verbatim> srm://<host>:<port>/<file_path> </verbatim> * =fully-qualified= all the SURLs will be converted to the format: <verbatim> srm://<host>:<port>/srm/managerv1?SFN=<file_path> </verbatim> * =disabled= no SURL convertion will be performed If you're using a previous version, for interoperability reasons we suggest to use fully qualified SURLs, i.e. in the format <verbatim> srm://<srm_host>:<srm_port>/srm/managerv1/?SFN=<file_path> </verbatim> If you know the type of the SRM that would be involved in the transfer, you can also specify one of the supported compact format. For !Castor, as example, you can use <verbatim> srm://<castorsrm>:8443/srm/managerv1?SFN=<file_path> srm://<castorsrm>:8443//srm/managerv1?SFN=<file_path> srm://<castorsrm>:8443/?SFN=<file_path> srm://<castorsrm>:8443/<file_path> srm://<castorsrm>/<file_path> </verbatim> In case the transfer is processed by a channel configured to use =srmcopy=, the fully qualified format may not work. Please have a look [[#SrmCopyMalformedUrl][here]] for a workaround #InvalidEndpoint ---+++ I've tried to submit a job but I get back an error saying: SOAP-ENV:Server.userException - org.xml.sax.SAXException Usually this issue is related to an endpoint pointing to the wrong server (typically =ChannelManagement= instead on =FileTransfer=): when you observe an error similar to <verbatim> submit: SOAP fault: SOAP-ENV:Server.userException - org.xml.sax.SAXException: Deserializing parameter 'job': could not find deserializer for type {http://transfer.data.glite.org}TransferJob </verbatim> please ask the user to look at the command he just submitted and to check that the specified endpoint is correct; all the CLIs commands that start with =glite-transfer-channel-*= require to use a =ChannelManagement= interface, while the ones that start with =glite-transfer-*= require the =FileTransfer= interface. In order to check if the endpoint is correct, the user can also re-run the command with the =-v= option and checks if the line =Using Endpoint= ends with =FileTransfer= or =ChannelManagement= #NoMatch ---+++ I've tried to submit a job but I get back an error saying: No match When the user submit a transfer job, he usually specify some SURLs that may contains a question mark (=?=). In some shells this character has to be escaped by simply quoting it (='?'=): for example, if the SURLs are <verbatim> srm://castorgridsc.cern.ch:8443/srm/managerv1?SFN=/castor/cern.ch/grid/dteam/src_file srm://castorgridsc.cern.ch:8443/srm/managerv1?SFN=/castor/cern.ch/grid/dteam/dst_file </verbatim> please make sure you run =glite-transfer-submit= in this way <verbatim> glite-transfer-submit \ srm://castorgridsc.cern.ch:8443/srm/managerv1'?'SFN=/castor/cern.ch/grid/dteam/src_file \ srm://castorgridsc.cern.ch:8443/srm/managerv1'?'SFN=/castor/cern.ch/grid/dteam/dst_file </verbatim> #NotAuthorizedToGetChannel ---+++ I was able to list the channels but I cannot get the channel details Listing channels is open to any user as long as he/she is not in the veto mapfile - you only get the channel name from this call. However, getting the details of a channel - source, destination, bandwitch, etc is restricted. For this you need to be: * an admin * manager of the channel being queried * manager of any VO on the given FTS You can check your roles on a given FTS by running =glite-transfer-getroles=. Information on channel and VO managers can be managed by a service admin or other managers by using the appropriate client tools. Information on service ADMINs is stored inside the admin-mapfile. #NotDedicatedChannel ---+++ How do I setup a non-dedicated Channel? Non-dedicated channels (a.k.a. "catch-all" channels) are a special channel configuration that allows matching any site as source or destination, therefore not coupled with the underlying network. Using "catch-all" channels allows to limit the number of channels you need to manage, but also limits the degree of control you have over what is coming into your site (although it still provides the other advantages like queueing, policy enforcement and error recovery). The usage of these channels is mainly recommended in Tier1 for providing full connectivity to all other sites, where the suggested channels definition is: * Dedicated channels from any other Tier1 to the T1 * Non-dedicated channels to each of the related Tier2 * A non-dedicated channel to the T1 You can setup a non-dedicated channel that will manage all the transfers from any site to your site by issuing a =glite-transfer-channel-add= and using =*= and source site name, like: =glite-transfer-channel-add -f NUM_OF_FILES -S CHANNEL_STATE [...] CHANNEL_NAME "*" YOUR_SITE= Of course, you have then to issue a =glite-transfer-channel-setvoshare= for each !VO that should be authorized to use the channel and then configure a !ChannelAgent for that channel. Please note that is a !VO is not authorized to use a channel between site =A= and =B= but has privileges on a =*-B= channel, transfer requests for that !VO from site =A= to =B= are denied since the non-dedicated channel is evaluated _after_ all the dedicated ones. In addition, please also note that the default !ChannelAgent configuration for that channel requires that all the SRM that would be involved in the managed transfers should be listed in the information system. In case a !VO needs to relax this constraint, for example in order to transfers files to/from !Classic !SEs not included in the information system, the following parameters should be added to the !VOAgent configuration: * =transfer-agent-vo-actions.EnableUnknownSource= should be set to =true= if !SEs not known to the !InfoSys should be allowed as valid source (these would be matched by the =*-Site= catch-all channels) * =transfer-agent-vo-actions.EnableUnknownDest= should be set to =true= if !SEs not known to the !InfoSys should be allowed as valid destination (these would be matched by the =Site-*= catch-all channels) In case a !VO needs these parameters, it would be better to turn off the [[#WhichSurlFormat][SURL Normalization]], or at least set it to =fully-qualified=, for all the !ChannelAgents associated to non-dedicated channels, since it would be impossible to resolve the correct endpoint for the SRM not listed in the !InformationSystem. It will also be worth to reccommend the users to use fully-qualified SURLs for transfers that should be processed through these channels. *Use of the =*-*= 'catch everything' channel is not recommended for production grids*. ---+++ After upgrading to FTS 1.5 I got "No Channel found or VO not authorized ..." error *%MAROON%Symptom:%ENDCOLOR%* After upgrading to FTS 1.5 I got "No Channel found or VO not authorized ..." error Running the FTS service we encountered many inconsistencies in the way the information was published in EGEE.BDII, especially related to the case used to publish the site name. This not not a probalem when EGEE.BDII is used directly, since it's is case insensitive, but creates some intereoperability issues when used via ServiceDiscovery (that is case sensitive). We therefore decided to apply a convention, within the FTS boundaries, in order to have all the site names uppercase in the channel definitions. Starting form version 1.5, the FTS WebService forces the case when you create a new channel, but when upgrading from previous versions, this convention may conflict whit already defined channels. In order to fix this, we have provided an admin pack hat allows changing the channel definitions. The instruction how to use that tools are available here. Therefore, if you hit this problem, download the glite-data-transfer-scripts RPM and follow the instuction reported above in order to replace all the site names that contains lowercase letters in all the channel definition (you may need the support of your DBA). *%BLUE%Note:%ENDCOLOR%* If this RPM is not yet available in the repository, please contact fts-support ---++ FTA Solutions #AlwaysSubmitted ---+++ Job always in Submitted state The first action that is executed on a transfer request is the !Allocation, performed by the !VO agent associted with the VO of the submitter. This actions checks the source and destination !SURLs of the job request, find the sites of the involved SEs using !ServiceDiscovery and then look up in the registered channels for a matching. When this operation succeed, the job is moved to !Pending and the =channel_name= property is filled with the name of the found channel. Due to a bug in FTA 1.3 and 1.4 ([[http://savannah.cern.ch/bugs/?func=detailitem&item_id=10076][#10076]]) a job stays in !Submitted state instead of going to !Failed in one of the following cases * The channel doesn't exist but the source and destination SE are registered in !ServiceDiscovery or the !VO is configured to accept !unknown source and destination * The VO of the user who submitted the job has no valid share on the channel * The channel is in !Stopped, !Drain or !Halted (actually, when the channel status is !Halted, a job should go in !Pending and not in !Failed) Usually this problem is due to a configuration error. The first thing to do is to retrieve the status of the channel that should be involved in the transfer =glite-transfer-channel-list CHANNEL_NAME= check the channel state, that the !VO has a share and that the names of the source and destination sites match the ones retrived using !ServiceDiscovery: in case the file plugin is used, look at the =site= element of the SRM services reported into the =services.xml= file <verbatim> <service name='CERNSC3-SRM'> <parameters> <endpoint>httpg://castorgridsc.cern.ch:8443/srm/managerv1</endpoint> <type>SRM</type> <version>1.1.0</version> <site>CERN-SC</site> <param name='SEMountPoint'>/castor/cern.ch/grid/dteam/storage</param> </parameters> </service> </verbatim> and compare them with the value returned by =glite-transfer-channel-list= In case this doesn't fix the problem, check that a !VO agent is configured and running for that !VO. Do =glite-transfer-status --verbose JOB_ID= And check that the value of the =VOName= property is correct; in case is not, it's a problem with the !FTS =glite-data-transfer-submit-mapfile=: edit that file manually or regenerate it following teh procedures reported by [[FtsServerSubmitMapfile13][FtsServerSubmitMapfile13]], cancel the job, wait that the files is reloaded by the !FTS and ask the user to resubmit the request. In case the !VO is set correctly, check on the agents node that an agent is configured: * if you're using gLite 1.3, please have a look at =/opt/glite/etc/config/glite-data-transfer-agents-oracle.cfg.xml= and see if there is an instance for the VO: <verbatim> <instance name="YOUR_VO-fts"> <parameters> <transfer-vo-agent.Name value="YOUR_VO"/> <!-- Other parameter --> <!- ... --> </parameters> </instance> </verbatim> * if you're using gLite 1.4, open the file =/opt/glite/etc/config/glite-file-transfer-agents-oracle.cfg.xml= and look for an instance: <verbatim> <instance name="YOUR_VO" service="transfer-vo-agent-fts"/> </verbatim> If the instance is missing, or the naming convention is not correct, edit the appropriate file and rerun the configuration script. If the instance is there, check if it's running, using the command =/opt/glite/etc/init.d/glite-data-transfer-agents --instance glite-transfer-vo-agent-YOUR_VO status= or =service glite-data-transfer-agents --instance glite-transfer-vo-agent-YOUR_VO status= If the job is still !Submitted, follow the procedure reported [[#LastHope][here]] #AlwaysPending ---+++ Job always in Peding state After the a transfer request is allocation to a channel, its status is moved to !Pending. The !ChannelAgent will then process this request based on its internal inter-VO scheduling. In case the job state remaing !Pending forever, you have to check the follwoing things: * The related !ChannelAgent daemon should be running * The !Channel state should be set to !Active * The VO should have a share on the channel that is greater than 0 In order to check if the agent is running, use the command =/opt/glite/etc/init.d/glite-data-transfer-agents --instance glite-transfer-channel-agent-CHANNEL_NAME status= or =service glite-data-transfer-agents --instance glite-transfer-channel-agent-CHANNEL_NAME status= You can check the !Channel state and VO share uing the command: =glite-transfer-channel-list CHANNEL_NAME= If the job is still !Pending, follow the procedure reported [[#LastHope][here]] #SecurityError ---+++ All my transfers fail with a SECURITY_ERROR This issue is usually due to a problem in the interaction from a !FTA and the !MyProxy server. This mainly happens in the following cases: * User is mistyping the !MyProxy passphrase when submitting the job * User has an invalid or expired certificate in !MyProxy * The agent is not an authorized retrieves for !MyProxy * There is a authentication problem (expired certificate or crl) In the first two cases, all the transfers of this user should fail while the ones of other users succeed, while in the others all the transfers would faild, indipendently of the user. Usually, you can detect the type of the error by having a look at the agent log file in =/opt/log/glite/glite-transfer-channel-agent-CHANNEL_NAME.log= or =/opt/log/glite/glite-transfer-vo-agent-VO_NAME.log= * If the problem is due to a wrong passphrase, you'll see <verbatim> 2005-08-26 07:25:52,281 ERROR transfer-agent-myproxy - Failed to get the proxy from the !MyProxyServer. Reason is: Reason is Error in bind() ERROR from server: invalid pass phrase </verbatim> Ask then the user to resubmit his/her file, possibly using the =-p= option of =glite-transfer-submit=. In case the problem persists, maybe the user forgot teh passphrase, so ask him/her to restore the credential in myproxy using =myproxy-init -s MYPROXY_SERVER -d= * In case the agent is not an authorized retriever, you'll see the a similar entry <verbatim> 2005-08-26 07:25:52,281 ERROR transfer-agent-myproxy - Failed to get the proxy from the MyProxyServer. Reason is: ERROR from server: "<anonymous>" not authorized by server's authorized_retriever policy </verbatim> If that is the case, you have to contact the !MyProxy server administrator and ask him to add the DN of the certificate of the account used to run the agent. If it still doesn't work, please also check the the agent is running with a valid certificate, following what described [[#CannotGetAgentDN][here]] * in case the entry is similar to <verbatim> 2005-08-26 07:25:52,281 ERROR transfer-agent-myproxy - Failed to get the proxy from the MyProxyServer. Reason is: Error authenticating: GSS Major Status: Authentication Failed GSS Minor Status Error Chain: (null) </verbatim> This problem is usually due to an expired certificate or to an expired certificate revocation list (crl). Please check the validity of the certicates and update the crl in both the agent and !MyProxy nodes * In the other cases, ask the user to store again his/her certificate in !MyProxy, running the command =myproxy-init -s MYPROXY_SERVER -d= Please note that the the =-d= option is required in order to associte the credentials to the DN of the user instead of the account name If you need to know which !MyProxy server is used, have a look [[#WhichMyProxy][here]] #WhichMyProxy ---+++ Which !MyProxy Server is used? When an agent has to perform an operation in behalf of the user, it retrieves the user's delegated credentials from the configured !MyProxy server, cache it in the local file system and then impersonate the user by setting the environment variable X509_USER_PROXY. The operations where this is required are: * Retrieve services endpoints and information from !ServiceDiscovery * Perform the transfer (unless the property =transfer.vo-agent.DisableDelegationForTransfers= is set to true) * Contact the catalog for retrieving the list of replicas and registering the new ones when the transfer is finished (only in case of FPS VO Agent) The endpoint of the !MyProxy server is usually retrieved using !ServiceDiscovery, so in case of the file plugin, you need to have an entry in =/opt/glite/etc/services.xml= like <verbatim> <service name='MyProxy'> <parameters> <endpoint>myproxy://myproxy.cern.ch</endpoint> <type>MyProxy</type> <version>1.14</version> </parameters> </service> </verbatim> You can query the !InfoSys using the command =glite-sd-query -t !MyProxy= In order to resolve which !MyProxy server should be used, the !FileTransferAgent looks into the associated services of the !FileTransferService who received the user's request (available from gLite 1.3 QF23) or, if not found, takes the first !MyProxy server returned by the !InformationSystem; you can also force the server to use a specific instance by setting the agent configuration property =transfer-agent-myproxy.Server=. In case this property is not set and there is no !MyProxy entry registered in the !InfoSys, the environment variable $MYPROXY_SERVER is used. Starting from version gLite 1.3 QF23, the user is also allowed to specify the myproxy he want to use by providing the option =-m myproxy_hostname= in the =glite-transfer-submit= command line. #CannotGetAgentDN ---+++ I've noticed a warning "Cannot Get Agent DN" in the agent log files You can see this entry in case the agent doesn't run with a valid certificate. When an !FTA starts, it put an logs the DN of the certificate the agent will use. This certificate is used to perform the following actions: * Retrieve the user delegated credentials from !MyProxy using the passphrase provided by the user. This happend both on the !Channel and the VO Agents * Perfom the transfer if the =transfer.vo-agent.DisableDelegationForTransfers= property is set to =true=. This happens only in the VO Agent and it's the default behavior the FPS configuration If the agent doesn't have a valid certificate, it's likely that these operations would fail. In order to fix this problem, check first that the user running the agents has a valid certificate: usually this certificate are installed in =$HOME/.globus/usercert.pem= and =$HOME/.globus/userkey.pem= and should be owned by the user. In case the certificate is installed in a different place, the environment variables X509_USER_CERT and X509_USER_KEY shoudl be set accordingly. You should also check that the certificate is not expired, by running: =openssl x509 -text -in ~/.globus/usercert.pem= or =openssl x509 -text -in $X509_USER_CERT= In case the certificate is valid but the agent always reports the warning, check if there is an expired proxy certificate in =/tmp/x509up_uUSER_ID= (where =USER_ID= is the uder id of the account used to run the agent) and delete it. #SrmCopyMalformedUrl ---+++ My srmcopy transfers fail with a dCache !MalformedUrl exception You may notice this error when a user is transfering files to a dChache SE using a channel configured to perform =srmcopy= transfers. This is due to a bug in dCache version <= 1.6.5 in parsing the URL. You have to ask the user to resubmit his/her requests using the following conventions: * In case the destination SE is dCache, and the source is !Castor or DPM * !Source SURL can be <verbatim> srm://<castorsrm>:<port>//srm/managerv1?SFN=<path> srm://<castorsrm>:<port>/?SFN=<path> srm://<castorsrm>/<path> </verbatim> * !Destination SURL should be <verbatim> srm://<dcachesrm>:<port>/srm/managerv1?SFN=<path> srm://<dcachesrm>/<path> </verbatim> * In case the source SE is dCache and the destination one is !Castor or DPM * !Source SURL should be <verbatim> srm://<dcachesrm>:<port>/srm/managerv1?SFN=<path> srm://<dcachesrm>/<path> </verbatim> * !Destination SURL can be <verbatim> srm://<castorsrm>:<port>/srm/managerv1?SFN=<path> srm://<castorsrm>:<port>//srm/managerv1?SFN=<path> srm://<castorsrm>:<port>/?SFN=<path> srm://<castorsrm>:<port>/<path> srm://<castorsrm>/<path> </verbatim> * In case both the source and destination SE are dCache * !Source SURL should be <verbatim> srm://<dcachesrm>:<port>//srm/managerv1?SFN=<path> srm://<dcachesrm>/<path> </verbatim> * !Destination SURL should be <verbatim> srm://<dcachesrm>:<port>/srm/managerv1?SFN=<path> srm://<dcachesrm>/<path> </verbatim> This problem is fixed in dCache v 1.6.6, however this new version doesn't seem to accept the compact SURL format <verbatim> srm://<srmhost>/<path> </verbatim> If the destination SE is then dCache and it's version is 1.6.6, we suggest to use for both source and destination SURLs either: <verbatim> srm://<srmhost>:<port>/<path> </verbatim> or the fully qualified one: <verbatim> srm://<srmhost>:<port>/srm/managerv1?SFN=<path> </verbatim> #DCacheSrmCopyUrl ---+++ I've upgraded to 1.4.1 but srmcopy doesn't seem to work Starting from version 1.3QF23, the !FileTransferAgent normalize the SURLs before executing all the SRM get, put and copy requests and the default normalization is to convert them into the compact format <verbatim> srm://<srmhost>/<path> </verbatim> As illustrated [[#SrmCopyMalformedUrl][here]], we observed a problem with dCache srmcopy in version 1.6.6 not working with this format: after ~30 minutes the error returned is <verbatim> number of retries exceeded:org.dcache.srm.scheduler.NonFatalJobFailure: java.io.IOException: both from and to url are not local srm </verbatim> In order to workaround this problem, you have to change the configuration of !FilteTransferAgent normalization to use a different format, by setting the !ChannelAgent configuration property =transfer-agent-channel-actions.SurlNormalization= to either =compact-with-port= for converting to the format <verbatim> srm://<srmhost>:<port>/<path> </verbatim> or =fully-qualified= for the format <verbatim> srm://<srmhost>:<port>/srm/managerv1?SFN=<path> </verbatim> Please note that this is not a bug in FTS, but a problem in dCache; you might have observed after upgrading to 1.4.1 because this version of FTS has been release more or less at the same time as dCache 1.6.6 #PingNull ---+++ I've upgraded to 1.4.1 but the transfer failed with Error in srm__ping: NULL Starting from version 1.4.1, FTS retrieves the srm endpoint from the information system, instead of parsing the SURL and, in case one of the compact formats are used, using the default port (8443) and service path (srm/managerv1). In case your transfers start failing after the upgrade with an error: <verbatim> Cannot Contact SRM Service. Error in srm__ping: NULL </verbatim> probably the entry in the information system is not correct: in fact, a common error that has been observed is that the SRM endpoint is stored as <verbatim> srm://<srmhost>:<port>/srm/managerv1 </verbatim> instead of <verbatim> httpg://<srmhost>:<port>/srm/managerv1 </verbatim> You can also check by looking into the transfer log files (located in =/var/tmp/glite-transfer-url-copy-UID/CHANNEL_NAMEfailed= in the related !ChannelAgent box) and check the endpoint that is used for the SRM calls #NoSiteFoundForHost ---+++ The transfer failed with the error: No site found for host ... During the allocation phase the !VOAgent needs to resolve what are the sites that will be involved during the transfer. In order to do that, the agent will look up in the information system the site names of the source and destination SRMs, querying by the hostname retrieved from the provided SURLs. In case the user gets an error like: <verbatim> Failed to Get Channel Name: No site found for host ... </verbatim> You have to look at the following things: * The entry concerning the SRM services should be listed in the information system * The SD library plugins are defined and configured properly (environament variables, files, etc) * If the file-based plugin is chosen, the =/opt/glite/etc/services.xml= file is properly formatted In order to do detect errors, it's useful to run the command: <verbatim> su - ACCOUNT_USED_TO_RUN_THE_VOAGENT -c '/opt/glite/bin/glite-sd-query -t SRM --host SRM_HOSTNAME' </verbatim> and check the result (this command execute the same query as the agent). In the problem still persists, it may be worth to have a look at the /proc tanle and see if the <verbatim> /proc/VOAGENT_PROCESS_ID/environ </verbatim> contains the correct values for the =GLITE_LOCATION= and =GLITE_SD_*= environment variables. In case the !StorageElement should not be listed in the information system, you may want to have a look [[#NotDedicatedChannel][here]] #WhichServiceTypes ---+++ Which Service Types are used? The File Transfer Agent needs to interact with external services in order to accomplish its tasks and used the gLite !ServiceDiscovery API in order to discover their properties. The involved services are: * !MyProxy: used to retrieve the clients' delegated credentials * !SRM & !GridFtp: the site information is used to allocate a transfer job to a channel * !FileCatalog: used by the vo-agent in FPS mode in order to retrieve the sourec replicas to be used for a transfer and registered the new replicas when the transfer is finished In order to discover that information the File Transfer Agent used the service types listed in [[http://infnforge.cnaf.infn.it/glueinfomodel/index.php/V12/ServiceType][Glue Service Types]] As reported in bug [[http://savannah.cern.ch/bugs/?func=detailitem&item_id=12961][#12961]], however, the service type for a !GridFtp server is set to =GridFTP= instead of =gsiftp= and a backward compatible fix is foreseen for a future release. As a temporary workaround you could follow the comments reported on the bug. #LastHope ---+++ I've tried everything, and it still doesn't seem to work In case your problem is listed in this page, but none of proposed solutions doesn't seem to work, you can generate verbose log files and send them to [[mailto:fts-support@cern.ch fts-support]]. In order to generate these files, please follow the procedure: For each agent involved (the VO one responsible to allocate a transfer to a channel and retry failed transfer; and the Channel one, responsible to transfer the files and monitor the status), please edit the file =glite-transfer-vo-agent-VO_NAME.log-properties= (in case of !VO !FTA) or and =glite-transfer-channel-agent-CHANNEL_NAME.log-properties= (in case of !Channel !FTA) and replace the lines =log4j.rootCategory=INFO, file= with =log4j.rootCategory=DEBUG, file= and =log4j.appender.file.fileName=/var/log/glite/glite-transfer-channel-agent-CHANNEL_NAME.log= or =log4j.appender.file.fileName=/var/log/glite/glite-transfer-vo-agent-VO_NAME.log= with =log4j.appender.file.fileName=/var/log/glite/glite-transfer-channel-agent-CHANNEL_NAME.debug.log= or =log4j.appender.file.fileName=/var/log/glite/glite-transfer-vo-agent-VO_NAME.debug.log= Restart the agents and let them running for ~ 1 minute; then stop the agents, restore the original values of the modified files, start the agents again and mail these =/var/log/glite/*.debug.log= files to [[mailto:fts-support@cern.ch fts-support]] ---++ FTS Channel Administration solutions #NFilesForVO ---+++ How do I set the number of files transferred per VO instead of per channel? In the FTS Channel Agent you have three parameters you can act on in order to tune the inter-vo scheduling: the channel VO share, the numbers of files that the channel can process concurrently and the =transfer-channel-agent.VOShareType= configuration property. The purpose of this configuration parameter is to define a policy how the VO share should be interpreted for a channel and you can add it to the instance that corresponds to the related channel agent in the =glite-file-transfer-agents.cfg.xml= configuration file. The allowed values are: * *normalized*: the share is the value of the channel =voshare= property for the given VO, normalized to the sum of all the shares for all the VOs in the same channel. This option could be used when channel administrators want to guarantee slots for certain VOs, in order to implement some sort of !QoS, accepting to eventually penalize the total throughput (transfer slots would be reserved to a VO even if that VO has no job to process) * *absolute*: the share is the value on the channel =voshare= property expressed as a percentage. No normalization is performed, that means that the sum of all the shares on the same channel can exceed 100%. This option could be used when channel administrators want to balance the share between the VOs, without allowing that a single VO fully allocate a channel but minimizing the risk to allocate slots to VOs that don't have any job to process. This option implies some tuning on the VO share values based on experience, but it would allow to have a compromise between throughput and !QoS. * *normalized-on-active*: the share is the value of the channel =voshare= property for the given VO, normalized to the sum of all the share for all the VOs in the same channel that has at least one job that can be processed by the Channel Agent (job state should be Active, Pending or Canceling). This option is the default and should be used when the channel administrators want to optimize the throughput of the channel (the channel can be fully allocated even by one VO), but with a lower !QoS As an example, supposing you have a channel that has 30 files and 3 VOs, you could have: | || *Normalized* | *Absolute* | *Normalized-on-active** | | VO | Share | Max Files | Max Files | Max Files | |VO_1 | 50 | 15 | 15 | 0 | |VO_2 | 30 | 9 | 9 | 18 | |VO_3 | 20 | 6 | 6 | 12 | (* supposing VO_1 has no job to submit) As you can notice, in case the sum of the VO share is 100, there's no difference between the "normalized" and "absolute" setup. But if this constraint is not respected, you can have: | || *Normalized* | *Absolute* | *Normalized-on-active** | | VO | Share | Max Files | Max Files | Max Files | |VO_1 | 70 | 14 | 21 | 0 | |VO_2 | 50 | 10 | 15 | 19 | |VO_3 | 30 | 6 | 9 | 11 | (* supposing VO_1 has no job to submit) Please note that the value of the column "Max Files" correspond to the maximum number of files a VO is authorized to submit at the same time. In any case the constraint imposed by the "files" channel property is always respected. If you want to start with two VOs, setting them each to be able to perform up to 15 transfers concurrently: Set the =transfer-channel-agent.VOShareType= to _normalized_ (or _absolute_), having the VO share set to 50 and the channel files set to 30: you'll allow then up to 30 parallel transfers on the channel, but each VO would not be able to submit more than 15 at the same time. In case you'll have to support other VOs, you'll need to adjust these percentages. ---++ General problems ---+++ How to replace host certificates on service nodes *%MAROON%Problem%ENDCOLOR%* The host certificate is expired or going to be changed. *%GREEN%Solution%ENDCOLOR%* * On *DPM* and *LFC* machines See the corresponding section in the 'DPM and LFC' section of this troubleshooting guide: [[https://uimon.cern.ch/twiki/bin/view/LCG/TheLCGTroubleshootingGuide#What_to_do_if_the_host_certifica][What to do if host certificate expired or going to be changed]] * On *dCache* node * copy in the new certs to =/etc/grid-security/= * run the following line <verbatim> /opt/d-cache/bin/dcache-core restart </verbatim> The connections will be interrupted, this is unfortunately unavoidable at present. It could be minimized with the individual domains being restarted eg <verbatim> /opt/d-cache/jobs/gsidcapdoor stop /opt/d-cache/jobs/gsidcapdoor start </verbatim> for all of the following domains <verbatim> gPlazma gridftpdoor srm xrootdDoor gsidcapdoor </verbatim> * On *FTS* node The new host certificate has to be put to the usual place (=/etc/grid-security=), All FTS dameons need to be reconfigued (with YAIM) to copy the hostcerts to where the (non-root) user running the daemon can see it. You should restart all the daemons using the standard procedure for this (which gives no user-visible downtime). * On *VOMS* node Copy the new host certificate to =/etc/grid-security=, and restart the service: <verbtaim> /etc/init.d/gLite restart </verbatim> Pay attention that on all node that refer to this VOMS server, the server host certificate has to be changed, as well. In the <verbatim> /etc/grid-security/vomses </verbatim> directory. Furhermore the entries under <verbatim> ~.glite/vomses/ /opt/glite/etc/vomses/ /opt/edg/etc/vomses </verbatim> has to be changed correspondingly. * On *lcg-CE* node Put the new certificates under <verbatim> /etc/grid-security/ </verbatim> and restart the services. * On *glite-CE* node Put the new certificates under <verbatim> /etc/grid-security/ </verbatim> and copy also to =/home/glite/.certs= and restart the services. * On *lcg-RB* node Put the new certificates under <verbatim> /etc/grid-security/ </verbatim> and restart the services. * On *glite-RB (WMS)* node Put the new certificates under <verbatim> /etc/grid-security/ </verbatim> and copy also to =/home/glite/.certs= and restart the services. ---+++ Where I can find the log files * On *DPM* node * =/var/log/dpns/log= * =/var/log/dpm/log= * =/var/log/dpm-gsiftp/dpm-gsiftp.log= * =/var/log/rfio/log= * =/var/log/srmv1/log= * =/var/log/srmv2/log= * =/var/log/srmv2.2/log= * =/var/log/lcgdm-mkgridmap.log= * On *LFC* node * =/var/log/dli/log= * =/var/log/lfc/log= * =/var/log/lcgdm-mkgridmap.log= * On *BDII* node * =/opt/bdii/var/bdii-fwd.log= * =/opt/bdii/var/bdii.log= ----- Last edit: %SEARCH{".*" nosearch="on" regex="on" scope="title" nototal="no" topic="DMFtsSupport" format="$wikiusername on $date"}% Maintainer: Gergely Debreczeni -----
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r23
<
r22
<
r21
<
r20
<
r19
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r23 - 2011-10-17
-
MaartenLitmaath
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback