CRAB3-Panda Operations
This twiki is intended to track all the info and tips to allow debugging e operations of the system. Main components we'd like to document are
REST APIs,
TaskWorker, CMSAdder Component, ASO, APF
CRAB Client
Ops documentation is here
CRAB REST APIs
Ops documentation is here
Task Woker
Ops documentation here
The full
AsyncStageOut wiki management page can be found here
https://svnweb.cern.ch/trac/CMSDMWM/wiki/AsyncStageOutManagement
Service host
- ASO service is setup in devaso1. It runs as vosusr01. So to operate the service you need to log into the machine through vosusr01 account (loggin first on lxplus then ssh on devaso1)
Cert used by service
- To contact FTS/DBS/CS REST interface, a proxy is required. The service proxy is renewed via a crontab. The log of the proxy renewal can be found here /tmp/vomsrenew.log while the script to renew it is here /home/vosusr01/proxy/vomsrenew.sh. The script generates a valid proxy and write it here /home/vosusr01/gridcert/proxy.cert.
Configuration files
- The ASO relies on 1 configuration file for all component. The config file is located here /data/ASO/async_install_common_pre16/current/config/asyncstageout/config.py. Each component config section modification requires the restart of the component. There is also a config_database in couch located here http://devaso1.cern.ch:5184/_utils/database.html?asynctransfer_config
. Each component has an associated document that you can modify. Such modification does not required the restart of the components.
Component operations
All followings operations command must be done under /data/ASO/async_install_common_pre16/current
- To start/stop the tool: ./config/asyncstageout/manage stop-asyncstageout
- To start/stop a single ASO component: ./config/asyncstageout/manage execute-asyncstageout wmcoreD --shutdown/start --component AsyncTransfer/DBSPublisher/Analytics
- To start/stop the database: ./config/asyncstageout/manage stop-services
Logs and tips
- The log of all ASO components can be found here: /data/ASO/async_install_common_pre16/current/install/asyncstageout/
- The locations of FTS log files are printed in the AsyncTransfer component log file: /data/ASO/async_install_common_pre16/current/install/asyncstageout/AsyncTransfer/stderr.log.
- The active users are printed in the AsyncTransfer log file:
[vosusr01@devaso1 current]$ tail -f install/asyncstageout/AsyncTransfer/stderr.log
DEBUG:PhEDEx:getData:
url: tfc
data: {'node': u'T2_US_Nebraska'}
DEBUG:PhEDEx:Data is from the cache
DEBUG:PhEDEx:getData:
url: tfc
data: {'node': u'T3_IT_Perugia'}
DEBUG:PhEDEx:Data is from the cache
DEBUG:root:kicking off pool
DEBUG:root:current_running [[u'santocch', u'', u'']]
- To check if system is correctly transferring and not stucking, you need to check the FTS log files:
[vosusr01@devaso1 current]$ tail -1000 install/asyncstageout/AsyncTransfer/stderr.log|grep ftslog
DEBUG:AsyncTransfer-Worker-santocch:log file created: /data/ASO/async_install_common_pre16/v01/install/asyncstageout/AsyncTransfer/logs/7/2013/santocch/T2_CH_CERN-T3_IT_Perugia_1374574670.02.ftslog
DEBUG:AsyncTransfer-Worker-santocch:executing command: ftscp -copyjobfile=/tmp/tmpzjxzwy -server=https://fts.cr.cnaf.infn.it:8443/glite-data-transfer-fts/services/FileTransfer -mode=single in log: /data/ASO/async_install_common_pre16/v01/install/asyncstageout/AsyncTransfer/logs/7/2013/santocch/T2_CH_CERN-T3_IT_Perugia_1374574670.02.ftslog at: Tue, 23 Jul 2013 12:17:50 for: /C=IT/O=INFN/OU=Personal Certificate/L=Perugia/CN=Attilio Santocchia
DEBUG:AsyncTransfer-Worker-santocch:log file created: /data/ASO/async_install_common_pre16/v01/install/asyncstageout/AsyncTransfer/logs/7/2013/santocch/T1_US_FNAL-T3_IT_Perugia_1374574670.06.ftslog
DEBUG:AsyncTransfer-Worker-santocch:executing command: ftscp -copyjobfile=/tmp/tmpoKVPWF -server=https://fts.cr.cnaf.infn.it:8443/glite-data-transfer-fts/services/FileTransfer -mode=single in log: /data/ASO/async_install_common_pre16/v01/install/asyncstageout/AsyncTransfer/logs/7/2013/santocch/T1_US_FNAL-T3_IT_Perugia_1374574670.06.ftslog at: Tue, 23 Jul 2013 12:17:50 for: /C=IT/O=INFN/OU=Personal Certificate/L=Perugia/CN=Attilio Santocchia
A correct on-going transfer should print something like the following in the log:
[vosusr01@devaso1 current]$ tail -f /data/ASO/async_install_common_pre16/v01/install/asyncstageout/AsyncTransfer/logs/7/2013/santocch/T2_CH_CERN-T3_IT_Perugia_1374574670.02.ftslog
+ glite-transfer-status -s https://fts.cr.cnaf.infn.it:8443/glite-data-transfer-fts/services/FileTransfer 213cf55e-f381-11e2-9991-b28fbfac286f
213cf55e-f381-11e2-9991-b28fbfac286f status is Active
+ glite-transfer-status -s https://fts.cr.cnaf.infn.it:8443/glite-data-transfer-fts/services/FileTransfer 213cf55e-f381-11e2-9991-b28fbfac286f
213cf55e-f381-11e2-9991-b28fbfac286f status is Active
+ glite-transfer-status -s https://fts.cr.cnaf.infn.it:8443/glite-data-transfer-fts/services/FileTransfer 213cf55e-f381-11e2-9991-b28fbfac286f
213cf55e-f381-11e2-9991-b28fbfac286f status is Active
+ glite-transfer-status -s https://fts.cr.cnaf.infn.it:8443/glite-data-transfer-fts/services/FileTransfer 213cf55e-f381-11e2-9991-b28fbfac286f
213cf55e-f381-11e2-9991-b28fbfac286f status is Active
+ glite-transfer-status -s https://fts.cr.cnaf.infn.it:8443/glite-data-transfer-fts/services/FileTransfer 213cf55e-f381-11e2-9991-b28fbfac286f
213cf55e-f381-11e2-9991-b28fbfac286f status is Active
APF
Auto Pilot Factory. Check status of the Panda Queues and submit pilot wrapper accordingly to the status of the queue and to the APF queues configuration.
Service host
- vocms60: serves the setup of poc3test
- vocms73: serves the setup of c3p1
Both machines run the APF under root account. Log into the machine through vosusr01 account (loggin first on lxplus then ssh on the node you want)
Cert used by service
The APF uses a pool of certificates for the pilot (wrapper) submission.
Configuration is on /etc/apf/proxy.conf. Certs are in /data/vosusr01/proxy
Configuration files
There are two main config files: /etc/apf/queues.conf and etc/apf/factory.conf . Sites specific information are in queues.conf . There are few common parameters and then there is a section for each queue.
There is also the wrapper script which is not really a configuration file but it may requires changes, fixes and/or adjustements. It is in
- /data/apf/runpilot3-wrapper-cms-new_1.sh
If you open it, you can see what it does. Mainly you must know that it is here where we define which is the actual pilot framework. (Currently we are downloading it from
http://cmsdoc.cern.ch/~spiga/pilot.tar.gz
)
Component operations
There are two services which run on the APF. The APF itself and condor. You may need to start and stop both. Log in as root and
- service condor start/stop
- service factory start/stop
condor restart should not be needed. Also the APF doesn't requires any restart so often
Logs and tips
There are two kind of logs. The APF log and the condor job logs. Both are in /data/apf/log/
- /data/apf/log/apf
- /data/apf/log/
job logs are kept for three days and this is a configurable parameter.
Kill all the jobs of a given queue
for i in `condor_q -wide | grep "Rome" | awk '{print $1}'`; do echo $i; condor_rm $i ; done
Monitor job status
List all the wrapper submitted (irrespective of the status / queue)
- condor_q -wide
List all the wrapper submitted to a site (irrespective of the status )
- condor_q -wide | grep Pisa
List all the wrapper in status Idle submited to a specific queue
- condor_q -wide | grep " I " | grep Pisa
CMSAdder Component
- The CMS Adder code is located in /data/atlpan/srv/lib/python2.6/site-packages/pandaserver/dataservice/AdderCmsPlugin.py. To edit it, you need a sudo privilege since this plugin is run as atlpan user.
- The CMS Adder component retieves the jobs/outputs metadata from panda database and:
- upload files metadata into crabserver DB
- insert document for each files into ASO database
- send back a status code to panda server: 0 (interactions succeeded), 1 (interactions succeeded) and 2 (interactions failed due to a fatal error)
- The crabserver interface and ASO database are specified in the config file.
Service host
- The CMS Panda Component is running in the the same host of panda server, voatlas294. To login into this machine, you need to login into one machine of lxvoadm.cern.ch cluser with your lxplus username and password. And then from there login into votals294 using always your lxplus username and password. For the access to lxvoadm.cern.ch cluster you need to contact jorge.amando.molina-perez@cernNOSPAMPLEASE.ch and for voatlas294 you need that someone from Atlas open a ticket in "ATLAS Central Services Operations" project asking the sudo access for you on that machine.
Cert used by service
- To contact the CS REST interface, a proxy is required. The service proxy is renewed via a crontab. The log of the proxy renewal can be found here /tmp/vomsrenew.log while the script to renew it is here /data/atlpan/proxy/vomsrenew.sh. The script generates a valid proxy and write it here /data/atlpan/proxy/proxy.cert
Config file description
- The Adder component has only 1 config file located here /data/atlpan/srv/etc/panda/auth_aso_plugin.txt. It is a json file where the ASO_DB_URL, PROXY and ASO_CACHE are specified. To edit it you need sudo privilege.
Component operations
- The component is called by panda server. So there is not a command to start or stop the component. The component is started and stopped with panda server.
Logs and tips
- The log of this component can be found here: /data/atlpan/srv/var/log/panda/ASOPlugin.log
- To ensure that the component is uploading correctly files into CS database, you need to see in the log lines like these ones:
< CMS-Server-Time: D=461573 t=1371549244778066^M
< ^M
{ [data not shown]
^M100 576 100 16 100 560 29 1026 --:--:-- --:--:-- --:--:-- 1029
* Connection #0 to host c3p1.cern.ch left intact
* Closing connection #0
* SSLv3, TLS alert, Client hello (1):
} [data not shown]
, 0
while for the ASO interactions, you need to see something like the following in the log:
2013-06-18 14:27:11,947 DEBUG Trying to commit {'dn': '/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=mcinquil/CN=660800/CN=Mattia Cinquilli', 'lfn': '/store/temp/user/mcinquil/GenericTTbar/mytest_0001/8a54bf516d1c71c89f531efe0772a562/1000/outfile_141_O5HV95.root', 'checksums': {'adler32': 'ad:8c12962d'}, 'failure_reason': [], 'size': 44350880L, 'group': '', 'destination': 'T2_IT_Bari', 'publish': 0, 'last_update': 1371565631, 'source': 'T2_ES_CIEMAT', 'state': 'new', 'role': '', 'type': 'output', 'dbSource_url': 'Panda', 'inputdataset': '/GenericTTbar/HC-CMSSW_5_3_1_START53_V5-v1/GEN-SIM-RECO', 'workflow': '130618_140931_mcinquil_crab_20130618_160928', 'start_time': '2013-06-18 14:27:11.730120', 'job_end_time': '2013-06-18 14:27:11', 'dbs_url': 'http://cmsdbsprod.cern.ch/cms_dbs_prod_global/servlet/DBSServlet', 'publication_retry_count': [], 'user': 'mcinquil', 'publication_state': 'not_published', 'jobid': 1849451512, 'retry_count': [], 'end_time': '', '_id': '03373286b16ec1bdf9f601cf89f8d1957e62a1a43725cf1fba38edd9', 'publish_dbs_url': 'https://cmsdbsprod.cern.ch:8443/cms_dbs_ph_analysis_02_writer/servlet/DBSServlet'}
2013-06-18 14:27:11,978 INFO ASOPlugin ends.
- In general, the failure of the Adder to contact the REST (you do not see the message above) can be caused by:
- Expired proxy: in this case ensure that the proxy /data/atlpan/proxy/proxy.cert is valid and check /tmp/vomsrenew.log to make sure that the proxy renewal is done correctly
- Troubles with CS REST/database: you can try to contact the CS REST interface manually using curl. If the CS REST interface does not reply correctly to your request you need to go to CS REST machine and see what is happening
- In general, the failure of the Adder to contact the the ASO (you do not see the message above) is caused by troubles of the ASO couchdb service: try to open the ASO database index page http://devaso1.cern.ch:5184/_utils/index.html
to make sure the ASO database in couch is up and running correctly. If the ASO database does not respond you need to go to the ASO instance machine and see what is happening.