Difference: DanielaRemenskaWork (1 vs. 41)

Revision 312013-06-06 - DanielaRemenska

Line: 1 to 1
 
Line: 51 to 51
  ~ $ dirac-dms-check-fc2bkk --Prod 22760 --FixIt
Changed:
<
<
>
>
  • If many jobs are completed but cannot upload to LogSE (volhcb15), check number of connections (socket count), restart StorageElement.
 
  • Mario's script for shifter report:
      $ssh lxplus
Line: 87 to 87
 time . /cvmfs/lhcb.cern.ch/lib/LbLogin.sh time SetupProject --debug --use="AppConfig v3r158" --use="SQLDDDB v7r9" --use="ProdConf" Brunel v43r2p3 gfal lfc dpm --use-grid
Added:
>
>
python -c "from hashlib import md5"

  • Check if CVMFS cash is fresh enough on worker nodes. From an lxplus node:
       [dremensk@lxplus0158 ~]$ /usr/bin/attr -q -g revision /cvmfs/lhcb.cern.ch/
       12659
 

  • get the production environment
Line: 151 to 159
  dirac-dms-show-se-status | grep SARA
Added:
>
>
  • Launch a replication of RAW files (in case they're lost). Registers them in the LFC as well.
        dirac-dms-add-replication --Term --Plugin ReplicateDataset --Destination SARA-RAW --Start
       

 
  • Get the file access protocols for a site:
        dirac-admin-get-site-protocols --Site=LCG.SARA.nl

Revision 302013-05-24 - DanielaRemenska

Line: 1 to 1
 
Line: 46 to 46
 
Added:
>
>
  • Checking inconsistencies between FC and BK: (if in FC but have replica NO in BKK -> will fix them; if in FC but not in BKK -> will remove them!)
        ~ $ dirac-dms-check-fc2bkk --Prod 22760 --FixIt
        
 
  • Mario's script for shifter report:
      $ssh lxplus
Line: 172 to 178
 
     dirac-bookkeeping-get-file-descendants /lhcb/data/2010/RAW/FULL/LHCb/COLLISION10/81609/081609_0000000035.raw 9
   
Changed:
<
<
>
>
  • Get RAW ancestors of the lost FULL.DST files:
        dirac-bookkeeping-get-file-ancestors /lhcb/LHCb/Collision12/FULL.DST/00020526/0003/00020526_00031385_1.full.dst
       
 
  • Productions with some Unused files, which have run number = zero in production DB (a fix)
        dirac-transformation-debug 16771 --Status Unused

Revision 282013-04-23 - DanielaRemenska

Line: 1 to 1
 
Line: 201 to 201
 ~ $ dirac-dms-check-inputdata 47384426,47384216
Added:
>
>
  • VERY USEFUL: Get info on how a file is produced (run number, descendants, processing pass...)
     $ dirac-transformation-debug 20392 --LFN /lhcb/LHCb/Collision12/FULL.DST/00020391/0009/00020391_00093978_1.full.dst --Info alltasks
      

  • Check for IDLE/REGISTERED/ jobs at a particular CE:
    $ glite-ce-job-status -L0 -a -e lcgce02.gridpp.rl.ac.uk --to '2013-04-11 00:00:00' -s IDLE:WAITING:REGISTERED |grep -c IDLE
       

  • Browse the LFC:
     $ lfc-ls /grid/lhcb/LHCb/Collision12/LOG/ 
       
 

Dirac Storage Management System overview Work in progress, under construction

here.

Revision 262013-03-14 - DanielaRemenska

Line: 1 to 1
 
Line: 180 to 180
 
   danielar@herault dremensk $ srm-bring-online -debug srm://storm-fe-lhcb.cr.cnaf.infn.it/t1d0/lhcb/data/2012/RAW/FULL/LHCb/COLLISION12/125977/125977_0000000019.raw
   
Added:
>
>
  • Prestage files using gfal in python:
    import gfal
    print 'GFAL version',gfal.gfal_version()
    gfalDict = {'srmv2_spacetokendesc': 'LHCb-Tape', 'no_bdii_check': 1, 'srmv2_desiredpintime': 86400, 'defaultsetype': 'srmv2', 'timeout': 30, 'nbfiles': 1, 'surls': ['srm://storm-fe-lhcb.cr.cnaf.infn.it:8444/srm/managerv2?SFN=/t1d0/lhcb/archive/lhcb/MC/MC10/ALLSTREAMS.DST/00009779/0000/00009779_00001506_1.allstreams.dst'],'protocols': ['file', 'dcap', 'gsidcap', 'xroot', 'root', 'rfio']}
    errCode, gfalObject, errMessage = gfal.gfal_init( gfalDict )
    print 'gfal.gfal_init:', errCode, errMessage
    
    errCode,gfalObject,errMessage = gfal.gfal_prestage( gfalObject )
    print 'gfal.gfal_prestage:', errCode, errMessage
    
    numberOfResults, gfalObject, listOfResults = gfal.gfal_get_results( gfalObject )
    for result in listOfResults:
      print 'result per surl', result
       
 

Dirac Storage Management System overview Work in progress, under construction

here.

Revision 192012-12-12 - DanielaRemenska

Line: 1 to 1
 
Line: 92 to 92
  dirac-admin-ban-se -w CNAF-USER
Added:
>
>
  • Allowing a site after downtime is over. First let's see all SE:
     $ dirac-admin-site-info LCG.IN2P3.fr
    {'CE': 'cccreamceli05.in2p3.fr, cccreamceli06.in2p3.fr',
     'Coordinates': '4.8655:45.7825',
     'Mail': 'grid.admin@cc.in2p3.fr',
     'MoUTierLevel': '1',
     'Name': 'IN2P3-CC',
     'SE': 'IN2P3-RAW, IN2P3-DST, IN2P3_M-DST, IN2P3-USER, IN2P3-FAILOVER, IN2P3-RDST, IN2P3_MC_M-DST, IN2P3_MC-DST, IN2P3-ARCHIVE, IN2P3-BUFFER'}
       
    Now let's unban:
       $ dirac-admin-allow-site LCG.IN2P3.fr "Downtime finished"
       $ dirac-admin-allow-se IN2P3-RAW, IN2P3-DST, IN2P3_M-DST, IN2P3-USER, IN2P3-FAILOVER, IN2P3-RDST, IN2P3_MC_M-DST, IN2P3_MC-DST, IN2P3-ARCHIVE, IN2P3-BUFFER "Downtime finished"
       
 
  • You want to investigate on which nodes jobs failed at LCG.Dortmund.de:
       dirac-wms-jobs-select-output-search --Site=LCG.Dortmund.de --Status='Failed' --Date=2008-09-19 'running on '

Revision 182012-12-12 - DanielaRemenska

Line: 1 to 1
 
Line: 42 to 42
 
Added:
>
>
 
  • Mario's script for shifter report:
      $ssh lxplus
Line: 142 to 143
 
GlueCEPolicyMaxWallClockTime
4320
Added:
>
>
  • Get file descendants to see if the file was processed indeed:
         dirac-bookkeeping-get-file-descendants /lhcb/data/2010/RAW/FULL/LHCb/COLLISION10/81609/081609_0000000035.raw 9
       

  • Productions with some Unused files, which have run number = zero in production DB (a fix)
        dirac-transformation-debug 16771 --Status Unused
       

 

Dirac Storage Management System overview Work in progress, under construction

here.

Revision 172012-12-12 - DanielaRemenska

Line: 1 to 1
 
Line: 41 to 41
 
Changed:
<
<
>
>
 
  • Mario's script for shifter report:
      $ssh lxplus
Line: 126 to 126
  CERN-CASTORBUFFER file, xroot, root, dcap, gsidcap, rfio
Added:
>
>
  • Get BDII site info on MaxWallclockTimes
       $ dirac-admin-site-info LCG.RAL.uk
        {'CE': 'lcgce05.gridpp.rl.ac.uk, lcgce04.gridpp.rl.ac.uk',
          'Coordinates': '-1.32:51.57',
          'Mail': 'lcg-support@gridpp.rl.ac.uk',
          'Name': 'RAL-LCG2',
    
        $ dirac-admin-bdii-ce-state lcgce04.gridpp.rl.ac.uk | grep MaxWallClockTime
        GlueCEPolicyMaxWallClockTime: 120
        GlueCEPolicyMaxWallClockTime: 4320
        GlueCEPolicyMaxWallClockTime: 4320
        GlueCEPolicyMaxWallClockTime: 4320
       GlueCEPolicyMaxWallClockTime: 4320
        
 

Dirac Storage Management System overview Work in progress, under construction

here.

Revision 142012-12-11 - DanielaRemenska

Line: 1 to 1
 
Line: 95 to 95
  dirac-wms-jobs-select-output-search --Site=LCG.Dortmund.de --Status='Failed' --Date=2008-09-19 'running on '

Added:
>
>
  • Web portal stuck: how to restart it
         runsvctrl t runit/Web/Paster
        
    if this is not enough, then it will be necessary to get all Paster processes ('ps faux | grep -i web_paster') and do a 'kill -9' of them.

  • An example for checking whether the WMS Job Manager service is up:
        dirac-framework-ping-service WorkloadManagement JobManager
        

  • List of currently banned sites:
         dirac-admin-get-banned-sites
        

  • Banning a site:
     dirac-admin-ban-site LCG.CERN.ch --comment="All jobs failing with Application not Found error" 
 

Dirac Storage Management System overview Work in progress, under construction

here.

Revision 132012-12-10 - DanielaRemenska

Line: 1 to 1
 
Line: 40 to 40
 
Added:
>
>
 
  • Mario's script for shifter report:
      $ssh lxplus
Line: 73 to 74
 SetupProject.sh --debug --use="AppConfig v3r151" --use="SQLDDDB v7r9" --use="ProdConf" Brunel v43r2p2 gfal CASTOR lfc oracle dcache_client --use-grid
Added:
>
>
  • get the production environment
        (on lxplus)
        SetupProject LHCbDIRAC
        lhcb-proxy-init -g lhcb_prod
       

  • Banning a SE if a Site is in downtime or full
         dirac-admin-ban-se -c LCG.RAL.uk 
        Example to ban one SE at RAL
          dirac-admin-ban-se RAL-DST
       Example to ban one SE in writing at CNAF
          dirac-admin-ban-se -w CNAF-USER
       

  • You want to investigate on which nodes jobs failed at LCG.Dortmund.de:
       dirac-wms-jobs-select-output-search --Site=LCG.Dortmund.de --Status='Failed' --Date=2008-09-19 'running on '
       

 

Dirac Storage Management System overview Work in progress, under construction

here.

Revision 72012-09-16 - DanielaRemenska

Line: 1 to 1
 
Line: 16 to 16
 
Added:
>
>

Shifter's 101

  • For files in MaxReset, see which jobs attempted to run on them (NB: there is a dirac-admin script for this also!)
    [volhcb22] /home/dirac > ./jobs4file.sh 19995 /lhcb/data/2012/RAW/FULL/LHCb/COLLISION12/127979/127979_0000000005.raw
       
 

Dirac Storage Management System overview Work in progress, under construction

here.

Revision 52011-08-11 - DanielaRemenska

Line: 1 to 1
 
Line: 13 to 13
 
Changed:
<
<
>
>
 

Dirac Storage Management System overview Work in progress, under construction

here.
Line: 188 to 189
 

To see your jobs

https://lhcbweb.pic.es/DIRAC/LHCb-Development/lhcb_user/jobs/JobMonitor/display
Added:
>
>

Some usefull GitHub commands:

Fixing committed mistakes: git revert HEAD

create new branch: git branch experimental

to switch to the new branch: git checkout experimental

commit all changes: git commit -a

copying changes from the local branch to the remote one: git push origin fixes-sms:refs/heads/fixes-sms

revert a published change: git reset --hard HEAD~1

Graphical overview with all branches/comments: gitk &

Checking out existing git repo: git clone git@githubNOSPAMPLEASE.com:remenska/DIRAC.git

cd DIRAC/

git checkout -b fixes-sms origin/fixes-sms

life saver for testing any Dirac code on-the-fly:

on volhcb22
from DIRAC.Core.Base import Script
Script.addDefaultOptionValue( '/DIRAC/Security/UseServerCertificate', 'yes' )
Script.parseCommandLine( ignoreErrors = False )
from DIRAC.StorageManagementSystem.DB.StorageManagementDB  import StorageManagementDB
storageDB = StorageManagementDB()
res = storageDB.getCacheReplicas( {'Status':'StageSubmitted'} )
print res
 -- DanielaRemenska - 04-Apr-2011

Revision 12011-04-04 - DanielaRemenska

Line: 1 to 1
Added:
>
>

Useful links for LHCb

How to's.

How to access volhcb12.

From any machine except lxplus, you should login first to lxvoadm.cern.ch and then to volhcb12.

ssh dremensk@lxvoadmNOSPAMPLEASE.cern.ch

sudo su dirac

mysql -p -uDirac

mysql> show databases;

Useful to see all processes and their open files

ps -ef | grep RequestFinalizationAgent | wc -l

path: /opt/dirac/pro/DIRAC/

Submit a job

dirac-wms-job-submit Simple.jdl

Simple.jdl:

JobName = "Simple_Job";
Executable = "/bin/ls";
Arguments = "-ltr";
StdOutput = "StdOut";
StdError = "StdErr";
OutputSandbox = {"StdOut","StdErr"};
InputData =
        {
             "LFN:/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/77969/077969_0000000611.raw"
        };

BannedSites =
        {
            "LCG.CERN.ch"
        };

Script to restart all agents/services

runsvctrl d /opt/dirac/startup/StorageManagement_StorageManagerHandler
runsvctrl d /opt/dirac/startup/StorageManagement_RequestPreparationAgent
runsvctrl d /opt/dirac/startup/StorageManagement_StageRequestAgent
runsvctrl d /opt/dirac/startup/StorageManagement_StageMonitorAgent
runsvctrl d /opt/dirac/startup/StorageManagement_RequestFinalizationAgent
runsvctrl u /opt/dirac/startup/StorageManagement_StorageManagerHandler
runsvctrl u /opt/dirac/startup/StorageManagement_RequestPreparationAgent
runsvctrl u /opt/dirac/startup/StorageManagement_StageRequestAgent
runsvctrl u /opt/dirac/startup/StorageManagement_StageMonitorAgent
runsvctrl u /opt/dirac/startup/StorageManagement_RequestFinalizationAgent

Script to check immediatelly all logs on volhcb12

tail -150 /opt/dirac/runit/StorageManagement/RequestPreparationAgent/log/current
tail -150 /opt/dirac/runit/StorageManagement/RequestPreparationAgent/log/current
tail -150 /opt/dirac/runit/StorageManagement/StageRequestAgent/log/current
tail -150 /opt/dirac/runit/StorageManagement/StageMonitorAgent/log/current
tail -150 /opt/dirac/runit/StorageManagement/RequestFinalizationAgent/log/current
tail -150 /opt/dirac/runit/StorageManagement/StorageManagerHandler/log/current

To clear content of logs

 echo -n > current 

A list of LFNs for testing the staging procedure

(under /project/bfys/dremensk/cmtdev/InputFiles.txt)

LFN:/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/69924/069924_0000000001.raw
LFN:/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/71493/071493_0000000057.raw
LFN:/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/71479/071479_0000000001.raw
LFN:/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/71476/071476_0000000241.raw
LFN:/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/70171/070171_0000000001.raw

How to check the SPACE TOKEN(s) for a file

 $ dirac-dms-lfn-replicas LFN:/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/71476/071476_0000000241.raw
2011-03-15 16:55:13 UTC dirac-dms-lfn-replicas/DiracAPI  INFO: Replica Lookup Time: 0.23 seconds
{'Failed': {},
'Successful': {'/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/71476/071476_0000000241.raw': {'CERN-RAW': 'srm://srm-lhcb.cern.ch/castor/cern.ch/grid/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/71476/071476_0000000241.raw'}}}

Setting up a manual request from python directly

from DIRAC.Core.Base.Script import parseCommandLine
parseCommandLine()
from DIRAC.Core.DISET.RPCClient import RPCClient
s = RPCClient("StorageManagement/StorageManagerHandler")
s.getWaitingReplicas()
s.getTasksWithStatus('Done')
---------------
s = RPCClient("WorkloadManagement/JobMonitoring")
s.getJobTypes()
{'OK': True, 'rpcStub': (('WorkloadManagement/JobMonitoring', {'skipCACheck': False, 'delegatedGroup': 'lhcb_user', 'delegatedDN': '/O=dutchgrid/O=users/O=nikhef/CN=Daniela Remenska', 'timeout': 600}), 'getJobTypes', ()), 'Value': ['DataReconstruction', 'DataStripping', 'MCSimulation', 'Merge', 'SAM', 'User']}
---------------
>>> s.getStates()
{'OK': True, 'rpcStub': (('WorkloadManagement/JobMonitoring', {'skipCACheck': False, 'delegatedGroup': 'lhcb_user', 'delegatedDN': '/O=dutchgrid/O=users/O=nikhef/CN=Daniela Remenska', 'timeout': 600}), 'getStates', ()), 'Value': ['Checking', 'Completed', 'Done', 'Failed', 'Killed', 'Matched', 'Received', 'Rescheduled', 'Running', 'Stalled', 'Waiting']}
--------------
s.getStageRequests({'StageStatus':'Staged'})
>>> s.getStageRequests({'StageStatus':'Staged'})['Value'][1874854]
{'PinExpiryTime': datetime.datetime(2011, 3, 4, 11, 7, 26), 'StageRequestCompletedTime': datetime.datetime(2011, 3, 3, 11, 47, 26), 'StageStatus': 'Staged', 'RequestID': '140117334', 'StageRequestSubmitTime': datetime.datetime(2011, 3, 3, 11, 46, 53), 'PinLength': 86400L}
--------------

s = RPCClient("StorageManagement/StorageManagerHandler")
s.setRequest({'CERN-RDST':'/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/77969/077969_0000000611.raw'},'DanielaTest','method@DanielaTest/TestHandler',999999)
-----------------

Procedure to deploy new code

  1. cd /project/bfys/dremensk/cmtdev/LHCbDirac_v5r11p3
  2. svn update
  3. Stop the agents and services in the SMS:

runsvctrl d /opt/dirac/startup/StorageManagement_RequestPreparationAgent

  1. Check if it is in fact disabled:

ps -ef | grep RequestPreparationAgent

  1. Set the logging level to debug:

emacs /opt/dirac/runit/StorageManagement/RequestPreparationAgent/run LogLevel=DEBUG

  1. the modified code needs to be copied to the appropriate path on volhcb12:

cd /opt/dirac/pro/DIRAC/StorageManagementSystem/Agent =scp danielar@loginNOSPAMPLEASE.nikhef.nl:/project/bfys/dremensk/cmtdev/LHCbDirac_v5r8/DIRAC/StorageManagementSystem/Agent/RequestPreparationAgent.py . =

  1. 1 for Agent:
(ONLY if new,not update) dirac-install-agent StorageManagement RequestPreparationAgent start the agent:

runsvctrl u /opt/dirac/startup/StorageManagement_RequestPreparationAgent

  1. 2 For Service:

cd /opt/dirac/pro (if new) ./scripts/install_service.sh DataManagement testDMS

cd /opt/dirac/startup

ln -s /opt/dirac/pro/runit/DataManagement/testDMS DataManagement_testDMS

Once this link has been created, then the service will automatically start.

  1. Check the log to see if your modifications are visible:

cat /opt/dirac/runit/StorageManagement/RequestPreparationAgent/log/current

  1. If all is ok, SET BACK THE LOG LEVELS TO INFO:

emacs /opt/dirac/runit/StorageManagement/RequestPreparationAgent/run

To browse SRM

srmls srm://srm-lhcb.cern.ch/castor/cern.ch/grid/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/

To see your jobs

https://lhcbweb.pic.es/DIRAC/LHCb-Development/lhcb_user/jobs/JobMonitor/display

-- DanielaRemenska - 04-Apr-2011

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback