AAA Operations Troubleshooting Guide

Federation Problems

  1. Case 1: a user complaining “a file cannot be accessed via Xrootd” (No servers are available to read the file.' (errno=3011))
    • for example let’s say file is: /store/mc/Phys14DR/QCD_Pt-170to300_Tune4C_13TeV_pythia8/ALCARECO/TkAlMinBias-PU20bx25_trkalmb_castor_PHYS14_25_V1-v2/40000/6C2A7B06-41B1-E411-BA33-0025901D4C74.root
    1. Check where the file resides physically:
    2. from the dataset name, get the sites where dataset exists
    3. identify if there is a site offering Xrootd access
    4. if all the previous checks are positive, the file SHOULD be accessible via Xrootd, if
    5. try and access the file:
      1. open a lxlpus
      2. do "voms-proxy-init -voms cms"
      3. do “xrdcp -d 3 -f root://cms-xrd-global.cern.ch//store/mc/Phys14DR/QCD_Pt-170to300_Tune4C_13TeV_pythia8/ALCARECO/TkAlMinBias-PU20bx25_trkalmb_castor_PHYS14_25_V1-v2/40000/6C2A7B06-41B1-E411-BA33-0025901D4C74.root /dev/null"
      4. if you get the same error, check the error message
      5. if you see errors like “cannot connect to ….” most probably the to is the problem
    6. try and locate the file
      1. open a lxlpus
      2. do “voms-proxy-init -voms cms"
      3. do “xrd cms-xrd-global.cern.ch locateall /store/mc/Phys14DR/QCD_Pt-170to300_Tune4C_13TeV_pythia8/ALCARECO/TkAlMinBias-PU20bx25_trkalmb_castor_PHYS14_25_V1-v2/40000/6C2A7B06-41B1-E411-BA33-0025901D4C74.root"
      4. if you do not get any server, the file is apparently not accessible
      5. if you do, the file should be accessible, and the error must be in the servers you are returned
      6. if you cannot connect to cms-xrd-global.cern.ch, that is the problem
      7. if the above fail, try and access the file via regional redirectors. Use the same xrdcp commands but with servers
        • xrootd-cms.infn.it
        • cmsxrootd.fnal.gov
      8. if any of the two works, the problem is in the global redirector
      9. if none works, chances are the site(s) providing the files have problems on their xrootd servers
    7. if in the previous workflow you identified the problem in
      • the regional redirector
      • the global redirector
      • a site
    8. please open a GGUS ticket to the site, category CMS WAN Access - AAA
      • site for global redirector = CERN
      • sites for regional redirectors = leave only to the CMS WAN Access - AAA

  1. Reading tests at a site are failing
    1. Attempt to locate test path using xrd:
      • xrd locateall /store/test/xrootd//store
    2. If the above returns "No matching files found", attempt to access a file on the site via xrdcp
      • Find a file that should be on the site using DAS and attempt to xrdcp it. For example:
      • xrdcp -d 1 -f root://cmsxrootd.fnal.gov//store/test/xrootd/T2_US_Nebraska//store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/CE4D66EB-5AAE-E111-96D6-003048D37524.root /dev/null
    3. If xrdcp fails as well, open a GGUS ticket to the site, category CMS WAN Access - AAA. Provide the following information for the site admins, so they can start troubleshooting:

Storage Problems

EOS Problems

dCache Problems

--02.05.2017-- We encountered with following error in T2_IT_Rome :

[b] XrdCl::File::Open(name='root://cms-xrd-global.cern.ch//store/test/xrootd/T2_IT_Rome/store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/CE4D66EB-5AAE-E111-96D6-003048D37524.root <http://cms-xrd-global.cern.ch//store/test/xrootd/T2_IT_Rome/store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/CE4D66EB-5AAE-E111-96D6-003048D37524.root>', flags=0x10, permissions=0660) => error '[ERROR] Server responded with an error: [3012] Failed to open file (General problem: No free port within range [666])

Solution has been provided by Ulf Tigerstedt :

dcache has settings for:

dcache.net.wan.port.min = 20000
dcache.net.wan.port.max = 25000
dcache.net.lan.port.min = 20000
dcache.net.lan.port.max = 25000

when .wan. is for wan-safe protocols and xrootd being unsafe is .lan. If the pool has max-min less than the amount of movers allowed, the pool will run out (and it will give a nice error message).

If will eventually recover when the other transfers end and ports are freed.

DPM Problems

HDFS Problems

Others

  • if cmsd is stuck for a site, try to check how many threads xrootd is using

Check if site is subscribed

SAM xrootd-access

  • Quick link to ETF monitoring for particular xrootd-access test for Nebraska site
  • or, in CMS ETF monitoring page just write in quicksearch field s:xrootd-access h:unl.edu and hit enter

SAM xrootd-fallback

Launching Scale Test

storage.xml vs site-local-config.xml

Sites refer to SITECONF/PhEDEx/storage.xml in their SITECONF/JobConfig/site-local-config.xml. If you suspect protocol is wrongly defined in the storage.xml, PHEDEX TestCatalog tool from PHEDEX/Utilities may come in handy for debugging. It will check the storage.xml syntax and try to do the lfn-pfn-lfn conversion for a given protocol. For example:

-bash-4.1$ source /cvmfs/cms.cern.ch/phedex/slc6_amd64_gcc493/cms/PHEDEX/4.2.1/etc/profile.d/init.sh
-bash-4.1$ TestCatalogue -c /cvmfs/cms.cern.ch/SITECONF/T2_UK_London_Brunel/PhEDEx/storage.xml -p local-xrootd  -L /store/data/
Testing file name mappings in /cvmfs/cms.cern.ch/SITECONF/T2_UK_London_Brunel/PhEDEx/storage.xml using protocol local-xrootd
LFN: /store/data/
PFN: root://dc2-grid-64.brunel.ac.uk//dpm/brunel.ac.uk/home/cms/store/data/
TKN:
Re-LFN:  *** ERROR: result different from /store/data/ ()
-bash-4.1$ TestCatalogue -c /cvmfs/cms.cern.ch/SITECONF/T2_UK_London_Brunel/PhEDEx/storage.xml -p xrootd  -L /store/data/
Testing file name mappings in /cvmfs/cms.cern.ch/SITECONF/T2_UK_London_Brunel/PhEDEx/storage.xml using protocol xrootd
LFN: /store/data/
PFN: root://xrootd-cms.infn.it//store/data/
TKN:
Re-LFN:  *** ERROR: result different from /store/data/ () 

The output means that TestCatalogue could do the lfn-to-pfn conversion for local-xrootd protocol successfully, but the reverse conversion (pfn-to-lfn) failed as result did not match the original lfn. If you do not need the pfn-to-lfn conversion, that is an-OK. Otherwise you can correct your pfn-to-lfn rule , e.g. this little example works both ways:

-bash-4.1$  cat test2.xml
<storage-mapping>
  <lfn-to-pfn protocol="local-xrootd" destination-match=".*" path-match="/+store/(.*)" result="root://dc2-grid-64.brunel.ac.uk//dpm/brunel.ac.uk/home/cms/store/$1"/>
  <pfn-to-lfn protocol="local-xrootd" destination-match=".*" path-match="root:/+.*brunel.ac.uk/+dpm/brunel.ac.uk/home/cms/(.*)" result="/$1"/>
</storage-mapping>
-bash-4.1$ TestCatalogue -p local-xrootd -L /store/data/ -c test2.xml
Testing file name mappings in test2.xml using protocol local-xrootd
LFN: /store/data/
PFN: root://dc2-grid-64.brunel.ac.uk//dpm/brunel.ac.uk/home/cms/store/data/
TKN:
Re-LFN: /store/data/
Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2018-07-26 - DonataMielaikaite
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback