AAA Operations Troubleshooting Guide
Federation Problems
- Case 1: a user complaining “a file cannot be accessed via Xrootd” (No servers are available to read the file.' (errno=3011))
- for example let’s say file is: /store/mc/Phys14DR/QCD_Pt-170to300_Tune4C_13TeV_pythia8/ALCARECO/TkAlMinBias-PU20bx25_trkalmb_castor_PHYS14_25_V1-v2/40000/6C2A7B06-41B1-E411-BA33-0025901D4C74.root
- Check where the file resides physically:
- from the dataset name, get the sites where dataset exists
- identify if there is a site offering Xrootd access
- if all the previous checks are positive, the file SHOULD be accessible via Xrootd, if
- try and access the file:
- open a lxlpus
- do "voms-proxy-init -voms cms"
- do “xrdcp -d 3 -f root://cms-xrd-global.cern.ch//store/mc/Phys14DR/QCD_Pt-170to300_Tune4C_13TeV_pythia8/ALCARECO/TkAlMinBias-PU20bx25_trkalmb_castor_PHYS14_25_V1-v2/40000/6C2A7B06-41B1-E411-BA33-0025901D4C74.root /dev/null"
- if you get the same error, check the error message
- if you see errors like “cannot connect to ….” most probably the to is the problem
- try and locate the file
- open a lxlpus
- do “voms-proxy-init -voms cms"
- do “xrd cms-xrd-global.cern.ch locateall /store/mc/Phys14DR/QCD_Pt-170to300_Tune4C_13TeV_pythia8/ALCARECO/TkAlMinBias-PU20bx25_trkalmb_castor_PHYS14_25_V1-v2/40000/6C2A7B06-41B1-E411-BA33-0025901D4C74.root"
- if you do not get any server, the file is apparently not accessible
- if you do, the file should be accessible, and the error must be in the servers you are returned
- if you cannot connect to cms-xrd-global.cern.ch, that is the problem
- if the above fail, try and access the file via regional redirectors. Use the same xrdcp commands but with servers
- xrootd-cms.infn.it
- cmsxrootd.fnal.gov
-
- if any of the two works, the problem is in the global redirector
- if none works, chances are the site(s) providing the files have problems on their xrootd servers
- if in the previous workflow you identified the problem in
- the regional redirector
- the global redirector
- a site
- please open a GGUS ticket to the site, category CMS WAN Access - AAA
- site for global redirector = CERN
- sites for regional redirectors = leave only to the CMS WAN Access - AAA
- Reading tests at a site are failing
- Attempt to locate test path using xrd:
- xrd locateall /store/test/xrootd//store
- If the above returns "No matching files found", attempt to access a file on the site via xrdcp
- Find a file that should be on the site using DAS and attempt to xrdcp it. For example:
- xrdcp -d 1 -f root://cmsxrootd.fnal.gov//store/test/xrootd/T2_US_Nebraska//store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/CE4D66EB-5AAE-E111-96D6-003048D37524.root /dev/null
- If xrdcp fails as well, open a GGUS ticket to the site, category CMS WAN Access - AAA. Provide the following information for the site admins, so they can start troubleshooting:
Storage Problems
EOS Problems
- EOS/CERN is more important than a typical site, since prompt reco “loosely” depends on it
- if you get a warning by T0 team saying they cannot access remote files from T0
- check that Xrootd access at cern is ok; if yes, EOS is fine
- if not, check if the european redirector is ok; you can do that by checking whether european sites have a green xrootd-access test on
- if they are mostly green, it is NOT a problem with EU redirector, must be CERN
- open a ggus ticket to cern-prod
- if they are mostly orange/red, the problem must be in the EU redirector
- open a ggus ticket to CMS WAN Access - AAA
- you can also try and access the file via
- open a lxlpus
- do “voms-proxy-init -voms cms"
- do “ xrdcp -d 3 -f root://cms-xrd-global.cern.ch// /dev/null
- if it works for you, report back to T0 team
- if not, open the ticket
dCache Problems
--02.05.2017--
We encountered with following error in T2_IT_Rome :
[b]
XrdCl::File::Open(name='root://cms-xrd-global.cern.ch//store/test/xrootd/T2_IT_Rome/store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/CE4D66EB-5AAE-E111-96D6-003048D37524.root
<http://cms-xrd-global.cern.ch//store/test/xrootd/T2_IT_Rome/store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/CE4D66EB-5AAE-E111-96D6-003048D37524.root>', flags=0x10, permissions=0660) => error '[ERROR] Server responded with an error: [3012] Failed to open file (General problem:
No free port within range [666])
Solution has been provided by Ulf Tigerstedt :
dcache has settings for:
dcache.net.wan.port.min = 20000
dcache.net.wan.port.max = 25000
dcache.net.lan.port.min = 20000
dcache.net.lan.port.max = 25000
when .wan. is for wan-safe protocols and xrootd being unsafe is .lan.
If the pool has max-min less than the amount of movers allowed, the pool will run out (and it will give a nice error message).
If will eventually recover when the other transfers end and ports are freed.
DPM Problems
HDFS Problems
Others
- if cmsd is stuck for a site, try to check how many threads xrootd is using
Check if site is subscribed
- Shows all sites subscribed to production federation (http://dashb-ssb.cern.ch/dashboard/request.py/siteviewhistory?columnid=224
):
- xrdmapc --list all cms-xrd-global.cern.ch:1094
- Shows all sites subscribed to transition federation (https://dashb-ssb.cern.ch/dashboard/request.py/siteviewhistory?columnid=219
):
- xrdmapc --list all cms-xrd-transit.cern.ch:1094
- If site is subscribed to any federation then we need to test subscriptions manually:
- Firstly, let's check if we can directly access the file with the site redirector
- If we can directly access the file then, replace the url site redirector with production/transitional redirector
- If things don't work via DNS alias one can test even per redirector behind DNS alias
- If it's DNS alias cms-xrd-transit.cern.ch we can dive deeper and test via vocms031 and vocms032 which are behind that alias
- Same we can do when xrdmapc --list all <host | DNS alias>:1094
- If that doesn't work, please notify site to investigate, best would be ask them look at the cmsd.log for any clue and suggest restart the service.
- If core dump was generated ask them send it to us for further debugging and also recommend check systems tweaks to apply:
SAM xrootd-access
- Quick link to ETF monitoring for particular xrootd-access
test for Nebraska site
- or, in CMS ETF monitoring page
just write in quicksearch field s:xrootd-access h:unl.edu
and hit enter
SAM xrootd-fallback
Launching Scale Test
storage.xml vs site-local-config.xml
Sites refer to SITECONF/PhEDEx/storage.xml in their SITECONF/JobConfig/site-local-config.xml. If you suspect protocol is wrongly defined in the storage.xml, PHEDEX
TestCatalog tool from PHEDEX/Utilities may come in handy for debugging. It will check the storage.xml syntax and try to do the lfn-pfn-lfn conversion for a given protocol. For example:
-bash-4.1$ source /cvmfs/cms.cern.ch/phedex/slc6_amd64_gcc493/cms/PHEDEX/4.2.1/etc/profile.d/init.sh
-bash-4.1$ TestCatalogue -c /cvmfs/cms.cern.ch/SITECONF/T2_UK_London_Brunel/PhEDEx/storage.xml -p local-xrootd -L /store/data/
Testing file name mappings in /cvmfs/cms.cern.ch/SITECONF/T2_UK_London_Brunel/PhEDEx/storage.xml using protocol local-xrootd
LFN: /store/data/
PFN: root://dc2-grid-64.brunel.ac.uk//dpm/brunel.ac.uk/home/cms/store/data/
TKN:
Re-LFN: *** ERROR: result different from /store/data/ ()
-bash-4.1$ TestCatalogue -c /cvmfs/cms.cern.ch/SITECONF/T2_UK_London_Brunel/PhEDEx/storage.xml -p xrootd -L /store/data/
Testing file name mappings in /cvmfs/cms.cern.ch/SITECONF/T2_UK_London_Brunel/PhEDEx/storage.xml using protocol xrootd
LFN: /store/data/
PFN: root://xrootd-cms.infn.it//store/data/
TKN:
Re-LFN: *** ERROR: result different from /store/data/ ()
The output means that
TestCatalogue could do the lfn-to-pfn conversion for local-xrootd protocol successfully, but the reverse conversion (pfn-to-lfn) failed as result did not match the original lfn.
If you do not need the pfn-to-lfn conversion, that is an-OK. Otherwise you can correct your pfn-to-lfn rule , e.g. this little example works both ways:
-bash-4.1$ cat test2.xml
<storage-mapping>
<lfn-to-pfn protocol="local-xrootd" destination-match=".*" path-match="/+store/(.*)" result="root://dc2-grid-64.brunel.ac.uk//dpm/brunel.ac.uk/home/cms/store/$1"/>
<pfn-to-lfn protocol="local-xrootd" destination-match=".*" path-match="root:/+.*brunel.ac.uk/+dpm/brunel.ac.uk/home/cms/(.*)" result="/$1"/>
</storage-mapping>
-bash-4.1$ TestCatalogue -p local-xrootd -L /store/data/ -c test2.xml
Testing file name mappings in test2.xml using protocol local-xrootd
LFN: /store/data/
PFN: root://dc2-grid-64.brunel.ac.uk//dpm/brunel.ac.uk/home/cms/store/data/
TKN:
Re-LFN: /store/data/