June 2010 Reports

To the main

30th June 2010 (Wednesday)

  • Experiment activities:
  • Data taking since last night. Reconstruction of new data taken in the weekend seems to have a problem in the increased memory consumed by these jobs (and then LRMSes killing jobs). This has to be further investigated LHCb application side.
GGUS (or RT) tickets:
  • other: 1
  • T0: 1
  • T1: 3
  • T2: 0

Issues at the sites and services

  • GOCDB v4: some instability in querying the PI (GGUS:59560)
  • T0 site issues:
    • Many pilots aborted to ce201.cernch CREAMCE GGUS:59559
    • Affected by the outage of AFS yesterday as any other VOs both in transferring data and in our shared areas. Right now the volumes interesting lhcb shared area have been recovered and SAM jobs verified the software is properly available to grid jobs.
    • SAM Critical tests failing since several days at CERN. (GGUS:59452). Any progress?
    • No progress on the GGUS ticket (GGUS:59422) to CERN to investigate on some transfers from UK sites (and not only) failing there with the error below (this could provide some hints in the investigation of the UK sites upload issue reported elsewhere)
       globus_xio: System error in writev: Broken pipe
    • Both glite WMS instances for LHCb have been upgraded with the last patch https://savannah.cern.ch/patch/index.php?3621 that will be ready in few days for being picked up by all sites supporting LHCb (and we warmly suggest them to upgrade).

  • T1 site issues:
    • IN2p3: issue with almost all files in a run whose locality was not being properly reported preventing reco jobs to be submitted. GGUS:59558
    • CNAF: jobs do not manage to grab computing resources preventing of a couple of productions to run successfully until completion. It looks like a share issue having very few jobs running and 90 jobs queued locally in the LRMS. GGUS:59562
    • CNAF: similar to CERN ce201 issue there are many jobs aborting against ce07 at CNAF with a proxy expiration issue GGUS:59568

    • SARA: confirmation that the two disk servers are back to life
  • T2 sites issues:
    • none

29th June 2010 (Tuesday)

  • Experiment activities:
  • No MC productions running. Reconstruction and reprocessing jobs (and merging) runnign to the completion. Mainly waiting for new data.
GGUS (or RT) tickets:
  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • SAM Critical tests failing since several days at CERN. (GGUS:59452). Any progress?
    • No progress on the GGUS ticket (GGUS:59422) to CERN to investigate on some transfers from UK sites (and not only) failing there with the error below (this could provide some hints in the investigation of the UK sites upload issue reported elsewhere)
       globus_xio: System error in writev: Broken pipe
    • Both glite WMS instances for LHCb have been upgraded with the last patch https://savannah.cern.ch/patch/index.php?3621 that will be ready in few days for being picked up by all sites supporting LHCb (and we warmly suggest them to upgrade).

  • T1 site issues:
    • SARA: As of today, the two disk servers issue reported last week is the main reason that currently prevent LHCB to carry until the completion the reco-stripping (GGUS:59447). Is there any forecast when we can have back these disk servers?
    • RAL/SARA: Investigation of the observed degradation in the SARA-RAL transfer channel are continuing. Many thanks to both sites people involved. (GGUS:59397)
  • T2 sites issues:
    • none

28th June 2010 (Monday)

  • Experiment activities:
  • LHCb Week in St. Petersburg.
  • Received some data last night before the cryogenic issue. Reconstruction of this new data + reprocessing of old data. Due to a misconfiguration in DIRAC many jobs ended up in queues (mainly at GridKA and IN2p3) too short to hold them and failed because killed by the LRMS. Problem fixed.
  • Reconstruction jobs over the new data are taking a bit too long to process the full event whose number of tracks (and then the number of combination) is very high. Some are really a killer (up to 30 seconds per event).
  • No MC productions running.
GGUS (or RT) tickets:
  • T0: 2
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • SAM Critical tests failing since several days at CERN. (GGUS:59452)
    • Open a GGUS ticket (GGUS:59422) to CERN to investigate on some transfers from UK sites (and not only) failing there with the error below (this could provide some hints in the investigation of the UK sites upload issue reported elsewhere)
       globus_xio: System error in writev: Broken pipe
    • Request to be notified when a job runs out of CPUTime at CERN is on hold. RT690202
  • T1 site issues:
    • SARA: another disk server seems to have been down during the last week end with some data not available (GGUS:59447). This is not the same disk server reported by Onno on Friday
    • RAL: Observed some degradation in the SARA-RAL transfer channel. The replication takes very long and the channel is particularly slow if compared with any other channel setup at RAL (or at CERN) to RAL (GGUS:59397). Any progress?
    • IN2p3: Some users report to have problem fetching data out of there (GGUS:59412). Issue related with the wrongly installed French CA on AFS UI.
  • T2 sites issues:
    • none

25th June 2010 (Friday)

  • Experiment activities:
  • Received a run (10 files in total) last night.
  • Reco-stripping-05 is running close to its completion while reco-stripping-04 a bit late.
  • MC productions still ongoing with some problem application side.
GGUS (or RT) tickets:
  • T0: 1
  • T1: 2
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • 3D stream replication failures to pic since yesterday (processes got stuck on one of big transaction last night, unknown the reason). This means no new ConditionDB data replicated to the Iberian T1; today at 10.22 it affected also the capture process; for one hour no new data were replicated to all T1s. So far the problem has been fixed. So replication to T1s is on track and all data gaps had been recovered. Open a GGUS ticket for traceability (GGUS:59404) considering that this problem - in data taking period - would have compromised the data reprocessing at T1's letting all jobs to crash and would have been important enough to raise an ALARM. Reason of the process stuck has to be further investigated.
    • wms203 in drain to apply the patch that cures issues against CREAMCEs
    • upgraded the two remaining CREAMCEs ce202 and ce203 to 3.2.6
    • The FAILOVER space token got filled up yesterday has been cleaned by the data manager. Crash files uploaded in the debug area (served by the same service class)
  • T1 site issues:
    • PIC: MC_M-DST space token completely full bringing all transfers to there failing. PIC provided their quota to LHCb so this becomes an issue for LHCb to steer upload of MC output data somewhere.
    • RAL: Observed some degradation in the SARA-RAL transfer channel. The replication takes very long and the channel is particularly slow if compared with any other channel setup at RAL (or at CERN) to RAL (GGUS:59397).
    • IN2p3: Some users report to have problem fetching data out of there (GGUS:59412)
    • SARA: a disk server bee56, is down, again with infiniband problems. List of files provided.
  • T2 sites issues:
    • none

24th June 2010 (Thursday)

  • Experiment activities:
  • A lot of MC and reco-stripping (05) activities. Reco-stripping is running close to its completion
                                                                    Processed(%)    running/waiting    Done/Files    
    3.5 Magnet Down                             
    6979 no prescaling, runs up to 71816     |  100%                   1/0                  3375/3376    
    6981 prescaling,    runs from  71817       | 94.5%                  119/2               2272/2403   
    
    3.5 Magnet Up
    6980 no prescaling, runs up to 71530     | 99.9%                  0/0                  2608/2611   
    6982 prescaling,    runs from  71531       | 99.8%                  2/1                  3445/3453  
    
    
  • All SAM jobs failed in test the LHCb Applications due to a bad options on the LHCb side.
GGUS (or RT) tickets:
  • T0: 1
  • T1: 0
  • T2: 2

Issues at the sites and services

  • VOMS: GGUS:59337 Inconsistencies observed between VOMRS and VOMS
  • T0 site issues:
    • The FAILOVER space token got filled up (as reported by SLS and service managers). LHCb Data Manager has to clean it up.
  • T1 site issues:
    • SARA: dcap access issue reported at GGUS:59252 has been confirmed to be solved.
    • CNAF (and Legnaro): LSF returning inconsistent (with the BDII) information (GGUS:59274). The discussion is proceeding and it looks like site managers believe that this is not matter to be exposed to experiments. However the problem is a real problem, the information returned by LSF clients on the WN are used to run pilots in filling mode.
  • T2 sites issues:
    • RO-07-NIPNE, problem with shared objects (GGUS:59339)
    • GRISU-CYBERSAR-CAGLIARI: shared area issue (GGUS:59334)

23rd June 2010 (Wednesday)

  • Experiment activities:
  • A lot of MC and reco-stripping (05) activities. Reco-stripping is having problem in running jobs at CNAF and this slows down a bit the progress

GGUS (or RT) tickets:

  • T0: 1
  • T1: 2
  • T2: 4

Issues at the sites and services

  • T0 site issues:
    • still valid the issue of LFC-RO not appearing in SAMDB (GGUS:59193)
    • discrepency between VOMRS and VOMS : (GGUS:59337)
  • T1 site issues:
    • CNAF LSF does not report properly the normalization factor used by CPU Time left utility to run in filling mode. The value is ~60 times larger than expected if the reference CPU is the machine advertised in the BDII (2373 2KSI) (GGUS:59274)
    • CNAF : permission problem : (GGUS:59316)
    • SARA the file was stored on a server that was unavailable for a few minutes (GGUS:59252)
  • T2 sites issues:
    • Legnaro: similar (to CNAF) problem in the value of the cpuf reported by LSF (GGUS:59272)
    • MILANO-INFN (GGUS:59315) SQLlite DB problem
    • IL-TAU-HEP (GGUS:59301) Shared area issue
    • GR-04-FORTH-ICS (GGUS:59300) Cannot open shared object

22nd June 2010 (Tuesday)

  • Experiment activities:
  • A lot of MC and reco-stripping (05) activities. Reco-stripping is having problem in running jobs at CNAF and this slows down a bit the progress
  • We need to coordinate an ugrade on all gLite WMS used by LHCb at T1's to fix the annoying problem of jobs stuck in running status as reported by glite-wms-job-status. Past weeks test on the pilot gswms01 with the patch (https://savannah.cern.ch/patch/index.php?3621) installed showed a net improvement of this problem.

GGUS (or RT) tickets:

  • T0: 1
  • T1: 2
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • LSF seems to report wrong information on the used CPU Time (GGUS:59247)
    • LFC-RO disappeared from the topology in the SAMDB, no SAM test results available any longer (GGUS:59193). Any news?
  • T1 site issues:
    • CNAF LSF does not report properly the time left and all pilot abort immediately believing there is not time left
    • CNAF: one CREAM CE had problem submitting pilots. Jobs failed status was then wrongly reported by our gLite WMS (GGUS:59253)
    • SARA: problem accessing data via dcap port (GGUS:59252)
  • T2 sites issues:

21st June 2010 (Monday)

  • Experiment activities:
  • Fairly busy week end with a lot of MC and reco-stripping (05) activities. No major issues to report.
  • During the weekend a critical SAM tests (SWDIR) has been changed and for some sites it was failing because of some wrong permissions set on some directories. This might well be a software deployment issue and has nothing to do with the shared area service itself. We rolled out to WARNING in case this happens but some sites (CNAF/GridKA) might result red for some time.

GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 2

Issues at the sites and services

  • T0 site issues:
    • Found several lxplus machines freezing and problem has been diagnosed to an AFS table of callback-returns reaching capacity.
    • 2nd queue at CERN, supposed to be switched off last week, still appearing in the site BDII today and jobs been submitted through. It was removed only on some CEs while not on all.
    • Transparent-low-risk intervention to change the order of space token look up has it is currently done by SRM and reordering as described here. This fixes transfer problems reported last week due to a contention on the access of some files.
    • Dashboard SSB: does not update the information (input files for the feeders properly updated but information does not get propagated GGUS:59231)
    • LFC-RO disappeared from the topology in the SAMDB, no SAM test results available any longer (GGUS:59193)
  • T1 site issues:
    • CNAF transfer issue last week understood: a gridftp server was not properly connected to new disk servers been added last week.
  • T2 sites issues:

18th June 2010 (Friday)

GGUS (or RT) tickets:

  • T0: 2
  • T1: 2
  • T2: 2

Issues at the sites and services

  • T0 site issues: CERN :
    • LFC-RO (could not secure connection (GGUS:59155) converted in an ALARM ticket (GGUS:59174)
    • LFC-RO disappeared from the topology in the SAMDB, no SAM test results available any longer (GGUS:59193)
  • T1 site issues:
    • CNAF GGUS:59038 Still problem transferring into CNAF from all T1sdespite the site claims to be fixed.
    • GOCDB problem, can not retrieve information : (GGUS:59172)
  • T2 sites issues:

17th June 2010 (Thursday)

GGUS (or RT) tickets:

  • T0: 1
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • grid_2nd_lhcb remove from the BDII.
    • CREAM CE upgrade to the latest release.
    • CE attached to SLC5 subcluster
    • LFC intermittent problem... GGUS:59155
  • T1 site issues: CNAF FTS transfer (GGUS:59143)

  • T2 sites issues:

16th June 2010 (Wednesday)

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 site issues:
  • T1 site issues:
  • T2 sites issues:

15th June 2010 (Tuesday)

GGUS (or RT) tickets:

  • T0: 1
  • T1: 1
  • T2: 4

Issues at the sites and services

  • T0 site issues:
  • T1 site issues:
    • CNAF : Problems transferring to CNAF with FTS (GGUS:59038)
  • T2 sites issues:
    • 4 sites with too many jobs aborting there. Details on GGUS tickets.

14th June 2010 (Monday)

GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 2

Issues at the sites and services

  • T0 site issues:
  • T1 site issues:
    • CNAF no shared area variable defined (GGUS:58985)
    • IN2p3: SRM endpoint not available on Saturday. SAM tests confirm this outage. GGUS:58994
  • T2 sites issues:

11th June 2010 (Friday)

  • Experiment activities:
  • Running several MC productions at low profile (<5K jobs concurrently).
  • Merging production.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • ask LSF support to keep the short grid queue for SGM and the LONG queue for the rest.
  • T1 site issues:
    • IN2P3 : finish some intervention on afs to cure the sw area problem
  • T2 sites issues:

10th June 2010 (Thursday)

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 5

Issues at the sites and services

9th June 2010 (Wednesday)

  • Experiment activities:
  • Running several MC productions at low profile (<5K jobs concurrently).
  • Merging production.

GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • CERN Castor disk servers unavailable (GGUS:58886 )
  • T1 site issues:
    • IN2P3: one CE in Lyon not publishing information correctly. Fixed now.
  • T2 sites issues:

8th June 2010 (Tuesday)

  • Experiment activities:
  • Running several MC productions at low profile (<5K jobs concurrently).
  • Merging production.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • CERN: jobs killed by the batch system.
  • T1 site issues:
    • CNAF: RAL-CNAF one file systematically failing. Under investigation RAL-CNAF people. (GGUS:58821)
    • SARA: Problem fixed: was a dcap port to be restarted. SARA people will put more robust monitoring tools. GGUS:58838
    • IN2P3: still problem on some pilots landing on the long queue instead of the very-long. Understood: LHCb CPUTime not demanding enough.
  • T2 sites issues:
    • INFN-T2 : one CE publishing wrong information GGUS:58850

7th June 2010 (Monday)

GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • CERN: jobs killed by the batch system. Remedy Ticket open, under investigation.
  • T1 site issues:
    • CNAF StoRM: user reporting problems accessing data on GPFS (GGUS:58794). Problem was on static gridmap file not recognizing a particular LHCb FQAN
    • CNAF: RAL-CNAF one file systematically failing. Glitch of the network. (GGUS:58821)
    • IN2p3 : there are many jobs killed ending up in a wrong (too short) queue. Under investigation. Same it's true at other sites as well (RAL)
    • SARA: may be a dcap port has to be restarted. GGUS:58838
  • T2 sites issues:
    • none

4th June 2010 (Friday)

GGUS (or RT) tickets:

  • T0: 0
  • T1: 4
  • T2: 3

Issues at the sites and services

  • T0 site issues:
    • LFC (RW) is reporting of alarms
  • T1 site issues:
    • CNAF CORAL access issue against Oracle GGUS: 58770. Solution: Due to a problem on the private network interface of one of the two machines of the LHCb Oracle cluster, clients were loosing the connection to the Conditions database at CNAF. Fixed.
    • CNAF StoRM: user reporting problems accessing data on GPFS GGUS: 58794
    • IN2p3 CORAL access issue against Oracle GGUS: 58766. It does not seem related to the PSU (IN2p3 did not apply it) but the problem concerns the memory allocation in the Shared Global Area [SGA] which is too small [1.5GB]. Increased.
    • PIC: CREAMCE, anomalous amount of jobs failing GGUS: 58797.
  • T2 sites issues:
    • INFN-NAPOLI-CMS and GRISU-CYBERSAR-CAGLIARI jobs aborting.
    • Shared area at NGI_HR

3rd June 2010 (Thursday)

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 1

Issues at the sites and services

  • T0 site issues:
    • xroot protocol adopted for CERN analysis jobs cured the problem reported previous days.
  • T1 site issues:
    • pic GGUS 58743 for a issue with FTS. All of attempts to submit FTS jobs to fts.pic.es were timing out with no response from the server
  • T2 sites issues:

2nd June 2010 (Wednesday)

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 site issues:
    • Load on lhcbmdst despite the new 11 disk servers added on Friday. This becasue data reside exclusively on the old servers and load is not spread. Moved to xrootd that does not trigger a LSF job for reading files already staged in disk
  • T1 site issues:
    • IN2p3: jobs killed because exceeding memory limit (2GB) while they are expected to consume 1.5GB (as per VO Id Card). May be a memory leak on DaVinci application.
  • T2 sites issues:
    • T2 UK sites upload issue: this has to be escalated at the T1 coordination meeting being now a long standing problem and has to be address systematically in a coordinated way.

1st June 2010 (Tuesday)

  • Experiment activities:
  • MC production ongoing
  • Tested and verified the GGUS ALARM workflow at CERN for the LHCb critical services

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 2

Issues at the sites and services

  • T0 site issues:
    • none
  • T1 site issues:
    • Issue at SARA with SAM tests. Confirmed that it was one diskserver with CERN CA expired. Some other failures observed due to one of the diskservers in maintenance (GGUS 58647).
    • dCache sites in general: (Input for the T1 coordination meeting). Details available on GGUS 58650
  • T2 sites issues:
    • GRISU-SPACI-LECCE: shared area issues
    • UK-T2 sites upload issue: GGUS open for middleware: 58605

-- RobertoSantinel - 29-Jan-2010

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2010-06-30 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback