March 2011 Reports

To the main

31 March 2011 (Thursday)

Experiment activities:

  • EXPRESS and FULL validation of work flows for Collision11

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • SRM castor problem with bringonline: (GGUS:69165) has been fixed very fast. The new version has been deployed yesterday evening and fixes the problem.
    • Occasional BDII timeouts from the CERN top level BDII have been experienced since last Wednesday by Dirac Agents (SNOW:INC026625)
    • Webafs has been down this morning, resulting in pilot jobs not downloading and installing themselves properly. Server is up again - be confirmed that the problem is solved (SNOW:INC027185)
  • T1
    • GRIDKA: Pilots are starting and subsequently dying very fast. The problem could be related to the webafs problem - to be confirmed.

  • T2 site issues:
    • NTR

30 March 2011 (Wednesday)

Experiment activities:

  • EXPRESS and FULL validation of work flows for Collision11

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • SRM castor problem with bringonline: (GGUS:69165)
  • T1
    • IN2P3 : Can not reproduce the problem of shared area at IN2P3.
    • RAL : can the EGEE broadcast can be identify with a correct message. The Unscheduled downtime message was the same as the schedule one and when the unscheduled message has been send it was not clear that the CASTOR intervention was not over ....
    • CNAF : STORM intervention.

  • T2 site issues:
    • NTR

29 March 2011 (Tuesday)

Experiment activities:

  • EXPRESS and FULL validation of work flows for Collision11

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0
Issues at the sites and services

  • T0
    • SRM castor intervention
  • T1
    • Recovery of DATA lost done for all the site. Marco Cattaneo , our computing coordinator send a message to our T1 contacts. "I would like to warmly thank all those involved in this recovery for the considerable effort that went into this. This saves LHCb from having to reprocess the full 2010 dataset ahead of the next stripping campaign"
    • IN2P3 : problem with shared area under investigation.

  • T2 site issues:
    • NTR

28 March 2011 (Monday)

Experiment activities:

  • EXPRESS and FULL validation of work flows for Collision11

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • Oracle intervention this morning on the LHCb Bookkeeping. Tomorrow SRM castor intervention
    • we did not see any major problem with the settings of CERNVMFS.
  • T1
    • Recovery of DATA lost still in progress for RAL, GRIDKA and IN2P3.
  • T2 site issues:
    • NTR

25 March 2011 (Friday)

Experiment activities:

  • MC productions and validation of work flows for Collision11

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • CVMFS : put back in prod with5GB cache
  • T1
  • T2 site issues:
    • NTR

24 March 2011 (Thursday)

Experiment activities:

  • MC productions and validation of work flows for Collision11

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • NTR
  • T1
    • HPSS and d-cache masters are working to recover as much as possible of the lost data at IN2p3. Thanks for the effort (for traceability GGUS:68889)
  • T2 site issues: :
    • NTR

23 March 2011 (Wednesday)

Experiment activities:

  • MC productions and validation of work flows for Collision11
  • LHCbDIRAC week on going at CERN

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • NTR
  • T1
    • HPSS and d-cache masters are working to recover as much as possible of the lost data at IN2p3. Thanks for the effort (for traceability GGUS:68889)
  • T2 site issues: :
    • NTR

22 March 2011 (Tuesday)

Experiment activities:

  • MC productions running at full steam (35K jobs last 24hs) + validation of work flows for Collision11
  • LHCbDIRAC week on going at CERN

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • Validation work-flow for EXRESS was failing at CERN apart few jobs running on pre-production nodes (CVMFS). AFS installation was corrupted, reinstalled the software now it is OK
  • T1
    • GridKA: recovered the SDST lost last week and put in the vobox at GridKA. Many thanks to GridKA people. Our T1 VOBOX responsible will re-register them on dcache.
    • HPSS and d-cache masters are working to recover as much as possible of the lost data at IN2p3. Thanks for the effort.
  • T2 site issues: :
    • NTR

21 March 2011 (Monday)

Experiment activities:

  • MC productions running at lower pace
  • Validations of work flows for Collision 11 data taking on going.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • Recovered the 4225 files on tape lost last week. Many thanks to all CASTOR people involved

  • T1
    • IN2P3 : why we are not able to recover the data at IN2P3 while it has been done for al the other sites ?

  • T2 site issues: :
    • NTR

18 March 2011 (Friday)

Experiment activities:

  • MC productions running at lower pace
  • Validations of work flows for Collision 11 data taking on going.
  • By mistake 17K sdst files (output of the reconstruction and inpout for any eventual restripping campaign) have been deleted at 5 sites. We asked the affected sites to try and do their best to recover these data from their tape systems. It is not urgent (we do not have clear idea when the next reprocessing will happen) but it would be vital to have as much as possible these data recovered. PIC did it already. CERN, RAL and GridKA will look at that, IN2p3 reports there is no way to recover them.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • Asked IT-PES to have a way to discriminate at job submissions WN mounting AFS from the ones mounting CVMFS in order to steer SAM sw-installation jobs to the right shared area flavor.

  • T1
    • NTR
  • T2 site issues: :
    • NTR

17 March 2011 (Thursday)

Experiment activities:

  • MC productions running without major problems
  • Validations of work flows for Collision 11 data taking on going.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • SAM tests jobs on CVMFS seem to have problem not finding files. Seems related to a cache issue and Steve is working on a patch.
    • Intervention on SRM-lhcb. System is drained, Proposal to have the intervention tomorrow 12.00
  • T1
    • NTR
  • T2 site issues: :
    • NTR

16 March 2011 (Wednesday)

Experiment activities:

  • MC productions running without major problems
  • Validations of work flows for Collision 11 data taking on going.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • SAM tests jobs on CVMFS seem to have problem not finding files. Seems related to a cache issue and Steve is working on a patch.
    • Intervention on SRM-lhcb. Not before next Monday (we have to drain a backlog of data from ONLINE to CASTOR). Any way it will be dependent on LHC schedule.
  • T1
    • NTR
  • T2 site issues: :
    • NTR

15 March 2011 (Tuesday)

Experiment activities:

  • MC productions running without problems

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • DaVinci problems have been understood, i.e. because of mismatch in Python versions. A new configuration fixing the problem has been deployed on Grid sites.
    • cvmfs - after successful tests during the w/e on a subset of migrated nodes the cvmfs usage at CERN has been extended to all nodes currently equipped with cvmfs.
  • T1
    • NTR
  • T2 site issues: :
    • Chasing up problems with T2 sites on MC productions

14 March 2011 (Monday)

Experiment activities:

  • MC productions running at full speed.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • Davinci jobs failing for some production at CERN. Application devs looking at that.
    • cvmfs - installed in batch nodes at the moment - after all weekend proven to be OK for LHCb and then we can go ahead in enabling LHCb to use cvmfs to the rest of nodes configured.
  • T1
    • RAL: tomorrow Network intervention.
  • T2 site issues: :
    • 3 T2 sites failed for MC productions, GGUS tickets have been opened for theses sites

11 March 2011 (Friday)

Experiment activities:

  • MC productions running at full speed.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • cvmfs is being installed in batch nodes at the moment. This is going to be tested over the w/e on a subset of nodes.
  • T1
    • NTR
  • T2 site issues: :
    • 3 T2 sites failed for MC productions, GGUS tickets have been opened for theses sites

10 March 2011 (Thursday)

Experiment activities:

  • MC productions on going w/o major problem.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • volhcb29, machine hosting SAM suite, has been rebooted this morning for a quick patch and SAM tests may have been affected. Back to normality now.
  • T1
    • RAL: it has been setup a "whole node" queue at RAL for testing jobs with guaranteed number of cores. Currently the max number of jobs is 5
  • T2 site issues: :
    • NTR

9 March 2011 (Wedesday)

Experiment activities:

  • Restarted few MC production after the LHCb Jamboree. Nothing to report.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • NTR
  • T1
    • RAL: DT for CASTOR upgrade. Banned SEs
  • T2 site issues: :
    • NTR

8 March 2011 (Tuesday)

Experiment activities:

New GGUS (or RT) tickets:
  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • NTR
  • T1
    • NTR
  • T2 site issues: :
    • NTR

7 March 2011 (Monday)

Experiment activities:

New GGUS (or RT) tickets:
  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • NTR
  • T1
    • NTR
  • T2 site issues: :
    • NTR

4 March 2011 (Friday)

Experiment activities:

  • New Dirac version was installed this morning, system ramping up again
    • After Dirac upgrade - Sam tests submitted by Dirac started failing - the problem is understood

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0
Issues at the sites and services

  • T0
    • NTR
  • T1
    • File access problems with DCAP at Gridka, the issue is currently under investigation (GGUS:68252)
  • T2 site issues: :
    • NTR

3 March 2011 (Thursday)

Experiment activities:

  • MC production running smoothly.
  • New Dirac version to be installed this afternoon

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0
Issues at the sites and services

  • T0
    • NTR
  • T1
    • NTR
  • T2 site issues: :
    • NTR

2 March 2011 (Wednesday)

Experiment activities:

  • MC production running smoothly.
  • Certification for the next Dirac release is ongoing, release being planned for tomorrow

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0
Issues at the sites and services

  • T0
    • Problem with access to lhcb-srm GGUS:68180
    • Problem with accessing of RAW files which are on tape but not staged. GGUS:68131
  • T1
    • NTR
  • T2 site issues: :
    • NTR

1 March 2011 (Tuesday)

Experiment activities:

  • MC production running smoothly.
  • Certification for the next Dirac release is ongoing

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0
Issues at the sites and services

  • T0
    • Problem with accessing of RAW files which are on tape but not staged. GGUS:68131
  • T1
  • T2 site issues: :
    • Some T2 sites are running CREAM CE 1.6.4 and LHCb jobs fail because of (BUG:78565), e.g. LCG.BHAM-HEP.uk, LCG.ITWM.de, LCG.JINR.ru, LCG.KIAE.ru, LCG.Krakow.pl. We'll submit GGUS tickets for each one of them.

-- RobertoSantinel - 02-Dec-2010

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2011-04-01 - JoelClosier
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback