December 2010 Reports

To the main

17th December 2010 (Friday)

Experiment activities:

  • Reprocessing almost completely finished and its output merged. MC production ready to start after the final tests. CERN IT coordination is needed for LHCb intervention

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • NTR
  • T1 site issues:

16th December 2010 (Thursday)

Experiment activities:

  • Reprocessing almost completely finished and its output merged. MC production ready to start after the final tests.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • NTR
  • T1 site issues:

15th December 2010 (Wednesday)

Experiment activities:

  • Reprocessing almost completely finished and its output merged.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0
Issues at the sites and services
  • T0
    • NTR
  • T1 site issues:
    • IN2p3: users being affected accessing files in a disk pool that was disabled yesterday (GGUS:65290)
    • SARA: request to restore the original share (exceptionally increased for coping with the merging backlog). It should be clear that for the long term NIKHEF/SARA should adjust internally (and transparently) the share in a more reasonable way (50% Vs 50% for example)
    • RAL corrupted files: LHCb will look at the list provided and re-transfer them.

14th December 2010 (Tuesday)

Experiment activities:

  • Fixed the internal problem in DIRAC and submitted the remaining jobs for reprocessing. No longer so clear whether there will be MC production during Xmas break.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • NTR
  • T1 site issues:
    • RAL: Found 38 files corrupted (checksum mismatch) due to a bug in castor related to incompletely transferred files.

13th December 2010 (Monday)

Experiment activities:

  • Xmas plans: run huge MC productions for physics studies. This is pending the removal of old MC data at all centers (decide to archive them in a T1D0 service class at CERN).
  • Reprocessing: at 98.4%. Remaining jobs are only due to an internal problem DIRAC not recreating jobs for unprocessed files.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
    • NTR
  • T1 site issues:
    • NTR

10th December 2010 (Friday)

Experiment activities: remaining 3K jobs to accomplish the reprocessing in total. Proceeding smoothly

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
    • NTR
  • T1 site issues:
    • Flooding NIKHEF with direct CREAM submission due to a problem DIRAC side.

9th December 2010 (Thursday)

Experiment activities: remaining 6K jobs to accomplish the reprocessing (NL-T1 5500 jobs in the belly stiil but draining very smoothly). User dark area clean up ongoing. Missing CERN.

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
    • Requested the exhaustive list of user files (GGUS:65036). This prevents to finish the User Dark Area clean up in duty time.
    • Received the SIR from CASTOR developers about the shortage of the lhcbdst service class.

  • T1 site issues:
    • NTR

8th December 2010 (Wednesday)

Experiment activities:Reprocessing now mainly running at SARA (backlog formed beacuse of the FS issue)

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
    • Merging is over at CERN.

  • T1 site issues:
    • SARA/NIKHEF: reprocessing jobs proceeding smoothly but still 10K remaining. Jobs are competing with MC simulation for some internal configuration in DIRAC. Addressed.

7th December 2010 (Tuesday)

Experiment activities:Reprocessing now mainly running at SARA (backlog formed beacuse of the FS issue)

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
    • Found the reason of the failures with LHCBDST: The key problem is in the mismatch between resource scheduling and polling protocols. Idles requests left in the LSF preventing further requests to get a slot. This is triggered when the diskserver gets overloaded by too many requests (exhausting the slot). The only viable solution is more h/w than reasonable to compensate and too many slots to compensate for the "unused" ones. I think a detail SIR is needed.

  • T1 site issues:
    • SARA: Problem grabbing computing resources (GGUS:65407). It was a fair share issue (very small for our largest T1) that was preventing to run constantly at the load that LHCb would expect. We also added NIKHEF in the abstract definition of NL-T1 to drain the 15K jobs backlog formed but we also get a glitch of PBS (4 hours last night) over there. Ron agreed to tune the FS on SARA batch farm to speed up this backlog drain.
    • RAL outage

6th December 2010 (Monday)

Experiment activities: Reprocessing going at full steam, more than 80% done. CNAF and CERN almost finished. IN2p3 and RAL going to finish. Merging is running in parallel smoothly almost everywhere.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
    • Despite the 2 new disk servers added last week, this morning we killed again LHCbDST. We banned in writing the SE until the number of active transfers decreased. No clear the root cause of the problem but it looks like when gfal gives up a transfer there is not a SRM abort internally in CASTOR . The transfer request is then still active in the LSF queue and when it is scheduled it takes 3 minutes to realize that the client has gone and eventually aborts. This feeds the snow ball effects. Sebastien is looking at that.
  • T1 site issues:
    • SARA: Problem grabbing computing resources (GGUS:65407). The site rank becomes strongly non atractive then all of sudden the site starts running jobs but then (fair share) the sites rejects any further jobs that pile up in the local queue for being scheduled the day later. We need to run almost 12K reconstruction jobs there and at this pace it will take 2 weeks.

3rd December 2010 (Friday)

Experiment activities: Reprocessing going at full steam. Merging is running.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
    • 2 new disks servers added to LHCb_DST
      • Found also a bug in CASTOR. Workaround to decrease the timeout on LSF from 30 minutes to 30s. to allow new coming request to be taken in time.
    • LHCb_RDST has to be converted to T1D0 (basically: switch ON the garbage collector) Done
  • T1 site issues:

2nd December 2010 (Thursday)

Experiment activities: Reprocessing going at full steam. Merging is running smoothly

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 2

Issues at the sites and services

  • T0
    • Quickly updated the FTS instances to take on board the new Data Manager roles.
    • LHCb_RDST has to be converted to T1D0 (basically: switch ON the garbage collector)
  • T1 site issues:
    • opened yesterday 6 GGUS ticket to ask the new Data Manager in LHCb will inherit the same capabilities (GGUS:64829 to GGUS:64835). All sites reacted in less than 24 hours.
    • GRIDKA: instabilities observed with their dCache system (GGUS:64772) seem to have gone. Check this URL

1st December 2010 (Wednesday)

Experiment activities: Reprocessing going at full steam. Merging > testing jobs running

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 7
  • T2: 0

Issues at the sites and services

  • T0
    • Open GGUS:64826 for changing the VOAdmin of our FTS servers at CERN
  • T1 site issues:
    • opened 6 GGUS ticket to ask the new Data Manager in LHCb will inherit the same capabilities (GGUS:64829 to GGUS:64835)
    • GRIDKA: still some instabilities with the overall Storage system (no matter the space token) affecting our reconstruction jobs (GGUS:64772)

-- RobertoSantinel - 29-Jan-2010

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2011-01-05 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback