June 2011 Reports

To the main

30th June 2011 (Thursday)

Experiment activities: Quite day. Nothing to report. LHCb week ongoing.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • NTR
  • T1
    • NTR

30th June 2011 (Thursday)

Experiment activities: Quite day. Nothing to report. LHCb week ongoing.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • NTR
  • T1
    • NTR

29th June 2011 (Wednesday)

Experiment activities: Almost all sites are well behaving for real data processing and user activities.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • NTR
  • T1
    • RAL : Want to keep open these points about RAL to know how the site intends to address them.
      • Tape recall performances issues (under discussion at RAL)
      • Bug in LRU for which files staged since some time but with fresh access requests, instead of being pinned down to the cache might be garbage collected. Shaun is thinking about a patch of this bug but would take a while to get deployed.
    • GRIDKA: Problem with the pilot jobs submitted through CREAM CEs aborting (GGUS:71952). Fixed!

28th June 2011 (Tuesday)

Experiment activities: Very good fill since yesterday (took 6.6 pb-1). One issue with DIRAC-ONLINE with transfer requests to CASTOR piling up in the buffer. Issue not understood but workaround to kick these pending requests one by one to catch up the backlog.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services
  • T0
    • NTR
  • T1
    • RAL : Many jobs in input data resolution. Suffering two different problems over there:
      • Tape recall performances issues (under discussion at RAL)
      • Bug in LRU for which files staged since some time but with fresh access requests, instead of being pinned down to the cache might be garbage collected. Shaun is thinking about a patch of this bug but would take a while to get deployed.
    • GRIDKA: The huge backlog of the previous days has been almost totally drained performing very well in the last days.
    • GRIDKA: Problem with the pilot jobs submitted through CREAM CEs aborting (GGUS:71952). Problem still there: we only have pilot jobs submitted via LCG-CE.
    • NIKHEF: All pilot jobs submitted through CREAM CEs are aborting (GGUS:71955). It was a site reconfiguration issue. Fixed.
    • SARA: Spikes of failing jobs in uploading output data every night at the same time. It looks like some regular script runs at SARA. In touch via private mail with Ron

27th June 2011 (Monday)

Experiment activities: Staging problem (DIRAC side) fixed on Friday evening releasing an emergency patch of the stager system. Things seem to be back to normality. Since Friday took 18TB of files spread on 50 runs.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 3
  • T2: 0
Issues at the sites and services
  • T0
    • 3D Streams: Notified about one apply process aborted in the LFC streams replication. 1 hour stoppage of the service to fix the problem.
  • T1
    • RAL : Sunday received notification from RAL : Due to some problems on Oracle databases behind Castor service all castor instances have been put in downtime. Discovered this morning a lot of timeout accessing turls (SRM issue). This is also reflected in the huge amount of jobs pending the transfer of their output to RAL SEs. Open an internal ticket (84699)
    • GRIDKA: Still backlog (4K waiting tasks, due to the Storage issues reported last week) but it is recovering very quickly and situation is getting much better now.
    • GRIDKA: Problem with the pilot jobs submitted through CREAM CEs aborting (GGUS:71952)
    • NIKHEF: All pilot jobs submitted through CREAM CEs are aborting (GGUS:71955)

24th June 2011 (Friday)

Experiment activities: Further problems with pre-staging of files within DIRAC. Being handled by hand at present while debugging of the problem by the experts continues.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T1
    • GRIDKA: GGUS:71572. Job success rate fine now. However we need to know whether we can ramp up the level of jobs running there. Currently we have ~200 jobs running with >6100 waiting jobs for GridKa. This is now critical for us and needs to be addressed soon. About 100TB of free space has also been transfered into LHCb-DISK from LHCb-DST. However the space token migration that was done a few weeks ago has not really been completed.
    • RAL : Believe problems with job failures due to "input data resolution" have been understood now - large latency between pre-staging and actual running of the jobs.

23rd June 2011 (Thursday)

Experiment activities: Many fixes and updates to DIRAC over last 24 hours. Bug which affected users hotfixed. Overnight problem with DIRAC component doing staging of files at Tier-1s fixed.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T1
    • GRIDKA: GGUS:71572. Situation slightly better in the last 24 hours. Very low (~200) numbers of running jobs. Effort on to follow up with GridKa about the space token migration that was done a few weeks ago - worry that it was not completed.
    • RAL : Slow staging rates - need to wait and see the effects of the changes made yesterday. Significant fractions of jobs still failing with "input data resolution" indicating continuing problems with garbage collection.

22th June 2011 (Wednesday)

Experiment activities:

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T1
    • GRIDKA: GGUS:71572. Continuing data access problems at GridKa. LHCb jobs are still failing due to both - problems with availability of data and access to the available data. Site remains flaky even with a much lower load than before. We are concerned if this is related to the LHCb space token migration earlier this month, with the diskservers not being re-assigned properly, as needed.
    • RAL : Slow staging rates.

21th June 2011 (Tuesday)

Experiment activities:

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 1 (shared area problem at tbit01.nipne.ro)
Issues at the sites and services

  • T1
    • GRIDKA: GGUS:71572. Continuing data access problems at GridKa. LHCb jobs are still failing due to both - problems with availability of data and access to the available data.

20th June 2011 (Monday)

Experiment activities: Problem with one DIRAC central service running out of space this early morning.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • CERN: Investigation on going on the low number of LSF job slots. GGUS:71608 . Problem seems to have been understood and pinned down to the algorithms used by LSF for shares.

  • T1
    • GRIDKA: GGUS:71572. Ticket escalated to ALARM as SRM became completely unresponsive and the site unusable.

17th June 2011 (Friday)

Experiment activities: Smooth data taking (order of 10pb-1 recorded last night and this morning).

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • CERN: Investigation on going on the low number of LSF job slots. Lowered the slots for Alice, may be this is gonna help. GGUS:71608
  • T1
    • GRIDKA: GGUS:71572 The situation is not good yet. We temporary banned Gridka SE to let the system draining the backlog of SRM requests. They put in place new h/w in the read-only pools in order to alleviate this issue.

16th June 2011 (Thursday)

Experiment activities: No much data taken last night and beams restarted only this morning.

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 1
  • T2: 0
Issues at the sites and services

  • T0
    • CERN: Ticket opened against LSF support: still too few jobs running in the system (at most 800) after the increase of the limit to 8000. Yesterday at the time of the same issue was reported it was noticed a ramp up (from 300 to 800) then it has been thought the problem was not longer there. GGUS:71608
  • T1
    • GRIDKA: more than half of user and production jobs accessing data via dcap protocols failing because not consuming CPU. Preliminary investigations from Doris seem to pin down a load problem with the disk pools (not tokenized) in front of the tape. (GGUS:71572). We are not entirely convinced with this argument as user jobs do not use tape system.

15th June 2011 (Wednesday)

Experiment activities: Last night more than 10TB collected with spike of luminosity at 3.7X10^32. Launched latest version of the stripping (Stripping14) on 2010 data.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • Running very few jobs compared the number of waiting jobs targetted to CERN
    • Problem with 3D replication. It appeared during the operation of restoration of historical data for one of the LHCB offline applications.Now it seems OK.
  • T1
    • CNAF: Investigating a problem with 38 files not properly copied over to site, despite being registered in the LHCb Bookkeeping. Has been understood: flaw in the Dirac DMS clients for certain type of operations.

14th June 2011 (Tuesday)

Experiment activities: More than 25 TB of data collected in the last 3 days. Stripping of 2011 data nearly achieved and re-stripping of 2010 data will start today.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 0
Issues at the sites and services

  • T0
    • A limit of jobs / DN has been relaxed now, allowing more than 2.5k jobs / DN.
  • T1
    • GRIDKA: Spikes of analysis jobs failing accessing data (GGUS:71412). Fixing the problem with dcap ports resolved the problem. Ticket closed.
    • CNAF: Investigating a problem with 38 files not properly copied over to site, despite being registered in the LHCb Bookkeeping.

10th June 2011 (Friday)

Experiment activities: Taken last night 11pb-1 data everything well transmitted to CASTOR and final destination with no problems. Expected a quite hot w/e in terms of new physics.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 0
Issues at the sites and services

  • T0
    • Observed problems in some lxplus node with CVMFS. Stefan reports a too aggressive caching mechanism with the current CVMFS (0.68) preventing to pick up new software.
  • T1
    • GRIDKA: Spikes of analysis jobs failing accessing data (GGUS:71412). Problem with dcap ports restarted over night.
    • PIC: Received a GGUS ticket (GGUS:71399) about "abusive use of the scratch area" in the WN. It boiled down to a problem with CVMFS there (instead LHCb fault): the S/W is not picked up by the CVMFS clients but installed locally on the WN (for each job ~10GB).

9th June 2011 (Thursday)

Experiment activities:

Took some data last night (not so much). flooded in the last week GridKA whose share has been increased too much and almost all RAW data ended up there leaving other centers (RAL and CNAF) almost idle.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • Changed one field in the back-end of LFC (hostSE)
  • T1
    • PIC: Back from the DT yesterday a problem with SRM has been spotted. Now fixed. Open a GGUS ticket for a CE that has been de-comissioned after the DT (GGUS:71383)

8th June 2011 (Wednesday)

Experiment activities:

Data processing/reprocessing activities proceeding almost smoothly everywhere with failure rate less than 10% mostly for input data resolution

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • NTR
  • T1
    • NL-T1: rolling fashion intervention to update CVMFS brought a reduced capacity with number of running jobs falling to 500. Now improving. There is a backlog there for stripping not fully recovered.
    • PIC: In DT today. They got over last days a bit too much jobs compared the real capacity of the site (they were well behaving in the past). For this reason a backlog of activities is formed over there too.

7th June 2011 (Tuesday)

Experiment activities:

Main activities: stripping of reprocessed data + reconstruction and stripping of new taken data. MC activities.

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • CERN Tape system with a huge backlog as the reason of no migration taking place since yesterday afternoon (GGUS:71268)
  • T1
    • SARA: moved to direct cream submission. Part of the waiting jobs redirected to NIKHEF. Still backlog to be drained. Fixed the problem of staging files.
    • RAL: moved to LRU policy for GC

6th June 2011 (Monday)

Experiment activities:

  • Taken in 24 hours 23pb-1.
  • Overlap of many activities going on concurrently in the system.
    • Reprocessing (Reconstruction+ Stripping + Merging) of all data before May 's TS (about of ~80 pb-1 data) using new application software and conditions (~80% completed)
    • First pass reconstruction of new data + Stripping + Merging
    • Tails of old reconstruction productions using previous version of the application (Reco09) almost completed.
    • MC productions at various non T1's centers. New GGUS (or RT) tickets:
  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • CERN the additional disk servers in the LHCb-Tape token boosted the stripping jobs now running smoothly at CERN
  • T1
    • SARA still a huge backlog of stripping jobs due to low rate of submission of pilot jobs. Decided to move to direct submission by passing the gLite WMS and the ranking expression.
    • SARA reported some issue in staging files, most likely some tape driver problem. Local contact person in touch with Ron and Co.
    • RAL: backlog drained by moving to direct submission. Many jobs exiting immediately because data was garbage collected. Asked RAL folks to use the policy: LRU (Last Recently Used)

  • T2

1st June 2011 (Wednesday)

Experiment activities:

  • No data. Processing and reprocessing are running. Problems with Stripping jobs.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0
Issues at the sites and services

  • T0
    • CERN Some stripping jobs consume CPU time more then usual with high "sleeping %" on the 48 core batch nodes.
  • T1
    • GRIDKA CREAMCE (GGUS:70835) (On Hold)
    • IN2P3 LFC RO Mirror (Solved)
    • SARA, RAL Huge backlog of waiting stripping jobs.

  • T2

-- RobertoSantinel - 02-Dec-2010

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2011-07-01 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback