October 2010 Reports

To the main

29th October 2010 (Friday)

Experiment activities: Data taking. Reconstruction proceeding. Few MC productions.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
    • GGUS:63514 (opened Wednesday). Diskserver down. put "on hold". Files "invisible" for user, who can use replicas. Waiting update.
  • T1 site issues:
    • IN2P3: A number of jobs seg-faulting. GGUS:63573 probably related with GGUS:62732
    • RAL: GGUS:63468 (Opened 3 days ago) SRM reporting wrong TURL. srm instance went in half an hour DT to apply a patch. Hopefully problem solved.

28th October 2010 (Thursday)

Experiment activities: Data taking. Reconstruction proceeding. No MC.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
    • GGUS:63514 (opened yesterday). Diskserver down. put "on hold". Files re-replicated manually.
  • T1 site issues:
    • IN2P3: Very low pilots efficiency for CREAM CE. Asked to investigate. GGUS:63559
    • CNAF: GGUS:63506 (Opened yesterday) A number of files are corrupted. Space token full. Corrupted in the replication process. They will be re-replicated manually.
    • RAL: GGUS:63515 (Opened yesterday) File could not be staged. , quickly fixed and closed.
    • RAL: GGUS:63468 (Opened 2 days ago) SRM reporting wrong TURL. Not clear what caused it, for the moment we asked our users not to use RAL.

27th October 2010 (Wednesday)

Experiment activities: No new data since yesterday, data reconstruction not problematic.

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 2
  • T2: 0

Issues at the sites and services

  • T0
    • Files on CASTOR not available. Maybe a faulty diskserver. GGUS:63514
  • T1 site issues:
    • CNAF: problematics shown in FTS channel CNAF-IN2P3, opened first for IN2P3, now points to file not available at CNAF. GGUS:63506.
    • RAL: file can't be staged. GGUS:63515
    • RAL: SRM reporting wrong TURL. GGUS:63468. Affecting ~10% of the files, not clear what caused it.

26th October 2010 (Tuesday)

Experiment activities: Data taking. Taken in three days ~30% of the total 2010 statistics.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
    • NTR
  • T1 site issues:
    • GRIDKA: (GGUS:63353). LHCb_DST space token running out of space. Waitiing for a list of dark files in tokenless space. This is space that could be eventually recuperated and allocated into the DST space token
    • IN2P3: Issue with some files whose tURL is not retrievable. GGUS:63462

25th October 2010 (Monday)

Experiment activities: No Activities.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
    • NTR
  • T1 site issues:
    • GRIDKA: (GGUS:63353). LHCb_DST space token running out of space. No all 2010 pledged resources allocated.
    • IN2P3: Shared area issue (GGUS:59880) opened July the 8th. Very urgent. (GGUS:62800) opened October the 6th. Top priority. Still to be solved. It's time for escalation.

22nd October 2010 (Friday)

Experiment activities: Some MC activity, charm stripping launched yesterday, validation of the new reprocessing on going. No major issues

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 2

Issues at the sites and services

  • T0
    • A large number of jobs (but not all) seemed to be failing with "SetupProject.sh Execution Failed" (AFS-like issue) in several productions. Problem seems to have resolved itself - transitory between 8:30-9:30 UTC
  • T1 site issues:
    • GRIDKA: (GGUS:63353). LHCb_DST space token running out of space. No all 2010 pledged resources allocated.
    • IN2p3: (GGUS:63024). Disk server back in production. Setting files visible in the catalog to run the remaining reconstruction and merging.
    • IN2p3: (GGUS:63234 Verified yesterday). Again software area issue with more jobs timing out in setting up the environment. Definitely to be addressed.

21st October 2010 (Thursday)

Experiment activities: No MC, no reconstruction, validation of the new reprocessing on going. No major issues

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
    • none
  • T1 site issues:

20th October 2010 (Wednesday)

Experiment activities: Impressive amount of MC jobs (65K run in the last 24 hours) and very tiny failure rate (~1%) (see pics)

Last24MC.png

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 2
  • T2: 0

Issues at the sites and services

  • T0
    • isscvs.cern.ch is no longer accessible outside CERN (CT719539). Issue immediately handled and problem understood to be related to the CERN firewall setup for CVS servers.
  • T1 site issues:
    • GRIDKA: observed instabilities (SAM and real activities perfectly correlated in timing) on the SRM endpoint (GGUS:63253)
    • RAL: one disk server of lhcbMdst service class was not available yesterday (GGUS:63230)
    • IN2p3: any news from SUN (GGUS:63024)?

19th October 2010 (Tuesday)

Experiment activities: No data received last night. Reconstruction running to completion. Some delay related to problem at IN2p3 (RAW data not available because of the disk server) and problem at RAL (stalling jobs)

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
    • Trapped two data inconsistencies on CASTOR.
    • LHCb SSB: the feeders have to be restarted
  • T1 site issues:
    • IN2p3: Shared area issue preventing to install latest version of LHCb application. (GGUS:63234). This problem with the shared area at Lyon must be escalated.
    • IN2p3: LHCb want a clear estimation on when the disk server will be back to life to take a decision on how to handle few remaining data to reconstruct.
    • RAL: Problem of stalling jobs was due to a bug introduced in DIRAC screwing up the estimation of the CPU

18th October 2010 (Monday)

Experiment activities: Mostly user jobs and reconstruction of data taken the previous days. No reprocessing. MC productions running pretty much smoothly. New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
    • The data consistency suite running on CASTOR at CERN is trapping issues with data with incorrect checksum computed (due to a bug hitting also RAL installation). The same suite also reports files damaged due to problem in the upload. The reason of this last point has to be further investigated.
    • LHCb SSB: the feeders have to be restarted
  • T1 site issues:
    • CNAF: A user reporting his job with problems accessing data. A couple of WNs were not properly configured w/o GPFS mounted properly
    • CNAF: Many jobs stalled: LSF configuration issue
    • IN2p3: received the list of files in a faulty disk-server. Data manager removed them from the catalog level. (GGUS:63024)
    • RAL: Half of the jobs stalling (and then eventually killed by either the dirac watch dog or the LRMS because exceeding the wall clock time). Investigations on going to understand if it is an issue with the site or not.

15th October 2010 (Friday)

Experiment activities: Mostly user jobs in the morning, reconstruction of new data taken yesterday was not problematic.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
    • none
  • T1 site issues:
    • RAL: staging input files failing this morning (GGUS:63140). Everything seems ok now. RAL share first reduced to 0 and then put back to its original value.

14th October 2010 (Thursday)

Experiment activities: Data taking. Mostly user jobs in the morning, reconstruction of new data to start.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
    • none
  • T1 site issues:
    • IN2P3: disk server still down at IN2P3. Affecting merge and user jobs. Requested list of files.

13th October 2010 (Wednesday)

Experiment activities: Mostly user and MC jobs

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 0

Issues at the sites and services

  • T0
    • none
  • T1 site issues:
    • IN2P3: Files unavailable (GGUS:63024): disk server down?
    • IN2P3: problems in accessing the LHCb_RDST and LHCb_RAW (GGUS:63008). Possibly related with the other ticket.
    • IN2P3: few jobs went to buggy kernel WNs: seg fault. Ticket was closed few days ago, reported internally.
    • RAL: share have been pur back to its original value. HammerCloud stress test.
    • SARA: ce-sft-job SAM test failures observed (condorG queue error). Seems ok now.

12th October 2010 (Tuesday)

Experiment activities: During last night took 2pb-1 data but some datasets were not sent to offline. Production validation did not start yet due to a bug in Davinci option files. Expected to start in the afternoon.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0
    • none
  • T1 site issues:
    • SARA: Request to open port 1521 of the Oracle ConditionDB listener to subnets of the WN at all T1's as per VO Id Card request (GGUS:62971). For the time being the SARA DB is banned. All the jobs at SARA/NIKHEF will use the CERN DB.
    • RAL: found many files reporting 0 checksum, also not being accessible via ROOT (GGUS:61532). Problem will be fixed in the afternoon. LHCb need to understand the root cause of this inconsistency on size and locality. In the mean time, in order to re-enable RAL in the production mask a validation stress test is being setup with Hammer Cloud.
    • RAL: SAM tests for SRM (both DIRAC and gfal unit tests) failing (GGUS:62893). It was an overload issue due to a concurrent backup of DB.
    • IN2P3 : Still issues with shared area (GGUS:59880 and GGUS:62800)

11th October 2010 (Monday)

Experiment activities: Mostly User jobs, no new data. Starting now a validating activity.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 3

Issues at the sites and services

  • T0
    • none
  • T1 site issues:
    • SARA: 2 separate issues with ConditionDB:
      • systematically failing SAM tests and affecting production jobs. (GGUS:62896)
      • Database TAG not updated.
    • RAL: SAM tests for SRM (both DIRAC and gfal unit tests) are failing. Under investigation. (GGUS:62893)
    • IN2P3 : seg fault (GGUS:62732) understood: due to newer WN with buggy kernel
    • RAL: found many files reporting 0 checksum, also not being accessible via ROOT (GGUS:61532)

8th October 2010 (Friday)

Experiment activities: Mostly User jobs. Reconstruction jobs delayed due to a bug in the new release.

New GGUS (or RT) tickets:

  • T0: 1
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
    • CE: they seem to be wrongly publishing into the BDII the architecture of the cluster RT717022
  • T1 site issues:
    • IN2P3 : seg fault still present and we would like a post mortem of the main problem on Sunday (GGUS:62732)
    • RAL: SAM tests for SRM (both DIRAC and gfal unit tests) are failing.

7th October 2010 (Thursday)

Experiment activities: General remark: SAM availability for T1.

  • IN2p3: DIRAC unit test failing systematically for a connectivity problem. Moved to a dedicated LHCb machine the clients. It seems fixed
  • PIC and CNAF and NIKHEF: DIRAC unit test failing because of space filled. Preamble added to this test (it was already in the unit test based on gfal) to take this into account. Fixed
  • NIKHEF PIC and RAL. A test (CondDB-conn) checking the connectivity from a site WN against all 7 instances of CondDB was failing because of an issue with SARA. As matter of fact this test had not to be in the list of tests in the "LHCb Critical Availability" but rather on the list of tests for the "LHCb Desired Availability" in dashboard. Fixed.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 0

Issues at the sites and services

  • T0
    • sam201 and sam202 (used to publish SAM tests results) became unresponsive (overload) and many test results weren't published since long time. Moved ot a dedicated (samnag003) machine.
  • T1 site issues:
    • RAL : FTS transfer to RAL SRM Finished yet destination files 0 size (GGUS:62829)
    • IN2P3 : SAM software installation failed again (GGUS:62800)

6th October 2010 (Wednesday)

Experiment activities: .

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 0

Issues at the sites and services

5th October 2010 (Tuesday)

Experiment activities: .

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 6
  • T2: 0

Issues at the sites and services

  • T0
    • none
  • T1 site issues:
    • RAL : LHCb files "NONE" at RAL SRM (GGUS:62744)
    • RAL : CONDDB : part of the problem understood. (see GGUS:62667)
    • IN2P3 : SAM tests failing against LHCb_USER at IN2p3 (GGUS:62739)
    • GRIDKA : All jobs aborted at cream-2-fzk.gridka.de cream-3-fzk.gridka.de (GGUS:62758)
    • GRIDKA : Dump of LHCb USEr space token (GGUS:62763)
    • SARA : Most pilots aborted creamce2.gina.sara.nl creamce.gina.sara.nl (GGUS:62759)

4th October 2010 (Monday)

Experiment activities: . Analysis, lot of jobs stuck in our pre-stager (Dirac). We will ask T1 a dump of the LHCb USER space in order to do a clean of all the orpheans user files.

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0
    • none
  • T1 site issues:
    • IN2P3 : jobs failing with Segmentation fault (Is something went wrong on sunday at IN2P3. Jobs are running fine now?) (GGUS:62732)
    • RAL : test WN with a previous configuration

1st October 2010 (Friday)

Experiment activities: . Analysis, no particular issues

New GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 0

Issues at the sites and services

  • T0
    • none
  • T1 site issues:

-- RobertoSantinel - 29-Jan-2010

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2010-11-04 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback