March 2010 Reports

To the main

31st March 2010 (Wednesday)

Experiment activities:

  • The DSTs of yesterday's data available in the book-keeping under: Lhcb -> Collision10 -> Beam3500GeV-VeloOpen-MagDown -> Real Data + RecoDST-2010-01 -> DST. These are complete DSTs without any stripping. The stripping will be deployed in the coming days, once some crashes in DaVinci are fixed .

GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 1
Issues at the sites and services
  • T0 sites issues:
    • FTS export service: ALARM ticket issued - 56880 convinced that they just swapped the old (fts-t0.export.cern.ch) with the new (fts22-t0-export.cern.ch, it was pilot before) by changing the alias. (Migration to FTS22 Procedure). This "false" problem also offered the possibility to exercise the procedures for alarming problems on best effort supported services like FTS that proven to be OK.
  • T1 sites issues:
    • pic: back from the downtime yesterday, has been re-integrated in the lhcb production mask with some latency this morning due to CIC portal not sending the notification of the END downtime. CIC people report their portal overloaded due to a migration to GOC 4.
    • NIKHEF-SARA: issue with dCache accessing data. (56909)
  • T2 sites issues:
    • CBPF: Shared area issue

30th March 2010 (Tuesday)

Experiment activities:

  • Started to receive COLLISION10 type data promptly migrated to tape, checksum checked and registered in LFC.
  • Actively working on commissioning real data workflows affected by some problems at application level (a patch going to be rolled out today by core devlopers)
  • xrootd: tests at CERN continuing: no extra variable had to be defined to manage the access of files via xrootd that has been apparently properly configured now.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 1
Issues at the sites and services
  • T0 sites issues:
    • none
  • T1 sites issues:
    • pic: downtime (announced) causing some major perturbation on user activities due to some of the critical DIRAC services hosted there.
    • CNAF: ConditionDB intervention (until 13:00 CET)
  • T2 sites issues:
    • INFN-PADOVA: Shared area issue

29th March 2010 (Monday)

Experiment activities:

  • xrootd first tests at CERN. 100 user jobs and this is it today!

GGUS (or RT) tickets:

  • T0: 2
  • T1: 1
  • T2: 3

Issues at the sites and services

  • T0 sites issues:
    • CASTOR:The LHCb data written to lhcbraw service has not been migrated for several days (GGUS: 56795).
    • A vobox (volhcb26) wasn't reachable in the weekend. Remedy ticket to vobox-support. (Remedy Ticket CT0000000671634)
    • Jan (following a request on Friday) has announced that the castorlhcb head nodes are now running (the latest version of) SSL-enabled xrootd. LHCb want to stress that SRM should provide a consistent tURL for xroot for accessing data at various service classes. Preliminary tests show the problem to be still there (file open and then not read) and a close interaction with xrootd developers has been triggered.
  • T1 sites issues:
    • SARA: LFC read-only replica was unresponsive since a while (disappeared from SAMDB too) and this was at the origin of the root group-id reported last week at NIKHEF. Oracle RAC issue the reason of the problem.
    • CNAF : ConditionDB intervention: agreed for tomorrow 9-13 CET
  • T2 sites issues:
    • egee.fesb.hr shared area issue
    • ITPA-LCG2 library misisng
    • IN2P3-LPC SAM voms tests failing

26th March 2010 (Friday)

Experiment activities:

  • No production activity just the usual few thousand user jobs run per day. xrootd first test at CERN

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 1

Issues at the sites and services

  • T0 sites issues:
    • It looks like reading a file with xrootd at CERN is OK from lxplus but does not seem to work from a lxbatch node (file open but not possible to read). Can it be there a firewall issue blocking the data channel between the WN and the xrootd server?
  • T1 sites issues:
    • CNAF: can go ahead with the intervention on the ConditionDB on Monday. Suggestion to prepare for the intervention to minimize the downtime period
  • T2 sites issues:
    • CSCS-LCG2 pilots stalled.

25th March 2010 (Thursday)

Experiment activities:

  • No production activity just the usual few thousand user jobs run per day.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0 sites issues:
    • none
  • T1 sites issues:
    • CNAF: Storm: LCMAPS problem causing some users to fail their jobs in uploading data and SAM test jobs too. This is a known bug on LCMAPS (when log files bigger than 2GB)
    • CNAF: CREAMCE : problems submitting with Role=pilot due to too few pool account setup there. fixed by increasing to 50 the pool accounts for pilot role.
    • NIKHEF: problem with Group ID passed to LFC clients being zero (root). Under investigation.
    • pic is still banned because of a problem in the module to install software not handling properly some environment variables. SAM tests failing systematically

24th March 2010 (Wednesday)

Experiment activities:

  • No production activity mainly activities devoted to setup a suite to evaluate xrootd for the long standing file access problem that LHCB is facing at sites. Direct CREAMCE submission: a first prototype has been setup

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 1

Issues at the sites and services

  • T0 sites issues:
    • none
  • T1 sites issues:
    • NIKHEF all jobs fail systematically to upload data to everywhere. Sound like jobs use GID =0 that is associated to root. GGUS ticket open
    • CREAMCE : problems submitting with Role=pilot due to too few pool account setup there.

* T2 sites issues:

    • GRISU-CYBERSAR-CAGLIARI pilot aborting

22nd March 2010 (Monday)

Experiment activities:

  • LHCb T1 Jamboree in Amsterdam.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 0

Issues at the sites and services

  • T0 sites issues:
    • none
  • T1 sites issues:
    • NIKHEF all jobs fail systematically to upload data to everywhere. Sound like jobs use GID =0 that is associated to root. GGUS ticket open

19th March 2010 (Friday)

Experiment activities:

  • Ongoing the clean up activity of old DC06 data. Re-running FULL Stream reconstruction with new DaVinci application (fixing a known bug). This is the workflow to process 2010 real data. Running at low profile four different small MC productions and corresponding Merging (less than 2K jobs in total, including users).

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 2

Issues at the sites and services

  • T0 sites issues:
    • none
  • T1 sites issues:
    • CNAF: testing a first prototype in DIRAC for direct submission to CREAM and found mapping problems with the remote CE for the user and pilot FQANs
    • pic: still investigating with local contact people the issue of the SL5-64bit application failing to be installed
    • NIKHEF: reported a problem in uploading output data from production jobs no matter which destination. A GGUS ticket is about to come.
       LcgFileCatalogClient:__getRoleFromGID: Failed to get role from
      GID 0 Invalid argument
  • T2 sites issue:
    • Shared area issue at INFN-TORINO and at IPSL-IPGP-LCG2

18th March 2010 (Thursday)

Experiment activities:

  • Ongoing the clean up activity of old DC06 data.

GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 sites issues:
    • The problem with SVN reported yesterday to be slow has been fixed with some workaround.
    • The SVN central service (as also reported by SLS see plot below) has been completely unavailable last night since ~23:00. A remedy ticket submitted for that issue is CT669210. The problem -fixed - was a third SVN server machine installed to help with performance problems previously reported by LHCb; because of an error, it became available in the load balancer although it was still in the "maintenance" state and then login was not possible.
SVN.png

  • T1 sites issues:
    • PIC: There is a problem with the installation of one of the LHCb application software for SL5-64bit. Not clear the reason yet, no ticket open. Preventively, the site has been banned

  • T2 sites issue:
    • ECDF: found a problem on Redhat entreprise Linux 5 for settings the LHCb environment. Fix will be provided by the LHCb CORE people (g++ ssues recently reported affecting numerous sites).

17th March 2010 (Wednesday)

Experiment activities:

  • Rerun EXPRESS and FULL streams to test the 2010 workflows (using 2009 real data). The EXPRESS stream was OK but for the FULL stream a known issue at the application level has been encountered.
  • Launching now a campaign to clean up old DC06 data across all storages.

GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 2

Issues at the sites and services

  • T0 sites issues:
    • SVN reported to be extremely slow today. This is a problem already reported via RT:https://cern.ch/helpdesk/problem/CT662489&email=marco.clemencic@cern.ch
  • T1 sites issues:
    • none
  • T2 sites issue:
    • Disk quota exceeded at INFN-PISA and problem importing some module in UKI-SCOTGRID-ECDF

16th March 2010 (Tuesday)

Experiment activities:

  • The 2 stripping productions launched on Friday (to fix the pb of large input files) are almost completed. Another small stripping to be launched and few millions MC events requested to be produced . Not really much activity. Mainly waiting for the first data in the afternoon and checking that all pieces of code are working fine for processing them.
  • Testing SQUID for offloading central web servers application and dirac software repositories at pic.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 sites issues:
    • A glitch with the AFS shared area preventing to write lock files spawned by SAM (used also for SW installation). Immediately fixed.
  • T1 sites issues:
    • none
  • T2 sites issue:
    • none

15th March 2010 (Monday)

Experiment activities:

  • Stripping of bbar and ccbar over previous MC productions runniing at very low profile . Some jobs (10%) killed by the watch dog declaring them stalled. Something related with the application.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 2

Issues at the sites and services

  • T0 sites issues:
    • none
  • T1 sites issues:
    • none
  • T2 sites issue:
    • ITEP shared area issue; g++ not found in BIFI

12th March 2010 (Thursday)

Experiment activities:

  • Relaunched the stripping of bbar and ccbar over the MC production of the 10th of February. Testing the new code for SRM Unit test (temporarily set to "non critical")

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 sites issues:
    • none
  • T1 sites issues:
    • none

11th March 2010 (Thursday)

Experiment activities:

  • No activity at all.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 sites issues:
    • LFC replication tests failing occasionally at all T1's. To be investigated the reason (could be due to the master at CERN)
  • T1 sites issues:
    • SAM suite still failing at CNAF: put some protection in the code of the test to shield it form unexpected responses given by Storm.

10th March 2010 (Wednesday)

Experiment activities:

  • No activity yesterday, system pretty idle.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 sites issues:
    • Tomorrow morning a 90 minutes downtime on all all central DIRAC machines for an intervention on the network service f513-c-ip169-shpyl-10.
  • T1 sites issues:
    • CNAF: after the upgrade yesterday of Storm, the SAM unit test for SRM keeps failing affecting the whole site availability. Not clear the reason yet.
  • T2 sites issues:
    • Continuing the investigation of data-upload related problem from several UK sites to CERN. Strict collaboration with Sheffield and Glasgow people.

9th March 2010 (Tuesday)

Experiment activities:

  • Small productions running at very low rate occasional failures mostly due to application related issues. Production last week finished quite smoothly... just merging of smallish files was causing some trouble in DIRAC logic.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • T0 sites issues:
    • Some merging jobs failed at CERN on Saturday and Sunday because of input data temporarily not available. On Monday the problem had disappeared.
  • T1 sites issues:
    • RAL:Apply process on streams replication to LFC@RAL failing yesterday. This seems to be due to a spurious entry in the RAL DB (read-only) that has been introduced manually. This is something that should not happen at all.
  • T2 sites issues:
    • Continuing the investigation of data-upload related problem from several UK sites to CERN. Strict collaboration with Sheffield and Glasgow people.

8th March 2010 (Monday)

Experiment activities:

  • Last week MC productions almost over today (LHCb week). No much activity to report

GGUS (or RT) tickets:

  • T0: 1
  • T1: 0
  • T2: 2

Issues at the sites and services

  • T0 sites issues:
    • CASTOR upgrade to 2.1.9-4 + SRM upgrade to 2.8-6 went w/o problems to report
    • Yesterday some problems with the merging jobs at CERN because input data was not available (RT open). Today it became available again.
  • T1 sites issues:
    • Noticed CNAF CE has been failing the job submission critical SAM test on Sunday morning (all CEs affected, problem with LRMS local submission) and also IN2p3 had the critical unit test for SRM failing on Sunday.
  • T2 sites issues:
    • Shared area issue at IN2P3-LPC jobs failing at Pisa

5th March 2010 (Friday)

Experiment activities:

  • MC productions running at sustained regime w/o major problems (9-10K jobs concurrently)
  • Organizing the first LHCb T1 Jamboree meeting in Amsterdam (22nd and 23rd of March). All site representatives/site-service managers at T1 and CERN are invited. Already adhesions from T1 we need to have few names of people from CERN. Agenda still to be finalized.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 1

Issues at the sites and services

  • T0 sites issues:
    • CASTOR upgrade to 2.1.9-4 + SRM upgrade to 2.8-6 on Monday the 8th (9:00 -11:00 am CET).
  • T1 sites issues:
    • none
  • T2 sites issues:
    • Shared area issue at INFN-MILANO

4th March 2010 (Thursday)

Experiment activities:

  • MC productions running at sustained regime w/o major problems

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 2

Issues at the sites and services

  • T0 sites issues:
    • Proposed time-slot for CASTOR upgrade: the sooner the better or at the next LHC stop. Monday morning could be suitable (modulo LHC is not running)
    • Offline CondDB schema corrupted yesterday. Need to restore the schema to the previous configuration but the apply process failed against 5 out of 6 T1 (but CNAF).
  • T1 sites issues:
    • SARA : The critical FileAccess SAM test failing has been understood by the core application devs as some libraries (libgsitunnel) for slc5 platforms not properly deployed in the AA.
    • NIKHEF: Issue with HOME not set on some WN yestrday afternoon. Issue trapped by some jobs in the (very) short period that the watch dog running there did not run restarting the nscd daemon dead.
  • T2 sites issues:
    • Shared area issue at INFN-NAPOLI and uploading problem at UKI-LT2-Brunel. This is the fourth UK site (plus the ones reported yesterday) banned in LHCb because of that problem: under investigation.
    • The issue with the BDII "flickering" was due to a bug in the dirac script that queries the BDII and fills their own information system.

3rd March 2010 (Wednesday)

Experiment activities:

  • MC production running at the pace of 8K concurrent jobs.
  • Discussion on xroot testing in LHCb. Some inconsistencies on the way protocols xroot and root are handled by different technologies. RAL has first to upgrade their CASTOR before supporting xrootd. IN2p3, SARA and GridKA do support it. pic is working to provide xroot. CNAF does not. CERN is OK.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 1
  • T2: 4

Issues at the sites and services

  • T0 sites issues:
    • none
  • T1 sites issues:
    • SARA (after the LRMS intervention yesterday): users (and also SAM) report problems in getting their jobs running. ATLAS is running at full steam there (1700 jobs).
    • GRIDKA:SQLite problems due to the usual nfslock mechanism getting stuck. Restarted the nfs server.
  • T2 sites issues:
    • INFN-CAGLIARI (aborting jobs), UKI-NORTHGRID-SHEF-HEP (uploading problem) UKI-LT2-UCL-CENTRAL (shared area) UKI-SCOTGRID-GLASGOW (uploading problems)

2nd March 2010 (Tuesday)

Experiment activities:

  • 40 million MC events to be produced scattered across 20 different production requests. Running smoothly so far.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Issues at the sites and services

  • Gridview: issue with historical SAM information: the 1st of March was skipped.
  • T0 sites issues:
    • none
  • T1 sites issues:
    • SARA: issue with critical test for file access under investigation
  • T2 sites issues:
    • UK sites: BDII information flickering, pilot can't be submitted because not available resources found. Something to do with what reported yesterday?

1st March 2010 (Monday)

Experiment activities:

  • No large activity in the system right now (few MC production created). Drained the remaining stripping productions to allow the new release of DaVinci application (based on the version of ROOT that fixes the problem of compatibility with dcap libraries) to be deployed. As consequence of that: all dCache sites have been reintegrated in the production mask . SAM tests also confirm that the problem preventing to access data is now fixed almost in all dCache sites.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 2
  • T2: 3

Issues at the sites and services

  • T0 sites issues:
    • none
  • T1 sites issues:
    • SARA: issue with critical test for file access (after the test moved to the right version of the root fixing the compatibility issue)
    • RAL: issue last week with voms cert tests
    • RAL: issue on Friday with on one disk-server (gdss160) taken temporarily out of production.
  • T2 sites issues:
    • GRIF and UKI-LT2-QMUL failing also the voms-cert critical SAM test
    • Software installation issues at UKI-SOUTHGRID-BHAM-HEP
    • users job stalling at RAL-HEP most likely due to C/A issue

-- RobertoSantinel - 29-Jan-2010

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2010-04-01 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback