November 2009 Reports

To the main

30th November 2009 (Monday)

Experiment activities:
  • System empty
  • CREAMCEs: many of them are still not publishing the GlueCESTateStatus= production causing SAM suite to fail with usual "BrokerHelper: not compatible resources"
GGUS tickets * 1 new tickets since Friday

T0 sites issues:

  • none
T1 sites issues:
  • none

27th November 2009 (Friday)

Experiment activities:
  • System almost empty (70 user jobs running at 10am this morning)
  • CREAMCEs: many of them are still not publishing the GlueCESTateStatus= production causing SAM suite to fail with usual "BrokerHelper: not compatible resources"
GGUS tickets * 0 new tickets since yesterday

T0 sites issues:

  • none
T1 sites issues:
  • SARA:upgraded consistently their storage resources fulfilling 2009 LHCb requirements. Also the VOMS unawareness issue reported at some previous meetings seems to have gone by restarting the gsidcap server. No longer a bug as originally reported?

26th November 2009 (Thursday)

Experiment activities:
  • MC09 L0HLT1 stripping of b-inclusive and c-inclusive launched and already some files available in bookkeeping
  • User analysis is the most important activity
GGUS tickets * 0 new tickets since yesterday

T0 sites issues:

  • none
T1 sites issues:
  • none

25th November 2009 (Wednesday)

Experiment activities:

  • Setting up special MC production (few hundred of jobs) with last days data taking conditions just for comparative studies.
  • 4-5K user analysis jobs per day as reported from the SSB.

GGUS tickets * 1 new tickets since yesterday

T0 sites issues:

  • none
T1 sites issues:
  • NIKHEF problem: no more users complaining about "ROOT figuring out the user HOME " issue even after the WNs have been put back in production.
  • IN2p3: The file access issue reported a ten days ago seems to be understood: they added a variable in the job's environment to force the dcap client in active mode. In this way, the client connects happily to the pool for the data stream .
T2 sites issues:
  • Shared area issue

24th November 2009 (Tuesday)

Experiment activities:

  • After just one hour (56 minutes) since the first p-p collisions have been seen at the pit 8, data was happily reconstructed in the Grid and available in the LHCB Bookkeeping. This is a definitely very impressive achievement considering that in this time data was shipped from pit to CASTOR,migrated to tape, registered into catalog, shipped to T1 reconstruction jobs created and submitted,managed to run and output distributed to all other sites fully automatically. More information: LHCbOffline Twitter
  • LHCB_HISTO space token going to be setup at CERN and 100TB more of disk ready. To be decide in LHCb how to spread them across our service classes.
  • Updated the LHCb VO ID Card to increase the quota for the shared area at sites. From 50GB to 150GB. This new requirement considers the fact that software for different (higher) number of architectures has to be deployed and to guarantee a more comfortable margin. The clean up of obsolete versions of software can recover few GBs of space only.
  • No MC productions but huge analysis activity seen according the SSB (may be users attracted by fresh real data?).

GGUS tickets * 2 new tickets since yesterday

T0 sites issues:

  • ConditionDB and LFC stream replication discovered to be not working except for CNAF. ALARM ticket submitted. Promptly fixed by DB people (forwarded the problem) according: 1) Open a service request to Oracle (this is a bug) 2) improved the monitoring to detect this problem 3) Streams re-established for condition DB (either with export/import or recovering the stream replication) . 4) Left LFC 3D not sync towards RAL and GridKA for keeping this problem for Oracle support
  • ALARM ticket and Critical service coverage. As it happened with AFS last week the list of covered service has definitely to be re-negotiated.

T1 sites issues:

  • NIKHEF: issue with ROOT and HOME was still valid on Saturday...all jobs affected run in the new WNs recently put in production on the site (and removed yesterday). Under investigation.
T2 sites issues:
  • Shared area issue

23rd November 2009 (Monday)

Experiment activities:

  • First LHC proton (splash) events have been recorded (actually three files), fully reconstructed at CERN and redistributed across all T1s for physics analysis (great success of LHCb offline computing). These the details of the first real data for LHCb. More information: LHCbOffline Twitter
/lhcb/data/2009/DST/00005616/0000/00005616_00000002_1.dst
/lhcb/data/2009/DST/00005616/0000/00005616_00000003_1.dst
/lhcb/data/2009/DST/00005617/0000/00005617_00000001_1.dst

And they have been distributed in the following way, ensuring we have two T1D1 copies and 5 T0D1 copies (although currently without shares implemented):

Site                      Files
CERN_M-DST                    3
CNAF-DST                      3
GRIDKA-DST                    1
GRIDKA_M-DST                  2
IN2P3-DST                     3
NIKHEF-DST                    2
NIKHEF_M-DST                  1
PIC-DST                       3
RAL-DST                       3

GGUS tickets * 4 new tickets since Friday

T0 sites issues:

  • none

T1 sites issues:

  • SARA: experienced yesterday problems with Andrew's job accessing data in the lhcb hammer cloud (authz problems). It seems to be an issue with 1.9.5-8 of dCache that does not support any longer VOMS.
  • pic: shared area exceeding quota and not possible to install version of Brunel for reconstruction of real data
  • NIKHEF:still users affected by HOME issue on NIKHEF's WN's (nscd servers stuck)
  • CNAF: StoRM was down on Friday evening.

20th November 2009 (Friday)

Experiment activities:

  • No activities going on in the system apart from 1K jobs of user distributed analysis (Tier1)
  • Expected this evening to receive first beams data. Only few jobs running reconstruction application at CERN (just for demonstration and producing some DST for users)
  • DC06 DST redistribution still going to RAL CNAF and SARA

GGUS tickets * 2new tickets since yesterday

T0 sites issues:

  • none

T1 sites issues:

  • SARA: disk space shortage on MC_(M)-DST space tokens.
  • SARA: experienced yesterday problems with some users accessing data:misconfiguration of SRM publishing dcap protocol as valid while notably SARA supports only gsidcap
  • CNAF:reported 6 files in CASTOR unrecoverable and definitely lost. To be removed from all catalogs
  • RAL: SAM jobs failing accessing SQLite file in shared area.

19th November 2009 (Thursday)

Experiment activities:

  • The two MC productions managed to simulate all events requested And remaining jobs on the sites have been killed. This explains why SSB reports now suddenly all sites red.
  • DC06 DST redistribution across T1 (started last Thursday) is still proceeding smoothly. Done for IN2p3,Gridka,pic; again some disk space issue promptly addressed by SARA people that are working very hard to bring their new storage capacity online.
  • Going to receive beams data during this weekend (Friday evening). Only few jobs running reconstruction application at CERN (just for demonstration and producing some DST for users)

GGUS tickets

T0 sites issues:

  • The potential problem reported yesterday is not a problem for LHCb that as part of the production validation checks also the size of files among various catalogs and storages.

T1 sites issues:

  • SARA: disk space shortage on MC_(M)-DST space tokens.

18th November 2009 (Wednesday)

Experiment activities:

  • Two MC production running at full steam (~12-13K jobs concurrently running since three days now...)
  • DC06 DST redistribution across T1 (started last Thursday) is again proceeding smoothly. Almost done for IN2p3.
  • Setup a view of the Site Status Board that sites could look at to check how the site is behaving as far as concerns jobs run (and their type). Just follow this LHCb Dashboard

GGUS tickets

T0 sites issues:

  • Reported about a potential problem with a version of globus installed in the SL4 nodes at CERN that might potentially affect data registration on LFC (crappy size).Verified that LHCb validates each production and also the possible mismatches of size between catalogs..LHCb is shielded by this problem

T1 sites issues:

  • SARA: reported failures accessing data yesterday (before today's major intervention) due to dCache permissions on the LHCb_USER space. Worth that the new dCache installation fixes this issue too.(GGUS open)

17th November 2009 (Tuesday)

Experiment activities:

  • Two MC production running at full steam (~12K jobs concurrently running in the system)
  • Stripping jobs running concurrently at T1's + user jobs
  • DC06 DST redistribution across T1 (started last Thursday) is again proceeding smoothly (at average pace of 50MB/s average)

GGUS tickets

T0 sites issues:

T1 sites issues

  • IN2p3 Maintenance on batch system to change configuration and then problem after scheduled maintenance
  • SARA:MC_DST space token running closely to shortage.. promptly added extra space

T2 sites issues:

16th November 2009 (Monday)

Experiment activities:

  • Two MC production running at full steam (~10K jobs)
  • Stripping jobs running concurrently at T1's and user jobs
  • DC06 DST redistribution across T1 (started last Thursday) is proceeding w/o major issues to be reported (at pace of 50MB/s average)

GGUS tickets

T0 sites issues:

  • Some users reporting problem with central DIRAC services (Sandbox server)

T1 sites issues

:

  • IN2p3: Issue accessing some data files in dCache (GGUS ticket open)

T2 sites issues:

  • NIPNE: missing library
  • Shared area issue at Bulgarian site

13th November 2009 (Friday)

Experiment activities:

  • the two MC production announced yesterday have been submitted.We expect a week end fairly busy.
  • Stripping jobs at SARA (still failing with the application error code 134)
  • DC06 DST redistribution across T1. It is about 75TB of data distributed in equal weight across all T1 out of CERN. No major problems observed in the first round of transfers (finished all at the same time) apart some issue at CERN (logged in the T0 section of this report).

xfers.png

GGUS tickets

T0 sites issues:

  • Some AFS volumes were not accessible this morning and since yesterday 17:30 affecting LHCb users of SL5 cluster. The problem has disappeared at around 10. Discussions about the criticality of AFS and its coverage.
  • Reported some contention activity on CASTOR, activity coming from the LHCb PIT machines (outside CERN LAN).
  • Problems transferring to SARA at some point on time last night due (most likely) to some disk server at the source. This problem disappeared this morning (no GGUS open).
SOURCE error during TRANSFER phase: [GRIDFTP_ERROR] globus_xio: Unable to connect to lxfsrb4204.cern.ch:20203 globus_xio: System error in connect: Connection refused globus_xio: A system call failed: Connection refused

> SOURCE error during TRANSFER_PREPARATION phase: [REQUEST_TIMEOUT] failed to prepare source file in 180 seconds

> SOURCE error during TRANSFER_PREPARATION phase: [HTTP_TIMEOUT] service timeout during [srm2__srmPrepareToGet]

 

T1 sites issues:

  • GridKA: issue with shared area yesterday.

12th November 2009 (Thursday)

Experiment activities:

  • Just 500 jobs in the system .between user/sam and few remaining stripping.
  • Couple of large production coming soon (50M events)

GGUS tickets

T0 sites issues:

  • volhcb02 and volhcb06 issue: agreed on moving them today (14:00)

T1 sites issues:

  • RAL: shared area issue under investigation (requested this morning more info)
  • SARA: Exit code 134 (jobs killed abruptly by the system): under investigation the reason LHCb application side

11th November 2009 (Wednesday)

Experiment activities:

  • Only user jobs running in the system at T1 mainly (~3K concurrently). No relevant production activities running right now.

GGUS tickets

T0 sites issues:

  • volhcb02 and volhcb06 issue: since FIO need to install new hardware that is arriving soon in the racks that these machines are currently placed, they need to quickly move these to a temporary place until their services are migrated. LHCb looking for a possible slot...

T1 sites issues:

  • RAL: Observed some perturbation at RAL with timeouts accessing software on shared area. Awareness about at RISK intervention at RAL but the problem occurred outside the scheduled intervention (last night)
  • NIKHEF: yesterday still issues with ROOT application resolving HOME apparently still related with the nscd server stuck on some WN following the OS kernel upgrade..
  • SARA: once the CE has been unbanned again jobs (L0HLT stripping) started to fail with some specific application error (134). Problem affecting jobs there but also (different application) few other (mainly Russian) sites.

10th November 2009 (Tuesday)

Experiment activities:

  • There are still to run few stripping jobs at NL-T1 (down last week) + few more L0HLT physics stripping activity. It will be a very low level activity.
  • Clean up of old MC productions on going.
  • Introducing in production mask few CREAM CEs that proved to be running OK in previous tests.

GGUS tickets

T0 sites issues:

  • volhcb02 and volhcb06 targeted for retirement the 15th of November. (perhaps) Overlooked the announcement and now running in emergency and needing two new equivalent machines

T1 sites issues:

  • CNAF: problems deleting some remote directories on CASTOR
  • GridKA: problems deleting three directories over dCache instance.

T2 sites issues:

  • Shared area issue at Milan site

9th November 2009 (Monday)

Experiment activities:

  • During the week end run all remaining productions. No much activity in the system now. Just remarkable one production whose jobs are stalled almost everywhere.
  • No envisaged incoming activities this week (may be some FEST reconstruction at very low level). This one should be a quite week.
  • Last week stripping exercise (50 i/p files per job) run very smoothly with just one jobs stalled. NL-T1 was out of the mask because it was in downtime but in general they managed to run happily at all dCache sites that in the past were failing and being killed by watch dog mechanism.

GGUS tickets

T0 sites issues:

  • CASTORLHCB issue on Friday evening understood and fixed to be a messed up view of LSF about which services were mapped to which disk pools. This was resulting in un-responsive CASTOR.

T1 sites issues:

  • NIKHEF: some WNs (following the recent kernel upgrade) were failing ROOT application to figure out the HOME. Promptly fixed on Saturday
  • IN2p3: down for upgrade of dCache
  • CNAF: issue listing directories (very slow) under investigation.

T2 sites issues:

6th November 2009 (Friday)

Experiment activities:

  • L0HLT Stripping activity launched yesterday over all 25K merged files (from the 1billion MC production of minimum bias).
  • MC09 production running at the pace of 3K jobs in the system now

GGUS tickets

T0 sites issues:

  • VOBOX machines at CERN rebooted this morning because of the kernel patched (security vulnerability issue)
  • CASTORLHCB issue. lhcb_castor_0_0_PEND_RUNSTACKEDP_1.gif

T1 sites issues:

  • CNAF: StoRM problems listing a directory with gfal_ls
  • SARA: many jobs failing with the error: Gauss 134 that seems to be some problem with the application but only occurring at some sites. Under investigation

T2 sites issues:

5th November 2009 (Thursday)

Experiment activities:

  • This afternoon few test stripping data awaiting for physics group how to proceed.
  • EXPRESS stream FEST reco jobs running at low level: just a CERN exercise.
  • MC09 production running at the pace of 5K jobs in the system now

GGUS tickets

T0 sites issues:

  • Run the transparent intervention on CASTORLHCB to upgrade to 2.1.8-15 this morning

T1 sites issues:

  • SARA: all transfers failing there. tURL resolution issue timing out.Problem fixed by disabling the new tape protection feature.

T2 sites issues:

4th November 2009 (Wednesday)

Experiment activities:

  • FEST week. Processed about 20 runs so far but it is running at very low level because of real detector activities going on now on the pit.
  • MC production activities running at low profile.
  • A lot of sites complain that SAM test checking VOMS certificate were failing. The instance lcg-voms.cern.ch was not in the VO Id card for LHCb ( now added).

GGUS tickets

T0 sites issues:

  • Agreed to run the transparent intervention on CASTORLHCB to upgrade to 2.1.8-15
  • LSF server node was overloaded by the awful number of polling requests form the time left utility used by pilot jobs to check time remaining and ceilings. This has been temporarily stopped DIRAC side awaiting for some improved logic to be put in place.

T1 sites issues:

  • SARA: all transfers failing there. tURL resolution issue timing out.
  • Lyon long jobs running on short queue issue: understood to be (most likely) a problem in the time left utility used by DIRAC (over-evaluating the remaining time on the queue and then pulling too long payloads). A patch on DIRAC has been put in production since the 27th. Resolved also the issue of memory Vs queue length for all queues in Lyon (except the short one).
  • NIKHEF batch farm not available becasue of a network router configuration issue

T2 sites issues:

  • A lot of jobs aborting with some internal application error (134 Gauss) that has to be closely investigated. This is particularly important in sites recently upgraded to SL5.

2nd November 2009 (Monday)

Experiment activities:

  • Production activities low over the weekend. Starting up again today.

GGUS tickets

  • GGUS was not working this morning. Tickets were not being updated with comments.
  • 2new tickets since yesterday
    • T0: 0
    • T1: 0
    • T2: 1
  • List of all open tickets

T0 sites issues:

T1 sites issues:

T2 sites issues:

  • Shared area problems. (As always).
  • Jobs aborted at IN2P3-LPC.

-- RobertoSantinel - 01-Dec-2009

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2010-01-29 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback