January 2010 Reports

To the main

29th January 2009 (Friday)

Experiment activities:

  • Stripping on MC09 b and c inclusive events + normal users activity: less than 1000 jobs.

GGUS (or RT) tickets:

  • T0: 0
  • T1: 0
  • T2: 0

Problems at the sites

  • T0 sites issues:
    • Noticed yesterday at around 14:00 all SAM tests against our LFC at T1's failing systematically. This is certainly in line with the scheduled intervention for the Oracle security patch on the LHCBR offline (13:30 -15:00). Announced to be rolling fashion the DBA envisaged indeed some perturbations in the connection during the 90 minutes of intervention
    • LHCB has not problem to run the CASTOR intervention next week for the Oracle patch. Brought the question whether these jobs already running in the batch system will
be frozen or just let dieing/hanging. (input for a more general point about procedures for the Thursday meeting)
    • vobox and extra LFC node. Looking for status.
  • T1 sites issues:
    • IN2p3: the announced stress test against new client of gsidcap has to be postponed (the suite from DIRAC is not ready yet). Unlikely to be ready for the next week due to the software week .
    • RAL: Reported that some user (last week) had their jobs killed by the LRMS because exceeding the memory limit.
    • RAL: intervention extended beyond the week end.Any news on that?
  • T2 sites issues:

Problems with services: We see a lot of tests/probes non-lhcb specific polluting our production services (LFC and SEs). This is the content of a directory under the global space of lhcb in LFC. We kindly ask responsible (Nagios managers) to agree with LHCb about the convention used to put everything under a test area also available at the SEs.

POSIX-TEST-bohr3226.tier2.hep.manchester.ac.uk-2009-12-02 

POSIX-TEST-cmsdcache.pi.infn.it-2009-12-02 
POSIX-TEST-prod-se-01.pd.infn.it-2009-12-02 
POSIX-TEST-prod-se-02.pd.infn.it-2009-12-02 
POSIX-TEST-se.reef.man.poznan.pl-2009-12-02 
POSIX-TEST-t2-se-00.to.infn.it-2009-12-02

28th January 2009 (Thursday)

Experiment activities:

  • Relaunched the L0HLT1 stripping for b and c inclusive MC09 events with a "corrected" workflow description (less input files for avoiding the too large output files as reported yesterday)
GGUS (or RT) tickets:
  • T0: 0
  • T1: 0
  • T2: 3

Problems at the sites

  • T0 sites issues:
    • Looking for news from Steve about RT CT643684 (vobox replacing old Log SE service) and CT654872 (extra FE behind LFC-RO)
    • Verified that 75 GB files reported yesterday by Jan weren't to be considered "normality" but rather a mistake on a stripping production definition and then to be scrapped.
  • T1 sites issues
    • RAL : Downtime.
    • SARA: different users affected (apparently) by the Storage nodes reboot scheduled this morning at SARA with jobs failing to resolve tURL.
  • T2 sites issues:
    • shared area problem at AUVERGRID, Milano and Pisa

27th January 2009 (Wednesday)

Experiment activities:

  • System is empty apart from users jobs.
GGUS (or RT) tickets:
  • T0: 0
  • T1: 0
  • T2: 0

Problems at the sites

  • T0 sites issues:
    • Asking about the second machine behind LFC-RO.
    • Status of the request asfar as concerns the volhcb06 SE logs replacement (the 5TB disk machine)
  • T1 sites issues
    • RAL : Downtime
    • CNAF:CREAM CE still observed problems if using sgm account. (GGUS open on Monday)
  • T2 sites issues:
    • none

26th January 2009 (Tuesday)

Experiment activities:

  • No scheduled activities running in the system now.
GGUS (or RT) tickets:
  • T0: 0
  • T1: 1
  • T2: 2

Problems at the sites

  • T0 sites issues:
    • CASTOR upgrade this morning.
    • VOBOXes:forced reboot this afternoon for kernel upgrade
    • volhcb15: (LHCb log SE service). Delivered, the machine seems to have not the right partition (just one for "/" and usually ones for OS despite has been explicitly asked a different partitioning)
  • T1 sites issues
    • RAL : Downtime
    • IN2p3: Downtime.
    • PIC: SE was banned due to many users affected by a dCache pool which had some network problem. The content of the problematic pool has been migrated to a new one.
    • CNAF:CREAM CE problems if using sgm account.
  • T2 sites issues:
    • Shared area issues both at UKI-SOUTHGRID-RALPP and AUVERGRID

25th January 2009 (Monday)

Experiment activities:

  • Only bb and cc inclusive MC09 stripping in the system now (very few jobs in the system in total at T1's). Launched a MC simulation but some application problem found.
GGUS (or RT) tickets:
  • T0: 0
  • T1: 0
  • T2: 1

Problems at the sites

  • T0 sites issues:
    • nothing
  • T1 sites issues
    • RAL : Downtime
    • IN2p3: Downtime.
    • PIC: an issue with one file for some user analysis jobs Under investigation by our local contact person.
    • CNAF:migration to TSM is now over.

22nd January 2009 (Friday)

Experiment activities:

  • Only bb and cc inclusive MC09 stripping in the system now (very few jobs in the system in total at T1's). Few hundred user jobs
GGUS (or RT) tickets: 1 new tickets since yesterday
  • T0: 0
  • T1: 0
  • T2: 2

Problems at the sites

  • T0 sites issues:
    • nothing
  • T1 sites issues
    • CNAF: still in the middle of a migration (LFC catalog to be updated according).
    • IN2p3: preferred to postpone the test of the new gsidcap client to the next week.
  • T2 sites:
    • Shared area and SQLite issues

21th January 2009 (Thursday)

Experiment activities:

  • Reprocessing of 2009 450 GeV data launched. L0HLT MB MC09 Stripping finished and a summary available here : CNAF and GRIDKA (as reported last week) the most problematic sites. The bb and cc inclusive stripping in the system now.
GGUS (or RT) tickets: 1 new tickets since yesterday
  • T0: 0
  • T1: 0
  • T2: 1

Problems at the sites

  • T0 sites issues:
    • nothing
  • T1 sites issues
    • CNAF: registering data of the new T1Dx endpoint in LFC.
    • PIC: The low efficiency jobs observed yesterday (as suspected) were user jobs: these were about jobs whose output sandbox upload to CASTOR RAL was hanging. DIRAC has in place any possible timeout. Jobs stack is not longer available because finally killed by the LRMS no further investigation are possible.
    • RAL: similarly another user job seems to have consumed 49 seconds over a wall clock time of 55 hours,.

20th January 2009 (Wednesday)

Experiment activities:

  • No large activity to report apart from users.
GGUS (or RT) tickets: 2 new tickets since yesterday
  • T0: 1
  • T1: 0
  • T2: 1

Problems at the sites

  • T0 sites issues:
    • LFC 3D replication yesterday had some problem (trapped by our SAM and SLS and confirmed by Eva that received an alarm). In details this was the problem At 16:47 we got an alarm from our monitoring system: the capture process was delayed for 90 minutes. There was a problem receiving the archived log files from the source database on the downstream system.. GGUS team ticket submitted.
    • Unless clear signal from the management, LHCb agree on the intervention for CASTORLHCb upgrade to 2.1.9 the 26th,
  • T1 sites issues
    • CNAF: Storm successfully upgraded (1.4--> 1.5) and SAM confirmed it is back to life. Data migrated to TSM. Now changing the configuration service and adding these replica in LFC. The TxD1 endpoints have been reintegrated.
    • IN2p3: setting up a stress suite to test the patch from dCache about the issue reported with gsidcap at SARA and IN2p3 (causing these sites to be banned).
  • T2 sites issues:

19th January 2009 (Tuesday)

Experiment activities:

  • Very low level activity. Rerun stripping on some of the MC09 Minimum bias events and reconstructed few COLLISION09 events 1.08TeV (9 jobs)
  • Ongoing activities in certifying the new DIRAC production system (based on SVN) over the new h/w delivered for central boxes.
GGUS (or RT) tickets: 2 new tickets since yesterday
  • T0: 0
  • T1: 1
  • T2: 1

Problem at the sites

  • T0 sites issues:
    • volhcb13 got dirac partition full and then job logging info database corrupted causing some user problem. Lemon sensor "ad hoc" in place but not switched on
  • T1 sites issues
    • GRIDKA: confirmed that now CREAMCE are mapping correctly sgm users
    • CNAF: Storm is upgrading to the recent version for supporting TSM.
    • RAL: SRM SAM jobs were failing. This seems related to the new code of the unit test from LHCb which is now debugged. Rolled back to the old-stable code for the critical unit test.
    • RAL: There is another failed diskserver at RAL in the lhcbDst space token.
  • T2 sites issues:

18th January 2009 (Monday)

Experiment activities:

  • There are 4 new productions requested and all are about full reconstruction of COLLISION09 data. Only users during the week end and now less than 1000 jobs inthe system.
  • Issue with SRM unit test.Working for a solution.

GGUS (or RT) tickets: 4 new tickets since Friday

  • T0: 0
  • T1: 0
  • T2: 3

Problem at the sites

  • T0 sites issues:
    • Any news for a second box behind LFC-RO?
  • T1 sites issues
    • Intervention to migrate CASTOR to TSM (6TB in total, ~2K files). They are all ready to proceed. Storm and CASTOR endpoints banned at CNAF to proceed with.
  • T2 sites issues:
    • Imperial College and Manchester shared area issues.
  • Others:
    • GGUS portal problem submitting TEAM ticket. Open a normal GGUS ticket against GGUS support.

15th January 2009 (Friday)

Experiment activities:

  • There are less than hundred jobs running for the stripping production launched two days ago: it is about jobs that did not manage to run at GridKA and CNAF yesterday. Few more hundred in total considering user jobs.

GGUS (or RT) tickets: 3 new tickets since yesterday

  • T0: 0
  • T1: 1
  • T2: 2

Problem at the sites

  • T0 sites issues:
    • LFC Read-Only problem: any follow up on the request to add at least one more machine behind LFC-RO at CERN? Dirac now equipped to support geographically associated T1 LFC RO for catalogs queries.
    • Any update on a possible slot for intervene on LHCb Castor stagers and bring the new privileges schema?
  • T1 sites issues
    • GRIDKA: CREAMCE mapping. Any news?
    • GRIDKA: noticed yesterday that also GridKA was stalling all jobs: either a problem with dcap server (tURL was returned but not data read) or with shared area. This morning jobs resubmitted were running ans some until completion. Anything known happening yesterday afternoon site-side?
    • GRIDKA: one GGUS about shared area unreachable (SAM proving that)
    • CNAF: confirmed that the problem yesterday was due to GPFS not available. Problem fixed..
  • T2 sites issues:
    • SAM failing at INFN-NAPOLI-CMS and jobs aborting at UKI-SOUTHGRID-BHAM-HEP

14th January 2009 (Thursday)

Experiment activities:

  • Few thousand of user jobs (out of them the users killing the RO-LFC at CERN)
  • One stripping production active running at pace of few hundred jobs.

GGUS (or RT) tickets: 0 new tickets since yesterday

  • T0: 0
  • T1: 0
  • T2: 0

Problem at the sites

  • T0 sites issues:
    • LFC Read-Only problem pointed to be due to the CORAL interface used by some user's application.
      • Possible solution: LFC interface of CORAL to be uptodated plus increasing the h/w and threads behind lfc-lhcb-ro instance at CERN. DIRAC will start also to use all T1's instances so far unused. The reason of this last sentence is that we did not feel confident so far in using our mirrors because of the incoherent status of the replication often caught. 3D people vouching for the integrity of the mirrors is a reason more to start using
      • Before Xmas we left with the idea to intervene on our CASTOR stager to implement the ACLs successfully tested in November. Is there any scheduled intervention in the pipeline whose slot can be used to also bring this new permission schema into production?
  • T1 sites issues
    • GRIDKA: lcgadmin VOMS role is not correctly mapped to *sgm account via CREAMCE at GRIDKA. This is possible elsewhere
    • CNAF: we had all Stripping jobs failing there. It looks a load problem with Storm: apparently the jobs managed to retrieve the tURL of the file (GPFS file://) but then the job got stuck accessing data and not updates. Looking for more information then GGUS will be submitted.
  • T2 sites issues:
    • none

13th January 2009 (Wednesday)

Experiment activities:

  • There are about 2K user jobs running in the system and 6K more are waiting to be picked up by pilots
  • Just one stripping production is active now but not jobs created yet

GGUS (or RT) tickets: 2 new tickets since yesterday

  • T0: 1
  • T1: 1
  • T2: 0

Problem at the sites

  • T0 sites issues:
    • LFC Read-Only instance shows problems with many connections just timing out. This is flickering however. Open GGUS ticket
  • T1 sites issues:
    • GRIDKA: lcgadmin VOMS role is not correctly mapped to *sgm account via CREAMCE at GRIDKA
  • T2 sites issues:
    • none

12th January 2009 (Tuesday)

Experiment activities:

  • no major activity to be reported CREAM CEs passing SAM tests are now included in production mask concurring to production activities
  • DPM test for user analysis in two large sites

GGUS (or RT) tickets: 0 new tickets since yesterday

  • T0: 0
  • T1: 0
  • T2: 0

Problem at the sites

  • T0 sites issues:
    • none
  • T1 sites issues:
    • none
  • T2 sites issues:
    • none

11th January 2009 (Monday)

Experiment activities:

  • no major activity to be reported

GGUS (or RT) tickets: 1 new tickets since Friday

  • T0: 0
  • T1: 0
  • T2: 1

Problem at the sites

  • T0 sites issues:
    • none
  • T1 sites issues:
    • PIC: both problematic WMSes (rb01 and rb03) have been promptly replaced by two new instances that are happily passing SAM tests for lhcb.
    • RAL: one of the two WMS was not longer submitting CondorG jobs because of too many piled jobs in the ICE queue. This is the same issue we already run at CERN before XMas and for which a patch was already released. The problem disappeared looking at SAM afterward.
    • SARA: ICE queue emptied and WMS back again to life.
    • IN2p3: the problem reported before Xmas due to a third party library used with gsidcap has been fixed by developers but needs some more testing. In the mean time Lyon moved to the dcap protocol and the SE can be unbanned in the LHCb production mask.
  • T2 sites issues:
    • Shared area issue

8th January 2009 (Friday)

Experiment activities:

  • ~1.5K jobs waiting/running in the system exclusively for user analysis. The 4 MC productions run yesterday are completed w/o major issues apart the usual 20% of failure rate due to the known (whose fix is in SVN) problem with the application.

GGUS (or RT) tickets: 4 new tickets since yesterday

  • T0: 1
  • T1: 3
  • T2: 0

Problem at the sites

  • T0 sites issues:
    • wms203.cern.ch results to be not available (draining) and from the monitoring page it results to be due to a high number of ICE scheduled jobs.
  • T1 sites issues:
    • PIC: both WMSes are not longer submitting CondorG jobs because of too many piled jobs in the ICE queue (1500).This is an issue we already run at CERN before XMas and for which a patch was already released.
    • GridKA: one of the WMSes there is not available because overloaded.
    • SARA: WMS is not longer submitting CondorG jobs because of too many piled jobs in the ICE queue (1500) (see PIC)
  • T2 sites issues: none reported

7th January 2009 (Thursday)

Experiment activities:

  • Because of a wrong format of the dates (problem discovered by Pablo the 31st of December) the lhcb SSB was not publishing fresh results for none of its view until yesterday. The problem has been sorted out.
  • ~2.5K jobs waiting/running in the system between user analysis and 4 new MC productions launched this morning

GGUS (or RT) tickets: 0 new tickets since yesterday

  • T0: 0
  • T1: 0
  • T2: 0

Problem at the sites

  • T0 sites issues: none
  • T1 sites issues:
    • RAL: it looks like there was a second LHCb disk server where FSPROBE picked up an error over the Christmas holiday (part of the LHCb_M-DST space token at RAL). RAL people would like to validate the checksum and asked to provide our value for the files on that server
    • CNAF: discussions/evaluation about migrating CASTOR to TSM. It is about 6TB of data (1 night for copying everything).
  • T2 sites issues:

6th January 2009 (Wednesday)

Experiment activities:


  • There are slightly more than 1K jobs running in the system from user analysis. Very high efficiency (success/total in the last 24h/7day); no user complaining.

GGUS (or RT) tickets: 0 new tickets since yesterday

  • T0: 0
  • T1: 0
  • T2:0

Problem at the sites

  • T0 sites issues: none
  • T1 sites issues:
    • RAL sent the list of file's checksum the 25th of December. Any mismatch found should be notified to LHCb.
  • T2 sites issues:

5th January 2010 (Tuesday)

Experiment activities:

  • System currently empty (250 user jobs)
  • Breakdown of Activities over Xmas (17/12/2009-05/01/2010, from DIRAC accounting).
    • System almost unattended:
    • MC Simulation: 8705 jobs with usual 25% finished with application errors (cause found by Core Application devs, will be in next release)
    • Reconstruction/Reprocessing/(MC)Stripping/Merging: run 119 jobs, 2 failing with application problems
    • User: 62760 jobs, failed ~10% with: Input Data resolution (2K), Application Problem (1.8K), Job stalled (1.5K), Match making failure (1.2K) and Input Data not available (600)
  • GGUS tickets 7 tickets since December the 17th
    • 3 Solved properly (all about shared area issues at T2)
    • 1 ticket in progress since the 28th at RAL (54212, it was about files not available)
    • 1 ticket assigned at CNAF-T2 (54364, while it had to be followed by LHCb support unit, mistake of the TPM)
    • 1 ticket in progress at Russian site ( 54372, shared object not available)
    • 1 ticket recently assigned to Bologna for a shared area issue

-- RobertoSantinel - 29-Jan-2010

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r2 - 2010-01-29 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback