May 2009 Reports

To the main

29th May (Friday)

Experiment activities

  • LHCb week this week.
  • LHCb merging job failures at dCache sites now understood. Testing ongoing.

Top 3 site-related comments of the day: 1. NL-T1 dCache causing transfers to fail.
2. Jobs unable to access SQLite DB on NL-T1 NFS shared area. All jobs failing.
3. Data access problems at IN2P3. Being investigated.

GGUS (or Remedy) tickets

T0 sites issues:


T1 sites issues:

  • [OPEN - since 29 May 09 ] NL-T1 dCache causing FTS transfers to fail.

T2 sites issues:


28th May (Thursday)

Experiment activities

  • MC09 production continuing at T0,1,2 sites.
  • LHCb week this week.

Top 3 site-related comments of the day: 1. Data access problems at IN2P3. Being investigated.
2. NIKHEF-T1 went into unscheduled downtime
3. Pilots aborting at some T2s

GGUS (or Remedy) tickets

T0 sites issues:


T1 sites issues:


T2 sites issues:


27th May (Wednesday)

Experiment activities

  • MC09 production continuing at T0,1,2 sites.
  • Merging of data files at T1s.

Top 3 site-related comments of the day: 1. Data access problems at IN2P3. Being investigated.
2. Shared area problems at some T2s
3.

GGUS (or Remedy) tickets

T0 sites issues:


T1 sites issues:

  • Ongoing problem of accessing data on dCache from LHCb applications.

T2 sites issues:


26th May (Tuesday)

Experiment activities

  • MC09 production continuing at T0,1,2 sites.

Top 3 site-related comments of the day: 1. Data access problems at IN2P3. Being investigated.
2. CRL installed on the CERN UI not properly updated this morning
*3.* Shared area problems at some T2s

GGUS (or Remedy) tickets

T0 sites issues:

  • CRL installed on the AFS UI not properly installed under X509_ CERT_DIR

T1 sites issues:

  • Ongoing problem of accessing data on dCache from LHCb applications.

T2 sites issues:


25th May (Monday)

Experiment activities

  • MC09 production continuing at T0,1,2 sites.

Top 3 site-related comments of the day: 1. CNAF disk server has expired CRLs: transfers failing
2. IN2P3 going into unscheduled downtime (now resolved)
3. Jobs are failing at NIKHEF due to a problem accessing shared area

GGUS tickets

T0 sites issues:

T1 sites issues:

  • [OPEN - since 24 May 09 ] CNAF disk server has expired CRLs. GGUS ticket not actioned.
  • [OPEN - since 25 May 09 ] Jobs are failing at NIKHEF due to a problem accessing shared area.

T2 sites issues:

  • [OPEN - since 25 May 09 ] Pilots aborting at INFN-BARI and IN2P3-LPC
  • [OPEN - since 21/2 May 09 ] Jobs fail to get voms proxy extension at some T2s

22nd May (Friday)

Experiment activities

  • MC09 production continuing at T0,1,2 sites.

Top 3 site-related comments of the day: 1. CNAF disk server has expired CRLs: transfers failing
2. Jobs fail to get voms proxy extension at some T2s
3.

GGUS tickets

T0 sites issues:

T1 sites issues:

  • [OPEN - since 20 May 09 ] CNAF disk server has expired CRLs. GGUS ticket not closed yet.

T2 sites issues:

  • [OPEN - since 21/2 May 09 ] Jobs fail to get voms proxy extension at some T2s
  • [OPEN - since 20 May 09 ] Black hole node found at BMEGrid
  • [OPEN - since 20 May 09 ] No space left at PEARL-AMU
  • [OPEN - since 20 May 09 ] Shared software area problems at PISA
  • [OPEN - since 20 May 09 ] Some other minor T2 issues

20th May (Wednesday)

Experiment activities


  • MC09 production continuing at T0,1,2 sites.

Top 3 site-related comments of the day: 1. CNAF disk server has expired CRLs: transfers failing
2. RAL CASTOR is down for maintenance which is limiting our MC09 activity (now resolved)
3. Black hole node detected at BMEGrid

GGUS tickets

T0 sites issues:

T1 sites issues:

  • [OPEN - since 20 May 09 ] CNAF disk server has expired CRLs.
  • RAL CASTOR is down for maintenance which is limiting our MC09 activity. Site is now back and jobs have been rescheduled.

T2 sites issues:

  • [OPEN - since 20 May 09 ] Black hole node found at BMEGrid
  • [OPEN - since 20 May 09 ] No space left at PEARL-AMU
  • [OPEN - since 20 May 09 ] Shared software area problems at PISA
  • [OPEN - since 20 May 09 ] Some other minor T2 issues

19May (Tuesday)

Notes: Agreed the CASTOR intervention next Wednesday. CASTOR lhcb will be unvailable for about 4 hours.

Experiment activities

  • User analysis on going at CERN and T1s. No production activities on going. Run last night a small sample of MC09. Went absolutely smoothly through. Today MC09 will restart.

Top 3 site-related comments of the day: 1. Aborted pilot at Gazon (Nikhef)
2.
3.

GGUS (or Remedy) tickets

T0 sites issues:


T1 sites issues:

  • [CLOSED - 18th of May ] IN2p3 many files not available. Disk server down. SRM should be able to handle this situation.

T2 sites issues:

  • Shared area issues and voms extension extraction problem at some sites

18May (Monday)


Previous week summary

Notes:

GGUS problems preventing to access the portal for submitting tickets. Also mailing to helpdesk@ggusNOSPAMPLEASE.org does not work

Experiment activities

  • User analysis on going at CERN and T1s. No production activities on going. MC09 production pending physics group approval.


Top 3 site-related comments of the day: 1. GGUS major OUTAGE preventing to submit tickets
2. Files unavailable at IN2p3
3.

GGUS (or Remedy) tickets

T0 sites issues:


T1 sites issues:

  • [OPEN - 18th of May ] IN2p3 many files not available. LHCb contact person at Lyon looking at that.

T2 sites issues:

  • Shared area issues and voms extension extraction problem at some sites

15May (Friday)

Experiment activities

  • FEST full stream production has run after the green light fromt he Data Quality team. We have mainly problems (to be investigate with application experts) at NIKHEF and IN2p3. The application (Brunel) crashed there when trying to access RAW files when using gsidcap. Any where else OK.

Top 3 site-related comments of the day: 1. Brunel crashes at IN2p3
2. Brunel crashes at NIKHEF
3.file access problems at GridKA

GGUS (or Remedy) tickets

T0 sites issues:

  • [CLOSED - 15th of May ] LFC exhausted server threads yesterday at about 22:00 CET due again to Persistency LFC interface that has to be re-engineered.

T1 sites issues:

  • [OPEN - 15th of May ] dCache at GridKA problems
  • [OPEN - 12th of May ] WMS at RAL problems
  • [OPEN - 12th of May ] WMS at pic problems
  • [CLOSED - 14th of May ] Issue with pic yesterday afternoon due to cooling system power switched off. SOme DIRAC service sitting there suffered that.

T2 sites issues:


14May (Thursday)

Experiment activities

  • FEST is stuck in between ONLINE and CASTOR. Looks a problem with LHCb Bookkeeping service. No big production activity going on.

Remark:

  1. The stability of WMS observed at T1's is far from what would be desirable
  2. RAL: over last days had we suffered a bit the Network intervention there with some user jobs stalling (no hearth beat sent back)

Top 3 site-related comments of the day: 1. WMS issue at RAL
2. WMS issues at pic
3.

GGUS (or Remedy) tickets

T0 sites issues:


T1 sites issues:

  • [OPEN - 12th of May ] WMS at RAL problems
  • [OPEN - 12th of May ] WMS at pic problems

T2 sites issues:


13May (Wednesday)

Experiment activities

  • This week is a FEST week (full experiment exercise) and reconstruction jobs expected soon at T1.
  • MC09 production (testing new versions of LHCb software)

Remark:

  1. SRM upgrade at CERN yesterday causing problems to retrieve multiple turls in one request. SRM downgraded and Savannah bug submitted to developers (in the picture the behavior of SRM between the upgrade and the downgrade).fromCERN2.png
  2. The stability of WMS observed at T1's is far from what would be desirable
  3. We found (and reported to Linux support) an issue with ldap compat library for slc5. This potentially affects all sl5 installations outside CERN

Top 3 site-related comments of the day: 1. SRM upgrade at CERN yesterday to 2.7-17 not transparent
2.
3

GGUS (or Remedy) tickets

T0 sites issues:

  • [OPEN - 7th of May ] CERN UI has to be updated
  • [OPEN - 13th of May ] srm-lhcb downgraded in the morning. Multiple turls were available again after the downgrade. Open savannah ticket to devs

T1 sites issues:

  • [OPEN - 12th of May ] WMS at SARA problems
  • [OPEN - 12th of May ] WMS at SARA problems
  • [OPEN - 12th of May ] WMS at pic problems

T2 sites issues:
Shared area problems, pilot aborting SQLite DB access problem.

12May (Tuesday)

Experiment activities

  • This week is a FEST week (full experiment exercise) and reconstruction jobs expected soon at T1.
  • SAM suite and chaotic analysis activity running at that time
  • Verifying the gLExec installation at Lyon we spotted some configuration problems there. Working on...

Remark:

  1. The stability of WMS observed at T1's is far from what would be desirable
  2. We found (and reported to Linux support) an issue with ldap compat library for slc5. This potentially affects all sl5 installations outside CERN

Top 3 site-related comments of the day: 1. WMS issue at SARA
2. WMS issues at RAL
3.WMS issue at pic

GGUS (or Remedy) tickets

T0 sites issues:

  • [OPEN - 7th of May ] CERN UI has to be updated

T1 sites issues:

  • [OPEN - 12th of May ] WMS at SARA problems
  • [OPEN - 12th of May ] WMS at SARA problems
  • [OPEN - 12th of May ] WMS at pic problems

T2 sites issues:
Shared area problems, pilot aborting SQLite DB access problem.

11 May (Monday)

Experiment activities

  • No Production Activity on going. This week is a FEST week (full experiment exercise).
  • SAM suite and chaotic analysis activity running at that time
  • One well controlled user analysis is running regularly three times per week to test file access at T1's. This aims to reproduced in a controlled way the analysis use case

Top 3 site-related comments of the day: 1. CVS Problem at CERN again
2.
3.

GGUS (or Remedy) tickets

T0 sites issues:

  • [OPEN - 7th of May ] CERN UI has to be updated
  • [OPEN - 11th of May ] CERN CVS problems again

T1 sites issues:

  • Issue with SARA SRM as shown in this picture (apparently due to some disk pools not working).

T2 sites issues:
Shared area problems, pilot aborting SQLite DB access problem.


8th May (Friday)

Experiment activities

  • MC09 and merging activity is running to the completion (few pending jobs at GridKA and CNAF mainly due to problems over the last days). Data access problems at CNAFhave been fixed

Top 3 site-related comments of the day: *1.*Some GridKA WNs have ulimit set to 4GB that prevents to write large (~5GB) output and many jobs failing there
2. WMS at T1's failing with different messages.
3. Production UI at CERN must be upgraded to correct the problem with client hanging when SOAP message returned

GGUS (or Remedy) tickets

T0 sites issues:

  • [OPEN - 7th of May ]CERN UI has to be updated

T1 sites issues:

  • [OPEN - 8th of May ] GridKA WN with ulinit set to 4GB
  • [OPEN - 8th of April ]WMS at pic and GridKA problems

T2 sites issues:

Issues at several sites retrieving user voms extensions

7th May (Thursday)

Experiment activities

  • We have the pleasure to announce that a good fraction of the first batch of minimum bias MC09 productions has been produced and merged into files of about 5 GB and the completion of 10millions event per each sample is ongoing. Relatively quickly. We shall wait for a green-light from the PPG on the quality of the data (in particular its usability for the studies that should be done with them) before continuing with the "NoTruth" production up to 10^9 events. The plot shows the distribution of the number of jobs jobs_MC09.bmp

Top 3 site-related comments of the day: 1. ACL issues with StoRM at CNAF is penalizing the Merging activity that cannot proceed
2. dCache at GridKA instable
3.

GGUS tickets

T0 sites issues:

T1 sites issues:

  • [OPEN - 4th of May ] StoRM endpoint returns tURL that are not accessible. Reopen a new ticket
  • [OPEN - 28th of April ]dCache system becoming unrensponsive. Under investigationfile access is failing there

T2 sites issues:

6th May (Wednesday)

Experiment activities

  • MC09 and merging activity is running to the completion (currently ~1K jobs runing in total). Data access problems at CNAF are blocking production over there.

Top 3 site-related comments of the day: 1. GPFS space at CNAF serving StoRM endpoint seems to become unresponsive
2. SARA data access problem:(UNSCHEDULED DOWNTIME)
3. dCache at GridKA instable

GGUS tickets

T0 sites issues:

T1 sites issues:

  • [OPEN - 4th of May ] StoRM endpoint returns tURL that are not accessible. Reopen a new ticket
  • [OPEN - 28th of April ]dCache system becoming unrensponsive. Under investigation

T2 sites issues:


5th May (Tuesday)

Experiment activities

  • MC09 and merging activity is running (with and without output data saving)

Top 3 site-related comments of the day: 1. GPFS space at CNAF serving StoRM endpoint seems to become unresponsive
2. dCache at GridKA instable
3. IN2p3 Still down

GGUS tickets

T0 sites issues:

T1 sites issues:

  • [OPEN - 4th of May ] StoRM endpoint returns tURL that are not accessible. Reopen a new ticket
  • [OPEN - 28th of April ]dCache system becoming unrensponsive. Under investigation
  • [OPEN - 3rd of May ]IN2P3 down due to cooling problems since Sunday. Not yet fully operational.

T2 sites issues:

  • Pilot aborting in small sites; few sites with Shared area problem

4th May (Monday)

Experiment activities

  • Testing the MC09 and merging activity

Important news

As announced earlier, the exercise of a systematic user analysis (that pairs with the chaotic real one) has been also performed to check tier1's sites and the data file access. Please find

https://twiki.cern.ch/twiki/bin/viewfile/LHCb/ProductionOperationsWLCGdailyReports?rev=1;filename=Analysis_at_Tier1s.pdf

the results of this exercise

Top 3 site-related comments of the day: 1. GPFS space at CNAF serving StoRM endpoint seems to become unresponsive
2. dCache at GridKA instable

GGUS tickets

T0 sites issues:

T1 sites issues:

  • [OPEN - 4th of May ] StoRM endpoint returns tURL that are not accessible. Reopen a new ticket
  • [OPEN - 28th of April ]dCache system becoming unrensponsive. Under investigation

T2 sites issues: Pilot aborting in small sites; few sites with Shared area problem

-- RobertoSantinel - 18 Jun 2009

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2010-06-11 - PeterJones
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LHCb All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback