Difference: ProductionOperationsWLCGJul09Reports (1 vs. 2)

Revision 22010-01-29 - unknown

Line: 1 to 1
 

July 2009 Reports

Added:
>
>
To the main
 

31st July (Friday)


Revision 12009-08-03 - RobertoSantinel

Line: 1 to 1
Added:
>
>

July 2009 Reports

31st July (Friday)


Experiment activities

LHCb is rest.
Productions are 'on hold' pending the installation of the requested disk space at CERN such that the backlog of already produced data can be cleared before resuming.
This backlog of requests is also holding up the output data validation activity for the production that otherwise look to be completed.

GGUS (or Remedy) tickets since yesterday:


T0 sites issues:

CERN: pending the installation of more disk capacity.

T1 sites issues :

IN2p3 publishing in the BDII in a way that prevents LHCb agents to see it properly (GlueClusterUniqueID different than the name of the HOST CE)

T2 sites issues:

Couple of sites in UK with large number of jobs failing

30th July (Thursday)


Experiment activities

Previous days running at full regime productions have now been stopped because the failover is also getting full at all T1's due to the lhcbdata space at CERN full.
CERN disk capacity is indeed today the main problem (also accordingly SLS ) and became a show stopper. LHCb would like to hear from FIO which the plans to get these extra 100TB agreed on Monday

GGUS (or Remedy) tickets since yesterday:


T0 sites issues:
LFC: no problem LFC server side (being evrything under used). Most likely a problem with the client host particularly overloadedor network in between.

RDST intermittent problem due to low disk capacity and SRM not shielded enough for which threads are busy waiting for the stager to service the PUT requests.

T1 sites issues:

Working on the SQLlite issue with CNAF people

29th July (Wednesday)


Experiment activities
There are currently running 16K jobs concurrently for 6 different physics MC production. In this picture a snapshot of the last 24 hours running jobs



GGUS (or Remedy) tickets since yesterday:


T0 sites issues:


Intermittent time outs retrieving tURL from SRM at CERN (verified for the RDST space, most probably general issue
Observed LFC access times out this morning. SLS does not seem to indicate problems.

T1 sites issues:

issue with SQLLite DB file in the shared area: provided to CNAF people all suggestions to fix this problem as done at GridKA. Waiting for them.

28th July (Tuesday)


Experiment activities
Many different production activities are proceeding smoothly in general, even when "lhcbdata" disk pool was full all files went into failover and were recovered. Agreed to increase by 100TB the disk capacity at CERN having not yet available the 2009 pledged resources on site. Running a very sustained rate of concurrent jobs (>16K).

CNAF is still banned due to NFS standard problems,The operator reports the GGUS ticket has been closed and said not to use the shared area for SQLite DBfiles. �Unfortunately from our side the fix doesn't apply to all currently used software versions so we are somewhat stuck. Most other sites have solved this problem in fact.

Furthermore, please note that:



GGUS (or Remedy) tickets since yesterday:


T0 sites issues:

lhcbdata pool got full and more disk capacity has been requested. Now a disk server has moved there to let draining the failover full of requests to transfer to CERN
T1 sites issues:

21st July (Tuesday)


Experiment activities

  • further signal production submitted.
  • Post mortem on our FEST excercice (last week ) can be found at RAL, CERN, PIC, GRIDKA, CNAF, IN2P3


GGUS (or Remedy) tickets since Friday:


T0 sites issues:


T1 sites issues:


20th July (Monday)


Experiment activities

  • Here the link to the last week report from the shifter.
  • 1 billion events MC production has now running its last 500 jobs. Few further signal production submitted and a long tail of Reprocessing activity last week (500 jobs at NIKHEF)



GGUS (or Remedy) tickets since Friday:


T0 sites issues:


T1 sites issues:


17th July (Friday)


Experiment activities

  • In this page the staging activity progress. IN2p3 finished to transfers (accidentally scrapped) data yesterday, data migrated to MSS this morning and ready this afternoon to be staged and to run reprocessing at Lyon too.
  • The reprocessing activity will last this Saturday.
  • It has been decided for the problem at RAL and CNAF reported yesterday about limited disk servers connections available to temporary copy the file in the WN and access it from there.
  • This activity plus the other MC09 user analysis SAM brings to a new astonishing record: 20K jobs running concurrently in the system
Running_jobs.png



GGUS (or Remedy) tickets since Friday:


T0 sites issues:


T1 sites issues:

- RAL and CNAF problem : temporary workaround copying local to the WN the file. The
diskservers have indeed many open connections (i.e. many rootd servers spawn) but are rather unloaded and downloading to the local disk is the right job for now. Next week CNAF people will tune accordingly the disk server serving the cache in front of tapes for the T1D0 accordingly the numbers
that were put forward to the scrutiny group

-IN2p3 transferred again data over there yesterday and this morning the cache will be cleaned up as soon as data get into the MSS

T2 sites issues:

16th July (Thursday)


Experiment activities

  • In this page the staging activity progress
  • Started also reprocessing and subsequent reprocessing. 2K running in the system now at T1
  • MC09 is proceeding at sustained rate (8K jobs running concurrently) plus the usual ~3K of users concurrently. 13K jobs in the system

Top 3 site-related comments of the day:

top 3 site problems 1. IN2p3 removed completely their data (also from the MSS)
2. RAL and CNAF reprocessing jobs hanging to access data.
3. Transfers to GridKA failing

GGUS (or Remedy) tickets since Friday:


T0 sites issues:


T1 sites issues:
- GridKA: Problems transferring some data out of GridKA since two days
- IN2p3: data removed completely from tape - CNAF: Wrong BDII publication at CNAF seems fixed
- In2p3: transfers to MC_M-DST failing becasue it was full. Increased the quota.
- RAL: >1K jobs stalled to access data It appears to be due to a limited connections available to the Castor diskservers from the WNs.
- CNAF: also 2K jobs stalled accessing data file; suspicious also in a wrong CASTOR setup on the WNs.

T2 sites issues:

15th July (Wednesday)


Experiment activities

  • Staging monitoring are available here page and subsequent reprocessing will start in the afternoon. Testing beforehand each T1 with small sample of jobs each.
  • MC09 is proceeding at sustained rate (8K jobs running concurrently) plus the usual ~1K of users concurrently.

GGUS (or Remedy) tickets since Friday:


T0 sites issues:


T1 sites issues:
- Transfers to CASTOR at CNAF failing during the week end; there was a problem with the stager of CASTOR (probably related to Oracle db). File transferred properly (Solved)
- DST transfers to MC-DST space at pic failing. Disk full (Solved)

- Wrong BDII publication at CNAF driving pilots to aborting
T2 sites issues:

14th July (Tuesday)


Experiment activities

  • After the intervention on CASTOR at CERN tomorrow afternoon LHCb will restart the reprocessing. Special care to NIKHEF where we transferred again data that have been accidentally also from tape. So we would like to ask SARA people to go again through the cache and remove data over there (keeping this time data on MSS ;-))
  • MC09 is proceeding at full regime with approxi. 11K jobs running concurrently that pair with other 1K of users concurrently gives a sustained 12K jobs concurrently in the system since days.

GGUS (or Remedy) tickets since Friday:


T0 sites issues:


T1 sites issues:
- Transfers to CASTOR at CNAF failing during the week end; there was a problem with the stager of CASTOR (probably related to Oracle db). (REOPENED)
- DST transfers to MC-DST space at pic failing

T2 sites issues:
Smallish sites flooded by too aggressive pilot submission policy. To be investigated.

13th July (Monday)


Experiment activities

  • Currently running 10,5K jobs between user private analysis (~1K) and MC09 production
  • The reprocessing activity announced last week is still awaiting for all sites (notified via GGUS by the shifter) to remove data from the cache. There are ~15K pending the green light to be submitted to sites for testing the staging and file access at T1's
  • Here the weekly report from the shifter

GGUS (or Remedy) tickets since Friday:


T0 sites issues:


T1 sites issues:
Transfers to CASTOR at CNAF failing during the week end; there was a problem with the stager of CASTOR (probably related to Oracle db). (SOLVED)
FZK: shared area with problems on some of the WN (NFS auto-mounter problem)(SOLVED)
Some users not recognized on StoRM SE at CNAF (OPEN)

T2 sites issues:
Wrong BDII publication altering the computation of the rank, shared area issues and too many pilots aborting (batch system configuration)

10th July (Friday)


Experiment activities

  • MC09 + FEST
  • Important: Decided to run a re-processing activity 'a la' STEP09 using previously FEST week's accumulated data at T1's and providing so a greater challenge . This means that sites have to be notified via GGUS ticket (already sent) to clean up their caches in front of the MSS for running the re-processing workflow included the staging from tape. All data under /lhcb/data/2009/RAW/FULL/FEST/ should be removed from the disk-cache (but not from the tape)

GGUS (or Remedy) tickets since yesreday :


T0 sites issues:


T1 sites issues:

Sites demanded to clean up caches (and only caches) for allowing LHC-b to rerun a reprocessing exercise.

T2 sites issues:

9th July (Thursday)


Experiment activities

  • MC09 + FEST .New record of concurrent number of jobs: 12245

GGUS (or Remedy) tickets since yesreday :


T0 sites issues:

LHCb_RDST and LHCb_RAW space tokens problem fixed by reconfiguring the new disk servers that have been redeployed in production again after some trivial tests confirming they were OK

T1 sites issues:

RAL jobs stuck in staging sttaus. Found to be a problem with the tape drivers.

CNAF and pic WMS issue. Timing out all requests

T2 sites issues:

worng configuration of some sites (mainly for LCMAPS mapping and for misleading publication on the BDII)).

8th July (Wednesday)


Experiment activities

  • MC09 + FEST .

GGUS (or Remedy) tickets since yesreday :

Service issues:

GGUS down this morning

T0 sites issues:

LHCb_RDST and LHCb_RAW space tokens have a problem that prevents systematically to upload data. Seems related to new disk servers deployed on these spaces. Pulled out production now.
T1 sites issues:

RAL jobs stuck in staging sttaus. Requests sent properly to SRM but files never get online.


T2 sites issues:


7th July (Tuesday)


Experiment activities

  • MC09 for 10^9 events resumed after being paused during the week end for a major issue with the LogSE service that had exhausted the disk space.
  • Preparing for FEST production tomorrow although some issue with installation SAM jobs failing everywhere and preventing to install requested version of LHCb application software.


GGUS (or Remedy) tickets since last time:

T0 sites issues:
T1 sites issues:

One of the two gLite WMSes at pic is not responding.


T2 sites issues:

wrong mapping of the Role=lcgadmin (SGM users) at many non T1 sites.

2nd July (Thursday)


Experiment activities

  • MC09 for 10^9 events resumed after being paused due to a main problem with DIRAC yesterday.Cleaned all duplicated (wrong) data. Merging production also restarted
  • gLExec tests confirm that the installation at GridKA and Lancaster are OK.
top 3 site problems


GGUS (or Remedy) tickets

T0 sites issues:
T1 sites issues:

The gLite WMS at SARA is still failing the submission with timeout. Despite the ICE queue was removed there the service is still not responding properly.

GridKA: a file owned by root:root in the shared area is causing jobs to crash over there.
T2 sites issues:

1st July (Wednesday)


Experiment activities

  • MC09 for 10^9 events running. This morning 10K concurrent jobs are running.
  • LHCb specific SAM jobs were failing over the last days due to a bug introduced in DIRAC. Site reliability does not result affected by that but many SAM test displayed in the Dashboard would result reddish.

Top 3 site-related comments of the day:

top 3 site problems 1. CIC portal: VO-ID card information got lost (recovered now by CIC people looking at the bug)
2. GridKA: root owned file in shared area causing jobs failing
3. WMS issues at SARA (timeout).


GGUS (or Remedy) tickets

m/w issues:

discovered a bug in the SWIG code for SL5 (bug #52476) and fixed by Remi. It was introducing a memory leak on the python binding for gfal.

T0 sites issues:
T1 sites issues:

The gLite WMS at SARA is still failing the submission with timeout. Despite the ICE queue was removed there the service is still not responding properly.

GridKA: a file owned by root:root in the shared area is causing jobs to crash over there.
T2 sites issues:


-- RobertoSantinel - 03 Aug 2009

 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback