LHCb CCRC08 information

Planned tasks

  • Raw data distribution from pit to T0 centre
    • Use of rfcp into CASTOR from pit - T1D0
  • Raw data distribution from T0 to T1 centres
  • Recons of raw data at CERN & T1 centres
    • Production of rDST data - T1D0
    • Use of SRM 2.2
  • Stripping of data at CERN & T1 centres
    • Input data: RAW & rDST - T1D0
    • Output data: DST - T1D1
    • Use SRM 2.2
  • Distribution of DST data to all other centres

More details given in the preGDB October meeting (see slides)

Updated (more detailed) resource requirements given here - latest info 28th October 2007. With a breakdown by site, for February, given here (latest update 13th November 2007).

The above documents were updated following the January'08 face-to-face LCG CCRC08 meeting and feedback from some sites can be found here (22nd January 2008)

High level planning for the CCRC08 is given below:

  • status of 16th Feb is given here.
  • status of 25th Feb is given here.

The presentation to the CCRC08 F2F meeting (in April 08) can be found here. It gives the current resource estimates for the May'08 phase of CCRC08 and the services expected at the T1 sites. Since the presentation to the F2F meeting LHCb have identified space requirements under the LHCb_USER space token. The requirements for CCRC08 (and for data taking) are given here.

Daily Meetings

Monday 11:00 operation meeting - 1st item should be CCRC08 activities

Tuesday 9:30 PASTE meeting - 1st item should be CCRC08 activities

Wednesday 11:00 CCRC08 specific meeting

Thursday 11:00 operation meeting - 1st item should be CCRC08 activities

Friday 11:00 CCRC08 specific meeting

If you need the phone details contact Nick Brook

Critical service list

Rank Sort Definition Sort Max downtime (hrs) Sort Comment Sort
10 Critical 0.5  
7 Serious disruption 8  
5 Major reduction in effectiveness 8  
3 Reduced effectiveness 24  
1 not critical 72  

Service Sort Rank Sort Comment Sort
CERN VO boxes 10  
CERN LFC service 10  
T0 SE 10  
VOMS proxy service 7  
T1 VOboxes 3  
CERN network Campous and AFS 10  
FTS 7 both CERN to/from T1 & inter T1
WN misconfig 7  
CE access 7  
Conditions DB 7  
LHCb Bookkeeping service 7  
Oracle streaming from CERN 7  
SAM service 7 We should rely on this to OK a site ?
LHCb gLite WMS 5  at CERN (at T1 3)
T1 LFC service 3  
Dashboard 3  

the table above represents a first sketch presented beginning 2008. A detailed document describing all LHCb critical services, metrics,monitoring tests and criticality is available here

Planning Document

The latest version (3rd April 2008) of LHCb milestone document can be found here

Storage requirements and space management

Each site should mind to provide a configuration like that LHCb Space Management and the middleware should be able to cope with that requirements. An updated document describing the space token connectivity can be found here LHCbSpaces.xls. A detailed description of Storage Class requirements is available here.

Logical Namespace
Information on the LHCb namespace can be found here. Only the real data configuration will be used in February. For CCRC, the proposed name for <year> is CCRC08.
In case the SAPATH terminates with "/lhcb", the "/lhcb" should not be repeated, otherwise it will. If site need different SAPATH for different spaces, we don't care so much (this concerns what is before the LFN in the PFN). Note however there will be problems with space migration, but unimportant for CCRC'08.

Storage Requirements
Information on the LHCb space requirements can be found at the following document PostCCRC_storage.pdf

Space tokens deployment and monitoring
Information on the LHCb Space tokens deployment status at Tier1 sites

Collection from the sites of their hardware description T1 Storage System setup :updated at September the 4th

E-logbook

Issues experienced during CCRC are available at the following e-log

What To Do When... (WTDW)

This Twiki page should give you some guidance on trying to solve some of the problems observed during CCRC08.

February Phase summary

15-17th February

Exercise of pit-T0-T1 machinery at 1 file/minute. An issue with several data management agents polling volhcb03 with the consequence a re-boot was necessary. In general everything seem to proceed successfully

19th February Small scale production with low number of events run successfully. Some of the jobs observed to have gsidcap issues with lost connection at IN2P3. There were issues around a corrupted software area at GridKa.

20th February Transfer recommenced at a rate 1 file per minute. It was observed that that file removal from CERN using srm only removed from namespace and not the files from the cache leading to the trnasfer issue to CERN. Ths was a consequence of the CASTOR software deployed for LHCb. Autojob submission was started. Transfer (pit-T0-T1)running constantly

21st February Issues with SRM reporting space full at IN2P3. This was associated of cleaning SE with SRM not reclaiming space. There were issues surrounding the use of role=pilot that meant jobs ran as sgm in test production. Role=production will be used for CCRC08. ANother test production submitted.

22nd February Issues were observed with upload/creating directories at IN2P3 and RAL (associated with dirac_directory creation.) Transfer now automatically running 6 hours on 6 hours off to mimic LHC operations.

23rd February Re-occurrence of the IN2P3 SRM space problem. Transfer to CNAF failed - associated with not enough LSF slots on diskservers( increased to 200 per server)

Analysis of weekend jobs:

General

Brunel application problem associated with TransportSvc. Proxy retrieval issues associated with relying on the voms server to extend proxy lifetime. Incorrect error flag on reaching EOF for the MDF file.

CERN

There was a temporary software area hitch. A failure to copy and register output files; this was associated with files greater than ~2 GB and the subsequet checking of file size after the data was uploaded.

CNAF

A failure to copy and register output files; the same issue as associated with CERN. Jobs hug waiting for data from CASTOR this is associated with the LSF job slot problems observed in the transfers.

RAL

A failure to copy and register output files; the same issue as associated with CERN and CNAF. There were timeouts associated with the bookkeeping.

IN2P3

There was a gPlazma authorisation issue that occurred rarely. Major issue with gsidcap doors failure; this is due for a 2 hour timeout for any connection being set by the site. There were timeouts associated with the bookkeeping.

NIKHEF

As the number of jobs increase at NIKHEF all jobs failed trying to access through dCache due to failing to create a control line. This seemed to be a load issue. NIKHEF will move the gsidcap server from the srm server - not scheduled yet. Issue still open. There were timeouts associated with the bookkeeping.

PIC

All seemed fine.

GridKa

Some stalled jobs associated with dCache file access problems.

27th February

Transfer issues to NIKHEF; problems with one of the pools.

29th February

The bookkeeping problems observed at IN2P3, NIKHEF and RAL were solved after the appropriate firewall ports were opened.


This topic: LHCb > WebHome > LHCbComputing > CCRC08
Topic revision: r26 - 2009-05-15 - RobertoSantinel
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback