LHCb CCRC08 information
Planned tasks
- Raw data distribution from pit to T0 centre
- Use of rfcp into CASTOR from pit - T1D0
- Raw data distribution from T0 to T1 centres
- Recons of raw data at CERN & T1 centres
- Production of rDST data - T1D0
- Use of SRM 2.2
- Stripping of data at CERN & T1 centres
- Input data: RAW & rDST - T1D0
- Output data: DST - T1D1
- Use SRM 2.2
- Distribution of DST data to all other centres
More details given in the preGDB October meeting (see
slides)
Updated (more detailed) resource requirements given
here - latest info 28th October 2007. With a breakdown by site, for February, given
here (latest update 13th November 2007).
The above documents were updated following the January'08 face-to-face LCG CCRC08 meeting and feedback from some sites can be found
here (22nd January 2008)
High level planning for the CCRC08 is given below:
- status of 16th Feb is given here.
- status of 25th Feb is given here.
The presentation to the CCRC08
F2F meeting (in April 08) can be found
here. It gives the current resource estimates for the May'08 phase of CCRC08 and the services expected at the T1 sites. Since the presentation to the
F2F meeting LHCb have identified space requirements under the LHCb_USER space token. The requirements for CCRC08 (and for data taking) are given
here.
Daily Meetings
Monday 11:00 operation meeting - 1st item should be CCRC08 activities
Tuesday 9:30 PASTE meeting - 1st item should be CCRC08 activities
Wednesday 11:00 CCRC08 specific meeting
Thursday 11:00 operation meeting - 1st item should be CCRC08 activities
Friday 11:00 CCRC08 specific meeting
If you need the phone details contact Nick Brook
Critical service list
Service |
Rank |
Comment |
CERN VO boxes |
10 |
|
CERN LFC service |
10 |
|
T0 SE |
10 |
|
VOMS proxy service |
7 |
|
T1 VOboxes |
3 |
|
CERN network Campous and AFS |
10 |
|
FTS |
7 |
both CERN to/from T1 & inter T1 |
WN misconfig |
7 |
|
CE access |
7 |
|
Conditions DB |
7 |
|
LHCb Bookkeeping service |
7 |
|
Oracle streaming from CERN |
7 |
|
SAM service |
7 |
We should rely on this to OK a site ? |
LHCb gLite WMS |
5 |
at CERN (at T1 3) |
T1 LFC service |
3 |
|
Dashboard |
3 |
|
the table above represents a first sketch presented beginning 2008. A detailed document describing all LHCb critical services, metrics,monitoring tests and criticality is available
here
Planning Document
The latest version (3rd April 2008) of LHCb milestone document can be found
here
Storage requirements and space management
Each site should mind to provide a configuration like that
LHCb Space Management and the middleware should be able to cope with that requirements. An updated document describing the space token connectivity can be found here
LHCbSpaces.xls. A detailed description of Storage Class requirements is available
here.
Logical Namespace Information on the LHCb namespace can be found
here. Only the real data configuration will be used in February. For CCRC, the proposed name for <year> is CCRC08.
In case the SAPATH terminates with "/lhcb", the "/lhcb" should not be repeated, otherwise it will. If site need different SAPATH for different spaces, we don't care so much (this concerns what is
before the LFN in the PFN). Note however there will be problems with space migration, but unimportant for CCRC'08.
Storage Requirements Information on the LHCb space requirements can be found at the following document
PostCCRC_storage.pdf
Space tokens deployment and monitoring Information on the LHCb
Space tokens deployment status at Tier1 sites
Collection from the sites of their hardware description
T1 Storage System setup :updated at September the 4th
E-logbook
Issues experienced during CCRC are available at the following
e-log
What To Do When... (WTDW)
This
Twiki page should give you some guidance on trying to solve some of the problems observed during CCRC08.
February Phase summary
15-17th February
Exercise of pit-T0-T1 machinery at 1 file/minute. An issue with several data management agents polling volhcb03 with the consequence a re-boot was necessary. In general everything seem to proceed successfully
19th February Small scale production with low number of events run successfully. Some of the jobs observed to have gsidcap issues with lost connection at
IN2P3. There were issues around a corrupted software area at
GridKa.
20th February Transfer recommenced at a rate 1 file per minute. It was observed that that file removal from CERN using srm only removed from namespace and not the files from the cache leading to the trnasfer issue to CERN. Ths was a consequence of the CASTOR software deployed for LHCb. Autojob submission was started. Transfer (pit-T0-T1)running constantly
21st February Issues with SRM reporting space full at
IN2P3. This was associated of cleaning SE with SRM not reclaiming space. There were issues surrounding the use of role=pilot that meant jobs ran as sgm in test production. Role=production will be used for CCRC08. ANother test production submitted.
22nd February Issues were observed with upload/creating directories at
IN2P3 and RAL (associated with dirac_directory creation.) Transfer now automatically running 6 hours on 6 hours off to mimic LHC operations.
23rd February Re-occurrence of the
IN2P3 SRM space problem. Transfer to CNAF failed - associated with not enough LSF slots on diskservers( increased to 200 per server)
Analysis of weekend jobs:
General
Brunel application problem associated with
TransportSvc. Proxy retrieval issues associated with relying on the voms server to extend proxy lifetime. Incorrect error flag on reaching EOF for the
MDF file.
CERN
There was a temporary software area hitch. A failure to copy and register output files; this was associated with files greater than ~2 GB and the subsequet checking of file size after the data was uploaded.
CNAF
A failure to copy and register output files; the same issue as associated with CERN. Jobs hug waiting for data from CASTOR this is associated with the LSF job slot problems observed in the transfers.
RAL
A failure to copy and register output files; the same issue as associated with CERN and CNAF. There were timeouts associated with the bookkeeping.
IN2P3
There was a gPlazma authorisation issue that occurred rarely. Major issue with gsidcap doors failure; this is due for a 2 hour timeout for any connection being set by the site. There were timeouts associated with the bookkeeping.
NIKHEF
As the number of jobs increase at NIKHEF all jobs failed trying to access through dCache due to failing to create a control line. This seemed to be a load issue. NIKHEF will move the gsidcap server from the srm server - not scheduled yet. Issue still open. There were timeouts associated with the bookkeeping.
PIC
All seemed fine.
GridKa
Some stalled jobs associated with dCache file access problems.
27th February
Transfer issues to NIKHEF; problems with one of the pools.
29th February
The bookkeeping problems observed at
IN2P3, NIKHEF and RAL were solved after the appropriate firewall ports were opened.