--
HarryRenshall - 16 Jan 2008
Week of 080114
Open Actions from last week:
Monday:
see the
weekly phone conference
in Indico
Tuesday:
Experiment reports:
ALICE (PM) met with the Castor team this morning to discuss in which
Castor release will be a new xrootd plugin that they need. More info
tomorrow. Since
RAL is staying with Castor 2.1.4 (this and 2.1.6 are
supported for bug fixes) ALICE will send raw data to tape at
RAL in
February but not read it back.
ATLAS (SC) software to handle space tokens is ready and will be used as
soon as sites configure them.
CMS (AS) nothing to report.
LHCb (RS) Nick Brook will coordinate solving the Castor/rfio problem
of LGCb at
RAL and CNAF. There will be an LFC bulk data modification to
the master LFC.
SMOD report: (M.C-S)
FIO will upgrade all CERN LFC to 1.6.7 next week. The 1.6.8 version
which runs in slc4 is not yet available though ATLAS plan to use it
for bulk operations in February.
DBMOD report: (JDS)
We now know that the physics databases client on worker nodes needs
much of the applications area software stack. It also needs Oracle
libraries and these are in fact freely available but without technical
support. Migration to new hardware in the integration RAC will be done
for CMS this week and ATLAS next week.
Monitoring / dashboard report:
We are looking at how best to handle the critical services.
Release update:
Nothing new to report.
Questions from sites:
AOB:
D.Bonacorsi of CMS asked if Castor sites are free to choose if they run
2.1.4 or 2.1.6 ? (M.C-S) Yes, both are supported for bug fixes.
Also can they choose gridftp1 or 2 ? (M.C-S) Castor 2.1.6 will require
gridftp2.
(JS) the approved middleware versions should be ready to push out to
sites next week.
(NT) the SL4 VO-box is ready to go into production - are there any
objections if we release it ? (there were none).
(JS) Note that reports for this meeting can be sent to the mailing list
wlcg-scod@cernNOSPAMPLEASE.ch
Wednesday
Experiment reports:
ATLAS (SC) and LHCb (RS) do not need site installation of the Oracle Instant Client as they are included in the experiment software suites. ALICE (PM) will be having another meeting with the Castor team to decide on using a new xrootd plugin. It might also be available for Castor 1.6.4.
Core services (CERN) report:
DB services (CERN) report (MG):
There was a move of CMS databases to a 64-bit OS machine which exposed a frontier application of theirs using a hard-coded server name in a connection string. The user has been advised.
Monitoring / dashboard report (JC):
CCRC08 elog-gers have been deployed, accessible from the Twiki. They will be used to document interventions, problems and also general observations. In addition there is one intended to link in with
MoU response times. Write access requires registration (from the elog entry page).
There are already LHCb and ATLAS detector elog-gers but we could also host experiments under CCRC08.
Release update (NT):
A DPM patch release is about to be made, FTS will be released next Monday and also gfal and lcg-utils. SC reported that LHCb had found a bug in list-replica. Since this is the baseline version supporting SRM 2 it should be built into the repository but not (yet) installed. The slc4 VO-box will be put in the middleware repository tomorrow. JDS hoped that the CCRC08 baseline middleware versions would be ready for sites to start installing by next Mondays operations meeting.
Questions from sites:
AOB:
Thursday
Experiment report(s):
LHCb want to move files from the pit to the Castor lhcbdata pool which they cannot currently see. MCS asked them to send a request to castor.support. CMS reported that the tomcat server in front of the SAM database was down - a known problem.
Core services (CERN) report (MCS):
At 06.00 the CMS Castor instance could not send its heartbeat to the central service. Being looked at with network experts. As soon as the new xrootd plugin has been tested it can be deployed - there are no Castor dependencies. CNAF and
RAL should deploy the current one then all sites should redeploy the new one together.
DB services (CERN) report: (JDS on behalf of MG): The problem with the integration RAC from CMS
INT9R observed yesterday by Frontier following the migration to 64 bit hardware was due to a not refreshed cache of IP addresses by the Frontier Tomcat server which was therefore leading the application to connect still to the old (and not more existing) hardware. CMS is aware of this problem and will fix it.
Under the request of LHCb, we have applied today a rolling patch (already tested by LHCb on their integration
RAC) which fixes a Oracle bug affecting updates of two CLOB columns in the same query which appears in the LHCb
COOL use case.
Monitoring / dashboard report:
Release update:
There is some confusion over the valid dpm patches - some reported were in fact stale. MCS reported that the dpm in production is good enough for what we need. NT said there would anyway be a new dpm today, fts gfal and lcg-utils would go to pps next Monday for rapid cycling. The gfal get_replica (calling list_replicas) bug is understood and a fix is available.
Questions from sites:
AOB:
Friday
Experiment report(s):
- SAM: issue with OSG sites understood and (hopefully) soon to be fixed
- ALICE: Uni-Mexico; Russia and RAL to start with gLite 3.1 VOBox
- ATLAS: need to know when at least one site will be ready with space tokens
- LHcb: rfio issue & RAL - changing CASTOR config with 1 LSF per disk server seems to solve problem -> roll-out to other sites to be configured.
Core services (CERN) report:
DB services (CERN) report:
We have applied on Atlas online PVSS data a procedure to compact and compress the archived data. The measured outcome is a reduction of 50% of the allocated space (from 1100 GB to 500 GB) and, as a direct consequence, a two-fold speed-up of PVSS queries. This procedure had been developed with Atlas and CO in Q4 2007 following the discovery of a bug that caused Oracle blocks to be only partially filled.
Unfortunately the compression has caused a streams bug, the capture processes are aborted when mining the redo log or archive log files which contain the information for the compressed tables and the replication is now blocked for ATLAS setup since Tuesday 12.01. A Service Request was opened on priority 1 and a patch is being developed.
We are in close contact with the Oracle Support . A patch is already existing and with Oracle development for final validation.
Oracle tells us that this patch should arrive before Monday morning and we will immediately apply it.
We have also enlarged the retention for the log files so that we will not need to reimport.
Monitoring / dashboard report:
- Metrics still being defined - will take ~1 more week
- Gridmap for service providers - still need experiment input
- Experiment tests in SAM - info collected and will be distributed
Release update:
- Release to production: VO box, gFal, lcg_utils, DPM ready -> make available 09:30 UTC+1
- LFC 1.6.8 in cert - expected Mon/Tue -> ATLAS T0 LFC
- FTS: pre-production smoke-test Monday -> pilot Monday
- list-replica bug in gFal still to be provided
Questions from sites:
- WN tar ball availability? Same as above
AOB: