--
HarryRenshall - 15 Apr 2008
Week of 080414
Open Actions from last week:
Daily CCRC'08 Call details
To join the call, at 15.00 CET Tuesday to Friday inclusive (usually in CERN bat 28-R-006), do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
Monday:
See the
weekly joint operations meeting minutes
Additional Material:
Tuesday:
elog review: nothing new
Experiments round table:
ALICE (PM): are sending the last data required for their commissioning exercise to begin on 5 May. Currently CNAF is down for a Castor upgrade and the NDGF SE is down. They are testing migrating their VO-box software to run on 64-bit Linux.
Sites round table:
NL-T1 (JT): ATLAS are having difficulties with their SAM tests to judge the availability of the joint NIKHEF-SARA Dutch Tier 1. A runaway LHCb program wrote 120 GB of logs bringing down some worker nodes.
Core services (CERN) report:
DB services (CERN) report: During tests of new RAC hardware some 10% of the storage controllers have failed and as yet they have no explanation from the vendor. For this reason they propose to use Oracle's dataguard software to maintain an asynchronous failover copy of the physics databases on the old hardware after migration to the new. They will put this plan before the MB today. The planning is to migrate LHCb and CMS today, ATLAS tomorrow then WLCG on Thursday (which means a 2 hour FTS downtime and 4 hours down for the local ATLAS LFC).
Monitoring / dashboard report: CMS want to start using the Condor glide-in facility to submit about 100 jobs/day and this will need mods in the dashboard to track them. Also they want to start looking at the cpu efficiency of their various applications.
Release update:
AOB: Registration for the WLCG collaboration workshop closes tomorrow.
Wednesday
elog review: New item from PIC for LHCb. Currently space token is determined from file name path but for May CCRC this was supposed to be different - what is the status ?
Experiments round table:
LHCb (RS): asking sites if they have yet deployed the LHCBUSER space. For NL-T1 JT confirmed this (3 TB there) must come out of the LHCb
MoU space envelope. All 7 Tier 1 are now replicating the LHCb LFC using Oracle streams.
CMS (DB): Not a lot of export activity now as they are busy preparing for the May run.
Sites round table:
Core services (CERN) report: After this meeting the CERN production FTS is going to be switched to the 'RAL' experiment shares model where each one gets a guarranteed total bandwidth share based on the number of files and within which it can set sub-shares. This does stop experiments profitting from lack of use by another experiment. DB of CMS asked to be informed when other FTS reconfigurations are made.
DB services (CERN) report: The migrations of the CMS and LHCb RACs to new hardware were carried on successfully yesterday afternoon
and finished within the scheduled window of intervention. A standby (using Oracle Data Guard) is kept on the
old hardware as a fail-over in case of further problems with the new hardware (which has recently shown several
controllers and disk failures).
We are also investigating the capture process of the LHCb T0 to T1 which has failed several times in the new RAC and there
are a few hours of latency. We will fix it as soon as possible.
The migration of the ATLAS RAC is going on now (scheduled from 14:00 to 16:00). The list of affected services
is as follows (can be found also on the it status board)
atlas_rac,atlas_muon_ec_align,atlas_authdb,atlas_coolprod,atlas_prodsys,atlas_dd,atlas_muoncert,atlas_tags,atlas_da,atlas_t0
atlas_muonprod,atlas_muon,atlas_muon_rpc,atlas_integration,atlas_muoncsc,atlas_largus,atlas_larcalib,atlas_trt
atlas_largfr,atlas_muonmic,atlas_htmldb,atlas_atlog,atlas_config,atlas_oksprod,atlas_pvssprod,atlas_dashboard,atlas_dcs
atlas_pvssconf_dcs,atlas_coolwrite,atlas_dq2_location,atlas_dq2,atlas_mdt_dcs,atlas_mda,atlas_coca,atlas_oks
streams online replication
atlr_backup,atlas_tagsprod,atlas_tags_writer
The migration to the new hardware of the LCG RAC is scheduled for tomorrow 15th April from 15:00 to 17:00.
The list of affected services is (also on It board):
lcg_FCR
lcg_fts
lcg_fts_monitor
lcg_fts_t2
lcg_fts_t2_w
lcg_gridview
lcg_lfc
lcg_same
lcg_sam
lcg_sam_portal
lcg_voms
A transparent hardware upgrade was performed at ASGC affecting the 3D, FTS, LFC, CASTOR and SRM databases.
Monitoring / dashboard report:
Release update:
AOB: DB of CMS said the Castor team have migrated ATLAS to level 2.1.7 but now want to wait 2 weeks to see how it settles in so what would this imply for a CMS upgrade and would it be important for them ? JS said we always wanted to exercise such an upgrade during data taking and that he would find out if this release brings any benefits for CMS.
Thursday
elog review: Gridview stopped reporting transfers at 17.00 UTC yesterday.
Experiments round table:
ALICE (PM): Getting ready for the 3rd commissioning exercise begenning the 5th of May. Succesfull integration of the gLite3.1 VOBOX distribution in 64-bit nodes. A first 64-bit node will enter production at CERN this week.
ATLAS (SC): A post mortem of the ATLAS Functional Tests will be discussed at today' s ATLAS operations meeting.
The only activity going on this week is T1-T1 transfer tests for the IT cloud (ongoing, results will be shown
beginning of the next week)
Tomorrow ATLAS plans to run a 3 days functional test to validate a new DDM release (after an important bug fix) and to test the new FTS pilot service at CERN.
I am not aware of any trouble after the database migration yesterday afternoon.
Tonight, after the intervention on WLCG database rack, the SRMv1 and Classic SE interface to CASTOR at CERN will
be decommissioned. RIP. The only Grid access to CASTOR at CERN will be via SRMv2 i.e. srm-atlas.cern.ch
CMS (AS): Are adding CMS services to SLS, starting with the dataset bookkeeping system (DBS) and the distributed calibration database system (
FroNTier); also adapting current tools to generate fake analysis jobs to exercise CMS sites in view of CCRC08.
LHCb (RS): LHCb has started to test its stripping and reconstruction workflows against all T1s and CERN as preparatory phase for CCRC. In the meantime they also started to clean up all phase-1 data on all SE and LFC. The dcap server was down at SARA - we alerted them. We are preparing to migrate disk only data at CNAF from CASTOR to STORM. They are also connecting sensors to the CERN SLS system.
Sites round table:
NL-T1 (JT): He will check if dcap failures are properly alarmed. No LHCb jobs were running at NIKHEF. RS thought they should be so will check.
RAL: Turns out LHCb require the LFC to be published in BDII for their SAM tests. We weren't doing this for the latest server. Normal LHCb use was OK we believe.
Now fixed!
Core services (CERN) report: The new FTS shares mechanism is working so far but does not yet carry much load.
DB services (CERN) report (
EdF): The migration of the ATLAS RAC to the new hardware was carried on successfully yesterday afternoon and finished within the scheduled window of intervention. A standby (using Oracle Data Guard) is kept on the old hardware as a fail-over in case of further problems with the new hardware (as reported yesterday for CMS and LHCb RACs).
The migration of the ATLAS downstream capture was longer than expected because of misbehavior of some streams processes (still being investigated). The Tier1 replicas have been synchronized during the night.
The problem observed with the capture process of the LHCb T0 to T1 was fixed yesterday after adjusting some memory parameters in the new database.
The migration of the LCG RAC is going on now (scheduled from 14:00 to 16:00). The list of affected services is as follows:
lcg_FCR
lcg_fts
lcg_fts_monitor
lcg_fts_t2
lcg_fts_t2_w
lcg_gridview
lcg_lfc
lcg_same
lcg_sam
lcg_sam_portal
lcg_voms
Monitoring / dashboard report: The failure of gridview to display traffic since 17.00 yesterday is thought to have been an oracle server overload caused by a backlog of data being processed. No data appears to have been lost however.
Release update:
AOB:
Friday
elog review: nothing new
Experiments round table:
ATLAS (KB): this is a week with many changes. We are
making a lot of modifications to our storage setup at CERN. We move
large amounts of data around to get them into the pools that are meant
to hold them. But then we also have to update the catalogs. Moreover
we have now completely changed to srmv2 which in its turn also needed
the catalogs to be modified. Last but not least we got rid of the last
classic SE for ATLAS at CERN and again we had to change the catalogs.
We also have changed the hardware for the databases and did some
software upgrades and applied various patches. I noticed that also
many Tiers decided to profit from this stand still and do some upgrades.
I believe tomorrow is the first day without major interventions and we
will try to set up a limited functional test to try some of the latest
dq2 upgrades during the weekend. I know that Bologna is doing T1-T1
tests but we have not heard from any of the other Tier-1's which also
still have to do this. I repeat, this is something we cannot do
centrally but has to be done from the Tier-1's. We have a good wiki
with instructions how to do it though.
Sites round table:
Core services (CERN) report:
DB services (CERN) report:
Monitoring / dashboard report:
Release update:
AOB: