-- HarryRenshall - 15 Apr 2008

Week of 080414

Open Actions from last week:

Daily CCRC'08 Call details

To join the call, at 15.00 CET Tuesday to Friday inclusive (usually in CERN bat 28-R-006), do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

Monday:

See the weekly joint operations meeting minutes

Additional Material:

Tuesday:

elog review: nothing new

Experiments round table:

ALICE (PM): are sending the last data required for their commissioning exercise to begin on 5 May. Currently CNAF is down for a Castor upgrade and the NDGF SE is down. They are testing migrating their VO-box software to run on 64-bit Linux.

Sites round table:

NL-T1 (JT): ATLAS are having difficulties with their SAM tests to judge the availability of the joint NIKHEF-SARA Dutch Tier 1. A runaway LHCb program wrote 120 GB of logs bringing down some worker nodes.

Core services (CERN) report:

DB services (CERN) report: During tests of new RAC hardware some 10% of the storage controllers have failed and as yet they have no explanation from the vendor. For this reason they propose to use Oracle's dataguard software to maintain an asynchronous failover copy of the physics databases on the old hardware after migration to the new. They will put this plan before the MB today. The planning is to migrate LHCb and CMS today, ATLAS tomorrow then WLCG on Thursday (which means a 2 hour FTS downtime and 4 hours down for the local ATLAS LFC).

Monitoring / dashboard report: CMS want to start using the Condor glide-in facility to submit about 100 jobs/day and this will need mods in the dashboard to track them. Also they want to start looking at the cpu efficiency of their various applications.

Release update:

AOB: Registration for the WLCG collaboration workshop closes tomorrow.

Wednesday

elog review: New item from PIC for LHCb. Currently space token is determined from file name path but for May CCRC this was supposed to be different - what is the status ?

Experiments round table:

LHCb (RS): asking sites if they have yet deployed the LHCBUSER space. For NL-T1 JT confirmed this (3 TB there) must come out of the LHCb MoU space envelope. All 7 Tier 1 are now replicating the LHCb LFC using Oracle streams.

CMS (DB): Not a lot of export activity now as they are busy preparing for the May run.

Sites round table:

Core services (CERN) report: After this meeting the CERN production FTS is going to be switched to the 'RAL' experiment shares model where each one gets a guarranteed total bandwidth share based on the number of files and within which it can set sub-shares. This does stop experiments profitting from lack of use by another experiment. DB of CMS asked to be informed when other FTS reconfigurations are made.

DB services (CERN) report: The migrations of the CMS and LHCb RACs to new hardware were carried on successfully yesterday afternoon and finished within the scheduled window of intervention. A standby (using Oracle Data Guard) is kept on the old hardware as a fail-over in case of further problems with the new hardware (which has recently shown several controllers and disk failures). We are also investigating the capture process of the LHCb T0 to T1 which has failed several times in the new RAC and there are a few hours of latency. We will fix it as soon as possible.

The migration of the ATLAS RAC is going on now (scheduled from 14:00 to 16:00). The list of affected services is as follows (can be found also on the it status board)

atlas_rac,atlas_muon_ec_align,atlas_authdb,atlas_coolprod,atlas_prodsys,atlas_dd,atlas_muoncert,atlas_tags,atlas_da,atlas_t0 atlas_muonprod,atlas_muon,atlas_muon_rpc,atlas_integration,atlas_muoncsc,atlas_largus,atlas_larcalib,atlas_trt atlas_largfr,atlas_muonmic,atlas_htmldb,atlas_atlog,atlas_config,atlas_oksprod,atlas_pvssprod,atlas_dashboard,atlas_dcs atlas_pvssconf_dcs,atlas_coolwrite,atlas_dq2_location,atlas_dq2,atlas_mdt_dcs,atlas_mda,atlas_coca,atlas_oks streams online replication atlr_backup,atlas_tagsprod,atlas_tags_writer

The migration to the new hardware of the LCG RAC is scheduled for tomorrow 15th April from 15:00 to 17:00. The list of affected services is (also on It board):

lcg_FCR lcg_fts lcg_fts_monitor lcg_fts_t2 lcg_fts_t2_w lcg_gridview lcg_lfc lcg_same lcg_sam lcg_sam_portal lcg_voms A transparent hardware upgrade was performed at ASGC affecting the 3D, FTS, LFC, CASTOR and SRM databases.

Monitoring / dashboard report:

Release update:

AOB: DB of CMS said the Castor team have migrated ATLAS to level 2.1.7 but now want to wait 2 weeks to see how it settles in so what would this imply for a CMS upgrade and would it be important for them ? JS said we always wanted to exercise such an upgrade during data taking and that he would find out if this release brings any benefits for CMS.

Thursday

elog review: Gridview stopped reporting transfers at 17.00 UTC yesterday.

Experiments round table:

ALICE (PM): Getting ready for the 3rd commissioning exercise begenning the 5th of May. Succesfull integration of the gLite3.1 VOBOX distribution in 64-bit nodes. A first 64-bit node will enter production at CERN this week.

ATLAS (SC): A post mortem of the ATLAS Functional Tests will be discussed at today' s ATLAS operations meeting. The only activity going on this week is T1-T1 transfer tests for the IT cloud (ongoing, results will be shown beginning of the next week) Tomorrow ATLAS plans to run a 3 days functional test to validate a new DDM release (after an important bug fix) and to test the new FTS pilot service at CERN. I am not aware of any trouble after the database migration yesterday afternoon. Tonight, after the intervention on WLCG database rack, the SRMv1 and Classic SE interface to CASTOR at CERN will be decommissioned. RIP. The only Grid access to CASTOR at CERN will be via SRMv2 i.e. srm-atlas.cern.ch

CMS (AS): Are adding CMS services to SLS, starting with the dataset bookkeeping system (DBS) and the distributed calibration database system (FroNTier); also adapting current tools to generate fake analysis jobs to exercise CMS sites in view of CCRC08.

LHCb (RS): LHCb has started to test its stripping and reconstruction workflows against all T1s and CERN as preparatory phase for CCRC. In the meantime they also started to clean up all phase-1 data on all SE and LFC. The dcap server was down at SARA - we alerted them. We are preparing to migrate disk only data at CNAF from CASTOR to STORM. They are also connecting sensors to the CERN SLS system.

Sites round table:

NL-T1 (JT): He will check if dcap failures are properly alarmed. No LHCb jobs were running at NIKHEF. RS thought they should be so will check.

RAL: Turns out LHCb require the LFC to be published in BDII for their SAM tests. We weren't doing this for the latest server. Normal LHCb use was OK we believe. Now fixed!

Core services (CERN) report: The new FTS shares mechanism is working so far but does not yet carry much load.

DB services (CERN) report (EdF): The migration of the ATLAS RAC to the new hardware was carried on successfully yesterday afternoon and finished within the scheduled window of intervention. A standby (using Oracle Data Guard) is kept on the old hardware as a fail-over in case of further problems with the new hardware (as reported yesterday for CMS and LHCb RACs). The migration of the ATLAS downstream capture was longer than expected because of misbehavior of some streams processes (still being investigated). The Tier1 replicas have been synchronized during the night. The problem observed with the capture process of the LHCb T0 to T1 was fixed yesterday after adjusting some memory parameters in the new database. The migration of the LCG RAC is going on now (scheduled from 14:00 to 16:00). The list of affected services is as follows: lcg_FCR lcg_fts lcg_fts_monitor lcg_fts_t2 lcg_fts_t2_w lcg_gridview lcg_lfc lcg_same lcg_sam lcg_sam_portal lcg_voms

Monitoring / dashboard report: The failure of gridview to display traffic since 17.00 yesterday is thought to have been an oracle server overload caused by a backlog of data being processed. No data appears to have been lost however.

Release update:

AOB:

Friday

elog review: nothing new

Experiments round table:

ATLAS (KB): this is a week with many changes. We are making a lot of modifications to our storage setup at CERN. We move large amounts of data around to get them into the pools that are meant to hold them. But then we also have to update the catalogs. Moreover we have now completely changed to srmv2 which in its turn also needed the catalogs to be modified. Last but not least we got rid of the last classic SE for ATLAS at CERN and again we had to change the catalogs. We also have changed the hardware for the databases and did some software upgrades and applied various patches. I noticed that also many Tiers decided to profit from this stand still and do some upgrades.

I believe tomorrow is the first day without major interventions and we will try to set up a limited functional test to try some of the latest dq2 upgrades during the weekend. I know that Bologna is doing T1-T1 tests but we have not heard from any of the other Tier-1's which also still have to do this. I repeat, this is something we cannot do centrally but has to be done from the Tier-1's. We have a good wiki with instructions how to do it though.

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2008-04-18 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback