-- HarryRenshall - 04 Apr 2008

Week of 080407

Open Actions from last week:

Daily CCRC'08 Call details

To join the call, at 15.00 CET Tuesday to Friday inclusive (usually in CERN bat 28-R-006), do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

Monday:

See the weekly joint operations meeting minutes

Additional Material: There will be LCG OPN tests Wednesday 9 April from 15.00 to 19.00 CET. The plan of the test is here: https://twiki.cern.ch/twiki/bin/view/LHCOPN/BackupTest RAL will be unreachable for 15-20 minutes between 16:45 and 17:15 PIC will be unreachable for 15-20 minutes between 17:15 and 17:45

The goal of the maintenance is to verify that all the backup solutions work as expected. The T1s with a backup link should be up all the time, but at the moment we cannot guarantee that it will be the case and there may be outages at any time for any Tier1.

Tuesday:

elog review: Nothing new

Experiments round table:

ATLAS (from K.Bos): we have started our Functional Test (FT) around 3:30 pm today. We had decided to run at a rate as if we were taking data with the detector at 200 HZ for 10 hours per day. In that case we run at roughly 40% of the maximally needed throughput because we have 24 hours to export the data we produce in 10. This will be the mode we will be running in May also unless we are taking cosmic ray data in which case the rate is determined by the TDAQ. We intend to stop the production on Friday morning and remove the left-over FT subscriptions on Sunday night. So we leave the system 2.5 days to transfer all the data. On Monday morning we will analyze the results.

CMS: the recent SAM srm test failures due to information disappearing from the bdii has been fixed.

LHCb: Preparing for infrastructure testing to start on 18 April. The slow throughput recently seen on the CERN to CNAF traffic has been corrected (not known what happened). The LHCb LFC is now replicated to NL-T1 and PIC is scheduled to start next week.

Sites round table: RAL would like information on Tier 1 resource requirements for the May CCRC from CMS and ALICE as soon as possible. NL-T1 confirmed they expect new (2007 pledges) hardware resources to start being published during the last 2 weeks of May.

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Wednesday

elog review: Nothing new.

Experiments round table:

ATLAS (SC): This is a quick summary of the present status for ATLAS activities:

1) Calo data distribution has been very successful. In less than 24h the whole amount of data has been distributed according to the ATLAS computing model. More than 90% of the data has been distributed in less than 6 hours. The overall data distribution counts 412 dataset subscriptions, out of which only 8 did not complete yet (all in NDGF, problem with space token configuration)

2) Functional test have been running till ~8PM yesterday night. At this time an AFS problem (a hardware failure on a file server) prevented the machinery to keep running. Manual intervention was needed. Machinery restarted today at 9AM and the activity is currently ongoing.

3) A problem with space tokens at NDGF was fixed but they now have a problem with their OPN network switch.

In addition, as a reminder, there needs to be a big intervention in LFC @ cern to migrate SRMV1 entries to SRMv2 entries. The data volume is > 8M entries. This could be done dynamically but would take more than 20 hours when the service would not be stopped but some degradation would be observed. Another possibility is to do a short 1 hour update but this would need a service stoppage. This is being tested now and, if successful, could be scheduled for next Monday.

LHCb (RS): for the May CCRC they are requesting a new space token (LHCBUSER) of type 'custodial online' to be deployed at CERN and the T1. They would require 3.3 TB of such space at CERN, about 3 TB at NL-T1 and less at the other T1. When asked JT said that NL-T1 should be able to provide this space provided it were within the MoU envelope of LHCb.

Sites round table:

NL-T1 (JT) reported ATLAS jobs disappeared rapidly about 21.00 last night. K.Bos will follow up what happened.

GRIF (MJ) plan to install the new version of dpm which will correct ACL support for ATLAS. S.Jezeqel is going to provide a script to correct the ACL of files already in the GRIF ATLAS dpm.

Core services (CERN) report: Experiments have been invited to join a production scale test of the SL4 version of the WMS.

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB: The OPN backup link tests announced on Monday should now be starting.

Thursday

elog review: Nothing new

Experiments round table:

ATLAS (SC): Here there are the two main points for today:

1) Functional tests are ongoing as scheduled. In the night it was observed an increase of the queue for tape migration in CASTOR. This was due to too many small files being produced at the T0. This has been changed this morning (so now datasets contain less files, by a factor 10, but larger size). Test will stop tomorrow.

Observations:

- RAL is in scheduled downtime. No RAW are being assigned to RAL since yesterday (but AOD and ESD are, this will be a also a test of recovery after downtime).

- SARA is having dCache problems. Below you find the explanation from Ron Trompert

* Yesterday we have upgraded dCache to 1.8.0-14 and java 1.6. We see now a number of problems as a result of this, hanging processes and high loads on our dCache head node. We have already moved the dCache head node back to java 1.5 which seemed to have reduced the load.

We are now in unscheduled maintenance to see if this is enough the solve the matter. If not, we will move back to java 1.5 altogether. *

- CNAF has problems at the tape endpoint. After the end of the intervention on tuesday, they have been showing efficiency between 30% and 60% for tape. CNAF site managers believe this is "left over" from the intervention and are looking into it.

2) The intervention on LFC for the srm.cern.ch->srm-atlas.cern.ch has been postponed to the next week. The procedure user last week for 1.5 M events was too slow for the 8.5 M events to be migrated (did not finish in 24h). Therefore a new (faster) procedure is bein put in place, but this will imply some downtime (estimated 2h). We are trying to pack everything in the hardware migration intervention for the ATLAS and WLCG databases next week. Maria Girone is collecting the various infos and will present a proposal.

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Friday

elog review:

Experiments round table:

Sites round table:

Core services (CERN) report:

DB services (CERN) report: The test migration to new quadcore HW has been successfully performed on all LHC offline Oracle databases concerned last week. The final migrations of the LHC RAC production databases to the new HW will go ahead as planned next week as follows:

CMSR & LHCBR: Tuesday 15.04 ATLR: Wednesday 16.04 LCGR: Thursday 17.04

(these major migrations (which will include a move from 32 to 64bit) will be performed with a downtime of only two hours, thanks to the use of Oracle Data Guard)

The downstream databases for ATLAS and LHCb streams setup will be migrated next week, Tuesday 15th April and Wednesday 16th April, to the new hardware.

The LFC deployment team has proposed a clean-up of the ATLAS LFC local catalog deployed on the LCG RAC, which will follow after the hardware upgrade. The ATLAS local LFC will not be available for about two hours.

A blocking issue with the monitoring of the storage arrays on RAC5 and RAC6 has been solved with the help of IT-FIO-TSI. Many thanks.

New Infortrend storage has showed a few more problems with the controllers. So far we have found 5 controllers out of 60 new arrays (3 last week and 2 this week) with issues that required vendor intervention.

CNAF finished the computer center move last Monday, 7th April, after 2 weeks of downtime. LFC and LHCb replication was synchronized in less than 1 hour. ATLAS replication is still pending because CNAF is now migrating the production servers for ATLAS. CNAF ATLAS database is scheduled to be ready next Monday, 14th April.

BNL completed successfully an intervention to upgrade the production servers for ATLAS from 32 to 64 bit. Even though the intervention was extended 1 day more due to some complications, there was not impact on the ATLAS replication environment.

Monitoring / dashboard report:

Release update:

AOB:

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r5 - 2008-04-10 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback