-- HarryRenshall - 28 Mar 2008

Week of 080331

Open Actions from last week:

Daily CCRC'08 Call details

To join the call, at 15.00 CET Tuesday to Friday inclusive (usually in CERN bat 28-R-006), do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

Monday:

See the weekly joint operations meeting minutes

Additional Material:

ATLAS: 1) The T0 throughput test finished over the weekend. On saturday, a problem of tape migration into CASTOR occurred. The problem has been fixed on sunday, need to follow this up with CASTOR people (Simone). As a consequence, the writing rate into CASTOR had to be reduced otherwise disk was going to get full. Before the migration problem, the write rate in castor was approx 1GB/s and data exported according to computing model. After the problem, the rate was reduced to 250MB/s but RAW and ESD have been oversubscribed to multiple T1s, to get the 100% of nominal rate. Most T1s finished the free space. At the end of Sunday, when the exercise stopped, only RAL, TRIUMF and BNL had free space. RAL managed to cope with the rate, TRIUMF ended up with backlog. BNL come into the game on saturday (problem on site services first, on SRM space tokens mapping later)

2) Dashboard has problems. May be an effect of the change to summer time on saturday night? BNL callbacks are still not being displayed. File Done events not appearing. Dashboard people are investigating

3) Today is cleanup day fro TT files. BNL and NDGF will have to take care of cleanup themselves (centralized tool supports only LFC catalog). LYON and FZK suffered from dCache problems in massive centralized deletions via SRM, therefore will run the cleanup locally. All other sites will be cleaned up centrally. Performance will be measures and results discussed in the following days.

4) Today the T1-T1 test should start for NDGF and RAL. I contacted people at the 2 clouds, pointing to Stephane twiki with the instructions and reminding them to elog. For NDGF, Birger has to update first to to update to 0.6. He did it in the morning.

Tuesday:

elog review:

Experiment report(s):

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

Questions/comments from sites/experiments:

AOB: No meeting today as we have a full day CCRC Face to Face meeting.

Wednesday

elog review:

Experiment report(s):

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

Questions/comments from sites/experiments:

AOB: No meeting today as we have a full day Grid Deployment Board meeting.

Thursday

elog review:

Experiment report(s):

ATLAS: 1) T1-T1 transfer tests are ongoing for RAL and NDGF. Very slow progress. There are problems most likely at the DDM site service level. Experts are investigating.

2) Migration to SRM2 for CASTOR at CERN: srm-durable-atlas has been shut down this morning (Jan). The migration of LFC entries is done for the central LFC catalog and ongoing for the local LFC catalog (Sophie). It will take a while (more than 1M entries). DDM location catalog will be updated by Vincent Garonne once the LFC intervention is finished. In fact the LFC intervention was completed at about 14.00.

3) Discussion with CASTOR team (Olof) about how to protect CASTOR disk pools from "internal" migration (disk-to-disk and tape-to-disk). Need new release of CASTOR, which is available. CASTOR team will schedule an upgrade for the next week.

ALICE: Ongoing T1 transfers but will not get busy again till 15 April when a new commissioning exercise starts.

LHCb: Cleaning up old data - planning to finish by 18 April. Also preparing new DIRAC and GANGA for 18 April. Testing conditions database from pit to CC and also streams replication to Tier 1 (except CNAF which is still down).

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

Questions/comments from sites/experiments: CNAF should be back in prod. by 8 April.

AOB: J.Shiers said it is strange some sites use mysql for Atlas conditions DB and Oracle for that of LHCb. There is a request from network support to stress the T0 to T1 links next Tuesday, 8 April. CERN could push some dteam traffic.

Friday

elog review: Nothing new

Experiment report(s):

ATLAS: Status of this week summary:

1) T1-T1 test for RAL and NDGF spotted problems at the ATLAS DDM level. A patch will be available today, will be installed in ATLAS VOBOX serving T0 exports and will be tested in the Functional tests next week.

2) srm-durable-atlas.cern.ch (SRMv1) has been decommissioned. Migration was smooth. Thanks to all (FIO, DDM devels). Next week i will schedule the same exercise for srm.cern.ch. An email with the description of the request and tickets in Remedy will come on monday. In the weekend some cleanup of the pool 'atldata' (served till now by srm-durable-atlas) will be done. There is good chance that also castorgrid.cern.ch can be decommissioned for ATLAS next week. Kors is following this up with users. If so, it will be another intervention to be scheduled (possibly together with the one for srm.cern.ch)

Next Week:

1) Functional test: 3 days (Tuesday to Thursday) of nominal rate exports (~650MB/s out of CERN). Monday will be dedicated to the test of the machinery. Metrics: T1 passes the test if by the end of the week (Sunday) 95% of the files is at the T1s. In addition, 95% of the datasets must be complete at the T1s (notice this is not the same thing).

From 95% to 100%: full success From 90% to 95%: partial success Less than 90%: failure

In case of discrepancy between percentage of files and of datasets, the worst of the 2 is considered. Data exported following MoU shares and according to computing model.

2) Export of Cosmic data for a given sub detector (Muons?). Data exported following MoU shares and according to computing model (still a ?? question mark on this, being discussed within ATLAS mngmt)

3) Decommissioning of SRMv1 at CERN (see above)

Activities 1) and 2) will use the new plugin for T0->T1 distribution.

Overall ... a busy week.

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

Questions/comments from sites/experiments: J.Templon asked if it was correct that LHCb have stopped running jobs at NL-T1 ? The answer was given later by Roberto: Well, effectively I see there are only three jobs running (from DIRAC2 system, normal production) and the reason of that is because we are now concentrating in the preparation of CCRC and we are simply finalizing some pending stripping productions (commissioned more than one year ago) for which most likely your site already exhausted its share (meaning no further data to be processed is there) and there are no jobs targeted for NIKHEF in the pipeline.

AOB: CS group plan to test the LHCOPN backup links next Wednesday, 9 April, afternoon. Traffic to PIC and RAL, who have no backup, will be disturbed. The plan of the test is here: https://twiki.cern.ch/twiki/bin/view/LHCOPN/BackupTest RAL will be unreachable for 15-20 minutes between 16:45 and 17:15 PIC will be unreachable for 15-20 minutes between 17:15 and 17:45

The goal of the maintenance is to verify that all the backup solutions work as expected. The T1s with a backup link should be up all the time, but at the moment I cannot guarantee that it will be the case. You should expect outages at any time for any Tier1.

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2008-04-04 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback