Week of 080512

Open Actions from last week:

Daily CCRC'08 Call details

To join the call, at 15.00 CET Tuesday to Friday inclusive (usually in CERN bat 28-R-006), do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

Monday:

See the weekly joint operations meeting minutes

Additional Material:

Tuesday:

Attendance: local(Jamie, JT, Miguel, Ricardo, Patricia, Roberto, Simone, Andrea, James, Daniele, 70569, Stephen);remote(Gonzalo, Kors, Vincenzo Spinoso/Bari).

elog review:

Experiments round table:

  • ATLAS: Simone - started T1-T1 tests at ~10:00 as planned. 620 datasets subscribed (~10% at each) across all sites (BNL 25%, SARA, 15% etc.) Datasets replicated across all other T1s. Monitored in dashboard. First 4 hours should PIC, NDGF, ASGC keeping up with 350/400MB/s - performing very well. 1/3 of sample already. Will complete in < 1 day (target ~2 days). ATLAS contact points at sites to follow-up. T1s performing not so efficiently please investigate! Q for monitoring - nothing visible in GridView. Use FTS at Tier1s. dCache - dCache transfers don't show - no info available. Mine Globus gridftp logs. Done for internally logging but not generally available. James to follow-up. Agreed with CMS to overlap. How can we see what's going on? Kors - 3 best sites are normally 'weakest' at other direction (import vs export) - is this understood? Kors - results remarkably good - impressed? post-mortem on last week by end this week. Basically all sites made target except for NDGF for known reasons. Throughput to SARA 4MB/s per file - would expect ~20MB/s - being followed up. Non-OPN transfers working better! Looking into OPN. JT: 1) anything special BNL-NIKHEF. Looks like BNL pulling files off NIKHEF DPM. A: aggregation of data in order to produce bytestream for FDR 2. Being done in BNL. RDO files. (OK). 2) Is it possible from FTS to understand underlying gridftp errors? Publishing of FTS logs?

  • CMS: Daniele - outcome of last days posted to observation section of elogs. CMS T1-T1 testing also this week (last night). Stable transfer amongst some of T1s. Need help / support checking FTS servers, monitoring info. Hourly peaks > 500MB/s on average 300-350MB/s of aggregate fake traffic. T1s still struggling in exporting pre-production data to CERN. Some of outbound traffic to WAN also to CERN. Good if we can superimpose some days with ATLAS... Monitoring important!

  • ALICE: Patricia - still figuring out how to put Alien 2.15 for commissioning exercise. Still 20 sites have old VO box version. GGUS ticket to each ROC. Only 2 ROCs have responded. These sites will not enter the exercise! (CERN, UK and DE ok - all others not!) Italy and France many Tier2s plus Tier1 not upgraded. Should upgrade VO boxes to SLC4 plus gLite 3.1 version.

  • LHCb: Roberto - during long w/e ramp-up to target rate for CCRC without GridKA and NIKHEF (some staging issues). Run few thousand recons jobs concurrently as per break-down from Nick. Not running stripping - online people have not activated book-keeping workflow which is required. No major problems in raw data distribution (pit - Tier0 - Tier1). Problem with CASTOR this morning - problem in registering transfer requests into Oracle - see GGUS ticket. Problem fixed? (No longer occurs...) No known outstanding T1-T1 issues, but pending start of stripping. Had to ban GridKA as lcgls v. slow. Stager agent is affected -> affects other SRM interactions. GGUS ticket. Some issues with files -> IN2P3 (no space available). Affects only a few sites. Issue with SARA with gsidcap server, restarted yesterday, thanks to JT. Q from this morning's PASTE meeting: good practice to run srm and gsidcap server on same box? About 60 concurrent jobs at PIC - in line with expectation.

Sites round table:

Core services (CERN) report:

  • extreme locking issue with SRM_ATLAS plus deadlocks in SRM_SHARED, plus huge number of sessions.. Ticket end of last week about some failures accessing SRM_ATLAS. A: many file deletions. Multiple attempts to remove same file. If file does not exist srmrm can take a long time. Why is this happening? Not done multiple times, but can occur as consistency check (single delete). -> Follow-up. (Sync request to delete 10 files very 10s). Is this the same problem?

DB services (CERN) report:

  • Streams online - offline for ATLAS down since this morning due to data consistency. ATLAS people investigating. Streams monitoring is back up. Cleanup of repository performed.

Monitoring / dashboard report:

Release update:

AOB:

Wednesday

No meeting due to GDB

Thursday

Attendance: local(Gavin, Jan);remote(Jamie, Jeremy Coles (GridPP)), Daniele Bonacorsi, Stephen Gowdy, Derek Ross, Andreas/FZK, Michel Jouvin).

elog review:

Experiments round table:

  • ALICE:

  • ATLAS:

  • LHCb:

  • CMS:

Sites round table:

  • NIKHEF: In case you have not heard via other channels: all worker nodes of the Nikhef cluster have been turned off. Other grid services remain up and running (like our BDII, RB, DPM-SE, etc).
    We had a failure of a cooling until yesterday. A provisional repair is in place, but this doesn't give us full capacity.
    A new (additional) cooling unit was already in the process of being placed, however it is not completely ready yet.
    The WNs will most likely remain off for the entire weekend.

  • BNL: (update from Tuesday evening): We observed a very low incoming rate following the start of the ATLAS Tier-1 - Tier-1 transfers (<30 MB/s, outgoing was 400 – 500 MB/s) and found the FTS channels pointing from the ATLAS Tier-1 centers to BNL were clogged with transfers that are related to Raw Data Object replication in preparation of the FDR-2 in June (RDO’s are produced in all ATLAS clouds and are needed for bytestream/mixing production at BNL). The FTS channel priority at CERN had to be adjusted from 3 to 5 and from 5 to 3 at BNL respectively to increase the incoming rate for the CCRC stream into BNL to a decent level. In summary, CCRC transfers are competing with ATLAS mc08 dataset replication. By changing the FTS priority to 5 for the CCRC related transfers made them having priority over MC replication. Other ATLAS Tier-1 centers do not observe this effect because their MC related transfer rate (incoming) is small compared to what we observe at BNL. Following the before mentioned adjustments the CCRC related rate went up around 5pm (CDT) to 180 – 200 MB/s and was maintained at this level for more than 6 hours (continuing).

Core services (CERN) report:

  • Castor ALICE instance upgraded to 2.1.7.
  • Garbage collection on t1transfer pool (for T0 export) had a problem yesterday - it was deleting files before they were exported. Problem was fixed during the night and this morning.
  • Various SRM2 releated Castor issues. See: https://prod-grid-logger.cern.ch/elog/Data+Operations/2

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Friday

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt ATLAS-T1-T1overview.ppt r1 manage 1013.5 K 2008-05-15 - 14:57 JamieShiers  
Edit | Attach | Watch | Print version | History: r10 | r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r6 - 2008-05-15 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback