Week of 080512

Open Actions from last week:

Daily CCRC'08 Call details

To join the call, at 15.00 CET Tuesday to Friday inclusive (usually in CERN bat 28-R-006), do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

Monday:

See the weekly joint operations meeting minutes

Additional Material:

Tuesday:

Attendance: local(Jamie, JT, Miguel, Ricardo, Patricia, Roberto, Simone, Andrea, James, Daniele, 70569, Stephen);remote(Gonzalo, Kors, Vincenzo Spinoso/Bari).

elog review:

Experiments round table:

  • ATLAS: Simone - started T1-T1 tests at ~10:00 as planned. 620 datasets subscribed (~10% at each) across all sites (BNL 25%, SARA, 15% etc.) Datasets replicated across all other T1s. Monitored in dashboard. First 4 hours should PIC, NDGF, ASGC keeping up with 350/400MB/s - performing very well. 1/3 of sample already. Will complete in < 1 day (target ~2 days). ATLAS contact points at sites to follow-up. T1s performing not so efficiently please investigate! Q for monitoring - nothing visible in GridView. Use FTS at Tier1s. dCache - dCache transfers don't show - no info available. Mine Globus gridftp logs. Done for internally logging but not generally available. James to follow-up. Agreed with CMS to overlap. How can we see what's going on? Kors - 3 best sites are normally 'weakest' at other direction (import vs export) - is this understood? Kors - results remarkably good - impressed? post-mortem on last week by end this week. Basically all sites made target except for NDGF for known reasons. Throughput to SARA 4MB/s per file - would expect ~20MB/s - being followed up. Non-OPN transfers working better! Looking into OPN. JT: 1) anything special BNL-NIKHEF. Looks like BNL pulling files off NIKHEF DPM. A: aggregation of data in order to produce bytestream for FDR 2. Being done in BNL. RDO files. (OK). 2) Is it possible from FTS to understand underlying gridftp errors? Publishing of FTS logs?

  • CMS: Daniele - outcome of last days posted to observation section of elogs. CMS T1-T1 testing also this week (last night). Stable transfer amongst some of T1s. Need help / support checking FTS servers, monitoring info. Hourly peaks > 500MB/s on average 300-350MB/s of aggregate fake traffic. T1s still struggling in exporting pre-production data to CERN. Some of outbound traffic to WAN also to CERN. Good if we can superimpose some days with ATLAS... Monitoring important!

  • ALICE: Patricia - still figuring out how to put Alien 2.15 for commissioning exercise. Still 20 sites have old VO box version. GGUS ticket to each ROC. Only 2 ROCs have responded. These sites will not enter the exercise! (CERN, UK and DE ok - all others not!) Italy and France many Tier2s plus Tier1 not upgraded. Should upgrade VO boxes to SLC4 plus gLite 3.1 version.

  • LHCb: Roberto - during long w/e ramp-up to target rate for CCRC without GridKA and NIKHEF (some staging issues). Run few thousand recons jobs concurrently as per break-down from Nick. Not running stripping - online people have not activated book-keeping workflow which is required. No major problems in raw data distribution (pit - Tier0 - Tier1). Problem with CASTOR this morning - problem in registering transfer requests into Oracle - see GGUS ticket. Problem fixed? (No longer occurs...) No known outstanding T1-T1 issues, but pending start of stripping. Had to ban GridKA as lcgls v. slow. Stager agent is affected -> affects other SRM interactions. GGUS ticket. Some issues with files -> IN2P3 (no space available). Affects only a few sites. Issue with SARA with gsidcap server, restarted yesterday, thanks to JT. Q from this morning's PASTE meeting: good practice to run srm and gsidcap server on same box? About 60 concurrent jobs at PIC - in line with expectation.

Sites round table:

Core services (CERN) report:

  • extreme locking issue with SRM_ATLAS plus deadlocks in SRM_SHARED, plus huge number of sessions.. Ticket end of last week about some failures accessing SRM_ATLAS. A: many file deletions. Multiple attempts to remove same file. If file does not exist srmrm can take a long time. Why is this happening? Not done multiple times, but can occur as consistency check (single delete). -> Follow-up. (Sync request to delete 10 files very 10s). Is this the same problem?

DB services (CERN) report:

  • Streams online - offline for ATLAS down since this morning due to data consistency. ATLAS people investigating. Streams monitoring is back up. Cleanup of repository performed.

Monitoring / dashboard report:

Release update:

AOB:

Wednesday

No meeting due to GDB

Thursday

Attendance: local(Gavin, Jan,Simone);remote(Jamie, Jeremy Coles (GridPP)), Daniele Bonacorsi, Stephen Gowdy, Derek Ross, Andreas/FZK, Michel Jouvin).

elog review:

Experiments round table:

  • ALICE: No report

  • ATLAS: Please look at the slide summary from Simone (to attach) describing outcome of T1-T1 tests. Summary: 3 days of T1-T1 full matrix tested all together, 2GB/s cumalative peak on 1st day, 700MB/s on day 2 and 3. Most channels look good. INFN had issue importing data and SARA had issues exporting data. Disucsion weill happen in ATLAS meeting and then more at June post-mortem. FTS channel setting: there is a large difference (and no common agreement) in the way sites set the FTS params (e.g. default retries, #transfer, #tcp streams) - this needs to be reviewed properly in post-mortem. A DDM bug was noted that precent4ed some datasets from being copied properly. ATLAS appreciates the efforts made to install FTM everywhere. Next week: Monday cosmics. Tuesday: T0-export at 200% computing model rate (about 1GB/s out of CERN (?)), will run all week, possibly into next weekend. Space needed at T1 sites is noted on the slides - please check you have it available. Lots of sites have been using the eLog to report issues - this is much appreciated!

  • LHCb (eLogger from https://prod-grid-logger.cern.ch/elog/CCRC'08+Observations/92): As per discussion that we had this morning we can summarize that CCRC is going fairly smoothly everywhere apart few minor issues (still under investigation and logged with relative GGUS tickets in this elog). Tier-0 RAW data export is running extremely fine (the first plot in attachment to this report shows the throughput achieved and the second plot the data transfer quality). From the second plot it results some issue to IN2P3. Those are about few files left in some inconsistent status at In2P3 dCache (see report https://prod-grid-logger.cern.ch/elog/CCRC'08+Logbook/283) for which the DIRAC Data Transfer Agent tries and tries (and tries) infinite times until data get its final destination.
    • Reconstruction: we are currently running (smoothly) 1500 jobs in the system (see the third plot presenting the cumulation of jobs run on the system) and we did not observe any outstanding issue in file access at any site. We manage to run (+/- 10%) to all sites the foreseen number of reconstruction jobs accordingly the share and the numbers presented at the beginning of CCRC in April WLCG workshop. Each site got its share and all job slots (allocated for CCRC at sites) were filled !
    • We had few issues yesterday with all job gsidcap connections stuck at IN2P3 (see the forth plot about CPU Efficiency). Most likely this was due to the restart of the dCache system after the Network outage experienced at CC-In2P3 center yesterday. We are in direct touch with local site manager for better understanding the reason of that and eventually debugging (no GGUS ticket has been submitted though).
    • Stager agent activity. Few issues (now gone) at IN2P3 and GridKA (see elog entries and relative GGUS tickets) : https://prod-grid-logger.cern.ch/elog/CCRC'08+Logbook/281 and https://prod-grid-logger.cern.ch/elog/CCRC'08+Logbook/279. Apart from that (as visible from the very last plot wrongly labeled "Transfer Quality by destination ) no major issues observed also in the activity of retrieving metadata information.
    • Stripping: the workflow is now ready, ancestors files information are now in the LHCb Bookkeeping and Joel reports that few more test jobs are OK under application point of view. We plan to ramp up also stripping to targetted rate but we need first to sort out some more DIRAC specific issue. We hope for the next week to run at regime and in a fully automatic way the Stripping as well.
    • Analysis: we need to interface GANGA with DIRAC3 yet. No expected to come soon!

  • CMS: Processing is going well at CERN for CSA. T0-export - mostly fine, with some Castor issues being followed up. Analysis tests running on T2 sites - the rseults from this will be collated (sic) and reported soon. T1-T1 tests, work ~well - still to analyse the interference (or not) with ATLAS. Will start early next week for T1-T2 transfers (ex regional), not the full comissioned matrix, but a subset. MC running well at T1 sites, no issues with CPU - more skimming processing next week at T1 sites.

Sites round table:

  • NIKHEF: In case you have not heard via other channels: all worker nodes of the Nikhef cluster have been turned off. Other grid services remain up and running (like our BDII, RB, DPM-SE, etc).
    We had a failure of a cooling until yesterday. A provisional repair is in place, but this doesn't give us full capacity.
    A new (additional) cooling unit was already in the process of being placed, however it is not completely ready yet.
    The WNs will most likely remain off for the entire weekend.

  • BNL: (update from Tuesday evening): We observed a very low incoming rate following the start of the ATLAS Tier-1 - Tier-1 transfers (<30 MB/s, outgoing was 400 – 500 MB/s) and found the FTS channels pointing from the ATLAS Tier-1 centers to BNL were clogged with transfers that are related to Raw Data Object replication in preparation of the FDR-2 in June (RDO’s are produced in all ATLAS clouds and are needed for bytestream/mixing production at BNL). The FTS channel priority at CERN had to be adjusted from 3 to 5 and from 5 to 3 at BNL respectively to increase the incoming rate for the CCRC stream into BNL to a decent level. In summary, CCRC transfers are competing with ATLAS mc08 dataset replication. By changing the FTS priority to 5 for the CCRC related transfers made them having priority over MC replication. Other ATLAS Tier-1 centers do not observe this effect because their MC related transfer rate (incoming) is small compared to what we observe at BNL. Following the before mentioned adjustments the CCRC related rate went up around 5pm (CDT) to 180 – 200 MB/s and was maintained at this level for more than 6 hours (continuing).

  • FTM installation problem reported at FZK - FTM is OK (you should make sure you install all 3 schema files!).

  • All sites are encourage to install FTM as soon as they can. Apologies that it was not clearly specified in the baseline services document! (to be updated).

Core services (CERN) report:

  • Castor ALICE instance upgraded to 2.1.7.
  • Garbage collection on t1transfer pool (for T0 export) had a problem yesterday - it was deleting files before they were exported. Problem was fixed during the night and this morning.
  • Various SRM2 releated Castor issues casuing service degradation (deadlocks in DB). These will likely need code fixes to address. See: https://prod-grid-logger.cern.ch/elog/Data+Operations/2

DB services (CERN) report: No report.

Monitoring / dashboard report: No report.

Release update: No report.

AOB: -

Friday

Attendance: local(Patricia, Roberto, Nick, Flavia, Jamie, Miguel, Jan, Gavin, Ale, Sophie, Julia, Maria);remote(Derek,Stephen).

elog review:

Experiments round table:

  • ALICE: only 2 sites have yet upgraded (Birmingham & Kosice), others have migrated or ongoing. Transfers - beginning to test channels. RAL - xrootd i/f for castor2 - no news. Can still transfer files using FTS but cannot read them! Derek: not expecting to get xrootd in, but SRM endpoint should be ready Monday+
  • LHCb: long observation + plots yesterday. All sites getting their share of data & jobs - running like a dream, all job slots filled. Problem with NIKHEF (cooling went down). CIC broadcast? GridKA: hundreds of jobs with dcap problems. Restarting server? SARA: 'empty response' issue preventing rDST to SE. dCache upgrade to p3 fixed problem but introduced another - all space token definitions lost! Thus all FTS transfers stopped. (Flavia - not lost but not visible through SRM(?!)) CNAF: SToRM issue - authentification to pure disk endpoint. Stripping - few tests / jobs per site as per yesterday. Need to restart test - ramp-up stripping next week then all chain running.
  • ATLAS:has finished T1-T1 tests yesterday. Went pretty well for almost all sites. Yesterday pm during ATLAS ops meeting each T1 provided slides with observations. gridftp load, FTS channel settings etc - v. useful, could be useful for other VOs. CNAF: under investigation, do another small but similar test to understand problems further (this w/e). Detector data taking - this w/e dedicated to M7. Ready to distribute if they arrive.
  • CMS: start T1-T2 transfers today. T1-T1 quite successful as reported yesterday.

Sites round table:

  • RAL: CASTOR update: misconfiguration - reverted. Took service out for a couple of hours.
  • NIKHEF: WN update?
  • GRIF: we proceeded with some transfer tests from CCIN2P3 to GRIF/LAL using DDM and the new 5 Gb/s link between Lyon and GRIF. Without any specific tuning, we were able to achieve a sustained throughput of 200 MB/s (several peaks at 250 MB/s) without any transfer error.

Core services (CERN) report:

  • SRM ATLAS suffered from instabilities during Thursday. Problems fixed by restarting srmServer processes on two servers at 5:00p.m. Started even late Wed. No operator alarms - working on improving infrastructure. No ATLAS report. Lots of transfers failing.
  • SRM problems - report from Shaun: close to solution on 1 of deadlock problems. Then will focus on 'stuck connections'.

DB services (CERN) report:

Monitoring / dashboard report:

  • problem with T1-T1 transfers not being visible in GV - T1s are installing FTM (required). Dashboards: new version for ATLAS following shifters requests; CMS - site status board has new info about available space on storage. Probably not perfect but better than a poke in the eye with a burnt stick. Admin i/f for GridMap for criticial services - improvements to LHCb view.

Release update:

  • High priority security update coming out today(!)

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
PowerPointppt ATLAS-T1-T1overview.ppt r1 manage 1013.5 K 2008-05-15 - 14:57 JamieShiers  
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2008-05-16 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback