-- HarryRenshall - 30 Apr 2008

Week of 080505

Open Actions from last week:

Daily CCRC'08 Call details

To join the call, at 15.00 CET Tuesday to Friday inclusive (usually in CERN bat 28-R-006), do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

Monday:

See the weekly joint operations meeting minutes

Additional Material:

Tuesday:

Attendance
local (Maria Girone, Andrea Sciaba, Nick Thackray, Simone Campana, Antonio Retico, Roberto Santinelli, Patricia Mendez-Lorenzo, Jamie Shiers, Flavia Donno), remote (Derek Ross/RAL, Gonzalo Merino/PIC, JT/NL-T1, Daniele Bonacorsi/CMS&INFN)

elog review:

Experiments round table:

  • ATLAS: (Simone) all problems reported yesterday have been solved (very good!). New problems (2): RAL power outtage in overall RAL region; comment? Derek: CASTOR still down, hoping up in the next 2 hours. SC: degrade of service - throughput to NDGF tape extremely poor - running with 1 tape drive. Target is 5% of RAW. Will not decrease throughput to NDGF. 2 days of grace time to recover backlog at end of exercise (Thursday). Throughput to disk also affected (same FTS channel). Main problem yesterday (unable to deliver date for ~30') - some interaction between SRM & DB. --> Follow-up offline.

  • CMS: (Daniele) quite some CASTOR problems. Mike Miller in contact with CASTOR operations. Problem fixed by 'mounting by hand' tapes. Switch GC policy to FIFO(!) CASTOR also giving some difficulties at T1s. CASTOR stager not on SL4 does not seem to work correctly with tape families (CNAF, also ASGC?) RAL ok - 2.1.6 on SLC4. Some issues setting pools for more complex use cases (e.g. direction of some data to CAF). Ramp-up going ok; CCRC waiting for things to stabilize before pushing for transfers. Like (and getting) attention on CASTOR issue.

  • LHCb: (Roberto) all issues reported yesterday have been completely fixed (RAL, SARA, GridKA). Also happy with that(!) Next hours of activity: ramp-up should have started yesterday at 16:00. Ran massive clean-up of Feb and last w/e's data: 2Hz deletion rate. Latest recons & analysis s/w installed. BRUNEL now using dCache 1.8 clients crashing - work-around is to use clients installed on WN, not that shipped with LHCb s/w. Problems with SE at RAL, IN2P3 in downtime, NIKHEF dCache library inconsistency(?). JT: bug in 'passive mode' (??) Run jobs directly at SARA? Tentatively ramp-up production from 16:00 today. Maybe CERN, PIC, CNAF & GridKA...

  • ALICE: (Patricia) quiet & will be until May 18 when 3rd commissioning exercise. Migrate to Alien 2.15. Get rid of SLC3 VOboxes. Will not be supported as from two weeks. Affects only Tier2s as Tier1s have already migrated - except NIKHEF?

Sites round table:

  • NL-T1: ROC North has not be very aggressive in assigning tickets from region to site. Request experiments to cc: appropriate SARA/NIKHEF support e-mail when filing tickets. ATLAS - almost no jobs (SC to check); Many 'file exists' errors in DPM log. JT to send log file snipet to SC.

Core services (CERN) report:

  • Note - atlas-operator-alarm@cernNOSPAMPLEASE.ch mailing list now created. Members currently: Alexei Klimentov - PH/UAT <Alexei.Klimentov@cern.ch>, Birger Koblitz - IT/GS <Birger.Koblitz@cern.ch>, Armin Nairz - PH/ADP <Armin.Nairz@cern.ch>, Pavel Nevski - PH/UAT <Pavel.Nevski@cern.ch>, <jezequel@lapp.in2p3.fr>

DB services (CERN) report:

Monitoring / dashboard report:

  • nothing to report

Release update:

  • nothing special - 64bit WNs available in production later this week?

AOB:

  • Q for RAL - entry in elog; RAL settings 1 active file / channel. Derek --> FTS admin to increase.

Wednesday

Attendance: local(Ricardo, Julia, Roberto, Andrea, Jamie, Simone); remote(Stephen Gowdy, Danlele)

Follow-up from yesterday: URLs for RSS feeds are listed above. See also atlas-operator-alarm@cernNOSPAMPLEASE.ch mailing list. cms & lhcb lists also exist - ownership & membership to be checked. alice list has also been created (obsoleting alice-grid-alarm & atlas-grid-alarm).

elog review:

Experiments round table:

  • CMS: Daniele - CASTOR tape recall q issue: need 3 actions (purge, block access to CRAB jobs at CERN, define better policies for use cases in coming weeks). Progress on 1st 2. Stephen - moving files from Tier1 transfer pool. Cleared 1/2 of 24TB(?) Daniele - plenty of 'invalid state' files. To be dealt with by CASTOR operations team. Known from CASTOR experience at Tier1 that this can happen and requires intervention. 2nd action: Ian and Oliver made some scripts, documented, blocking access to samples that are available at other sites and hence reduce load from users. Continue and report via GGUS. Reports sent every 2 days and report will be extracted and uploaded to elog.
  • ATLAS: Simone - problems reported with several sites; many problems popping up but solved very fast. All Tier1s healthy except: NDGF - problems putting things on disk & tape, mainly due to lack of h/w, migration to tape not fast enough. Unique FTS channel effectively limits transfers to both disk & tape. ATLAS ddm to throttle for this? 2 FTS channels - 1 disk, 1 tape. Feasible? Hacky? RFD. BNL - exhausted disk space(!)
  • LHCb: Roberto - all T1s healthy. Problem with SRM endpoint at RAL. GGUS flow interesting(?) 18:00 UTC submitted, processed by TPM 07:00, fixed 11:00. Adopt cc: sites when submitting GGUS tickets. Aborting ramp-up for 2nd time frown Problems with online system. Components work. dCache client library clash at NIKHEF solved. More news later...
  • ALICE: Patricia (via e-mail) - no news from Alice. Due to the quiet time of Alice untll the 18th, Castor team has proposed to upgrade the Castor system for Alice before the challenge now. Besides this and in order to push all sites to provide the updated vobox, I have submitted a ggus ticket, cloned to all ROCs to ensure this procedure.

Sites round table:

  • NDGF (via e-mail): I should clarify the problem. It isn't that we really lack tape drive capacity. The problem is that the current disk in front of the tape drives can't sustain more than 2-4 simultaneous transfers. This is throttled on the dCache-level. What this means for FTS transfers is that if FTS schedules 50 tape transfers, the vast majority of these will get stuck in a queue. The interaction with disk transfers is that the "waiting" transfers eat up all the FTS transfer slots, leaving none for disk transfers. We used to have a problem with FTS timeouts killing throughput, but this should be worked around by now (at the cost of a little more transfer failures that needs to be rescheduled).Faster disks in front of the tape is estimated for next week, by then this should hopefully be not so bad. The "real" solution will be more tape libraries online in NDGF, which is scheduled for this summer. //Mattias Wadenstein

  • BNL (comments on yesterday's statement about disk-space from Michael Ernst):
    • As many other Tier-1 sites BNL has not yet received the FY-08 disk storage capacities (1.5 PB raw) with the procurement process starting as early as January
    • BNL is hosting almost all ATLAS data sets. As a consequence BNL is currently the most utilized ATLAS analysis center. We observe hundreds of thousands of jobs per month being submitted by users from all over the world, primarily from Europe
    • In anticipation of the upcoming FDR in early June ATLAS has asked BNL to do the so-called mixing step. These jobs have specific I/O requirements (reading from 50 input files per job simultaneously, 150-300 jobs running at all times) and it was found after lots of difficulties when running these jobs at CERN in preparation of FDR-1 that BNL is the only ATLAS Tier-1 that is capable of doing this for FDR-2. As a consequence BNL has to provide ~100TB of disk space to hold the input datasets (RDO’s) for the mixing jobs.
    • For production BNL has to provide all input datasets that are used for production jobs within the US cloud which requires about 400 TB of disk space.

      We are in the process of adding/freeing up ~50TB of disk space today. This is all we can do until either the new disk servers are deployed (were shipped last Friday) and/or the mixing step is completed (expected to last until end of May).
    • Two further updates:
    1. From Michael: I just received the information that our 31 thumpers have arirved at BNL. The storage management team is making any possible effort to bring up 3-5 units (or more than 100TB) asap. This may take until tomorrow noon.The installation work of the rest of them will proceed over the course of the next couple of weeks. // Michael
    2. From Simone: Just to mention that at 15:30 disk space was available in BNL (some cleanup was done I believe). BNL started recovering from backlog and kept 200MB/s for about 1h, when backlog was recovered (there are some incomplete datasets because source files in castor have been recycled and do not exist anymore). This means the CERN-BNL link seems to be healthy after the intervention of ESnet two weeks ago (we were observing a limit on 100MB/s sometime before).

Core services (CERN) report:

  • Phone in new Grid control room (513 R-068): "Votre demande a bien ete prise en compte.Le travail sera realise en debut de semaine 20. Veuillez nous excusez des desagrements engendres par ce retard."

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

  • Open actions. NIKHEF not getting production jobs as disk full; file exists problem - DDM bug. "Fixed in next release". Available beginning of next week.

Thursday

Attendance: local(olof, Luca, Roberto, Gavin, Ricardo, Daniele, Jamie, Miguel); remote (JT, Michael, Gonzalo, Stephen, Derek)

Carried over from yesterday:

elog review:

Experiments round table:

  • LHCb: elog issues - FTS shares on CERN-GridKA channel too low - fixed (Gav). This also explained no jobs at GridKA. 2nd entry: 'severe' limitation on CASTOR for bulk removal requests. 20/25 seems to be about the limit (Shauwn recommendation after experience at RAL). dCache can handle ~1000 files in similar requests without client dying. Data transfer - still not full nominal rate. About 1 file / min = 30MB/s. After some problems on the LHCb side, finally managed to start. No problems. Next week watch carefully system and eventually ramp up to final nominal rate.

  • CMS: concerning T0 workflows. Still mainly CSA08. Pre-production of samples ~done. Transferring externally produced samples back. First samples to CAF with jobs there. Long staging times due to big tape Qs at CERN. Investigated closely. Most CMS-side actions done. srmrm cleanup of 50% of files / disks. Cleanup at DBS level to 'hide' data to users that are available elsewhere. -> Site down with CASTOR operations people next week to help optimize further. T1 workflows. 7 Tier1s (+PIC), all providing needed # of CPUS (good!) 800 jobs at CERN, 3K at FNAL, several hundred at other sites each. gLite WMS in prod at CNAF. Job tracking problem addressed on CMS side. Reaching stable operations after 4 days of running. Tier2 workflows (analysis) - see elog entry. Distributed transfers T0-T1 ramping up. Sum prod+debug FTS instances. Now arranging T1-T1 + T1-T2 traffic, trying to arrange superimposition of traffic with ATLAS (week after next). Some T2 transfers with DDT team starting next week. T0-T1 tests 3rd week May together with full chain exercise of ATLAS.

  • ATLAS: T1-T1 next week, throughput in week 3. Weekend between W2/W3 cosmic M7 data taking - priority! Data should start arriving in castor next Friday. By following Tuesday M7 data taking should be finished. Maybe postponed by 1 day? Throughput could continue into week 4. Current exercise (functional test) finished today at 10:00. Q in LSF draining jobs, 2 days 'grace' for sites to finish getting data according to metric published earlier. Postmortem beginning of next week. Yesterday's issues solved - very short time - thanks! Ticket to castor.support about 2 gliches of 20-30'. Yesterday night and this morning. No high priority but an answer stp. Sites recovering backlog mostly sara+ndgf.

Sites round table:

* NDGF - see comments above.

Core services report:

  • GOCDB (hosted at RAL) :The GOCDB service was broken by a power outage at RAL on the morning of Tuesday 6th May. There was a corruption of the database journals which prevented the recovery to the point of the break. It currently looks as if we have lost data from 12:50 GMT on Sunday 4th May to 06:00 GMT on Tuesday. The GOC apologises profusely for this break of service and loss of data. We assure you we are:-
    1. working with Oracle to understand why the corruption occurred
    2. implementing a mirrored standby at another site to guard against future power or network breaks. This work has started but unfortunately was not yet fully in place.

      I (John Gordon) will issue another note when the service is restored.

  • Miguel: problem in file application in castor. Changed on fly on CMS after testing (but not RPMs - to be scheduled). Patched version deployed for castor as agreed this morning. This could explain this morning's problem. Scheduling plug-in not working correctly. Confirm with CMS if ok tomorrow.

  • Roberto: does computing model prevent user to stage file from tape? Daniele: every user can access data visible via DBS. CRAB jobs to such sites.

  • JT: asks Simone to check what is needed wrt disk space in coming 3 weeks. Simone: con-call yesterday with developers. Will be described at today's ATLAS operations meeting. Kors will send mail describing spacetoken, type, number. JT thanks LHCb for the jobs! (You're welcome!)

  • Gonzalo: still little load from LHCb. OK? Will increase share for PIC following today's production meeting.

  • Gonzalo: tape metrics - found tiny files from ATLAS being uploaded to tape. These are from ATLAS SAM tests! (Every 2 hours). Configured by hand that test directory not migrated. Simone will follow-up with Alessandro. Similar configuration defined at CERN. Ale to follow-up and contact Tier1s.

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

  • The release of gLite 3.1 Update 22, scheduled for today, was delayed due to the EGEEII-EGEEIII transition meeting. The software will be made available next Tuesday.

AOB:

Friday

Attendance: local(Luca, Julia, Ricardo, Roberto, Patricia, Simone, Nick, Jamie); remote(JT, Derek, Gonzalo, Stephen, Michael)

elog review:

Experiments round table:

  • ATLAS: Simone - 3 issues; 1) problem with CASTOR@CERN - FTS transfer give many SRC timeouts, client tools 2/5 attempts with 30' timeout fail putting files on castor. elog+ticket->castor support; 2) bug in ATLAS DDM (previous release). many SURLs have wrong format in 8 T1s with LFC (+2). How many entries? Asked for script from LFC experts, both for Oracle & MySQL b/es. Ask help of T1s to run this script. Applies to local LFC at CERN. 8.5M entries at CERN. Michael: is BNL affected too? Simone: DDM 0.5 release. Miguel only mentioned LFC-sites. Cross-check with Miguel. Clients can digest both 'good' and 'bad' formats. 'File exists' problem is also (partially) caused by this. JT - Q real problem with SE at SARA? Simone - this and other DDM bug. JT - outstanding ticket. Put comment in ticket stp? 3) Throughput problem CERN-SARA. All sites basically finished within one hour after subscriptions ended. Still tail with NDGF & SARA. NDGF understood (see above). SARA? Throughput 30-40MB/s, even with queue. Not seen for other sites. Ricardo - comment on (1) - too many threads talking to castor b/e. Could be too many parallel bulk removes?? LHCb also observed similar problems at RAL - 20/25 concurrent files/request. Simone - ATLAS deletion service does this. JT: service gridmap for ATLAS and look at T1 CEs 'livestatus' NIKHEF is not mentioned. Julia - will investigate. For next week, T1-T1 exercise will need 15TB of disk space free: NDGF, NIKHEF, INFN have problems. Now NIKHEF ok, INFN today/ Monday. NDGF? (Currently only 5TB free...)

  • ATLAS follow-up:
    1. Answer to Michael: the issues of SURLS with port number etc .. according to DDM devels is not relevant for OSG and NDGF
    2. Activity on CASTOR: ATLAS is currently deleting data on CASTOR@CERN, but at quite low rate. In particular: one bulk deletion request every 10 seconds (synchronous, so requests do not overlap). Each bulk is 10 files.

  • ALICE: (Patricia, by e-mail) - coordinating the migration of all VOBOXES to the latest OS+middleware version before the commissioning exercise. Still 19 VOBOXes are running with the deprecated sytem. I have also submitted a ticket to GGUS cloned to all ROCs to ensure actions from the ROC responsible. All VOBOXES must have been upgraded, tested and validated before the 18th of May.. Status of the VOBOXEs upgrades available under: https://pcalimonitor.cern.ch:8443/admin/linux.jsp

  • LHCb: recons and data distribution running happily across all sites (50% of nominal for CCRC - next week will start increasing to nominal rate, factor >2 on number of concurrent jobs at each T1). Small issues: short glitch on AFS, problem known & published on service status board; CERN-GridKA transfer problem reported yesterday evening taken promptly by Doris - load problem in SRM. Now goes better. SARA - data upload of output of recons back to rDST space at SARA 'empty response' - ticket. CERN-SARA transfers into RAW space. Initally 'no free space' exception, then 'file exists' problem. pnfs problem? GGUS ticket... LHCb issues: will be fixed by DIRAC developers. non-CCRC: many tape staging requests pending for 48H under investigation. PIC(Gonzalo) - job load at PIC, around 20jobs, whereas 70 slots. Plenty free... -> LHCb production meeting

Sites round table:

  • NIKHEF: outtage on DPM se - raid controller problem. Problem started 5 May but only just noticed! One of disk servers and 1/2 of partitions inaccessible. ATLAS is customer!

Core services (CERN) report:

  • Castor Alice intervention on the 15th May (not announced yet in the IT status board)
  • AFS problem on AFS22 8 May 2008: about 95% of all AFS volumes have been recovered. 232 (~5%) are currently too damaged, they will very likely have to be restored from backup tape
  • CASTORCMS: Patch 2.1.7-6 applied to castorcms at 10am this morning
  • CASTOR - patch 2.1.7-6 might be applied to castorpublic on Tuesday (13th May)

DB services (CERN) report:

  • Streams monitoring currently down - being looked at.

Monitoring / dashboard report:

  • Job monitoring application affected by AFS. Data not lost but 'delayed'.

Release update:

AOB:

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2008-05-09 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback