Week of 090525

WLCG Baseline Versions

WLCG Service Incident Reports

GGUS Team / Alarm Tickets during last week

Archive of Broadcasts

Weekly VO Summaries of Site Availability

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local( Jamie, Gang, Stephane, Nick, Harry, Andrea, Patricia, MariaD, Simone, Roberto, Diana, Dirk);remote(Andrea (CNAF), JT, Daniele ).

Experiments round table:

  • ATLAS - (Stephane/Simone) Issue with LFC at PIC on the weekend triggered discussion on mechanism for sites to contact experiment experts. ATLAS proposes to setup email list / egroup, with experiment experts which can be used by a small group of know T1 site contacts (10-20 in total) to get in contact with experiment expert on call (and possibly later Point 1 experts). MariaG: Could VOMS groups + email list be used? Or GGUS tickets? Simone: would need to be able to generate SMS. Jamie: traceability of this channel would be useful. On the technical side of the problem: ATLAS was trying to delete LFC entries at PIC with the central deletion tool. This has triggered instability of the PIC LFC instance, which had not been seen in similar activities/load before. The full analysis is still going on, but the problem seems to be related to particular non-canonical ACL values rather than unusual load. PIC experts and LFC support are working on the remaining cleanup and understanding of the problem (ticket 48980). Simone: T2 transfers from Taiwan are enabled again as of Sunday. No problems observed so far. Jeff mentioned problems on Sunday with failing sam tests due to expired proxies. Andrea saw similar problems for submissions via the WMS, even though direct submission is working fine. To be checked by the ATLAS.

  • CMS reports - (Daniele) Daniele reminded sites to consult the wiki documents which have been put in place in preparation for STEP09. He asked if June 1st (CERN holiday) will be considered first full day of STEP09. Jamie: June 1st will be a best effort day.

  • ALICE - (Patricia) No major issues during the weekend - smooth running. Some problems at CERN T0 with jobs which required manual killing. Not clear yet why this happens apparently only at CERN. New T2 at Hiroshima is ramping up. Still some issues with submission via their VObox. Working on the issue together with another new site in Spain.

  • LHCb reports - (Roberto) MC09 production going on. Cooling problems at Lyon and problems with CRL at CNAF (being worked on by CNAF experts). Some jobs had been killed by NIKHEF watchdog - Jeff added that the problem has been analysed and was due to external database connections.

Sites / Services round table:

  • CNAF (Andrea) - working on LHCb certificate problems and also cross checking on other nodes to avoid similar problems. The site also saw some job submission problems (~20 jobs died) which are being analysed.

  • NIKHEF(Jeff): dCache upgrade to 1.9.2-5 showed problems with memory consumption (>8GB) for gsi-dcap doors. The issue has been forwarded to dCache support and the site is preparing for a possible downgrade to 1.9.0-10 in case the problem cannot be fixed quickly. Simone/Jeff warned other sites to take this into account in case of planned dcache upgrades.

  • ASGC(Gang): SRM problems last week have been identified as another instance of the Oracle "BigID" problem.

  • CERN-PROD ATLAS issue with Castor SRM last Thursday not yet understood. Service request opened to developers. SIR will follow once we understand better.

AOB:

Tuesday:

Attendance: local( Nick, Roberto, Julia, Gavin, MariaD, Gang, Dirk);remote( Daniele, JohnK (RAL), Andrea(CNAF), JT, Michael ).

Experiments round table:

  • ATLAS - (Simone) ASGC progress: ATLAS started to test recall from tape functionality. Recalls without space tokens results in correct space. The access to a VO box for Simone has been appreciated and simplifies the recommissioning. Still iusse with small number of available tape drives(2) which are shared with CMS. Further 6 drives are expected to become available soon (ASGC is waiting for IBM intervention). Simone remarked that some new recall request have overtaken older ones and asked the site to investigate.

  • CMS reports - (Daniele) CMS CRL UI expired at CERN (ticket 49039). CASTOR upgrade for CMS finished smoothly with some minor issues. The upgrade broke some CMS download scripts as the stager query format had changed. CMS has in the meantime adjusted their production fedex to cope with the unintended change. The old output format will be restored with the upcoming minor upgrade to 2.1.8-8. Daniele reported that CMS got confirmation from ATLAS for multi-vo tape i/o tests. A more concrete schedule is expected after a meeting between CMS and ATLAS tomorrow. Prestaging test have started at CNAF and will continue with other sites. Also transfer test samples have been prepared and tests will soon proceed.

  • ALICE -

  • LHCb reports - (Roberto) Monte Carlo production is continuing smoothly but issues gsi-dcap have been observed at Lyon and during the weekend at GridKA. Roberto reported an issue with the UI in CERN AFS. Gavin added that this issue is currently being looked at by the support team in FIO.

Sites / Services round table:

  • RAL (John): RAL had scheduled downtime for a network change which unfortunately failed. The intervention will have to be repeated.

  • CNAF (Andrea) : Several tickets solved and closed. Upgraded to latest glite version and did not observe any problems so far.

  • BNL (Michael) : BNL upgraded to latest version of ATLAS site services. Michael mentioned some issues but will provide more detail later. Simone added that this change was required as two weeks ago BNL did not properly mark locations in DDM catalog when BNL was still using a candidate release for the site services. This issue is now fixed with the upgrade, but it is not clear yet what happened to failed location from before. ATLAS will follow up with the site.

  • NIKHEF/SARA (Jeff) : Still discussions / investigations on dcache downgrade or possibly upgrade to new patch level - triggered by problems after recent dcache upgrade.

  • CERN(Gavin): ATLAS SRM issue of last Thu have now been understood. Post mortem about the AFS UI problems can be found here.

AOB: (MariaDZ) Since the GGUS May 2009 Release (last week) one can now see the LHCOPN tickets linked from the GGUS homepage (bottom left). I am putting this link in these minutes' template. When people need more info a network person will be called to the wlcg-operations' meeting. Discussed today at a LHCOPN meeting too. We need OSG, at least BNL feedback on the provision of site contact and emergency email for use by GGUS. This was discussed in 2 dedicated meetings (last December and March). Detail in http://savannah.cern.ch/support/?107531. About the ATLAS requirement presented yesterday, please check https://savannah.cern.ch/support/?108277, we should refine the requirement in this ticket.

Wednesday

Attendance: local(Jamie, Nick, Gav, Harry, Gang, Antonio, Roberto, MariaDZ, Simone, Diana);remote(John Kelly (RAL), Andrea Chierici (CNAF), Michael, Daniele).

Experiments round table:

  • ATLAS (Simone) - follow up on previous issues: PIC- contacts reported that entries with corrupted ACLs cleaned. Deletion agents restarted and problem did not reoccur. Can confirm problematic ACLs - ticket still open but would like to understand from developers - should not crash server. BNL: exchange wth Hiro who updated SS to latest stable release. SS DB - backlog of registrations in central catalog - being done by hand. Whole process now ok - close(d). Problem this morning with FTS@SARA - v. v. slow, eventually timing out. VO boxes serving SARA & IT sites "disturbed" - have to check SS (DDM developers). Ron Trompert fixed quickly. Tomorrow d/s marked CCRC'08 will stop to be produced. STEP'09 will be subscribed according to full pattern so that Mon/Tue can ramp-up export.
Brian (RAL) - in hand to have space in datadisk ready for STEP. Within UK looking at perceived rate R/W CASTOR for ATLAS. Maybe due to small files? Simone - this is current data distribution? A: yes, for current transfers. Simone: T1 gets data which is later shipped to T2s but not fully synchronise. Might get T1 pileup. ?? Filling FTS channels etc?? Have to check. Brian - preparing T2 disk space as well.

  • CMS reports (Daniele) - CRL expiration errors now fixed - thanks Sophie for PM. Tickets opened by CMS on this are now closed & verified. News from last 24H: mainly compact & good meeting with Graeme(ATLAS) yesterday. Discussion on details, all sorted. ATLAS+CMS overlapping T0 tape writing tests(!) Updated plans by other VOs on this would be good - still some possibility to shift to get multi-VO. T1: more tests with Andrea's script at CNAF. More work at IN2P3 to finalise.. First tests with script by Andrea & Elisa on Twiki - migrating soon to main CMS twiki. Transfer: possibly today and before tomorrow - tables and all details on samples on T1 sites for T1-T1 tests. Analysis at T2: no progress to be reported. Meeting later in time for (report) tomorrow... Roberto - when is overlap in tape writing? Daniele: June 2-14 overlap period from ATLAS side. CMS: global run 2nd week and CRUZET first week. Hope for ~2 x 48H slots over full speed overlap. 2nd week of June should be a couple of slots with ATLAS+CMS. Simone - ATLAS test suite runs writing to tape in T0 at all times but tape family is recycled. This will run for full period of STEP so any time CMS can kick in will be ok! Harry - plan to go up to 1000Hz - still on? Daniele - depends on timing. T0 guys writing to tape now but not so obvious we can push to 1KHz. Roberto - LHCb will be writing to tape at nominal rate "fest'09" for full period. Harry - first pass processing date also to tape? A: Y.

  • ALICE - (site tutorial ongoing)

  • LHCb reports (Roberto) - MC09 production ongoing T1/T2 and merging at T1 and CERN. Issue accessing data at Lyon still ongoing - to be confirmed by local contact and site admin. CRL problem on AFS UI - impact but now closed, thanks.

Sites / Services round table:

  • NL-T1: We're downgrading Dcache to 1.9.0-10, because we keep having problems with the GSIdCap service of version 1.9.2-5.

  • ASGC (Gang) - 5+2 tape drives now online. i.e. 7/8 so still 1 not online.

  • RAL (John) scheduled down tomorrow 8 - 11 local - make 3rd attempt to fix network!

  • CERN (Gav) - set of CASTOR interventions - another tomorrow - should be transparent!

AOB: (MariaDZ): USAG meeting tomorrow agenda http://indico.cern.ch/conferenceDisplay.py?confId=59811

Reminder - please join / attend so that we can start promptly at 15:00!

Thursday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

  • ASGC: Today is a public holiday

AOB:

Friday

Attendance: local();remote().

Experiments round table:

  • ATLAS -

  • ALICE -

Sites / Services round table:

AOB:

-- JamieShiers - 25 May 2009


This topic: LCG > WebHome > WLCGCommonComputingReadinessChallenges > WLCGOperationsMeetings > WLCGDailyMeetingsWeek090525
Topic revision: r11 - 2009-05-28 - GangQin
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback