Week of 080630

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local();remote(). Cancelled due to EGI meeting.

elog review:

Experiments round table:

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Tuesday:

Attendance: local(Harry, Nick, Jean-Philippe,Roberto,Simone,Maria,Jan,Gavin);remote(J.Coles).

elog review: Nothing new since 20 June

Experiments round table:

LHCb: 1) DIRAC3 is planned to be released tomorrow after a few more tests today. 2) The gsidcap problem seen at IN2P3 (multiple gsi connections) will be fixed by dcache patch level 8 expected next week. There is meanwhile a workaround of a reduced doors setup. 3) There are encouraging results from the xrootd tests being made at SARA. 4) Tests of the pre-glexec MUPJ (multi-user pilot jobs) are expected to go ahead soon at the T1. 5) It is thought by LHCb that the new gfal 1.10.14 will fix a problem of thread safety. J-PB thought it did not so to be checked.

ATLAS: 1) Since yesterday ATLAS is successfully running FTS transfers out of an SLC4 server in the PPS. 2) CNAF has now been down for more than 1 week (power switch problems) causing problems to ATLAS site services. There are some 5000 files known to be online in CNAF (also on tape at CERN) and being queried for. This long downtime should trigger an MoU metric failure Post-Mortem. 3) Looking in gridview at T1 to T1 traffic where all sites report they are using FTM we can only see traffic from TRIUMF and RAL. G McCance will check inside gridview else we should raise site GGUS tickets.

Sites round table:

Core services (CERN) report: FIO/FS have a first release of the SLC4 version of castorsrm but no date for the official release yet.

DB services (CERN) report: The CNAF streams synchronisation has been stopped since their power failure but we have been told this morning that ATLAS and LHCb LFC will be coming back so we will try and resynchronise them.

Monitoring / dashboard report:

Release update: There will be a first meeting about a pilot cream CE service today.

AOB: There will be a DB developers workshop next Tuesday, 8 July, 09.00 in the IT auditorium to which all are welcome. Current issues and use cases will be addressed.

Wednesday

Attendance: local(Andrea, Jean-Philippe, Gavin, Simone, Harry, Jamie, NIck, Sophie,Patricia, Roberto);remote(Derek, Michael, Jeremy, Michel).

elog review:

Experiments round table:

  • CMS (Andrea) - preparing for next cosmic run CRUZET 3 - next Monday. ~100TB of data. Discussing which Tier1 will be custodial for this data. Report from CNAF - everything up and running. Storm still down - phedex debugged transfers stopped.

  • ATLAS (Simone) - request to distribute cosmics from previous 2 w/es to Tier1s. Done. Nothing subscribed to CNAF. RAWD distributed with relative weights to other T1s. Background continuous file transfer testing. For ATLAS, STORM important as its production end point for disks. Ale was revalidating site this morning. (Re-?)discussing shares and setup of batch system at CERN - tomorrow morning setup of shares for new groups. Q- jpb - does # of groups / subgroups increase? A - no,

  • ALICE (Patricia) - Beginning to ramp up again. Affected by one of myproxy nodes down, hence proxy renewals not working and proxies corrupted. All VO boxes went down. Have to delete old proxies, renew, restart services. Nick - Myproxy service is supposed to be h/a - a suivre. Post-mortem. Linux H/A.

  • LHCb (Roberto) - DIRAC3 - restart in a couple of hours(!) gsidcap issue, reported yesterday as understood. patch level 8 fixes one problem but new problem now spotted. Info in elog - details passed to dCache developers. xrootd testing at SARA. Yesterday going fine, today small interrupt, connections lost and not recovered. Involved Fabrizio Furano.

Sites round table:

  • RAL (Derek) - LFC crashing, implementing restart scripts. jpb - install 1.6.10-6. Only difference between -4 and -6 is a fix for this problem. Was known problem but info not clearly spread. (Timeout or problem with security handshake.) --> OPS meeting.

Core services (CERN) report:

  • Experiment scripts pointing to old version of AFS UI. /current is recommended. Doing config to new VO (sixt). New VOs have new naming convention, existing ones not.

  • Morning meeting - new kernel in PPS. Based on problems seen by ATLAS last time, recommend to test! -> Birger to see with Uli.

DB services (CERN) report:

Monitoring / dashboard report:

  • Problems with SAM monitoring this morning. Problem accessing top level BDII at CERN -> Grid red! (not central european sites). Was a network problem that affected all BDIIs(?!) 07:30 this morning for ~30 minutes. Jeremy - will these results get based through to site reports. Should in such cases be handled centrally. -> John Shade to correct centrally.

Release update:

AOB:

Thursday

Attendance: local(Jamie, Gavin, Simone, Andrea, Roberto, Harry, Jean-Philippe);remote(Jeremy, Derek, Michel).

elog review:

  • Nothing new

Experiments round table:

  • LHCb (Roberto) - dirac3 status: still issues testing workflow of simulation & reconstruction. Not ready to restart. 2 further space tokens at CERN for MC simulation when ready to restart.

  • CMS (Andrea) - cruzet3 (cosmic run from Monday) - custodial site = in2p3. CERN CAF and FNAL will get full copy. CNAF, PIC, Taiwan get partial. Full dataset = 100 TB.

  • ATLAS (Simone) - currently problem running reprocessing at some dcache sites (reports from SARA, IN2P3). Staging from tape and lifetime on disk after staging. Flavia running tests to see whether there is a problem in dcache / gfal / atlas (pin management? Wrong workflow?) New space token association to a pool "ATLASSPECIALDISK" - T0D2(!?) i.e. backup is on disk. 2nd copy triggered on first read. (Cannot currently write into this - tried various users with different privs. Reported to Jan & castor operations). Organised meeting tomorrow to follow-up. (e.g. is it really necessary to trigger read immediately after write to get 2nd copy?) Not known why this implementation was chosen. (Use case = calibration files).

Sites round table:

  • R.A.S.

Core services (CERN) report:

  • Just note afs problem in last ~hour. Problems logging into lxplus (afs22). See status board.

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Friday

Attendance:

local(Andrea, Simone, Jean-Philippe, Sophie, Flavia, Jan, Gavin, the mysterious 77139 smile )

remote(Michel, Jeremy, Derek).

elog review:

- Nothing today

Experiments round table:

  • CMS. Nothing to report. Michel: any activity in terms of production jobs? Will check: 13k prod jobs submitted.

  • Atlas: Some decisions about disk space made in ATLAS ops meeting: see slides from ops meeting regarding what sites should implement for space tokens for MC, group analysis and users. Please look and comment if any issues. Prestaging / recall from tape: 5 sites (2 castor, 3 dcache) reprocessing is slow with major recall problems: an effort starting to understand these problems driven from ATLAS ops - prestage on sites one by one in a controlled way. Will start on Monday with SARA. Flavia: dCache developers have agreed to setup a test instance with a small disk partition to test migration (Migration in dcache requires full disks). Will compile a docu with recommendations for sites.

Q. (Michel): please clarify ATLAS space token permissions for DPM - Yes, Simone will follow with Jean-Philippe. Issue: ACLs on nameserver can be multi-group, but ACLs on space token are only for single group. This requires a change in DPM: in development plans.

Q. (Simone) Strange behaviour in Taipei re ATLASDATA DISK D1T0 (140 TB). It's completely full - 5 TB seems useful but not clear what the rest is. Reported space seems to change uncorrelated to ATLAS production activity.

Sites round table:

  • RAL (Jeremy): there was a SAM problem last night that caused many alarms to be sent to RAL ( the same issue was noted by CERN). RAL staff were called out for this by SMS. Can the CERN SAM service be put on piquet for these issues? --> follow up on this.

Core services (CERN) report:

  • LFC (Sophie): LHCb LFC configuration mistake - 20 threads set, should have been 60 - fixed. All LFCs nopw set to 60 threads. Issue with long-running LFC sessions from ATLAS (never closed) - this affect our ability to make interventions. Agreement with ATLAS: code should check status and retry if necessary - on intervention we can force close if necessary.

  • Castor: LHCb space tokens at CERN done. ATLAS: 'special' disk - following up. SAM( Monbox at CERN ) making frequent null connections to SRM. Awaiting feedback from SAM team. New version of Castor and new Castor-SRM. Intrusive upgrade: needs to be scheduled. CMS: will contact to phase out usage of SRMv1

Monitoring / dashboard report:

  • Andreas. Please use WMS to submit CMS SAM tests -> wms113 (lcgadmin role and production role). Sophie: we need to understand about user mapping first.

AOB:

None.

-- JamieShiers - 27 Jun 2008

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2008-07-04 - GavinMcCance
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback