Week of 080929

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Alessandro, Jamie, Markus, Ricardo, Harry, Miguel, Patricia, Jan), Jean-Philippe;remote(Gareth, Michael).

elog review:

Experiments round table:

  • ATLAS - weekend went pretty smoothly. Expect cosmic data taking with all subdetectors tonight - 16:00 ATLAS meeting for details. Replication automatic - i.e. expect cosmics as announced. GGUS - providing list of 5 Tier2s (calibration T2s) - 4 are EGEE, 1 Michigan - use contact e-mail from tier1s of atlas. OSG tickets only ok for BNL - at WLCG ops will ask GGUS to implement direct access for at least these 5 tier2 sites. (Ticket by-passes ROC). Michael - welcome very much if this can be implemented - would ease life! Q: cosmic data. LAr sample? A: will follow-up.

  • ALICE - issue reported with WMS at RAL now solved. 2 WMSs at RAL up and running - put in production for ALICE this morning (some auth issues). New ALICE q at CERN - increase time limit to 20 hours (2 normalised days - previous were 1 normalised day) (LSF ticket - close) queues to be provided as from tomorrow. Test new queues tomorrow morning.

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

  • Friday network problems (router) affecting CMS online - streaming not working. Corrected, but DNS intervention later broke it again - fixed this morning. Friday night LFC stream to SARA aborted. Fixing some rows at destination - data was changed at destination but should be R/O!

  • SAM: intervention tomorrow (broadcast sent) - 1 hour of tests will not be published.

Monitoring / dashboard report:

Release update:

AOB:

Tuesday:

Attendance: local(Andrea, Patricia, James, Harry, Jamie, Gavin, Jean-Philippe, Ricardo, Alessandro, Miguel);remote(Gonzalo, Derek, Michael).

elog review:

Experiments round table:

  • CMS - nothing to report today. In contact with CASTOR team to run transfers using new SRM endpoint which will be setup probably next week.

  • ALICE - two qs mentioned yesterday put in production.

  • ATLAS - how many samples written in each event - 5 (q from Michael yesterday). Each event around 3MB. Running cosmic data taking with all subdetector. All sites good efficiency except Lyon in scheduled downtime.

  • LHCb: an issue with myproxy server preventing to have proxies renewed and then jobs running smoothly. As far as I know (Maarten was reporting that during EGEE08) this is a known issue with the reliability of myproxy server software whose solution should be on the way but not clear to me where exactly.
    Other VOs have experienced this as well in the past and I'd like now to say that also LHCB did. Gavin - some hiccoughs with myproxy service in last few days but nothing chronic.

Sites round table:

  • PIC (Gonzalo) - issue this morning around 11:00. SRM over-loaded by LHCb requests. Happened a couple of weeks ago - was in contact with LHCb to schedule controlled test. Arranged for yesterday but seemed to be launched today(!). New parameters didn't help - srmget requests timeout. Puts work however. Now rebooting SRM, asking LHCb to stop tests, schedule a new test to try to fix. Michael - file transfer rate per hour? A: get requests piling up in some internal SRM q. Each generate a pin request - internal qs overloaded. Transfer rate not so high. Michael - correlated with high pnfs load? No. See increase in SRM load but not 'too high'. Only seen 'now' - 13K srmget reqs in one shot - presumably would also happen with other VOs but has not happened so far. Q for Michael - do you have tests to show how many you can stand? A: yes, launched about 10K requests at a time. Took some time [ to handle? ] but never got stuck. Gonzalo - eventually get requests succeed but takes some 4-5H. SAM get requests for ex. timeout during this period. Investigation on-going with dCache people.

  • RAL (Derek) - no update, still investigating. No decision yet as whether separating out into different DBs would help.

Core services (CERN) report:

  • FTS 2.1 (SLC4 version) deployed in production in parallel - encouraging experiments to test it out and when confident move to it. Proposal would be to make 'an official' release - to be discussed and agreed at some board.

DB services (CERN) report:

  • No news from our side. Complementing what was said yesterday, we could not find the cause of the LFC stream to SARA failure. Their DBA says no data was written on their side.
    Scheduled intervention this morning on SAM was postponed as tests could not terminate in due time. It will be re-scheduled soon.

Monitoring / dashboard report:

  • Setup new elog for Kors for tracking outcome of regular management meetings. Nothing new on CMS elog...

Release update:

AOB:

  • N.B. Friday's meeting will (exceptionally) be in 28 R-006 due to a conflict with the LCG GridFest that day.

Wednesday

  • Rehearsal of LHC GridFest involving many people

Attendance: local();remote().

elog review:

Experiments round table:

  • ATLAS (Simone):

    1. ATLAS Cosmic data (RAW and ESD) are currently being distributed to T1s both on disk and tape. This is a lot of traffic, also for the CERN-CERN channel (pumping between 150MB/s and 300MB/s in the last 48h).
    2. I got notification from Gavin about the new FTS on slc4 being in production at CERN. I agreed with ATLAS the new service will start being used for functional test transfers starting from the next monday.
    3. CASTOR operations asked if it could be possible for ATLAS to test a new version of CASTOR (still 1.1.7) in "PPS mode". This is possible, I can include it in the next round of functional test (monday) if the CASTOR endpoint is ready, otherwise in can be easily introduced at a later time.
    4. Issues in the last 24h:
      • problem with postgres database in dCache at PIC, fixed this morning;
      • low incoming throughput into SARA, affects functional tests as well as MonteCarlo Production. Possibly a FTS issue, being followed.

Sites round table:

  • PIC: Incident this morning. Affected SRM service again 4AM -> 11.30AM. Due to auto-update on SRM service. recovered from and fixed on node.

Core services (CERN) report:

  • FTS - problem discovered in new FTS service. Will hold up release of FTS for T1s.

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Thursday

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Friday

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r7 - 2008-10-01 - MiguelAnjo
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback