-- HarryRenshall - 20 Jun 2008

Week of 080623

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CET Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Jamie, Harry, Patricia, Roberto, Jan, Nick, Alessandro, Flavia, James, Ricardo);remote(IN2P3).

elog review:

Experiments round table:

  • LHCb (Roberto) - still some DIRAC issues preventing to re-start CCRC'08 activities. Debugging of file access protocol at Lyon with site, storage and LHCb experts. Intermittent problems. Not related to load. Marc - not totally aware of problem. Serious problem this w/e with A/C. Had to stop about 300 WNs - waiting for action this week to repair A/C machine. Keep info posted on website.

  • ALICE (Patricia) - beginning redeployment of Alien at all sites. Changes in 'LCG module' to submit jobs to RB/WMS sites in a homogeneous manner.

  • ATLAS (Ale) - 'Activity never stops' for functional tests. Nothing special to report. CNAF - suffered serious problem. UPS too heavy & floor collapsed!

Sites round table:

  • RAL - rogue CMS user submitted ~10K jobs overloaded batch system - jobs killed, AOK.

Core services (CERN) report:

  • Ice OK for BBQ(!)

DB services (CERN) report:

Monitoring / dashboard report:

  • CMS dashboard load (see below). OSG - critical services for OPS being changed (tomorrow after MB).

  • Mail exchange Luca / Julia: The query we though may have been the perf bottleneck is now using an index and going quite faster. In my previous email I was suggesting to create an index only on ReportTimeStamp instead of the one you created The index you have created also fixes the problem though. I don't know enough of the selectivity ReadWrite and ProtocolId columns to say for sure they are not needed for this query (any ideas)?

Release update:

AOB:

Tuesday:

Attendance: local(Jamie, Ricardo, Harry, Miguel);remote(Derek).

elog review:

Experiments round table:

  • CMS (Andrea) - CMS week in Cyprus. Nothing to report.

  • ATLAS (Simone) - nothing to report - writing procedures / Twikis etc. Answer to Q from GRIF: doc from Kors presented by Alexei last week. Rediscussed last week following questions from sites. New version will be dispatched. ATLAS model of DDM robust enough against service like LFC? (Cloud blocked - 2 calibration sites / 4 are in Italy and hence are blocked). R/O replication of catalogs? Being discussed...

Sites round table:

  • RAL - still talking to CMS to resolve problems that hit LHCb.

Core services (CERN) report:

  • Power tests tomorrow at 08:00 CEST - critical area. Announcement e-mail follows:

Dear Service Manager,

A test of the critical connections in the critical area will take place Wed 25th of June at 08:00. The physics power supply will be switched off during 5 minutes in order to check that all servers in this area are properly connected to the critical supply. NOTE THAT THE FOLLOWING ZONES WILL BE FULLY OR PARTIALLY EXCLUDED FROM THE TEST TO AVOID INCIDENTS: - IS ZONE (due to specific hardware which hangs when the second power supply is reconnected) - AFS ZONE (due to a switch which is currently not connected to the critical distribution following an electrical problem. The switch cannot be properly reconnected without affecting the AFS service) - TELECOM ZONE (due to some on-going re-configuration work in this zone) - ROUTER ZONE (because the routers are, on purpose, only connected to the physics supply)

The risk of incident is low as all the incidents detected during the last power cut have been followed-up.

Vincent

  • Publishing of data from gstat to SAM is now fixed. # tuples limit (100) hit starting Friday - work-around on gstat side (break into chunks). Maybe due to additional CEs added at CERN last Friday(?) The publishing problem was followed up here: https://gus.fzk.de/pages/ticket_details.php?ticket=37710

DB services (CERN) report:

  • CMS DBs (on & off) being upgraded to Oracle 10.2.0.4

Monitoring / dashboard report:

Release update:

AOB:

Wednesday

Attendance: local(Andrea, Simone, Harry, Roberto, Jamie, Nick, Ricardo, Miguel);remote(Jeremy, Jeff)

elog review:

Experiments round table:

  • CMS (Andrea) - CNAF recovered from power cut, CASTOR ok, CEs ok, GPFS still down, area for unmerged files unavailable, STORM also down (PhedeX transfers down). Several services added to CMS Service Map.

  • ATLAS (Simone) - dev + ops meeting today to discuss recent changes to new space tokens @ Tier2s doc. Concerns of sites addressed - new version presented today. Presentation on activity for upcoming month. Quick summary: mostly cosmics from detector - sub-detector per week. Data -> CASTOR -> T1s. In 2-3 weeks from now, also cosmics from all sub-detectors, i.e. M8. FDR II-reloaded, i.e. re-test some parts of machinery. Mainly ATLAS-internal exercise + CASTOR@CERN, LFC@CERN. No export to Tier1s. Update tomorrow... (agenda page)

  • LHCb (Roberto) - phone call with Nick regarding restart of CCRC. (Test new features in DIRAC3 see document summarizing all dev. tasks http://indico.cern.ch/materialDisplay.py?contribId=2&materialId=0&confId=36490). Goal is not to reproduce what was done in the past but to implement lessons learned. Still quite alot to do. Some MC workflow run successfully... During July will test DIRAC framework also for handling generic pilots. Test capability of DIRAC to run multiple payloads. Q for IN2P3: update on issue of gsidcap access? Jonathon - updated in ticket - very hard to reproduce problem. Multiple files opened sequentially. Joel gave procedure to run LHCb application(?) Will try..

Sites round table:

  • NIKHEF (Jeff) - maintenance on a cluster on Friday. 400 cores - 1000 over a week ago, will back down to a few hundred for a day then back again.

Core services (CERN) report:

  • SRM CMS (Ricardo) upgrade this morning as per announcement.

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

  • Simone - would be nice to have summary of deployment status of FTM at Tier1s. -> weekly operations.

Thursday

Attendance: local(Jamie, Harry, Jean-Philippe, Ricardo, Gavin, Roberto);remote(Derek, Andrew Sansum, Jeremy, JT, 00033478930880).

elog review:

Experiments round table:

  • LHCb (Roberto) - If a rogue user was from LHCb would have been banned immediately frown - Andrew: no, would not discriminate against LHCb. (Much of discussion lost along with edit session frown ) Should GGUS/elog have been used? R: at least an entry in elog. Use of gLite WMS from DIRAC3 should also avoid this... Torque upgrade??

* LHCb report bis - gsidcap access at IN2P3. Jonathan working hard to debug. Hanging jobs in gsidcap server. Roberto has tried to summarise the issue. Not yet reached solution... SARA - nearline/online issue: configuration incorrect. write-only disk pool instead of read-only(!) Will check with Ron if ok...

  • ATLAS (Simone, by e-mail): I would just like to report that CNAF seems functional now (an FTS delegation problem has been fixed in the morning). Other sites are running fine, with high efficiency.

  • CMS (Andrea, by e-mail) - I would have just mentioned an issue at RAL, where the queues were totally saturated by CMS user jobs due to the smallness of the limit of concurrent analysis jobs (50). The huge number of queued jobs was affecting also LHcb, as a temporary measure the limit was raised to drain the queues.

  • Comment on above (John Gordon, by e-mail): The RAL limit on CMS jobs was originally imposed because CMS complained they were not getting enough production job slots. Since the user jobs asked for tapes their efficiency was very low so once they occupy a job slot new production work cannot get its fair-share. We don't guarantee that you won't see your production work squeezed again.

    It would have been good to have earlier discussion of Tier1 workflows. Were we supposed to have anticipated these job patterns from the various requirement talks pre-CCRC08?

Sites round table:

  • RAL (Derek) - As of noon (BST), we have removed the 50 job limit on CMS User jobs, this will allow all the current queued jobs to run, reducing the ERT and allowing LHCb to submit jobs to RAL. We are also looking at temporarily moving CMS jobs into a queue which is unused by LHCb thus if a backlog of CMS jobs did begin to reappear then LHCb jobs would still be submitted and would be prioritised above the CMS jobs.

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

  • GridView from yesterday (below) shows 'wall' - is it understood why transfers collapsed at this time?

gv-25June.png

  • Problem may be with GridView - ATLAS Dashboard shows activity did not stop:

atlas-25June.png

Release update:

AOB:

  • Jeremy - one UK site moving to new SE. Attempt to migrate SE with space tokens intact.

Friday

Attendance: local(Jamie, Ricardo, Harry, Jean-Philippe, Nick);remote(Michael, Derek, IN2P3).

elog review:

Experiments round table:

  • ATLAS (Simone) - another downtime of CNAF. Power cut - broadcast yesterday. One of side effects is that pilot factories using WMS at CNAF have stopped. (Agent which submits pilots to sites.) Andrea - why not use also WMS at CERN? A: will(!) Name of endpoints received from Maarten this morning - will start to use it.

  • CMS (Andrea) - nothing to report!

  • ALICE (Patricia) - ALICE central services down during night. Recovered now and ramping up.

  • LHCb (Roberto - after the meeting, by e-mail) : So far we have not yet a DIRAC3 production instance working that would allow to restart CCRC despite people are working hard to get it.

Sites round table:

* RAL (Derek) - situation resolved for now - keeping eye on it. Andrea - from CMS point of view, mask T1 sites so that no analysis jobs will be scheduled at T1s. Triggered by this problem, now looking at datasets which only exist at RAL - will be replicated to some Tier2(s).

* BNL (Michael) - everything running well! Simone- still problem with Michigan calibration site - invalid space token. Michael - Michigan people working on it. Confusion of mapping of space token(s) to VOMS role. Have to clarify this within ATLAS - properly document how mapping is intended to be / work. Simone - production role of ATLAS. Will send FQAN example. /ATLAS/ROLE=PRODUCTION. Michael - will send more information.

Core services (CERN) report:

  • New "WLCG Operations" wiki - https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOperationsWeb. Attempt to split out (relatively) static information (pointers) from CCRC'08 wiki. Leave latter as "white board". Comments on content welcome. Idea is to keep it clean and simple! (A wiki is probably not the best for this sort of page, but its easier for "rapid prototyping".)

  • (Gav - by e-mail): Doing SL4 FTS testing on PPS: dteam tests look good (65k files). We'll be starting pilot with CMS and Atlas on Monday (LHCb / Alice to follow). Watch this space...

DB services (CERN) report:

Monitoring / dashboard report:

  • GridView / Dashboard discrepancy noticed yesterday: (Phool Chand - by e-mail): The tomcat was running and receiving the tuples from the web service publishers but these tuples were not being inserted into the database. I have restarted the tomcat. Now tuples are being inserted into the database. The graphs will be OK , when the data will be re-summarized in the next hours.

  • GridView (Simone) - when Nick sent mail about FTM, ~all sites responded that they had installed this component. However, GV still doesn't seem to display this(?) (Use FTS tab in GridView? - to be confirmed).

gv-26June.png

Release update:

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatshs atlas-25June.jpg.shs r1 manage 1372.0 K 2008-06-26 - 15:32 UnknownUser  
PNGpng atlas-25June.png r1 manage 4.4 K 2008-06-26 - 15:35 UnknownUser  
PDFpdf atlas-permanent-testing-June27.pdf r1 manage 50.9 K 2008-06-27 - 12:21 UnknownUser  
PNGpng gv-25June.png r1 manage 7.9 K 2008-06-26 - 13:54 UnknownUser  
PNGpng gv-26June.png r1 manage 8.3 K 2008-06-27 - 10:55 UnknownUser  
Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2008-06-27 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback