Week of 080728

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Simone, Stephane Jezequel (Atlas expert on-call), Julia, Andrea, Harry, Jean-Philippe, Ulrich);remote(Derek, Gonzalo, Michael).

elog review: nothing new

Experiments round table:

ATLAS (SJ): Started cosmics data export on Friday with many T1 problems. PNFS log full at NDGF stopped their dcache, FZK had a dcache problem and IN2P3 asked ATLAS to stop central file deletion and were up and down all weekend. PIC has had delegated proxy corruption for some time and Gonzalo reported they currently believe this is due to a lack of clock synchronisation, possibly a kernel/hardware mismatch. They have now installed a frequent (each 2 minutes) crontab and asked if any corruption was seen over the weekend - the answer was no. ATLAS send calibration data to a US T2 at Michigan called AGL T2 via FTS and this has stopped working. Michael reported this as a BDII problem as the AGL SE should be published in an OSG bdii to be recognised by the CERN bdii and apparently this has not happened since 2 weeks though he thought the OSG BDII was correct. M.Litmaath at CERN is looking into this (later a quick check showed that around 17 July the site disappeared from the Indiana U. site reference file as used by CERN and there may have been a layout change/misinterpretation). Over the weekend there were two ATLAS T2 sites not recognising the transfer certificate, there was a gridftp problem to ASGC and Sunday evening an ATLAS problem lead to missing subscriptions of cosmics to BNL - now fixed.

CMS (AS): Nothing to report.

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB: A.Sciaba reported CMS had requested a new castor pool some 10 days ago without any feedback yet. U.Schwickerath from FIO promised to look into this.

Tuesday:

Attendance: local(oberto, Julia, Simone, Andrea, Harry, James, Jamie, Miguel, Gavin, Olof, Uli, Steve, Jean-Philippe);remote(Michael, Jeremy, Derek).

elog review:

Experiments round table:

  • LHCb (Roberto) - elog entries, 3 on CNAF/STORM. All developers on holiday!. SRM flickering behaviour at FZK - no update since 23rd July. File transfer problem out of Lyon - Lionel following closely with Andrew. Same issue out of SARA. No update since 25 July from Julian. Last entry on FTS delegation at PIC. Update from Gonzalo - origin due to strange clock behaviour in FTS server. Second counter goes backwards! MC simulation running through DIRAC3. Recons/stripping only when new release in prod with new gfal. Same release would trigger ganga - dirac3. Beginning of usage of dirac3 for analysis and retiring of dirac2 (requires use of obsolete services).

  • CMS (Andrea) - problem with FTS exports from yesterday evening. Error 'server has problems contacting DB'. Problem disappeared at 20:15 and reappeared this morning. Gav: around time of first problem schema export. Was unannounced. Even bog standard R/O operations should be announced. Sequence on production DB reset to earlier value. Why it went away and reappeared not understood. Looking at it... Service currently back. Fixed at 14:20 today (dropped faulty sequence). (Affected all VOs). Rest of service continued to run but ran out of jobs as could not insert new jobs. Andrea - reminder of start of mid-week global run (2 days) at 09:00. CASTOR SRM upgrade 09:00 - 10:00. CMS would like to be informed immediately of any problem and when intervention ends. Miguel - was done yesterday on public and today on ATLAS without problems. Transparent! Pending request of 10TB castor pool for CMS. Miguel - analysing internally. Create pool.

  • ATLAS (Simone) - suffered same FTS problem just mentioned. For the rest things going relatively smoothly. Very good efficiency for all sites. Great lakes publication problem in BDII. Does not seem to be solved. Harry - dashboard says datadisk datasets being created. Problem with calibration - no Q! dashboard issue? Not yet reported.. Miguel - Tier0 sent email reporting delay on migrations on Tier0 atlas pool. Started yesterday - about 10K files to write to tape - investigating. Meeting tomorrow at 10:00.

Sites round table:

  • RAL intervention on SE. Derek - paused longer qs last night. Intervention - no news. Good news?

Core services (CERN) report:

  • VOMS (Steve) - intervention early next week. Transparent for all except registrations - 30' interruption. Move to new h/w and quattor. Schedule and announce.

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Testing interface to GGUS (Harry) - get some test tickets. No reply needed (operator should reply).

Wednesday

Attendance: local(Jean-Philippe, Harry, Jamie, Julia, Simone, Andrea, Roberto, Uli, Jan);remote(Derek, Michael, Jeremy).

elog review:

Experiments round table:

  • CMS (Andrea) - little to report. CASTOR SRM intervention this morning lasted only a few minutes and was perfectly transparent. Pool for log files requested has been created. Global run going on - news tomorrow.

  • ATLAS (Simone) - Status of RAL? (Status after CASTOR migration). Derek - SRM ATLAS backup and functional. Extra load of transfers adversely affected FTS - currently down / under maintenance 'poke at DB'. One of filesystems full - knock on effect. Simone - transfers from CERN to RAL using CERN FTS should be ok? A - yes. Next point: Tier2s in US cloud (great lakes) now working. Unclear what problem was. BDII at CERN? BNL? Harry - site itself. Michael - yes, badly configured site. With help of James etc corrected. NDGF in 'sort of downtime' - at risk today and outtage tomorrow. Extremely problematic in transfers. Interventions on some pools and not others?? To be confirmed - downtime should end tomorrow.

  • LHCb (Roberto) - elog entry CASTOR SRM intervention at CERN went fine (for record). Open issues in elog. 3 open issues with STORM endpoint, unchanged as developers all away. 4th problem transferring files out of STORM also in elog. Problem preparing source file with some error from FTS. Open issue since 23 July at FZK - SRM flickering. Still no reaction... SARA timeout problems seem to be understood by Ron. Comment in WLCG elog thread. pin manager needs to be restarted from time to time. [ Missed ] Pin manager issue also at Lyon. Middleware problem? -> dCache developers. Summary of all WMS issues for LHCb. VDT 1.6 issue with proxies > 10 delegations. Starts to be problem - patch from globus already available - few days in cert, week in PPS, ready for production mid August(?) Touches everything that requires authentification! Default mapping in case given FQAN fails. gridmap mapping to default, e.g. s/w manager user. CMS requested ~6 months ago mapping to default user. On track... VDT change could have significant knock-on effects. In any case needs to be followed carefully.

Sites round table:

Core services (CERN) report:

  • Jan - CASTOR SRM upgraded successfully transparently for all production endpoints. CASTOR ATLAS b/e upgrade tomorrow - non transparent. Patch level 13. Will be scheduled in coming weeks also for other experiments.

DB services (CERN) report:

  • Dawid (by email):
    • Oracle CPU patch applied on all integrations and scheduled for ATLAS, LHCb and LCG in two weeks time
    • FTS problem during export/import traced down to an Oracle bug (post mortem conducted)
    • Atlas streaming from online to offline broke down yesterday evening, now fixed. Streaming to Tier1s affected, currently under investigation.

Monitoring / dashboard report:

  • Julia - started migration of SAM tests for ATLAS to production DB. Presentation today...

Release update:

AOB:

Thursday

Attendance: local(Miguel, James, Simone, Harry, Jamie, Uli, Roberto, Julia, Gav);remote(Jeremy, Michael, Gonzalo, Derek, Daniele).

elog review:

Experiments round table:

  • CMS (Daniele) - still working on mid-week global runs. 2nd smoother than 1st. Fixing and refining things as exercises go by. Couple of open issues with CASTOR. Need to answer to some requests re creation of pools etc. Will follow up asap. Refining lists of T1s that will get CRUZET data and replay of CRUZET3 exercise. 1 custodial site + 1-2 other sites too.

  • ATLAS (Simone) - 2 things: 1st - RAL. Exports from T0 to RAL ~100% efficiency but T1-T1 from/to RAL looks problematic. Derek - SRM response time for 1 (round-robin) goes very high. Not sure why - srm expert on hols - back next week. Need to sit on problem till then? Stop activities? Continue - will help debugging. 2nd - elog/ticket about BNL-CERN channel. Reply from Gav looks like BNL endpoint name changed. (Published sitename changed) Follow-up? Michael - probably due to fact sitename not consistent (OSG/WLCG). Legacy BNL-WLCG2 or summat. Implication on availability reports. 0% avail of SE due to non-aligned site names. One of team members changed name published in OSG and propagated to WLCG. Notified team member - in process of fixing it. CERN and T1 sites should rerun the glite-sd2cache tool by hand to rebuild the local cache from the updated BDII (it only runs once a day by default - Gavin will send details.
    2 requests for pools in castor at CERN (reprocessing at CERN & enduser scratch space). Miguel - getting lots of requests. Being reviewed. Maybe some change in way things are deployed.

  • LHCb (Roberto) - Still 4 entries against CNAF. Still GridKA SRM flickering behaviour. Good news - SARA & Lyon seem to have understood timeout on preparing files for transfer out. Work-around adopting at Lyon (restart pin manager when stuck) ok - would like SARA to adopt same work-around until patch available. FTS/dCache issue - Lyon/PIC failures Corruption of SRM dCache DB?? More info in GGUS ticket. Savannah bug on WMS - delegate proxy credentials somehow mixed up with client creds.

Sites round table:

Core services (CERN) report:

  • CASTOR ATLAS upgraded transparently to latest production release. Start contacting other experiments to have all at same level (2.1.7-13) within next couple of weeks. Tier1s also upgrading in coming weeks...

DB services (CERN) report:

Monitoring / dashboard report:

  • (Julia) Problem with CMS dashboard update due to DB hang - reproducible? Ran smoothly after restart of collectors.

Release update:

AOB:

  • (Harry) Edoardo Martelli back and will resume contact with BNL on LCG OPN issues (if not already resolved)

Friday

Attendance: local(Dirk, Dawid, Jacek, Harry, Jamie, Jean-Philippe, Roberto);remote(Derek, Gonzalo, Jeremy).

elog review:

Experiments round table:

  • ATLAS (Ale - by email) : Thanks to FTS support for help in identifying the cause of the overnight FTS failures from CERN to BNL (an apparent site name change in the BNL bdii). ATLAS have only taken a few GB of cosmics lately, currently in reconstruction at CERN, so there will not be a large flow of data to the Tier 1 this weekend.

  • LHCb (Roberto) - main point dirac3. Now migrated into production system. Will restart stripping / recons. This version adds one delegation to the chain. Reproduced WMS bug - filled savannah bug. Another critical point for LHCb. Hope will be fixed soon!

Sites round table:

  • RAL (ATLAS) - regenerated index statistics on ATLAS CASTOR DB, doing full tablescan - now fixed.

Core services (CERN) report:

  • Due to a converter failure (between the fibre and the router), CMS (pit 5) lost network yesterday at 17:43. Network piquet has been called and replaced the faulty part, everything came back ok at 19:27.

DB services (CERN) report:

  • (Dawid Wojcik - by email):
    • Atlas offline to Tier 1s replication was stuck between Saturday and Wednesday (4 days). The issue was caused by the occurrence of a gap in the archivelogs sequence propagated to the downstream capture database. The problem was not spotted by the monitoring as the latency counting from capture to apply was zero and the capture was not reporting any problems. The replication to Tier 1s has been restarted on Wednesday afternoon and encountered similar problem on Thursday morning. All was fixed on Thursday afternoon. post-mortem required
    • GRIDKA reported some intermediate problems that are still not understood. From exchange of emails seems that the problem is serious and cannot be fixed at the moment. Investigation is in progress. Seen since 10.2.0.4 upgrade but not clear if this is actually connected! This currently affects only replication of LHCb LFC to GRIDKA (conditions replication was never affected). Streams split for LHCb LFC will be needed if problem is not resolved today.

  • (Copied from CERN C5 report):
    • There was a problem submitting jobs to the CERN T0-export service (FTS-T0-EXPORT) on July 28 between 18.06 CEST and 20.15 and the day after 29-7 between 10.49 CEST to 14.42 CEST. During this period, all FTS job submission attempts failed with an Oracle error. The root cause of this was an unexpected behaviour of Oracle datapump (export/import). The problem has been fixed and is documented in https://twiki.cern.ch/twiki/bin/view/FIOgroup/FtsPostMortemJul29
    • Atlas offline cluster node 2 rebooted on Saturday morning (26-7). Services were relocated on other nodes and users were not unaffected. The root cause has been traced to a bug of Oracle clusterware, more investigations with Oracle support are in progress.
    • Streams replication from Atlas online to offline database stopped on Tuesday evening due to a DDL operation 'drop table' that triggered a known streams bug. The problem was identified and fixed on Wednesday afternoon.
    • Atlas offline to Tier 1s replication was stuck between Saturday and Wednesday (4 days). The issue was caused by the occurrence of a gap in the archivelogs sequence propagated to the downstream capture database. The problem was not spotted by the monitoring (as this looks like no source activity), it is now understood and the replication to Tier 1s has been restarted on Wednesday afternoon. We plan to change the monitoring to also issue warnings about about extended periods without apparent replication load.
    • Oracle security patch CPU Jul08 is being tested on validation systems. The patch can be applied as a rolling upgrade and so far no issues have been reported by the validation tests. Production upgrade scheduling for CPU Jul08 has also been agreed with the experiments.

Monitoring / dashboard report:

Release update:

  • VOMRS - for LHC VOs hardware migration (4th August 10 - 11 CEST (UTC+2)). The VOMRS service hosted on lcg-voms.cern.ch for the following VOs: alice, atlas, cms, dteam, geant4, lhcb, ops, test, unosat, vo.gear.cern.ch and vo.sixt.cern.ch will migrate to new hardware. There will be up to an hours loss of service for VOMRS thus blocking the registration process during this time. Problems -> GGUS

AOB:

  • Jeremy - Q re ATLAS jamboree + in August. Will add pointers and reminder at weekly operations meeting.

  • Kors - We will organize a last Jamboree before LHC turn-on on Thursday August 28 and a preliminary agenda can be found at: http://indico.cern.ch/conferenceDisplay.py?confId=38738 We would really appreciate if representatives of at least all Tier-1's but also of the major Tier-2's will be there, but of course everybody is welcome. The Friday we will use for tutorials and training but we can also organize some extra meetings if needed. The Monday through Wednesday of that same week there will be an Analysis workshop with a focus on tools and development. We have reserved the IT Aud. for that whole week and a video link will be set up also.

  • Xavi (by email) : We will organize a Tutorial and Training session on the 29th of August, just the day after the ATLAS Tier-1&2&3 Jamboree. Preliminary agenda can be found at: http://indico.cern.ch/conferenceDisplay.py?confId=38864 I specially encourage potential future shifters, actual shift crowd and site contacts to assist. We will have tutorials for the fundamental services and systems in ATLAS, and also a special monitoring training session based on the ATLAS dashboards (specially interesting for site contacts, as one can see if a site is performing well -either in data management or in simulation production- which is very useful to spot, track and debug problems)

  • Massimo & Johannes - as discussed in the last month and presented in the last two ADC weekly meetings, we are going to have an ATLAS analysis workshop on August 25-27 at CERN. The outline of the session is online on indico ( http://indico.cern.ch/conferenceDisplay.py?confId=38560 ).
    The main goals of the workshop are:
    • Agree (reinforce) the ATLAS strategy for analysis. The central topics are streamlining the development across the different contributors and to discuss the deployment of an infrastructure for users.
    • Revisit the data access for the analysis use cases.
    • Set up a sustainable and efficient user support for ATLAS users. Users activity will critically depend on the quality of the support we will provide.
    • Discuss user feedback and requirements from power users and analysis experts.

      We feel that the workshop will be a good opportunity to consolidate the successful experience in grid analysis of our Ganga and pAthena and continue to build on that. We insist on the "workshop" format because we feel that the three days of the event will be best used in technical discussion (with little formal presentations).

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2008-08-01 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback