Week of 080728

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Simone, Stephane Jezequel (Atlas expert on-call), Julia, Andrea, Harry, Jean-Philippe, Ulrich);remote(Derek, Gonzalo, Michael).

elog review: nothing new

Experiments round table:

ATLAS (SJ): Started cosmics data export on Friday with many T1 problems. PNFS log full at NDGF stopped their dcache, FZK had a dcache problem and IN2P3 asked ATLAS to stop central file deletion and were up and down all weekend. PIC has had delegated proxy corruption for some time and Gonzalo reported they currently believe this is due to a lack of clock synchronisation, possibly a kernel/hardware mismatch. They have now installed a frequent (each 2 minutes) crontab and asked if any corruption was seen over the weekend - the answer was no. ATLAS send calibration data to a US T2 at Michigan called AGL T2 via FTS and this has stopped working. Michael reported this as a BDII problem as the AGL SE should be published in an OSG bdii to be recognised by the CERN bdii and apparently this has not happened since 2 weeks though he thought the OSG BDII was correct. M.Litmaath at CERN is looking into this (later a quick check showed that around 17 July the site disappeared from the Indiana U. site reference file as used by CERN and there may have been a layout change/misinterpretation). Over the weekend there were two ATLAS T2 sites not recognising the transfer certificate, there was a gridftp problem to ASGC and Sunday evening an ATLAS problem lead to missing subscriptions of cosmics to BNL - now fixed.

CMS (AS): Nothing to report.

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB: A.Sciaba reported CMS had requested a new castor pool some 10 days ago without any feedback yet. U.Schwickerath from FIO promised to look into this.

Tuesday:

Attendance: local(oberto, Julia, Simone, Andrea, Harry, James, Jamie, Miguel, Gavin, Olof, Uli, Steve, Jean-Philippe);remote(Michael, Jeremy, Derek).

elog review:

Experiments round table:

  • LHCb (Roberto) - elog entries, 3 on CNAF/STORM. All developers on holiday!. SRM flickering behaviour at FZK - no update since 23rd July. File transfer problem out of Lyon - Lionel following closely with Andrew. Same issue out of SARA. No update since 25 July from Julian. Last entry on FTS delegation at PIC. Update from Gonzalo - origin due to strange clock behaviour in FTS server. Second counter goes backwards! MC simulation running through DIRAC3. Recons/stripping only when new release in prod with new gfal. Same release would trigger ganga - dirac3. Beginning of usage of dirac3 for analysis and retiring of dirac2 (requires use of obsolete services).

  • CMS (Andrea) - problem with FTS exports from yesterday evening. Error 'server has problems contacting DB'. Problem disappeared at 20:15 and reappeared this morning. Gav: around time of first problem schema export. Was unannounced. Even bog standard R/O operations should be announced. Sequence on production DB reset to earlier value. Why it went away and reappeared not understood. Looking at it... Service currently back. Fixed at 14:20 today (dropped faulty sequence). (Affected all VOs). Rest of service continued to run but ran out of jobs as could not insert new jobs. Andrea - reminder of start of mid-week global run (2 days) at 09:00. CASTOR SRM upgrade 09:00 - 10:00. CMS would like to be informed immediately of any problem and when intervention ends. Miguel - was done yesterday on public and today on ATLAS without problems. Transparent! Pending request of 10TB castor pool for CMS. Miguel - analysing internally. Create pool.

  • ATLAS (Simone) - suffered same FTS problem just mentioned. For the rest things going relatively smoothly. Very good efficiency for all sites. Great lakes publication problem in BDII. Does not seem to be solved. Harry - dashboard says datadisk datasets being created. Problem with calibration - no Q! dashboard issue? Not yet reported.. Miguel - Tier0 sent email reporting delay on migrations on Tier0 atlas pool. Started yesterday - about 10K files to write to tape - investigating. Meeting tomorrow at 10:00.

Sites round table:

  • RAL intervention on SE. Derek - paused longer qs last night. Intervention - no news. Good news?

Core services (CERN) report:

  • VOMS (Steve) - intervention early next week. Transparent for all except registrations - 30' interruption. Move to new h/w and quattor. Schedule and announce.

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Testing interface to GGUS (Harry) - get some test tickets. No reply needed (operator should reply).

Wednesday

Attendance: local(Jean-Philippe, Harry, Jamie, Julia, Simone, Andrea, Roberto, Uli, Jan);remote(Derek, Michael, Jeremy).

elog review:

Experiments round table:

  • CMS (Andrea) - little to report. CASTOR SRM intervention this morning lasted only a few minutes and was perfectly transparent. Pool for log files requested has been created. Global run going on - news tomorrow.

  • ATLAS (Simone) - Status of RAL? (Status after CASTOR migration). Derek - SRM ATLAS backup and functional. Extra load of transfers adversely affected FTS - currently down / under maintenance 'poke at DB'. One of filesystems full - knock on effect. Simone - transfers from CERN to RAL using CERN FTS should be ok? A - yes. Next point: Tier2s in US cloud (great lakes) now working. Unclear what problem was. BDII at CERN? BNL? Harry - site itself. Michael - yes, badly configured site. With help of James etc corrected. NDGF in 'sort of downtime' - at risk today and outtage tomorrow. Extremely problematic in transfers. Interventions on some pools and not others?? To be confirmed - downtime should end tomorrow.

  • LHCb (Roberto) - elog entry CASTOR SRM intervention at CERN went fine (for record). Open issues in elog. 3 open issues with STORM endpoint, unchanged as developers all away. 4th problem transferring files out of STORM also in elog. Problem preparing source file with some error from FTS. Open issue since 23 July at FZK - SRM flickering. Still no reaction... SARA timeout problems seem to be understood by Ron. Comment in WLCG elog thread. pin manager needs to be restarted from time to time. [ Missed ] Pin manager issue also at Lyon. Middleware problem? -> dCache developers. Summary of all WMS issues for LHCb. VDT 1.6 issue with proxies > 10 delegations. Starts to be problem - patch from globus already available - few days in cert, week in PPS, ready for production mid August(?) Touches everything that requires authentification! Default mapping in case given FQAN fails. gridmap mapping to default, e.g. s/w manager user. CMS requested ~6 months ago mapping to default user. On track... VDT change could have significant knock-on effects. In any case needs to be followed carefully.

Sites round table:

Core services (CERN) report:

  • Jan - CASTOR SRM upgraded successfully transparently for all production endpoints. CASTOR ATLAS b/e upgrade tomorrow - non transparent. Patch level 13. Will be scheduled in coming weeks also for other experiments.

DB services (CERN) report:

  • Dawid (by email):
    • Oracle CPU patch applied on all integrations and scheduled for ATLAS, LHCb and LCG in two weeks time
    • FTS problem during export/import traced down to an Oracle bug (post mortem conducted)
    • Atlas streaming from online to offline broke down yesterday evening, now fixed. Streaming to Tier1s affected, currently under investigation.

Monitoring / dashboard report:

  • Julia - started migration of SAM tests for ATLAS to production DB. Presentation today...

Release update:

AOB:

Thursday

Attendance: local(Miguel, James, Simone, Harry, Jamie, Uli, Roberto, Julia, Gav);remote(Jeremy, Michael, Gonzalo, Derek, Daniele).

elog review:

Experiments round table:

  • CMS (Daniele) - still working on mid-week global runs. 2nd smoother than 1st. Fixing and refining things as exercises go by. Couple of open issues with CASTOR. Need to answer to some requests re creation of pools etc. Will follow up asap. Refining lists of T1s that will get CRUZET data and replay of CRUZET3 exercise. 1 custodial site + 1-2 other sites too.

  • ATLAS (Simone) - 2 things: 1st - RAL. Exports from T0 to RAL ~100% efficiency but T1-T1 from/to RAL looks problematic. Derek - SRM response time for 1 (round-robin) goes very high. Not sure why - srm expert on hols - back next week. Need to sit on problem till then? Stop activities? Continue - will help debugging. 2nd - elog/ticket about BNL-CERN channel. Reply from Gav looks like BNL endpoint name changed. (Published sitename changed) Follow-up? Michael - probably due to fact sitename not consistent (OSG/WLCG). Legacy BNL-WLCG2 or summat. Implication on availability reports. 0% avail of SE due to non-aligned site names. One of team members changed name published in OSG and propagated to WLCG. Notified team member - in process of fixing it. Should rerun thing [details << Gav]
    2 requests for pools in castor at CERN (reprocessing at CERN & enduser scratch space). Miguel - getting lots of requests. Being reviewed. Maybe some change in way things are deployed.

  • LHCb (Roberto) - Still 4 entries against CNAF. Still GridKA SRM flickering behaviour. Good news - SARA & Lyon seem to have understood timeout on preparing files for transfer out. Work-around adopting at Lyon (restart pin manager when stuck) ok - would like SARA to adopt same work-around until patch available. FTS/dCache issue - Lyon/PIC failures Corruption of SRM dCache DB?? More info in GGUS ticket. Savannah bug on WMS - delegate proxy credentials somehow mixed up with client creds.

Sites round table:

Core services (CERN) report:

  • CASTOR ATLAS upgraded transparently to latest production release. Start contacting other experiments to have all at same level (2.1.7-13) within next couple of weeks. Tier1s also upgrading in coming weeks...

DB services (CERN) report:

Monitoring / dashboard report:

  • (Julia) Problem with CMS dashboard update due to DB hang - reproducible? Ran smoothly after restart of collectors.

Release update:

AOB:

  • (Harry) Edoardo Martelli back and will resume contact with BNL on LCG OPN issues (if not already resolved)

Friday

Attendance: local();remote().

elog review:

Experiments round table:

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r6 - 2008-07-31 - JamieShiers
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback