-- HarryRenshall - 11 Jul 2008

Week of 080714

Open Actions from last week:

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(James,Jean-Philippe,Andrea,Harry,Alessandro);remote(Derek,Gareth,Michel).

elog review: Nothing new

Experiments round table:

ATLAS(AdG) : 1) Alessandro reported that T1 to T1 transfers from BNL were working better and asked M.Ernst for the status. Michel reported that a storage server had lost network connectivity yesterday and, after changing many components, was working again. They still had innaccessible files that they were examining on a case-by-case basis and hoped to be able to export them soon. 2) There was a problem with ATLAS central services on a testbed machine where a developer installation caused a software clash, since fixed. 3) The retry policy for failed file transfers was found to be too aggressive at 60 minutes so has been throttled back to 90 minutes. 4) CNAF had several problems running the ATLAS functional tests. These were fixed on Saturday so ATLAS will continue the functional tests through today then decide tomorrow morning to resume production or not.

CMS (AS): The CRUZET3 cosmics run has terminated and was seen as a good success especially from the detector.

Sites round table:

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB: JC reported they were correcting an incorrect link in the service map pointers giving a different view if traversing there via the SLS pages.

Tuesday:

Attendance: local(Julia, Jean-Philippe, Harry, Alessandro);remote(Derek, Michel, Daniele).

elog review: nothing new

Experiments round table:

ATLAS (AdG): We are seeing no more problems with the BNL transfers but currently have many errors to IN2P3. This, in fact, is due to a scheduled downtime to end at 16.00 so we need to improve our monitoring and reaction to such events. Yesterday we had a problem importing data to CERN (export was working) due to a corrupted proxy inside a VO-box. This is under investigation (it is not the old FTS delegated proxy problem).

CMS (Daniele Bonacorsi): The current CRUZET-3 repack is around 13 TB, and it is being inserted into global DBS and injected into PhEDEx as we speak. CMS is starting a series of global runs tomorrow (Weds, 9AM CERN time, through Thurs afternoon CERN time), continuing on a weekly basis until the CRUZET-4 global run. CMS is informing T1 sites on the conventions for these datasets so T1 people can prepare. Reprocessing at T1 sites will happen, also.

Sites round table: BNL (ME) have been investigating the reason for some files being innaccessible to file transfers after repair of their failing fileserver. It is in fact not a residue of the hardware problems but that these files have become permanently pinned in dcache and the srm access method fails. Other access methods do not fail. BNL recently upgraded to dcache patch level 8 and their interpretation is that transfers for these files failed for some reason and that dcache then does an implicit pinning. There is no dcache fix yet but BNL have deployed a workaround of periodically checking for such files and unpinning them.

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Wednesday

Attendance: local(James, Julia, Jean-Philippe, Jamie, Harry, Gavin, Patricia, 70532);remote(Daniele, Derek, Michael).

elog review:

Experiments round table:

  • CMS (Daniele) - cruzet3 finalized. Quite good experience. More mature in terms of data handling in general. Need repacker replay to ensure all ok. PIC had some pool problems. Some CMS workflow issues, merging problems etc. Data sent to all tier1s, IN2P3 as custodial site. Some problems transferring to CAF, e.g. from PIC. Some timeout problems to ASGC, CNAF & RAL. PIC affecting by closure of gridftp connection. Transfers to FZK aborted at SRM level. IN2P3 in downtime. Some failures to contact remote SRM. Processing: cruzet3 reco first time on REAL data. Other steps.. Reco submissions to all T1s starting now. Report back regularly... Data consistency campaigns. Important to ensure all book-keeping and catalogs in sync and what has to be worked on in coming weeks. Small rampup in # T2s and T3s(!) including joining phedex topology. Transfer errors in preparing source file hence not clear that target implementation is relevant.

  • ALICE (Patricia) - nothing for today.

  • ATLAS (Gav) - has reported some problems with new castor 2.1.7-10 deployed at T0 on Monday. 20% failure rate on transfer to worker node.

Sites round table:

  • BNL (Michael) - very high incoming rates from CERN - 300MB/s - to tape. A bit unusual - follow up. Alessandro by phone - will investigate.

Core services (CERN) report:

  • Investigating gridftp errors. Background error since some time but more common now? Monitoring issue - new gridftp monitoring using new message system from castor. Messaging component dies.. Taken off all servers. Back to 4 servers, some alarms, hope to move back to widescale testing next week.

  • Dear all, The security ACLs of the CERN LHCOPN routers will be changed on July 16th from 09:00 AM to 10:00 AM.
The changes will not affect the LHCOPN traffic. The intervention will be transparent to the users.
Best regards,
CERN Network Support

DB services (CERN) report:

Monitoring / dashboard report:

  • Servicemap followup - service map redirecting to SLS now.

Release update:

AOB:

Thursday

Attendance: local(Derek, Kors, Jeremy, Michael, Daniele);remote(Julia, Harry, Jamie, Miguel, James).

elog review:

Experiments round table:

  • ATLAS (Kors) - running relatively smoothly over past 24h. One issue - suddenly alot of data shipped out of T0 to T1s, particularly BNL. Had not realised that as soon as cosmic data appears it is immediately shipped out. ATLAS is taking cosmic ray data with alot of test triggers. Data to Castor 800MB/s - 3x nominal! Huge events & datasets (assigned to specific T1s - e.g. BNL). Running tests concurrently. Nothing significant to report otherwise...

  • ATLAS (Miguel) - castor clients at T0 seeing timeouts. Network problem on switches used by some batch nodes. Partially fixed after lunch and fully at 18:00.

  • CMS (Daniele) - same as yesterday! Dealing with repacker replays at CERN & export to T1s. Review of main problems and errors. Categorised easily by dataops. Mainly SRM unresponsiveness here & there. Some issues ASGC CNAF RAL still under investigation. Focussing on data consistency campaign. (What is residing on storage systems, ensure catalogs consistent etc.)

Sites round table:

  • CNAF (Daniele) - hoping to re-establish production situation - many unfortunate incidents at same time. Kors - another scheduled downtime for 3 days. General downtime? Can CNAF directly give update?

Core services (CERN) report:

DB services (CERN) report:

Monitoring / dashboard report:

Release update:

AOB:

Friday

Attendance: local(Andrea, Gavin, Harry, Jamie, Jean-Philippe, Jan, Miguel);remote(Derek, Michael, Michel, Luca).

elog review:

Experiments round table:

  • CMS (Daniele) - continuing as said previously. Replay of repacker tests started this morning. Soon Tier1s will start to get data.

* ATLAS (Kors) - quiet night, DQ system blocked. Restarted this morning at 10:00 - big boost. Not fully understood but... Issue with OPN - 3 issues this / last week. Last week BNL/TRIUMF, this week BNL/RAL, yesterday BNL/PIC. All traced down as OPN problems. Probably issue at BNL -> David Foster. First time we have seen so many failures in such a short time. Continue functional tests + M8 data over w/e.

Sites round table:

  • CNAF (Luca) - after various incidents last w/e situation now stable. Recovered all services - all now in operation. Failure on storage apparatus (under maintenance since 3 days) but doesn't affect LHC experiments. All systems now working, also infrastructure (UPS etc.). In principle should now have normal service...

  • BNL (Michael) - failure last night regarding BNL-PIC connectivity. PIC could not get data from BNL. Tracked down - routing problem. PIC network announced to BNL not through OPN but ESnet which led to hang-up for ~8 hours. No special relationship between these sites in ATLAS model.

  • GRIF (Michel) - normal that there is no LHCb activity - is this normal?

Core services (CERN) report:

  • CASTOR 2.1.7-10 (Miguel) - 4 bugs identified, fixes. Will upgrade ATLAS to -11 and then schedule other experiments. -11 expected in next fortnight (or so...) Bug already seen in previous version but could not be reproduced in last 2 weeks. Seen yesterday - core dump in 1 daemon.

DB services (CERN) report:

Monitoring / dashboard report:

  • elog (James) - take down next week for o/s & s/w upgrade. Should also fix password changing problems. Main reason is to get some SSL features. Harry - allow to run ATLAS python script? A: changed again! Hourly release schedule - run on at ATLAS box. 11:00 Tuesday morning.

Release update:

AOB:

  • ftm (Gav) - config page had example point to dev box. Corrected config. This probably explains why some sites had not been published 'correctly'.
Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2008-07-22 - DanieleBonacorsi
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback