-- HarryRenshall - 13 Feb 2009

Week of 090216

WLCG Baseline Versions

WLCG Service Incident Reports

  • This section lists WLCG Service Incident Reports from the previous weeks (new or updated only).
  • Status can be received, due or open

Site Date Duration Service Impact Report Assigned to Status

GGUS Team / Alarm Tickets during last week

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information

See the weekly joint operations meeting minutes

Additional Material:

Monday:

Attendance: local(Jean-Philippe, Gavin, Miguel Anjo, Harry, Ricardo, MariaDZ, James, Patricia, Nick, Simone, Andrea);remote(Joel, Ron, Daniele, John Kelly).

Experiments round table:

  • ATLAS - Two points from the weekend. 1) The FTS at NDGF had several jobs not progressing, the oldest from Feb 12. Fixed by midday today. 2) TRIUMF has installed FTS 2.1 and transfers are now running 3-4 times faster (a complete new installation was made). This has raised the problem of slow LFC registration due to the long RTT and there is a backlog of 80000 registrations. ATLAS will try to weed out less useful transfers (Jean-Phillipe reported that the requested new LFC bulk methods have been delivered).

ATLAS have a new site services candidate fixing one major and one minor bug. This has been installed on the IN2P3 cloud production box for a few days of testing before rollout.

  • CMS reports - 4 new Savannah tickets (internal to CMS). 3 new GGUS tickets, 2 for T1 and 1 for T2. Transfer of a specific MC dataset from US T2s to FNAL T1 has been going on for a long time, FZK has transfer errors to several T2, Caltech lost a disk hence 900 files. Being audited so they can be retransferred there. Previous problems persist at Strasbourg (file exists errors) and Warsaw (no space on device) which was assigned by the ROC since 12 Feb.

  • ALICE - Two new WMS are being setup at CERN to help debug the poor performance (Nick - will need help in getting the hardware). CNAF has found a recipe which improves the performance of their WMS and this will now be tried at CERN. About 6 sites are now running a CREAM CE for ALICE.

  • LHCb - DIRAC week so only limited activity planned. Normal dummy MC continues. From last week still waiting for CNAF to schedule CASTOR intervention. IN2P3 tried a new version of dcache but found some unexpected feature so the site remains unusable for LHCb. Implementing the pilot role at CERN gave problems with the LSF shares (Ricardo reported this was related to shares for local users, now fixed) and there are intermittent problems accessing CASTOR files at CERN.

Sites / Services round table:

RAL (JK): are planning a series of upgrades to CASTOR and SRM which will entail service outages.

CERN (MA): ATLAS online to offline streams replication failed again due to a user error.

AOB: MariaDZ had two issues for ATLAS 1) She would like to have a recipe published of how to avoid the case in GGUS ticket 44585 where the Tier 2 site LAPP was sent a job requiring Tier 0 software. 2) ATLAS have added 20 or more roles in their VO tags with all of them having the attribute that they are to be used for site configuration so could ATLAS please verify this (CMS also may have this).

Tuesday:

Attendance: local(Jamie, Harry, Julia, MariaDZ, Diana, Nick, Patricia, Eva);remote(Michael, John Kelly, Daniele).

Experiments round table:

  • ATLAS (Simone) - there is no standing issue and one "announcement": the replication of DPDs from cosmic data reprocessing has started yesterday. DPDs are distributed to all T1s (not to TRIUMF for the moment, since there is already a backlog of 80K files to be registered in LFC, see yesterday's minutes) and to T2s in the requested shares. The full set of DPDs count 50K files, 6TB data volume.

  • CMS reports - a few internal Savannah tickets, no need for breakdown, no new GGUS. Some open tickets: issues related to MC transfer from US T2s to FNAL understood a bit more.. FTS performing not as good as srmcp?? Maybe summarize and report issue to FTS devs. Nothing (yet) in Twiki but soon... T2s to T1 transfer issues following up site/site. FZK-T2s: ticket opened yesterday incl GGUS. FZK scheduled down may solve. Caltech issue with disk loss - long list of files to be invalidated - done. PIC-IPHC(.fr) assigned to CMS contacts. Slow transfers IN2P3 - Purdue fixed by raising priority of subscriptions - done, about 1/3 of overall transfer now done. Last T2 ticket: Warsaw, thanks to MariaDZ for raising at OPS meeting. Feedback received - should now be ok, will keep ticket open until verified. Could have saved some time by ticketing directly to Warsaw. Thanks for help! T3s: T2-T3 transfer links discussion - not yet an operational issue.

  • ALICE (Patricia) - setup of WMS nodes at CERN to try to resolve problem - issue is to provide 2 nodes and install latest version. Problems assigning h/w. Nick - spoke to Ewan, maybe possible to find h/w in next few days. This is the "CREAM week" - also CNAF and GSI providing CREAM CEs - should be in production end week. CERN not yet. Nick - will follow up.

  • LHCb -

Sites / Services round table:

  • BNL (Michael) - currently upgrading 1.9.0.9 of dCache - will last ~full day until 6pm Eastern.

  • FZK (Jos) - the scheduled SRM / pnfs intervention.

  • DB(Eva) - intervention to upgrade private switches for RAC3. Standby DBs & PDBR services running with less resources but up all the time.

AOB:

Wednesday

Attendance: local(Harry, Sophie, Gavin, MariaDZ, James, Patricia, Nick, Simone);remote(Michael, Angela Poschlad(FZK), John Kelly).

Experiments round table:

  • ATLAS (SC)- The backlog of LFC registrations in TRIUMF is still slowly increasing. It will take some time for the new bulk methods to be deployed so we are asking TRIUMF to throttle back on their T1-T1 transfers. These are lower priority than the T2-T1 MC results as they are for data consolidation. There was a pending action on experiments to reducethe number of roles on their VO-id cards and it has been agreed with A.de Salvo to reduce ATLAS to 5 such roles. Michael reported they also see a backlog of LFC registrations at BNL where there are currently 4000 (small) files in transfer but only half that number registered and registering manually is a painful process. Simone thought this was due to the VObox site services and promised to send instructions on how to upgrade to the recently released new version. This will have to be scheduled as he estimated a half-day of downtime (the VObox has to be drained of activity).

  • ALICE (PM)- A new WMS at CERN running the 4.3 megapatch will go into production this afternoon to help debug the performance problems. There are now 8 sites having a CREAM-CE with FZK, RAL and CNAF already being used. Also the GSI Tier 2 CREAM-CE is now in operation since yesterday. Sophie reported they are setting up a CREAM-CE at CERN for ALICE to test. The ALICE VObox at Nikhef is inaccesible today.

  • LHCb -

Sites / Services round table: James reported on the plans for BNL and FNAL removing their remaining service registrations in GOCDB. They had recently discovered that SAM had not been well calculating their site availability as they only considered EGEE (GOCDB) registered resources and this will be corrected. The long term plan is for all OSG sites to be only registered in OSG and to use their OSG Information Management services instead of the GOC DB. It would, however, help the transition if sites would keep their GOC DB names and he queried Michael on this. This was not Michaels preferred approach but he will discuss this with Dantong. He agreed with James that 1 April is a good target date to have completed this transition.MariaDZ then asked about support for direct routing of GGUS tickets to OSG sites and that their might be confusion between GOC DB and OIM site names. James said these should reduce to one per site but there needs to be some cleanup. Maria also queried the status of automated interchange of tickets between the systems and offered to be available for any discussions. Michael confirmed that GGUS tickets directly routed to BNL would enter their ticketing system and welcomed the offer of discussions adding that Dantong would be the right person on their side.

  • FZK (Artem) - yesterday during dCache downtime 9:00 - 17:00 (extended from 13:00) central european time, the following operations were completed
    • pnfs moved to postgres version 8.3.6
    • database files compacted during migration.
    • applied some tuning settings
    • dcache upgraded to 1.9.1-5

AOB: MariaDZ reminded of the recent LHCb issue where jobs did not find needed software at a Tier 2 site. She queried how the information on such non-grid middleware software or rpms (e.g. the root package) is handled both from the point of view of experiment requirements and site publishing. An issue to be taken offline of this meeting.

Thursday

Attendance: local(Eva, Harry, Nick, MariaDZ, Simone, Gavin, Ewan);remote(Michael, Angela, Jeremy, John).

Experiments round table:

  • ATLAS - 1) At TRIUMF Di has reduced, as requested, T1 to T1 transfers to one file at a time per channel in order to allow the backlog of LFC registrations to be cleared. This does not affect MC production which comes in on T2-T1 channels. The backlog should clear over the coming weekend then ATLAS will reassess the situation on Monday. 2) There is an authorisation failure in transferring files to CNAF. They had a Voms server error yesterday but ATLAS users use the CERN Voms server so we think instead there may be an expired credential in a storage server. A ticket has been raised. 3) Have given to BNL instructions on how to upgrade their VObox site services. 3) Site services at CERN will be upgraded Monday and also the two ddm dashboard services will be merged into one.

  • ALICE -

  • LHCb -

Sites / Services round table:

FZK (AP): Had a problem yesterday evening with too many srmLs requests again. Thought to be from ATLAS as it was at a time when ATLAS production was active but they cannot be sure. SC said it would be really helpful to know if this from a production workflow or a rogue user (this must be a universal requirement). The overload stopped around 23.00.

BNL (ME): They will be upgrading to FTS 2.1 this morning (EST).

CERN databases (ED): Applied the latest Oracle cpu patch today to the ATLAS and LHCb downstream databases.

AOB: (MariaDZ) 1) There will be a major CERN networking intervention at CERN on 18 March. See http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/ServiceChangeArchive/ImortantDisruptiveNetworkIntervention18March.htm 2) CERN is organising a Platform LSF course at the end of April that should be of interest to CASTOR administrators. Please contact Maria.Dimou@cernNOSPAMPLEASE.ch if you are interested to participate. 3) Maria is organising a meeting to follow up on the repercussions to the GGUS mechanisms of removing BNL from the GOCDB. Proposed date is now 5 March and she would like an ATLAS participation.

Friday

Attendance: local(Harry, Jean-Philippe, Eva, Ewan, Simone);remote(Angela, Jeremy, John).

Experiments round table:

  • ATLAS - Data deletion from CERN Castor via SRM (lcg-utils) went down to 1 per second from the middle of last week. Not understood why (was previously several/second) since ATLAS have not changed their framework. They suspect a recent SRM change and have opened a ticket. In the meantime they will switch to using local commands.

  • ALICE - Status of the CREAM-CE on 18/02/09
Site queues Status of the queues 2nd VOBOX VOBOX with clients General Status
FZK 4 OK YES YES READY
KOLKATA 2 OK YES YES READY
ATHENS 1 OK NO NO NOT READY
KISTI 1 OK YES YES READY
GSI 1 OK NO YES READY*
IHEP 1 NOT OK NO NO NOT READY
RAL 1 NOT OK NO YES NOT READY
CNAF 1 NOT OK YES** NO NOT READY

* ONLY 1 VOBOX IS NOT THE REQUIRED SITUATION, ALTHOUGH THE SITE IS READY FOR CREAM PRODUCTION

** VOBOX PROVIDED BUT STILL SUFFERING OF SOME CONFIG ISSUES

  • LHCb -

Sites / Services round table:

RAL (JK): Confirm their CASTOR, SRM and FTS outage for next Monday.

FZK (AP): Have installed FTS 2.1. Will run in parallel with the old FTS and start publishing it on Monday.

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
HTMLhtm alice-cream.htm r1 manage 2.6 K 2009-02-20 - 15:59 HarryRenshall CREAM-CE Status for ALICE 18/02/09
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2009-02-20 - HarryRenshall
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback