Week of 090622

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Gavin, Patricia, Nick, Maria, Eva, Jean-Philippe, Alessandro, Simone, Diana, Dirk, Olof);remote(Jamie, Michael, Gonzalo, Angela, Gareth, Brian).

Experiments round table:

  • ATLAS - (Alessandro) Quiet Weekend. ASGC taken out from data distribution because of the scheduled downtime tomorrow. The functional tests will continue. Second site being removed from data distribution is RAL because of their long scheduled downtime, ~2 weeks, until 6th July. Glasgow will serve as UK Tier-1 in the meanwhile and this was successfully tested during the Weekend. Brian: is the RAL LFC still used for logging in the catalog? yes, since it's only going to be down for a short period (~24hours starting on Thursday). ATLAS will decide how to cope with the 24hours downtime of the LFC. Michael: are there any updates to the data distribution plans for the cosmic run? Alessandro will attach the report to this twiki after the meeting. ATLAS DataTaking for June

  • ALICE - (Patricia) production perspective: stable over the Weekend, with ~9k concurrent jobs at all the time. This morning the job submission was restarted and this caused a dip, which is ALICE controlled. CREAM CE at CERN issue continues. It is being investigated by the service managers. Transfers: running tests site by site over the Weekend. An average of 200MB/s aggregate was achieved. Gavin: the issue about speed of some transfers reported last week has been investigated and seems to be due to that the channel settings for some of the sites were too low (e.g. only one channel for RAL). This is being tuned by the FTS operations team.

Sites / Services round table:

  • BNL - (Michael) NTR
  • PIC - (Gonzalo) NTR
  • FZK - (Angela) there was a reports from ATLAS and LHCb about tape access over the Weekend. The monitoring doesn't show any problem but the issue is being investigated. FZK has scheduled a 4 hours tape outage for this week + 2 hours for rebooting the dCache servers. The intervention is needed for improving from performance bottlenecks that were found during the STEP exercise. Thursday 7-11 UTC followed by the reboot. Are the VOs ok with this? ATLAS and ALICE yes. Angela will follow up with the others.
  • RAL - (Gareth) first day with contractors moving hardware to new building.
  • CERN - (Gavin) scheduled SRM upgrade for tomorrow morning. Transparent

  • Request to upgrade arm-atlas and srm-lhcb to SRM 2.7-18 on Tuesday 23 09.00. Provisionally agreed by LHCb and ATLAS.

AOB: (MariaDZ) Please read the 'Summary and Conclusions' section of the last USAG notes http://indico.cern.ch/getFile.py/access?resId=0&materialId=minutes&confId=59811 The experiments, please prepare for the analysis of overdue tickets assigned to you at the next USAG of July 2nd. You can find all your tickets from page https://gus.fzk.de/pages/metrics/download_escalation_reports_vo.php

Tuesday:

Attendance: local(Olof, Steve, Alessandro, Gavin, Eva, Nick, Andrea, Joel, Maria);remote(Angela, Michael, Fabio, Jeremy, Luca, Gareth, Michel, Vladimir).

Experiments round table:

  • ATLAS - (Alessandro) Started Cosmic data taking yesterday evening. 70 data sets have been distributed to the Tier-1s. Also the distribution to the Tier-2s started. No major incidents except for FZK, where there was some problems with DATATAPE, which seems to have been solved now. Concerning the FZK intervention announced yesterday: there is no GOCDB entry for the FZK intervention on Thursday? Angela: it is not possible to put 'Tape' downtime in GOCDB so how do you want it to be entered? Disk will be available during the intervention. However, the reboot of the dCache poolnodes is associated with ~2hours downtime and it will be entered in GOCDB. Gavin: the Glasgow channel has now been setup on the CERN FTS service. Please give it a try. Will do

  • ALICE (Patricia before the meeting)- Nothing to report in terms of the production, running smooth after the new setup of MC cycle reported yesterday. It would be worth to have some news of the ticket 49553 reported by the 17th of June and regarding the CREAm-CE services at CERN still not operative.

  • LHCb reports - (Joel) Continued activity from last week. Problem yesterday: download file bigger than 27GB causing problems at IN2P3. LHCb has informed IN2P3 and will fix the problem on their side.

Sites / Services round table:

  • FZK (Angela) The ATLAS reported problem was due to a full disk. The server was rebooted and data moved.
  • BNL (Michael) NTR
  • IN2P3 (Fabio) scheduled downtime of HPSS on June 30th for adding new tape drives. It will last around 4 hours starting in the morning. The dCache service will be available but of course tape residents files cannot be accessed. Is this intervention inconvenient for the ATLAS Cosmic run?
  • GridPP (Jeremy): the UKI-LT2-IC-LeSC site will be decommissioned in July. This is one of the London Imperial College sites and it is running with old hardware. The IC-HEP site will continue. A GOCDB downtime and broadcast will be entered. The site has no storage.
  • INFN (Luca): no issues to report but will switch parts of the GPN to the OPN on Tuesday. The operation is done in coordination with the CERN networking team (Edoardo). Luca wonders about BigId patch released by Oracle and what is its deployment schedule at CERN? Olof: we are planning to roll it out here at CERN, starting with the preproduction and background stagers (repack). CERN DBAs have tested the patch, which successfully pasted the reproducible test-case. The upgrade is rolling and would therefore not require downtime. In principle there is no reason for holding back the deployment on the Tier-1s but Olof recommends consulting the CERN DBAs (Nilo, Eric, Dirk) in case of questions.
  • RAL (Gareth): move to new building is under way. On Thursday all services will stop at RAL when then networking equipment is moved. Some of the services will come back the day after while others (CASTOR) will come back the week after. Alessandro: concerning the RAL LFC - ATLAS decided stop service the UK updates during that time.
  • GRIF (Michel) NTR
  • CERN: (Gavin) Transparent upgrade of SRM (public, LHCb and ATLAS) end-points (2.8-18) this morning. A problem with some of the switches in the CC started on Friday afternoon. The switches drop out for 30 seconds and this has started to cause problems for end-users (No route to host). So far the problem is only affecting switches in front of CASTOR diskservers. The problem is being investigated by the networking team.
  • ASGC (report submitted by Jason & Horng-Liang after meeting):
    for the ASGC DC relocation carried out today,

    earlier estimation actually not enough due to unexpected cancellation from one of the contract vendor
    that we lost another 4 dedicated engineers helping with facility relocation/installation, from IDC back ASGC.
    complex fiber cabling also delay the processes for another two hours. we  only able to finish the preparation
    of rack mounts kits for defined rack space today, and will resume the facility installation tomorrow. full 
    function of grid services expect to resume before 6pm, both for T1/2 core as well as regional services.

    due to this, i extend another day for intervention and the facility installation expected to finish before 1pm tomorrow,
    and we hope the cable management could be done before 4pm, for both power cables, FC and UTP.

    apologize for any inconveniences this may cause,

AOB:

  • Maria: tomorrow before 10am the new release of GGUS will be completed. This means that TEAMers and ALARMers will be taken from VOMS from now on. No mismatches found for ALICE, some for ATLAS and many for CMS and LHCb. Fabio: in the past GGUS was configured so that people at the Sites could submit alarm tickets for testing, which was useful. Will this change in the future? Maria: good point! will follow this up with Gunter and assure that dteam ALARMers will be added if necessary.
  • Joel: receiving mails about pool account for which the password will expire on 29th of June: hope there is no problem...??? Steve: we tested with the dteam accounts yesterday, which didn't reveal any problems.

Wednesday

Attendance: local(Alessandro, Daniele, Eva, Jamie, Simone, Jean-Philippe, Diana, Antonio, Roberto, Olof (chair));remote(Michael, Fabio, Angela, Ron).

Experiments round table:

  • ATLAS - (Alessandro) A change in ATLAS panda monitoring code created an issue on the ATLR data base, which propagated to other services depending on the same service. This morning, it was found that the dashboard was stuck. From the ATLAS point of view this should be prevented in the future. A Post-Mortem will be created by ATLAS. From the Cosmic data taking point view every is ok.

  • CMS reports - (Daniele) Closely following the FZK tape situation, especially the unexpected downtime. Also monitoring the transfer quality to IN2P3. Have examples of wrong checksum files, which are draining a lot of efforts for the offline production team. Tier-2: minor issues in Florida (checksum issues), Nebraska (network problem), French Tier-2s lacking close-SE matching. Also Tier-3 issues in UK and US. Angela: what unexpected downtime do you mean? there was a mail earlier today stating problems or delays for staging in files. Daniele will forward the information to Angela. Angela: there is currently a problem with CPU time reporting from PBS. A workaround has been put in place while investigating. Fabio: will investigate the transfer errors to IN2P3. Wasn't aware of the issue

  • ALICE (Patricia before the meeting) - Issue with the CREAM-CE at CERN reported through the ticket 49553 has been successfully solved and the ticket has been closed. currently more than 500 ALice jobs are running through ce202.cern.ch. A problem with the grid33.lal.in2p3.fr WMS in France has been reported by Michel this morning (issue followed through the ticket: 49743). The problem is affecting the Alice job submission in all French sites. In addition the 2nd WMS placed in France for Alice: ipngrid28.in2p3.fr seems also to have problems at submission time. Experts at the sites are aware and following the problem. In the meantime an external WMS will be configured for all Alice french sites by today. The issue is not currently affecting the production since all french sites are still running the jobs they had in the queues but it will affect in the following hours since no new jobs will be submitted to this sites. the definition of an external WMS should solve the problem. Fabio: this issue seems to have been triggered by the CA upgrade yesterday. Apparently a known issue in the WMS and there is a GGUS ticket 49760. This is affecting all WMS in France, not only ALICE but also other VOs. Maarten is aware of the problem

  • LHCb reports - (Roberto) The 1billion MC production is going on and currently there are 6k jobs in the system. DIRAC interference preventing other (user) jobs going through. An issue with resolving input data for jobs running at CERN and GridKA. CERN issue: one CE (ce124) known to be in downtime, which causes many pilot jobs failing. There is a ticket is about simply disabling the ce124 server. Angela: concerning problems reported with the shared software area at FZK: would like to move the software area to faster disks but this will require few hours downtime (read-only). Is it OK for Monday? yes, Roberto will forward the information. Fabio: very few jobs in IN2P3 (~90). Should we worry? Roberto will check.

Sites / Services round table:

  • BNL (Michael): there was a double fibre cut in US-LHC net. Two links affected: FNAL reduced 10->5Gbit/s affecting the primary (secondary down 1.3Gb/s). Also BNL affected 10->5 (secondary not affected). This seems to have caused problems for the Cosmic data transfers: connections to gridftp doors caused hanging or stale connections which has a subsequent problem with locked slots. Problem has been reported to dCache but there is no solution. Michigan Tier-2 issue with diskspace is being worked on.
  • IN2P3 (Fabio) NTR
  • FZK (Angela): downtime notification in GOCDB was not sent out yesterday. A GGUS ticket opened 49747. Second issue: hardware problem on an important monitoring machine. Missing monitoring info while working on the problem.
  • NL-T1(Ron): one thing - during STEP ALICE sent a continuous flow of 30MB/s second of data (mainly 0s) and this continued after the challenge.
  • CERN: SImone: FTS configuration for Tier-2 - only one channel between CERN - Munich, while two is needed. Direct channel from CERN to UK Tier-2s also put in place (using the FTS cloud concept). Seems to work fine. Great lake is also served by Star channel - being cured. Otherwise the FTS for the Tier-2s is ok. Eva: LFC replication problem at GridKA: Oracle is working on the problem based on the information collected. No solution yet.

  • Middleware (Antonio) gLite 1.3.2.8 - fetchcrl script improvements, new dCache client (1.9.0.9)/server(1.9.1.7). CREAM clients in VOBox. See middleware link at top of this twiki page for details.

AOB: (Maria) Please announce to Fabio and all that the site admins who periodically launch alarms' test will continue being authorised alarmers as before. See https://savannah.cern.ch/support/?104835#comment82

Thursday

Attendance: local(Edoardo, Jamie, Daniele, Eva, Alessandro, Gavin, Andrea, Gang, Patricia, Nick, Roberto, Simone, Olof (chair));remote(Angela, Michael, Fabio, Gareth, Brian, Michel).

Experiments round table:

  • ATLAS - (Alessandro) Cosmic data replication: replications within clouds has been stopped because wrongly reconstructed data sets have been discovered. The cause has to be found before re-enabled. Another small issue: 8hrs downtime in PIC, though correctly announced, had not appeared on the ATLAS dashboard. UK cloud has been removed from DDM until the LFC is back. Simone: the problem found in the reconstruction code was it was running without magnetic field on. The data will be deleted from sites. The university of Geneva has setup special trigger DAQ setup from CERN --> Geneva: the new datasets will be replicated, which is a new stream of data (~MB/s). A post-mortem for yesterday's Panda issue can be found at https://twiki.cern.ch/twiki/bin/view/Atlas/PandaAtlrJune2009.

  • CMS reports - (Daniele) last 24hrs quite some tickets in operation has changed a bit with focusing on ASGC as they are coming back. Still monitoring quality of transfers to IN2P3 with good support from Fabio. Investigated checksum problems due to transfers is taking efforts from central operation team. Some focus on ramping up site availability for Russian Tier-2 sites, with 100% improvements 3 sites/8 -> 6/8 since a couple of weeks. Internal discussions about Tier-2 - Tier-2 transfers and the potential implications.

  • ALICE - (Patricia) this morning ALICE announced two new production cycles. While tuning the MC parameters sites might find some instabilities, which was the case this morning with ~1000 failed jobs. A new ticket for the CREAM CE @ CERN is performing ce202, ticket number 49793. Announced prototype virtualized cluster behind a CREAM CE at CERN will be tested by ALICE.

  • LHCb reports - (Roberto) All MC physics production has been temporary halted because of inconsistencies in DIRAC bookkeeping. Checks will be performed today and until this has been sorted out the production will stay off. Another issue with DIRAC prioritization mechanism is being investigating. Profiting of this break the production system will be moved to a new VOBox. Many jobs failing with condition db, again a DIRAC issue it seems.

Sites / Services round table:

* FZK (Angela) NTR * BNL (Michael) NTR. One comment: ATLAS elog there was a mention about a site service not working for some time. This was not the case, so it must some monitoring problem. The information is communicated via AFS to CERN and there appears to be a hickup with the information flow. Alessandro: the monitoring is based on the SLS information provided by BNL. In this case, the problem was releated to a restart of the site services. If this number is higher 3-4, the availability will be degraded. Michael: Hiro said that there hadn't been any restart... To be followed up offline.

  • IN2P3 (Fabio) confirms to keep the scheduled HPSS intervention for next Tuesday. Would like to know if there is planned ATLAS reconstruction activity planned for that morning and if so, if ATLAS prefer suspended jobs or let them fail? Fabio posted the following announcement/request after the meeting:
           IN2P3: The 5 hours-long intervention on HPSS scheduled for June 30th morning is maintained,
           as announced during this meeting last Tuesday. Data transfers to the site won't be interrupted
           and date will be kept on disk during the intervention. However, tape-only resident files will be
           unavailable for local processing or transfer to other sites.

           As this may affect some activities such as the reprocessing of ATLAS cosmic data, it is possible
           for us to keep the reprocessing jobs in the queue while HPSS is unavailable, to prevent jobs
          crashing or hanging. We wait for confirmation from the experiments on special actions that
          could be taken at the site-level for making this interruption as less disturbing as possible.
  • RAL (Gareth) in the middle of the move the new building. No services running at all... Brian: have been working in storage and been in contact with some US site and would like what the best communication channel is to OSG? Michael: what information? some info about BestMan. Michael: there is a storage forum lead from FNAL, Michael will forward the contact to Brian.
  • GRIF (Michel) NTR
  • CERN (Gavin) two hot-fixes applied to CASTORCMS, CASTORLHCB and CASTORCERNT3 this morning. The intervention was transparent. The new version of the CA will be applied after the meeting. Same fixes as applied last time.
  • DB services (Eva): split ASGC from ATLAS streams replication because they were down and we were running out of space. If not back before the Weekend we will be out of the streams recovery window. More information submitted by Jason after the meeting:
           it's lately confirm with central team that we have resume the replication late in the
           evening (14:10 UTC). switching the network configuration between distribution switch
           and core switch confusing the earlier testing and fail with TNS adapter errors. thanks Eva
           for the prompt reply after resuming the 3D services.

AOB:

  • Jamie: it's getting close to the deadline for the press-release. Michael: FNAL and BNL are working on this right now so please expect a quote later today.
  • STEP'09 post-mortem workshop agenda has been updated following input from ATLAS. Please register as soon as possible if you intend to come physically! (Currently 70/100 seats taken). EVO booking details are now also on the agenda page. Again for those physically present, a BBQ is being organised from 16:00 UTC on (18:00 CEST).
  • Edoardo: OPN connection to Tier-1: cut on transatlantic cable yesterday. Backup route currently. Tomorrow, the routes will be changed at CERN. Next Tuesday the CNAF routes to TRIUM, RAL and PIC will be changed to solve the performance problem seen.

Friday

Attendance: local(Daniele, Jamie, Eva, Ricardo, Gang, Jean-Philippe, Alessandro, Simone, Dirk, Gavin, Nick, Patricia, Robert, Olof (chair));remote(Michael, Michel, Gareth, Alexei).

Experiments round table:

  • ATLAS - (Simone) follow-up from previous meetings: Graeme put a PM of the ATLR in the WLCG-ops post-mortem table (https://twiki.cern.ch/twiki/bin/view/Atlas/PandaAtlrJune2009). News for Cosmic data taking: data-sets for ESD that had been generated without magnetic field have been identified and will be removed. Replication has been resumed for all the other data-sets. RAL LFC is back: this means that the replication from CERN --> Glasgow can be resumed in the afternoon. ASGC still shows 100% failure getting data and the downtime has been extended - ATLAS would like to know when will the site go out of downtime? For the moment all ATLAS activities for ASGC have been stopped until ASGC has declared and proved fully functional. BNL Vobox showing critical messages - it looks like site services are restarting by themselves 2-3 times within an hour and this causes the services to be flagged degraded. However, BNL is transferring files so the problem doesn't seem to affect that.

  • CMS reports - (Daniele) next Monday all tickets to ASGC will be put on hold giving them 7 days to cleanup things. No new tickets will be opened by shifters. After a follow-up on the Monday the week after, the ticket submission will resume. PIC ticket for stale transfers is not because of PIC problem. Tier-2 - Tier-2 transfers: for a France Tier-2 the transfer quality is 0% for transfers to other Tier-2s while ok for transfers to Tier-1. Michel will follow up with the Tier-2. Minor thing: non-urgent request to CASTOR and Tape operation team for post-mortem information.

  • ALICE - (Patricia) ticket 49793 regarding the CREAM CE at CERN is ok now and has been closed this morning. Working with Ulrich to use Virtual Machine cluster behind the CREAM CE: works fine now.

  • LHCb reports - (Roberto) MC and physics production restarted this morning after DIRAC developers fixed some problem. As usual merging production will run on Tier-1s. DIRAC task queue scheduling is still being investigated. A second activity: test of glExec wrapper script at PPS systems at GridKa and Lancaster. Still failing because of various configuration problems (wrong mapping, white listing). Work ongoing with the site admins.

Sites / Services round table:

  • BNL (Michael) NTR
  • Grif (Michel) NTR
  • RAL (Gareth) still in the middle of the move, which is going well for the moment. Yesterday's network outage toke a bit longer than planned, which resulted in that downtime had to be extended for some services (e.g. BDII). CASTOR and batch will be down for another week.
  • ASGC (Gang) check of our services: CASTOR, UI and serveral other services are online. Recovery of database service was in time so we don't need to resynchronize.
  • CERN (Gavin) would like to schedule an SRM upgrade for ALICE and CMS on Tuesday next week. Same upgrade as the one already performed for ATLAS and LHCb this week. Simone: people reporting problems with lcg-util 1.6 version installed on the CERN UI. One problem related to Lustre access, which is the case for several ATLAS tier-2 sites. A more important problem is that 1.6 is having a problem with copy of CASTOR@CERN to CASTOR@CERN, which has been fixed in 1.7. Simone would like to know when CERN will deploy 1.7? Gavin will follow up.

AOB:

  • Jamie sent reminder about workshop. Registration closes on Sunday.
  • Olof: VOs should review their site availability dashboards reference to INFN_T1. In some cases it is referring INFN_CNAF which is a small Tier-3 site. ATLAS has corrected and ALICE looks ok but CMS and LHCb should check theirs.

-- JamieShiers - 19 Jun 2009

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2009-06-26 - OlofBarring
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback