Week of 091116

WLCG Service Incidents, Interventions and Availability

VO Summaries of Site Availability SIRs & Broadcasts
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive

GGUS section

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

General Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions Weekly joint operations meeting minutes

Additional Material:

STEP09 ATLAS ATLAS logbook CMS WLCG Blogs


Monday:

Attendance: local(Daniele, Jean-Philippe, Jamie, Patricia, Roberto, Gang, Harry, Dirk, Miguel, David, Julia, Alessandro, Simone, Andrea, MariaD);remote(Kyle(OSG), Michael, Gareth, Angela, Jason, Daniele (CNAF), Fabio, Ron).

Experiments round table:

  • ATLAS (Alessandro)- Quiet weekend. Couple of bursts of errors with CASTOR at CERN hopefully fixed by SRM upgrade. Atlas stager at CERN was moved to new hardware and the intervention was not transparent. Dashboard was down as announced, but other monitoring services were up like SLS, so it was still possible to monitor the service.

  • CMS reports (Daniele)- The follow up on the problem of low quality transfers beween CERN and FNAL is slower than expected. Disk to disk copy at CERN is very slow and needs follow up. Bad performance of MC production connected to a Condor bug is still being investigated. Shallow retry supposed to recover some of the failures seems to actually make the things worse. We do not know if WMS is responsible for some of the failures. Slow transfers to RAL. Globus error in Legnaro after an intervention on the CE. Needs to be investigated with Globus experts. 4 Tier2s have problems with BDII. Miguel wanted to comment on the problems at CERN: the first problem is not easy to understand (may be linked to an SRM upgrade?); the second problem was due to disk to disk copies done in parallel. This should be ok now.

  • ALICE (Patricia)- Quiet weekend. SRM upgrade this morning went well. The CREAM CEs are not working at CERN (GGUS 53286). Pending registration of DNs for MyProxy service has been done. Alice will publish the list of DNs that need to be authorized. This will help checks by FIO.

  • LHCb reports (Roberto)- Active weekend: MC production, stripping jobs at Tier1s and DST redistribution across Tier1s. No major issue. One GGUS ticket for Lyon and a few issues with shared areas at Tier2s.

Sites / Services round table:

  • Michael: The cause of the SRM exceptions was found by the developers thanks to the analysis done by BNL. The developers know how to fix it. The fix could be interesting for other sites.
  • Gareth: NTR. Tomorrow outage for FTS and MyProxy (two hours including drain). The maintenance of the UPS has been cancelled but a UPS test will take place on Thursday.
  • Angela: one CE offline due to disk failures, but jobs are still coming. Alessandro and Roberto will check why this CE is still targetted.
  • Jason: 40 cartridges with incorrect label. Those tapes are automatically disabled by CASTOR and the tapes are being relabeled. Robot mount errors have been fixed by recalibrating the second frame.
  • Fabio: NTR
  • Daniele (CNAF): NTR
  • Ron: SARA will be in scheduled maintenance on Wednesday for network upgrade + dCache upgrade (to enable tape protection). The intervention will take at most the day but most probably a few hours. ALICE VO box will be moved to new hardware.
  • David: PIC only reachable by GEANT because of the fibre cut between Madrid and Barcelona.
  • Julia: migration of the Dashboard DB for CMS tomorrow.
  • MariaD: NTR
  • Kyle (OSG): NTR

AOB:

Tuesday:

Attendance: local(Olof, Miguel, Jean-Philippe, Daniele, MariaD, Roberto, Jamie, Julia, Gang, Alessandro, Harry, Simone, Nicolo, Wei, MariaG);remote(Angela, Gonzalo, Jason, Jeremy, Daniele(CNAF), Michael, Ronald).

Experiments round table:

  • ATLAS (Alessandro)- 2 GGUS tickets incorrectly assigned by the Atlas shifters: burst of 1200 errors in a few minutes when transferring to Triumf; one assigned to Boston University because BU SRM does not publish the information in BDII which causes FTS failures. Michael says that work is in progress and problem should be fixed in next few days. ASGC is back in Cosmic Data distribution and will get 5% of the data; they will get standard ESDs and AODs as well. Tonight there will be a raw data taking test for collision data. The test will send 600 MB/s to all Tier1s. BNL will get all ESDs for Cosmic an Collision data. Because of the high data rate Atlas need more disk servers to be installed at CERN preferably before the end of this week. Simone will distribute the list of tags by the end of this week.

  • CMS reports (Daniele)- There will be a followup meeting just after the WLCG daily meeting to discuss about transfer quality between CERN and FNAL. MC affected by infrastructure: condor problem (loosing communication). Patch is available (version 6.8) but needs recompilation on SL3 (???). As some sites are suffering (France, Italy and Estonia for example), CMS sets the retry count to 2. For the Tier1 issues, there is no progress and no update from Legnaro. Some Tier2 to Tier2 transfer problems.

  • ALICE (Patricia)- (by mail) Basically Alice is continuing the reconstruction production at the T0 where they are using LCG-CE resources only. The CREAM-CEs at CERN are down and they submitted a ticket yesterday: 53286. We do not have any update in the status of this ticket from the experts. Could Alice get an update on this issue?

  • LHCb reports (Roberto)- High activity (13000 concurrent jobs) but no major problem. Disk space problem at SARA, but disks have been added (problem solved).

Sites / Services round table:

  • Angela: NTR
  • Gonzalo: NTR
  • Jason: NTR
  • Jeremy: NTR
  • Daniele(CNAF): NTR
  • Michael: Tomorrow transparent migration of Condition DB. Will take a day but the DB will be available in read-only mode during that time. Ok with Atlas. Eva will be in contact with Carlos to do the migration. The migration is needed because BNL is getting out of space in the DB.
  • Ronald: Scheduled down time tomorrow for network and dCache upgrade (going from 1.9.4 to golden release 1.9.5-8)

  • Miguel: problem with tape robot last night; fixed this morning at 11:00; data in this robot was unaccessible during that time.
  • Julia: Dashboard migration for CMS is taking longer than expected. Should be completed in the next hour.
  • MariaG: latest Oracle security patch and recommended set of patches will be applied this week: we only received the merged patches yesterday. Integration and test DBs are done now; the date for the patch of production instances has been agreed with experiments (see Service Status Board).

AOB:

Wednesday

Attendance: local(Daniele, Andrea, Jean-Philippe, Eva, Patricia, Roberto, Giuseppe, Miguel, Gang, Olof, Jan, Alessandro, Jamie, MariaD, Harry, Julia, Antonio);remote(Michael, Angela, Gonzalo, Tiju, Jeremy, Daniele, Ron, Jason, Fabio).

Experiments round table:

  • ATLAS (Alessandro)- SARA down as announced, so no DDM activity there. Problem with Dashboard configuration file after migration (fixed). SRM instability at CERN last evening (GGUS ticket submitted, but actually no new ticket was needed as the problem is similar to the previous one still being worked on). DB level contention is being investigated as a possible cause. It seems to be a CASTOR stager issue (messages lost between the stager and the SRM) but a workaround can be put in the SRM. A new SRM release will be available today. Kors will decide if Atlas wants this new version installed as soon as possible, as the load will increase and the problem seems to be load related. The current rate of failures seems to be 5-10 per hour.

  • CMS reports (Daniele)- About the bad performance for the transfers between CERN and FNAL, a meeting took place yesterday another one will take place tomorrow. The SRM patch should go in today. Progress on the MC issue: a patch from Condor is being tested. There was a Dashboard issue: after DB migration the Oracle Execution Plan was not correct (DB experts working on it). Slow progress on the issues at Tier1s. 49 files have been waiting for migration to tape at CNAF for a few days (working on it). CE failure at BARI has been fixed. Too many WMS instances launched at IPHC: problem reported to the developpers. Atlas and CMS will get the same CASTOR SRM version.

  • ALICE (Patricia)- Alice waiting for the upgrade of the VO boxes at SARA. Non working Cream CEs at CERN (ticket 53286): ce201 back in production, not yet the others.The corresponding bug (48144) is fixed in CREAM 1.5. Sites that have migrated the VOBOXES to gLite 3.2 see problems with publication. A fix has been tested in Bari and will be propagated to the other sites.

  • LHCb reports (Roberto)- Big Monte Carlo production. No major problem. Redistribution to Tier1s (100 MB/s). dCache autorization problems (user space token) at SARA. VDT release installed at CERN on worker nodes has a bug: the filesize reported is wrong and could lead to incorrect entries in LFC.The Status Board is now linked to the LHCb wiki page, so that sites can see the activity at the different sites.

Sites / Services round table:

  • Michael: migration of the Condition DB will probably be done tomorrow but needs more preparation, so replication has not been stopped.
  • Angela: NTR
  • Gonzalo: scheduled intervention on the network last night (01:00-02:00) to try to solve the external connectivity problem.
  • Tiju: NTR
  • Jeremy: NTR
  • Daniele: NTR
  • Ron: today SARA is in maintenance to upgrade the network between compute and storage nodes to 40 Gbs. dCache has been upgraded to 1.9.5-8 and is up and running. VO boxes will be ready in about one hour.
  • Jason: 3000 files migrated. Still looking at wrong labels on tapes. Will add more LTO3s.
  • Fabio: NTR

  • Ricardo: scheduled intervention on CEs at CERN
  • Miguel: few more Name Servers, Cupv and Vmgr machines will be added to the DNS load balanced CASTOR servers (transparent). 22 disk servers added for Atlas.
  • Eva: one node for Atlas crashed with high load. Filesystem was corrupted after reboot. A new node has been added. Oracle security patches are being applied in rolling mode.
  • Antonio: we do not use PPS anymore, but we have a staged rollout of the middleware. The rollout lat week was painful and we had to rollback at all sites (problem with LB). Another staged rollout will take place this week (Torque, CREAM). BLAH vulnerability fix (CERN could wait for this fix). The ICE fix for the SQL syntax error could be put in the same bundle.

Release report: deployment status wiki page

AOB:

Thursday

Attendance: local(Daniele, Jean-Philippe, Jamie, Nick, Eva, Jacek, Roberto, Nicolo, Andrew, Simone, Jan, Miguel, Gang, Edoardo, MariaD);remote(Jeremy, Angela, Jason, Gareth, Ronald, Fabio, Alexei).

Experiments round table:

  • ATLAS (Simone)- Possible problem with FTS at CERN: some queries fail with permission denied while the same query by the same user just before or just after succeed, but transfers succeed. Should be checked. Team ticket sent to CERN. Is the problem related to FTS 2.2? In that case Atlas would be happy to go back to FTS 2.1. Daniele: Phedex should be resilient against this kind of failure. The SRM upgrade at CERN took place and SARA is back in production for Atlas.

  • CMS reports (Daniele)- Dashboard problem: during the migration, the DB schema was changed. This mark the statistics as stale. The load was to high to be able to collect new statistics. A manually generated Execution plan was forced, this decreased the load and new statistics could be collected to automatically generate a new execution plan. Situation was back to normal around 21:00. The SRM patch at CERN was applied. The result will be discussed in the conference call just after the WLCG daily meeting. The Condor patch is being tested in production: very few failures are seen. CNAF has closed 2 tickets: transfers to FNAL and tape migration. Phedex will be modified to be more resilient in case the latter problem reoccurs. The problem at Legnaro was solved by doing a full reinstallation of the CE.

  • ALICE -

  • LHCb reports (Roberto)- Simulation: all the needed events have been produced. The rest of the jobs are being killed, with explains the 'red' color on the dashboard. The DC06 redistribution is going smoothly and is finished with some sites and will be completed for the other sites in the next few hours. Shortage of disk space at SARA, but site actively working to add new disk resources. Reconstruction being done at CERN to produce DSTs for LHCb users.

Sites / Services round table:

  • Jeremy: NTR
  • Angela: NTR
  • Jason: NTR
  • Gareth: RAL was marked at risk this morning because of the foreseen test on UPS. This did not take place and will be rescheduled. Some SAM failures. Nothing specific. Load issue? As SAM uses lcg_util, the problem has nothing to do with the FTS failures reported above. Link to FTM at CERN is not working. FTS support informed.
  • Ronald: new compute nodes have been installed at Nikhef. There is almost a factor 3 increase in capacity. Now working on installation and configuration of the new storage hardware. Nikhef network infrastructure between compute and storage was upgraded 20-160Gbps. Atlas HammerCloud tests ok at SARA but some problems at Nikhef; they are currently under investigation.
  • Fabio: NTR

  • Eva: Oracle security patches on production instances: being done for Atlas now; will be done for CMS next week. Streams has been prepared for Condition database migration at BNL.
  • Jan: SRM upgraded for CMS and ATLAS at CERN.
  • Miguel: the load balanced service nodes for CASTOR at CERN have been added. The CE storage backend upgrade was not done. The next set of Linux upgrades is being prepared and will be deployed on 11th January.
AOB:

Friday

Attendance: local(Jamie, Olof, Jean-Philippe, Eva, Gavin, Julia, Roberto, Simone, Wei, Alessandro, Gang, Miguel);remote(Gonzalo, Michael, Angela, Jeremy, Daniele/CMS, Onno, Gareth, Daniele/CNAF, Brian, Jason).

Experiments round table:

  • ATLAS (Simone)- FTS problem reported yesterday has been investigated: it is a bug in FTS as a Access denied failure is reported instead of a DB access failure (there were rolling interventions on the DB at CERN and IN2P3). Problems contacting the Atlas central catalogue (number of connections exhausted due to reduced number of DB servers. To solve the problem, the maximum number of connections to the DB has been increased from 400 to 600. DDM will be modified to introduce more caching and better retries. SRM problem at SARA. A team ticket was sent for a problem at NDGF. With the LHC startup, now it will be alarm tickets. Today's accelerator plan: 5 splash events around 17:00. AODs and ESDs will be sent everywhere. All of raw data wil be sent to disk at 3 T1s: BNL, LYON and SARA. Then the data will be copied to tape. AODs and DESDs will be replicated to T2s. Hopefully the splash events will be in different files (it is better to have a few hot files rather than one super hot file).

  • CMS reports (Daniele)- Preparing for the splash events. Not really working on Tier2 problems.

  • ALICE -

  • LHCb reports (Roberto)- few reconstruction jobs at CERN with raw data. Few thousands user jobs at Tier1s. No problem except disk space problem at SARA. 6 files lost at CNAF (catalogues need to be updated). SQLite problems at RAL.

Sites / Services round table:

  • Gonzalo: NTR
  • Michael: power problem on one of the phases: 15 % of the compute nodes not available, rest ok. Should be up in the next hour. Migration of Condition DB did not succeed: missing data. Replication to old DB restarted. No service interruption. Migration will be attempted again next week.
  • Angela: NTR
  • Jeremy: NTR
  • Daniele: Condor patch tested with CMS: very good results. Will be able to close ticket next week.
  • Onno: At Nikhef, new WNs have problems accessing files in SE: this may be due to the network upgrade. SRM problem at SARA: system time wrong. Could be an ntp problem related to changes in iptables.
  • Gareth: short "at risk" for CASTOR Atlas today. LSF partition filling up.
  • Brian: NTR
  • Jason: NTR

  • Eva: DB monitoring not working at CERN (problem of storage for monitoring data). Should be up before the weekend.
  • Julia: still performance problems with CMS Dashboard. Needs to be investigated.

AOB:

-- JamieShiers - 16-Nov-2009

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2009-11-21 - JeanPhilippeBaud
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback