Week of 110411

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local (AndreaV, Graeme, Mattia, Ewan, Marcin, Jan, Maarten, Alessandro, Nicolo, Gavin, Stephan); remote (Michael, Dimitri, Gonzalo, Ulf, Jon, Ronald, Jhen Wei, Rolf, Rob, Tiju, Claudia; Joel).

Experiments round table:

  • ATLAS reports -
    • ATLAS general
      • Calibration till 14:00, If stable beams are declared for LHC, ATLAS will enter combined run.
    • Central Ops
      • CentralCatalog glitch voatlas72. This time the error was different from the ones of last week.
    • Site issues
      • SARA deletion not working: this is now very urgent. GGUS:69544. (SARA now report physical deletion underway - 2.9 million files, 256 TB.) [Ronald: deletion is in progress but will take time.]
      • Transfer failures from FZK-LCG2 to pic GGUS:69560 . Not clear if it's a source or a destination problem. Pic said the files are online and accessible. Could FZK check? [Gonzalo: transfer failures are due to a problem on the primary OPN link, everything seemed rerouted except for PIC-FZK, the network team is having a look at this.]
      • FZK-LCG2 MCTAPE bringonline "failures" GGUS:69541 .
      • IN2P3-CC MCTAPE bringonline "failures" GGUS:69552 site answered that prestage is working fine at their site, so the problem may simply be a high volume of data needing recalled.
      • IN2P3-CC ATLAS sw installation 16.6.3.5 problem on Friday. Installation was successful on Saturday. Propagation of release tag to panda seems to have failed, which is under investigation. Release has been tagged manually by production system experts.
      • CERN-PROD: please grant special stager permissions for ddmusr03 on voatlas161 (atladcops, ex atlddm16) GGUS:69575 {Ewan: will reroute this ticket to the CASTOR team, this was incorrectly assigned to us by SNOW. Maarten: when submitting a new SNOW ticket, we should make sure that there are enough hints to ensure correct assignment (e.g. was CASTOR mentioned here?).]
      • [Graeme: NDGF is blacklisted because of problems with tape libraries; this had been annnounced but was not properly recorded as a downtime in the database.]

  • CMS reports -
    • LHC / CMS detector
      • Possibility of 30 mins of STABLE BEAMS today / magnet off until end of scrubbing run
    • CERN / central services
      • Patch deployed to PhEDEx Datasvc for issue with T0 subscriptions introduced last week
    • Tier-0 / CAF
      • Tier-0 PromptRecoInjector component crashing with "ORA-00600: internal error code, arguments: [kcblasm_1], [103], [], [], [], [], [], []": due to known bug in Oracle 10.2.0.5; currently running with workarounds, pending decision on patch deployment [Andrea: is this a CMS-specific problem? Marcin: this seems to be application-specific, a workaround was suggested to be implemented in logon triggers to change the session properties. Nicolo: actually this affected two different CMS applications, so it may not be a CMS-specific problem.]
      • Fri 8th: SAM/JobRobot test failures with "Maradona" errors - caused by misconfiguration on pre-prod WNs, fixed GGUS:69531
    • Tier-1
      • MC production in progress, many sites available for WMAgent testing
      • FNAL: 3 custodial files from AOD dataset missing: one file recovered from tape, two files invalidated, ticket closed - SAV:120192
      • Fri 8th: IN2P3: CE SAM tests failing at T1 and T2 - electrical problem, fixed - GGUS:69539
      • Sat 9th: RAL: CE SAM test failures: loss of network for squid servers, fixed - GGUS:69549
      • Sat 9th: SRM_COPY errors on T0-->FNAL link: due to overload on FNAL SRM, should be fixed now, monitoring - SAV:120298 [Jon: root problem is that CMS started using a new CMSSW built with a new compiler and pulled in a different version of the dcache library not containing the necessary optimizations; this was fixed by making sure that the correct dcache library is used.]
      • Sun 10th: "Invalid path" errors in imports to CNAF: caused by wrong subscription, will be removed - SAV:120306
      • Sun 10th: "Permission denied" errors in T2-->FNAL imports: caused by approval of subscription before site had the chance to create custodial tape families - SAV:120316
    • Tier-2
      • T2_IN_TIFR - SAM CE failures after end of downtime - SAV:120015 and SAV:120310
      • T2_UK_SGrid_RALPP draining queues for scheduled downtime
      • MC production and analysis in progress
    • Notes
      • CRC from Tuesday 12th: Peter Kreuzer

  • ALICE reports -
    • T0 site
      • Few production running during the weekend at CERN due to a human error
    • T1 sites
      • CNAF: no NAGIOS tests running since a week, but dashboard started showing no tests running two days ago. Untill then, the site availability was green. Under investigation. [Maarten: this seems to be due to the renaming of Italian ROC to Italian NGI.]
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities:
      • Reconstruction and MC simulation
      • Validation of the stripping.
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 3
      • T2: 0
    • Issues at the sites and services
      • T0
        • Request to have more tape drive for LHCb
      • T1
        • PIC: Problem of FTS transfer to PIC-ARCHIVE [Gonzalo: is there a GUUS ticket? Joel: did not open a tickt yet because it is not yet clear where the problem is; having a look with Ricardo who is at CERN these days.]
        • NIKHEF: Outage [Ronald: is this for SARA? Joel: yes this is for SARA actually.]
        • RAL: small network problem during the week-end which impacted SRM for LHCb

Sites / Services round table:

  • BNL: ntr
  • KIT: ntr
  • PIC: incident this weekend with 5 hanging disk servers of ATLAS, now recovered; under investigation, this is similar to what happened one month ago.
  • NDGF: forgot tomark SRM as at risk, sorry for that; for 9 days there will still be an ongoing downtime, this has been correctly entered in the database.
  • FNAL: nta
  • NLT1: nta
  • ASGC: ntr
  • IN2P3: ntr
  • OSG: problem with availability records, they are being recomputed
  • RAL:
    • problem with one network switch on Sat, caused downtime or degradation for LHCb and CMS
    • outage for LHCb tomorrow
    • cvmfs is now in production for LHCb
  • CNAF: ntr

  • Database services: two interventions for ATLAS conditions replication this week: tomorrow discontinue replication to SARA, NDGF, CNAF as requested by ATLAS; on Thursday add new schema (geometry database) to the replication for the remaining sites.

AOB: ntr

Tuesday:

Attendance: local(Alessandro, Andrea V, Ewan, Graeme, Jan, Luca, Maarten, Mattia, Pedro, Peter);remote(Claudia, Gonzalo, Jan, Jhen-Wei, Joel, Kyle, Michael, Rolf, Ronald, Tiju, Ulf, Xavier).

Experiments round table:

  • ATLAS reports -
    • Central Ops
      • One CentralCatalog reader (voatlas05) was reported unavailable in SLS last night. Similar issue this morning with voatlas03. Experts are continuing their investigations.
        • DBAs report the reboot of a node in ADCR at 1121, which is correlated with CC Oracle errors.
          • This may be a different issue to the SLS reported problems.
      • Important T1 releases being tagged manually (e.g., RAL this morning). Experts are working on fixing the BDII->Panda loader.
      • Replication of muon reprocessed data underway.
    • SARA space recovery, GGUS:69544. Site reports faster version of clean-up scripts now running. We are monitoring the situation.
    • NDGF-T1. Unavailability of data because of Danish tape system downtime is causing data access problems for ATLAS. 9 day outage >> T1 MoU.
    • CERN:
      • Alarm ticket, GGUS:69626. T0 files unavailable. CASTOR Ops reported a server intervention, should now be fixed. (Verified.)
        • Jan: small-scale rolling interventions on disk servers are not announced in the GOCDB; it is OK for experiments to open alarm tickets when they see issues corresponding to such interventions; at the moment the amount of such tickets has been manageable
      • Alarm ticket, GGUS:69631. Users are experiencing problems with the atlasscratchdisk serviceclass. CASTOR Ops reported a server intervention, should now be fixed. (Verified.)
        • Graeme: for the scratch disk problems team tickets could have been opened instead
      • Scratchdisk files unavailable, GGUS:69593. Seems to be fixed in parallel with GGUS:69631...
    • ASGC: Possible corruption of heavy ion DESDs. Holding up finish of heavy ion reprocessing campaign. Site is investigating.

  • CMS reports -
    • LHC / CMS detector
      • Scrubbing beams + possibility of 20 mins of STABLE BEAMS (12 Bunches) during the night from Tuesday to Wednesday. CMS Magnet off until end of scrubbing.
    • CERN / central services
      • Patch-Deployment on CMS Offline DB planed today early afternoon (exact time to be specified), to address the issue with CMS Tier-0 (see next bullet).
        • CMS Operations are getting prepared for a short downtime (PhEDEx and other components)
    • Tier-0 / CAF
    • Tier-1
      • MC production in progress
      • WMAgent testing on-going at various sites
    • Tier-2
      • MC production and analysis in progress
      • Scheduled downtimes today at MIT and RRC_KI
      • Unscheduled downtime at GRIF_LLR
    • Other
      • CMS submitted to CERN/IT a list of 55 POWER USERS who need full access to the Service NOW (SNOW) Tool, see RQF0008055

  • ALICE reports -
    • T0 site
      • NTR
    • T1 sites
      • CNAF: NAGIOS tests running again since yesterday afternoon after configuration was fixed (ROC_Italy --> NGI_IT)
        • Mattia: we will look into the dashboard behavior w.r.t. absence of test results, as the problem could have been reported earlier
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Experiment activities:
      • Validation of the stripping. Full reprocessing will be launched today
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 3
      • T2: 0
    • Issues at the sites and services
      • T0
        • ntr
      • T1
        • PIC: Implementation of new space token today and tomorrow.
        • RAL: SRM castor intervention. CVMFS is in production on all WN

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT - ntr
  • NDGF
    • CEs will not be switched off because of the glibc bug announced today
    • SRM downtime April 19
    • Q for ATLAS: shouldn't the unavailable ATLAS data have copies at other sites?
      • Graeme: intermediate MC data are not copied to other sites
  • NLT1 - ntr
  • OSG
    • the corrected RSV records have been received by the SAM team
  • PIC - ntr
  • RAL
    • LHCb SRM upgrade went OK
    • CMS SRM upgrade tomorrow 11-13h local time

  • CASTOR - nta
  • dashboards - nta
  • databases
    • patch for CMS: see CMS report
    • problem with ATLAS ADCR node #1 affecting DQ2 was due to vendor error in another intervention; the cluster remained available for application failover and the affected node was back in 2 min
  • grid services - ntr
  • GT group - ntr

AOB:

Wednesday

Attendance: local (AndreaV, Steve, Maarten,, Graeme, Mattia, Edoardo, Peter, Ewan, Jan, Luca, Alessandro, MariaD); remote (Claudia, Tiju, Jon, Felix, Rolf, Dimitri, Jeremey, Rob, Ulf; Joel).

Experiments round table:

  • ATLAS reports -
    • Central Ops
      • BDII->Panda release tag problem has been fixed.
      • We believe that Central Catalog load imbalance is being caused by DNS sending all clients to a single instance - experts continue to investigate.
    • SARA space recovery, GGUS:69544. Clean up has completed. SARA in touch with dCache developers for a better fix.
    • ASGC corrupted files - investigations continue. Many datasets have now been verified as good, now concentrating on the pre-merged NTUP datasets. Hold up of HI reprocessing is now very serious.
    • FZK Oracle intervention - DE cloud was set offline last night. Savannah:120365.
    • CERN, one corrupted file reported by castor operations. This was obsolete data and has been deleted. [Jan: this issue is a leftover from the December 18 power cut, we are following up to check if there are any other leftovers.]

  • CMS reports -
    • LHC / CMS detector
      • LHC ramp tests + scrubbing beams. CMS Magnet off until end of scrubbing.
    • CERN / central services
      • CASTORCMS/DEFAULT pool degraded several times on Apr 12 (afternoon/evening), due to heavy user activities, not too dramatic and system resolved by itself, see GGUS:69649
        • Note : impatiently waiting for EOS !
      • the "lense" cannot be set in the CMS Site Status Board (http://dashb-ssb.cern.ch/dashboard/request.py/siteviewhome)
        • This feature is very useful to notify and point Computing Shifters on Duty to already known issues/tickets (was working with previous version of SSB)
        • Savannah:120367 opened to the Dashboard team
      • the NEW Service Maps portal (http://cms-critical-services.cern.ch/) was unreachable on April 12
        • Savannah:120358 opened to the Dashboard team
        • Problem has been solved, was due to a server crash.
    • Tier-0 / CAF
      • Heavy HI user activity on cmscaf1nd queue (2300 pending jobs) : CAF Physics Group leader warned the user that his workflow was not appropriate
    • Tier-1
      • MC production in progress
      • WMAgent testing on-going at various sites
      • scheduled downtimes at RAL (10:00 - 12:00 UTC) and KIT (FTS) (11:00 - 15:00 UTC)
    • Tier-2
      • MC production and analysis in progress
      • Scheduled downtimes today at MIT and RRC_KI
      • Unscheduled downtime at UERJ (00:00 - 21:00 UTC)
    • Other
      • CMS expecting the list of 55 POWER USERS who need full access to the Service NOW (SNOW) Tool to be enabled, see RQF0008055

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • Nothing to report
    • T2 sites
      • Several operations

  • LHCb reports -
    • Experiment activities:
      • Validation of the stripping. Stripping problem with memory usage.
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • Any news about the increase of number tape drive for LHCb? [Jan: yes we should give you more tapes, will follow up offline, please open a ticket about this.]
      • T1
        • SARA : the big amount of 'dark data' have been eliminated by cleaning the dCache instance at SARA by Ron Trompert. (the discrepancy in SARA-DST was 60 TB, and in SARA-USER was about 35 TB.)

Sites / Services round table:

  • CNAF: several interventions going on as planned (LSF, GPFS, kernel), all should be transparent to the users
  • RAL: intervention for CMS yesterday was completed; one more intervention tomorrow for ATLAS and one on Friday for ALICE
  • FNAL: ntr
  • ASGC: ntr
  • IN2P3: new batch system is in preproduction (for local users only, for the moment); the old system will continue in parallel for Grid users
  • KIT: ntr
  • GridPP: ntr
  • OSG:
    • problems with gstat site, related to the ALICE nodes at Berkeley, causing problems in the reports
    • found another bug in the RSD infrastructure at T2 sites; false records were inserted into the monitoring, will recompute and resend corrected records
  • NDGF: ntr

  • Database services: will add geometry to ATLAS replication tomorrow
  • Network services: ntr
  • Storage services: nta
  • Grid services: ntr
  • Dashboard services: ntr
  • GGUS services: ntr

AOB: none

Thursday

Attendance: local(Alessandro, David, Graeme, Jan, Maarten, Maria D, Peter, Steve);remote(Claudia, Elizabeth, Foued, Gareth, Gonzalo, Joel, Jon, Michael, Rolf, Wei Jen).

Experiments round table:

  • ATLAS reports -
    • ATLAS
      • Data taken yesterday during stable beam period, data11_7TeV project
      • Looking forward to more stable beams today
    • Databases
      • Geometry database (ATLASDD) successfully added to streams replication.
    • CERN/T0
      • Some more corrupted files reported by CASTOR ops yesterday. Cleaned up.
    • T1s
      • KIT came out of downtime yesterday. Slight overrun on FTS services, but online by 1800CEST. Cloud operations back to normal.
      • RAL SRM downtime, halt of activities is being managed by site.
        • RAL reported early end of downtime and resumption of ATLAS activities. Thanks!
      • ASGC still checking for corrupted outputs from HI reprocessing, BUG:80241.
        • Graeme: the HI physicists do not want to give up on this fraction of the data; to be decided where and when a reprocessing could be done

  • CMS reports -
    • LHC / CMS detector
      • Real collisions expected this afternoon/evening. CMS ramped up magnet to 2 Tesla, later will ramp to 3.5 Tesla.
    • CERN / central services
      • nothing to report
    • Tier-0 / CAF
      • Ready for collisions running this afternoon
    • Tier-1
      • MC production in progress
      • WMAgent testing on-going at various sites
      • getting ready (tape families) to receive data from upcoming collision running
    • Tier-2
      • MC production and analysis in progress
      • Unscheduled downtime at UERJ (00:00 - 21:00 UTC)
    • Other
      • While the CMS Site Readiness monitoring is correctly taking account CREAM CEs, the CMS SSB (Dashboard) is not reporting correctly the CE status as it considers only the LCG-CE status. This may trigger alarms from the CMS Computing shifters to sites.
      • work is in progress between CMS and the Dashboard team to fix this monitoring issue
      • Today meeting of the CMS Offline and Computing Monitoring Task Force, where all aspects of monitoring are being reviewed, in particular those depending on central CERN/IT services.

  • ALICE reports -
    • T0 site
      • The firewall saw a high load due to data replication between CERN and a very efficient T2; the security team are looking into a strategy for dealing better with such use cases.
    • T1 sites
      • Nothing to report
    • T2 sites
      • Several operations

  • LHCb reports -
    • Experiment activities:
      • Validation of the stripping. some code have been removed and the stripping is running.
    • New GGUS (or RT) tickets:
      • T0: 0
      • T1: 0
      • T2: 0
    • Issues at the sites and services
      • T0
        • modification of space token at CERN (GGUS:69709)
      • T1
        • PIC : transfer problem identified and solved. Only one stream was defined between PIC and CNAF. It has been increased to 10 and transfers are happily running.
          • Maarten: this issue should also affect ATLAS and CMS, but the exact number of streams matters a lot less when the channel is full with concurrent transfers
          • Joel: the issue was observed in functional tests, i.e. with 1 test file being transferred at a time

Sites / Services round table:

  • ASGC - ?
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3 - ntr
  • KIT
    • FTS/LFC intervention went OK
    • cream-3-fzk.gridka.de back in production
  • NDGF
    • The door of srm.ndgf.org was killed by too many connections yesterday after the meeting, but a restart fixed it.
    • A CASTOR file bug was found and RAL seems to have fixed it: GGUS:69706
  • OSG
    • WLCG service interactions (in particular RSV <--> SAM/Nagios) being discussed with SAM team leader David Collados
  • PIC - ntr
  • RAL
    • ATLAS SRM upgrade went OK
    • ALICE SRM upgrade tomorrow
    • NDGF issue was due to a corrupted file

  • CASTOR - nta
  • dashboards - nta
  • GGUS/SNOW - ntr
  • grid services - ntr

AOB:

Friday

Attendance: local (AndreaV, Graeme, Peter, Maarten, Ewan, David, Jan, Alessandro, Maria, Jamie); remote (Michael, Xavier, Jon, Gonzalo, Alexander, Jeremy, Andreas, Jhen-Wei, Rolf, Rob, Claudia, Gareth; Joel).

Experiments round table:

  • ATLAS reports -
    • ATLAS
      • Stable beams from 0200 CEST last night.
      • More stable fills expected today.
    • Central Services
      • Low activity alarm on DDM central catalogs observed again last night.
        • Now definitively correlated with DNS load balancing issues.
        • Corrective actions to be discussed on Monday at operations meeting. [Maarten: why do you not make it round-robin? Graeme: to be discusseed with DNS experts on Monday.]
    • T1s
      • NDGF-T1 partial tape system downtime continues.
      • ASGC data corruption understood and ESD check complete. BUG:80241. Some tasks will be rerun at ASGC.
      • RAL SRM down from 0100UT. Alarm ticket sent this morning, GGUS:69726.
        • Site went into unscheduled downtime to investigate.
        • Site now 'at risk', but needs to keep very careful eye on load and prioritise T0 export of custodial data.
    • gLite Upgrade Issue for DPM sites, BUG:80061
      • Please see the release notes for gLite WN 3.2.10: http://glite.cern.ch/R3.2/sl5_x86_64/glite-WN/3.2.10-0/
        • The work around for 32bit DPM libraries is required for ATLAS DPM sites. [Maarten: a new WN patch is being prepared, hopefully this should reduce the number of required workarounds.]

  • CMS reports -
    • LHC / CMS detector
      • stable beams, gradually increasing intensity + number of bunches
    • CERN / central services
      • The large majority of CREAM CEs was not being tested by SAM since yesterday. This was due to the fact that the last SAM jobs that were submitted yesterday were stuck in the "Submitted" status (according to the gLite WMS). The stale jobs have been deleted, so the situation should be back to normal within the hour.
    • Tier-0 / CAF
      • Tier-0 busy with data repacking on Thursday (peak 1.8k jobs).
      • small issue with online/offline SW incompatibility regarding a particular HLT stream, causing Express processing crashes in Tier-0, currently investigated.
      • [Maarten: 15 minutes ago Stephen Gowdy reported that a CRAB server ran out of pool accounts (all 999 were used). Maria: please remind Stephen that a ticket should always be opened in these cases. Peter: will follow up and submit a ticket. Done after the meeting: GGUS:69739 .]
    • Tier-1
      • CCIN2P3 : Due to a misconfiguration of an LCG-CE at CC-IN2P3, several pilot jobs ended up taking all the ten slots reserved to the lcgadmin role, thus causing SAM jobs to abort for proxy expiration. The configuration has been fixed (see GGUS:69723).
      • MC production in progress
      • WMAgent testing on-going at various sites
    • Tier-2
      • T2_ES_IFCA : gatekeeper at 1 CEs died during night and was solved early morning
      • T2_PL_WARSAW : CE in error since 24h, no response by the local admin, hence bridged original savannah ticket to GGUS:69735
      • T2_EE_Estiona : failing CREAM CE due to mis-configuration. Now fixed (see Savannah:120431). However now the site went into unscheduled downtime for destaging its LCG-CE.
        • Note that CMS still has the general issue that CREAM CE SAM tests are not reporting properly to the SSB, hence triggering wrong alarms, see Savannah:113192
      • MC production and analysis in progress
    • Other
      • CMS Offline and Computing Monitoring Task Force meeting yesterday was productive and helped identifying potential areas of progress, in particular in the Computing Shift monitoring and potential automation/alarming features to be exploited in SLS and SSB.

  • ALICE reports -
    • T0 site
      • Missconfiguration of a lemon metric regarding xrootd , triggered and alarm to the operator in voalice16 and voalice10. Problem identified and solved
    • T1 sites
      • SARA: GGUS:69729 . Since yesterday there was trouble accessing data at the site, it looks like the cause of the problem was a large number of read requests on the Alice pools. They lowered the number of xrootd transfers from 150 to 100 per pool.
      • [David: NAGIOS tests failed yesterday for ALICE Cream CE at RAL. Maarten: yes they timed out after 5h. This is unexpected because they should normally time out after the default 11h, this is being investigated. Maarten: a much bigger issue for ALICE is that SAM test jobs and normal user jobs cannot be distinguished from their credentials.]
    • T2 sites
      • Several operations

  • LHCb reports -
  • Experiment activities:
    • Data taking restarted.
  • New GGUS (or RT) tickets:
    • T0: 0
    • T1: 0
    • T2: 0
  • Issues at the sites and services
    • T0
      • Tape drive allocation has been done. modification of space token at CERN (GGUS:69709)
    • T1
      • ntr

Sites / Services round table:

  • BNL: ntr
  • KIT: ntr
  • FNAL: unscheduled outage on FTS server for T2 yesterday due to an error in an intervention
  • PIC: reminder, electrical maintenance next Tue as scheduled, all queues will be drained the day before
  • NLT1: nta
  • GridPP: ntr
  • NDGF:
    • short outage next Tue for SRM
    • the Danish tape robot might be operational again tonight, apologies again to ATLAS
  • ASGC: some jobs failed due to server overload
  • IN2P3: ntr
  • OSG: ntr
  • CNAF: reminder, outage next Tue at 10am UTC for tape library and Oracle
    • tapes will be unavailable for all experiments, disks will remain available
    • question: how should these partial outages be recorded in GOCDB? [Maarten: true, this is a fundamental limitation in our procedures. You could announce it as 'at risk' giving details about the partial outage. Jan: similar situations for CASTOR at CERN are also recorded as 'at risk', although there are some differences.]
  • RAL: Graeme gave a good summary of the problems. FTS is now reasonably ok, the batch not yet (may need to wait for the weekend to fix it)

  • CASTOR services: problem with monitoring (SLS went blank), now fixed
  • Grid services: ntr
  • Dashboard services: ntr

AOB: none

-- JamieShiers - 08-Apr-2011

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2011-04-15 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback