Week of 110214

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Massimo, Dirk, Eva, Jamie, MariaDZ, Andrea, Stephane, Douglas, Roberto, Maarten, Alessandro, Gavin, Mattia);remote(Rolf, Jon, Xavier, Ulf, Gonzalo, Daniele, Tiju, Suijan, Paolo, Rob, Michael).

Experiments round table:

  • ATLAS reports -
    • Supposed to start HI processing tomorrow but delayed - more news next week (so quiet this week...) More news at ATLAS ADC meeting
    • Changes coming for ATLAS T1 streams. Audit trail should be enabled for ATLAS DBs at all Tier1 sites. Please get in touch with Florbella. [ Eva will review ]
As a preparation and extra monitoring of the decommissioning of some of
the Tier-1 Streams Sites (NDGF, SARA, CNAF), wed like to ask for the Audit trail to be enabled
at all the ATLAS databases at Tier-1. This would allows us to monitor the
behaviour of the access and failover to the sites, while this decommissioning
phase is going on. The audit trail will be used for weekly reports on database
usage, regarding machines and users to check on origin.
Some of the sites already have the audit tables enabled, so this request is for
the rest of the sites, so we can monitor the correct failover of the processes.
This is a temporary measure, and can be disabled when the decommissioning is over.
Thank you,
Florbela
    • Use service certificate (instead of Kors Bos) for Functionnal Test DDM transfers:
      • Grant privileges to new DN on Castor (GGUS:67302)
      • For all FTS servers (CERN and all T1s) : please grant same privileges to DN referred in AT THE END of GGUS:62855
    • Migration of data from ATLASMCDISK to ATLASDATADISK : Ongoing -> Discover inaccessible/corrupted files
    • CERN :
      • Detected 2 stuck files (GGUS:67278) : Files on a diskserver being drained but draining is stuck. Servers are out of 'usual warranty' -> Special operation
      • 1 file not migrated to TAPE (informed by mail) : Probably issue from 18th December
    • T1s
      • IN2P3-CC (GGUS:67274) : Production jobs affected by failing dccp . It is related to network issue in IN2P3-CC and one machine did not recover correctly

  • CMS reports -
    • CMS / CERN / central services
      • Global run with cosmics (CRAFT11) going on. No major troubles.
    • Tier-0
      • the HeavyIon zero-suppression (HI-ZS) run continues smoothly.
        • currently flat usage of ~2k T0 slots (plot), castor load -> out: ~2.8 GB/s ; in: 0.8 GB/s (plot). For resources, in contact with Berndt
      • the T0 Operations over last days will be discussed at today's Computing Ops meeting - more info tomorrow.
    • Tier-1's
      • KIT: JR efficiency at ~70% last week. Now back to full green in the CMS SiteReadiness. OK.
      • PIC: transfer errors to T2_IT_Pisa. Seems to work when action reproduced by the CMS contact in PIC. Waiting for feedback from Pisa admins (SAV:119194)
      • IN2P3: CMS-SAM availability down to 56% last Sunday. Symptoms: 2 CEs (cclcgceli{04,08}) show low CMS-SAM availability due to failures in CE-cms-prod and CE-sft-job tests overnight (apparently not at the same times for the 2 CEs). Possible cause: the CMS contact reported (SAV:119204) there was an incident on the core network device that impacted HPSS, it forced the BQS resources to be drained, and the SAM failure could be related to the proxy expiration due to the draining. Green now.
    • Tier-2's
      • NTR - business as usual
    • Miscellanea
      • CREAM-CE / LCG-CE checks of SAM site availability
        • It's being kept monitored for 1 week (today: day 7): DONE.
        • grand-summary: positive feedback, the algo seems reliable and working, very little observations - just 2 minor inconsistencies found (posted in my CRC report, to be discussed at today's CMC Computing Ops meeting).
    • [ CMS CRC-on-duty from Feb 8th to Feb 14th: Daniele Bonacorsi. Next CRC: Stefano Belforte ]

  • ALICE reports - General Information: ALICE analysis week ongoing 14th-17th of February, high activity on GRID (around 26K jobs running)
    • T0 site
      • Migration of xrootd servers postponed till next Monday 21st February (due to above analysis week)
    • T1 sites
      • SARA: GGUS 67265. Closed. The site was reporting Alice jobs running while MonaLisa was reporting none. We thought there were jobs doing nothing, but finally was a problem with MonaLisa
    • T2 sites
      • Cyfronet: GGUS 67254 . Closed
      • Clermont: false fire alarm prevent the site from working during the weekend
      • Several operations

  • LHCb reports - MC and user activity smooth operations.
    • T0
      • Some AFS slowness observed.Open (as requested) a GGUS ticket on Friday afternoon with all possible information about the problem provided (GGUS:67264).
      • ce114.cern.ch. Many pilots aborting with Globus error 12. This is symptomatic of some misconfiguration of the Gatekeeper (GGUS:67253). No news since Friday. [ Maarten - doesn't look any different from ce113 as far as account is concerning to which jobs are mapped. ce is also fine in sam tests and no other complaints. Jobs really do fail so there is something not quite right. ]
      • Sysadmins approached me on Friday (after the WLCG ops) reporting about the huge amount of pending jobs in the grid queues @CERN. This is an issue similar to the problem experienced at GridKA with the BDII reporting inconsistent information for which the Rank was 0 and then erroneously attracting jobs. Ulrich put a patch in pre-production on Saturday that seems to improve the situation (but not completely). [ Maarten - Ulrich explained that this is a long standing problem made worse very recently by very high number of jobs recently. Issue with info provider now becomes more of a problem. Friday evening some emergency measures were taken but problems still there at 5% level. # waiting jobs for q which does have waiting jobs sometimes reported as 0. Needs to be resolved but quite tricky. Like default when info provider times out. ]
    • T1
      • CNAF: a bunch of jobs failing at around 4pm on Sunday setting up the runtime environment. Shared area issue (GGUS:67282)
    • T2 site issues: :
      • NTR

Sites / Services round table:

  • IN2P3 - several points: request of ATLAS to change DN for FTS done; network incident from Sunday due to defective CPU card on one of core switches which disconnected 50% of storage. Card replaced during the night - SIR in preparation (01:00 - 04:00 this morning); SIRs: 2 filed now 1) LHC s/w area 2) ATLAS dCache both in Q4 2010; 2nd pre-announcement for outage of HPSS from 20-Feb - 24-Feb. Major version change (6.2 to 7.3) recommend that experiments pre-stage data to dCache so can continue to work in this long period.
  • PIC - ntr
  • FNAL - ntr
  • KIT - from Friday pm to this morning one tape lib stuck - main victim was CMS who could not stage some files. Library now running ok.
  • RAL - declared an outage tomorrow for general instance of CASTOR which will affect ALICE
  • ASGC - ntr
  • CNAF - ntr
  • BNL -
  • OSG - BNL items: 2 records that we are trying to update in SAM; 1 is maintenance record from January 15 that didn't get accounted for; 2nd is due to bug that caused sites to go into unknown status. Will release a patch for this but some records "unknown status" instead of success". 3rd item concerns ATLAS DN change. GGUS:66052

  • CERN batch - about 500 machines get stuck and have to be rebooted. Lost around 4000 - 5000 batch jobs.
  • CERN storage - some files inaccessible - known bug in SRM 2.9 should be fixed in SRM 2.10. Have to discuss a slot for putting into prodution

AOB:

  • GGUS-SNOW is target date for change from Remedy PRMS to Service Now ("SNOW"). SNOW will enter production tomorrow. Return path SNOW->GGUS not ready so changes made in SNOW will not be reflected in GGUS as we stand. Discussion of tests of ALARM tickets ongoing. Only signed tickets from production GGUS server are accepted. As SNOW-GGUS path not ready CERN service managers will have to work in two systems.

  • GGUS stopped working half an hour ago, being investigated. After the meeting, this was reported to be fixed at 15:35, but was still a problem at least for ATLAS. Finally fixed at 16:05 UTC+1.

Tuesday:

Attendance: local(Maarten, Maria, Jamie, Douglas, Roberto, Mattia, Nilo, Luca, Massimo, Alessandro, Lola, MariaDZ);remote(Jon, Ronald, Ulf, Dimitri, Rob, Jeremy, Paolo, Gonzalo, Stefano, Suijan).

Experiments round table:

  • ATLAS reports -
    • T0, Central services
      • The service certificate change for functional test transfers was done yesterday, and causes many transfer problems through the day, but most were fixed by early afternoon. One problem remained with RAL, fixed now. [ Rob - this was personal cert that changed to robot cert? This was changed ok at BNL. Ale- can confirm that BNL working ok. ]
      • Database people are changing the use of Oracle streams in ATLAS, and to help monitor this they would like the audit trail turned on. This is being managed by Florbela (florbela.tique.aires.viegas@cernNOSPAMPLEASE.ch), and seems to be in hand.
    • T1
      • All functional tests of transfers to RAL were failing all day. This was eventually traced to the new service cert. in use, and changes to the permissions on the FTS service there. This was fixed by late morning. (GGUS:67381)
    • T2/3
      • All data transfers to WISC in the US cloud are failing for a while now. (GGUS:67495)
      • Data transfers between SFU and BNL have been failing for the past day. (GGUS:67318)

  • CMS reports -
    • CMS / CERN / central services
      • Global run with cosmics (CRAFT11) going on. No major troubles.
    • Tier-0
      • the HeavyIon zero-suppression (HI-ZS) run continues smoothly.
    • Tier-1's
      • no issues
    • Tier-2's
      • NTR - business as usual
    • Miscellanea
      • CREAM-CE / LCG-CE checks of SAM site availability
        • It's being kept monitored for 1 week (today: day 7): DONE.
        • grand-summary: it works.
    • CMS CRC-on-duty : Stefano Belforte

  • ALICE reports - General Information: ALICE analysis week ongoing 14th-17th of February, high activity on GRID (around 31K jobs running)
    • T0 site
      • Nothing to report
    • T1 sites
      • FZK & CNAF: Solution provided yesterday afternoon to the problem with the ClusterMonitor (failures of the services from time to time) that some sites were suffering from. [ Looks like it is working and will deployed also at the rest of the sites ]
    • T2 sites
      • Several operations

  • LHCb reports -
    • After a period with MC running steady at full steam (40-50K jobs/day), last night a drop in the number of jobs due to an internal DIRAC issue (incompatibility between pilot/central service versions after a recent patch release).
    • Defining the road map for the major DIRAC release that hopefully should happen on Tuesday when the power cut on the CC will also force a draining of the system. [ Maarten - do you really need to drain system for just a 2h cut? A: to minimize impact on user make DIRAC release at same time ]
    • T0
      • Shared area issue: provided all possible information about the problem to AFS managers. (GGUS:67264). Discussions on the way LHCb side the situation can be improved. Discussions on way the environment can be setup [ Massimo - no smoking gun - would like a clear case like "we tried to open file xx and this hangs" and/or problematic directories. ]
      • Issue with the BDII reporting inconsistent information for CERN queues (causing the Rank to be erroneously attractive). I see still big fluctuations on the Rank value for one of the CEs at CERN. (http://santinel.home.cern.ch/santinel/cgi-bin/rank_last_day?test=ce111.cern.ch-jobmanager-lcglsf-grid_lhcb) [ Maarten - Ulrich seems to be convinced he has found root cause but did not give details. ]
    • T1
      • GRIDKA: hundreds of jobs not longer visible in LHCb but whose process is left running as zombie process absorbing 20-30GB of virtual memory. Our contact person in GridKA is looking strace/gdb at these process and eventually will inform core software developers.
    • T2 site issues: :
      • NTR

Sites / Services round table:

  • FNAL - ntr
  • NL-T1 - ntr
  • NDGF - intervention announced yesterday still going to happen
  • KIT - ntr
  • RAL - our castor invention progressing according to plan
  • IN2P3 - ntr
  • CNAF - ntr
  • PIC - ntr
  • ASGC - ntr

  • OSG - still trying to get hold of Wojcik about accounting record from Jan - no response so far. Would like to get BNL record input correctly. A downtime that was missed in the middle of Jan that affected availability numbers. Ale + Maarten will discuss with him. ATLAS availability or OPS? A: OPS
  • GridPP ntr

  • CERN DB - overnight node 3 of CMSR rebooted. At same time on CMSR problem with streaming - a very large transaction was running which exhausted a system tablespace. Think the two are related and in contact with CMS.

  • CMS storage - agreed with ATLAS and intervention for tomorrow at 10:00 (transparent) risk assessment and announcement done. Rollback would create some measurable downtime.

  • CERN AFS UI Today the AFS UI CA certificates were updated from 1.37 to 1.38 which is boring in itself however this set of CAs includes openssl version 1.0 compliant CA files as well as those we are using today. Probably from today CRLs will be processed with a local copy of OpenSSL v1.0 so that CRLs are also installed for OpenSSL v1.0 as well as v9.6 based software, in particular SL6 uses OpenSSL v1.0. With out this update SL6 client machines to the AFS CA area while "working" would have been insecure with no CRLs. The opportunity was also taken to move from fetch-crl v2 to fetch-crl v3 as well.

AOB:

  • GGUS-SNOW: went live today as per schedule. Preference was for CERN-ROC only (and not new SUs that will be handled "later"). This was not noticed in Savannah description. 3rd level m/w support was migrated by mistake but corrected just before 09:00. However, import of all open GGUS tickets on m/w issues created a number of notifications related to SLA criteria native to SNOW. Detailed explanation will happen. Routing and mapping changed. A report at T1SCM next week.

Wednesday

Attendance: local(Massimo, Stefan, Simone, Maarten, Jamie, MariaDZ, Mattia, Douglas, Eva, Nilo, Gavin, Alessandro);remote(Kyle, Tiju, Onno, Suijan, Stefano, Michael, Jon, Ulf, Daniele Andreotti, Dimitri, Rolf).

Experiments round table:

  • ATLAS reports -
    • T0, Central services
      • The Castor upgrade today took longer than what was expected. This causes a 30 min. outage of CERN SRM, from about 10:05-10:35. This did not cause significant problems for ATLAS, but did cause some confusion/concern from some shifters.
    • T1
      • The production grid cert. used in handling transfers to the FR cloud was changed this morning. This caused problems and errors, and a GGUS ticket was sent to IN2P3 to check on the permissions used there (GGUS:67521). People said they would look at this, but no answer back yet at this time. [ Rolf - the ticket mentioned has an answer from 13:00 UTC with a q for ATLAS ]
    • Reminder: ATLAS new DN used for DataManagement is /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ddmadmin/CN=531497/CN=Robot: ATLAS Data Management

  • CMS reports -
    • CMS / CERN / central services
      • Global run with cosmics (CRAFT11) going on. No major troubles.
      • sent a TEAM ticket last night because of castor default pool unavailable and no tape activity, GGUS:67513
        • promptly answered, thanks
        • default pool was simply heavy user activity
        • lack of tape operations was a monitoring bug, ticket staying open due to that
    • Tier-0
      • Processing cosmics
      • the HeavyIon zero-suppression (HI-ZS) run continues smoothly.
    • Tier-1's
      • no issues. completing tails. reprocessing to start in a week or so due to issues with CMSSW 3_11
    • Tier-2's
      • no production yet, waiting for 3_11 here as well. Analysis running fine
    • Miscellanea
      • NTR
    • CMS CRC-on-duty : Stefano Belforte

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Yesterday's drop of jobs has been explained with the recent upgrade of CERN CA (to version 1.38-1). The format of the certificates directory has changed. [hash].0 files - used to include the certificates of the CAs - are now links to [CA].pem files (with support for openssl9.8 and openssl1.0 hashing, that has changed). Bundled certificates in DIRAC were not reflecting this new structure and systematically the pilot job submission was failing.
    • T0
      • Shared area issue: investigation on going in close touch with LHCb people (GGUS:67264).
      • Issue with the BDII reporting inconsistent information for CERN queues : bottom line? [ Gav - requires a s/w change. Maarten - inadvertent cross mounting that caused data for production CEs to be overwritten with data used by a special test setup. This also explained why problem was so strangely intermittent. This has been fixed. Investigations have led to some substantial improvement of code to info provider which had not been touched in yonks. ]
    • T1
      • NTR
    • T2 site issues: :
      • NTR

Sites / Services round table:

  • RAL - ntr
  • NL-T1 - ntr
  • ASGC - ntr
  • BNL - ntr
  • FNAL - ntr
  • NDGF - ntr - network service went ok and all perfect.
  • CNAF - ntr
  • KIT - downtime on Friday maintenance on a tape library. Tapes in this library will not be available for reading. 10:30 - 12:00.
  • IN2P3 - nta
  • OSG - ntr

  • CERN DB - ATLAS online DB will be disconnected tomorrow morning several timest to do some tests. In afternoon will test failover to Safehost. Muon calibration data will be replicated from Michigan to CERN as from tomorrow in production. CMS online DB was rebooted in a rolling way in the morning to add new disks.

  • CERN Dashboard - SAM availability reports look fine since Monday.

  • CERN storage: ATLAS - apologise that validation procedure was not good enough hence delay; CMS - ticket will be closed as we are passing issue to internal and will log Savannah. LHCb - waiting for more data

  • GGUS-SNOW: many issues for ROC-CERN. Many things depend on development SNOW->GGUS and functional elements in SNOW where right content is not there will document issues via a Twiki. Will follow up via T1SCM and CERN C5 as appropriate.

AOB:

  • Action in preparation for ATLAS LFC central: got green light from ATLAS so can proceed with de-commissioning. IT-PES please schedule decommissioning Monday next week or after. Please can DB people keep DB for a few days just in case.

  • Emergency broadcast yesterday evening about new CA versions - incompatible with VOMS admin installations. Most WLCG VOMS servers at CERN but also at FNAL and BNL. Message forwarded - please look into it.

Thursday

Attendance: local(Eva, Manuel, Jamie, Maria, Douglas, Alessandro, Mike, Massimo, Stefan, Simone);remote(Ulf, Michael, Jon, Gonzalo, Rob, Sujijan, JT, Stefano, Foued, Paolo).

Experiments round table:

  • ATLAS reports -
    • T0, Central services
      • The grid cert. for more central services was updated today, continuing the grid cert. upgrades happening this week. There are more permissions problem associated with this, and Alessandro Di Girolamo <Alessandro.Di.Girolamo@cern.ch> is tracking the issues. One problem is with CERN FTS service, and a ticket was created on this (GGUS:67571), there has been no response to this since 9:00UTC, at this time.
      • There were tests of an online database outage today, and the ATONR database for ATLAS was taken down from about 10:00-14:00. Various online systems in ATLAS used this to test outage procedures. Announcements of the outage from automated systems worked well all day (the AMOD phone received many SMS messages). [ Eva - around 13:00 ATONR switched to standby - will be switched back and end of afternoon ]
    • T1
      • Problems with data transfers to Taiwan this morning, first noticed as time outs in transfers from FZK. Turns out to related to a broken undersea cable between Japan, and the US, causing long transfer times to Taiwan. By late morning Taiwan was switched to using another network link, and there have been no complaints since. We assume this caused problems with other sites also, but didn't hear any other complaints.
      • The changes to the grid cert. used yesterday caused a few GGUS tickets to be created on permission changes (GGUS:67555, GGUS:67556, GGUS:67557, GGUS:67558). Site responded to these tickets well, and in the timely fashion, and problems were gone within an hour in most cases.

  • CMS reports -
    • CMS / CERN / central services
      • Global run with cosmics (CRAFT11) going on. No major troubles.
      • update on TEAM ticket GGUS:67513 from yesterday
        • reading better it was not really something wrong in monitoring as that some process needed for monitoring data collection had died. It was restarted and now all is OK. We understand IT is looking into how to prevent this for happening again.
      • Problem with FNAL voms server having seeped in UI's configuration before fully deployed GGUS:66403 A user managed to get a proxy from the fnal voms which was then not accepted by services. Some details still beig investigated on our side, not clear anyone needs to do anything yet. FNAL VOMS appear in UI 3.2.8 config at CERN, but usually simply fails to give proxies. [ Maarten - idea was CMS would have VOMS server in case the one at CERN was unreachable or had some other problem. By default have it configured on UI for getting a proxy. Probably a CMS internal issue whether this FNAL VOMS server should be usable for this kind of server as ATLAS have done with their server at BNL. If not ready for whatever reason can be removed from UI config and CMS could open a ticket for this to happen. To first order should contact FNAL to figure out what is status of this server. Stefano - do we try to close it here now or wrap-up offline?? Rob - we recently took this out of our config package due to errors that they were experiencing. Some issues with registration in GOCDB? ]
    • Tier-0
      • Processing cosmics
      • the HeavyIon zero-suppression (HI-ZS) run continues smoothly.
    • Tier-1's
      • no issues. completing tails. reprocessing to start in a week or so due to issues with CMSSW 3_11
    • Tier-2's
      • no production yet, waiting for 3_11 here as well. Analysis running fine
    • Miscellanea
      • NTR
    • CMS CRC-on-duty : Stefano Belforte

  • ALICE reports -
    • T0 site
      • Nothing to report
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operations (some issues to be chased up with a couple of T2s that are not well behaved at the moment ]

  • LHCb reports -
    • Smooth operations.
    • Focus on the certification of Dirac-v6r0. Meant to be done next Tuesday when all VO boxes will be off.
    • T0
      • Shared area issue: investigation on going in close touch with LHCb people (GGUS:67264).
    • T1
      • Scheduled downtime at CNAF of the Oracle DB behind the LFC-RO. Finished at 12:00
    • T2 site issues: :
      • Issue at Oxford (GGUS:67173) Failing pilots - need to follow-up

Sites / Services round table:

  • NDGF - ntr
  • BNL - update regarding BNL-CNAF network issue; Yesterday requested an update and received messages from ESNET and USLHCNET. Still issues though circuit re-engineered. Error rate observed and experts are discussing whether this is good enough for production. Traffic not moved back onto this circuit. All related communication in GGUS:61440
  • FNAL - reported we are observing transfer failures between CERN and FNAL - investigating. Might be calling on network people at CERN to help FNAL network people to run detailed tests.
  • PIC - ntr
  • IN2P3 - small outage 22nd Feb; pre-announcement. DBs for ATLAS which are DBATL and DBAMI out 10-12 local time. Announced as usual; ATLAS already up-to-date on this.
  • RAL - 2 ongoing AT RISKS; one to bounce ATLAS streaming DB to turn on auditing and one to merge space tokens.
  • ASGC - nothing to add to network issue this morning that was mentioned earlier
  • NL-T1 - small issue: ATLAS jobs HI can't stay within 4GB memory limit that was announced long ago. Temporarily raised limit to 5GB but can't run too many on a single node otherwise exhaust VM. If this causes problems will have to set it back.
  • KIT - ntr
  • CNAF - confirm intervention on DB over. Services are back.
  • OSG - ntr

  • CERN storage - investigating issue from LHCb

AOB:

  • Maarten - we have observed significant error rates with CRL service at CERN affecting multiple services. SAM machines have been affected by this. Any machine that runs CRL update cron job normally goes via http proxy (squid) which is working ok most of the time but have seen nodes suffer from empty responses sufficiently often that this can cause problems. CERN CRL has lifetime of 5 days and can get out of date which can affect work that a machine can do. VO boxes might suffer.

Friday

Attendance: local(Marcin, Maria, Mike, Jamie, Alessandro, Massimo, Maarten, Roberto);remote(Rolf, Jon, Michael, Xavier, Gonzalo, Tiju, Jeremy, Paolo, Suijan, Thomas Bellman, Rob, Onno).

Experiments round table:

  • ATLAS reports -
    • Pilot version will be patched due to an urgent and critical issue with RAL. Site mover used inside pilot is not working after migration from mcdisk to datadisk. Will do update today even if Friday. Otherwise RAL would be stuck for whole w/e.
    • Info: ATLAS has foreseen migration to Nagios for end Feb. During 1st week of March SAM submission will stop and only Nagios tests will be used to calculate site availability. If sites have some specific feature for SAM test will have to be changed for Nagios test. More info at ADC meeting Monday week.

  • CMS reports -
    • CMS / CERN / central services
      • Global run with cosmics compelted. Getting prepared for beam
      • update on GGUS:66403 (yesterday). Item sorted out offline. Opened GGUS:67609 to get CERN's UI corrected. CMS will purse a deployment of new .lsc files so voms.fnal.gov can be used as valid voms server inproduction.
    • Tier-0
      • Processing cosmics
      • the HeavyIon zero-suppression (HI-ZS) run continues smoothly.
    • Tier-1's
      • pre-production with CMSSW 3_11has started, also running backfills at T1's to test new production tool
      • GGUS:67573 with ASGC: their Castor storage is overloaded to the point that SAM tests are timing out. Production jobs also affected. [ Suijan - caused by yesterday's network problem ]
    • Tier-2's
      • little production yet, waiting for 3_11 here as well. Analysis running fine
    • Miscellanea
      • NTR
    • CMS CRC-on-duty : Stefano Belforte

  • ALICE reports - General information: Yesterday afternoon we deployed on all sites a new AliEn version (v2-19.73). Services at the sites are already running this new version successfully . Next thing to be put in place is limit on memory utilisation of jobs.
    • T0 site
      • Migration of xrootd servers will be done next Monday 21st February (SLC4 to SLC5)
    • T1 sites
      • Nothing to report
    • T2 sites
      • Usual operations

  • LHCb reports -
    • Another drop in the number of running jobs last night. Found that a feature introduced yesterday into the dirac pilot sw was crashing systematically all pilots. It has been discovered this early morning and fixed promptly. Jobs slowly coming back to sites
    • Issues at the sites and services
      • To be verified but it looks like that some version of CREAMCE introduced some problem in the execution of the pilot jobs; cream-3 @ GRIDKA and ce204-205-206 @ CERN are examples of CREAMCE with this peculiar problem. The same is not present in other CREAMCE endpoints (like the one @pic). It would be nice to have the version installed on these particular nodes. [ Maarten - those CEs are at latest version 1.6.4. ]
    • T0
      • NTR
    • T1
      • NTR
    • T2 site issues: :
      • NTR

Sites / Services round table:

  • IN2P3 - small change in outage of MSS - will start tomorrow at 18:00 instead of 20th at 00:00.
  • FNAL - replaced our PhEDEx node yesterday - back in service and working well
  • BNL - ntr
  • KIT - today's maintenance on tape lib successful and finished in scheduled time. AT RISK on Monday postponed to 28 Feb - unfortunately cannot remove announcement in GOCDB for Monday 21 Feb! Not sure why... Have opened a corresponding ticket.
  • PIC - ntr
  • RAL - ntr
  • CNAF - ntr
  • ASGC - nta
  • NDGF - ntr
  • NL-T1 - ntr
  • GridPP - ntr
  • OSG - Jan downtime for BNL has been resent and Wojcik has confirmed receipt. Adjustments for downtime all available for SAM recalculation.

  • CERN DB - update about ATLAS online testing. In afternoon performed switchover to standby and back. ATLAS noticed some minor problems with reconnecting to original. Able to switchover and back in 15' each. Problem seen on application side with COOL. Restarted service and then ok. Unclear what problem was. More news from Rainer..

  • CERN Storage - investigation for LHCb continues, arrived to volumes used and for time interval concerned no issues found. Possibly something else? Ticket from ATLAS just sent - intermittent errors on T0. Have identified daemon but not yet restarted as would like to find root cause. Something similar seen also for CMS - investigating.

AOB:

  • KIT - downtime announcement issue (see above) GGUS:67600. Wrong announcement has gone but bug still present.

-- JamieShiers - 09-Feb-2011

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2011-02-18 - MariaGirone
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback