Week of 111121

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday:

Attendance: local(Alexei, David, Eva, Ignacio, John, Maarten, Manuel, Maria D, Massimo, Pepe, Steve);remote(Gonzalo, Jhen-Wei, Joel, Kyle, Lisa, Michael, Onno, Paolo, Rolf, Tiju, Ulf).

Experiments round table:

  • ATLAS reports -
    • Tier-0
      • Heavy Ion running proceeding
      • We were informed Sat 7pm that Oracle ADCR instance 3 had been rebooted due to problems with interconnect, fully functional after reboot
        • Eva: a switch was replaced, but the redundancy was supposed to make that transparent; we will follow up with the network experts
    • Tier-1
      • INFN-T1 power failure Fri afternoon, site excluded from PanDA, DDM, and Tier 0 export pending restoration of SRM service on Monday. Tier 2 and Tier 3 sites remain operational.
        • INFN-T1 Mon Nov 21. All transfers failed. GGUS ticket (alarm ticket) : GGUS:76663
          • Paolo: no update yet from ATLAS contact at CNAF; we are looking into it
      • BNL transfer failures ~16:00-18:00 on Saturday due to SE problem at BNL
      • FZK-LCG2 high failure rate (40% production job failures on Sat) from input file staging Sat afternoon, site taken offline for production Sat night and excluded from analysis brokerage. [Problem still being worked on Sunday.] GGUS:76605
      • Sunday morning ticket on SARA-MATRIX SRM failure, Tier 0 exports failing so ticket escalated to ALARM at 8am. Alarm acknowledged by site 20min later. Site excluded from PanDA, DDM and Tier 0 export. [No action reported on ticket as of Sun evening.] GGUS:76628
        • Onno: dCache head node had a full partition, which caused the storage cluster to become unavailable; the problem was fixed yesterday evening; also LHCb was affected; further details in site report
        • Alexei: what is the status of the network maintenance?
        • Onno: so far ongoing as planned

  • CMS reports -
    • LHC / CMS detector
      • Heavy Ions collision data taking.
    • CERN / central services
      • CASTOR/T0TEMP instance overload: GGUS:76613. We opened the ticket just to x-check that everything was ok. It was the case closed, and we indeed learnt that HI-Tier-0-folks are running lot of parallel jobs I/O demanding in a wave-like-form. They are going to better tune the knobs to run more smoothly. Additionally, we asked some cms unused resources to be added to this particular pool: GGUS:76649.
      • GGUS:76635 Several CMS data files stuck in export from CASTOR at CERN to T1s. Files cannot be staged to the t1transfer pool, maybe the files are stuck in transferManager. As today, some of the files are already STAGED, and a handful are INVALID, but most are in "Error 2/No such file or directory (File * (951635137@castorns) not on this service class)" state.
    • T0
      • Running HI express and prompt reconstruction.
    • T1 sites:
      • MC production and/or reprocessing running at all sites.
      • Run2011 data reprocessing.
      • [T1_DE_KIT]: Debugging network issues with US T2s GGUS:75985. IPerf tests ongoing with some US-sites. re-opened the ticket
        • Pepe: we provided a link to a CMS tool that can be useful in debugging such issues
      • [T1_FR_CCIN2P3]: investigation on networking issues in progress GGUS:75983 GGUS:71864
        • Pepe: we provided links to a CMS tool that can be useful in debugging such issues
      • [T1_FR_CCIN2P3]: GGUS:75829 about proxy expiration in JobRobot test: happened again on Nov 5th and Nov9th on cccreamceli02.in2p3.fr - also affecting T2_BE_IIHE SAV:124597 and SAV:124471 - CREAM developers involved GGUS:76208 CMS will drain cccreamceli02.in2p3.fr over the next week from production tools to sync. with the decommissioning of the CE
      • [T1_IT_CNAF]: GGUS:76597, CREAM again reporting dead jobs @ CNAF T1 as REALLY-RUNNING. The same problem already observed a few weeks (GGUS:75648). It was decided at the moment to open a new ticket when we encounter the same problem...
      • [T1_TW_ASGC]: GGUS:75377, needs confirmation from CMS side. But the check will take some time since CMS is not running at full capacity at ASGC atm because of the CASTOR migration, and the ReReco is taking all available slots atm. The Redigi jobs which are the ones which use those "Corrupted Files" are behind the queue atm...
      • [T1_TW_ASGC]: GGUS:76648. Expired CRL at T1_TW_ASGC. Transfers from some sites to Taiwan failing. Most likely invalid CRLs on some or all gftp servers.
        • Jhen-Wei: the CRLs on the new disk servers are OK now
    • T2 sites:
      • NTR (at least, relevant for this meeting)
    • Other:
      • NTR

  • ALICE reports -
    • Two more ALICE users ran into the VOMRS Oracle bug that caused their valid certificates to be marked expired: GGUS:76654

  • LHCb reports -
    • Experiment activities
      • Next round of reprocessing starting to tail off. Backlog primarily now at IN2P3 which has most of the last round of data.
      • Still expect to launch next round of MC simulations by end of November.
        • Joel: expected LHCb activity is low in the coming days
      • Waiting for new templates from WLCG to give updated definitions of service criticality for LHCb.
    • T1
      • SARA : SE problem (GGUS:76629) fixed but is it normal that it took 8 hours for an alarm ticket to be worked out... (to be an illustration of the discussion on the last T1 coordination meeting with the difference between "response time" In this case 7 minutes (good) and "max downtime" in this case 13 hours (not good at all))
        • Onno: our SLA has only best effort during weekends
      • dCache : (GGUS:76561) opened for assistance migrating data from one space token to another.
    • T2
      • Shared software area problems at Auver (GGUS:76586)

Sites / Services round table:

  • ASGC - nta
  • BNL - ?
  • CNAF
    • the storage services are back in production now
  • FNAL - ntr
  • IN2P3 - ntr
  • NDGF
    • ~1 hour dCache downtime tomorrow starting 12:30 UTC for security updates
  • NLT1
    • SRM problem was due to log partition filling up; dCache logging configuration is delicate: we either get too little, or way too much; the configuration has many parameters that influence the behavior; we will discuss this with the developers
  • OSG
    • not attending Thu-Fri because of Thanksgiving holiday
  • PIC - ntr
  • RAL - ntr

  • CASTOR/EOS
    • CMS CASTOR t0temp pool overload due to its requirement of having 2 disk copies of each file, i.e. replication on close; furthermore, tape recalls are triggered for an account that is not allowed to trigger a replication; we will follow up offline
  • dashboards
    • the ATLAS Job Monitoring Historical View job statistics were not being updated due to a DB deadlock; no data was lost, but there is backlog of a few h; an alarm will be added to detect such problems faster
  • databases
    • ATLAS and WLCG integration databases will be moved to new HW tomorrow
    • ATLAS Archive DB will be moved to Oracle 11g on Thu
    • downtimes are required and have been announced
  • grid services
    • CREAM ce203.cern.ch out of production due to disk HW problem
    • VOMS server lcg-voms.cern.ch accidentally got its certificate renewed by a watchdog process this morning, causing many WMS job failures for a few hours; currently the previous certificate is back until Wed 09:00 UTC, when it gets replaced with a new certificate that is part of the lcg-vomscerts-6.8.0 rpm; ensure your FTS and gLite 3.1 WMS nodes have that rpm installed

AOB:

Tuesday:

Attendance: local(Alexei, David, Eva, Jamie, John, Maarten, Manuel, Maria D, Oliver, Stephen);remote(Burt, Dimitri, Gonzalo, Jeremy, Jhen-Wei, Kyle, Michael, Paco, Paolo, Tiju, Ulf).

Experiments round table:

  • ATLAS reports -
    • Pb-Pb running proceeding
    • Problem with ADCR instance between 9:30am and 10:am (it looks the intervention wasn't transparent. Thanks to IT DBAs for fixing it quickly)
      • Eva: a broken switch was changed, which should have been transparent, but a mistake was made; we will follow up with the network team; on Sat a standby switch reported errors, but by mistake the primary switch got rebooted instead
    • SARA and INFN-T1 are in DDM production starting from Nov 21, ~19:00
      • Info from Luca Dell'Angelo. 10:30am : 'There was still a problem on Atlas storage.' No problems are observed from our side.
    • some issues with T2s in IL, RU and US
      • AGLT2 issue was fixed during Nov 22nd night CET
    • other issues
      • there are sites in ToA, but not in FT. Reported to DDM Ops and Regional teams

  • CMS reports -
    • LHC / CMS detector
      • Heavy Ions collision data taking.
    • CERN / central services
      • CASTOR/T0TEMP instance overload over the weekend (GGUS:76613 (solved)). Thresholds will be adapted to avoid overloads. Additionally, asked some disk servers to be added to this particular pool: GGUS:76649 (in progress).
        • John: the CMS instance of EOS is close to capacity, more servers need to be added there
        • Oliver: OK, but the CASTOR t0temp issue is much more urgent
        • John: we will look into it right away
      • GGUS:76635 (solved) Several CMS data files stuck in export from CASTOR at CERN to T1s. Files cannot be staged to the t1transfer pool, maybe the files are stuck in transferManager. A ticket was opened to the Castor devs SAV:124828 about "Suboptimal choice of d2dsrc"
    • T0
      • Running HI express and prompt reconstruction.
    • T1 sites:
      • MC production and/or reprocessing running at all sites.
      • Run2011 data reprocessing.
      • [T1_DE_KIT]: Reduced throughput of transfers from KIT to US T2s. IPerf tests ongoing with some US-sites. GGUS:75985 (in progress, last update 11/21)
      • [T1_FR_CCIN2P3]: Reduced throughput of transfers from CCIN2P3, problems in the way RENATER/GEANT/? treats IP packets tagged LBE. Default to tag all the packets from our file servers LBE was changed, improvements visible. GGUS:75983 (in progress, last update 11/21), GGUS:71864 (in progress, last updated 11/21)
        • Rolf: the LBE (Less than Best Effort) QoS (Quality of Service) flag is put on outgoing packets for WLCG services to allow WLCG traffic outside the OPN to be throttled in favor of other usage when the network is overloaded; this mechanism worked OK until May 27, when some change occurred that still is not understood today; for now we have suppressed the use of that flag as a mitigation, but we would like to have the correct behavior restored at some point
      • [T1_FR_CCIN2P3]: CREAM CE for BQS will be drained and not used for production in CMS anymore GGUS:75829 (in progress, last update 11/17) + underlying problem with proxy propagation was traced back to problem in sudoers configuration GGUS:76208 (solved)
      • [T1_IT_CNAF]: CREAM again reporting dead jobs @ CNAF T1 as REALLY-RUNNING. Workaround not yet released from EMI, GGUS:76597 (in progress, last update 11/22)
      • [T1_TW_ASGC]: read errors on CASTOR, GGUS:75377 (solved)
      • [T1_DE_KIT]: JobRobot problems: GGUS:76706 (in progress, last update 11/22)
    • T2 sites:
      • NTR (at least, relevant for this meeting)
    • Other:
      • NTR

  • ALICE reports -
    • The VOMRS problem mentioned yesterday should be fixed now.

  • LHCb reports -
    • Experiment activities
      • Next round of reprocessing starting to tail off. Backlog primarily now at IN2P3 which has most of the last round of data.
      • Still expect to launch next round of MC simulations by end of November.
    • T1
    • T2

Sites / Services round table:

  • ASGC - ntr
  • BNL - ?
  • CNAF
    • the queues for ATLAS are open again, jobs are running
  • FNAL - ntr
  • GridPP - ntr
  • IN2P3
    • yesterday between 13:18 and 17:18 CET errors were reported for various grid services; they went away by themselves; did any other site see peculiar errors during that time frame?
      • Maarten: no such reports were noticed from other sites; yesterday's lcg-voms certificate incident could have something to do with it, though the timeline does not match exactly and it would be remarkable that only 1 site (NGI?) was affected; please send an example error for further investigation
      • added on Wed: the trouble appears to have come from the lcg-voms incident after all, which caused a fraction of the SAM job submissions to be refused by some of the WMS nodes at CERN (the ones running gLite 3.1); since such errors are not critical, other sites may not have noticed them; furthermore, the other VOMS server (voms.cern.ch) was OK and on average it is used half of the time, so other sites may just have been "lucky"
      • added on Thu: the SAM team actually did see critical errors for other tests, but the impact was minor and they will correct the availability and reliability numbers of the sites that were affected
  • KIT - ntr
  • NDGF
    • the SE at the Estonia T2 for CMS is getting bombarded with transfer requests from the FTS at FNAL, essentially crippling that SE
      • Oliver: we will look into that
  • NLT1 - ntr
  • OSG - ntr
  • PIC
    • The batch scheduler (maui) had a problem yesterday which was triggered by the reconfiguration of some WNs with a special setup to be able to certify CVMFS for ATLAS (NFS with write permission still needed frown )
    • More or less at the same time the scheduler was fixed (21-Nov 14:00 CET) we saw a big increase in the number of files that new ATLAS jobs were requesting to dCache. This created problem in a number of pools where movers started to be queued. Several transfers and jobs failed due to this during yesterday evening (GGUS:76705) and it was fixed this morning by raising the max limit of active movers in the pools. Worth to say that we still see this number of files being requested much larger than usual.
  • RAL
    • there was an Oracle problem last night for 1 h starting at 22:00 and affecting CASTOR for ATLAS and LHCb

  • CASTOR/EOS
    • around 11:00 CET castor-public was shown as being unavailable due to a DB update triggered by an SRM update
  • dashboards - ntr
  • databases
    • today's migration of the integration DBs went OK
    • on Thu the ATLAS Archive DB will be moved to Oracle 11g
  • GGUS/SNOW - ntr
  • grid services - ntr

AOB:

Wednesday

Attendance: local(Alexei, Edoardo, Maarten, Manuel, Maria D, Mike, Oliver);remote(Alexander, Gonzalo, Jhen-Wei, John, Kyle, Lisa, Michael, Paolo, Rolf, Ulf, Xavier).

Experiments round table:

  • ATLAS reports -
    • Pb-Pb running proceeding
    • GGUS ticket to CERN : GGUS:76747
    • some issues with T2s in DE, NL and US clouds

  • CMS reports -
    • LHC / CMS detector
      • Heavy Ions collision data taking.
    • CERN / central services
      • [T0TEMP Castor pool]: disk servers have been moved to pool yesterday, still under investigation remaining 1k queued transfers, seems to be stuck d2d, GGUS:76649 (in progress, last update 11/23)
      • [FTS]: increase CERN-STAR channel in the fts-t2-service to increase rates from CERN to Vanderbilt, GGUS:76712, (in progress, last update 11/23)
        • Alexei: you transfer directly to Vanderbilt?
        • Oliver: yes, because they are funded for HI data processing
    • T0
      • Running HI express and prompt reconstruction.
    • T1 sites:
      • MC production and/or reprocessing running at all sites.
      • Run2011 data reprocessing.
      • [T1_DE_KIT]: Reduced throughput of transfers from KIT to US T2s. IPerf tests ongoing with some US-sites. GGUS:75985 (in progress, last update 11/21)
      • [T1_FR_CCIN2P3]: Reduced throughput of transfers from CCIN2P3, problems in the way RENATER/GEANT/? treats IP packets tagged LBE. Default to tag all the packets from our file servers LBE was changed, improvements visible. GGUS:75983 (in progress, last update 11/21), GGUS:71864 (in progress, last updated 11/21)
        • Rolf: we will look at the new information; we still see an issue for the connectivity to Tokyo; we also noticed Czech and US sites heavily loading our PerfSonar setup, which may have influenced our conclusions; we need to repeat certain tests to ensure there is no overload due to that traffic
        • Oliver: does CMS need to do something?
        • Rolf: our network experts need to look further into it first
      • [T1_FR_CCIN2P3]: CREAM CE for BQS will be drained and not used for production in CMS anymore GGUS:75829 (in progress, last update 11/17)
        • underlying problem with proxy propagation was traced back to problem in sudoers configuration GGUS:76208 (solved)
      • [T1_IT_CNAF]: CREAM again reporting dead jobs @ CNAF T1 as REALLY-RUNNING. Workaround not yet released from EMI, jobs are running again, GGUS:76597 (in progress, last update 11/22)
      • [T1_IT_CNAF]: JobRobot problems, "BLAH_FAI BLAH error: submission command failed (exit code = 1)", GGUS:76739 (in progress, last update 11/23)
      • [T1_DE_KIT]: JobRobot problems, solved: GGUS:76706 (SOLVED, last update 11/23)
    • T2 sites:
      • NTR (at least, relevant for this meeting)
    • Other:
      • NTR

  • LHCb reports -
    • Experiment activities
      • Low activity. Integration of a new framework in the production system
    • T1
      • CERN : alarm about high READ for LFC. Can someone give us an explanation because we have no clue what is the issue.
      • SARA : cvmfs is running since one day without any issue
      • CNAF : cvmfs is running since one day with one issue which has been fixed
    • T2

Sites / Services round table:

  • ASGC - ntr
  • BNL
    • US ATLAS T2 connectivity to KIT has the same problems as reported by CMS; ESnet experts have been contacted to help debug the matter
      • Oliver: is information from CMS desirable?
      • Michael: possibly at some point, but we first intend to look into the connectivity between SLAC and KIT
  • CNAF - ntr
  • FNAL
    • no attendance Thu-Fri because of Thanksgiving holiday
  • IN2P3
    • GGUS:75158 has been updated and is waiting for LHCb input
  • KIT
    • the lack of progress for some of the reported network issues is related to a political matter internal to KIT; more news ASAP
  • NDGF
    • yesterday's problem affecting the Estonia T2 has been fixed: their SE is no longer getting bombarded with transfer requests
    • a fiber maintenance scheduled for today has been canceled; the GOCDB still mentions the downtime, though
  • NLT1 - ntr
  • PIC - ntr
  • OSG
    • no attendance Thu-Fri because of Thanksgiving holiday
  • RAL - ntr

  • dashboards - ntr
  • grid services
    • Some 9k rows in the ATLAS LFC do not have parent file IDs; this is triggering a Lemon monitoring sensor which rises LFC DB related alarms. This matter will be followed up with the ATLAS LFC consolidation experts
  • GGUS/SNOW - ntr
  • networks - ntr

AOB:

Thursday

Attendance: local(Alex, Alexei, David, John, Maarten, Manuel, Marcin, Maria D, Oliver);remote(Gareth, Giovanni, Jhen-Wei, Joel, Rolf, Ronald, Ulf).

Experiments round table:

  • ATLAS reports -
    • Pb-Pb running proceeding
    • Nov 23 ~21:45 - 23:30 ADCR problem.
      • disk failure, some records are corrupted, the issue was fixed by CERN-IT DB team
        • Marcin: 1 incident leading to another caused multiple disks to fail and the DB to become inaccessible; we moved it to the standby HW, but ran into a corrupted block affecting 8 rows in 1 PanDA table; 90% of the DB was back after 1 h; after 2 h we decided to stop trying to recover the bad rows and we made the rest of the DB available as well; ATLAS DB experts are fixing those rows now
        • Maarten: the cause of the failures will be looked into?
        • Marcin: the use of the HW is very I/O intensive and we already had another such incident this year, it will happen occasionally; in January we will move the DBs to new HW, in particular using NAS instead of SAN, which should make the DB operation more robust
    • LFC problem (prod-lfc-atlas.cern.ch is down). Fixed. Alarm ticket : GGUS:76770
      • Marcin: we had overlooked that the LFC was not configured on the standby HW; now it is OK
    • ADCR DB high access rate between 10:30am and 11:30am. Fixed by Luca (matview refresh issue)
      • Marcin: this issue was due to the corrupted rows
    • CERN-PROD groupdisk area CASTOR to EOS migration
    • T2 issues
    • David: ASGC had LCG-CE and CREAM test failures yesterday, we notified Alessandro (responsible for ATLAS SAM tests); now the CEs look OK

  • CMS reports -
    • LHC / CMS detector
      • Heavy Ions collision data taking.
      • Tomorrow, HeavyIon source regeneration (6-30 hours)
    • CERN / central services
      • [T0TEMP Castor pool]: disk servers have been moved to pool yesterday, still under investigation remaining 1k queued transfers, seems to be stuck d2d, queue cleared over night, should I close the ticket? GGUS:76649 (in progress, last update 11/23)
        • John: developers are looking into why it happened, but the ticket can already be closed
      • [CERN SRM]: srm-cms.cern.ch availability had some holes yesterday, any explanation? sls
        • John: we will check
    • T0
      • Running HI express and prompt reconstruction.
    • T1 sites:
      • MC production and/or reprocessing running at all sites.
      • Run2011 data reprocessing.
      • [T1_DE_KIT]: Reduced throughput of transfers from KIT to US T2s. IPerf tests ongoing with some US-sites, seems to be mostly political issues, GGUS:75985 (in progress, last update 11/21)
      • [T1_FR_CCIN2P3]: Reduced throughput of transfers from CCIN2P3, improvements visible, but not enough, IN2P3 asked for starting low level investigations with iperf, due to ThanksGiving we can follow up on Monday GGUS:75983 (in progress, last update 11/23), GGUS:71864 (in progress, last updated 11/23)
      • [T1_IT_CNAF]: CREAM reporting dead jobs @ CNAF T1 as REALLY-RUNNING. Workaround in the queue for next EMI release, CNAF asked to also clear the REALLY-RUNNING jobs left for debugging, this does not hurt processing right now, will clean up on Monday due to ThanksGiving, GGUS:76597 (in progress, last update 11/233)
      • [T1_IT_CNAF]: JobRobot problems, "BLAH_FAI BLAH error: submission command failed (exit code = 1)", solved by restarting ncsd, GGUS:76739 (SOLVED, last update 11/23)
    • T2 sites:
      • NTR (at least, relevant for this meeting)
    • Other:
      • NTR

  • LHCb reports -
    • Experiment activities
      • Low activity. Integration of a new framework in the production system
    • T1
      • CERN : alarm about high READ for LFC. After investigation we do not understand why the plots are not reflecting the real activity..
        • Manuel: we have started looking into matter and are in contact with Jean-Philippe
    • T2

Sites / Services round table:

  • ASGC - ntr
  • CNAF - ntr
  • IN2P3 - ntr
  • NDGF - ntr
  • NLT1 - ntr
  • RAL - ntr

  • CASTOR/EOS
    • default CASTOR pool for LHCb is fairly unusable due to 6k disk-to-disk transfers swamping the queue
      • Joel: we will discuss this with the user who originated those requests, extra HW should not be added to that pool
  • dashboards - nta
  • databases
    • ATLAS Archive DB has been moved to Oracle 11g but not to new HW, it will stay on the current HW
  • GGUS/SNOW - ntr
  • grid services - ntr

AOB:

Friday

Attendance: local(Alexei, David, Eva, Jamie, John, Maarten, Manuel);remote(Alessandro, Giovanni, Jhen-Wei, Joel, John, Onno, Roger, Rolf, Stephen, Xavier).

Experiments round table:

  • ATLAS reports -
    • Pb-Pb running proceeding
    • CERN-PROD groupdisk area CASTOR to EOS migration is postponed until Monday
    • false security alarm (many thanks to CERN IT security team for help)
    • issue with python version or dq2.cfg on one of vo boxes (production jobs are affected)
    • ADCR high load (most probably due to aggressive deletion service)
    • David: LCG-CE and CREAM tests for INFN-T1 have been failing with timeouts since Nov 18; this was first thought to be related to the ongoing matter of GGUS:76597 opened by CMS, but now appears to be a separate issue after all
      • Alessandro: we have asked Lorenzo Rinaldi (ATLAS contact at CNAF) to look into the matter

  • CMS reports -
    • LHC / CMS detector
      • Heavy Ions collision data taking as long as fill can be kept going
      • HeavyIon source regeneration (6-30 hours)
    • CERN / central services
      • [CERN SRM]: srm-cms.cern.ch availability had some holes yesterday, under investigation https://sls.cern.ch/sls/service.php?id=CASTOR-SRM_CMS
        • John: 1 of the head nodes was down, which only affected the SLS display
      • [CERN/Vanderbilt FTS]: create dedicated channel, apologies for competing request to increase the STAR channel before, experts determined that it would be better to have an own channel GGUS:76783
        • Manuel: the ticket is in progress
        • added after the meeting: planned for Monday
      • [Nagios/SAM]: mismatch between sam-bdii and lcg-bdii (some sites are in lcg-bdii, but not in sam-bdii), INC:082162
        • Manuel: we are trying to collect some evidence that would help debugging this problem; we will restore sam-bdii later this afternoon (done)
    • T0
      • Running HI express and prompt reconstruction.
    • T1 sites:
      • MC production and/or reprocessing running at all sites.
      • Run2011 data reprocessing.
      • [T1_DE_KIT]: Reduced throughput of transfers from KIT to US T2s. IPerf tests ongoing with some US-sites, seems to be mostly political issues, GGUS:75985 (in progress, last update 11/21)
      • [T1_FR_CCIN2P3]: Reduced throughput of transfers from CCIN2P3, improvements visible, but not enough, IN2P3 asked for starting low level investigations with iperf, due to ThanksGiving we can follow up on Monday GGUS:75983 (in progress, last update 11/23), GGUS:71864 (in progress, last updated 11/23)
      • [T1_IT_CNAF]: CREAM reporting dead jobs @ CNAF T1 as REALLY-RUNNING. Workaround in the queue for next EMI release, CNAF asked to also clear the REALLY-RUNNING jobs left for debugging, this does not hurt processing right now, will clean up on Monday due to ThanksGiving, GGUS:76597 (in progress, last update 11/233)
    • T2 sites:
      • NTR (at least, relevant for this meeting)
    • Other:
      • NTR

  • LHCb reports -
    • Experiment activities
      • Low activity. Integration of a new framework in the production system
    • T0
      • LFC SLS alarm : some queries are taking longer than expected. (perhaps a problem with Oracle RAC) under investigation with DB group
      • 14 concurrent instances in Lemon while in reality there are 1 or 2. IT is looking at this issue as well
      • concurrent query limit was set to 50 in the OLD setup and it should be a higher limit today with new hardware
    • T2

Sites / Services round table:

  • ASGC
    • we are looking into the LCG-CE and CREAM test failures for ATLAS reported yesterday
  • CNAF - ntr
  • IN2P3
    • our batch system migration from BQS to SGE has been completed today
      • Maarten: congratulations!
  • KIT - ntr
  • NDGF
    • our mail server has a problem and is being replaced today
  • NLT1 - ntr
  • RAL - ntr

  • CASTOR/EOS - nta
  • dashboards - nta
  • databases - ntr
  • grid services - nta

AOB:

  • Alexei: what is the status of the 9k orphaned LFC entries reported on Wed?
  • Manuel: they have been fixed early afternoon today
  • Alexei: did they cause any performance problems?
  • Manuel: no, just warnings from the Lemon sensor

-- JamieShiers - 31-Oct-2011

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2011-11-27 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback