TWiki
>
LCG Web
>
WLCGCommonComputingReadinessChallenges
>
WLCGOperationsMeetings
>
WLCGDailyMeetingsWeek110418
(revision 23) (raw view)
Edit
Attach
PDF
---+ Week of 110418 %TOC% ---++ Daily WLCG Operations Call details To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following: 1. Dial +41227676000 (Main) and enter access code 0119168, or 2. To have the system call you, click [[https://audioconf.cern.ch/call/0119168][here]] 3. The scod rota for the next few weeks is at ScodRota ---++ WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments | *VO Summaries of Site Usability* ||||*SIRs, Open Issues & Broadcasts*||| *Change assessments* | | [[http://dashb-alice-sam.cern.ch/dashboard/request.py/historicalsiteavailability?mode=siteavl&siteSelect3=101&sites=CERN-PROD&sites=FZK-LCG2&sites=IN2P3-CC&sites=INFN-T1&sites=NDGF-T1&sites=NIKHEF-ELPROD&sites=RAL-LCG2&sites=SARA-MATRIX&algoId=6&timeRange=lastWeek][ALICE]] | [[http://dashb-atlas-sam.cern.ch/dashboard/request.py/historicalsiteavailability?mode=siteavl&siteSelect3=403&sites=CERN-PROD&sites=BNL-ATLAS&sites=FZK-LCG2&sites=IN2P3-CC&sites=INFN-T1&sites=NDGF-T1&sites=NIKHEF-ELPROD&sites=RAL-LCG2&sites=SARA-MATRIX&sites=TRIUMF-LCG2&sites=Taiwan-LCG2&sites=pic&algoId=161&timeRange=lastWeek][ATLAS]] | [[http://dashb-cms-sam.cern.ch/dashboard/request.py/historicalsiteavailability?siteSelect3=T1T0&sites=T0_CH_CERN&sites=T1_DE_KIT&sites=T1_ES_PIC&sites=T1_FR_CCIN2P3&sites=T1_IT_CNAF&sites=T1_TW_ASGC&sites=T1_UK_RAL&sites=T1_US_FNAL&timeRange=lastWeek][CMS]] | [[http://dashb-lhcb-sam.cern.ch/dashboard/request.py/historicalsiteavailability?mode=siteavl&siteSelect3=501&sites=LCG.CERN.ch&sites=LCG.CNAF.it&sites=LCG.GRIDKA.de&sites=LCG.IN2P3.fr&sites=LCG.NIKHEF.nl&sites=LCG.PIC.es&sites=LCG.RAL.uk&sites=LCG.SARA.nl&algoId=82&timeRange=lastWeek][LHCb]] | [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGServiceIncidents][WLCG Service Incident Reports]] | [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGOpenIssues][WLCG Service Open Issues]] | [[https://cic.gridops.org/index.php?section=roc&page=broadcastretrievalD][Broadcast archive]] | [[https://twiki.cern.ch/twiki/bin/view/CASTORService/CastorChanges][CASTOR Change Assessments]] | ---++ General Information | *General Information* ||||| *GGUS Information* | *LHC Machine Information* | | [[http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/][CERN IT status board]] | M/W PPSCoordinationWorkLog | [[https://twiki.cern.ch/twiki/bin/view/LCG/WLCGBaselineVersions][WLCG Baseline Versions]] | [[http://cern.ch/planet-wlcg][WLCG Blogs]]| | GgusInformation | [[https://espace.cern.ch/be-dep-op-lhc-machine-operation/default.aspx][Sharepoint site]] - [[http://lhc.web.cern.ch/lhc/][Cooldown Status]] - [[http://lhc.web.cern.ch/lhc/News.htm][News]] | <HR> ---++ Monday: Attendance: local(Steven, Fernando, Alessandro, Jamie, Maria, Dan, Maarten, Dirk, Massimo, Zbyszek, Peter, David, MariaDZ);remote(Kyle, Onno, Gareth, Rolf, Federico, Michael, Gonzalo, Jon, Jhen-Wei, Andreas, Giovanni). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * ATLAS * Stable beams over the weekend, with 336 bunches. * Physics run, 336b, today. * GGUS * GGUS down Monday morning (Oracle connection problems). GGUS team contacted. * T1s * NDGF-T1 partial tape system downtime continued over the weekend. * ASGC failing all FTS transfers on Friday evening (CERN and ASGC FTS servers). Alarm ticket sent, GGUS:69743. * RAL batch system opened to 200 jobs on Saturday, allowing fast data11 reprocessing to finish. SRM stable over the weekend. * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * LHC / CMS detector * Good stable fill on 336 bunches on Sunday (unfortunately CMS ran without a fraction of the ECAL). Plan is to repeat run today. * CERN / central services * LCGCVS outage hosting CMSSW code during the morning, see http://it-support-servicestatus.web.cern.ch/it-support-servicestatus/ScheduledInterventionsArchive/110418-CVS.htm CMS identified smalls issues that are being addressed at the moment by experts (doing a re-sync). * Tier-0 / CAF * Tier-0 farm was saturated with Sunday run (Pending + Running jobs == 2.5k) * CAF getting filled with jobs now [ Steve - loads of "swap full" alarms yesterday ] * Tier-1 * Finishing one "older" data reprocessing request * MC production in progress * WMAgent testing on-going at various sites * Tier-2 * CMS Tier-2 Readiness pretty good lately : e.g. in last 2 weeks, 78% sites with CMS Readiness fraction > 80% * MC production and analysis in progress * CMS still has the general issue that CREAM CE SAM tests are not reporting properly to the SSB, hence triggering wrong alarms, see Savannah:113192 * Other * new CRC-on-Duty from tomorrow on : Ian Fisk * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * T0 site * Strange behaviour of the WMS nodes used by ALICE led to the discovery of a few k jobs "lost and found" by LSF. * T1 sites * Nothing to report * T2 sites * Usual operations * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - Data taking in the weekend, no data now. Reconstructions and stripping activities going now. * T0 * Tape drive allocation has been done. modification of space token at CERN (GGUS:69709) * T1 * glexec test failures at IN2P3-CC (GGUS:69793) [ Maarten - CEs that fail tests are part of T2 at IN2P3. T1 CEs are fine. Will follow up with IN2P3. ] * T2 * Pilots aborted at IN2P3-LPC (GGUS:69751) Sites / Services round table: * NL-T1 - announcement: SARA SRM - at risk on Thursday to update CERTs. * RAL - comment on problems with ATLAS SRM Thu-Fri and much reduced batch over w/e. Not completely understood but workaround in place. During morning increasing batch work - going ok. Almost batch to normal. AT RISK tomorrow morning for a couple of hours due to problem network with small breaks in connectivity. Will pause batch just before and drain FTS. Hopefully just a very short stop * BNL - ntr * FNAL - ntr * PIC - ntr; reminder that we are scheduled downtime tomorrow for electrical interventions, batch queues currently being drained * IN2P3 - nta * NDGF - on Friday we mentioned a planned downtime tomorrow for scheduled upgrade of dCache head node. Scheduled until 17:00 but will probably finish earlier. No communication with tape guys regarding Danish tapes - look as if still out according to ATLAS report. At noon we had sudden failure of some network equipment - reading unavailable both tape and disk for some ALICE and ATLAS data. No eta for when fixed. * CNAF - downtime announced last Friday confirmed for tomorrow 10:00 - 14:00 UTC tape library of CNAF will not be available. * ASGC - ntr * OSG - GGUS outage? * GGUS - 009:30 KIT unscheduled network downtime. Was this eventually published? Back available at 10:30. * CERN Storage: case of 1 file late to tape seen by LHCb fixed on spot (raw data). A few files still hanging around in strange status. * CERN DB - first upgrade of 11g on a TEST DB. We will perform upgrade of production in same fashion at the end of the year and all went smoothly. 1 incident with replication to T1s: due to deadlock of streams processes replication stuck for ATLAS conditions and LHCb cond+LFC. Problem with Oracle s/w. From time to time it happens... LFC affected and LHCb conditions for 30'. ATLAS a bit longer but all up by 12:00. (Took about 2h for ATLAS). AOB: ---++ Tuesday: Attendance: local(Fernando, Maria, Jamie, Maarten, Eva, Nilo, David, Alessandro, Andrea V, Dan, Massimo, MariaDZ);remote(Gonzalo, John, Jon, Ronald, Rolf, Felix, Rob, Federico, Giovanni, Andreas, Dimitri). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * ATLAS * Physics with 336b - 72b/injection but moderate intensities * Calibration during inter fill period, then start physics run at injection (data11_7TeV) * T1s * RAL scheduled a downtime between 10am - 12am CERN time for a network interruption. Downtime was expected to be short so UK cloud was not excluded from DDM. However during the interruption internal network problems appeared and as a consequence FTS is still not available - DDM Site Services are not able to contact RAL FTS. * PIC: Outage of all services at PIC due to the yearly electrical maintenance. Panda queues for ES cloud were set to offline on Monday around 6:00pm in order to drain. Later at 1:00am the cloud was set offline in DDM. * INFN-T1: Scheduled downtime at INFN-T1 between 10am and 6:00pm for a tape library upgrade. All data on tape will be unavailable, but the data stored on disk will be available during the intervention. According to accounting for INFN-T1 it looks like there is plenty of free space on the tape buffer so INFN-T1 has not been excluded from Santa Claus. * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * LHC / CMS detector * Good filling. * CERN / central services * CMS CVS repository moved from the current LCGCVS service to the CERN CVS service on Monday 18th April, see announcement here. CMS identified smalls issues that are being addressed at the moment by experts (doing a re-sink). * Tier-0 / CAF * Tier-0 farm * CAF getting filled with jobs now * Tier-1 * Will hopefully launch reprocessing of 2010 Data on Wednesday/Thursday * MC production in progress. A number of open tickets about tape families for custodial MC (RAL, PIC, CCIN2P3) * Tier-2 * MC production and analysis in progress * Other * new CRC-on-Duty : Ian Fisk * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * T0 site * WMS behaviour back to normal after the "lost and found" jobs were cleaned up yesterday. * T1 sites * Nothing to report * T2 sites * Usual operations * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - Data taking in the weekend, no data now. Reconstructions and stripping activities going now. Hitting memory due to application problems: temporary solution found, special thanks to IN2P3 and CNAF for letting us completing. * T0 * New space token have been setup - thanks to CERN! [ Massimo - the installation of hw required for this operation finished yesterday, hope the space will be available soon. ] * T1 * Two GGUS tickets open (GGUS:69812 GGUS:69813) and closed quickly by RAL and IN2p3 to increase memory there.. Thanks for increasing the memory to run our stripping for the next days. * GGUS:69827 Slowness with FTS xfers at Lyon . Channel IN2p3-In2p3 was closed Sites / Services round table: * PIC - ntr; scheduled downtime on-going * RAL - FTS has just started again - about 20' ago. Had a switch that was misbehaving and had to reconfig. After this some ports not working - now should be ok. Roughly 30K ALICE jobs queued here at RAL. Gareth is sending e-mail to ALICE asking for them to look into it. Maarten - will do. * FNAL - ntr * NL-T1 - tomorrow 10:00 - 11:00 local time SARA storage at risk. Tape b/e m/cs needs reboot. Reading from tape will not work during that time, writing ok. * IN2P3 - ntr * ASGC - ntr * CNAF - tape library downtime on-going; 14:00 UTC is the scheduled end of this downtime. Nothing else to report. Alessandro - very difficult for experiment to verify of all ok for tape libraries. Please can you send a confirmation when the intervention is over. e.g. cloud support team in IT. * NDGF - today's dCache upgrade worked fine, finished at noon. Network problem yesterday due to equipment breaking, fixed yesterday pm. Danish tape libs should be online since late last night. Can't confirm personally. ATLAS will look at it. * KIT - ntr * OSG - looks like downtime for GGUS has been added to AOB - this is what we still needed. Trying to understand if this will have affected ticket exchange. Pretty unlikely as early morning OSG time. Will look at this time period to make sure all messages were transferred as needed. * CERN Storage - on top of what was mentioned for LHCb strange alarm on ALICE side; bug on our side only affecting probes not service. Have two files with problems for CMS - following up. Very strange unusual case AOB: (MariaDZ) Answer to OSG's question yesterday: The KIT network problems on Monday 2011/04/18 started around 9:15am CEST. The end of downtime mail reached the GGUS developers at 10:23. Not sure GGUS was completely unavailable during all of that period (~1 hour). * CNAF - question on LCG CE in calculation of availablity - is it possible to exclude this in calculation. Ale - working on this; trying o put algorithm in place that works with LCG or CREAM CE, as available and applicable. Not yet finalised. Maarten - idea is that by end June it should no longer be necessary to have an LCG CE for neither experiments nor tests for availability. If you have a working LCG CE keep it running for these remaining couple of months! ---++ Wednesday Attendance: local(Ewan, Maarten, Fernando, David, Alessandro, Jan, Maria, Jamie, Edoardo, Luca, MariaDZ);remote(John, Federico, Rob, Ian, Gerard, Jon, Xavier, Jhen-Wei, Michael, Giovanni, Onno). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * ATLAS * start physics run at injection (data11_7TeV) * 16:00-19:00: Calibration period * Access for Pixel * BCM clock tests * Night: Physics (LHC aiming for next step in intensity 480 bunches) * T1s * PIC: Extended downtime until 8PM. ES cloud continues offline in Panda, DDM and has been excluded temporarily from Santa Claus * Central Services * DDM Central Catalogues: Alarms in the evening of Apr 19 due to known problems (load balancing and connections closed by client throwing IOErrors). * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * LHC / CMS detector * Expecting physics after yesterday's cryo issues * CERN / central services * We have a couple of tickets for jobs stuck we can't kill. We have a repacker job stuck and holding the system. The operations team would like access to bkill -r from the cmsprod account. [ Ewan - CMSPROD should be able to issue this for any jobs run by CMSPROD already. Ian - can I put you in touch with the person in charge? Will get more details and submit a GGUS ticket if necessary. Incident report on CMS report Twiki ] * Tier-0 / CAF * Runs over the weekend working through Prompt Reco. Good utilization but ramping down now * Tier-1 * Will hopefully launch reprocessing of 2010 Data on Thursday * MC production in progress. A number of open tickets about tape families for custodial MC (RAL, PIC, CCIN2P3) * Tier-2 * MC production and analysis in progress * Other * new CRC-on-Duty : Ian Fisk * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * T0 site * Nothing to report * T1 sites * RAL: huge number of waiting jobs due to CREAM CE often reporting zero instead, thereby inviting more jobs to be submitted (GGUS:69856) [ John - only just seen that ticket reopened. Have (probably) removed jobs - will put to solved again ] * KIT: some 4k running jobs do not appear in MonALISA; they might be stuck on something; under investigation * T2 sites * Usual operations * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - Low rate of reconstructions and stripping activities going now. * T1 - Closed ticket GGUS:69827 (IN2P3-IN2P3 FTS transfers) * A number of tests - job submission in particular - failing at IN2P3. Seems to be a problem in the way that CEs are advertised in BDII. Sites / Services round table: * CNAF - down time of yesterday afternoon for tape library over without problem * NL-T1- this morning had unscheduled downtime to fix a problem on tape b/e. Some tape files were therefore not available. Maintenance completed ok - now everything works. Since we had a downtime we did tomorrow's downtime today(!) CERTs updated this morning and tomorrows downtime for SARA SRM cancelled. * RAL - nta * PIC - yesterday we had a downtime extended to today due to big problems in cooling system. Still a water leak but hope to get green light at 16:00 and then will take ~4 hours to get all back on line, i.e. by 20:00. * FNAL - ntr * KIT - ntr * ASGC - ntr * BNL - ntr * OSG - there was a change at ESNET CRL location yesterday. Resolved and redirections put in place yesterday/ Some OSG users with DoE CERTs cannot sign on to CERN SSO. Try to resolve this morning (local time). * GridPP - ntr * CERN Storage - 30' unavailability for CASTOR ALICE yesterday. S/w bug combined with network trouble (bad switch). AOB: (MariaDZ) * There is a request to send email notification to sites on every ticket update Savannah:120243. When 'direct site notification' was implemented in July 2008 for Tier0/1 and generalised in January 2009, sites *insisted* they don't want email floods, just to know there is a ticket for them. Should we change now? What do WLCG sites think? Please say now or comment in the ticket. * As already announced in https://twiki.cern.ch/twiki/bin/view/LCG/WLCGDailyMeetingsWeek101115#Wednesday (AOB) the email templates for submitting TEAM and ALARM tickets will be decommissioned with the GGUS May Release on 2011/05/25. * WLCG T1SCM tomorrow: https://indico.cern.ch/conferenceDisplay.py?confId=136139 * SNOW ticket for conference phone problem in 513 R-068: https://cern.service-now.com/service-portal/view-incident.do?id=7317d0190a0a8c0800616f65a4eec4c0 ---++ Thursday Attendance: local();remote(). Experiments round table: * ATLAS [[https://twiki.cern.ch/twiki/bin/view/Atlas/ADCOperationsDailyReports][reports]] - * ATLAS * Calibration period until injection of new fill * Start physics run at injection (data11_7TeV) * LHC planning physics with 480bunches /36bunches per injection and 336bunches / 72 bunches per injection * T1s * PIC: Downtime ended fine and PIC was reincluded in usual production activity * RAL: Many transfer errors (https://gus.fzk.de/ws/ticket_info.php?ticket=69895). RAL declared short downtime and spotted performance issues with the Oracle database behind the SRM. * CMS [[https://twiki.cern.ch/twiki/bin/view/CMS/FacOps_WLCGdailyreports][reports]] - * LHC / CMS detector * Good run last night. Depending on when online updates we may be running at high rates throughout the weekend * CERN / central services * Wrote directly to an LSF supporter. bkill -r was disabled, but is hopefully being re-enabled. Had the near dead node that was holding 8 jobs put out of its misery. * Tier-0 / CAF * A number of jobs in stuck state. * Tier-1 * Will hopefully launch reprocessing of 2010 Data on Thursday * Open tickets about tape families for custodial MC (RAL, PIC, CCIN2P3). Delaying MC now * PIC had PhEDEx down after the maintenance, but back this morning * Tier-2 * MC production and analysis in progress. Reduced effort over the holiday * Other * CRC-on-Duty : Ian Fisk * ALICE [[http://alien2.cern.ch/index.php?option=com_content&view=article&id=75&Itemid=129][reports]] - * T0 site * Nothing to report * T1 sites * RAL: yesterday's ticket (GGUS:69856) was solved after the meeting * KIT: jobs missing in MonALISA - problem mostly fixed by upgrading to latest AliEn version; less than 500 such jobs remain, we may kill them at some point * KIT: configuration error caused ALICE::FZK::Tape SE to fail write requests; quickly fixed by KIT admins * T2 sites * Usual operations * LHCb [[https://twiki.cern.ch/twiki/bin/view/LHCb/ProductionOperationsWLCGdailyReports][reports]] - * RAW data distribution and their FULL reconstruction (Magnet Polarity UP) is going on w/o major problems on all T1s. * Decided to stop the stripping of reconstructed data with Magnet Polarity DOWN having found a problem in the options used. Most likely will restart today with just one input file per job (instead of 3 due to a known memory leak issue at application level). * T0 * Setup two new space tokens for LHCb: LHCb_Disk and LHCb_Tape. Migration of disk servers after Easter. * Problems staging files out of Tape (72 files). Files requested to be staged yesterday we would have expected them online. GGUS ticket most likely to be filed, Philippe looking on that. * T1 * PIC: back form the downtime, no major problems to report with. * RAL: reported 72 files corrupted, trying to recover them from CERN Sites / Services round table: * *CERN VOMS service* The certificate for the LHC voms services on voms.cern.ch will be updated on Wednesday April 27th. The current version of lcg-vomscerts is 6.4.0 and was released 2 weeks ago. It should certainly be applied to gLite 3.1 WMS and FTS services. AOB: ---++ Friday * *No meeting - CERN closed* -- Main.JamieShiers - 15-Apr-2011
Edit
|
Attach
|
Watch
|
P
rint version
|
H
istory
:
r24
<
r23
<
r22
<
r21
<
r20
|
B
acklinks
|
V
iew topic
|
Raw edit
|
More topic actions...
Topic revision: r23 - 2011-04-21
-
SteveTraylen
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
Altair
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback