TWiki
>
CMSPublic Web
>
CompOps
>
CompOpsWorkflowTeam
>
WorkflowTeamMeeting
>
WorkflowTeamMeeting20131126
(2013-11-26,
JenniferAdelmanMcCarthy
)
(raw view)
E
dit
A
ttach
P
DF
https://indico.cern.ch/conferenceDisplay.py?confId=254682 ---++++ Attending Adli, Julian - CERN John, Luis, Jen, SeangChan - FNAL Congratulations to Dave on his new baby. We are still waiting cute baby photos. Andrew ---++++ Personel: * coming off shift: Xavier <del> * coming on shift Xavier </del><ins> * coming on shift Sunil </ins> * list of everyone's holidays? (Cern closeout) * US will be having the Thanksgiving Holiday Thursday-Sunday * Jen will be pulling "best effort" days Wed-Sunday. In other words I'll log in, run the close out script and make sure the machines aren't on fire during US/Asia shifts but will not be spending a lot of time trackingdown issues that can not be ignored. * SeangChan will be Taking off completely Wed-Sun * Julian's holidays: (dec 27th to 29th) and (jan 7th to 11th) * Dec 23-Jan1 - Xavier and Sunil will be on shift but working from home * CERN closed Dec 22-Jan6 * Luis will be gone Dec 11-13, Dec 16-20 Luis will be working remotely ---++++ Issues * Dashboard is being unrelable this week - Julian will make sure Sunil understands the Dashboard issues we are having so we are not relying on it for debugging * time plots, # of jobs is unrelyable we are running 60K jobs but the dashboard plots say 120K jobs * Effiency plots - John will look into why we keep going green/yellow/green * Couch issues on 201, 216 couch keeps going down we need to keep a close eye on it ---++++ Agents * vocms201: Issues with couch. Getting usual when heavy load. * vocms235: Sandbox problem solved. <a target="_top" href="https://cmslogbook.cern.ch/elog/Workflow+processing/11386">11386</a> fwjr with missing task field, manually added * vocms85: Workflows stuck in Acquired. Oracle connection problem, solved. <a target="_top" href="https://cmslogbook.cern.ch/elog/Workflow+processing/11359">11359</a> * Stuck MonteCarlos: Why are so many? Not reaching "complete" * the acquired WF's were waiting resources * 2-3 WF's with many queued jobs and everything is piled up behind it * When you make ACDC's they need to have higher priority so that they go in. * Blocks were not being closed in DBS. Problem identified. There will be a procedure change * already discussed procedure changes, we need to make sure the twiki is updated * when workflow is force completed ACDC's also need to be force completed * Jullian will update the twiki's Jen or Sunil will test ---++++ <a name="Site_Issues_that_affected_workfl"></a>Site Issues that affected workflows * Xavier - good job going through the sites with issues and working with the site support team! * We need to get the EU operators working closer with the EU site support team. Right now the EU site support team is still on a steep learning curve but we aren't going to get them ramped up unless we get them working! * T2_US_UCSD: <a target="_top" href="https://savannah.cern.ch/support/index.php?140677">https://savannah.cern.ch/support/index.php?140677</a> * T2_EE_Estonia <a target="_top" href="http://savannah.cern.ch/support/?140658">http://savannah.cern.ch/support/?140658</a> * T2_DE_RWTH <a target="_top" href="http://savannah.cern.ch/support/?140734">http://savannah.cern.ch/support/?140734</a>, CLOSED * T1_FI_HIP <a target="_top" href="https://savannah.cern.ch/support/?140733">https://savannah.cern.ch/support/?140733</a> * T2_IT_BARI <a target="_top" href="http://savannah.cern.ch/support/?140875">http://savannah.cern.ch/support/?140875</a> * T2_PT_NCG_Lisbon: <a target="_top" href="https://savannah.cern.ch/support/?140775">https://savannah.cern.ch/support/?140775</a> * T1_US_FNAL * T1_RU_JINR: Can be used as a worker node, but not as a custodial for datasets. (Almost like a T2) * T2_TH_CUNSTDA (80 slots) - solved links and HC (91%) - SAM problems * T2_BR_UERJ (200 slots) - new SE - fixing phedex links and SAM availability * pledges view - update: there are sites with n/a - automatic? talk with Julian * https://dashb-ssb.cern.ch/dashboard/request.py/siteview?view=site_readiness#currentView=Pledges&highlight=true * Script to automatically change drain list with WR list in SSB ---++++ Workflows ---+++++ <a name="Monte_Carlo"></a>Monte Carlo * Issues when getting PhEDEx subscriptions. * Workflows stuck with no failure info BTV-Fall13: <a target="_top" href="https://cmslogbook.cern.ch/elog/Workflow+processing/11371">https://cmslogbook.cern.ch/elog/Workflow+processing/11371</a>. Probably filter issues. * This week task is to collect logs. * Failing ACDC's * WF's announced with ACDC's running: <a target="_top" href="https://cmslogbook.cern.ch/elog/Workflow+processing/11061">https://cmslogbook.cern.ch/elog/Workflow+processing/11061</a> ---+++++ ReDigi * Config issues causing 100% failure rates: https://cmslogbook.cern.ch/elog/Workflow+processing/11411 ---+++++ Reprocessing * Almost all done: <a target="_top" href="https://cmslogbook.cern.ch/elog/Workflow+processing/11390">https://cmslogbook.cern.ch/elog/Workflow+processing/11390</a> ---+++++ Workload Summary/Problems in the config file * Workload Summary in config file pointing to local couch. <a target="_top" href="https://cmslogbook.cern.ch/elog/Workflow+processing/11357">11357</a> * https://cmslogbook.cern.ch/elog/Workflow+processing/11418 * Basic summary of issue: * There are duplicate config (one of which is not used) * https://twiki.cern.ch/twiki/bin/viewauth/CMS/WMAgentDeployment - need to be updated (look elog above) * Seangchan/Luis need to clean up WMAgentConfig.py and wmagent-mod-config and deployment script. * Currently the instructions for deployment are to copy the old config file, why don't we have a good config file in the agent so we are not copying things around. This seems dangerous. ---+++++ AOB * Dashboard Alarms (Adli and Julian working on it) 40% * Site status script, tested on vocms201. Successful for now, this week is the migration for vocms216 and vocms85. * Emails to workflow HN is the TaskChain working properly, Luis has in fact read the emails but hasn't replied or looked yet. All jobs are running over the same event. * are we doing DQM harvesting on ACDC's <del>-- Main.JenniferAdelmanMcCarthy - 25 Nov 2013</del>-- Main.JenniferAdelmanMcCarthy - 25 Nov 2013
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r7
<
r6
<
r5
<
r4
<
r3
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r7 - 2013-11-26
-
JenniferAdelmanMcCarthy
Log In
CMSPublic
CMSPublic Web
CMSPrivate Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Create
a LeftBar
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Cern Search
TWiki Search
Google Search
CMSPublic
All webs
Copyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback