TWiki
>
CMSPublic Web
>
CompOps
>
CompOpsWorkflowTeam
>
WorkflowTeamMeeting
>
WorkflowTeamMeeting20140529
(2014-05-29,
JenniferAdelmanMcCarthy
)
(raw view)
E
dit
A
ttach
P
DF
---+!! Workflow Team Meeting - May 29 %TOC{depth="3" title="Contents:"}% ---++ [[https://indico.cern.ch/event/322191/][Vidyo Link]] ---++ Attending * FNAL - Jen, Luis, Dave, Oli, SeangChan, John * CERN - Julian, Andrew L ---++ Personel | May 22 -> May 29 | Xavier | | May 29 -> June 5 | Adli | * Please fill vacations plans for July/August for CSA14 campaign: https://twiki.cern.ch/twiki/bin/viewauth/CMS/CompOpsPlanningVacationSummer2014 * (Or tell Oli | CW | CP) * Jen will be taking off June 13 ---++ News * acdc couch views need to be rebuilt * projected for this weekend still? * Friday - maybe, We will be informed when this happens and when we are clear again. * what are possible implications? What should we look out for? * we can not create ACDC's while it happens. * SeangChan will delete the old ACDC's whose original WF has been announced, and see how much that buys us and then they will recreate the view * Need people to look into timeouts of merge jobs. Merge jobs should be short, there is no reason they should be timing out, we have been blaming "network issues" but is there anything else that has changed over the last couple weeks that could be causing this problem? Jen is seeing it a LOT in Redigi and Luis is noting issues in StoreResults as well. * I (Andrew Levin) created tickets about this: https://ggus.eu/index.php?mode=ticket_info&ticket_id=105836 and https://ggus.eu/index.php?mode=ticket_info&ticket_id=105821 ---++ Site issues * Who put Caltech in drain? When you you put a site in drain you must e-log and file tickets, they have 4k idle nodes and nobody knows why they are in drain. * They were having some file errors errors, I think Sara opened a ticket. https://cmslogbook.cern.ch/elog/Workflow+processing/14591 * Drain list, sites ready to move out: <pre> * 3 other sites : Caltch, ASGC, T2_Belgium_UCL Caltech ASGC UCL </pre> ---++ Xavier / Sara's Notes ---++ Agent Issues * 201 and 85 still in drain for upgrades - how are we doing in updating our documentation on drain issues? * Guys I need a green light to redeploy vocms201 and vocms85: https://cmslogbook.cern.ch/elog/Workflow+processing/14770 * I will redeploy it on Friday unless someone tells me not to. --> Workflows with missing information. --> More work for everyone. * SeangChan would like to move to couch 1.5 for better stability. Fixes Cert problem as well. * ErrorHandler crashing alot, hence the need for the acdc view rebuild. ---++ Workflow issues ---+++ Store Results * Jen, Julian and Luis had a meeting last friday to discuss handover of store results. We will have another Meeting Fri May 30, 5 CERN time: * https://indico.cern.ch/event/322193/ * Turns out that Store Results is having the same issue with merge timeouts as Redigi is. Luis reported that WF's he ran with no issues several weeks ago are now having timeout issues, and was going to investigate further. Luis do you have an update? ---+++ MonteCarlo * Recovering a lot of old workflows, some of them are really huge (100K jobs or more) and last a while. * I need to be able to extend workflows, who can help me debug this? https://github.com/dmwm/WMCore/issues/5148 ---+++ Redigi/Rereco * working my way through the list of WF's in complete. Most of them are due to timeouts, at FNAL we were blaming the timeouts on network issues, but I am seeing them across the board. we need to figure this out, it's killing us in latency to have to make 2-4 acdc's per workflow to get everything through. * Dave will post to Comp-ops to have the other T1's look at their network issues ---+++ RelVal * RelVal workflow assignment- Andrew's page * FWIW I'm agreeing with Dave, this sounds dangerous. We all know requestors can put "stupid things in" that could really break things bad. Having a bit of a buffer in there, may slow things down, but fast is not always best. * what if we have an approval requirement, like we do for transfer requests? * agreed that this isn't going to work. * Dave, please do not move relval data * all logcollect jobs are still failing at FNAL: https://cmslogbook.cern.ch/elog/Workflow+processing/14030 and https://github.com/dmwm/WMCore/issues/5076 * patch is in github but it hasn't been applied in the agent yet. * patches need to be applied to the relval agent. Julian will patch them tomorrow. ---++ AOB * closeout procedure and what are we doing with the MSS subscription * Now that we are saving more stuff at CERN it is increasing our latency * change the code so that the subscription has been made - Julian will change the code * SL6 - where do we stand - so far we've made the RPM's so we can deploy at CERN and FNAL machines, dependencies have been solved, Krista is getting us a machine * Get a bunch of SL6 machines, give it a team of "global" and start moving to it over the course of the next few months. Most of the worker nodes are SL6, we just need to move the agents. We need to get Condor functioning at FNAL. at CERN they need to connect to the condor pool * SeangChan will attend the Burt and Oli show today so we can discuss it with Krista. -- Main.JenniferAdelmanMcCarthy - 28 May 2014
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r7
<
r6
<
r5
<
r4
<
r3
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r7 - 2014-05-29
-
JenniferAdelmanMcCarthy
Log In
CMSPublic
CMSPublic Web
CMSPrivate Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Create
a LeftBar
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Cern Search
TWiki Search
Google Search
CMSPublic
All webs
Copyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback