TWiki
>
CMSPublic Web
>
CompOps
>
CompOpsWorkflowTeam
>
WorkflowTeamMeeting
>
WorkflowTeamMeeting20131202
(2013-12-03,
AndrewLevin
)
(raw view)
E
dit
A
ttach
P
DF
https://indico.cern.ch/conferenceDisplay.py?confId=254683 ---+++ Attendence John, SeangChan, Luis, Jen, Dave - FNAL Julian, Andrew, Adli- CERN ---+++ Personel * Nov 26 --> Dec 3 Sunil * Dec 3 --> Dec 10 Sara ---+++ Review of last weeks issues * 2011 legacy reprocessing: * Recovery very close to being complete, on the last workflow, waiting for recovery to finish. Jobs still running as of Mon night. * Where are we with the stuck Workflow problems? * 40 stuck yesterday 15 today * Luis restarted some stuck components, * DBS component was restarted and that * Problem with couch in the MC agents, there is a TaskArchive query that is taking 24 hrs and crashing, it hasn't * when can we shut down the agent, and hand over to Dirk, we will give Dirk 216 the next time it crashes we wil * Problem with log file access, especially when there is very little time between jobs running and access to logfiles, * suggestion to first store the logarchives on EOS and then on castor. * we don't change physical location so we can keep track of where they are for 1wk to 1 month - Dave is following up on this issue * How do we give users access to these files? will they be able to get them themselves or will the workflow team have to fetch them? * T2_CH_CERN_HLT & AI - most likely a site issue John will look into this. The WF's that are stuck due to this issue are all running on 227 which doesn't read from the drain list so the fact that it is in drain isn't an issue. It's an actual site issue. * low latency agent ---++++ Agent Issues * OracleDB issues. - Same task archiver issue discussed above * Datasets being assigned custodial in PhEDEx to T2_CH_CERN * Workflow running closed for long time: priority issue. * WorkloadSummary config - issue is closed * vocms235 JobAccountant https://cmslogbook.cern.ch/elog/Workflow+processing/11539 - Luis is working on info missing from FWJR Luis knows how to finish it ---++++ <a name="Site_Issues_that_affected_workfl"></a>Site Issues that affected workflows * xrootd troubles at Bari and Nebraska * T2_TH_CUNSTDA, not ready yet. * T2_DE_RWTH <a target="_top" href="http://savannah.cern.ch/support/?140734">http://savannah.cern.ch/support/?140734</a> * T1_US_FNAL <a target="_top" href="https://savannah.cern.ch/support/index.php?140959">https://savannah.cern.ch/support/index.php?140959</a> * T2_EE_ESTONIA <a target="_top" href="https://savannah.cern.ch/support/index.php?140981">https://savannah.cern.ch/support/index.php?140981</a> * T2_US_Caltech <a target="_top" href="https://savannah.cern.ch/support/?140709">https://savannah.cern.ch/support/?140709</a> * We have 5 tickets that we need to follow up on http://cms-project-relval.web.cern.ch/cms-project-relval/savannah/savannah.html - Sara should look at this. * T2_UK_Bristol - giving warnings of some jobs on production view of dashboard. Unscheduled Cream CE down, unscheduled downtime John will see how long they are down and then decide if they need to be in drain. They should be out of downtime in a couple hours. <img alt="" src="https://dl.dropboxusercontent.com/u/188468487/CMS/CompOpsMeeting/131202/WR_Table.png" width="800"/> ---++++ <a name="Workflow_Issues"></a>Workflow Issues * Drop of running jobs on tuesday-wednesday: GlideIn Front-end. * Retrieving logs before force-complete. ---++++++ <a name="MonteCarlo"></a>MonteCarlo * EXO-Fall13 with merge failures <a target="_top" href="https://cmslogbook.cern.ch/elog/Workflow+processing/11474">11474</a>, Kill-and-clone policy. * BTV-Fall13 batch without failing info <a target="_top" href="https://cmslogbook.cern.ch/elog/Workflow+processing/11508">11508</a> * High priority WFs finished: <a target="_top" href="https://cmslogbook.cern.ch/elog/Workflow+processing/11428">11428</a> * Highest priority WF stuck: <a target="_top" href="https://cmslogbook.cern.ch/elog/Workflow+processing/11489">11489</a> ACDC completed successfuly but still at 34% ---++++++ <a name="Reprocessing"></a>Reprocessing * franzoni_Fall53_2011A_Jet_Run2011A-v1_Prio1_5312p1_130916_235201_2576 - still running recovery, the rest of the ReReco WF's are done * pdmvserv_SMP-Fall11R4-00002_T1_US_FNAL_MSS_00006_v0__131113_125318_6858 * Monday's round of tests with Burt showed it is in fact a site issue and Burt is working on it. * still tracking down the issue so I will run another round * pdmvserv_EGM-UpgradePhase1Age1Kdr61SLHCx-00002_T1_US_FNAL_MSS_00002_v0__131125_160358_1243 - performance failures 92% so I will run ACDC to see if we get more ---+++ The Andrew's Question's * Asked by requesters what causes the memory limit of workflows RSS limit for heavy ion * we do have sites that we could bump it up a little, limit is set by the hardware limit of most sites * they want 4 GB I don't think we have any sites that have that kind of limit * RE priority of agents * we need to set a date * we made a ticket with problems of PhEDEx injector, 400 error: https://github.com/dmwm/WMCore/issues/4863 -- Main.JenniferAdelmanMcCarthy - 03 Dec 2013
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r5
<
r4
<
r3
<
r2
<
r1
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r5 - 2013-12-03
-
AndrewLevin
Log In
CMSPublic
CMSPublic Web
CMSPrivate Web
Create New Topic
Index
Search
Changes
Notifications
Statistics
Preferences
Create
a LeftBar
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Cern Search
TWiki Search
Google Search
CMSPublic
All webs
Copyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback