TWiki
>
LCG Web
>
WLCGCommonComputingReadinessChallenges
>
WLCGOperationsWeb
>
WLCGOpsCoordination
>
WLCGOpsMinutes200402
(2020-04-06,
LorneL
)
(raw view)
E
dit
A
ttach
P
DF
<!-- -- > <font size="6"> %RED% *DRAFT* %BLACK% </font> <br /><br /> <!-- --> ---+!! WLCG Operations Coordination Minutes, April 2, 2020 %TOC{depth="4"}% ---++ Highlights * [[WLCGOpsMinutes200402#COVID_19_impact_on_WLCG_operatio][COVID-19 impact on WLCG operations]] * [[WLCGOpsMinutes200402#WLCG_computing_resources_for_COV][WLCG computing resources for COVID-19 research]] * [[WLCGOpsMinutes200402#EGI_initiatives_HADDOCK_applicat][EGI initiatives. HADDOCK application.]] ---++ Agenda * [[https://indico.cern.ch/event/901439/][link to the agenda page]] ---++ Attendance * local: * remote: Aleksandr A (ATLAS), Alessandra D (Napoli), Alessandra F (ATLAS + Manchester), Alexander U (ATLAS), Alexandre Bonvin (Utrecht), Alexei, Andrea (WLCG), Andreas (KIT), Andrew (TRIUMF), Catalin (EGI), Cesare (MPCDF), Christoph (CMS), Concezio (LHCb), Costin (ALICE), Cécile Barbier, Dario (ATLAS), Dave M (FNAL), David B (!IN2P3-CC), David Cameron (ATLAS), David Cohen (Technion), David S (ATLAS), Doug (ATLAS), Eric (!IN2P3), Federico (LHCb), Felice (CMS), Giuseppe B (CMS), Giuseppe La Rocca (EGI), Ivan (ATLAS), James (ATLAS), Jeny (FNAL), Johannes (ATLAS), Julia (WLCG), Liz (FNAL + CMS), Maarten (ALICE + WLCG), Marco (Padova), Marian (monitoring + networks), Matt D (Lancaster), Matt V (EGI), Nicolo (ATLAS), Pepe (PIC), Peter (ATLAS), Petr (Prague + ATLAS), Renato (LHCb + CBPF + ROC_LA), Ricardo (SAMPA), Riccardo (WLCG), Rod (ATLAS), Ron (NLT1), Shawn (MWT2 + ATLAS), Stefano (CNAF), Stephan (CMS), Thomas (DESY), Torsten (Wuppertal), Victor (CMS), Vincent (security) * apologies: ---++ Operations News * the next meeting is planned for May 7 * please let us know if that date would pose a major inconvenience ---++ Special topics ---+++ COVID-19 impact on WLCG operations * [[https://twiki.cern.ch/twiki/bin/view/LCG/Covid19SiteImpact][Twiki page where input from sites, experiments, central operations and infrastructures is collected]] ---+++ WLCG computing resources for COVID-19 research *Note: FH denotes Folding at Home* * Federico: * not convinced running FH would be the best approach, other initiatives might be better * would we run some amount some of it alongside LHCb workloads? * also depends on the perspective of sites * we can furnish our expertise in running workloads across the grid * James: * sites can directly contribute to other initiatives * FH is easy to integrate into our workflows * experiments could direct such jobs to sites that agree * Thomas: * are we sure there will be enough work to run? * Rosetta at Home did not have enough so far * the FH client is incompatible with other BOINC work! * Federico: as we cannot know the queue, pilots may just die * Andreas: * we should provide a list of running projects * sites then can pick _before_ experiments would try and do something * KIT already doing that for resources above the pledge * James: * there are docs in a number of places * the CERN task force has concluded that FH would be the best option so far * David S: we should not just run what is possible, it has to be useful * Julia: * in principle we could even run a service creating such jobs * the usefulness of that is not known today * Dave M: * we would need to interact with experts of those domains * OSG and EGI are also running initiatives * FNAL is already involved there * Federico: * EGI are e.g. already running !WeNMR (see the presentation) * WLCG lack expertise in those areas * Matt V: * Alexandre Bonvin will talk about !WeNMR * EGI will have a call with OSG and come back to WLCG * David Cohen: * sites will need to know what resource numbers we are talking about * they may need to get agreement from funding agencies * Julia: indeed, and we should find the most effective contributions * Pepe: * resources are to be used for official purposes * there is more flexibility for amortized and other HW beyond pledge * Liz: * different countries and funding agencies will have different policies * sites should talk to their funding agencies * Alessandra F: WLCG cannot enforce anything * Christoph: * what sites do with resources beyond pledge is their decision * for running jobs in question on pledged resources we would need to know: * what fraction? * which application(s)? * through which channel(s)? * Alessandra F: * the best application is currently unknown * here we want to decide what we can do using the experiment infrastructures * and avoid unnecessary duplication of efforts * Costin: an experiment can reach all its sites * Federico: * it is not for us to operate the application(s) * biomed people should do that * Alessandra F: some interaction with people from WHO etc. might be needed * James: * the CERN task force are doing that * for now, FH was the only concrete proposal * Costin: in order not to waste effort, can we go ahead? * Maarten: * we have to be careful there * small-scale proofs of concept are OK at this stage * bigger activities could e.g. lead to issues between sites and funding agencies * we do not have a full plan at this time * Johannes: * in the experiments we can control the scale of these activities * and we could already use unpledged resources like the online farm * Christoph: * experiments cannot control the use of unpledged resources at sites * several sites are already using unpledged resources for related purposes * David S: * we can come to a suggestion for how to run things * and avoid unnecessary duplication * Dave M: WLCG can do the communication part * Julia: we will follow up in our own task force ---++++ Follow up comments after the meeting Simone Campana could not join the meeting because of overlap with another meeting he had to attend. There are a few follow-up comments/clarifications from Simone: * *Why FH?* As James Catmore pointed out, WLCG picked FH because the Fight-against-COVID TF at CERN indicated so, while more discussions there are happening. * *Concerns of the sites if they use resources allocated to LHC for COVID-19 research.* At the moment we agreed to do this at Citizen Science level, again, as recommended by the TF. So a few thousand cores. Even if this is 10k cores, this is 1% of WLCG, so we do not expect an impact on WLCG activities considering also that normally there is a 20% beyond pledge the experiments benefit from. I will mention this activity at the next RRB and ask the Funding Agencies for feedback. The situation is different if a site or a country decided to dedicate a large fraction of the resources to some initiative. There, we give flexibility but that site or that group of sites should document it and explain it to the funding agency. ---+++ EGI initiatives. HADDOCK application. [[https://indico.cern.ch/event/901439/#6-egi-initiatives-haddock-appl][presentation]] * Alexandre: * we are talking to OSG to see if our jobs can run there as well * our computing model has been opportunistic so far * sites decide if they want to support us e.g. for backfilling * the work volume depends on the user activity * it also is limited by the scalability of the portal(s) * James: have you contacted the CERN task force? * Alexandre: * not yet * at this time we are not limited by computing resources * we can flag jobs that are related to COVID-19 research * Andreas: whom to approach for such jobs? * Alexandre: * first enable the =enmr.eu= VO on your resources * we do not depend on CVMFS today, as we found it unreliable at several sites * instead, our jobs bring their payload of 1 to 20 MB in their input sandbox * the job output is typically around 5 to 20 MB * jobs have typically been short * through DIRAC we can make them longer with larger outputs * each site supporting these jobs will need to be enabled in DIRAC * if desired, the site can be tagged to receive only jobs related to COVID-19 * Julia: in the meeting between EGI and OSG, is there a WLCG representative? * Matt V: not yet, but we will follow up on that ---++ Middleware News * Useful Links * WLCGBaselineTable * [[WLCGBaselineVersions#Issues_Affecting_the_WLCG_Infras][MW Issues]] * [[WLCGT0T1GridServices#Storage_deployment][Storage Deployment]] * Baselines/News ---++ Tier 0 News ---++ Tier 1 Feedback ---++ Tier 2 Feedback ---++ Experiments Reports ---+++ ALICE * Mostly business as usual so far, despite COVID-19 measures everywhere! * Thanks very much to the site admins! * Current emphasis is on data analysis, which requires little additional disk space. * Productions that need a lot of disk space are postponed until pledges are available. ---+++ ATLAS * no Covid-19 related problems so far * Smooth and stable Grid production with ~430k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis, including ~90k slots from the CERN-P1 farm. Occasional additional bursts of ~100k jobs from NERSC/Cori. * Finishing the RAW/DRAW reprocessing campaign in data/tape carousel mode with data15 within the next week. * No other major other issues apart from the usual storage or transfer related problems at sites. * Feedback on APEL accounting question: keep it simple ! * Grand unification of !PanDA queues on-going and test of non-gridFTP TPC tests in production * Feedback on Google CA bundle for TPC to GCS: [[https://twiki.cern.ch/twiki/bin/view/LCG/CloudStorageIntegration][CloudStorageIntegration]] - will move ahead with it. * Would like to raise criticality of services for CEPH and DBoD to 8,9 ---+++ CMS * no Covid-19 related interrupts to the CMS computing infrastructure so far * jumbo frame issue at CERN impacting several sites, INC:2355684 * after network maintenance, March 11th, OTG:0054668 * we expected this to be corrected quickly, does anybody know what the issue is? * running at about 250k cores during last month * usual production/analysis mix (80%/20%) * ultra-legacy re-reconstruction of 2016 in validation * Run 2 Monte Carlo production is largest activity, large batch of Phase-2 events delivered ---++++ Discussion * Maarten: * the matter with the jumbo frames seems to be not so easy * the ticket is currently waiting for input from the affected site * Liz: * an issue with jumbo frames already hit us in the middle of Run 2 * this could be wider than 1 site * Stephan: * at the moment this is not a big problem, affecting a limited area of work * we would like to have a solution, even if it implies changes on our side ---+++ LHCb see [[Covid19SiteImpact#Impact_on_experiment_operations][here]] ---++ Task Forces and Working Groups ---+++ GDPR and WLCG services * [[https://twiki.cern.ch/twiki/bin/view/LCG/GDPRandWLCG][Updated list of services]] * Detailed discussion how we go to enable privacy notice for all our services has been postponed. We will have a dedicated meeting with experiment contacts most probably next Thursday ---+++ Accounting TF * T1 reports generated by CRIC were sent around for validation, T2 reports will be sent for March ---+++ Archival Storage WG ---+++ Containers WG ---+++ !CREAM migration TF Details [[FollowUpMigration][here]] Summary: * 90 tickets * 5 done: 2 ARC, 3 HTCondor * 18 sites plan for ARC, 12 are considering it * 22 sites plan for HTCondor, 14 are considering it, 7 consider using SIMPLE * 15 tickets on hold, to be continued in a number of months * 14 tickets without reply * response times possibly affected by COVID-19 measures ---+++ dCache upgrade TF * 34 sites are running versions > 5.2.0 http://wlcg-cric.cern.ch/core/service/list/?type=se&show_5=0&show_6=1&state=ACTIVE&impl=dcache&version=5. * 9 to go, some of them planned an upgrade , but postponed it due to COVID-19 * 2 plans to move to DPM ---++++ Discussion * Maarten: nowadays it does not seem a good idea for sites to move to DPM * Julia: we will follow up with them * Stephan: * one of those sites is a CMS site that already had a DPM * they want to consolidate their grid storage into just one system ---+++ DPM upgrade TF * 34 sites upgraded and reconfigured with DOME http://wlcg-cric.cern.ch/core/service/list/?type=se&show_5=0&show_6=1&state=ACTIVE&impl=dpm&version=DOME&show_11=0&show_18=0 Out of those 15 are running 1.13.2 with DOME * 6 upgraded but DOME is missing, but they are working on it * 1 to upgrade and re-configure, in progress * 1 site is suspended for operations * 9 moving away from DPM ---+++ Information System Evolution TF * REBUS is in readonly mode since beginning of April. Pages for editing information have been redirected to CRIC * Thanks a lot to Federico for providing API from Dirac for LHCb topology information. Will be used by CRIC and Storage Space Accounting ---+++ IPv6 Validation and Deployment TF Detailed status [[WlcgIpv6#IPv6Depl][here]]. ---+++ Machine/Job Features TF ---+++ Monitoring ---+++ MW Readiness WG ---+++ Network Throughput WG <br />%INCLUDE{ "NetworkTransferMetrics" section="02042020" }% ---+++ Traceability WG ---++ Action list %INCLUDE{ "WLCGOpsCoordActionList" }% ---++ AOB
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r27
<
r26
<
r25
<
r24
<
r23
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r27 - 2020-04-06
-
LorneL
Log In
LCG
LCG Wiki Home
LCG Web Home
Changes
Index
Search
LCG Wikis
LCG Service
Coordination
LCG Grid
Deployment
LCG
Apps Area
Public webs
Public webs
ABATBEA
ACPP
ADCgroup
AEGIS
AfricaMap
AgileInfrastructure
ALICE
AliceEbyE
AliceSPD
AliceSSD
AliceTOF
AliFemto
ALPHA
Altair
ArdaGrid
ASACUSA
AthenaFCalTBAna
Atlas
AtlasLBNL
AXIALPET
CAE
CALICE
CDS
CENF
CERNSearch
CLIC
Cloud
CloudServices
CMS
Controls
CTA
CvmFS
DB
DefaultWeb
DESgroup
DPHEP
DM-LHC
DSSGroup
EGEE
EgeePtf
ELFms
EMI
ETICS
FIOgroup
FlukaTeam
Frontier
Gaudi
GeneratorServices
GuidesInfo
HardwareLabs
HCC
HEPIX
ILCBDSColl
ILCTPC
IMWG
Inspire
IPv6
IT
ItCommTeam
ITCoord
ITdeptTechForum
ITDRP
ITGT
ITSDC
LAr
LCG
LCGAAWorkbook
Leade
LHCAccess
LHCAtHome
LHCb
LHCgas
LHCONE
LHCOPN
LinuxSupport
Main
Medipix
Messaging
MPGD
NA49
NA61
NA62
NTOF
Openlab
PDBService
Persistency
PESgroup
Plugins
PSAccess
PSBUpgrade
R2Eproject
RCTF
RD42
RFCond12
RFLowLevel
ROXIE
Sandbox
SocialActivities
SPI
SRMDev
SSM
Student
SuperComputing
Support
SwfCatalogue
TMVA
TOTEM
TWiki
UNOSAT
Virtualization
VOBox
WITCH
XTCA
Welcome Guest
Login
or
Register
Cern Search
TWiki Search
Google Search
LCG
All webs
Copyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback