March 2016 GDB notes

Agenda

http://indico.cern.ch/event/394780/

Introduction (Ian Collier)

Slides

NL EScience Centre Report (Daniela Remenska)

Slides

  • Oxana: you managed to create a common platform with people working together
  • Daniela: we gain from common approaches, generic solutions
  • Oxana: how do you impose that?
  • Daniela: projects proposals are required to aim for generic results and partnerships
  • Jeff:
    • the BiGGrid predecessor project gave funding directly to projects
    • PhD students then did the work and not a lot was heard from them afterward
    • a culture shift was needed to allow the NLeSC paradigm to succeed

WLCG Workshop Report (Ian Bird)

Slides

  • Jeff:
    • a new Information System also needs agreement from sites
    • avoid a second place where information needs to be provided
    • the BDII will not go away for EGI sites
    • if the BDII is not used, some MW development may be needed to cover gaps

  • Peter:
    • the task to merge the accounting DB data is being finalized
    • some info is hard to get out, a discussion with WLCG Operations was started
  • Ian B: let's wait for the April GDB accounting discussions to see what we need
  • Jeff: clearer reports will also help the Scrutiny Group

  • Jeremy: what other activities have started, e.g. lightweight sites?
  • Ian B:
    • most started activities concern the medium term
    • for the longer term the study group will start
    • there is the team for Understanding Performance, and the Tech Lab
    • Operations Coordination can do prototyping

  • Jeff: the Technical Forum should include site people
  • Ian B:
    • the Technical Forum essentially is the GDB
    • but we need to ensure all relevant areas are tracked by an advisory panel
    • GDB Advisory Panel = GAP, aptly named for the gap analysis

First Suggestions for a WLCG Fast Benchmark (Manfred Alef)

Slides

  • Manfred: unexposed compiler flags might get changed under the hood

  • Maarten: is it good or bad that the Haswell and Sandy Bridge CPUs look different in the ROOTmarks test?
  • Manfred: good for ALICE, bad for ATLAS and LHCb

  • Maarten: what are CMS doing?
  • Helge: they should have something, e.g. for the Amazon campaign run by FNAL

  • Helge:
    • the good correlation between Whetstone and HS06 is quite remarkable
    • maybe the WN variety at KIT is not wide enough to spoil the correlation?
    • this should be tested at other sites
    • also the experiments should be involved to check this further

  • Jeff: can we have the scripts?
  • Manfred: they will be published on the WG web site
  • Jeff: store them in CVMFS to allow running them in grid jobs?
  • Manfred: we will look into that

  • Jeff: these investigations could make a nice CHEP paper

  • Jeff: HS06 does not represent the experiments, maybe due to different compiler flags?

  • Helge: you also need to know the HW you run on, which may be impossible in clouds

  • Mattias:
    • ATLAS and others know the number of events handled in an MC job
    • that can then be correlated with the benchmark
    • we need complete coverage, i.e. simulation, reco and analysis, per experiment

  • Manfred: we can keep running all 5 benchmarks via the KitValidation framework

Argus Central Suspension Update (Vincent Brillault)

Slides

  • Jeff: the VO frameworks definitely need to consume suspension rules

  • Ian B:
    • the VO needs to be able to ban its users
    • that responsibility has moved to them
    • if their reaction is untimely, sites can just ban the VO

  • Sven:
    • banned DNs should also be communicated upward to the central team
    • so that sites can ban direct access attempts by those DNs

  • Dave K: why didn't things work at sites?
  • Vincent: an NGI campaign was done, a site campaign not yet
  • Ian N: in the UK there also was an Argus version mismatch
  • Sven: we need to have automatic monitoring
  • Ian N: work in progress in the UK

Improving Traceability - Introduction (Dave Kelsey)

Slides

WLCG Risk Assessment revisited (Ian Neilson)

Slides

  • Sven: misused identities are hard to detect if their activities stay under the radar

  • Jeff: admin identities and ordinary user identities may need different treatments

  • Vincent: attacks are propagated through common SW like ssh and possibly OpenStack
  • Ian C: standard components are better maintained, but also more popular for attacks
  • Vincent: a badly configured standard service is easier to attack than a non-standard service

  • Maarten: we can use the recently refreshed EGI Security Threats Assessment to update the one for WLCG

A new Model for traceability & separation (Vincent Brillault)

Slides

  • Maarten:
    • why would multi-user pilot-jobs be on the decline?
    • it would depend on whether each pilot commits itself to a single user
    • DIRAC used to do that and maybe still does today
    • the other pilot frameworks may not do that, e.g. AliEn does not

VO Perspective (Alessandro Di Girolamo)

Slides

  • Oxana: also the ARC Control Tower plays a role in traceability

  • Sven: in 2011 ATLAS said that a number of weeks might be needed to find the DN who submitted a particular job - how is it today?
  • Alessandro:
    • the bitcoin incident of last year was resolved during one morning
    • it took 6 or 7 people to work together on it, though
  • Ian C:
    • by aggregating logs we can make the problem tractable
    • this is not yet easy for all cases today
    • we will need to use ElasticSearch etc.
    • and scale down the amount of information

  • Vincent: could the user payload kill the wrapper?
  • Alessandro: yes

Security Operations Centre update (David Crooks, Liviu Valsan)

Slides 1

Slides 2

Discussion

  • Jeff:
    • Bro would need to be installed on every node and monitor everything
    • that may be OK only where the batch infrastructure is owned by the WLCG site
    • at many sites WLCG jobs will run side to side with other jobs
    • many sites would anyway feel uncomfortable sending such data elsewhere
  • Ian C:
    • the idea is that each site should have its own SOC
    • at the moment such a facility is hard to deploy
    • therefore an appliance is being looked into
    • compare with perfSONAR
    • data would only be shared within a trust community
    • some information will then be forwarded to a central instance

  • Sven: did the IDS at CERN actually help detect incidents?
  • Liviu: so far we have only been flooded with false positives
  • Sven: it will not be easy to find a tool that avoids them for us

  • David C:
    • small sites need help in these matters
    • we will test possible solutions with artificial data
    • we then can identify which information can be shared

  • Vincent: what can we do with the central MISP data?
  • David C: we need to be cautious with the volume of that data
  • Vincent: the SOC must also be able to reprocess the past when more data arrives

Wrap up & next steps (Ian Collier)

  • Ian C:
    • a first WG needs to further explore SOC solutions
      • the ingestion of VO workflow data also needs to be looked into
    • a second WG needs to further explore traceability tools for jobs
      • containers, cgroups etc.
    • the WLCG Risk Assessment should be updated later this year

  • Jeff:
    • the SOC at CERN has an impressive infrastructure, yet it detected zero incidents
    • we should watch out for a hasty deployment before we are sure it will actually work
  • Vincent: Bro was already useful for detection of compromised systems
  • Ian B: as CERN has an open infrastructure, we need to understand the network traffic
  • Ian C: we need to use big data tools to analyze network flows
  • Sven: as CERN attracts more attackers, its solution may be overkill for small sites
  • Ian C:
    • GridPP is well placed to tune solutions for small sites
    • when jobs run in opaque VMs, we need to analyze the network flows
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2016-04-15 - IanCollier
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback