Introduction

So far, no security incident in our grid had an attacker exploiting a grid proxy. However, as part of some incidents, many hundreds of proxies have been exposed: the attacker had read/write access to them. All the corresponding certificates had to be revoked.

We have a number of issues to sort on the worker nodes. In the past years we have made significant compromises in favor of making WLCG ready for the scale of LHC data processing. Now that we have proven the readiness during two full years of data taking we need to revisit aspects of the security model where the current state of affairs does not comply with our own policies. Examples include the long proxy lifetimes and lack of adequate traceability.

Traceability is an essential part of security operations and fulfills legal requirements in most countries. We need a level of traceability that makes it possible for an abuse to be traced back unambiguously to the originating user (whose proxy may have been abused by an intruder).

Both VOs and sites agree traceability is needed and both have appropriate handles in their own realms, but there is a gap between them that currently makes overall traceability difficult or even impossible in certain cases.

All organizations in WLCG are bound by security policies requiring them to store relevant information locally and participate in incident response. It is thus deemed acceptable for sites and VOs to collaborate and share data to identify the originating user, as long as the accuracy of the result depends on reasonable assumptions.

For example, only a small set of privileged persons will have administrative access to critical elements in the job submission chain, such as a VO's central task queue, a MyProxy server, or a CE. In absence of bugs in the design or the implementation of such services, we then can assume their records for a given job to be reliable under typical circumstances.

Given such assumptions, we must strive to identify the originating user of an abuse without ambiguity. Correlations based on time are useful to narrow down investigations, but evidence thus acquired is circumstantial and often insufficient to pinpoint the originating user.

A reasonable level of traceability will:

  • help preventing security incidents from spreading or re-occurring;
  • ensure compliance with legal requirements, including due diligence;
  • provide deniability for users who were not involved with an incident.

Good traceability will protect sites against charges of negligence and subsequent repercussions. Fallout may include damage to our reputation, which in turn can impact funding for the project. Here is an example of media coverage of a real incident affecting the related TeraGrid project in 2004:

Without fined-grained traceability sites will need to ban the whole affected VO on any security incident. Before the VO can be reinstated, the incident should first be resolved, but that is again hampered by the lack of fined-grained traceability and in the end may completely depend on circumstantial evidence, which may be inconclusive.

The security foundation of our infrastructure ought to do better.

Rationale

Note: some of the statements in this section are purposely thought provoking and provide ideas for the longer term!

Traceability&Logging
Can a VO be trusted to provide the right payload and digital identity for MUPJ ? What say the Site's jurisdictions about that?

  • If yes, then we don't need user proxy certificates here.
  • If no, then we need a new way of delegating Grid jobs. User proxy certificates are no good at all to prove accountability/traceability. This is detailed below.

Payload Separation and Pilot Job Protection
Should we enforce separation of payloads from different jobs/users an a WN?

  • If yes, then we need to find/pick a solution. Current candidates are identity-switch (e.g. gLExec) and VMs.
    • There are Computing Sites that don't like/allow identity-switching (setUID bit); they might have to be restricted to job types that are not concerned by this discussion.
  • If no, then we can go on with what we have?!

Grid User Accountability and Non-Repudiation
Do we need to protect a VO and its admins from potentially using or leaking its Grid users credentials? Do we need to provide for plausible deniability in case of incidents?

  • If yes, then user proxy certificates may not leave the user's computer.
  • If no, then we maybe should not worry about trust in the VO concerning traceability and logging on the WN.

Part 0 - Background, existing Policies

The Grid Acceptable Use Policy includes:

"4. You shall protect your access credentials (e.g. private keys or passwords)."

"5. You shall immediately report any known or suspected security breach or misuse of the Grid or
access credentials to the incident reporting locations specified by the Grid and to the relevant
credential issuing authorities."

The Policy on Grid Multi-User Pilot Jobs says:

"2. Each pilot job must be the responsibility of one of a limited number of authorised and registered
members of the VO. The VO is responsible for implementing a process for authorising pilot job
owners and ensuring that they accept the conditions laid down here. The pilot job owner and the
VO on behalf of whom the job is submitted are held responsible by the Grid and by the Site for
the safe and secure operation of the pilot job and its associated user job(s)."

"4. The pilot job framework must meet the fine-grained monitoring and control requirements defined
in the Grid Security Traceability and Logging policy. The use of gLexec in identity switching-
mode is one solution that meets these needs."

"5. The pilot job must use the approved system utility to map the application and data files to the
actual owner of the workload and interface to local Site authorization, audit and accounting
services. The owner of the user job is liable for all actions of that user job."

"6. The pilot job must respect the result of any local authorisation and/or policy decisions, e.g.
blocking the running of the user job."

"8. The pilot job framework must isolate user jobs from one another, including any local data files
created during execution and any inter-process communication."


The Grid Security Traceability and Logging Policy includes the following two paragraphs:

"The minimum level of traceability for Grid usage is to be able to identify the source of all actions
(executables, file transfers, pilot jobs, portal jobs, etc) and the individual who initiated them. In
addition, sufficiently fine-grained controls, such as blocking the originating user and monitoring to
detect abnormal behaviour, are necessary for keeping services operational. It is essential to be able to
understand the cause and to fix any problems before re-enabling access for the user."

"The aim is to be able to answer the basic questions who, what, where, and when concerning any
incident. This requires retaining all relevant information, including timestamps and the digital identity
of the user, sufficient to identify, for each service instance, and for every security event including at
least the following: connect, authenticate, authorize (including identity changes) and disconnect."

The Grid Policy on the Handling of User-Level Job Accounting Data is relevant as well.

Part I - Multi User Pilot Jobs

  • What is a Multi User Pilot Job?
    • A job that is submitted to the site with an identity that is not the identity of the user whose payload the job may run
    • From the site point of view, conceptually there are two different MUPJ scenarios:
      • The PJ runs payloads of other identities without the WN's (Site's) awareness. (masquerading)
      • The other identities are exposed to the WN (Site)
    • From the VO point of view, conceptually there are three different MUPJ scenarios:
      • The PJ runs payloads of other identities without site WN (Site) support (no OS protections)
      • The PJ has WN support to switch UID so PJ and user job run under different UIDs (OS protection of PJ from user)
      • The PJ has WN support to switch UID so PJ and all jobs (at least for different users) on the same node are guaranteed to run under different UIDs (true OS process insulation)

  • Traceability in the experiment framework
    • Important, in particular to investigate unintentional misuse
    • Insufficient when the job's trail on the WN may have been tampered with

  • Traceability at the site
    • Needed on the WN for investigation of malicious activities
    • Examples:
      • Which user attacked www.example.com?
      • Which user uploaded illegal data?
    • As much evidence as possible should be available after the fact
      • Investigation should be narrowed down to as few users as possible
    • Concurrent jobs running under the same account
      • Which job was guilty?
      • Bad if they can steal each other's proxies
      • Access to pilot proxy. Only a DoS risk? Just a traceability problem? Compromise of the PJ infrastructure?
    • Sequential jobs running under the same account
      • One job can leave a Trojan horse or time bomb behind for another.
      • Can PJ prevent that? Under what circumstances?
      • How can the original culprit be found?
    • Banning
      • How can you be (fairly) sure the incident is over?
      • Typical scenario: need to revoke the compromised user credential

  • Legal issues
    • Site views would be desirable here!
    • Sites may need proof of who was using a resource at a certain time
      • The pilot owner may be held responsible in absence of other evidence
    • Information supplied by the VO may be legally insufficient or too late
      • A VO is not a legal entity and has no legal obligations to sites
      • Users are required to sign the AUP, but their proxies might have been mishandled or compromised on VO services
      • Jobs should be signed by users with signatures verifiable by sites
      • Sites could trust the VO and let the pilot just log the user DN for now
        • But payloads of different users should still be separable!
    • Privacy laws might in principle hamper the flow of user information
      • But all users signed the AUP which allows user details to be made available to privileged parties (VO/site/NGI admins)
    • The VO ought not knowingly put its users (in particular the pilot owner) at risk of getting accused of someone else's actions

  • Virtual machines can simplify these matters
    • Run each pilot in its own VM
      • The VM is destroyed on job exit, after preserving the syslog etc.
    • Each pilot could only run payloads from a single user
      • If central task queue supports it, it would not allow the pilot to download payloads that were submitted by other users
        • How would this be guaranteed?
      • Glexec setuid mode could be avoided
        • An infected payload would be unable to affect payloads of others
        • Assumes hijacking the PJ is not (major) risk
    • PJ running as superuser would solve the UID changing problem
      • Typical operation mode on commercial clouds
      • Potential problems for (some) sites? (e.g. traceability)

Part II - Proxy Certificates as Grid credentials for delegation

(based on the current WLCG usage scenario of unrestricted X.509 proxies)

  • An X509 proxy credential entitles a person or service to act in the name of the issuer/originator of the credential.

  • There is no limitation to the delegation, except in time.

  • In practice this means a proxy can be used by everybody able to acquire it and with all the privileges that were implicitly or explicitly conveyed to the proxy

  • In contrast to e.g. an X.509 Grid credential, a proxy has no password or similar protection, but relies on proper handling by the middleware to avoid exposure. It may be copied/stolen at any time and place by people who manage to get access to it, usually through a file in which it is stored.

Part III - Proxy Certificates as credentials for Multi User Pilot Jobs

(consequences of the characteristics of unrestricted proxies for MUPJ)

An important goal of shipping a user proxy with a Grid job/payload to a WN in MUPJ scenarios is to make the owner/submitter of the job visible to the WN/Site

  • This is done because the WN/Site may not trust the VO and its central services to deliver the appropriate user identity for the payloads; simply trusting the VO might even be deemed prohibited by local law.

  • However, if the proxy is delivered to the WN/Site by the VO/central service, there already is a high level of trust in the VO from the perspective of the Site. In other words, the delivery of the proxy does not constitute a proof at all in terms of accountability.

  • If a VO/central service delivers the proxy to the WN/Site, it means that in principle it can also use the proxy for any other actions than the supposed Grid job execution.

  • As a consequence, users have lost any chance for easy repudiation or dispute of their actions. Ultimately this means anybody with privileged access, legitimate or illegitimate, can masquerade as and operate in the name of arbitrary Grid users without their consent, with all their privileges and with full options to cover traces. The accountability of Grid users rather decreases as the trust in a central service by both users and Sites increases.

  • However, proxies on the WN are not only needed for MUPJ; many users need a proxy to interact with other network services (e.g. storage)

  • The question is if we can do better than sending user proxies around, viz. prevent their abuse and actually prove accountability.

It should be noted that very similar concerns exist for "traditional" or "PUSH" models where the user delegates to the VO the job brokering.

Part IV - A credential to stand up

In order to establish MUPJ accountability of users on the WN a user payload credential needs to...

  • Prove the actual submission by the user (actually by someone possessing the private key of the user's certificate)

  • Be unusable for other purposes

  • As such, implement a restricted and definite delegation of a task (the job)
Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2011-12-06 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback