PanDA Security


Panda services that interact with other Panda components and with external clients use the standard GSI grid security model of authentication and authorization based on X509 grid certificates, implemented via OpenSSL and https. mod_gridsite is used in Panda Apache servers to support this. Clients use curl for http and https support.

Interactions with Panda that involve anything other than passive read actions require secure https. Any user of the system (e.g. job submission via client interface, pilot requesting work) must hold a valid certificate proxy. Panda also checks the VOMS attributes of the proxy to ensure that the user is a member of a VO authorized for Panda use.

The DN of the authenticated user is carried as part of the metadata of a Panda job, so user identity is known and tracked throughout Panda operations. Panda-internal accounting and quota/usage management also uses this information.

Production job execution and file management in Panda relies on production certificates, with the pilot carrying a specific 'pilot' VOMS role to control its rights. Analysis jobs also run under a production proxy unless glexec is employed in identity switching mode (see the glexec discussion).

We have also defined a means of securing the job workload specification (transformation) from tampering in the Panda DB (encryption of the transformation using RSA key pair, with decoding/validation in the pilot prior to execution).

Further security measures, existing and planned, are described in the security review answers below.

Answers to questions from the pilot job framework security review, Version 2.3, 7/25/2008

0. Describe in a schematic way all the components of the system.

If a component needs to use IPC to talk to another component for any reason, describe what kind of authentication, authorization, integrity and/or privacy mechanisms are in place. If configurable, specify the typical, minimum and maximum protection you can get.

Components of the Panda system are as follows. For more information see the PanDA main wiki.

a) The Panda Server, consisting of an apache-based Web interface and back end databases which contain job data and various metadata. It performs the functions of

  • accepting job submissions via https from python front-end client (see c)
  • managing the system-wide job queue
  • performing brokerage on the queue based on job assignment criteria such as data locality, job priority, realtime availability of processing resources
  • pre-staging input data to sites or out of mass store prior to release of jobs, via an associated data service
  • dispatching of jobs to pilots making job requests, based on brokerage decisions

b) The Panda Monitor, which is loosely coupled to a) via its network connection to the databases

c) A few variants of the client software (eg. the physics analysis interface called "pathena") which connects to the Panda server via https and is secured via a X509 proxy; the function of this component is to register a job request with the server. Another variant of the job submission front end, Bamboo, interfaces Panda to the ATLAS production database which records production jobs requested by the physics community

d) The Pilot Scheduler, with the main version being called "Autopilot", which governs submission of pilot jobs to target sites (condor-G is the job submitter in this case; another scheduler being developed uses gLite job submission tools)

e) The Pilot Job, originated on the target site by (d), which after launching on a worker node connects to (a) via https to request a payload job, manages site-local data handling and job execution, and reports job status and results to (a)

f) MyProxy server which is used to cache users' X509 credentials when necessary (encrypted TCP/IP communication)

1. Describe how user proxies are handled from the moment a user submits a task to the central task queue to the moment that the user task runs on a WN, through any intermediate storage.

In the distributed analysis scenario, when the user submits a job to the Panda system, the client software (pathena) will deposit the user proxy onto an instance of the MyProxy server, which is owned and managed by the Atlas organization. The Pilot Job (and the Pilot Job only) has the credentials to extract the concrete user's proxy from the MyProxy server, a process which happens once the Pilot connects to the server and is ready to execute the Transformation (the script defining the actual payload). Working under the assumption that we limit the lifetime of the user's proxy, the Pilot Job is responsible for its periodic renewal during the job's execution.

Possible security concerns related to misuse of a compromised pilot proxy are being addressed in the following way: an instance of the proxy being stored the MyProxy server can be assigned a unique key that will be required upon future retrieval. This is in addition to requiring X509 credentials from the client for authentication. It is possible, therefore, to associate each job ID registered with Panda, with a unique key formed at the time of job submission. The key could be based on the MD5 sum of the user proxy, so that it changes whenever the user creates a fresh VOMS proxy. If the key was unique for each user job, a lot more stress would be put onto the MyProxy server, as many more proxies would have to be managed.

In addition, we plan on using single (or few) use tokens that will be used by the Pilot Job in order to get a payload job from the Server. The key is to be generated by the pilot scheduler, which will store it in a database (together with the scheduler's own validation token) as well as send it along with the submitted pilot. The pilot in turn presents the token to the Panda server in the job request process. The server retrieves the token (and associated scheduler token) from the DB, so validating the pilot, and immediately deletes the token (or decrements a maximum usage count).

Note that the proxy itself is never cached in Panda. In this scheme, the ways for an intruder to take possession of a user's proxy are:

(a) to gain access to the Panda database, which is protected - in addition to the pre-requisite of having a valid pilot proxy. We consider this level of security breach highly unlikely as it involves intrusion of a few loosely coupled components, each with its own set of credentials, and would require that all of these be simultaneously compromised.

(b) to directly penetrate the Worker Node and impersonate pilot jobs. In case a Worker Node is compromised during a limited period (say a few hours or days), the attacker might try and imitate the pilot(s), requesting a series of jobs to obtain their associated user proxies, but the constraints on the token reuse ensure that for a given pilot any such abuse can only affect a small number of users. The attacker might kill the pilot itself, so that another job with another proxy (and possibly a token) gets started, but such an attack is not special to pilot-based workflow systems.

(c) to gain direct access to the pilot job scheduler. This kind of threat is not unique in that there are already components in the grid infrastructure which, if compromised, could provide the intruder with credentials of multiple users. Such a risk is minimized by having a very small number of machines performing this task, together with establishing a strict control over access to these. Indeed, this is exactly the current practice.

2. What happens around the identity change on the WN, e.g. how is each task sandboxed and to what extent?

The Pilot Job is started by a privileged user (who has the role=pilot VOMS role annotation), whose rights in this case are limited to being able to utilize glexec. The glexec utility is activated and uses the user's proxy obtained from the MyProxy server as explained in item 1. Effectively, each task is sandboxed by being executed under the end user account to which the user's proxy is mapped, with privileges limited to those of that account and proxy.

This identity change via glexec is an optional feature of Panda, activated only at sites which request its use.

3. How can running processes be accounted to the correct user?

As explained in the previous item, the identity switch via glexec leads to each user's processes running under specific UIDs traceable to their respective identities. At the Panda level, the DN of the job submitter is recorded permanently for each Panda job, such that Panda can trace and account usage.

4. How is a task spawned on the WN and how is it destroyed?

Pilot Jobs download the so-called "transformation" script (based on the job definition they obtain from the Panda server) which is the payload job script. The transformation script will be executed just like any other script submitted to the local batch queue, and its spawning and destruction would be identical to that. A process can either complete its execution normally, exit with errors or be terminated by the site administrator in case of a suspected security breach. Panda also has a mechanism by which Panda users and operators can request job termination: the Panda server messages the pilot managing the payload job to terminate the job.

5. How can a site be blocked?

There are two different kinds of site blocking.

(a) blocking the delivery and execution of the payload by the Panda server, on the suspect site. This is achieved by using database tables which support the token-based access control described above (in the second half of item 1). The schema for the table storing the tokens will also contain the DN of the pilot scheduler operator and pilot destination site. By disabling database entries corresponding to suspect site(s), we will be able to deny the delivery of jobs to suspect pilots on one or any number of participating sites.

(b) stopping submission of legitimate Panda Pilots to suspect sites. This is achieved by Panda maintaining a database of grid sites, which are kept in logical groups, that can be reconfigured at any time. Based on information stored in the database, Pilot submission is always targeting only sites chosen and approved for the specific set of tasks. It is trivial, therefore, to exclude a suspect site from Panda usage by modifying the relevant database entry.

6. Security of services

6.1 What site security processes are applied to the machine(s) running the WMS?

The machines running the Panda WMS at BNL are operated by personnel of "RHIC and ATLAS Computing Facility" (RACF) at Brookhaven National Laboratory and are subject to policies and processes specific to BNL, which comply with regulations set forth by the US Department of Energy. Likewise, instances of the Panda service deployed at CERN are subject to its security policies and processes.

6.2 Who is allowed access to the machine(s) on which the service(s) run, and how do they obtain access? How are authorized individuals authenticated on the machine(s)?

Only a limited number of qualified IT personnel are granted access to this machines, as defined by the management.

6.3 What is the process for keeping the service(s) and OS patched and up-to-date, especially with respect to security patches?

See 6.1

6.4 Do you have an identified security contact?

RACF does have the security contact as per 6.1

6.5 Describe the incident response plan to deal with security incidents and reports of unauthorized use?

See 6.1 - the Panda services are subject to a wider set of security policy implementations, and watched over by BNL cybersecurity specialists

6.6 What services (in general) run on the machine(s) that offer the WMS service?

Only Panda services run on the machines (apart from associated middleware, such as Condor for machines supporting pilot schedulers). Panda database services are hosted on machines that run no services other than the DB services (which use MySQL).

6.7 What processes exist to maintain audit logs (e.g. for use during an incident)?

All crucial components of Panda WMS maintain extensive log files, which are periodically rotated and archived for possible future use in case of an incident.

6.8 What monitoring exists on the machine(s) to aid detection of security incidents or unauthorized use?

There are a few layers of monitoring:

  • a web interface displaying the workload submitted by each user, including job summaries over specified time periods or job types, with drill-down to complete job definition and status details. Panda usage patterns are thereby transparent to both developers and the shift crews looking after production.
  • as mentioned in 6.7, there are extensive logs that can be consulted to aid in detection of a possible security incident

7. Can you limit the users that can submit jobs to the VO WMS? How?

There are a few layers of user access control in Panda

  • a DB table of Panda user attributes can be used to ban specific users from having their jobs executed by Panda
  • job submitters are checked (via their DN) for affiliation with a VO as defined in the VOMS system (VOMS data cached periodically), and jobs are rejected if there is no valid affiliation with a VO approved for Panda use
  • X509 proxies used in Panda have limited lifetimes, so in case of a certificate revocation a user would quickly lose job submission rights

Major updates:
-- TorreWenaus, MaximPotekhin - 06 Aug 2008
-- TorreWenaus - 06 Oct 2006

Responsible: TorreWenaus

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r2 - 2008-08-06 - TorreWenaus
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    PanDA All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback