Adding Quality of Service to the Grid with DIANE.

Jakub Moscicki, CERN/IT

Currently the mainstream usage of the Grids resembles a very large batch system: the goal is to maximize the computational throughput over long periods of time. This fits many applications, in particular, large data productions of the LHC experiments: production manager puts thousands of jobs into the system and, he or she expects, that after several days they come out with the result. However this model does not support very well other usage scenarios. For example in the interactive analysis the response of the system should be much faster and aligned with the interactive activity of the user. Life Science applications often involve short deadline jobs: a very large number of very short jobs which must finish with certain time limit. In general, the Quality of Service (QoS) characteristics are not present in the current Grid systems.

On the other hand, there is also an effect of the scale and complexity. EGEE is the world's largest Grid system to date, comprising over 20000 worker nodes, 200 computing sites and petabytes of storage. Such an impressive enterprise, connecting heterogeneous computing environments and organizations, comes with a cost: from the end-user perspective tracking of possible problems may be very time consuming and, at times, the system may exhibit lower efficiency.

User-level scheduling is a very light software technique which allows adding new capabilities, and improving QoS characteristics and reliability, on top of existing Grid middleware and infrastructure. DIANE (DIstributed ANalysis Environment, is a R&D project started at CERN/IT in 2001. At the beginning the target was to investigate distributed ntuple analysis for particle physics. However, with time, DIANE has become an application-independent user scheduling tool on the Grid and it has been interfaced to the number of applications in High Energy Physics, Medical Physics, Life Sciences and others.

DIANE is a python framework based on Master/Worker processing model which is used on top of regular Grid middleware in a transparent way. Worker agents are sent to the Grid as regular Grid jobs and they register to the Master agent by opening a TCP/IP connection. The Master agent runs on the user's desktop computer and is the coordination point for the virtual Worker pool. Workers may dynamically join and leave the pool, without disrupting the processing as a whole. The processing is composed of a large number of short tasks which are the units of computation. The Master allocates the tasks to Workers directly, bypassing the middleware scheduling layer. This allows to reduce the total job turnaround time and to react much faster to errors in task execution by reallocating them to other workers. Splitting the processing into a large number of fine-grained tasks improves the load balancing, assuring efficient utilization of the workers. In the result the computing resources may be returned to the Grid faster: the Worker agents are automatically terminated when the processing reaches the end.

DIANE's python framework allows to easily and promptly integrate existing applications even as complex as Athena - the analysis framework of the ATLAS experiment. Studies performed by members of the Atlas collaboration showed that it is possible to use DIANE to integrate local and Grid resources, and even resources which come from different Grid infrastructures at the same time. The demonstration of DIANE-based parallel Athena prototype has been shown at a number of EGEE conferences and it has been included in the Atlas Technical Design Report (TDR 2005). Additionally, DIANE has been interfaced to Ganga, a user-friendly Grid interface created in the context of Atlas and LHCb experiments at CERN. The physicists using Ganga will have in the future a possibility to choose the DIANE optimizer, which will be attached transparently to their jobs.

The statistical regression testing, which is part of the Geant-4 release validation procedure is operated on the EGEE Grid using DIANE scheduler. It allows to cut down the turnaround time several times and to provide more stable and predictable job output rate because the Worker agents which has been acquired at the beginning of processing are held inside the pool and are shielded from the instabilities in the Grid brokering. Stable job output rate is an important QoS feature because allows to plan the testing operations on the Grid with more reliability.

DIANE has been recently used to perform a sizeable fraction of the in silico drug discovery using EGEE infrastructure. The challange was to analyse possible drug components against the avian flu virus H5N1. This activity, addressing current and socially important problem, has had a number of press releases worldwide, including BBC and Liberation. It has been demonstrated that a User Level Scheduler such as DIANE, may improve the distribution efficiency on the Grid from below 40% to above 80% by optimizing the allocation of the fine-grained computing tasks. Efficient automatic error recovery mechanisms proved to be efficient in extended period of continuous work: the part performed with DIANE of the in silico drug search activity lasted around 30 days.

Over the months of May and June 2006, CERN has successfully supported a series of large-scale data processing activities being carried out by the International Telecommunications Union (ITU) as part of the ITU's Regional Radiocommunication Conference (RRC-06). Several sites of the EGEE infrastructure provided a computing grid of more than 400 PCs to work on each analysis in parallel. The processing on the EGEE infrastructure have been conducted using DIANE scheduling layer. The system completed more that 200 thousand very short frequency analysis jobs (clustered in around 40 thousand processing tasks) in around one hour, proving that on-demand computing with short deadline is possible on the Grid. The frequency allocation plan optimized with the help of the Grid allowed over 1000 delegates from 104 countries to adopt the treaty agreement that will replace the analog broadcasting plans existing since 1961 for Europe and since 1989 for Africa.

In the future a closer integration with Ganga will enable access to all DIANE capabilities. On-going activities in the context of PhD_StaraWersja research aim at supporting hard QoS requirements with novel techniques such as floating worker pool, extending scalability above 500 worker agents and supporting inter-dependent tasks for workflow applications.

-- JakubMoscicki - 26 Jul 2006

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2006-12-16 - JakubMoscicki
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback