DILIGENT Data Challenges


Description

The goal of this data challenge is to execute feature extraction on images. The Image Feature Extraction tool is composed of a Java application, some Perl scripts and a C application. The Java code implements a client that contacts the Flickr database (http://www.flickr.com/), downloads a set of users (limited to 5 for interaction) and the images that these users are sharing over the Web. The Perl script and the C application are the core of the Feature Extraction process, they extracts feature from the images, create thumbnails and (using a Java client) store the results on a cluster located at CNR.

Since the application runs on Java 1.5 (not supported on gLite nodes) we have to download Java binaries before launching the application. However, this overhead is not so penalising and the Feature Extraction process ratio has been significantly improved. The input files are stored on our servers and the PPS support is limited to the computing power required to process such collection.

The characteristics of the data challenge are:

  • 1000 jobs submitted per day (through 2 WMSs), although this can be increased/decreased as needed
  • Each job processes 1000 images and requires at most 50 Mb of disk space and at least 512 of RAM
  • Jobs consume between 20 minutes and 1 hour of CPU time (depending on CPU)
  • Sites do not need to install any particular libraries or other software


Schedule

The data challenge total duration was 116 days and was organized in 3 different phases:

  • Preparation: From 16 June to 15 July (30 days)
  • 1st phase: From 16 July to 29 July (14 days)
  • 2nd phase: From 30 July to 9 October (72 days)

During the preparation phase only 3 PPS sites were used to test the feature extraction application. Each job contained 250 images to process. The 1st and 2nd phase correspond to the real execution of the DC where 10 PPS sites were exploited. In the first phase each job contained 500 images to process whereas in the 2nd phase each job contained 1000 images.


Results

The following tables present statistics about the execution of the data challenge collected during the 116 days of execution:

Number of jobs

Submitted Processed %
Preparation 7500 5200 69,33
1st phase 7500 5000 66,67
2nd phase 51440 34133 66,35
Total 66440 44333 66,73

Number of input images to be processed

per Job Expected Processed %
Preparation 250 1875000 1300000 69,33
1st phase 500 3750000 2500000 66,67
2nd phase 1000 51440000 33408603 64,95
Total   57065000 37208603 65,20

Number of output products processed

Generated
Preparation 3900000
1st phase 7500000
2nd phase 100225809
Total 111625809

The 4,55 TB of products generated contain approximately 150 million features


Images processed by site

DC_site.png
Note: some CEs were not used during some periods of time

Images processed by day

DC_day.png


Daily Statistics

Daily statistics for each grid node can be be found at:



-- PedroAndrade - 30 Jul 2007

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r4 - 2007-10-11 - PedroAndrade
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    DILIGENT All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback