Chapter 5: Using the Computing Resources



5.1 Chapter Overview -- Getting Started

Complete: 5
Detailed Review status

Goals of this page:

This page is intended to provide you with an overview of this entire Chapter, pointing out which parts are required reading to get physics analysis done on the CMS distributed analysis infrastructure, and those that are meant to provide intellectual stimuli and broader context.

Contents

Introduction

CMS uses a globally distributed computing system for data analysis. The present Chapter has two objectives:

  1. Provide you with all the information required to use the global system for physics data analysis.
  2. Provide you with background information, and context, so that you start gaining some appreciation of the complexity of this system.

Those who really don't care about how things work, and just want to get their analysis off the ground, may want to skip all the material provided in the interest of our second goal above. The present section is meant to make this easy for you by providing guidance on what to skip. However, let us warn you upfront that eventually, you will need that more detailed background knowledge in order to understand, and react to failures of the distributed system that you will invariably be exposed to, while using it. The complexity of this global system guarantees that an educated and intelligent user will often be more effective in getting stuff done, than somebody who knows nothing but the basics.

Roadmap for Chapter 5

As a new user, you should read the "must read" chapters in the order listed, as concepts introduced in one will often be used in the next. This is especially true for Chapters 5.4, 5.5, and 5.6.

  • Chapter 5.1 is a must read. It not only provides this roadmap, but also a discussion of the requirements to get started.
  • Chapter 5.2 "Grid Computing Context" can be skipped by the impatient. It provides a general introduction of "grid" computing terms.
  • Chapter 5.3 "Analysis Workflow" can be skipped, except for the very beginning of it. It explains how CRAB works under the hood, at least conceptually.
  • Chapter 5.4 "Locating Data" is a must read. It explains how to find the datasets to run on and how to pull a single file to your desktop, so you can try out your executable interactively and do the bulk of your debugging.
  • Chapter 5.5 "Data Quality Monitor" can be skipped initially. It explains how to refine the Data Finding process to include Data Quality Information
  • Chapter 5.6 "Data Analysis with CRAB" is a must read. It explains how to use CRAB, the tool to use for doing data analysis on the globally distributed CMS data analysis infrastructure.
  • Chapter 5.7 "Data Analysis with CMS Connect" is a must read. It explains how to use CMS Connect, the complementary service to CRAB for user-defined scripts via condor for doing late-stage data analysis that don't depend on cmsRun (the CMSSW executable). E.g Making histograms, plots, analyzing trees, etc.
  • Chapter 5.8 "Dashboard Job Monitor" is a must read. It explains how to monitor the status of your jobs.
  • Chapter 5.9 "The role of the T2s" can be skipped initially. It provides essential background to understand the disk space organization at T2s in CMS. As T2s are the places where the vast majority of data analysis in CMS takes place, it will eventually be vital for you to read this chapter carefully.
  • Chapter 5.10 "Transfering Data" can be skipped initially. Once you have read chapter 5.7, you will understand how disk space is managed, and can then graduate to using it in style. This Chapter explains how to request datasets to be moved to T2s and T3s. Anybody in CMS can make such requests.
  • Chapter 5.11 "Data Organization Explained" can be skipped initially. It explains a variety of terms that CMS uses to describe how data is organized and managed.
  • Chapter 5.12 "Processing by Physics Groups". It talks about priority users privileges and convenors responsibilty towards such features.
  • Chapter 5.13 "cmssh tutorial". A very useful tool to easily find your favorite data from the command line, copy files transparently without knowing Physical File Name location, etc.

Basic requirements for using the Grid

The remainder of this page deals with the essentials you need before you can even start doing anything on the globally distributed CMS data analysis infrastructure.

Note that initial testing and workbook exercises can be done on an LXPLUS machine (or another machine, properly configured), but proper analysis jobs and Monte Carlo production should be submitted to the globally distributed CMS data analysis infrastructure. Note: We will sometimes use the word "Grid" as a synonym to "globally distributed CMS data analysis infrastructure" for obvious reasons of brevity.

The basic requirements for using the Grid resources are:

Obtaining and installing your Certificate

To obtain your certificate and join the CMS VO, follow the steps on this page.
That same page also has pointers to troubleshooting help if needed.

Note that it can take a few days for the certificate to be issued. The CA will give you instructions on how to load your certificate into your browser.

To setup the certificate on the user interface from where you have to work you should:

  • Export the certificate from your browser to a file in p12 format. How to export the certificate is very browser dependent. It will be something like Edit or Tools -> Preferences or (Internet) Options -> Advanced -> Security or Encryption -> View Certificates -> Your Certificates. In modern Firefox you should “backup” rather than “export” the certificate. You can find more instructions and hints for various browsers in this CERN CA help page. You can give any name to your p12 file (in the example below the name is mycert.p12).
  • Place the p12 certificate file in the .globus directory of your home area. If the .globus directory doesn't exist, create it.
      cd ~
      mkdir .globus
      cd ~/.globus
      mv /path/to/mycert.p12 .
  • Execute the following shell commands:
      rm -f usercert.pem
      rm -f userkey.pem
      openssl pkcs12 -in mycert.p12 -clcerts -nokeys -out usercert.pem
      openssl pkcs12 -in mycert.p12 -nocerts -out userkey.pem
      chmod 400 userkey.pem
      chmod 400 usercert.pem
  • For openssl commands, you need to put the same password that you chose while importing the certificate in your browser, and you would also be asked for "Enter PEM pass phrase". One may choose to keep it same, so as to avoid password confusions smile
  • Verify that it all works by executing (n.b. you may need to setup a grid UI to execute this command, see below):
      voms-proxy-init --rfc --voms cms
  • Ignore a (possible) message about not being able to find a .glite/vomses directory.

Some CAs provide the usercert.pem and userkey.pem files and then the user has to produce the p12 file to be imported to the browser. To convert the usercert.pem and userkey.pem files into a browser certificate mycert.p12 do the following:

openssl pkcs12 -export -in usercert.pem -inkey userkey.pem -out mycert.p12 -name "my browser cert for 2014"

To do CMS analysis on WLCG Grid resources, you will further require:

  • A CMS analysis software environment setup on your local computer.
  • Some sample datasets with local access (on a hard disk or other mass data storage system) so you can test your analysis code interactively before submitting your jobs on the grid. These local datasets are frequently subsets of one of the main CMS datasets resulting from a first-pass analysis job (RECO or AOD).
  • To stage user data back to CERN with a non-CERN certificate you need to map it to your CERN account (not yet enforced).

All CMS members using the Grid may benefit from subscribing to the Grid Annoucements CMS.HyperNews forum.

Connecting your certificate to your account

Certain steps in running a CMS Analysis with CRAB (e.g. publication of the output dataset in DBS) require that the user's DN is mapped to the user's account, in SiteDB. SiteDB will use your primary CERN computing account as username and by default will map it to the corresponding certificate issued by CERN. If you are using a grid certificate issued by a Certification Authority other than CERN CA, then read and follow the instructions in the SiteDB for CRAB page to make sure your certificate is correctly mapped to your account.

Using your grid certificate

Each day you wish to use xrootd, CRAB, CMS Connect, or similar technologies, you will need to authenticate your grid certificate with the command:
      voms-proxy-init --rfc --voms cms

Grid User Interface

The recommended way to submit jobs on the Grid is to use CRAB. It will allow you to access both EGEE and OSG Grid resources in a fully transparent way. For this a full gLite UI is not needed, although it will work. Minimal client as distributed by OSG or pre-installed on lxplus6 will do.

Preinstalled

  • At CERN:
    • LXPLUS6 already has the grid commands needed for Crab, no need to issue any setup command.
    • users on SLC5 machines can access an LCG UI by sourcing the file /afs/cern.ch/cms/LCG/LCG-2/UI/cms_ui_env.(c)sh.
  • Other affiliated sites and institutions may provide a generally available gLite UI in a similar way (see WorkBookRemoteSiteSpecifics to look for information for your institution).

Install your Own

This is stronglly not recommended. Installing and maintaining an up to date, functional, secure grid UI is expert work. If you need or want to install your own gLite UI, see CERN's document gLite 3.1 UI tarball distribution. Alternate instructions are available from USCMS at LCG User Interface (UI) installation.

Review status

Reviewer/Editor and Date (copy from screen) Comments
StefanoBelforte - 20 Sep 2014 review and update grid documentation, remove duplications
StefanoBelforte - 14 Sep 2014 update reference to CERN Grid CA page
StefanoBelforte - 20 Aug 2014 remove reference to gLite UI
JohnStupak - Mar 2013 review with minor changes
NitishDhingra - 28-Mar-2012 See detailed comments below
StefanoBelforte - 22-Dec-2009 Complete Expert Review, minor changes
FrankWuerthwein - 04-Dec-2009 Complete Reorganization 1st draft ready for review
AndreaSciaba - 30 Nov 2009 Minor corrections (removed or replaced broken links)
SimonMetson - 30 Apr 2009 Updated the link to request a certificate (after a question from a user advice from Andrea Sciaba
MattiaCinquilli - 24 Nov 2008 added explicit commands to setup the certificate
AndreaSciaba - 24 Jan 2008 review with updated links and minor changes
StefanoLacaprara - 16 Nov 2006 review with minor changes
AnneHeavey - 03 Aug 2006 fairly substantial edits to Grid info

Review with minor additions in the grid certificate set-up instructions. The page accomplishes its goal.

Responsible: StefanoBelforte
Last reviewed by: Main.David L Evans - fill in date when done -



5.2 Grid Computing Context

Complete: 5
Detailed Review status

Goals of this page:

This page will provide you a general context within which the CMS distributed analysis infrastructure is placed. The buzzword here is "Grid Computing", and this page will provide you a rudimentary introduction to its terms, to the extent they are relevant to CMS.

Contents

Introduction

To satisfy emerging IT needs in the scientific, industrial, governmental and commercial arenas, Grid computing has been conceived as an expansion of distributed computing. Grid computing involves the distribution of computing resources among geographically separated sites (creating a "grid" of resources), all of which are configured with specialized software for routing jobs, authenticating users, monitoring resources, and so on. Shared, site-based computing resources may include computing and/or storage nodes, software, data, a variety of scientific instruments, and so on.

Grid computing aims to provide reliable and secure access to widely scattered resources for authorized users located virtually anywhere in the world. When a user submits a job, the Grid software controls where the job gets sent for processing. Think of a Grid as a utility, much like the electrical utility grid. A company may buy electric power from a variety of physically separate sources, pool it, and distribute it to all its customers with high reliabililty. The customers do not need to know where their electricity originates, just that their wall sockets always work. In Grid computing, the end user does not need to know where particular resources reside, just that they are available with high reliability.

As regards CMS, it is virtually impossible to store all of the data in a single location, and to contain all of the CPU power at the same site for data storage, analysis and Monte Carlo production. For this reason, Grid technology is being used. A 3-level Tier structure of computing resources has been organized to handle the vast storage and computational requirements of the CMS experiment. A CMS physicist may use Grid tools to submit a CMS analysis job to a "Workload Management System" (WMS), and does not need to worry about the details such as location of data and available computing power, which are handled transparently.

Worldwide LHC Computing Grid Project (WLCG)

The mission of the WLHC Computing Project (WLCG) is to build and maintain a data storage and analysis infrastructure for the entire high energy physics community that will use the LHC. The WLCG project aims to collaborate and interoperate with other major Grid development projects and production environments around the world. As such, WLCG has developed relationships with regional computing centres as T1 centres. These centres exist in a number of different countries in Europe, North America and Asia. Each T1 centre is part of at least one of the Grids, EGEE, OSG , NorduGrid, and potentially others, and provides sharable resources. These resources become accessible to CMS through WLCG User Interfaces such that any CMS user can potentially use their facilities.

Enabling Grid for E-sciencE (EGEE)

EGEE is a project of the European Union which provides a world-wide Grid infrastructure for several scientific communities, including High Energy Physics and the LHC experiments. The vast majority of the WLCG sites outside the US is part of EGEE. EGEE provides not only the infrastructure, but also a complete Grid middleware stack, gLite.

Open Science Grid (OSG)

The Open Science Grid is a US Grid computing infrastructure that supports scientific computing via an open collaboration of science researchers, software developers and computing, storage and network providers. CMS researchers (US-based or not) working from a WLCG User Interface (UI) can access both EGEE and OSG Tier1 and Tier2 resources in a fully transparent way. More info at US CMS Grid Services and Open Science Grid.

NorduGrid

NordGrid is project by several countries (mostly in NorthernEurope) to develop and operate a grid infrastructure.

Grid Security, Authentication and Authorization

Maintaining security within a Grid is very important. Grid user authentication is based on Digital Certificates (sometimes called "Grid certificates" or just "certificates"). A digital certificate is a specialized file, issued by a trusted authority, that is used to verify a user's identity on a computer and/or over a computer network (e.g., on a Grid). Authenticated users must obtain authorisation to use particular Grid resources.

Authorisation is provided by membership in a Virtual Organisation (VO). A VO is a group of individuals or institutions who share the computing resources of a Grid for a common goal, e.g., the CMS collaboration. The VO must be able to verifiably identify applicants and members (i.e., to trust the certificates), control which individuals join the organisation, control what they are allowed to do, and make its list of members available to the software that controls and monitors the Grid. The VO keeps the list of authorized users in a VOMS server and when a valid certificate is presented by a user a special file called "proxy" is created on the user disk. The proxy has a limited validity (while certificates last usually one year) and is used to issue grid commands.

In an analogy with traveling, the certificate acts as your passport (thus providing authentication). Your VO "stamps a visa in your passport", thus saying "I know who you are and where you come from, and I authorize you to visit such-and-such places and to do such-and-such things." The proxy is the equivalent of a staying permit that let's you do some of those things for a limited time. Proxy renewal is as important in grid computing as staying permits for migrant workers.

Review status

Reviewer/Editor and Date (copy from screen) Comments
JohnStupak March 2013 Review with minor changes
NitishDhingra - 28-Mar-2012 See detailed comments below
StefanoBelforte - 22-Jan-2010 Complete Expert Review, minor changes
FrankWuerthwein - 04-Dec-2009 Complete Reorganization finished
AnneHeavey - 03 Aug 2006 fairly substantial edits to Grid info

Complete review. No changes. The page accomplishes its goal.

Responsible: StefanoBelforte
Last reviewed by: StefanoBelforte - 28 Feb 2008

-- FrankWuerthwein - 04-Dec-2009



5.3 Data Analysis Work Flow

Complete: 5
Detailed Review status

Goals of this page

When you finish this page, you should understand:
  • The steps that you need to follow in order to run an analysis job on grid resources.
  • The basics about how CRAB (Cms Remote Analysis Builder) works.

This page does not teach you how to use CRAB. It only provides background material on how things work.

To learn how to use CRAB see Chapter "Analysis with CRAB".

Contents

Introduction

Data Analysis in CMS involves the following steps:

  • Developing an executable to run on data.
  • Testing that executable locally on your desktop/laptop/lxplus by running it on at least one file from the dataset you want to run on.
    • See Chapter "Locating Data" for details of how to find data, and pull a few files to your desktop/laptop.
  • Doing the actual data analysis with CRAB

CRAB is a Python program intended to simplify the process of creation and submission of CMS analysis jobs into a grid environment. You'll use it to run your jobs on the grid (LCG or OSG). The remainder of this page will explain what CRAB does under the hood. This is background information for you to better understand what you are doing when you use CRAB.

Workflow Illustration

The figure below shows the flow of user code, physics data, and job- and resource-related information throughout the course of an analysis job. While this figure was drawed in 2006, it is still correct, Analysis Workflow has not changed since the start of CMS. You may want only to read:
  • DLS (DataLocationService) as: PhEDEx
  • RB (ResourceBroker) as : Grid Scheduler meaning something that submits jobs to Grid resources, as of 2016 we use HTCondor global pool

workflow for analysis job on grid

Task Formulation by the user

The first steps are required for any analysis:

 # make a your working directory 
   mkdir MYDEMOANALYZER
   cd MYDEMOANALYZER

# create a new project area
   cmsrel CMSSW_%LOCALWBRELEASE% 
   cd CMSSW_%LOCALWBRELEASE%/src/
   cmsenv

Write a Framework Module

First, create a subsystem area. The actual name used for the directory is not important, we'll use Demo. From the src directory, make and change to the Demo area:

mkdir Demo
cd Demo

Note that if you do not create the subsystem area and create you module directly under the src directory, your code will not compile. Create the "skeleton" of an EDAnalyzer module (see SWGuideSkeletonCodeGenerator for more information):

mkedanlzr DemoAnalyzer

Further steps for running parallel jobs on the Grid or on any batch processing system are:

  1. Determine how to split your job into "chunks" that can run in parallel and finish in a reasonable amount of time (e.g., a few hours).
  2. Create your CRAB configuration file, crab.cfg. In it, you tell CRAB where to get the code and the data, and how to split the job.
  3. Submit the job to the Grid via CRAB.
  4. Monitor your job, as needed.
  5. Collect your output, create your plots, and make discoveries!

Job Preparation by CRAB

Data Discovery

CRAB performs a query to the Dataset Bookeeping System (DBS) to find the right data to access. To select the data, the user can to go to the DAS search page and select the data he/she is interested in by using the query functionalities. The result of this query is a list of datasetpath, in the form /PrimaryDataset/ProcessedDataset/DataTier/ such as /DY1JetsToLL_M-10To50_TuneZ2Star_8TeV-madgraph/Summer12-PU_S7_START52_V9-v1/AODSIM. This datasetpath should be written in the crab configuration file crabConfig.py. On task creation, (a task is the collection of identical jobs which are created and eventually submitted to analyze a give set of data; the only difference among the jobs in a task is the events each job processes, as determined by the splitting) CRAB queries DBS for the datasetpath, and gets back the details of the dataset, such as number of events, number of files, number of events per file, etc. The result of the query is a list of event collections, grouped by the underlying file blocks to which the data correspond. Note that at this stage the tool doesn't need to know about the exact data location or about the physical structure of the event collections; this will only be needed further down in the workflow. Note that the user does not need to know at all the location(s) of the data, this is dealt with internally by CRAB.

Job splitting

At this stage, CRAB can decide how (and if) to split the complete set of event collections among several jobs, each of which will access a subset of the event collections in the selected dataset, according to user requirements. The splitting mechanism will take care to configure each job with the proper subset of data blocks and event collections. The user's crab.cfg file must specify the criteria by which the job splitting will take place (e.g., maximum number of events per job, maximum number of jobs, etc.). The actual splitting might not follow precisely the user requirement due to physical data placements on files: in any case the total number of events will be that required by the user.

Job configuration

The Workload Management System (WMS) will create job configurations for every job which is to be submitted. There are in fact two levels of job configuration: the first for the CMS software framework, the second for the Grid WMS. The Grid one is entirely dealt with by CRAB, while the CMS software one is the one setup by the user and CRAB just modifies it in order to access data on the Grid.

Job submission

After the previous step, two configuration files exist for every job in the task:
  • one for the application framework, and
  • one for the Grid WMS.

At submission time, the submission tool will have information about data location, and will pass this information to the Grid Workload Management System ( as of 2014 we only use HTCondor via the glideInWMS) which in turn can decide where to submit, according to some resource availability metrics. The CMS WM tools will submit the jobs to the Grid WM System, as a "job cluster" if necessary, for performance or control reasons, and will interact with the job bookkeeping system to allow the tracking of the submitted job(s). The submission can be direct (for a small task) or via a CRAB server, a CMS specific layer between user and the grid. In the latter case, the CRAB client, the one used by the user in the user interface, will pass the task specs to a CRAB server, which in turn will take care of submission to grid WMS (or local scheduler) on behalf of the user. The server will manage the task, monitor the jobs and eventually retrieve the output. The user will interact with the server rather than directly with the grid.

Job scheduling

The Grid WM System is responsible for scheduling the jobs to run on specific Computing Elements (CE) and dispatching them to the CE.

Job run-time

Job run-time takes place on a Worker Node (WN) of a specific Computing Element (CE). The jobs arrive on the WN with an application configuration which is still site-independent. The CE/WN is expected to be configured such that the job can determine the locations of necessary site-local services (local file replica catalogue, CMS software installation on the CE, access to CMS conditions, etc.).

Job completion

Once the job completes, it must store its output someplace. For very small outputs, the outputs may just be returned to the submitter as part of the output sandbox. For larger outputs, the output can be stored on the local Storage Element (SE) (for subsequent retrieval by the user): given a limitation in size of the output sandbox, any output larger than a few MB have to be copied to a remote SE and not returned via sandbox. The job's only obligation is to either successfully store the outputs to the local SE or pass them to the data transfer agent. It is assumed that the Grid WM System will handle the task of making the output sandbox, log files, etc., available to the user.

Task monitoring

While processing is in progress, the user can monitor the progress of the jobs constituting his or her task by using the job bookkeeping and monitoring system (crab status). Additional information about task status (also historical), can be found on Dashboard

Task completion

As individual jobs finish (or after the entire set of jobs in the task has finished) the user will find the resulting output data coalesced to the destination specified during the "job completion" step, above. A list of the runs and luminosity sections read in input is also available to determine the luminosity this analysis corresponds to. If the user wishes to publish this data, the relevant provenance information must be extracted from the job bookkeeping system, etc., and published in DBS.

These pieces thus constitute a basic workflow using the CMS and Grid systems and services. The CMS WM tools are responsible for orchestrating the interactions with all necessary systems and services to accomplish each specified task.

Information Sources


Review status

Reviewer/Editor and Date (copy from screen) Comments

StefanoBelforte - 2017-07-04 Slightly update to make it valid in CRAB3 + HTCondor world
JohnStupak - 4-June-2013 Minor revisions and update to 5_3_5
NitishDhingra - 29-Mar-2012 See detailed comments below
StefanoBelforte - 11-Nov-2010 Add information on luminosity of results
StefanoBelforte - 22-Jan-2010 Complete Expert Review, no changes
FrankWuerthwein - 06-Dec-2009 Complete Reorganization 1st draft ready for review
SimonMetson - 28 Feb 2008 review: Updated DBS Discovery link. In the (near) future this page should be updated to refer to the CRAB server
StefanoLacaprara - 1 Feb 2008 review: fill uptodate information plus add link to DBS and dashboard
StefanoLacaprara - 16 Nov 2006 review: minor mods and add comments about what is not yet possible with CRAB
AnneHeavey - 23 Jun 2006 Significant editing; move this from Intro down to Using the Grid

Complete review. Added information on deprication of DBS, added link to DAS, fixed broken links. The information on the page is quite clear.

Responsible: DaveEvans
Last reviewed by: SimonMetson - 28 Feb 2008



5.4 Locating Data Samples

Complete: 5
Detailed Review status

Goals of this page:

This page describes how to find collision data and Monte Carlo (MC) samples. In particular, you will learn:

  • How to use the Data Aggregation System (DAS) to locate your samples.
  • How to transfer a few files to your desktop/laptop so you can test your executable interactively.
  • Where to find information on what collision data and MC samples exist, and are most recent.

Contents

How to find samples with DAS Interface

All published samples whether official or unofficial are searchable through DAS web interface. In order to open it, one needs to have a valid grid certificate installed in the browser.

Open the DAS web page.

As you will see, the DAS interface is quite simple:

DAS_Interface_v2.png

To perform your search, you need to know the DAS query language. For an explanation and examples you can either refer to the DAS FAQ page or to the DAS documentation guide. Essentially, the query for a specific data set should be of the form:

dataset=/PrimaryDataset/ProcessedDataset/DataTier/
For example, to find Z->ee samples your query will look like:
dataset=/*Zee*/*/*
To find a specific file for a known dataset you will use:
file dataset=/PrimaryDataset/ProcessedDataset/DataTier/

Command line interface for DAS (das client)

You can copy/paste or download the DAS client script from the DAS web page using the "CLI" link in the upper menu. The DAS CLI is a simple python script (you need to use python 2.7 and above) and its usage is quite trivial. One needs to create a proxy for using it by the command voms-proxy-init -voms cms -rfc. Assuming you saved it as das_client.py:

python das_client.py --help

Usage: das_client.py [options]
For more help please visit https://cmsweb.cern.ch/das/faq

Options:
  -h, --help            show this help message and exit
  -v VERBOSE, --verbose=VERBOSE
                        verbose output
  --query=QUERY         specify query for your request
  --host=HOST           host name of DAS cache server, default is
                        https://cmsweb.cern.ch
  --idx=IDX             start index for returned result set, aka pagination,
                        use w/ limit (default is 0)
  --limit=LIMIT         number of returned results (default is 10), use
                        --limit=0 to show all results
  --format=FORMAT       specify return data format (json or plain), default
                        plain.
  --threshold=THRESHOLD
                        query waiting threshold in sec, default is 5 minutes
  --key=CKEY            specify private key file name, default
                        $X509_USER_PROXY
  --cert=CERT           specify private certificate file name, default
                        $X509_USER_PROXY
  --capath=CAPATH       specify CA path, default currently is /etc/grid-
                        security/certificates
  --retry=RETRY         specify number of retries upon busy DAS server message
  --das-headers         show DAS headers in JSON format (obsolete, keep for
                        backward compatibility)
  --base=BASE           specify power base for size_format, default is 10 (can
                        be 2)
  --cache=CACHE         a file which contains a cached json dictionary for
                        query -> files mapping
  --query-cache=QCACHE  a query cache value
  --list-attributes=KEYS_ATTRS
                        List DAS key/attributes, use "all" or specific DAS key
                        value, e.g. site
Thus, if you want to run your query you'll type (we'll use the same query example as shown above):
python das_client.py --query="dataset=/*Zee*/*/*"

To specify the dbs instance (that in the web interface would be selected with the drop-down menu), include it in the query command, for example:

python das_client.py --query="dataset=/*Zee*/*/* instance=prod/phys03"

Suggestion: One can store the das_client.py python script somewhere in the home dir and make an alias to it. For example, at lxplus.:

alias dasCLI 'python /afs/cern.ch/user/x/xyz/tools/das_client.py’

and then run from anywhere as:

dasCLI --query="dataset=/*Zee*/*/* instance=prod/phys03" 

Using DBS python client

DAS is a tool which aggregates data from several sources: DBS, PhEDEx, ReqMgr, SiteDB etc. But not all details of the information stored in those DB's is available, nor is the query as efficient as asking directly one of those services. Therefore DAS should be the first choice when looking for dataset informations, but sophisticated users that find the details or the performance inadequate to their needs can query DBS directly via its python client API. Instructions, examples and guidelines are in this twiki.

Accessing Remote Samples For interactive testing

The ability to access remote files (i.e. located at some Tier2) of various samples is essential to users for interactive testing and debugging. A remote file can be either copied to a local space (e.g. desktop/laptop) or directly opened inside cmsRun, using the Xrootd Service. Please refer to the dedicated chapter in this workbook: Using Xrootd Service (AAA) for Remote Data Access.

Finding existing MC samples for various physics processes

A list of MC samples requested in the latest production campaign can be found at MC co-ordination twiki. The collision data (MINI)AOD from the last year and all (MINI)AODSIM samples from the past couple of production campaigns will always be available at some disk site. If a sample is popular in CRAB, Dynamic Data Management (DDM) team at CMS distributes replicas of such hot samples via an automatic procedure which removes extra-copies of unused datasets.

As an example, to search for samples corresponding to RunII Summer16 MINIAODSIM production campaign one has to do a dump query like:

dataset=/*/RunIISummer16DR80X*/MINIAODSIM

Availability of Samples

The recent collision data and MC samples are always on the disk at some site. If something is not found on disk or is already archived to tape, one can file a ticket at the JIRA link.

Release Validation (CMS.RelVal) samples

As new releases are integrated, and readied for large scale MC production, or data reprocessing, CMS goes through a process referred to as "Release Validation" (CMS.RelVal). As part of that, the Data Operations team makes a variety of samples with that release at small scale. These CMS.RelVal samples are often your best opportunity to develop analysis code for a new release, as they are the first to appear.

  • There are CMS.RelVal samples for all major releases.
  • These samples have been produced to validate CMSSW pre-releases and releases and the production workflow.
  • In general you should run on these with the release with which they were produced (in particular for K_L_M_preX releases)
  • One can easily find CMS.RelVal samples using DAS interface. For example, to find ttbar samples your query will look like:
dataset=/*RelValTTbar*/*/*


Review status

Reviewer/Editor and Date (copy from screen) Comments
NitishDhingra - 2017-08-30 Revision with updated information on DAS interface, CLI, RelVal samples. Some modifications in subsection structure.
StefanoBelforte - 2015-08-19 point to DDM for data distribution
JohnStupak - 4-June-2013 Minor revisions
NitishDhingra - 01-Apr-2012 See detailed comments below
StefanoBelforte - 29-Jan-2010 Complete Expert Review, minor changes
FrankWuerthwein - 06-Dec-2009 complete reorganization
SudhirMalik- 4 Nov 2009 updated examples to CMSSW_3_3_1, updated DBS snapshots
KatiLassilaPerini - 28 Feb 2008 removed the LPC samples
CMSUserSupport - 05 Sep 2007 added CSA07 samples from Filip Moortgat's presentation August07 Physics Days
AlessandraFanfani - 21 Jun 2007 updated DM concepts and Data discovery examples
KatiLassilaPerini - 17 Apr 2007 updated DBS description, added a simple example search
JennyWilliams - 15 Sep 2006 Slight editing, added some comments
AnneHeavey - 30 Aug 2006 Created new page; info from Peter Elmer

Complete review, Some broken links have been fixed. The page provides complete information regarding data samples finding using DAS for physics analysis.

Responsible: StefanoBelforte
Last reviewed by: StefanoBelforte -

-- FrankWuerthwein - 04-Dec-2009



Warning: Can't find topic CMSPublic.WorkBookLocatingDQM

5.6 Data Analysis with CRAB

Complete: 3


Detailed Review status

Introduction and Editorial Note

This Workbook Chapter reproduces text from the CRAB guide: SWGuideCrab and other SWGuide pages linked in there. Text from the SWGuide twiki's is included, not linked, for easier reading and a bit reorganized but any change there will be reflected here. There should be no need to edit this twiki page to update instructions.

CRAB is a utility to submit CMSSW jobs to distributed computing resources. By using CRAB you will be able to:

  • Access CMS data and Monte-Carlo which are distributed to CMS aligned centres worldwide.
  • Exploit the CPU and storage resources at CMS aligned centres.

Prerequisites

To use CRAB to submit your CMSSW job to the Grid you must meet some prerequisites:

Get a Grid certificate and the registration to CMS VO

CRAB submits jobs to the Grid (LCG), so you need to run it from an User Interface, with a valid certificate, issued by your appropriate Certification Authority, and have a valid proxy. You need also to be registered on VORMS server. To get a certificate from CERN CA and register to CMS VO, you can find detailed instruction in the SWGuideLcgAccess page. If you get a certificate from another Certification Authority, the procedure to register to CMS VO with your certificate should be the same.

Setup your certificate for LCG

See instructions in this Offline Workbook page

Test your grid certificate

  1. Is your personal certificate able to generate Grid proxies? To find out, after having setup your environment run this command:
    grid-proxy-init -debug -verify 
    In case of failure, the possible causes are:
    • the certificate/key pair is not installed in $HOME/.globus/usercert.pem $HOME/.globus/userkey.pem (a.k.a. "pem files")
    • the certificate has expired
    • the certificate and the private key do not match
    In the first case, you either do not have a certificate at all or have to install it on the UI; in the second case, you should get a new certificate; in the third case you probably have incorrectly installed your certificate.
  2. Are you a member of the CMS VO? To see if this is the case, you can execute this command:
    voms-proxy-init -voms cms 
    If you get an error, chances are that you did not register to the CMS VO, or your registration expired. In this case, please follow the instructions in the SWGuideLcgAccess page
  3. You can verify the expiration date of your certificate with:
    openssl x509  -subject -dates -noout  -in $HOME/.globus/usercert.pem 
  4. see also: https://twiki.cern.ch/twiki/bin/view/CMSPublic/SWGuideVomsFAQ

Test the code locally

Before launching a million event analysis job on Grid, be sure to test your code locally in a clean area.

  1. Build a new CMSSW area (for example, CMSSW_7_4_15_...; pick as appropriate to your job):
    cmsrel CMSSW_7_4_15
    cd CMSSW_7_4_15/src
    cmsenv
  2. Check-out from the cvs repository only the code or configuration files you need to modify, and build your local libraries including your analysis code.
  3. Make sure that the code you check-out is compatible with the CMSSW version you are using.
  4. Make sure that the CMSSW version you are using is compatible with the data you intend to read.
  5. Prepare a test job accessing the data you will access in your Grid job. There are several ways to read the proper data:
    • The easiest way is to use the xrootd service to read data directly from a remote site. How to do this is explained in Using Xrootd Service for remote Data Accessing.
    • You can also use the xrootd service to copy a data file from a suitable dataset to your local machine (to work w/o network e.g.), as explained in File download with command-line tools.
    • If no suitable files exist, you can generate some events using the configuration file which is available from the DAS service.
  6. Test your CMSSW configuration file locally in order to avoid problems with the ParameterSet parsing.
  7. Run the job interactively (e.g. at CERN on lxplus):
    cmsRun your-pset-config-file.py 

Validate a CMSSW config file

In CRAB2, a user can validate its CMSSW configuration file by launching crab -validateCfg after creating the task with crab -create. In this way the configuration file will be controlled and validated by a corresponding python API. Note that it is not enough to check that the configuration file runs interactively, because in interactive mode CMSSW is too tolerant with python errors in that configuration file. At times a user may worry that the problem is in CRAB or CRAB validation rather than in the configuration file; in this case, one can use the following test, which does not involve CRAB:

edmConfigHash your-pset-config-file.py
Note that this is needed, but not necessarely sufficient, to have a valid CMSSW configuration file. Other problem could be related to some hidden charatecters (^M) in the configuration file, overall if it was downloaded from the web. To discover them you can use the command
cat -v your-pset-config-file.py
and remove them with the command
perl -pi -e 'tr/\cM//d;' your-pset-config-file.py
Then you can revalidate the configuration file again.

Use CRAB at CERN

please see SWGuideCrab, in particular : https://twiki.cern.ch/twiki/bin/view/CMSPublic/CRAB3CheatSheet#Environment_setup

Use CRAB outside CERN

Preferred way: use CVMFS

  1. Setup the Grid UI according to your site directions
  2. Follow same instructions as at CERN (CVMFS is globally available)

Basic Crab Commands

Please see SWGuideCrab

Common operations with CRAB

Please see SWGuideCrab

Return results locally

Please see SWGuideCrab

Copy results to a Storage Element

Please see SWGuideCrab

Publish copied results in a Storage Element to a DBS instance

Please see SWGuideCrab

Analyse published results

Please see SWGuideCrab

Review status

Reviewer/Editor and Date (copy from screen) Comments
StefanoBelforte - 2016-06-01 remove references to CRAB2 documentation, point to CRAB3 guide
StefanoBelforte - 30-Aug-2013 overstrike crab server
JohnStupak - 4-June-2013 Review and minor revisions
NitishDhingra - 02-Mar-2012 See detailed comments below
DaveEvans - 10 March 2010 Update stageout examples

Complete Review, no changes. The information on page is quite clear.

Responsible: StefanoBelforte
Last reviewed by: Review Me



5.6.1 Running CMSSW code on the Grid using CRAB2

(for CRAB3 tutorial please click HERE )

Complete: 5
Detailed Review status

WARNING

  • You should always use latest production CRAB version
  • This tutorial is outdated since it was prepared for a live lesson at a specific time and thus refers to a particular dataset and CMSSW version that may not be available when you read this (and where you try it).
    • as of 2014 you should be able to kickstart your Crab work using CMSSW 5_3_11 and the dataset /GenericTTbar/HC-CMSSW_5_3_1_START53_V5-v1/GEN-SIM-RECO as MC data and /SingleMu/Run2012B-13Jul2012-v1/AOD as real data.

Contents:

Prerequisites to run the tutorial

  • to have a valid Grid certificate
  • to be registered to the CMS virtual organization
  • to be registered to the siteDB
  • to have access to lxplus machines or to an SLC5 User Interface

Recipe for the tutorial

For this tutorial we will refer to CMS software:

  • CMSSW_5_3_11

and we will use an already prepared CMSSW analysis code to analyze the sample:

We will use the central installation of CRAB available at CERN:

  • CRAB_2_9_1

The example is written to use the csh shell family. If you want to use the Bourne Shell replace csh with sh.

Legend of colors for this tutorial

BEIGE background for the commands to execute  (cut&paste)
GREEN background for the output sample of the executed commands (nearly what you should see in your terminal)
BLUE background for the configuration files  (cut&paste)

Setup local Environment and prepare user analysis code

In order to submit jobs to the Grid, you must have an access to a LCG User Interface (LCG UI). It will allow you to access WLCG-affiliated resources in a fully transparent way. LXPLUS users can get an LCG UI via AFS by:

source /afs/cern.ch/cms/LCG/LCG-2/UI/cms_ui_env.csh

Install CMSSW project in a directory of your choice. In this case we create a "TESTfirst " directory:

mkdir TEST
cd TEST
cmsrel CMSSW_5_3_11
#cmsrel is an alias of scramv1 project CMSSW CMSSW_5_3_11
cd CMSSW_5_3_11/src/ 
cmsenv
#cmsenv is an alias for scramv1 runtime -csh

For this tutorial we are going to use as CMSSW configuration file, the tutorial.py:

import FWCore.ParameterSet.Config as cms
process = cms.Process('Slurp')

process.source = cms.Source("PoolSource", fileNames = cms.untracked.vstring())
process.maxEvents = cms.untracked.PSet( input       = cms.untracked.int32(10) )
process.options   = cms.untracked.PSet( wantSummary = cms.untracked.bool(True) )

process.output = cms.OutputModule("PoolOutputModule",
    outputCommands = cms.untracked.vstring("drop *", "keep recoTracks_*_*_*"),
    fileName = cms.untracked.string('outfile.root'),
)
process.out_step = cms.EndPath(process.output)

CRAB setup

Setup on lxplus:

In order to setup and use CRAB from any directory, source the script crab.(c)sh located in /afs/cern.ch/cms/ccs/wm/scripts/Crab/, which always points to the latest version of CRAB. After the source of the script it's possible to use CRAB from any directory (typically use it on your CMSSW working directory).

source /afs/cern.ch/cms/ccs/wm/scripts/Crab/crab.csh

Warning: in order to have the correct environment, the order to source env files has always to be

  • source of UI env
  • setup of CMSSW software
  • source of CRAB env

Locate the dataset and prepare CRAB submission

In order to run our analysis over a whole dataset, we have to find first the data name and then put it on the crab.cfg configuration file.

Data selection

To select data you want to access, use the DAS web page where available datasets are listed Data Aggregation Service (DAS) . For this tutorial we'll use :

/RelValProdTTbar/JobRobot-MC_3XY_V24_JobRobot-v1/GEN-SIM-DIGI-RECO
 (MC data)
  • Beware: datasets availability as sites changes with time, if you are trying to follow this tutorial after the date it was given, you may need to use another one

CRAB configuration

Modify the CRAB configuration file crab.cfg according to your needs: a fully documented template is available at $CRABPATH/full_crab.cfg, a template with essential parameters is available at $CRABPATH/crab.cfg. The default name of configuration file is crab.cfg, but you can rename it as you want.

Copy one of these files in your local area.

For guidance, see the list and description of configuration parameters in the on-line CRAB manual. For this tutorial, the only relevant sections of the file are [CRAB], [CMSSW] and [USER] .

Configuration parameters

The list of the main parameters you need to specify on your crab.cfg:
  • pset: the CMSSW configuration file name;
  • output_file: the output file name produced by your pset; if in the CMSSW pset the output is defined in TFileService, the file is automatically handled by CRAB, and there is no need to specify it on this parameter;
  • datasetpath: the full dataset name you want to analyze;
  • Jobs splitting:
    • By event: only for MC data. You need to specify 2 of these parameters: total_number_of_events, number_of_jobs, events_per_job
      • specify the total_number_of_events and the number_of_jobs: this will assign to each job a number of events equal to total_number_of_events/number_of_jobs
      • specify the total_number_of_events and the events_per_job: this will assign to each job events_per_job events and will calculate the number of jobs by total_number_of_events/events_per_job;
      • or you can specify the number_of_jobs and the events_per_job;
    • By lumi: real data require it. You need to specify 2 of these parameters: total_number_of_lumis, lumis_per_job, number_of_jobs
      • because jobs in split-by-lumi mode process entire rather than partial files, you will often end up with fewer jobs processing more lumis than expected. Additionally, a single job cannot analyze files from multiple blocks in DBS. So these parameters are "advice" to CRAB rather than determinative.
      • specify the lumis_per_job and the number_of_jobs: the total number of lumis processed will be number_of_jobs x lumis_per_job
      • or you can specify the total_number_of_lumis and the number_of_jobs
      • lumi_mask: the filename of a JSON file that describes which runs and lumis to process. CRAB will skip luminosity blocks not listed in the file.
  • return_data: this can be 0 or 1; if it is one you will retrieve your output files to your local working area;
  • copy_data: this can be 0 or 1; if it is one you will copy your output files to a remote Storage Element;
  • local_stage_out: this can be 0 or 1; if this is one your produced output is copied to the closeSE in the case of failure of the copy to the SE specified in your crab.cfg
  • publish_data: this can be 0 or 1; if it is one you can publish your produced data to a local DBS;
  • scheduler: the name of the scheduler you want to use;
  • jobtype: the type of the jobs.

Run CRAB on MonteCarlo data copying the output to a Storage Element

The chance to copy the output to an existing Storage Element allows to bypass the output size limit constraint, to publish the data on a local DBS and then to easily re-run over the published data. In order to make CRAB copies to a Storage Element you have to add the following information on the Crab configuration file:
  • that we want to copy our results adding copy_data=1 and return_data=0 (it is not allowed to have both at 1);
  • add the official CMS site name where we are going to copy our results; the name of official CMS sites can be found in the siteDB

CRAB configuration file for MonteCarlo data

You can find more details on this at the corresponding link on the CRAB FAQ page.

The CRAB configuration file (default name crab.cfg) should be located at the same location as the CMSSW parameter-set to be used by CRAB with the following content:

[CMSSW]
total_number_of_events  = 10
number_of_jobs          = 5
pset                    = tutorial.py
datasetpath             =  /RelValZMM/CMSSW_5_3_6-START53_V14-v2/GEN-SIM-RECO

output_file              = outfile.root

[USER]
return_data             = 0
copy_data               = 1
storage_element        = T2_xx_yyyy (to change with the CMS name of site where you can write outputs)
user_remote_dir         = TutGridSchool

[CRAB]
scheduler = remoteGlidein
jobtype                 = cmssw

Run Crab

Once your crab.cfg is ready and the whole underlying environment is set up, you can start running CRAB. CRAB supports command line help which can be useful for the first time. You can get it via:
crab -h

Job Creation

The job creation checks the availability of the selected dataset and prepares all the jobs for submission according to the selected job splitting specified in the crab.cfg

  • By default the creation process creates a CRAB project directory (default: crab_0_date_time) in the current working directory, where the related crab configuration file is cached for further usage, avoiding interference with other (already created) projects

  • Using the [USER] ui_working_dir parameter in the configuration file CRAB allows the user to chose the project name, so that it can be used later to distinguish multiple CRAB projects in the same directory.

crab -create  
that takes by default the configuration file called crab.cfg associated for this tutorial with MC data.

The creation command could ask for proxy/myproxy passwords the first time you use it and it should produce a similar screen output like:

 
$ crab -create
crab:  Version 2.9.1 running on Fri Oct 11 15:33:18 2013 CET (13:33:18 UTC)

crab. Working options:
        scheduler           remoteGlidein
        job type            CMSSW
        server              OFF
        working directory   /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131011_153317/

crab:  error detecting glite version 
crab:  error detecting glite version 
crab:  Contacting Data Discovery Services ...
crab:  Accessing DBS at: http://cmsdbsprod.cern.ch/cms_dbs_prod_global/servlet/DBSServlet
crab:  Requested dataset: /RelValZMM/CMSSW_5_3_6-START53_V14-v2/GEN-SIM-RECO has 9513 events in 1 blocks.

crab:  SE black list applied to data location: ['srm-cms.cern.ch', 'srm-cms.gridpp.rl.ac.uk', 'T1_DE', 'T1_ES', 'T1_FR', 'T1_IT', 'T1_RU', 'T1_TW', 'cmsdca2.fnal.gov', 'T3_US_Vanderbilt_EC2']
crab:  May not create the exact number_of_jobs requested.
crab:  5 job(s) can run on 10 events.

crab:  List of jobs and available destination sites:

Block     1: jobs                  1-5: sites: T2_CH_CERN, T1_US_FNAL_MSS

crab:  Checking remote location
crab:  Creating 5 jobs, please wait...
crab:  Total of 5 jobs created.

Log file is /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131011_153317/log/crab.log

* the project directory called crab_0_131011_153317 is created

Job Submission

With the submission command it's possible to specify a combination of jobs and job-ranges separated by comma (e.g.: =1,2,3-4), the default is all. To submit all jobs of the last created project with the default name, it's enough to execute the following command:

crab -submit 
to submit a specific project:
crab -submit -c  <dir name>

which should produce a similar screen output like:

 
$ crab -submit
crab:  Version 2.9.1 running on Fri Oct 11 15:33:34 2013 CET (13:33:34 UTC)

crab. Working options:
        scheduler           remoteGlidein
        job type            CMSSW
        server              OFF
        working directory   /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131011_153317/

crab:  error detecting glite version 
crab:  error detecting glite version 
crab:  Checking available resources...
crab:  Found  compatible site(s) for job 1
crab:  1 blocks of jobs will be submitted
crab:  remotehost from Avail.List = vocms83.cern.ch
crab:  contacting remote host vocms83.cern.ch
crab:  Establishing gsissh ControlPath. Wait 2 sec ...
crab:  Establishing gsissh ControlPath. Wait 2 sec ...
crab:  Establishing gsissh ControlPath. Wait 2 sec ...
crab:  COPY FILES TO REMOTE HOST
crab:  SUBMIT TO REMOTE GLIDEIN FRONTEND
                                                                      Submitting 5 jobs                                                                       
100% [====================================================================================================================================================]
                                                                         please wait                                                                          crab:  Total of 5 jobs submitted.
Log file is /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131011_153317/log/crab.log

Job Status Check

Check the status of the jobs in the latest CRAB project with the following command:
crab -status 
to check a specific project:
crab -status -c  <dir name>

which should produce a similar screen output like:

$ crab -status
crab:  Version 2.9.1 running on Fri Oct 11 15:42:49 2013 CET (13:42:49 UTC)

crab. Working options:
        scheduler           remoteGlidein
        job type            CMSSW
        server              OFF
        working directory   /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131011_153317/

crab:  error detecting glite version 
crab:  error detecting glite version 
crab:  Checking the status of all jobs: please wait
crab:  contacting remote host vocms83.cern.ch
crab:  
ID    END STATUS            ACTION       ExeExitCode JobExitCode E_HOST
----- --- ----------------- ------------  ---------- ----------- ---------
1     N   Running           SubSuccess                           cmsosgce.fnal.gov
2     N   Running           SubSuccess                           cmsosgce.fnal.gov
3     N   Running           SubSuccess                           cmsosgce.fnal.gov
4     N   Running           SubSuccess                           cmsosgce.fnal.gov
5     N   Running           SubSuccess                           cmsosgce.fnal.gov

crab:   5 Total Jobs 
 >>>>>>>>> 5 Jobs Running 
        List of jobs Running: 1-5 

crab:  You can also follow the status of this task on :
        CMS Dashboard: http://dashb-cms-job-task.cern.ch/taskmon.html#task=fanzago_crab_0_131011_153317_hg41w0
        Your task name is: fanzago_crab_0_131011_153317_hg41w0 

Log file is /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131011_153317/log/crab.log

Job Output Retrieval

For the jobs which are in the "Done" status it is possible to retrieve the log files of the jobs (just the log files, because the output files are copied to the Storage Element associated to the T2 specified on the crab.cfg and infact return_data is 0). The following command retrieves the log files of all "Done" jobs of the last created CRAB project:
crab -getoutput 
to get the output of a specific project:
crab -getoutput -c  <dir name>

the job results (CMSSW_n.stdout, CMSSW_n.stderr and crab_fjr_n.xml) will be copied in the res subdirectory of your crab project:

$ crab -get
crab:  Version 2.9.1 running on Fri Oct 11 16:17:23 2013 CET (14:17:23 UTC)

crab. Working options:
        scheduler           remoteGlidein
        job type            CMSSW
        server              OFF
        working directory   /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131011_153317/

crab:  error detecting glite version 
crab:  error detecting glite version 
crab:  contacting remote host vocms83.cern.ch
crab:  Preparing to rsync 2 files
crab:  Results of Jobs # 1 are in /afs/cern.ch/user/f/fanzago/scratch0/TEST_RELEASE/TEST_PATC2/TEST_2_8_2/TUTORIAL/TUT_5_3_11/SLC6/crab_0_131011_153317/res/
crab:  contacting remote host vocms83.cern.ch
crab:  Preparing to rsync 8 files
crab:  Results of Jobs # 2 are in /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131011_153317/crab_0_131011_153317/res/
crab:  Results of Jobs # 3 are in /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131011_153317/crab_0_131011_153317/res/
crab:  Results of Jobs # 4 are in /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131011_153317/crab_0_131011_153317/res/
crab:  Results of Jobs # 5 are in /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131011_153317/crab_0_131011_153317/res/
Log file is /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131011_153317/crab_0_131011_153317/log/crab.log

The stderr is an empty file, the stdout is the output of the wrapper of your analysis code (the output of CMSSW.sh script created by CRAB) and the crab_fjr.xml is the FrameworkJobReport created by your analysis code.

Use the -report option

Print a short report about the task, namely the total number of events and files processed/requested/available, the name of the dataset path, a summary of the status of the jobs, and so on. A summary file of the runs and luminosity sections processed is written to res/. In principle -report should generate all the info needed for an analysis. Command to execute:

crab -report
Example of execution:

$ crab -report
crab:  Version 2.9.1 running on Fri Oct 11 17:02:17 2013 CET (15:02:17 UTC)

crab. Working options:
        scheduler           remoteGlidein
        job type            CMSSW
        server              OFF
        working directory   /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131011_153317/

crab:  error detecting glite version 
crab:  error detecting glite version 
crab:  --------------------
Dataset: /RelValZMM/CMSSW_5_3_6-START53_V14-v2/GEN-SIM-RECO
Remote output :
SE: T2_CH_CERN srm-eoscms.cern.ch  srmPath: srm://srm-eoscms.cern.ch:8443/srm/v2/server?SFN=/eos/cms/store/user/fanzago/TutGridSchool_test/
Total Events read: 10
Total Files read: 5
Total Jobs : 5
Luminosity section summary file: /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131011_153317/res/lumiSummary.json
   # Jobs: Retrieved:5

----------------------------

crab:  The summary file inputLumiSummaryOfTask.json about input run and lumi isn't created
crab:  No json file to compare

The message "The summary file inputLumiSummaryOfTask.json about input run and lumi isn't created" isn't an error but a message that means input data didn't provide lumi section info, as expected for the MC data.

The full srm path will allow you to know where your data has been stored and to perform operations by hand on it. As example you can delete the data using srmrm command and check the content of the remote directory through srmls. In this case the remote directory is:

srm://srm-eoscms.cern.ch:8443/srm/v2/server?SFN=/eos/cms/store/user/fanzago/TutGridSchool_test

It could be necessary to substitute the ? with the "?" in the srm path, depending on the shell you are using. Additional srm commands include srmrm, srmrmdir, srmmv, for moving files within an srm system, srmcp which can copy files locally. Note that to copy files locally, srmcp may require the additional flag "-2" to ensure that the version 2 client is used.

Here is the content of the file containing the luminosity summary /crab_0_130220_173930/res/lumiSummary.json:

{"1": [[666666, 666666]]}

Copy the output from the SE to the local User Interface

Option that can be used only if your output have been previously copied by CRAB on a remote SE. By default the -copyData copies your output from the remote SE to the local CRAB working directory (under res). Otherwise you can copy the output from the remote SE to another one, specifying either -dest_se= or -dest_endpoint=. If dest_se is used, CRAB finds the correct path where the output can be stored. The command to execute in order to retrieve locally the remote output files to your local user interface is:
crab -copyData 
## or crab -copyData -c <dir name>
An example of execution:

$ crab -copyData
crab:  Version 2.9.1 running on Fri Oct 11 17:08:38 2013 CET (15:08:38 UTC)

crab. Working options:
        scheduler           remoteGlidein
        job type            CMSSW
        server              OFF
        working directory   /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131011_153317/

crab:  error detecting glite version 
crab:  error detecting glite version 
crab:  Copy file locally.
        Output dir: /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131011_153317/res/
crab:  Starting copy...
directory/afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131011_153317/res/already exists
crab:  Copy success for file: outfile_4_1_Jlr.root 
crab:  Copy success for file: outfile_3_1_MsR.root 
crab:  Copy success for file: outfile_1_1_HF3.root 
crab:  Copy success for file: outfile_2_1_cVA.root 
crab:  Copy success for file: outfile_5_1_gAw.root 
Log file is /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131011_153317/log/crab.log

Publish your result in DBS

The publication of the produced data to DBS allows to re-run over the produced data that has been published. The instructions to follow are below, and here is the link to the how to. You have to add to the Crab configuration file more information specifying that you (will) want to publish and the data name to publish.
[USER]
....
publish_data            = 1
publish_data_name       = what_you_want
....
Warning:
  • All the parameters related publication have to be added in the configuration file before creation of jobs, even if the publication step is executed after retrieving of job output.
  • Publication is done in the phys03 instance of DBS3. If you belong to a PAG group, you have to publish your data to the DBS associated to your group, checking at the DBS access twiki page the correct DBS url and which role in voms you need to be an allowed user.
  • Remember to change the ui_working_dir value in the configuration file to create a new project (if you don't use the default name of crab project), otherwise the creation step will fail with the error message "project already exists, please remove it before create new task ".

Run Crab publishing your results

You can also run your analysis code publishing the results copied to a remote Storage Element. Here below an example of the CRAB configuration file, coherent with this tutorial:

For MC data (crab.cfg)

[CMSSW]
total_number_of_events  = 50
number_of_jobs          = 10
pset                    = tutorial.py
datasetpath             = /RelValZMM/CMSSW_5_3_6-START53_V14-v2/GEN-SIM-RECO
output_file              = outfile.root

[USER]
return_data             = 0
copy_data               = 1
storage_element         = T2_xx_yyyy
publish_data            = 1
publish_data_name       = FanzagoTutGrid

[CRAB]
scheduler               = remoteGlidein
jobtype                 = cmssw

And with this crab.cfg you can re-do the complete workflow as described before, plus the publication step:

  • creation
  • submission
  • status progress monitoring
  • output retrieval
  • publish the results

Use the -publish option

After having done the previous workflow untill the retrieval of you jobs, you can publish the output data that have been stored in the Storage Element indicated in the crab.cfg file using:

   crab -publish
or to publish the outputs of a specific project:
   crab -publish -c <dir_name>
It is not necessary that all the jobs are done and retrieved. You can publish your output at a different time.

It will look for all the FrameworkJobReport files ( crab-project-dir/res/crab_fjr_*.xml ) produced by each job and will extract from there the information (i.e. number of events, LFN, etc.) to publish.

Publication output example

The output shown below corresponds to an old output using DBS2.

$ crab -publish
crab:  Version 2.9.1 running on Mon Oct 14 14:35:56 2013 CET (12:35:56 UTC)

crab. Working options:
        scheduler           remoteGlidein
        job type            CMSSW
        server              OFF
        working directory   /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_123645/

crab:  <dbs_url_for_publication> = https://cmsdbsprod.cern.ch:8443/cms_dbs_ph_analysis_02_writer/servlet/DBSServlet
file_list =  ['/afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_123645/res//crab_fjr_1.xml', '/afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_123645/res//crab_fjr_2.xml', '/afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_123645/res//crab_fjr_3.xml', '/afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_123645/res//crab_fjr_4.xml', '/afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_123645/res//crab_fjr_5.xml', '/afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_123645/res//crab_fjr_6.xml', '/afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_123645/res//crab_fjr_7.xml', '/afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_123645/res//crab_fjr_8.xml', '/afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_123645/res//crab_fjr_9.xml', '/afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_123645/res//crab_fjr_10.xml']

crab:  --->>> Start dataset publication
crab:  --->>> Importing parent dataset in the dbs: /RelValZMM/CMSSW_5_3_6-START53_V14-v2/GEN-SIM-RECO
crab:  --->>> Importing all parents level
-----------------------------------------------------------------------------------
Transferring path /RelValZMM/CMSSW_5_2_1-START52_V4-v1/GEN-SIM 
           block /RelValZMM/CMSSW_5_2_1-START52_V4-v1/GEN-SIM#24e1effb-0f0c-4557-bb46-3d5ecae691b8 
-----------------------------------------------------------------------------------

-----------------------------------------------------------------------------------
Transferring path /RelValZMM/CMSSW_5_3_6-START53_V14-v2/GEN-SIM-DIGI-RAW-HLTDEBUG 
            block /RelValZMM/CMSSW_5_3_6-START53_V14-v2/GEN-SIM-DIGI-RAW-HLTDEBUG#13e93136-29ed-11e2-9c63-00221959e7c0 
-----------------------------------------------------------------------------------

-----------------------------------------------------------------------------------
Transferring path /RelValZMM/CMSSW_5_3_6-START53_V14-v2/GEN-SIM-RECO 
            block /RelValZMM/CMSSW_5_3_6-START53_V14-v2/GEN-SIM-RECO#43683124-29f6-11e2-9c63-00221959e7c0 
-----------------------------------------------------------------------------------

crab:  --->>> duration of all parents import (sec): 552.62570405
crab:  Import ok of dataset /RelValZMM/CMSSW_5_3_6-START53_V14-v2/GEN-SIM-RECO
crab:  PrimaryDataset = RelValZMM
crab:  ProcessedDataset = fanzago-FanzagoTutGrid-f30a6bb13f516198b2814e83414acca1
crab:  <User Dataset Name> = /RelValZMM/fanzago-FanzagoTutGrid-f30a6bb13f516198b2814e83414acca1/USER
 
crab:  --->>> End dataset publication
crab:  --->>> Start files publication
crab:  --->>> End files publication
crab:  --->>> Check data publication: dataset /RelValZMM/fanzago-FanzagoTutGrid-f30a6bb13f516198b2814e83414acca1/USER in DBS url https://cmsdbsprod.cern.ch:8443/cms_dbs_ph_analysis_02_writer/servlet/DBSServlet

=== dataset /RelValZMM/fanzago-FanzagoTutGrid-f30a6bb13f516198b2814e83414acca1/USER
=== dataset description =  
===== File block name: /RelValZMM/fanzago-FanzagoTutGrid-f30a6bb13f516198b2814e83414acca1/USER#787d164e-b485-4a23-b334-a8abde3fe146
      File block located at:  ['t2-srm-02.lnl.infn.it']
      File block status: 0
      Number of files: 10
      Number of Bytes: 33667525
      Number of Events: 50

 total events: 50 in dataset: /RelValZMM/fanzago-FanzagoTutGrid-f30a6bb13f516198b2814e83414acca1/USER

Log file is /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_123645/log/crab.log

Warning: Some versions of CMSSW switch off the debug mode of crab, so a lot of duplicated info can be reported at screen level.

Analyze your published data

First note that:
  • CRAB by default publishes all files finished correctly, including files with 0 events
  • CRAB by default imports all dataset parents of your dataset

You have to modify your crab.cfg file specifying the datasetpath name of your dataset and the dbs_url where data are published (we will assume phys03 instance of DBS3):

[CMSSW]
....
datasetpath = your_dataset_path
dbs_url = phys03

The creation output will be something similar to:

$ crab -create
crab:  Version 2.9.1 running on Mon Oct 14 15:49:31 2013 CET (13:49:31 UTC)

crab. Working options:
        scheduler           remoteGlidein
        job type            CMSSW
        server              OFF
        working directory   /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_154931/

crab:  error detecting glite version 
crab:  error detecting glite version 
crab:  Contacting Data Discovery Services ...
crab:  Accessing DBS at: https://cmsweb.cern.ch/dbs/prod/phys03/DBSReader
crab:  Requested dataset: /RelValZMM/fanzago-FanzagoTutGrid-f30a6bb13f516198b2814e83414acca1/USER has 50 events in 1 blocks.

crab:  SE black list applied to data location: ['srm-cms.cern.ch', 'srm-cms.gridpp.rl.ac.uk', 'T1_DE', 'T1_ES', 'T1_FR', 'T1_IT', 'T1_RU', 'T1_TW', 'cmsdca2.fnal.gov', 'T3_US_Vanderbilt_EC2']
crab:  May not create the exact number_of_jobs requested.
crab:  10 job(s) can run on 50 events.

crab:  List of jobs and available destination sites:

Block     1: jobs                 1-10: sites: T2_IT_Legnaro

crab:  Checking remote location
crab:  WARNING: The stageout directory already exists. Be careful not to accidentally mix outputs from different tasks
crab:  Creating 10 jobs, please wait...
crab:  Total of 10 jobs created.

Log file is /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_154931/log/crab.log

The jobs will run in the site where your USER data have been stored.

CRAB configuration file for real data with lumi mask

You can find more details on this at the corresponding link on the Crab FAQ page.

The CRAB configuration file (default name crab.cfg) should be located at the same location as the CMSSW parameter-set to be used by CRAB. The dataset used is: /SingleMu/Run2012B-13Jul2012-v1/AOD

For real data (crab_lumi.cfg)

[CMSSW]
lumis_per_job           = 50
number_of_jobs          = 10 
pset                    = tutorial.py
datasetpath             = /SingleMu/Run2012B-13Jul2012-v1/AOD
lumi_mask             = Cert_190456-208686_8TeV_PromptReco_Collisions12_JSON.txt
output_file            = outfile.root

[USER]
return_data              = 0
copy_data                = 1
publish_data             = 1
publish_data_name       = FanzagoTutGrid_data

[CRAB]
scheduler               = remoteGlidein 
jobtype                 = cmssw

where the lumi_mask file can be downloaded with

wget --no-check-certificate https://cms-service-dqm.web.cern.ch/cms-service-dqm/CAF/certification/Collisions12/8TeV/Prompt/Cert_190456-208686_8TeV_PromptReco_Collisions12_JSON.txt

For the tutorial we are using a subset of run and lumi (using a lumiMask.json file). The lumi_mask file (Cert_190456-208686_8TeV_PromptReco_Collisions12_JSON.txt) contains:

{"190645": [[10, 110]], "190704": [[1, 3]], "190705": [[1, 5], [7, 76], [78, 336], [338, 350], [353, 384]],
...
"208551": [[119, 193], [195, 212], [215, 300], [303, 354], [356, 554], [557, 580]], "208686": [[73, 79], [82, 181], [183, 224], [227, 243], [246, 311], [313, 463]]}

Job Creation

Creating jobs for real data is analogous to montecarlo data. To not overwrite previous run for this tutorial, it is suggested to use a dedicated cfg:

crab -create -cfg crab_lumi.cfg  
that takes as configuration file the file name specified with the option -cfg, in this case the crab_lumi.cfg associated for this tutorial with real data.

$ crab -create -cfg crab_lumi.cfg
crab:  Version 2.9.1 running on Mon Oct 14 16:05:18 2013 CET (14:05:18 UTC)

crab. Working options:
        scheduler           remoteGlidein
        job type            CMSSW
        server              OFF
        working directory   /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_160518/

crab:  error detecting glite version 
crab:  error detecting glite version 
crab:  Contacting Data Discovery Services ...
crab:  Accessing DBS at: http://cmsdbsprod.cern.ch/cms_dbs_prod_global/servlet/DBSServlet
crab:  Requested (A)DS /SingleMu/Run2012B-13Jul2012-v1/AOD has 14 block(s).
crab:  SE black list applied to data location: ['srm-cms.cern.ch', 'srm-cms.gridpp.rl.ac.uk', 'T1_DE', 'T1_ES', 'T1_FR', 'T1_IT', 'T1_RU', 'T1_TW', 'cmsdca2.fnal.gov', 'T3_US_Vanderbilt_EC2']
crab:  Requested number of lumis reached.
crab:  9 jobs created to run on 500 lumis
crab:  Checking remote location
crab:  Creating 9 jobs, please wait...
crab:  Total of 9 jobs created.
Log file is /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_160518/log/crab.log

  • The project directory called crab_0_131014_160518 is created.
  • As explained the number of created jobs can not match the number of jobs required in the configuration file (9 created but 10 required jobs).

Job Submission

Job submission is always analogous:

$ crab -submit
crab:  Version 2.9.1 running on Mon Oct 14 16:07:59 2013 CET (14:07:59 UTC)

crab. Working options:
        scheduler           remoteGlidein
        job type            CMSSW
        server              OFF
        working directory   /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_160518/

crab:  error detecting glite version 
crab:  error detecting glite version 
crab:  Checking available resources...
crab:  Found  compatible site(s) for job 1
crab:  1 blocks of jobs will be submitted
crab:  remotehost from Avail.List = submit-4.t2.ucsd.edu
crab:  contacting remote host submit-4.t2.ucsd.edu
crab:  Establishing gsissh ControlPath. Wait 2 sec ...
crab:  Establishing gsissh ControlPath. Wait 2 sec ...
crab:  COPY FILES TO REMOTE HOST
crab:  SUBMIT TO REMOTE GLIDEIN FRONTEND
                                                                      Submitting 9 jobs                                                                       
100% [====================================================================================================================================================]
                                                                         please wait                                                                          crab:  Total of 9 jobs submitted.
Log file is /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_160518/log/crab.log

Job Status Check

Check the status of the jobs in the latest CRAB project with the following command:
crab -status 
to check a specific project:
crab -status -c  <dir name>

which should produce a similar screen output like:

[fanzago@lxplus0445 SLC6]$ crab -status
crab:  Version 2.9.1 running on Mon Oct 14 16:23:52 2013 CET (14:23:52 UTC)

crab. Working options:
        scheduler           remoteGlidein
        job type            CMSSW
        server              OFF
        working directory   /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_160518/

crab:  error detecting glite version 
crab:  error detecting glite version 
crab:  Checking the status of all jobs: please wait
crab:  contacting remote host submit-4.t2.ucsd.edu
crab:  
ID    END STATUS            ACTION       ExeExitCode JobExitCode E_HOST
----- --- ----------------- ------------  ---------- ----------- ---------
1     N   Running           SubSuccess                           ce208.cern.ch
2     N   Submitted         SubSuccess                           
3     N   Running           SubSuccess                           cream03.lcg.cscs.ch
4     N   Running           SubSuccess                           t2-ce-01.lnl.infn.it
5     N   Running           SubSuccess                           cream01.lcg.cscs.ch
6     N   Running           SubSuccess                           cream01.lcg.cscs.ch
7     N   Running           SubSuccess                           ingrid.cism.ucl.ac.be
8     N   Running           SubSuccess                           ingrid.cism.ucl.ac.be
9     N   Running           SubSuccess                           ce203.cern.ch

crab:   9 Total Jobs 
 >>>>>>>>> 1 Jobs Submitted 
        List of jobs Submitted: 2 
 >>>>>>>>> 8 Jobs Running 
        List of jobs Running: 1,3-9 

crab:  You can also follow the status of this task on :
        CMS Dashboard: http://dashb-cms-job-task.cern.ch/taskmon.html#task=fanzago_crab_0_131014_160518_582igd
        Your task name is: fanzago_crab_0_131014_160518_582igd 

Log file is /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_160518/log/crab.log

and then ...

$ crab -status
crab:  Version 2.9.1 running on Tue Oct 15 10:53:33 2013 CET (08:53:33 UTC)

crab. Working options:
        scheduler           remoteGlidein
        job type            CMSSW
        server              OFF
        working directory   /afs/cern.ch/user/f/fanzago/scratch0/TEST_RELEASE/TEST_PATC2/TEST_2_8_2/TUTORIAL/TUT_5_3_11/SLC6/crab_0_131014_160518/

crab:  error detecting glite version 
crab:  error detecting glite version 
crab:  Checking the status of all jobs: please wait
crab:  contacting remote host submit-4.t2.ucsd.edu
crab:  Establishing gsissh ControlPath. Wait 2 sec ...
crab:  Establishing gsissh ControlPath. Wait 2 sec ...
crab:  
ID    END STATUS            ACTION       ExeExitCode JobExitCode E_HOST
----- --- ----------------- ------------  ---------- ----------- ---------
1     N   Done              Terminated    0          0           ce208.cern.ch
2     N   Done              Terminated    0          60317       cream03.lcg.cscs.ch
3     N   Done              Terminated    0          60317       cream03.lcg.cscs.ch
4     N   Done              Terminated    0          0           t2-ce-01.lnl.infn.it
5     N   Done              Terminated    0          60317       cream01.lcg.cscs.ch
6     N   Done              Terminated    0          60317       cream01.lcg.cscs.ch
7     N   Done              Terminated    0          0           ingrid.cism.ucl.ac.be
8     N   Done              Terminated    0          0           ingrid.cism.ucl.ac.be
9     N   Done              Terminated    0          0           ce203.cern.ch

crab:  ExitCodes Summary
 >>>>>>>>> 4 Jobs with Wrapper Exit Code : 60317 
         List of jobs: 2-3,5-6 
        See https://twiki.cern.ch/twiki/bin/view/CMS/JobExitCodes for Exit Code meaning

crab:  ExitCodes Summary
 >>>>>>>>> 5 Jobs with Wrapper Exit Code : 0 
         List of jobs: 1,4,7-9 
        See https://twiki.cern.ch/twiki/bin/view/CMS/JobExitCodes for Exit Code meaning

crab:   9 Total Jobs 

crab:  You can also follow the status of this task on :
        CMS Dashboard: http://dashb-cms-job-task.cern.ch/taskmon.html#task=fanzago_crab_0_131014_160518_582igd
        Your task name is: fanzago_crab_0_131014_160518_582igd 

Log file is /afs/cern.ch/user/f/fanzago/scratch0/TEST_RELEASE/TEST_PATC2/TEST_2_8_2/TUTORIAL/TUT_5_3_11/SLC6/crab_0_131014_160518/log/crab.log

Job Output Retrieval

For the jobs which are in the "Done" status it is possible to retrieve the log files of the jobs (just the log files, because the output files are copied to the Storage Element associated to the T2 specified on the crab.cfg and infact return_data is 0). The following command retrieves the log files of all "Done" jobs of the last created CRAB project:
crab -getoutput 
to get the output of a specific project:
crab -getoutput -c  <dir name>

the job results will be copied in the res subdirectory of your crab project:

$ crab -get
crab:  Version 2.9.1 running on Tue Oct 15 10:53:53 2013 CET (08:53:53 UTC)

crab. Working options:
        scheduler           remoteGlidein
        job type            CMSSW
        server              OFF
        working directory   /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_160518/

crab:  error detecting glite version 
crab:  error detecting glite version 
crab:  contacting remote host submit-4.t2.ucsd.edu
crab:  Preparing to rsync 2 files
crab:  Results of Jobs # 1 are in /afs/cern.ch/user/f/fanzago/scratch0/TEST_RELEASE/TEST_PATC2/TEST_2_8_2/TUTORIAL/TUT_5_3_11/SLC6/crab_0_131014_160518/res/
crab:  contacting remote host submit-4.t2.ucsd.edu
crab:  Preparing to rsync 16 files
crab:  Results of Jobs # 2 are in /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_160518/res/
crab:  Results of Jobs # 3 are in /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_160518/res/
crab:  Results of Jobs # 4 are in /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_160518/res/
crab:  Results of Jobs # 5 are in /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_160518/res/
crab:  Results of Jobs # 6 are in /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_160518/res/
crab:  Results of Jobs # 7 are in /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_160518/res/
crab:  Results of Jobs # 8 are in /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_160518/res/
crab:  Results of Jobs # 9 are in /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_160518/res/
Log file is /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_160518/log/crab.log

Use the -report option

As for the MonteCarlo data example, it is possible to run the report command:

crab -report -c <dir name>
the report command returns info about correctly finished jobs, that means jobs with JobExitCode = 0 and ExeExitCode = 0

$ crab -report 
crab:  Version 2.9.1 running on Tue Oct 15 15:55:10 2013 CET (13:55:10 UTC)

crab. Working options:
        scheduler           remoteGlidein
        job type            CMSSW
        server              OFF
        working directory   /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_160518/

crab:  error detecting glite version 
crab:  error detecting glite version 
crab:  --------------------
Dataset: /SingleMu/Run2012B-13Jul2012-v1/AOD
Remote output :
SE: T2_IT_Legnaro t2-srm-02.lnl.infn.it  srmPath: srm://t2-srm-02.lnl.infn.it:8443/srm/managerv2?SFN=/pnfs/lnl.infn.it/data/cms/store/user/fanzago/SingleMu/FanzagoTutGrid_data/${PSETHASH}/
Total Events read: 264540
Total Files read: 21
Total Jobs : 9
Luminosity section summary file: /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_160518/res/lumiSummary.json
   # Jobs: Retrieved:9

----------------------------

crab:  Summary file of input run and lumi to be analize with this task: /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_160518/res/inputLumiSummaryOfTask.json

crab:  to complete your analysis, you have to analyze the run and lumi reported in the //afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_160518/missingLumiSummary.json file

Log file is /afs/cern.ch/user/f/fanzago/scratch0/TUTORIAL/crab_0_131014_160518/log/crab.log

where the content of files containing the luminosity info about the task are: the original lumiMask.json file written in the crab,.cfg file and used during the creation of your task

$ cat Cert_190456-208686_8TeV_PromptReco_Collisions12_JSON.txt 
{"190645": [[10, 110]], "190704": [[1, 3]], "190705": [[1, 5], [7, 65], [81, 336], .... "208686": [[73, 79], [82, 181], [183, 224], [227, 243], [246, 311], [313, 463]]}
the lumi sections that your created jobs have to analyze (that are info used as arguments of your jobs)

$ cat crab_0_131014_160518/res/inputLumiSummaryOfTask.json

{"194305": [[84, 85]], "194108": [[95, 96], [117, 120], [123, 126], [149, 152], [154, 157], [160, 161], [166, 169], [172, 174], [176, 176], [185, 185], [187, 187], [190, 191], [196, 197], [200, 201], [206, 209], [211, 212], [216, 221], [231, 232], [234, 235], [238, 243], [249, 250], [277, 278], [305, 306], [333, 334], [438, 439], [520, 520], [527, 527]], "194120": [[13, 14], [22, 23], [32, 33], [43, 44], [57, 57], [67, 67], [73, 74], [88, 89], [105, 105], [110, 111], [139, 139], [144, 144], [266, 266]], "194224": [[94, 94], [111, 111], [257, 257], [273, 273], [324, 324]], "194896": [[35, 35], [68, 69]], "194424": [[63, 63], [92, 92], [121, 121], [123, 123], [168, 173], [176, 177], [184, 185], [187, 187], [199, 200], [202, 203], [207, 207], [213, 213], [220, 221], [256, 256], [557, 557], [559, 559], [562, 562], [564, 564], [599, 599], [602, 602], [607, 607], [609, 609], [639, 639], [648, 649], [656, 656], [658, 658], [660, 660]], "194631": [[222, 222]], "193998": [[66, 113], [115, 119], [124, 124], [126, 127], [132, 137], [139, 154], [158, 159], [168, 169], [172, 172], [174, 176], [180, 185], [191, 192], [195, 196], [233, 234], [247, 247]], "194027": [[93, 93], [109, 109], [113, 115]], "194778": [[127, 127], [130, 130]], "195947": [[27, 27], [36, 36]], "195099": [[77, 77], [106, 106]], "196200": [[66, 67]], "194711": [[1, 4], [11, 17], [19, 19], [25, 30], [33, 38], [46, 49], [54, 55], [62, 62], [64, 64], [70, 71], [82, 83], [90, 91], [98, 99], [102, 103], [106, 107], [112, 115], [123, 124], [129, 130], [140, 140], [142, 142], [614, 617]], "195552": [[256, 256], [263, 263]], "195013": [[133, 133], [144, 144]], "195868": [[16, 16], [20, 20]], "194912": [[130, 131]], "194699": [[38, 39], [253, 253], [256, 256]], "194050": [[353, 354], [1881, 1881]], "194075": [[82, 82], [101, 101], [103, 103]], "194076": [[3, 6], [9, 9], [16, 17], [20, 21], [29, 30], [33, 34], [46, 47], [58, 59], [84, 87], [93, 94], [100, 101], [106, 107], [130, 131], [143, 143], [154, 155], [228, 228], [239, 240], [246, 246], [268, 269], [284, 285], [376, 377], [396, 397], [490, 491], [718, 719]], "195970": [[77, 77], [79, 79]], "195919": [[5, 6]], "194644": [[8, 9], [19, 20], [34, 35], [58, 59], [78, 79], [100, 100], [106, 106], [128, 129]], "196250": [[73, 74]], "195164": [[62, 62], [64, 64]], "194199": [[114, 115], [124, 125], [148, 148], [156, 157], [159, 159], [207, 208], [395, 395], [401, 402]], "194480": [[621, 622], [630, 631], [663, 664], [715, 716], [996, 997], [1000, 1001], [1010, 1011], [1020, 1021], [1186, 1187], [1190, 1193]], "196531": [[284, 284], [289, 289]], "195774": [[150, 150], [159, 159]], "196027": [[150, 151]], "193834": [[1, 35]], "193835": [[1, 20], [22, 26]], "193836": [[1, 2]]}                                    

the lumi sections really analyzed by your correctly terminated jobs

$ cat crab_0_131014_160518/res/lumiSummary.json
{"195947": [[27, 27], [36, 36]], "194108": [[95, 96], [119, 120], [123, 126], [154, 157], [160, 161], [166, 167], [172, 174], [176, 176], [185, 185], [187, 187], [196, 197], [211, 212], [231, 232], [238, 241], [249, 250], [277, 278], [305, 306], [333, 334], [438, 439], [520, 520], [527, 527]], "193998": [[66, 66], [69, 70], [87, 88], [90, 100], [103, 105], [108, 109], [112, 113], [115, 119], [124, 124], [126, 126], [132, 135], [139, 140], [142, 142], [144, 154], [158, 159], [168, 169], [172, 172], [174, 176], [180, 185], [191, 192], [195, 196], [233, 234]], "194224": [[94, 94], [111, 111], [257, 257]], "194424": [[63, 63], [92, 92], [121, 121], [123, 123], [168, 173], [176, 177], [184, 185], [187, 187], [207, 207], [213, 213], [220, 221], [256, 256], [599, 599], [602, 602], [607, 607], [609, 609], [639, 639], [656, 656]], "194631": [[222, 222]], "196250": [[73, 74]], "194027": [[93, 93], [109, 109], [113, 115]], "194778": [[127, 127], [130, 130]], "195099": [[77, 77], [106, 106]], "194711": [[140, 140], [142, 142]], "195552": [[256, 256], [263, 263]], "195868": [[16, 16], [20, 20]], "194912": [[130, 131]], "194699": [[253, 253], [256, 256]], "195970": [[77, 77], [79, 79]], "194076": [[3, 6], [29, 30], [33, 34], [58, 59], [84, 87], [93, 94], [106, 107], [130, 131], [154, 155], [228, 228], [239, 240], [246, 246], [268, 269], [284, 285], [718, 719]], "194050": [[353, 354], [1881, 1881]], "195919": [[5, 6]], "194644": [[34, 35], [78, 79]], "195164": [[62, 62], [64, 64]], "194199": [[114, 115], [124, 125], [148, 148], [156, 157], [159, 159], [207, 208]], "196531": [[284, 284], [289, 289]], "196027": [[150, 151]], "193834": [[1, 24], [27, 30], [33, 34]], "193835": [[19, 20], [22, 23], [26, 26]], "193836": [[1, 2]]}

and the missing lumi (difference between the original lumiMask and lumiSummary) that you can analyze creating a new task and using this file as new lumiMask file

$ cat crab_0_131014_160518/res/missingLumiSummary.json file
{"190645": [[10, 110]],
 "190704": [[1, 3]],
 "190705": [[1, 5], [7, 65], [81, 336], [338, 350], [353, 383]],
 "190738": [[1, 130], [133, 226], [229, 355]],
.....
 "208541": [[1, 57], [59, 173], [175, 376], [378, 417]],
 "208551": [[119, 193], [195, 212], [215, 300], [303, 354], [356, 554], [557, 580]],
 "208686": [[73, 79], [82, 181], [183, 224], [227, 243], [246, 311], [313, 463]]}

To create a task to analyze the missing lumis of the original lumiMask you can use the missingLumiSummary.json file as new lumiMask.json file in your crab.cfg. As before, you can decide the split you want, and using the same publish_data_name the news outputs will be published in the same dataset of previuosly task

[CMSSW]
lumis_per_job           = 50
number_of_jobs          = 4  
pset                    =  tutorial.py
datasetpath             = /SingleMu/Run2012B-13Jul2012-v1/AOD
lumi_mask             =  crab_0_131014_160518/res/missingLumiSummary.json  
output_file            = outfile.root

[USER]
return_data              = 0
copy_data                = 1
publish_data =1
storage_element          = T2_xx_yyyy
publish_data_name        = FanzagoTutGrid_data

[CRAB]
scheduler               = remoteGlidein 
jobtype                 = cmssw

$ crab -create -cfg crab_missing.cfg
[fanzago@lxplus0445 SLC6]$ crab -create -cfg crab_data.cfg
crab:  Version 2.9.1 running on Tue Oct 15 17:10:16 2013 CET (15:10:16 UTC)

crab. Working options:
        scheduler           remoteGlidein
        job type            CMSSW
        server              OFF
        working directory   /afs/cern.ch/user/f/fanzago/scratch0/TEST_RELEASE/TEST_PATC2/TEST_2_8_2/TUTORIAL/TUT_5_3_11/SLC6/crab_0_131015_171016/

crab:  error detecting glite version 
crab:  error detecting glite version 
crab:  Contacting Data Discovery Services ...
crab:  Accessing DBS at: http://cmsdbsprod.cern.ch/cms_dbs_prod_global/servlet/DBSServlet
crab:  Requested (A)DS /SingleMu/Run2012B-13Jul2012-v1/AOD has 14 block(s).
crab:  SE black list applied to data location: ['srm-cms.cern.ch', 'srm-cms.gridpp.rl.ac.uk', 'T1_DE', 'T1_ES', 'T1_FR', 'T1_IT', 'T1_RU', 'T1_TW', 'cmsdca2.fnal.gov', 'T3_US_Vanderbilt_EC2']
crab:  Requested number of jobs reached.
crab:  4 jobs created to run on 200 lumis
crab:  Checking remote location
crab:  WARNING: The stageout directory already exists. Be careful not to accidentally mix outputs from different tasks
crab:  Creating 4 jobs, please wait...
crab:  Total of 4 jobs created.

Log file is /afs/cern.ch/user/f/fanzago/scratch0/TEST_RELEASE/TEST_PATC2/TEST_2_8_2/TUTORIAL/TUT_5_3_11/SLC6/crab_0_131015_171016/log/crab.log

and submit them as usual. The created jobs will analyze part of the missing lumi of the original lumiMask.json file.

  • If you select total_number_of_lumis = -1 instead of lumi_per_job or number_of_job, the new task will analyze all the missing lumi.

Run Crab retrieving your output (without copying to a Storage Element)

You can also run your analysis code without interacting with a remote Storage Element, but retrieving the outputs to your workspace area (under the res dir of the project). Here below an example of the CRAB configuration file, coerent with this tutorial:

[CMSSW]
total_number_of_events  = 100
number_of_jobs          = 10
pset                    = tutorial.py
datasetpath             =  /RelValZMM/CMSSW_5_3_6-START53_V14-v2/GEN-SIM-RECO
output_file              = outfile.root

[USER]
return_data             = 1

[CRAB]
scheduler               = remoteGlidein
jobtype                 = cmssw

And with this crab.cfg in place you can re-do de workflow as described before (a part of the publication step):

  • creation
  • submission
  • status progress monitoring
  • output retrieval (in this step you'll be able to retrieve directly the real output produced by your pset file)

Where to find more on CRAB

Note also that all CMS members using the Grid must subscribe to the Grid Annoucements CMS.HyperNews forum.

Review status

Reviewer/Editor and Date (copy from screen) Comments
JohnStupak - 4-June-2013 Review, minor revisions, updated real data dataset to an existing dataset
NitishDhingra - 2012-04-07 See detailed comments below.
MattiaCinquilli - 2010-04-15 Update for tutorial
FedericaFanzago - 18 Feb 2009 Update for tutorial
AndriusJuodagalvis - 2009-08-21 Added an instance of url_local_dbs

Complete Review, Minor Changes. Page gives a good idea of doing a physics analysis using CRAB

Responsible: FedericaFanzago



CRAB at CAF/LSF at CERN

Complete: 5
Detailed Review status

Contents:

Newsbox
under review


This document describes how to use CRAB at CERN for direct submission to the batch system LSF or CERN Analysis Facility (CAF). The responsable of CAF is Peter Kreuzer. Useful information of CAF can be found in the CAF twiki page.

Prerequisites

  • Dataset you want to access has to be available at the CAF, so it must be registered in the CAF DBS
  • If you run on CAF, you have to be authorized to do so. In this page: https://twiki.cern.ch/twiki/bin/view/CMS/CAF#User_Permissions you can find the sub-groups and the correspond leader. If you know your sub-group, you can contact the leader for the authorization.
  • CRAB StandAlone (direct submission)
    • Jobs has to be submitted from an afs directory, from a node with LSF access for exemple on lxplus
    • Since in this case you are effectively using CRAB as a convenience tool to do LSF submission from your shell, you need to setup the environment as usual:
      • make sure you setup the environment in the following order *
        1.source /afs/cern.ch/cms/LCG/LCG-2/UI/cms_ui_env.sh
        2. cmsenv
        3. source /afs/cern.ch/cms/ccs/wm/scripts/Crab/crab.sh
        
        • in the above replace sh with csh if you are using tcsh
    • Please note that you must be sure to have enough quota on your afs area. Large output should be put on castor (look at CAF stageout below)
    • Even if you decided to send the output to castor, the stdout/err and the Framework Job Report will be returned back to your afs area in any case.
    • Removes the requirement to use an AFS directory and a host with LSF access, so can also submit from your desktop/laptop

Running

The workflow is exactly the same of that you would follow to access data on the Grid (see: CRAB Tutorial). So you setup your CMSSW area, you develop your code, test it on a (small) part of a dataset and then you configure CRAB to create and submit identical jobs to CAF to analyze the full Dataset. In the crab.cfg configuration file, you have just to put under the [CRAB] section:
scheduler = caf

The available CAF queues are:

cmscaf1nh
cmscaf1nd
cmscaf1nw

Running on the CAF, using caf as scheduler instead of lsf, the longest queue will be selected automatically (cmscaf1nw). If you need to select a different queue you can fill the parameter queue under the [CAF] section with either cmscaf1nh or cmscaf1nd (i.e. queue = cmscaf1nh). If you know that your jobs are short, it should be more efficient to use shorter queues.

CAF stageout

If you are running jobs at CAF then the required stageout configuration is:

  • Stage out into CAF user area (T2_CH_CERN is the offical site name for CAF):

[USER]
copy_data = 1
storage_element=T2_CH_CERN
user_remote_dir=xxx

the path where data will be stored is /store/caf/user/<username>/<user_remote_dir>

There is no support for staging out to the CAF-T1 from the GRID. The above instructions only apply for jobs running on the CAF itself.

Further details on CRAB and Stage out configurations available at this page.

CAF publication

You need the following in crab configuration file:

  • (NOTE: the storage element where the data are copied have to be T2_CH_CERN):
 
[USER]
copy_data = 1
storage_element=T2_CH_CERN
publish_data=1
publish_data_name = data-name-to-publish  (e.g. publish_data_name = JohnSmithTestDataVersion666 )
dbs_url_for_publication = https://cmsdbsprod.cern.ch:8443/cms_dbs_caf_analysis_01_writer/servlet/DBSServlet

The path where data will be stored is /store/caf/user/<username>/<primarydataset>/<publish_data_name>/<PSETHASH>

Review status

Reviewer/Editor and Date (copy from screen) Comments
MarcoCalloni - 18 Dec 2008 linked to workbook, needs to be renamed swguide -> workbook
StefanoLacaprara - 13 Mar 2008 created the page

Responsible: MarcoCalloni
Last reviewed by:



5.8 Job Monitoring with CMS Dashboard

Complete: 5
Detailed Review Status

Contents

CMS Dashboard provides a web interface for Job Monitoring

Most of the CMS Job Submission Systems, including CRAB and PA, are instrumented to send monitoring information to the CMS Dashboard. In addition to reports of the CMS Job Submission Systems, Dashboard collects information from the Grid Monitoring systems. Monitoring data is stored in the central database and there is a web interface running on top of it and allowing CMS users to follow the progress of their jobs. For the Dashboard monitoring you do not need to setup any environment. The only thing you need is a web browser. In case you see a problem, please submit a bug in savannah: https://savannah.cern.ch/projects/dashboard mail to dashboard-support@cernNOSPAMPLEASE.ch

How to follow the progress of your tasks

  • After submission of your task to the CRAB server you get back the Dashboard task monitoring link, which you can follow in order to get information about progress of your task.

  • If you want to find information about multiple tasks submitted during a particular time range you can enter the application via the entry task monitoring page:

http://dashb-cms-job-task.cern.ch/dashboard/request.py/taskmonitoring

  • Choose your identiy in the "Select a User" window and the time window to define the tasks submitted during a given time range. You should get at the screen the list of all your tasks submitted over the time range you have chosen.
As a rule, on the Dashboard UI, user identity is defined by user name and family name. But it is not always the case. User identity is retrieved from the Grid certificate subject and depends on it's format. To check your Dashboard user identifier, go to the Dashboard interactive job monitoring page:

http://dashb-cms-job.cern.ch/dashboard/request.py/jobsummary

Click on the bar with the 'analysis' label and sort by user. Looking on the user names listed next to the bars or in the table below, you will find out your Dashboard identifier. At the task monitoring page you see the list of all your tasks with the distribution of jobs by their current status.

You can bookmark the link to task monitoring application containing information only about your tasks:
http://dashb-cms-job-task.cern.ch/dashboard/request.py
/taskmonitoring#action=tasksTable&usergridname=USERNAME

where USERNAME is your Dashboard identifier

You can also bookmark the link to a particular task:
http://dashb-cms-job-task.cern.ch/dashboard/request.py/
taskmonitoring#action=taskJobs&usergridname=USERNAME&taskmonid=TASKMONITORID

where TASKMONITORID is project name in CRAB.

  • The front page shows the overview of all tasks submitted during the selected time range (3 days by default) in table and graphical form.
Every page is reloaded every few minutes providing the most up-to-date information.

  • Clicking on the information icon next to the task name you get the meta information of a particular task, like name of the input dataset, CMSSW version to be used, time stamp when task had been registered in the Dashboard.

  • Clicking on the number of jobs corresponding to a given status gives you a detailed information of all jobs of a chosen category: Grid job id, identifier of the job inside the task, how many times the job had been resubmitted, site where job is processed, time stamps of the job processing (UTC).

  • In case when the job was resubmitted multiple times, clicking at the number at the "Submission Attempts" column allows you to see all resubmissions corresponding to a given job inside the task. Here we are referring not to the resubmissions triggered by resource broker, but to resubmissions done by the user.

  • The application provides various graphs showing distribution of jobs by site, time, failure reason, and graphs showing time consumed by the task, the graph showing progress of the task in terms of processed events. Th plots either generated by default on a appropriate page or can be selected clicking on the "Plot Selection" link in the table. Graphs can be zoomed-in and zoomed-out.

  • Jobs listed in the "Successful" category are those which had accomplished properly from the application point of view and for which Dashboard did not get evidence that the job had been aborted by the Grid.

  • The jobs listed as "Failed" are those which either failed from the application point of view or had been aborted by the Grid or cancelled by the user/crab_server. By failure from the application point of view we mean non 0 status from the CMS application or problems related to the saving of the output files at the storage element, or problems while sourcing of the CMS environment at the site. To discover the reason of failure of the particular job, click on the number of jobs in the "Failed" category, you get the list of all failed jobs with the Grid status and application exit code. Moving cursor at the status value you get more detailed reason of failure.

  • Please, pay attention, for back navigation between the task monitoring pages do not use "back" button in the browser, use buttons provided on the task monitoring pages.

  • On the task page showing the overview of all jobs belonging to a task next to the task name you see the small Dashboard logo icon. Clicking on it you get the bar plot of distribution of jobs of the task by a chosen attribute, like site, computing element or resource broker. Sometimes this distribution can help you to find the problem of a given site, CE or resource broker and can allow you to exclude the problematic service or site when you resubmit your jobs. See more details in the section "Using Dashboard Interactive Interface".

If you see any discrepancy between information in the Dashboard and the output of the Crab status command

Sometime you can notice the discrepancy of the information in the Dashboard and the Crab status command.

  • Dashboard does not have user credentials and can not directly query Grid Logging and Bookkeeping system to get status about a particular job. It relies on the job status reports sent to the dashboard either from the jobs themselves or job status information of the Grid monitoring systems like RGMA or ICRTM. That is why if you see non-consistent data in Crab and Dashboard related to the Grid status of a given job, you should believe Crab.
  • On the other hand Dashboard gets real time information about jobs which is reported to the Dashboard from the jobs running at the worker nodes. So when you see that according to the Dashboard the job had terminated while Crab still considers the job to be running, it means that the job had already finished on the worker node and sent it's exit status, while Grid Logging and Bookkeeping system did not yet update the job status. If you see that a delay in update of crab status for the job which had terminated according to Dashboard takes too long (more than half a hour), the problem can be related to the Grid services.
  • For some sites, e.g. T3_US_FNALLPC, nothing is reported to dashboard until the job is done.

You can follow the CRAB3 Troubleshooting guide for more on how to troubleshoot your job and contact crab support.

Using Dashboard Interactive Interface

One of the purposes of the Dashboard Interactive Interface is to show the correlations of the job failures or inefficiencies in processing (pending too long in the queue for example) with a particular site or Grid service like Resource Broker.

When you see that all jobs of a particular task are failing and it is not clear to you whether it is a problem of your code or the problem related to the site misconfiguration, Dashboard Interactive Interface can help you to find it out.

  • First thing you can check is whether jobs of other users are failing at the same site with the same failure code. Go to:
http://dashb-cms-job.cern.ch/dashboard/request.py/jobsummary

By default on the interactive user interface you see all jobs submitted during the selected time window. If you tick the checkbox 'terminated' then you would see all jobs which are currently either in pending or running status or those which had been terminated from the date selected as the beginning of the time range to now regardless the time when the jobs had been submitted. Be aware that all dates in Dashboard including UI are UTC. You can sort the jobs by user, site, computing element, resource broker, application, task. Clicking at any bar of the plot would allow you to sort the subset of jobs shown in a particular bar by various attributes. Clicking at any number in the table would allow you to get detailed information about the selected subset of jobs, like processing time stamps, exit code of application, Grid job id etc... If you sort your jobs by task and then click on a particular task name in the table, the task monitoring page for this task would be opened.

  • Trying to understand the reason of the failures

Click on the bar with the 'analysis' label and sort by site. Dark green colour corresponds to the jobs which finsihed properly, pink one corresponds to the jobs which were properly handled by the Grid, but failed from the application point of view. Red colour corresponds to the jobs aborted by the Grid. Clicking on the number corresponding to the failed or aborted jobs in the table below gives you the list of all failed or aborted jobs with their failure reason.

  • Example:
Looking at the plot provided via link below you can see that there are no jobs which succeeded in Taipei

https://twiki.cern.ch/twiki/pub/CMSPublic/WorkBookMonitoringTutorial/tut0.pdf

Let's sort Taipei jobs by user

https://twiki.cern.ch/twiki/pub/CMSPublic/WorkBookMonitoringTutorial/Tut1.pdf

You see that there were several users running their jobs at the site and nobody managed to run the jobs properly. Clicking on the number of the failed jobs in the table we get the detailed view of the failed jobs with application exit code 8000, which very often indicates the data access problems (beware those images are from some time ago with older CMSSW releases, now failure to open file gives exit code=8020). So failures of the jobs could be related to the site misconfiguration rather than to a problem in the user code. If the problematic site does not represent the only location of data required by your task you can put the site in the black list (ce_black_list=Site_Name) of the Crab configuration file and resubmit the task. If you feel you have to do this black listing, also contact the crab support team, more information at the CRAB3 Troubleshooting page.

Review Status

Editor/Reviewer and date Comments
NitishDhingra - 04-October-2017 Review, minor revisions
JohnStupak - 9-June-2013 Review, minor revisions
NitishDhingra - 07-Apr-2012 See detailed comments below
StefanoBelforte - 29-Jan-2010 Complete Expert Review, minor changes
Main.julia - 07 February 2008 author

Complete review with minor fixes. The page gives a very good illustration of the Dashboard monitoring .

Responsible: JuliaAndreeva
Last reviewed by: DaveEvans 28 Feb 2008



5.9 The Role of the T2 Resources

Complete: 5
Detailed Review status

Goals of this page:

This page is intended to familiarize you with performing a large scale CMS analysis on the Grid. In particular, you will learn
  • the role of the Tier-s for user analysis,
  • the organization of data at the Tier-2 sites,
  • how to find and request datasets,
  • where to store your job output,
  • and how to elevate, delete and deregister a private dataset.
It is important that you also become familar with running a Grid analysis with Crab.

Contents

Introduction

The Tier-2 centers in CMS are the only location, besides the specialized analysis facility at CERN, where users are able to obtain guaranteed access to CMS data samples. The Tier-1 centers are used primarily for organized processing and storage. The Tier-2s are specified with data export and network capacity to allow the centers to refresh the data in disk storage regularly for analysis. A nominal Tier-2 will deploy 810 TB of storage for CMS in 2012. The CMS expectation for the global 2012 Tier-2 capacity is 27 PB of usable disk space. In order to manage such a large and highly distributed resource CMS has tried to introduce policy and structure to the Tier-2 storage and processing.

Storage Organisation at a Tier-2

T2_storage_2012.tiff

Apart from 30 TB storage space for central services, like MC production, and buffers, the main storage areas of interest for a user are:

  • 200 TB central space
    Here datasets of major interest for the whole collaboration, like primary skims or the main Monte Carlo samples, are stored. This space is controlled by AnalysisOperations.
  • 250 TB (125 TB * 2 groups) space for the detector and physics groups.
    Datasets which are of particular interest for the groups associated to a Tier-2 site, like sub-skims or special MC samples.
  • In the order of (e.g. 40 users * 4 TB) 160 TB "Grid home space" for local/national user.
    This quota can be extended by additional local/national resources. Mainly the output files from Crab user analysis jobs will be stored in this area.
  • 170 TB local space.
    Data samples of interest for the local or national community. The movement and deletion of the data is fully under the responsibility and control of the site.

Sites larger than nominal will provide resources for more central space, three groups, and additional regional space. Sites smaller than nominal may provide resources for only one physics group, or only central space, or if sufficiently small, only for simulated event production.

How to find a dataset?

If you have identified the physics processes which contribute to the background of your analysis and for your signal you want to know over which datasets you have to run your analysis. From the dataset names this is usually not so obvious. As a general tip you should subscribe to your preferred detector & physics (PAG/POG/DPG) groups’ Hypernews mailing list and to hn-cms-physics-announcements. Sometimes your group provides this information on the group's information page and documentation systems like TWikis or webpages. Ask your colleagues! If you have identified the names of the relevant datasets you should check whether they are available for analysis by utilizing the DAS Data Aggregation System or alternatively the PhEDEx Physics Experiment Data Export.

How to request a replication of a dataset?

The datasets you want to analyse have to be fully present at a Tier-2 (or at your local Tier-3) site. If shown to be present only at a Tier-1 center you can request a Phedex transfer to copy datasets to Tier-2 (and Tier-3) sites. Please consult the responsibles of a Tier-2/3 site operated by your national community if the datasets will be accounted towards local Tier-2/3 space, or the data managers of the physics group you are associated with whether they agree to store the datasets in their Tier-2 group space. After their agreement please give a reasonable explaination in the Phedex request comment field and choose the appropriate group from the corresponding pull-down menu or use local in case of transfers for the local/national community. With Phedex transfers you can not copy datasets into your personal Grid home space.

Where to store your data output?

Usually your Crab analysis job produces an amount of output which is too large to be transfered by the Grid sandbox mechanism. Therefore you should direct your job output to your associated Grid user home storage space using the stage-out option in Crab. Using CERN resources like Castor pools will probably be restricted in the near future, so for the majority of the CMS users a Tier-2 site will provide the output capacity. Usually your Grid home space will be at a Tier-2 site which your country is operating for CMS, if more than one site is present, ask your country's IT contact persons how they distribute their users internally. In case your institute or lab operates a Tier-3 site which has a suffient capability to receive CMS analysis output data over the Grid, also such a site could be used, however CMS support is only on best effort basis. Countries without own CMS Tier-2 centers and with no functional Tier-3 should contact their country representatives who have to negotiate with other sites to provide storage space for guest users.
Your associated Tier-2 provides you with in the order of 4 TB (exact amount to be negotiated with your Tier-2) space, usually only protected on hardware (e.g. Raid disks) level but without a backup mechanism. If there are additional local or national resources available it could be more, for details consult your Tier-2 contact persons.
Presently the Grid storage systems do not provide a quota system, therefore the local Tier-2 support will review the user space utilization regularly. Please try to be carefull not to overfill your home area.
If you register the output of your Crab job in DBS, all CMS users can have access to your data.

How to move a private dataset into offical space and how to delete and deregister a dataset?

In CMS official datasets and user datasets are differentiated. Whereas official datasets are produced centralized, the users are allowed to produce and store their own datasets containing any kind of data at a Tier-2 center. There are no requirements concerning data quality, usefulness and appropriate size to be stored on tape. The data is located in the private user space at the users home Tier-2 and can be registered in a local scope bookkeeping to use provided Grid tools in order to perform a distributed analysis. In principle, this dataset can be analysed by any user of the collaboration, however only at the Tier-2 center hosting the dataset, which has naturally a limited number of job slots. Later it could be possible, that the dataset created by the user becomes important for many other users or even a whole analysis group. To provide a better availability it is reasonable to distribute the dataset to further Tier-2 centers or even to a Tier-1 center for custodial storage on tape. However, the CMS data transfer system can only handle official data registered in the central bookkeeping. Therefore, it is necessary that the user dataset becomes an official dataset fitting all the requirements of CMS. The StoreResults service provides a mechanism to elevate user datasets to central bookkeeping by doing the following tasks:

  • Validation, through authentication and roles, ensures that the data is generally useful.
  • Merge the files into a size suitable for tape storage.
  • Inject data into the central bookkeeping and data transfer system.

The current system is ad-hoc based around a Savannah request/problem tracker for approvals and on the legacy CMS ProdAgent production framework. For the long term future a complete rewrite based on forthcoming new common CMS tools is presently discussed. Further information can be found in URL1.

To delete data from the user‘s home space the usage of Grid commands and the knowlegde of the physical file names is necessary. Please contact your local Tier-2 data manager and ask for advice and help. To invalidate private dataset registrations in a local-scope database in order to synchronise with deleted data samples is not a trivial action so far, a user friendly tool might become available in the future. Until then please consult the DBS removal instruction pages.

Information sources

CMS computing Technical Design Report
Presentation (for 2009 storage resources)

Review status

Reviewer/Editor and Date (copy from screen) Comments
StefanoBelforte - 2015-08-19 remove reference to Filemover
-- JohnStupak - 2-July-2013 Review
-- ThomasKress - 22-Apr-2012 Major changes for 2012 T2 storage and in StoreResults section
-- NitishDhingra - 10-Apr-2012 See the detailed comments below
-- ThomasKress - 29-Apr-2010 temporary URLs to StoreResults service
-- StefanoBelforte - 08-Feb-2010 Complete Expert Review, minor changes
-- KatiLassilaPerini - 16 Feb 2009 Created the template page
-- ThomasKress - 18 Feb 2009 First draft
-- ThomasKress - 19 May 2009 200 TB "end of 2008" instead of "2009"
-- ThomasKress - 27 Nov 2009 Storage resources adapted for 2010, and minor mods.
-- ThomasKress - 03 Dec 2009 Minor mods.

Review with minor modifications. Link to monitor the status of different sites added. The page gives a good overview of the Tier2 resources.

Responsible: ThomasKress
Last reviewed by: IanFisk 19 May 2009



5.10 Transferring MC Sample/Data Files

Complete: 4
Detailed Review status

Newsbox
Under construction

Goals of this page:

This page is intended to familiarize you with making a PhEDEx subscription and monitoring the progress of transfers to a site. In particular, you will learn:

  • what PhEDEx is?
  • why you should transfer data to your T2?
  • how to make a PhEDEx subscription to a site?
  • how to monitor PhEDEx transfers?

Contents

What is PhEDEx?

PhEDEx is the CMS data placement tool. It sits above various grid middleware [SRM (Storage Resource Manager), FTS (File Transfer Service)] to manage large scale transfers between CMS centres. CMS sites run a series of agents which run the requested transfers, verify that transfers have completed correctly and publish what data is available for transfer.

A normal user does not interact with this machinery. They will fill in a web form to make a request for data transfer to a site. Once this is approved the PhEDEx machinery takes over and makes sure that the transfer is complete.

Why do I need to transfer MC sample or Data files?

For general analysis CMS will use T2 centres, as the T0 and T1 sites will be busy carrying out reconstruction, re-reconstruction, skimming and AOD production. This means MC sample or data files need to be transferred from the T1 centres out to the T2's. It is up to the people working at T2 sites to choose which MC sample or data files goes to the site, and make the appropriate PhEDEx request.

If you run a CRAB job you and find that your MC sample or data is not located at a T2 centre you can request to have it transferred there using PhEDEx.

You do not need to have your MC sample or data at "your" T2 to run analysis, CRAB will run on it in any location. However, you may find it useful to make a copy at your local T2 as this will increase the number of sites you can run your analysis at.

For PhEDEx requests, if you are working with an analysis group, you can choose that group, however, that may mean files will be deleted in a regular cleanup. Otherwise choose "local" for the User Group.

No permission to view CMS.PhedexUserDocsSubscribeData

Copy files from other sites using gfal-copy command

Instructions are on the CRAB3 FAQ twiki on how to use gfal tools to find and copy files from another site's Storage Element.

Instructions for FNAL-LPC

The copyfiles.py script can be used to copy single files or a directory of files using gfal-copy or xrdcp from another site to T3_US_FNALLPC.

The getSiteInfo.py script can be useful to get the information of the site's endpoint to obtain a single file through gfal-copy, it is used by the copyfiles.py script above.

Review status

Reviewer/Editor and Date (copy from screen) Comments
JohnStupak - 2-July-2013 Review
NitishDhingra - 07-Apr-2012 See the detailed comments below
KatiLassilaPerini - 20 Feb 2008 created the template page

Substantial modifications due to depreciation of DBS. Instructions with snapshots for PheDex subscription using DAS interface added.

Responsible: SimonMetson
Last reviewed by: YourName - date



5.10 Data Organization Explained

Complete: 5
Detailed Review status

Goals of this page:

This page is intended to provide you with an overview of the terms used in Data Management in CMS, thus providing you an appreciation to how data is organized. It is background information only.

Contents

Dataset Bookkeeping System (DBS): “Which data exist?”

The Dataset Bookkeeping System (DBS) provides the means to define, discover and use CMS event data. The main features that DBS provides are:

  • Data Description: keeps dataset definition along with attributes characterising the dataset like the application that produced the data, the type of content resulting from a degree of processing applied to the data (RAW, RECO, etc),etc… The DBS also provides information regarding the “provenance” of the data it describes.
  • Data Discovery: stores information about (real and simulated) CMS data in a queryable format. The supported queries allow users to discover available data and how they are organized (logically) in term of packaging units (files and file-blocks).
Answers the question “Which data exist?”

  • Easiest way for user to query this information is via the Data Aggregation Service (DAS) as described in Chapter Locating Data Samples

Data Location Service (DLS): "Where is the data?"

The Data Location Service (DLS) provides the means to locate replicas of data in the distributed computing system. The DLS provide the names of Storage Elements of sites hosting the data. Answers the question “Where is the data?”

The Event Data Model (EDM) in CMSSW is based on simple files. In the data management you will see two terms used when discussing files:

Logical File Name (LFN)

  • This is a site-independent name for a file.
  • It doesn't contain either the actual protocol used to read the file or any of the site-specific information about the place where it is located.
  • it is preferred that you use this for all production files as then it is possible for a site to change specifics of the access and location without breaking your config file.
  • A production LFN in general begins with /store and looks like this in a cmsRun cfg file:
process.source = cms.Source("PoolSource",
    fileNames = cms.untracked.vstring(
'/store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/CE4D66EB-5AAE-E111-96D6-003048D37524.root'
         )
)

Physical File Name (PFN)

  • This is site-dependent name for a file.
  • Local access to a file at a site. (Note that reading files at remote sites specifying protocol in PFN doesn’t work)
  • The cmsRun application will automatically convert production LFN's into the appropriate PFN for the site where you are running. So you don't need to know the PFN yourself!!
  • If you really want to know the PFN, the algorithm that convert LFN to PFN is site dependent and is defined in the so called TrivialFileCatalog at the site ( TrivialFileCatalog of the various sites are in CVS COMP/SITECONF/SiteName/PhEDEx/storage.xml )

The EdmFileUtil utility in your CMSSW environment can be used to get the PFN from a given LFN:

cd work/CMSSW_5_3_5/src/
cmsenv
edmFileUtil -d /store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/CE4D66EB-5AAE-E111-96D6-003048D37524.root
 
will results in:
root://eoscms//eos/cms/store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/CE4D66EB-5AAE-E111-96D6-003048D37524.root?svcClass=default

For example accessing data locally at CERN you have the algorithm:

  PFN = root://eoscms//eos/cms/ + LFN

and the cmsRun cfg file looks like:

process.source = cms.Source("PoolSource",
    fileNames = cms.untracked.vstring(
     'root://eoscms//eos/cms/store/mc/SAM/GenericTTbar/GEN-SIM-RECO/CMSSW_5_3_1_START53_V5-v1/0013/CE4D66EB-5AAE-E111-96D6-003048D37524.root?svcClass=default'
    )
)

File Blocks

  • Files are grouped together into FileBlock.
  • A file block is the minimum quantum of data that is replicated between sites.
  • Each given file block may be at one or more sites.

Dataset

  • Fileblocks are grouped in datasets.
  • Dataset is a set of fileblocks corresponding to a single sample and produced with a single cfg file.
DataSet.png

DatasetPath

The DatasetPath is a string that identifies a dataset. It consists of 3 parts:

        /Primarydataset/Processeddataset/DataTier

where:

  • Primary dataset: name that describes the physics channel

  • Processed dataset: name that describe the kind of processing applied

  • Data Tier: describes the kind of event information stored from each step in the simulation and reconstruction chain. Examples of data tiers include RAW and RECO, and for MC, GEN, SIM and DIGI. A given dataset may consist of multiple data tiers, e.g., the term GEN-SIM-DIGI-RECO includes the generation (MC), the simulation (Geant), digitalization and reconstruction steps.

Review status

Reviewer/Editor and Date (copy from screen) Comments
StefanoBelforte - 29-Aug-2013 replace reference to DBS/DAS with ref. to Chapter 5.4
JohnStupak - 2-July-2013 Review and update to 5_3_5
NitishDhingra - 07-Apr-2012 See the detailed comments below
FrankWuerthwein - 04-Dec-2009 Complete Reorganization 1st draft ready for review

Complete review. Information regarding deprecation of DBS and migration to DAS has been added. Figures have been added for better understanding.

Last reviewed by: Main.David L Evans - fill in date when done - Responsible: StefanoBelforte

-- FrankWuerthwein - 06-Dec-2009

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2015-04-15 - FreyaBlekman


ESSENTIALS

ADVANCED TOPICS


 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback