TWiki
>
AtlasComputing Web
>
AtlasComputing
>
AtlasDistributedComputing
>
ATLASAnalytics
>
RucioHadoop
(2018-06-11,
IlijaVukotic
)
(raw view)
E
dit
A
ttach
P
DF
<!-- * Set ALLOWTOPICVIEW = Main.AllUsersGroup --> ---+!!<nop>Rucio in Hadoop %TOC% <!-- this line is optional --> %STARTINCLUDE% ---+Introduction What's the initial reason this data has been imported? The main developers are Thomas Beermann and Mario Lassnig. ---+The Data The data is stored in the =analytix.cern.ch= cluster in the directory: <verbatim>/user/rucio01/</verbatim> Example of access (start from lxplus): <verbatim> $ ssh analytix $ hadoop fs -ls /user/rucio01/traces $ hadoop fs -cat /user/rucio01/traces/rucio-server-prod-05.cern.ch.1416407134480 | head $ hadoop fs -ls /user/rucio01/dq2/traces $ hadoop fs -cat /user/rucio01/dq2/traces/2014-08 | head</verbatim> ---++ Apache Server / Rucio Daemon logs Stores log files for simple simple cat / grep analysis * read directly from log file and continuously streamed via Flume to HDFS * simple text log files * ~23GB per day ---++ Traces Contain updates of last access time of files/datasets, will be used for the popularity reports * update of last access time of files/datasets * send to ActiveMQ broker and continuously streamed via Flume to HDFS * text file with one JSON encoded dictionary per trace * ~5GB per day - 6M entries There are both traces from DQ2 (historical) and Rucio (current). Traces for DQ2 and Rucio events have the same fields. The only difference that DQ2's traces are stores in a plain-text format, and Rucio's are in JSON. For DQ2 traces description: <verbatim>$ hadoop fs -cat /user/rucio01/dq2/traces/README.txt</verbatim> Example of loading DQ2 traces data in a Pig script: <verbatim>dq2_traces = LOAD '/user/rucio01/dq2/traces/2014-*' USING PigStorage() AS (uuid:chararray, eventtype:chararray, eventversion:chararray, remotesite:chararray, localsite:chararray, timestart:chararray, timeend:chararray, duid:chararray, version:int, dataset:chararray, clientstate:chararray, protocol:chararray, filename:chararray, filesize:long, guid:chararray, timeentry:chararray, usr:chararray, relativestart:chararray, transferstart:chararray, catstart:chararray, validatestart:chararray, hostname:chararray, ip:chararray, suspicious:boolean, appid:chararray, usrdn:chararray, rucio_account:chararray, rucio_appid:chararray, errmsg:chararray);</verbatim> Example of loading Rucio traces in a Pig script: <verbatim> rucio_traces = LOAD '/user/rucio01/traces/*' USING JsonLoader('uuid:chararray, eventtype:chararray, eventversion:chararray, remotesite:chararray, localsite:chararray, timestart:chararray, timeend:chararray, duid:chararray, version:int, dataset:chararray, clientstate:chararray, protocol:chararray, filename:chararray, filesize:long, guid:chararray, tracetimeentryunix:chararray, usr:chararray, relativestart:chararray, transferstart:chararray, catstart:chararray, validatestart:chararray, hostname:chararray, ip:chararray, suspicious:boolean,appid:chararray, usrdn:chararray, rucio_account:chararray, rucio_appid:chararray, errmsg:chararray'); </verbatim> Important fields used from the traces records: | *Field name* | *type* | *Description* | |dataset| chararray | dataset or container name| |eventtype| chararray | type of an access event; we're interested only in _'get.*'_| |usrdn| chararray | DN of user's certificate, used to group events in query| |remotesite| chararray | remote site name| |localsite| chararray | local site name| |tracetimeentryunix (Rucio) / timeentry (DQ2)| chararray | event timestamp | |uuid| chararray | job UUID| ---++ Oracle Dumps Contain: * daily reports for operations / site admins for consistency checks * file replicas / unique files per storage endpoint * primary / custodial dataset replicas * number of replicas per dataset / last access times Import and sizes: * daily Sqoop dumps of most important tables to HDFS * bz2 compressed, tab-separated text files, ~16GB compressed size * DIDs: 550.000.00 entries * Rules: 7.500.000 entries * Replicas: 690.000.000 entries * Dataset Locks: 8.000.000 entries * RSEs: 700 entries DIDs and Dataset locks are linked with both _scope_ and _name_. Below there are examples of loading Rucio dumps data from a Pig script. _$CURRENT_DAY_ is a parameter having format 'YYYY-MM-DD' to point to the last dumps. ---+++ DIDs <verbatim> dids = LOAD '/user/rucio01/dumps/$CURRENT_DAY/dids' USING PigStorage('\t') AS ( scope: chararray, name: chararray, account: chararray, did_type: chararray, hidden: chararray, is_open: chararray, complete: chararray, obsolete: chararray, bytes: long, length: long, events: long, project: chararray, datatype: chararray, run_number: chararray, stream_name: chararray, prod_step: chararray, version: chararray, task_id: chararray, panda_id: chararray, campaign: chararray, lumiblocknr: chararray, provenance: chararray, phys_group: chararray, transient: chararray ); </verbatim> ---+++ Dataset locks <verbatim> dslocks = LOAD '/user/rucio01/dumps/$CURRENT_DAY/dslocks' USING PigStorage('\t') AS ( scope: chararray, name: chararray, rule_id: chararray, rse_id: chararray, account: chararray, state: chararray, updated_at: chararray, created_at: chararray, length: long, bytes: long, accessed_at: chararray ); </verbatim> ---+++ RSEs (Rucio Storage Elements) <verbatim> rses = LOAD '/user/rucio01/dumps/$CURRENT_DAY/rses' USING PigStorage('\t') AS ( id: chararray, rse: chararray, rse_type: chararray, deterministic: int, volatile: int ); </verbatim> ---+++ Rules <verbatim> rules = LOAD '/user/rucio01/dumps/$CURRENT_DAY/rules' USING PigStorage('\t') AS ( id: chararray, subscription_id: chararray, account: chararray, scope: chararray, name: chararray, did_type: chararray, state: chararray, rse_expression: chararray, copies: int, expires_at: chararray, weight: chararray, locked: int, grouping: chararray, error: chararray, updated_at: chararray, created_at: chararray, locks_ok_cnt: int, locks_replicating_cnt: int, locks_stuck_cnt: int, source_replica_expression: chararray, activity: chararray, notification: chararray, stuck_at: chararray ); </verbatim> ---++++Distinguishing _primary_ and _secondary_ data * rule expires_at IS NULL and rule locked THEN custodial * rule expires_at IS NULL and rule not locked THEN primary * rule expires_at IS NOT NULL THEN secondary * replica expires_at IS NOT NULL THEN tobedeleted * the _tobedeleted_ will be changed to _secondary_ Table with column descriptions. | *Column name* | *type* | *Description* | | A1 | B2 | C2 | | A3 | B3 | C3 | ---+Analysis Code Where is it? Table with all the scripts and short descriptions on what they do. | *name* | *type* | *Description* | | FilePopularity.pig | pig | Calculates how many times a file has been accessed during... | | A3 | B3 | C3 | Table with all the UDFs and their descriptions. | *UDF* | *return type* | *input parameters* | *Description* | | A1 | B2 | C2 | | A3 | B3 | C3 | ---+ Additional information Talks: * [[https://indico.cern.ch/event/276502/session/5/contribution/11/1/material/slides/0.pdf][daily dump data]] * [[https://indico.cern.ch/event/355167/contribution/3/material/slides/0.pdf][more details]] ----- <!-- For significant updates to the topic, consider adding your 'signature' (beneath this editing box) --> *Major updates*:%BR% -- Main.IlijaVukotic - 2014-11-19 %BR% -- Main.SergeyBelov - 2015-02-25 <!-- Person responsible for the page: Either leave as is - the creator's name will be inserted; Or replace the complete REVINFO tag (including percentages symbols) with a name in the form Main.TwikiUsersName --> %RESPONSIBLE% %REVINFO{"$wikiusername" rev="1.1"}% %BR% <!-- Once this page has been reviewed, please add the name and the date e.g. Main.StephenHaywood - 31 Oct 2006 --> %REVIEW% *Never reviewed* %STOPINCLUDE%
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r8
<
r7
<
r6
<
r5
<
r4
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r8 - 2018-06-11
-
IlijaVukotic
Log In
AtlasComputing
ATLAS Collaboration
ATLAS TWiki
ATLAS Protected
ATLAS Computing
Public Results
Report outdated page
Twiki-Support
Create
a LeftBar for this page
Index
Archives
Changes
Notifications
Cern Search
TWiki Search
Google Search
Atlas
All webs
Copyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback