Rucio in Hadoop
Introduction
What's the initial reason this data has been imported?
The main developers are Thomas Beermann and Mario Lassnig.
The Data
The data is stored in the
analytix.cern.ch
cluster in the directory:
/user/rucio01/
Apache Server / Rucio Daemon logs
Stores log files for simple simple cat / grep analysis
- read directly from log file and continuously streamed via Flume to HDFS
- simple text log files
- ~23GB per day
Traces
Contain updates of last access time of files/datasets, will be used for the popularity reports
- update of last access time of files/datasets
- send to ActiveMQ broker and continuously streamed via Flume to HDFS
- text file with one JSON encoded dictionary per trace
- ~5GB per day - 6M entries
Oracle Dumps
Contain:
- daily reports for operations / site admins for consistency checks
- file replicas / unique files per storage endpoint
- primary / custodial dataset replicas
- number of replicas per dataset / last access times
Import and sizes:
- daily Sqoop dumps of most important tables to HDFS
- bz2 compressed, tab-separated text files, ~16GB compressed size
- DIDs: 550.000.00 entries
- Rules: 7.500.000 entries
- Replicas: 690.000.000 entries
- Dataset Locks: 8.000.000 entries
- RSEs: 700 entries
Table with column descriptions.
Analysis Code
Where is it?
Table with all the scripts and short descriptions on what they do.
Table with all the UDFs and their descriptions.
Additional information
Talks:
Major updates:
--
IlijaVukotic - 2014-11-19
Responsible:
IlijaVukotic
Last reviewed by:
Never reviewed