HEP noSQL landscape
Summary
ATLAS
Needs and Usage
Logging and monitoring
- High write rates
- Cassandra,
MongoDB, HBase
Data analytics
- Complex computations over lots of data
- HBase,
MongoDB, Cassandra
Content and Summary retrieval
Application backend
- Low latency, schema-less design, “native” data structures
- Cassandra, MongoDB
Cassandra1: PANDA WMS pilot job framework
- What: archived reference data in Panda
- Why: largely de-normalized, does not require most of the features of RDBMS
- classes of queries which include time and date ranges where Oracle typically does not perform too well
- scale the increasing no of jobs
- existing expertise, easy to conf and install
- interface: JSON
- scaling from 4 nodes to?
CONF:
- 3x 1TB SSD, 48gbram,2CPU - 6cores
- +monitor machine
- ganglia
- composite indexes
- pycassa, amazon
Cassandra2: DQ2 tracing
- grid tracer service: data access and usage
- monitoring in real-time
- very high write rate
Tests:
- ~5 million traces every day
- Write row-by-row insertion(~3KB/row): MongoDB and Cassandra better than Ora (including noSQL)
- Read one months traces (90M rows, 34GB): Cassandra -minutes vs Ora. production hours vs. Ora test seconds
Cassandra3: TDAQ Online info service
- persistent monitoring for lost data
- high rate monitoring: 2500Hz
CONF:
- 3x 4 cores/4gb ram, raid
- 7200 obj/s => 45 batch/s => 5.6MB/s
- datastax monitoring soft
Best solution is actually
- Python scripts mapreducing flat CSV files on EC2
- Fraction of a second to produce any number of summaries
MongoDB: DQ2 Accounting
- rich data model
- one map reduce job - locking problem!!!
- summary better than Ora star schema
- works weell as long as all indexis in mem: 40 mio datasets fully indexed ~ 2GB
New app: popularity
- persistent genetic alghoritm
HBase: DQ2 Accounting
- evolution of value in time: rolling summaries
Current state
- DDMLAB-hosted 10 node Hadoop cluster (with no special optimisation)
- Full migration of Oracle content (400M rows, 20GB) into HBase data model (20M rows, 4GB): 2h 40m
- MapReduce once over data: 40 mins (random reads ~4K blocks, ~2.5MB/sec per node)
- HDFS replication factor 4… (40*4)/60 (replication factor to a full 10, then MapReduce in 15min)
- Compared with Oracle: 5-6 hours, if it finishes at all…
- Can do all accounting summaries in one MapReduce run
Still
ToDo
- One-way synchronisation from Oracle to HBase - not trivial (Scibd)
- Summary retrieval - trivial
- Ad-hoc queries need a slightly different data model than the MapReduce one
Hadoop: DQ2 Log analysis
Map-Reducing CSV files on Hadoop HDFS
- Dump table from Oracle (DQ2 Traces)
- Copy log files to HDFS (DQ2 Apache Logs)
- Use Hadoop Streaming API
- write results directly to HBase (with Java/Jython)
Map-Reduce 75GB in 5 minutes
- with Java libraries it should be 2.5 times faster (memory allocation savings)
CMS
MongoDB: Data Aggregation System
- What: search and aggregate information across different data-services without knowing the specific data sources
- Why: DAS does not require data preservation and transaction capabilities; dynamic type of stored meta-data objects (no predefined schema)
- show/propagate/replicate information (job history, workflow reports, ...) and to provide API to build other applications
- reduce and decouple from the core database (MySQL,ORA) the monitoring load
KyotoCabinet/SQLite: IgProf: performance tuning of CMS software
Hadoop: future use probably just HDFS
Issues
- mongo, couch, kyoto(from CMS Offline experience with NoSQL data stores - Valentin Kuznetsov, Database Futures Workshop, CERN, June 2011)
PES Batch Monitoring
Cassandra: Apache HBase - not sure of project phase, design, implementation, testing?
GT SAM
Cassandra: hadoop - Apache Pi
References
CERN:
Web:
--
FaustinRoman - 08-Nov-2011