Saarland University Saarland University | Department of Computer Science

Big Data Analytics Group

Database Systems WS 09/10

This core lecture covers fundamental data managing techniques which are used to build (not only) database systems. These techniques are also commonly used to build any other kind of information systems including search engines (Google), file systems, data warehouses, publish/subscribe systems (like Twitter and other Web 2.0 sites), streaming systems, map services (google maps), or Amazon's EC2 system. Therefore a better name for this lecture might be "Foundations of Data Management".

Topics will include:

  • storage media (tape, disk, flash, main memory)
  • data managing architectures (DBMS, streams, file systems, clouds, appliances)
  • storage management (DB-file systems, raw devices, write-strategies, differential files, buffer management)
  • data layout (horizontal and vertical partitioning, columns, hybrid mappings, compression, free memory management, defragmentation)
  • indexing (one- and multidimensional, tree-structured, hash-, partition-based, bulk-loading and external sorting, differential indexing, LSM and stepped merge trees, read- and write-optimized indexing, data warehouse indexing, text indexing, main-memory indexes, sparse and dense, direct and indirect, clustered and unclustered, main memory versus disk and/or flash-based)
  • operator models (push- and pull-model, block-based, vectorized)
  • operator implementations (join algorithms for relational, spatial, and multidimensional data, grouping and early aggregation, filtering)
  • query processing (scanning, plan computation, SIMD)
  • query optimization (query rewrite, cost models, cost-based optimization, join order, plan enumeration)
  • data recovery (single versus multiple instance, logging, ARIES, hot stand-by)
  • parallelization of data and queries (horizontal and vertical partitioning, shared-nothing, replication, distributed query processing, 2PC, map-reduce and hadoop, PNUTS, pig latin)
  • read-optimized system concepts (search engines, data warehouses, OLAP, ad-hoc analytics)
  • write-optimized system concepts (OLTP, streaming data, moving objects)
  • management of geographical data (GIS, google maps, computer games and simulations)
  • management of high-dimensional data (OLAP, multi-dimensional indexing trade-offs)

Administrative issues:

  • (27-04-10) Certificates (Scheine) are ready and may be fetched from our secretary (Mon-Thu from 9-12am).
  • (20-04-10) Result of repetition exam and final grades.
  • (12-04-10) Repetition exam inspection will be on Thursday, April 22: 12-13 (last name: A-J), 13-14 (last name: K-N), 14-15 (last name: O-Z). Results of the exam will be available approx. on April 20.
  • (07-04-10) Repetition exam will be in the new lecture hall (same place as for the final exam). Same rules as for the final exam.
  • (02-03-10) Result of final exam updated
  • (16-02-10) CORRECTED Exam inspection will be on Thursday, Feb 25: 10-11 (last name: A-J), 11-12 (last name: K-N), 12-13 (last name: O-Z)
  • (15-02-10) Result of final exam. Also consider a note on the outcome of this exam. If you want to participate in the repetition exam on April 8, 12:15pm, send a short email to my secretary not later than one week before the exam.
  • (19-01-10) The exam will be on Feb 4 from 12:15 to 14:15 in the new lecture hall. You may bring a single A4 sheet of paper with handwritten notes. You may use both sides. You may neither use a print-out nor a photocopy or a photocopy of handwritten notes.
  • (10-11-09) Please register in HISPOS until 03.12.09.
  • Students are expected to have successfully passed the Information System introductory lecture held every summer semester covering introductory relational database management concepts.
  • Exams: there will be two exams: one end term exam, and a repetition exam. You need to pass one of these exams. In case you do not pass the end-term exam, you will have to participate in a repetition exam. Your exam grade determine 70% of your overall grade for this course.
  • Exercises: you need to participate in the exercise groups regularly in order to pass the lecture. The material and exercises covered in the exercise sessions will be indispensable to successfully pass the exams. In the exercise sessions you will be grouped into small teams with other students and solve a small data managing task. You may use only Java for coding. Your code should be able to execute on a linux server. Your individual contribution to these teams will be evaluated twice during the semester and will be graded. You need to pass both of these evaluations to pass the lecture. The grade of these evaluations will influence your overall grade by 30%.
  • Registration closed: if you still want to register, please send an email to Angelika Scholl.
  • time and place: Tuesday and Thursday, 12:15pm til 2pm, E1.3 HS II

    Lecture Notes

    • 1a-Intro
    • 1b-Intro: Storage Media
    • 1c-Intro: Storage Management
    • 1d-Intro: DB-buffer (continued)
    • 2a-DataLayout: Tuple-IDs, Rows versus Columns
    • 2b-DataLayout: Column-Stores, Hybrid Stores, Compression, Free Memory Management, Defragmentation
    • 3a-Indexing: Indexes in real-life, Access Paths, Google, B-trees
    • 3b-Indexing: Project Milestones, split-operation, clustered, indirect, sparse, partitioning
    • 3c-Indexing: bulkloading, prefix-trees, large pages cache-optimized trees
    • 4a-HashIndexing: basics, hash functions, collisions, open addressing, dynamic hashing, extendible hashing
    • (12-11-09) Invited lecture on applying database technology to 3d video games and simulations, Dr. Marcos Vaz Salles, Cornell
    • 4b-HashIndexing: hybrid tree/hash, linear hashing, bitmaps
    • 5a-MultiDimensionalIndexing: intersection, key concatenation, prefixes, range queries, general query
    • 5b-MultiDimensionalIndexing: kd-trees, kd-tries, kdB-tree, quadtrees/-tries, linearized trees, z-codes, hilbert codes, range queries
    • 5c-MultiDimensionalIndexing: project: milestone clarification, query partitioning, trade-offs, grid indexes
    • 5d-MultiDimensionalIndexing: grids, order preserving hashing, logical versus physical partitioning, R-trees, priority-driven kNN, generalized indexes
    • 5e-MultiDimensionalIndexing: limits of multi-dimensional indexing
    • 6a-Operators: encyclopedia of DBMS, operators, ONC, Java iterators, pull versus push, coarse-granular (vectorized) iteration
    • (10-12-09) updated, bugfix on Slide 5
    • 6b-Operators: external sorting, optimizations, replacement selection, online merge, nested loops join, index nested loops join, sort-merge join
    • 6c-Operators: duplicates in sort-merge join, sweep area join, co-grouped join, simple hash-join, grace hash join, partitioning strategies
    • 6d-Operators: double-pipelined hash-join, grouping and aggregation, early aggregation, non-relational joins, grid join, plane sweep, z-code join
    • 7a-QueryOptimization: query optimization, rewrite rules
    • (15-12-09) Invited lecture on cost-based query optimization, Dr. Thomas Neumann, MPII
    • (17-12-09) mock exam (details announced in lecture on 10-12-09)
    • 8a-Recovery: transactions, ACID, WAL, fuzzy checkpoints, ARIES
    • 8b-Recovery: ARIES in detail, Franklin article
    • 9a-DistributedQP: motivation, multi-cores, GPUs, architectures, shared-nothing, speed-up, scale-up
    • 9b-DistributedQP: horizontal and vertical partitioning, replication, hybrid models, split vs merge, intra. vs inter-operator parallelism, distributed join processing, semijoin, repartitioning, co-partitioning
    • 9c-DistributedQP: partitioning examples, distributed query optimization, left-deep vs right-deep, 2PC, distributed concurrency control
    • 10a-PNUTS: large systems, eventual consistency, timeline consistency, PNUTS
    • 10b-PNUTS: PNUTS: related best practices in DBMSs, API, architecture, storage units, tablets, tablet controllers, routers, message brokers, recovery
    • 11a-MapReduce: map()-shuffle-reduce(), use-cases, splits, map tasks, reduce tasks, workers, fault tolerance, backup tasks, combiner functions, network locality
    • 11b-MapReduce: hadoop, sort benchmarks, multi-core, GPU
    • 11c-Pig: pig latin, data model, sequential specification vs execution, language definition, pig system
    • 13-WrapUp. Answer to DPT question

    Recommended Literature

    1. Books
    2. Papers


    1. Exercise Groups
      • (20-10-09) Exercise groups start on Oct 27
      • Group 1, Wednesday, 12:15-2:00pm, SR 15, Alekh Jindal
      • Group 2, Wednesday, 2:15-4:00pm, SR 15, Joerg Schad
      • Group 3, Thursday, 2:15-4:00pm, SR 14, Jorge Quiane
      • Group 4, Tuesday, 10:15am-12:00, MPII 0.23, Yagiz Kargin
    2. Assignments

    3. Sample Solutions
      • sample solutions for the assignments in pdf will be provided two weeks before the final exam. why "sample"?: for many assignments there may not only be a single right solution but multiple ones.
      • (01-02-10) Sample solutions for Assignment 11 available
      • (22-01-10) Sample solutions for Assignments 8 to 10 available
      • (18-12-09) Sample solutions for Assignments 1 to 7 available

    4. Project
      • (06-04-10) sources used for benchmark
      • (23-03-10) Final results of project eval
      • (11-03-10) For the final hand-in of the project, you have to submit two files:

        1. final_submission_groupXX.jar

        XX is your two-digit group number. Both files should be placed in the root of your SVN repository. the jar file will be used to run the tests, as was done for each milestone. The zip file will contain your sources. Please respect the submission guidelines.
      • (17-02-10) Please make an appointment for your project hand-in ASAP. See this forum post for details. Please bring a one page summary of your solution to that session.
      • (17-02-10) Code for milestone 4 (r369: bug fix in test: expectedResultSet changed to resultSet inside query loops in TransactionLayerTest)
      • (15-02-10) Code for milestone 4 (r368: Unit tests for the fourth milestone.)
      • (02-02-10) Code for milestone 4 (r354, bug fixes: (1) Missing connection between TransactionLayer and LogStore added, see Database.startSystem(LogStore logStore) and TransactionLayer.setLogStore(LogStore logStore) and the corresponding implementation, MyTransactionLayer (2) LogRecord is not an abstract class anymore. (3) Renamed some classes in logrecord.)
      • (28-01-10) Code (r333) and description for milestone 4 (last milestone)
      • (26-01-10) Clarification: Deadline for final hand-in postponed to March 12 (6:00 am) Due to construction work, power supply in E1 1 will be cut-off from March 12, 7am til approx. March 12, 10 am. This will effect all servers. Make sure to submit your project before.
      • (25-01-10) Deadline for final hand-in postponed to March 12.
      • (25-01-10) Deadline for Milestone 4 postponed to Feb 22.
      • (25-01-10) Code for milestone 3 (r321: minor bug fixes (1) Changed how operator is being used in unit tests to delete tables, based on: forum discussion (2) Corrected data/tableinfo.txt, Supplier relation's c_comment is now s_comment. )
      • (20-01-10) Code for milestone 3 (r283, Benchmark based on TPC-H (1) A new test, ThirdPhaseTest, using data from TPC-H (2) A new directory, dir/, with data for the new test (3) Removed some printing statements in previous release.)
      • (18-01-10) Code for milestone 3 (r273, bug fix and unit tests: (1) test/thirdmilestone/LogicalOperatorsTest (2) Input added to projection operator. (3) SampleRow has toString method)
      • (07-01-10) Code (r253) and description for milestone 3
      • (29-12-09) Code for milestone 2 (r224, new benchmark test and minor interface changes: (1) test/secondmilestone/SecondPhaseTest added, please check comments for the external library required. (2) Methods in IndexLayer that returned an Operator of type "Row" now return Operator an operator of type "? extends Row". (3) NoSuchTableException added to MyQueryLayer.updateRow)
      • (18-12-09) Deadline for Milestone 3 postponed to Jan 27
      • (18-12-09) Deadline for Milestone 2 postponed to Jan 4
      • (18-12-09) Code for milestone 2 (r216, unit tests and minor changes to the interface listed below. (1) QueryLayer.updateRow throws NoSuchTableException (2) IndexLayer.rangeQueryRowIDs and IndexLayer.deleteFromIndex(String, Object, Object) throw RangeQueryNotSupportedException (3) Database.getQueryInterface added. (4) Second milestone unit tests added and test directory reorganized. (5) IndexLayer.rebuildAllIndexes and storeIndexInformation now throw IOException)
      • (11-12-09) Sample Java CSB+-Tree made available
      • (11-12-09) Sample Java B+-Tree made available
      • (01-12-09) Description for milestone 2
      • (01-12-09) Code for milestone 2 (r1285)
      • (01-12-09) Milestone 1: Please submit your implementation classes in a file called milestone1_groupXX.jar, where XX is your two-digit group number. The file should only contain the new classes you provided and not the classes that are part of the framework. The file has to be placed in the root directory of your svn. Submissions by email will not be accepted.
      • (26-11-09) Code for milestone 1 (r166, (1) Bug fix in FirstPhaseTest.deleteRowInLocal, as pointed out in (2) TableTest.testViolateCharSize() removed as it was assuming that a row could be inconsistent - in that case a CHAR field of max size 5 storing a string of length 6 (3) RowID specified as being a row's unique identifier in a table. (4) Bug fix in TableTest.testGetRowsWithPredicate (5) Column.getDataAsObjectArray now uses MAX_VALUE of primitive types to represent deleted entries. Please see method's comment for more information.)
      • (24-11-09) New deadline for milestone 1 postponed to Dec 1. See slides
      • (23-11-09) forum and svn are up and running again
      • (21-11-09) To recover our cloud we need physical access to a specific server room. That server room will only be accessible again on Monday morning. Due to this inconvenience we will postpone the deadline by a day. New deadline for the storage layer is November 25.
      • (21-11-09) our private "cloud" serving svn and forum is down. We are working on it. Note that all sources are available from this web-site.
      • (20-11-09) Code for milestone 1 (r158, (1) All methods taking or returning Operator changed to use wildcards and extends. (2) Column.getDataArrayAsObject() has new comments on what to return and how to deal with primitives and nulls. (3) The use of the PersistentExtent Interface is now optional, but you should still be able to persist your database on disk and retrieve it.)
      • (18-11-09) Code for milestone 1 (r151, some fixes in JavaDoc and tests)
      • (17-11-09) Project page including link to forum published.
      • (17-11-09) Code for milestone 1 including functional tests (r143)
      • (10-11-09) Code for milestone 1 (r125, slightly updated)
      • (10-11-09) Description of milestone 1
      • (03-11-09) Code for milestone 1 (r83)
      • Responsible TA is Mohamed Yahya