Saarland University Saarland University | Department of Computer Science

Big Data Analytics Group

Database Systems WS 2012/13



We are flooded with data be it data on the Web (html pages, twitter, facebook, map services, ...), structured data in databases (your money on bank accounts, addresses, cell phone data, school and uni grades, flight information, taxes, medical records, ...), or data in scientific applications (gene data in bioinformatics, telescope data in astronomy, collider data in physics, measurements of seismic activity in geology, ...).

The way we access, manage, and process that data has tremendous impact on:
  1. performance. Though we sometimes think that a performance problem is due to particular algorithm requiring too much CPU time, it is often the data access patterns and retrieval times that slow down a program. The reason for bad performance may be that data cannot be accessed and shipped fast enough to the CPU. For instance you may be using unsuitable access methods to retrieve a single piece of information from a large data repository. Or you might be using an inefficient data layout ignoring the memory hierarchy and hardware capabilities of modern processors. In addition, even if the data was efficiently retrieved, performance may suffer due to picking the wrong analytical algorithms or not scaling your system correctly.
  2. reliability. What happens if your hard disk fails or your data center is flooded with water? How do you make sure that a consistent version of your data is accessible at all times? Can you afford to lose all data? How do you exploit multi-threading for accessing data without corrupting you data repository?

If you are interested in these questions, this might be the right lecture for you.

In this core lecture you will learn how to answer these questions. You will learn fundamental data managing algorithms and techniques which are used to build (not only) database systems but also search engines (Google), file systems, data warehouses, publish/subscribe systems (like Twitter), streaming systems, map services (google maps), or Amazon's Cloud (EC2), etc.

These techniques and algorithms will allow you to design, plan, and build (almost) any kind of data managing system.

Topics will include:

  • storage media (disk, flash, main memory, caches)
  • data managing architectures (DBMS, streams, file systems, clouds, appliances)
  • storage management (DB-file systems, raw devices, write-strategies, differential files, buffer management)
  • data layouts (horizontal and vertical partitioning, columns, hybrid mappings, compression, defragmentation)
  • indexing (one- and multidimensional, tree-structured, hash-, partition-based, bulk-loading and external sorting, differential indexing, LSM and stepped merge trees, read- and write-optimized indexing, data warehouse indexing, text indexing, main-memory indexes, sparse and dense, direct and indirect, clustered and unclustered, main memory versus disk and/or flash-based)
  • processing models (operator, push and pull, block-based, vectorized, compiled)
  • processing implementations (join algorithms for relational, spatial, and multidimensional data, grouping and early aggregation, filtering)
  • query processing (scanning, plan computation, SIMD)
  • query optimization (query rewrite, cost models, cost-based optimization, join order, plan enumeration)
  • data recovery (single versus multiple instance, logging)
  • parallelization of data and queries (horizontal and vertical partitioning, shared-nothing, replication, distributed query processing, NoSQL, MapReduce and Hadoop)
  • read-optimized system concepts (search engines, data warehouses, OLAP, ad-hoc analytics)
  • write-optimized system concepts (OLTP, streaming data, moving objects)
  • management of geographical data (GIS, google maps)

Place and Time:

  • E1 3, HS II
  • Tue 14:15 and Fri 10:15
  • Further information in HIS

Rules:

  • Requirements: Students are expected to have successfully passed the Information System introductory lecture held every summer semester covering introductory relational database management concepts or a comparable lecture.
  • Assignments: There will be weekly assignments (about 10 in total). Your performance in these assignments is not part of your overall grade. However, you have to obtain 50% of the points of all assignments overall. You may not have more than 3 assignments with 0 points (due to not handing-in, due to not presenting successfully, or due to not solving anything, ...).
    We strongly recommend that you team-up to build learning groups. You may hand-in as a group of 2-3 people. We expect one write-up per group. At the same time we expect that each member of the group contributes and understands each solution. If for whatever reason in any particular week you could not contribute to a particular question on an assignment sheet, it is fine to mark that on your hand-in. In that case you will not receive any points for that particular question, however you may receive points for all other questions. However, you do not risk to lose all points for this hand-in due to failed presentations (see Exercise Groups). Assignments will be made available on Tuesdays and have to be handed-in one week later on Tuesdays at 4:00pm latest (after the lecture).
  • Tutorials: We highly recommend to participate in the tutorials regularly in order to pass the lecture. However, attendance is not compulsary (with exceptions, see below). The material covered and hints given in the tutorials will be indispensable to successfully pass the exams. Think of the assignments as rehearsals for the actual exams (even though they are not officially called exams). The assignments will be very close to what you have to solve in the exams anyway.
    You have to successfully present the solution for one of the tutorials at least once throughout the semester. If you cannot present what you handed-in (in other words: if the TA has serious doubts that you understand what you are presenting even though you handed-in a correct solution...), you will not receive any points for that assignment sheet in that week. In that case you have to try again to successfully present another assignment in one of the three following weeks and must attend all three following tutorials. Similarly, if you do not show up at the tutorials frequently, the TA may make attendance for you compulsary to enable him to randomly pick you for presenting eventually. This will allow for a fair evaluation of your assignment performance.
  • Project: You should build small teams of 2-3 students (we will help you with that) and implement a small project throughout the semester: a main-memory database system. You may use only Java for coding plus a set of pre-defined libraries (see project definition). Notice that learning groups and project teams may be different.
    There will be three milestones (Dates: MS1: 18.11.; MS2: 16.12.; MS3: 20.1. at 23:59 each) plus a final hand-in (Date: 24.2. at 23:59) throughout the semester.
    For each milestone there will be a number of functional and performance tests. The outcome of the functional tests define 3% of your overall grade, the performance test another 3%, i.e. in total 6% for each milestone. Notice that performance tests for your program will only be executed in case the corresponding functional test does not produce any errors. Performance tests for the three milestone will be executed after the final hand-in only. This allows you to improve the performance of your entire system throughout the entire semester - even after the milestones. In summary, this means you can improve the performance of your system throughout the semester, but you cannot improve functional results after a milestone deadline. Only milestones with 100% functional correctness participate in the final performance evaluation. After each milestone already we will give you feedback on the performance of your solution. The final project hand-in defines another 12%, i.e. in total 6+6+6+12 = 30%.
    In case you do not hand-in a milestone or the final project on time, you do not receive any points for that milestone. Late hand-ins will not be considered. Please do not ask for extensions. Rather try to hand-in whatever you have on time than not handing-in at all. Only SVN hand-ins will be considered. Please do not send your results by email. The final hand-in of your project includes a brief oral presentation where each team member is expected to briefly show what she or he contributed to the project. You need to pass this evaluation to pass the lecture.
    Yet there will be individual project grades for the team members. Overall you have to obtain 50% of the points in the project. Then the project counts 30% of your overall grade.
    We will provide you with an svn-repository plus an automatic testing and performance evaluation environment.
  • Exams: There will be three exams: one 60 minute mid term exam (moved to Saturday , 24.11., 10.15am-11:15pm), one 120 minute end term exam (moved to Saturday, 9.2., 10:15am-12:15pm), and one 120 minute repetition exam (Date: 5.4., 10:15am-12:15pm). The midterm counts 20% of your overall grade. There is no minimal amount of assignment points you need to obtain in order to participate in the midterm exam. In addition, there is no minimal amount of points you have to obtain in the midterm exam. However, you cannot repeat the midterm exam.
    In contrast, you need to obtain at least 50% of the points in the assignments in order to be allowed to participate in the end term or repetition exam. You need to pass either the end term or the repetition exam. You need to obtain at least 50% of the points in one of these exams. The best grade of your performance in the end term and repetition exam grade determines 50% of your overall grade for this course. Yes, if you are unhappy with your grade in the end term exam, you may try the repetition exam.
  • Requirements for passing:
    • 50% of points in assignments,
    • one successful presentation in the tutorials,
    • 50% of points in project,
    • 50% of points in either end term or repetition exam.
  • Overall Percentage Computation: 20% midterm + 50% max(end term exam, repetition exam) + (6+6+6+12)% project.

Lecture Notes:

  • will be made available here after every lecture
  • (16-10-12) Teaser, Administrative Issues pdf
  • (19-10-12) Recap Tutorial pdf
  • (23-10-12) Hard Disks, Sequential vs. Random Access pdf
  • (26-10-12) SSDs, RAID, pulling up data (reading) pdf HD/SSD example products
  • (30-10-12) pushing down data (writing), disk-based versus main-memory systems pdf
  • (02-11-12) data layouts, row, column, hybrid, PAX, lightweight compression pdf
  • (06-11-12) lightweight compression (continued), paging, virtual memory pdf
  • (09-11-12) options for implementing MS1, indexing pdf
  • (13-11-12) the many faces of B+-trees pdf
  • (16-11-12) cache-efficient B+-trees, differential indexing, how to measure performance? pdf
  • (20-11-12) hashing: static, effective functions, order preserving, page-oriented, dynamic pdf
  • (27-11-12) dynamic hashing (continued), options for implementing MS2, query optimization pdf
  • (30-11-12) qp algorithms, join algorithms, outer joins, intersection pdf
  • (04-12-12) grouping and aggregation, external sorting, pipelining pdf (18-12-12) fixed bug in rs algo on page 5, last slide
  • (07-12-12) no lecture, 10:15-noon exam inspection at our group in E1 1, 3rd floor
  • (11-12-12) blocked pipelining, vectorization, code generation, code compilation, options for implementing MS3 pdf code
  • (14-12-12) Google Maps (continued), spatial indexing, Hadoop pdf (18-12-12) fixed bug in grid index example on page 6, slide 2
  • (18-12-12) course evaluation results, MapReduce use-cases, Hadoop in detail pdf eval result
  • (21-12-12) say No! No! and No!, =DB lecture (Limited Christmas Edition) pdf (password-free) and the winner is: Evica, congrats!
  • (08-01-13) Guest lecture by Sebastian Michel about Top-k Queries and locality sensitive hashing pdf
  • (11-01-13) No lecture!
  • (15-01-13) trade-offs in crash recovery pdf
  • (18-01-13) ARIES (continued), options for implementing final hand-in pdf
  • (22-01-13) parallelization of data and queries pdf
  • (25-01-13) fundamental techniques for parallelization (continued), updating a parallel DBMS, system examples pdf
  • (29-01-13) cloud computing, crowd computing pdf
  • (01-02-13) how big is big data?, HAIL | LIAH pdf
  • (05-02-13) Wrap-Up (Part 1), both parts in one pdf
  • (08-02-13) Wrap-Up (Part 2)
  • (09-02-13) Final-Exam: Seating is based on your first name, if it is in the range [A;Ke] -> HS II, E2.5; [Kh;O] -> HS III, E2.5; [P;Z] -> HS II, E1.3

Recommended Literature:

News:

  • (26-10-12) Please register in HIS until Nov 4.
  • (26-10-12) Please register for the tutorial groups on our moodle site until Oct 30.

Tutorials and Assignments:

  1. Tutorials
    • all tutorials are in room E1 1, 3.06
    • tutorials start on Oct 31
    • 1: Wed 10-12 Stefan Schuh
    • 2: Wed 16-18 Ahmad Ibrahim
    • 3: Thu 10-12 Endre Palatinus
    • 4: Thu 14-16 Ahmad Ibrahim
  2. Assignments:

  3. Sample Solutions:

Project Milestones:

Project Notes:

  • Responsible TAs are Alexander Bunte and Stefan Richter
  • We will provide a bulletin board for questions.
  • email: dbmsproject12@cs.uni-saarland.de