We are flooded with data be it data on the Web (html pages, twitter, facebook, map services, ...), structured data in databases (your money on bank accounts, addresses, cell phone data, school and uni grades, flight information, taxes, medical records, ...), or data in scientific applications (gene data in bioinformatics, telescope data in astronomy, collider data in physics, measurements of seismic activity in geology, ...).
The way we access, manage, and process that data has tremendous impact on:
performance. Though we sometimes think that a performance problem is due to particular algorithm requiring too much CPU time, it is often the data access patterns and retrieval times that slow down a program. The reason for bad performance may be that data cannot be accessed and shipped fast enough to the CPU. For instance you may be using unsuitable access methods to retrieve a single piece of information from a large data repository. Or you might be using an inefficient data layout ignoring the memory hierarchy and hardware capabilities of modern processors. In addition, even if the data was efficiently retrieved, performance may suffer due to picking the wrong analytical algorithms or not scaling your system correctly.
reliability. What happens if your hard disk fails or your data center is flooded with water? How do you make sure that a consistent version of your data is accessible at all times? Can you afford to lose all data? How do you exploit multi-threading for accessing data without corrupting you data repository?
If you are interested in these questions, this might be the right lecture for you.
In this core lecture you will learn how to answer these questions. You will learn fundamental data managing algorithms and techniques which are used to build (not only) database systems but also search engines (Google), file systems, data warehouses, publish/subscribe systems (like Twitter), streaming systems, map services (google maps), or Amazon's Cloud (EC2), etc.
These techniques and algorithms will allow you to design, plan, and build (almost) any kind of data managing system.
data layouts (horizontal and vertical partitioning, columns, hybrid mappings, compression, defragmentation)
indexing (one- and multidimensional, tree-structured, hash-, partition-based, bulk-loading and external sorting, differential indexing, LSM and stepped merge trees, read- and write-optimized indexing, data warehouse indexing, text indexing, main-memory indexes, sparse and dense, direct and indirect, clustered and unclustered, main memory versus disk and/or flash-based)
processing models (operator, push and pull, block-based, vectorized, compiled)
processing implementations (join algorithms for relational, spatial, and multidimensional data, grouping and early aggregation, filtering)
query processing (scanning, plan computation, SIMD)
data recovery (single versus multiple instance, logging)
parallelization of data and queries (horizontal and vertical partitioning, shared-nothing, replication, distributed query processing, NoSQL, MapReduce and Hadoop)
read-optimized system concepts (search engines, data warehouses, OLAP, ad-hoc analytics)
write-optimized system concepts (OLTP, streaming data, moving objects)
management of geographical data (GIS, google maps)
Teaching Assistants
Stefan Schuh (head tutor)
Endre Palatinus (head tutor)
Stefan Richter (project head tutor)
Xiao Chen
Muhammad Bilal Zafar
Tobias Frey (project tutor)
Time and Place
weekly videos PLUS LAB on Tuesdays from 10:15-noon in GHH (this is a flipped classroom); starting in June possible additional second meeting per week on fridays from noon to 2pm