Syllabus
Welcome to Intro to Big Data Systems! We'll deploy and use distributed systems to store and analyze large datasets. Unstructured and structured approaches to storage will be covered. Analysis will involve learning new query languages, processing streaming data, and training machine learning models. Systems covered include Docker, PyTorch, HDFS, Spark, Cassandra, Kafka, and more.
Revisions to Syllabus
- Sep 5, 2025: edited policy for missed exams/in-person quizzes
- Oct 28, 2025: fixing quiz format section of the syllabus to mostly multiple choice
Course Instructor
- Dr. Meenakshi Syamkumar (Teaching Faculty - Department of Computer Sciences) ms@cs.wisc.edu
Lecture (Meeting Time and Location)
- LEC002 MORGRIDGE 1570 MWF 11:00 AM - 11:50 AM
- We meet 3 times a week -- see the lecture schedule here.
- I'll ask questions during lecture via TopHat. Answering questions (and getting them correct) will help your participation score, though perfect attendance is not necessary for full credit.
Instructional Modality
- LEC002: in-person
Learning Objectives
- Deploy distributed systems for data storage and analytics
- Demonstrate competencies with tools and processes necessary for loading data into distributed storage systems
- Write programs that use distributed platforms to efficiently analyze large datasets
- Produce meaning from large datasets by training machine learning models in parallel or on distributed systems
- Measure resource usage and overall cost of running distributed programs
- Optimize distributed analytics programs to reduce resource consumption and program runtime
- Demonstrate competencies with cloud services designed to store or analyze large datasets
Readings
We'll be learning about many different big data systems, and so no textbook closely corresponds to the lecture content. Thus, attending lectures and taking notes will be your primary resource.
We will have recommended (though optional) readings for many systems, however. We'll select from O'Reilly text books because you can read them free online via the Madison Public Library. You just need to do the following:
- get a library card (free)
- sign into the O'Reilly collection with your card number
- search for the assigned book
Here are some of the main texts we'll reference this semester:
- Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems (1st edition), by Martin Kleppmann
- Learning Spark: Lightning-Fast Data Analytics (2nd edition), by Jules Damji et al.
- Cassandra: The Definitive Guide, (Revised) Third Edition: Distributed Data at Web Scale 3rd Edition (3rd edition), by Jeff Carpenter et al.
- Kafka: The Definitive Guide: Real-Time Data and Stream Processing at Scale (2nd Edition), by Gwen Shapira et al.
Sometimes we may post lecture notes too.
Communication
We message the class regularly via Canvas announcements. We recommend updating your Canvas settings so that the "Announcement" option is "Notify immediately" so that you don't miss something important.
See the help page for details about how to contact us.
We have various forms for us to leave (optionally anonymous) feedback, report lab attendance, and thank TAs.
Course Components
Grading breakdown There will be an opportunity to earn 104 points during the semester (100 regular points and 4 extra credit points).
- 3 In-person Exams, including final (15 points each, 45 points total)
- 2 In-person quizzes (10 points each, 20 points total)
- 2 In-person Hand-in worksheets (3 points each, 6 points total)
- 5 online quizzes (1 point each, 5 points total)
- 8 programming projects (3 points each, 24 points total)
- 4 extra credit points (for TopHat, Instructor Endorsements on Piazza, etc)
Letter Grades
Grade thresholds will be applied to points (not percents!) as follows:
- A >= 94
- AB >= 90
- B >= 82
- BC >= 72
- C >= 65
- D >= 60
- F < 60
The extra credit opportunities will add up to more than 4, but nobody can earn more than 4 extra credit points in the course. One TopHat points will be worth 0.2 course points in the extra credit category.
Exams and In-Person Quizzes
These will be taken in person and mostly multiple choice. Exam 1 and Exam 2 will be in the evening (location will be announced later), and exam 3 will be during finals week. The in-person quizzes will be 30 minutes, during lecture time. All exams/quizzes are cumulative.
If you must miss an exam or in-person quiz (e.g., due to illness), and the instructor approves this, the others will receive greater weight, without scaling. For example, a quiz is worth 10/65 of all the points. So if you miss a quiz, the other quiz will be worth 10 * (65 / 55) points and the exams will be worth 15 * (65 / 55 points).
If you take a exam or quiz, you cannot drop it and reweight after the fact.
Exam 3 cannot be skipped (only rescheduled to a later date, if necessary).
There will be alternate options for exams, but not in-person quizzes. If you must miss an in-person quiz (e.g., due to illness), and the instructor approves this, the in-person quizzes and exams will receive greater weight, without scaling. For example, a quiz is worth 10/65 of all the points. So if you miss a quiz, the other quiz will be worth 10 * (65 / 55) points and the exams will be worth 15 * (65 / 55 points). If you cannot make it to an exam nor its alternate, it will similarly be reweighted (with instructor approval).
If you take a exam or quiz, you cannot drop it and reweight after the fact.
Exam 3 cannot be skipped (only rescheduled to a later date, if necessary).
Quizzes
We'll have occasional online, multiple choice Canvas quizzes. There is no time limit, and they are open books/notes/AI. You may do them as a group with other CS 544 students if you like. Each student must submit the quiz individually.
Projects
There will be 8 substantial programming projects. AI policy will vary by project, to expose you to different tools (check the specific policy for each project). These can optionally be done with a single partner.
Academic Misconduct
Project Policies
Be sure to read and understand the full project collaboration policies here.
TopHat Policies
TopHat questions are intended for in-class participants. Students who submit any TopHat question remotely are not eligible for any extra credit for the course. We might notice this by passing around a sign-up sheet following a TopHat question.
Piazza Policies
Do not post project code snippets that are >5 lines long.
Exam and In-Person Quiz Policies
- students who have not taken an exam yet may study/prep with other students who have not taken it yet; it is fine to collaborate on creating note sheets
- students who have taken an exam may not discuss/share with a student who has not taken the exam yet; unless you have first-hand knowledge that another student has taken an exam, assume they have not taken it
- you may not sit adjacent to anybody you have met or know (event slightly)
Online Quiz Policies
Allowed
- however much time you need
- discussing answers with classmates who are taking the quiz at the same time
- referencing texts, notes, or provided course materials
- searching online for general information or using AI tools
- running code
NOT allowed
- taking it more than once
- discussing answers with anybody outside of the course
- posting anything publicly online about the quizzes
- using such material potentially posted by other students who broke the preceding rule
- getting TA/instructor help on quiz questions prior to the quiz deadline
Recommendation Letters
Earning a recommendation letter will require more effort than earning an A in
this course. At a minimum, you'll have to work on a self-guided personal project. The topic for this project should be related to the content that you have learnt in this course. This project should not be part of any of your other course work or research work. I'll expect a full documentation of the project in pdf form - I'll recommend you generate this pdf using notebook mardown documentation format. You'll see plenty of examples for this as part of the lecture notes.