Syllabus
Welcome to Data Management for Data Science! This course covers deployment of relational databases and NoSQL databases for managing and analyzing structured, semistructured, and unstructured data. Analysis will involve learning new query languages and applying gradient boosting algorithms for time series prediction. Data engineering techniques, including data pipelines, data warehouses, ETL, and ELT will be covered. NLP concepts of fine-tuning Large Language Model (LLM) with unstructured data will be covered along with fundamentals of vector databases and Retrieval-Augmented Generation (RAG). Visualization techniques will be employed for dashboard creation and data storytelling.
Revisions to Syllabus
- None yet.
Course Instructor
- Dr. Meenakshi Syamkumar (Teaching Faculty - Department of Computer Sciences) ms@cs.wisc.edu
Lecture (Meeting Time and Location)
- LEC001 BIRGE 145 01:20 PM - 02:10 PM
- We meet 3 times a week -- see the lecture schedule here.
- I'll ask questions during lecture via TopHat. Answering questions correctly will help you earn extra credit. Answering TopHat questions remotely is not permitted. Even a single violation will lead you to lose the opportunity to secure extra credit. To ensure this, "Secure Attendance" option will be enabled.
Instructional Modality
- LEC001: in-person
Learning Objectives
- Deploy SQL and NoSQL databases to manage structured, semi-structured, and unstructured data and write programs to analyse the datasets
- Gain proficiency in data integration techniques such as ETL and ELT, and demonstrate competencies with each stage of the ETL process
- Apply and compare prebuilt implementations of popular gradient boosting algorithms for time series prediction
- Develop foundational skills in fine-tuning Large Language Model (LLM) implementations with custom unstructured data
- Understand the fundamentals of Retrieval-Augmented Generation (RAG) by learning to work with vector databases
- Demonstrate visualization competencies for creating interactive plots, dashboards and data stories
Readings
We'll be learning about many different data management systems, and so no textbook closely corresponds to the lecture content. Thus, attending lectures and taking notes will be your primary resource.
We will have recommended (though optional) readings.
Here are some of the main texts we'll reference this semester:
- Fundamentals of Data Engineering by Joe Reis & Matt Housley
- Big Book of Data Engineering by databricks
- Data Engineering with Python by Paul Crickard
- Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems (1st edition), by Martin Kleppmann
- get a library card (free)
- sign into the O'Reilly collection with your card number
- search for the assigned book
Communication
We message the class regularly via Canvas announcements. We recommend updating your Canvas settings so that the "Announcement" option is "Notify immediately" so that you don't miss something important.
See the help page for details about how to contact us.
We have various forms for us to leave (optionally anonymous) feedback, report lab attendance, and thank TAs.
Course Components
Grading breakdown
- Midterm (15%)
- Final (20%)
- 12 quizzes with 2 drops (15% total, 1.5% each)
- 6 programming projects (8% each, 48% total)
- Participation (2% total)
- TopHat (2% extra credit)
At the end of the semester, you'll have a score out of 100, which will be mapped to a letter grade (see below).
Letter Grades
At the end of the semester, we will assign final grades based on these thresholds:
- 93% - 100%: A
- 88% - 92%: AB
- 80% - 87%: B
- 75% - 79%: BC
- 70% - 74%: C
- 60% - 69% D
Grade thresholds will be subject to mathematical rounding. For example, a total score of 92.5+ would be mapped to an A grade.
Exams
These will be multiple choice exams taken in person. The midterm and the final will be at a different location (to be announced).
Midterm will be held on Monday, March, 3rd.
Final will be held on Tuesday, May 6th. I am unable to offer a make up final before this assigned final exam date.
Quizzes
There will be a short Canvas quiz due at the end of most Wednesdays. You'll have the option to drop two lowest scores on your quizzes. The drop provision is in place to help with sickness or critical personal circumstances. Beyond these two drops, excuses for quizzes will not be provided. Please make sure to read the below rules regarding what is allowed and what is not.
Allowed
- however much time you need
- discussing answers with classmates who are taking the quiz at the same time
- referencing texts, notes, or provided course materials
- searching online for general information
- running code
NOT allowed
- taking it more than once
- discussing answers with anybody outside of the course
- discussing with classmates who have already completed the quiz when you haven't completed it yourself yet
- posting anything online about the quizzes
- using such material potentially posted by other students who broke the preceding rule
- getting TA/instructor help on quiz questions prior to the quiz deadline
Projects
See project policies here.
Participation
Participation will be calculated based on:
- Filling class surveys within the provided deadlines
- Accepted pull requests fixing issues with project specifications
- Instructor endorsed piazza contributions - at least one endorsed question or one endorsed answer will earn you the credits
Academic Misconduct
Code copying between students is not allowed in this course, except between project partners. Copying includes emailing, taking photos, looking while typing line by line, etc. Copying code then changing it is still copying and thus not allowed. Lock your compute when it's not attended.
Be sure to read and understand the full project collaboration policies here.
Citing ChatGPT (or other LLMs): it's allowed with proper citation (see above link for details).
Citing Online Resources: you can copy small snippets of code from stackoverflow (and other online references) if you cite them. For example, suppose I need to write some code that gets the median number from a list of numbers. I might search for "how to get the median of a list in python" and find a solution at https://stackoverflow.com/questions/24101524/finding-median-of-list-in-python.
I could (legitimately) post code from that page in my code, as long as it has a comment as follows:
# copied/adapted from https://stackoverflow.com/questions/24101524/finding-median-of-list-in-python
def median(lst):
sortedLst = sorted(lst)
lstLen = len(lst)
index = (lstLen - 1) // 2
if (lstLen % 2):
return sortedLst[index]
else:
return (sortedLst[index] + sortedLst[index + 1])/2.0
In contrast, copying from a nearly complete project you find online (that accomplishes what you're trying to do for your project) is not OK. When in doubt, ask us! The best way to stay out of trouble is to be completely transparent about what you're doing.
Recommendation Letters
Earning a recommendation letter is much harder than earning an A in this course. At a minimum, I'll want to see you doing something complex and interesting beyond the assingments. For a typical letter, I'll have collaborated (independent study) with a student on some project for multiple months, with many iterations of feedback. Alternatively, you can earn a letter by working on a self-guided project that demonstrates the skills that you have obtained from this course. You must get in touch with me at least 6 weeks before the first recommendation letter deadline.
Most grad schools require recommenders to fill long forms rating students on various abilities (see an example below). Make sure that if you're asking me, I would be able to fill such a form without needing to put "I don't know" as my answer to many of the questions.