Course: CS250: Python for Data Science

Course Introduction

Time: 67 hours
Free Certificate

This course attempts to strike a balance between presenting the vast set of methods within the field of data science and Python programming techniques for implementing them. Problem-solving and programming implementation will be emphasized throughout the course. All techniques presented will be introduced using real-world programming examples. A major goal of the course is to ensure that when you finish the course, you will have the programming and conceptual expertise you need to join the field of data science.

Several Python modules such as pandas, scikit-learn, scipy.stats, and statsmodels will be introduced that are useful for data analysis, data visualization, and data mining. The course will gradually shift from introductory topics such as a review of Python, matrix operations, and statistics to applications and implementing programs involving data mining, visualization, statistical models, and time series analysis.

Course Syllabus

First, read the course syllabus. Then, enroll in the course by clicking "Enroll me". Click Unit 1 to read its introduction and learning outcomes. You will then see the learning materials and instructions on how to use them.

Select activity Course Syllabus

Course Syllabus Page

Unit 1: What is Data Science?

This unit will introduce you to the field of data science. Before delving into the programming aspects of the course, it is important to have a clear view of what data science is. There are many techniques and computational methodologies for dealing with data science problems. The goal of this unit is to help put the rest of the course in context and help you understand how to conceptually organize various facets of the field.

When attempting to solve a data science problem, the overarching goal is to derive inferences and draw conclusions based on existing data sets. Such inferences are made through statistical, computational, and visualization techniques. Furthermore, even before computations can be made, data sets must often be curated and refined. This unit will help you to order and categorize your thinking to understand the flow of data science processes.

Completing this unit should take you approximately 7 hours.

Unit 2: Python for Data Science

This unit will introduce the Python IDE we will use in this course. We will also introduce installing Python modules relevant to upcoming units. The primary goals of this unit are to ensure that all required software is ready to run and to review the Python programming language.

Developing expertise in data science requires understanding drawn from a breadth of different subjects such as numerical methods, matrix computations, statistics, data processing, visualization, data mining, and statistical modeling. Python is an excellent vehicle for developing this expertise because of the availability of various modules capable of addressing these topics. In this course, we will use the numpy module to introduce numerical methods and matrix computations. The pandas module (built upon the numpy module) will be used for data processing and visualization. Basic statistical calculations will be accomplished using scipy and pandas. Data rendering and visualization applications will be addressed using the matplotlib and seaborn. Finally, data mining and statistical modeling will be introduced using sckit-learn and statsmodels. Mastering data science in the context of these Python modules will position you to delve into deeper subjects such as machine learning and deep learning.

You should leave this unit being able to write Python programs that can perform basic computational and data processing tasks. We will discuss core concepts such as data types, operators, functions, conditional statements, loops, and file handling. In addition, we will also give examples of Python data structures, such as lists and dictionaries, as well as basic object-oriented syntax. Understanding these data structures will enable you to implement basic plotting and data rendering instructions using the matplotlib module. This abbreviated (yet thorough) set of topics will serve as the programming vocabulary you will need to complete the course.

Completing this unit should take you approximately 6 hours.

Select activity Upon successful completion of this unit, you will ...
Upon successful completion of this unit, you will be able to:

create a Python notebook using Google Colab;

execute instructions using built-in Python data and control structures;

apply methods for random numbers within the random module; and

implement basic plotting and data rendering instructions using the matplotlib module.

2.1: Google Colaboratory
- Select activity Introduction to Google Colab
  Introduction to Google Colab URL
  
  Students must
  
  Mark as done
  
  The vast majority of coding examples in this course will be phrased in the form of a Python notebook. Google Colaboratory (or, just Google Colab) is an online interface for creating Python notebooks. It is extremely convenient because a vast set of Python modules (and, in particular, the modules used in this course) are already installed and, once imported, are ready for use.
  
  Assuming you have a Gmail account and are logged in, click the link provided to run Google Colab notebooks. In the menu bar, choose
  
  File --> New notebook
  
  After doing so, you should see an editable code cell appear in which you can enter a series of Python instructions. Your notebook is automatically linked to the Google Drive associated with your Gmail account. Your Google Colab notebooks are automatically created on your Google Drive in My Drive in a folder named Colab Notebooks. You can treat them as files within Google Colab and perform typical file operations such as renaming, opening, and saving.
  
  Upon entering some basic lines of Python code in the code cell such as:
  
  print('Hello world')
  a = 10
  print('a =', a)
  
  you can practice executing your code cell in one of several ways:
  
  Click the Play button on the left side of the cell.
  
  Type Alt+Enter: runs the cell and inserts a new, empty code cell immediately following it.
  
  Type Shift+Enter: runs the cell and moves attention to the next cell (adding a new cell if none exists).
  
  Type Cmd/Ctrl+Enter: runs the cell, but control does not advance to the next cell (control remains "in place").
  
  In this course, you will be introduced to several Python modules applicable for data science computations and program implementations:
  
  numpy: array operations
  
  pandas: dataframe operations
  
  scipy.stats: basic statistics
  
  matplotlib, pandas, seaborn: visualization
  
  scikit-learn: data mining
  
  statsmodels: time series analysis
  
  Upcoming units will explain and apply these modules. Before doing so, this unit will first review the basics of the Python programming language.
2.2: Datatypes, Operators, and the math Module
- Select activity Data Types in Python
  
  Data Types in Python Page
  
  Students must
  
  Mark as done
  
  Fundamental to all programming languages are datatypes and operators. This lesson provides a practical overview of the associated Python syntax. Key takeaways from this unit should be the ability to distinguish between different data types (int, float, str). Additionally, you should understand how to assign a variable with a value using the assignment operator. Finally, it is essential to understand the difference between arithmetic and comparison (or "relational") operators. Make sure to thoroughly practice the programming examples to solidify these concepts.
- Select activity Operators and the math Module
  Operators and the math Module Page
  
  Students must
  
  Mark as done
  
  The math module elevates Python to perform operations available on a basic calculator (logarithmic, trigonometric, exponential, etc.). At various points throughout the course, you will see this module invoked within programming examples using the import command:
  
  import math
  
  Commands from this module should be part of your repertoire as a Python programmer.
  
  We conclude this section by putting together concepts regarding datatypes, operators, and the math module. Upon completing this tutorial, you should be pretty comfortable with the basics of the Python programming language.
2.3: Control Statements, Loops, and Functions
- Select activity Functions, Loops, and Logic
  
  Functions, Loops, and Logic Book
  
  Students must
  
  Mark as done
  
  The core of Python programming is conditional if/else/elif control statements, loops, and functions. These are powerful tools that you must master to solve problems and implement algorithms. Use these materials to practice their syntax. If/else/elif blocks allow your program to make conditional decisions based upon values available at the time the block is executed. Loops allow you to perform a set of operations a set number of times based on a set of conditions. Python has two types of loops: for loops and while loops. For loops work by counting and terminating when the loop has executed the prescribed number of times. While loops work by testing a logical condition, and the loop terminates when the logical condition evaluates to a value of false. Functions are designed to take in a specific set of input values, operate on those values, and then return a set of output results.
- Select activity Functions and Control Structures
  
  Functions and Control Structures Page
  
  Students must
  
  Mark as done
  
  Use this video to reinforce the syntax and applications of if/else/elif statements, loops, and functions. Make sure you feel comfortable writing functions and testing values as they are input and returned from functions that you have written. Additionally, understanding program flow as you navigate through if-elif blocks implies that you can predict the outcome of the block before the code executes. This way, you will convince yourself that you understand how if/else/elif statements work. Finally, practice writing loops and predicting what values variables should take during each loop iteration.
2.4: Lists, Tuples, Sets, and Dictionaries
- Select activity Data Structures in Python
  
  Data Structures in Python Page
  
  Students must
  
  Mark as done
  
  Python has several built-in data structures. Lists allow you to create a collection of items that can be referenced using an index, and you can modify their values (they are "mutable"). Tuples are similar to lists except that, once created, you cannot modify their values (they are "immutable"). Dictionaries allow you to create a collection of items that can be referenced by a key (as opposed to an index). Use this section to learn about and practice programming examples involving lists, tuples, and dictionaries.
- Select activity Sets, Tuples, and Dictionaries
  
  Sets, Tuples, and Dictionaries Book
  
  Students must
  
  Mark as done
  
  Here is more practice with tuples and dictionaries. In addition, the Python built-in data structure known as a set is also covered. Sets are not ordered, and their elements cannot be indexed (sets are not lists). To understand Python set operations, remind yourself of basic operations such as the union and intersection. Use this tutorial to compare and contrast the syntax and programming uses for lists, tuples, sets, and dictionaries.
- Select activity Examples of Sets, Tuples, and Dictionaries
  
  Examples of Sets, Tuples, and Dictionaries Page
  
  Students must
  
  Mark as done
  
  Watch this video for more examples of tuples, sets, and dictionaries. Practice the examples and make sure you understand how to apply these data structures using functions. This video is designed to help you put together the concepts in this section.
2.5: The random Module
- Select activity Python's random Module
  
  Python's random Module Page
  
  Students must
  
  Mark as done
  
  Before diving into some aspects of random number generation using more advanced modules, review how to generate random numbers using the random module by practicing the code shown in these videos.
2.6: The matplotlib Module
- Select activity Visualization and matplotlib
  
  Visualization and matplotlib Page
  
  Students must
  
  Mark as done
  
  The matplotlib module is the first of several modules you will be introduced to for plotting, visualizing, and rendering data. Beginning at 20:40, follow this video to learn the basics of applying matplotlib. It is fundamental to understand how to annotate a graph with labels. We will revisit many of these concepts in upcoming units. Make sure to implement the programming examples to get used to the syntax.
- Select activity Precision Data Plotting with matplotlib
  
  Precision Data Plotting with matplotlib Page
  
  Students must
  
  Mark as done
  
  Although matplotlib can work with Python lists, it is often applied within the context of the numpy module and the pandas module. This tutorial briefly references these modules, which will be introduced and elaborated upon in upcoming units. For now, follow along with the programming examples to learn some more basic matplotlib plotting commands.
Unit 2 Assessment
- Select activity Unit 2 Assessment
  Unit 2 Assessment Quiz
  
  Students must
  
  Receive a grade
  
  Take this assessment to see how well you understood this unit.
  
  This assessment does not count towards your grade. It is just for practice!
  
  You will see the correct answers when you submit your answers. Use this to help you study for the final exam!
  
  You can take this assessment as many times as you want, whenever you want.

Unit 3: The numpy Module

Data science must deal with the various forms that data can take on when it is collected. While Python lists and dictionaries are powerful, data can often come in the form of numerical arrays or spreadsheets. This unit introduces the numpy module, which is useful for performing matrix operations on data contained within arrays. When you finish this unit, you will be able to implement a host of matrix operations relevant to data science.

Much of the syntax used for lists, such as indexing and slicing, naturally carries over to numpy arrays. Array operations in numpy are typical of what you would expect from basic linear algebra (such as matrix addition and multiplication, systems of linear equations, determinants, and matrix inverses). Additionally, methods within the matplotlib module can accept numpy arrays as input. Finally, we will also take a look at file handling using numpy, since having this skill is often useful for machine learning applications.

Completing this unit should take you approximately 6 hours.

Unit 4: Applied Statistics in Python

As data science can often involve making statistical inferences from data, many of the upcoming units will apply calculations rooted in probability and statistics. This unit is foundational in that it discusses various ways of generating random data, computing basic statistical measures, and performing statistical analyses in Python. When you finish this unit, you will be able to implement and apply Python methods from the scipy.stats module.

You have already seen that the random module can generate scalar random numbers and that numpy can generate arrays of random numbers. We will also find that many numpy methods extend quite naturally to the pandas module we will introduce in the next unit. Additionally, the scipy.stats module allows for statistical modeling and parameter calculations. These Python implementations will serve as a foundation for more sophisticated methods discussed we will use later in the course.

Completing this unit should take you approximately 13 hours.

Unit 5: The pandas Module

This unit introduces the pandas module, which is necessary for generalizing numpy array operations to dataframes containing things like spreadsheet data. When you finish this unit, you will be able to process pandas dataframes and perform the suites of calculations we outlined in the earlier units.

The pandas module contains many methods that greatly simplify processing data files relevant to data science. Data collected from real-world situations is often messy, and can contain observations you might want to discard. The pandas module offers useful methods that can deal with such situations. We will also discuss methods for visualizing data using pandas.

Completing this unit should take you approximately 4 hours.

Unit 6: Visualization

Now that you've mastered the basic statistical, array, and spreadsheet data processing techniques, it is natural to want to plot, render, and visualize that data. In this unit, we will discuss visualization techniques beyond those introduced using matplotlib and pandas. When you finish this unit, you will be able to implement and visualize data plots applied within the field of data science.

The matplotlib module is convenient for visualizing data formatted using numpy arrays. pandas is similarly equipped for pandas dataframes. The seaborn module is also designed to work with pandas data. It is extremely powerful for rendering data using methods professional data scientists would find useful. While matplotlib provides basic plotting capabilities, seaborn can be used to construct bar charts, violin plots, heat maps, and more. These more advanced forms of visualization for statistical data sets can enable one to immediately draw inferences based on patterns discerned within the plots.

Completing this unit should take you approximately 4 hours.

Unit 7: Data Mining I – Supervised Learning

Data mining attempts to find patterns and relationships within and between given data sets. The field of data mining is vast, so we have broken down its introduction into two units: supervised and unsupervised learning. We will then move on to statistical model-building. When you finish this unit, you will be able to implement learning systems fundamental to the field of data mining.

This unit discusses the basics of supervised learning, feature extraction, dimensionality reduction, and training and testing of supervised learning models. We will focus on benchmark models fundamental to data mining, such as Bayes' decision and K-nearest neighbor. We will implement them using the scikit-learn module. Understanding these methods will prepare you for future excursions into machine learning and deep learning.

Completing this unit should take you approximately 11 hours.

Unit 8: Data Mining II – Clustering Techniques

This unit extends the material in the previous unit to clustering techniques, which are useful for creating pattern classification models where the input classes are unknown (which we call unsupervised learning). When you finish this unit, you will be able to create programs capable of training and testing unsupervised learning models. As in the previous unit, we will implement these techniques using the scikit-learn module.

The clustering of input feature vectors can be accomplished in several different ways. This unit focuses on two techniques: K-means, which requires some knowledge of the number of classes, and hierarchical clustering, which allows the input data to gradually define the number of classes. Both methodologies have their place within the field of data science.

Completing this unit should take you approximately 5 hours.

Unit 9: Data Mining III – Statistical Modeling

There is more to data science than simply analyzing data. The ability to build a model from a data set implies that some deeper relationship amongst data observations has been captured. Now that we have covered the basics of data mining, it makes sense to consider statistical models that allow for inference and also have some predictive power.

This unit demonstrates how to apply the scikit-learn module to building regression models. We will also show how to interpret model parameters once a model has been constructed. This unit will teach you how to apply Python for creating regression models, drawing inferences, and making predictions using computed model parameters.

Completing this unit should take you approximately 5 hours.

Unit 10: Time Series Analysis

The last step in this introduction to data science requires us to deal with data derived from time series, such as stock prices as a function of time. All the tools from the earlier units will play a role in performing these analyses. As in the last unit, our goal is to build statistical models that allow for inference and prediction.

This unit introduces models for analyzing time-series data. The statsmodels module contains various analysis tools, including methods for handling stationary and nonstationary data. This unit will focus on constructing autoregressive, moving average, and autoregressive integrated moving average models. This unit will teach you how to implement Python programs capable of statistical inference and forecasting from time-series data.

Completing this unit should take you approximately 6 hours.

Study Guide

This study guide will help you get ready for the final exam. It discusses the key topics in each unit, walks through the learning outcomes, and lists important vocabulary. It is not meant to replace the course materials!

Select activity CS250 Study Guide

CS250 Study Guide Book

Course Feedback Survey

Please take a few minutes to give us feedback about this course. We appreciate your feedback, whether you completed the whole course or even just a few resources. Your feedback will help us make our courses better, and we use your feedback each time we make updates to our courses.

If you come across any urgent problems, email contact@saylor.org.

Select activity Course Feedback Survey

Course Feedback Survey URL

Certificate Final Exam

Take this exam if you want to earn a free Course Completion Certificate.

To receive a free Course Completion Certificate, you will need to earn a grade of 70% or higher on this final exam. Your grade for the exam will be calculated as soon as you complete it. If you do not pass the exam on your first try, you can take it again as many times as you want, with a 7-day waiting period between each attempt.

Once you pass this final exam, you will be awarded a free Course Completion Certificate.

Select activity CS250: Certificate Final Exam

CS250: Certificate Final Exam Quiz

Students must

Receive a grade

Receive a passing grade

Course Syllabus

Course Syllabus

Unit 1: What is Data Science?

1.1: Introduction to Data Science

A History of Data Science

Understanding Data Science

1.2: How Data Science Works

How Data Science Works

The Data Science Pipeline

The Data Science Lifecycle

1.3: Important Facets of Data Science

Data Scientist Archetypes

What is the Field of Data Science?

Thinking about the World

Unit 1 Assessment

Unit 1 Assessment

Unit 2: Python for Data Science

2.1: Google Colaboratory

Introduction to Google Colab

2.2: Datatypes, Operators, and the math Module

Data Types in Python

Operators and the math Module

2.3: Control Statements, Loops, and Functions

Functions, Loops, and Logic

Functions and Control Structures

2.4: Lists, Tuples, Sets, and Dictionaries

Data Structures in Python

Sets, Tuples, and Dictionaries

Examples of Sets, Tuples, and Dictionaries

2.5: The random Module

Python's random Module

2.6: The matplotlib Module

Visualization and matplotlib

Precision Data Plotting with matplotlib

Unit 2 Assessment

Unit 2 Assessment

Unit 3: The numpy Module

3.1: Constructing Arrays

Using Matrices

Creating numpy Arrays

numpy Fundamentals

numpy for Numerical and Scientific Computing

3.2: Indexing

numpy Arrays and Vectorized Programming

Advanced Indexing with numpy

3.3: Array Operations

A Visual Intro to numpy and Data Representation

Mathematical Operations with numpy

numpy with matplotlib

3.4: Saving and Loading Data

Storing Data in Files

Load Compressed Data using numpy.load

Saving a Compressed File with numpy

".npy" versus ".npz" Files

Unit 3 Assessment

Unit 3 Assessment

Unit 4: Applied Statistics in Python

4.1: Basic Statistical Measures and Distributions

Applying Statistics

Key Statistical Terms

Descriptive Statistics

Basic Probability

Distribution and Standard Deviation

Continuous Probability Functions and the Uniform Distribution

The Normal Distribution

Confidence Intervals

Hypothesis Testing

Linear Regression

4.2: Random Numbers in numpy

Using numpy

Random Number Generation

Using np.random.normal

A Data Science Example

4.3: The scipy.stats Module

Descriptive Statistics in Python

Statistical Modeling with scipy

Probability Distributions and their Stories

4.4: Data Science Applications

Statistics and Random Numbers

Statistics in Python