Lecture Notes: Machine Learning

I’ve finished learning Machine Learning course by Andrew Ng in November 2018, and I got the Certificate on Coursea. This course lasts for 11 weeks and covers topics such as linear regression, logistic regression, neural networks, SVM, etc.

The entire lecture notes will be posted in a few days, since the original notes were written on my notebook. Complying with the Coursera Honor Code, I won’t provide solution to quiz or assignment in my blog.

Week 1 Introduction

Lecture 1: Introduction

Machine Learning

Grew out of work in A.I.

New capability for computers.


  • Database Mining: Large dataset from growth of automation/web (Web click data)
  • Applications can’t program by hand (Natural Language Process, Computer Vision)
  • Self-customing Porgrams (Amazon/Netflix recommandations)
  • Understanding human learning (brain or real A.I.)


Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel (1959)

Well-posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Tom Mitchell (1998)


Supervised Learning
Unsupervised Learning
Other: Reinforcement Learning, Recommender Systems

Supervised Learning

Programmes are given a data set.
Know what the correct output looks like.
Familar with the relationsip between input and output.

  • Regression (Example: Predicting the house price with squares)
  • Classification (Example: Predicting a tumor as maligant or benign)
Unsupervised Learning

Little or no data what our result looks like.
Derive structures of data where we don’t know the effects of variable.
No feedback based on the prediction results.

Lecture 2: Linear regression with one variable

Here’s notations for a adata set:
m: Numbers of treaining examples
x: ‘input ‘ variable/features
y: ‘output variable/ target variable
\small (x,y): one training example
\small (x^{i},y^{i}): ith training example

In order to describe the learning problem more formally, we give a data set to the programme and let it find a function: \small h: X\rightarrow Y as a predictor of value ‘Y’, which is called hypothesis function.
\small h_{\theta}(x)=\theta_{0}+\theta_{1}x (Short cut as h(x))
Linear regression with one variable or univariate linear regression.

Cost Function

Cost Function: J(\theta_{0}, \theta_{1}) = \frac{1}{2m} \sum_{i=1}^{m}(h_{\theta }(x^{(i)})-y^{(i)})^{2}
(\frac{1}{2}is more convenient for the computation of gradient descent)
minimize(\theta_{0},\theta_{1}): J(\theta_{0},\theta_{1})
J(\theta_{0},\theta_{1}) with 2 parameters is usually represented in a contour plots or contour figure.

Gradient Descent

Gradient Descent Algorithm is a efficient method to find the minimal point of J(Cost Function)


\small repeat\,until\,coverage \,\{ {\theta_{j} := \theta_{j} - a\frac{\partial }{\partial\theta_{j}}J(\theta_{0},\theta_{1})} \}
The 2 parameters should be update simultaneously.
temp\,0 :=\, \theta_{0}-a\frac{\partial }{\partial\theta_{j}}J(\theta_{0},\theta_{1})
temp\,1 :=\, \theta_{1}-a\frac{\partial }{\partial\theta_{j}}J(\theta_{0},\theta_{1})
\theta_{0}:=\,temp\,0 ;\,\theta_{1}:=\,temp\,1
In this function, a means ‘training rate’


For a cost function with l parameter (\small J(\theta_{i})).
The derivative shows the slope of \small J(\theta_{i}), by subtract it, the \small \theta_{1} wil move closely to the minimum point.

For the training rate a, if a is too small, gradient descent can be slow, but if a is too large, it may fail to converge or even diverge.
The rate a could fixed since the derivative of \small J(\theta_{0}) is gradually decreasing with the destruction of the slope of \small J(\theta_{0}).

Gradient Descent for Linder Regression

\small \frac{\partial}{\partial\theta_{j}}J(\theta_{0},\theta_{1})=\frac{\partial}{\partial\theta_{j}}\frac{1}{2m}\sum_{i=1}^{m}(h_{\Theta }(x^{(i)})-y^{(i)})^{2}
\small j=0: \frac{\partial}{\partial\theta_{j}}J(\theta_{0},\theta_{1})=\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)}-y^{(i)})
\small j=1: \frac{\partial}{\partial\theta_{j}}J(\theta_{1},\theta_{1})=\frac{1}{m}\sum_{i=1}^{m}(h_{\theta}(x^{(i)}-y^{(i)})\cdot x^{(i)}

This algorithm is called ‘Batch Gradient Descent’, or BGD. ‘Batch’ means taht each step of gradient descent uses all the training examples. This will eventually find the global minimum for any cost function. There’s nolocal optima in the algorithm.

Lecture 3: Linear Algebra Review


Matrix is a rectangular array of numbers.
A=\begin{bmatrix}0&1 &2 &3\\4&5&6&7\end{bmatrix}

A is a 2*4 matrix.
The dimension of a matrix is the numbers of its rows and its colums.
\small A_{i\,j} is the ‘i,j’ entry in the ith row, jth column.
In this example. \small A_{2\,3}=6.
We use \small \mathbb{R}^{2\times 4} to represent a 2×4 matrix.


Vector is an n*1 matrix.
\small y^{i} is the ith element in the vector.
In this course, we use 1-index in this course.


Matrix Addition:
Addition and subtraction are element-wise, so you simply add or subtract each corresponding element.
\begin{bmatrix} a & b \\ c & d  \end{bmatrix} +\begin{bmatrix} w & x \\ y & z \newline \end{bmatrix} =\begin{bmatrix} a+w & b+x \\ c+y & d+z \\ \end{bmatrix}

Matrix Multiplication with scalar:
\begin{bmatrix} a & b \\ c & d \end{bmatrix} * x =\begin{bmatrix} ax & bx\\ cx & dx \end{bmatrix}

Matrix-Vector Multiplication:
\begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix} \cdot\begin{bmatrix} x \\ y \\ \end{bmatrix} =\begin{bmatrix} ax + by \\ cx + dy \\ ex + fy\end{bmatrix}

(mn matrix multiplied by an n1 vector results in an m1 vector.)
The number of columns of matrix must equal to the number of raws in the vector.

Matrix-Matrix Multiplication:
\small \begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix} \cdot\begin{bmatrix} w & x \\ y & z \\ \end{bmatrix} =\begin{bmatrix} aw + by & ax + bz \\ cw + dy & cx + dz \\ ew + fy & ex + f*z\end{bmatrix}

Divide the second matrix into vectors.
An mn amtrix multiplied by an no matrix results in an m*o matrix.
The number of columns of the first must equal to the rows of the second.


  • Not commutative: A \times B \neq B \times A
  • Associative: A \times B \times C = (A \times B)\times C
  • Identity matrix: A\cdot I=I\cdot A=A.
  • For example: \begin{bmatrix} 1 & 0 & 0 \newline 0 & 1 & 0 \newline 0 & 0 & 1 \newline \end{bmatrix}

Matrix Inverse and Transpose

The inverse of a matrix is denoted A^{-1}. A\cdot A^{-1}=I (Identity matrix)
Unsquare matrix has no inverse. Use Octave to get inverse.

Transposition of a matrix is rotating it by 90 degrees in clockwise, and then reversing it. For example:
A =  \begin{bmatrix}  a & b \newline   c & d \newline   e & f \end{bmatrix}\,A^T = \begin{bmatrix} a & c & e \newline b & d & f \newline \end{bmatrix}
So that A_{i\,j}=A_{j\,i}^{T}

Published by

Siujoeng Lau

Liberty will never perish.

Leave a Reply

Your email address will not be published. Required fields are marked *