Global alignments  review
Take two
sequences: X[j] and
Y[j]
M[i1, j1] ± 1
M[i, j] = max M[i, j1] – 2
M[i1, j] – 2
The best alignment for
X[1…i] and Y[1…j] is
called M[i, j]
X[j]
Initiation: M[0,0]=0
Apply the equation
Find the alignment with
backtracing
Y[i]
0
1
2
3
4
5
6
7

G
A
G
G
C
G
A
0

0
1
G
2
A
3
G
4
T
5
G
6
A
2
Algorithm time/space
complexity  BigO Notation
a simple description of complexity:
– constant O(1), linear O(n), quadratic
O(n 2 ), cubic O(n 3 )...
asymptotic upper bound
read: “order of”
f n=O
gn iff.
simple ,e.g.n 2
∃
x 0,
c
∀ f x≤cgx
x≥x 0
2⋅lnx
f x
lnx
x 0
=17
BigO Notation
Example
Time complexity of
global alignment:
M[i1, j1] ± 1
M[i, j] = max M[i, j1] – 2
M[i1, j] – 2
nm−1
init
10nm 1 =Onm
calcM print
Global alignment – linear
space
M[i1, j1] ± 1
M[i, j] = max M[i, j1] – 2
M[i1, j] – 2
We need O(nm) time, but only O(m)
space
– how
problem with backtracking
Global alignment – linear
space, recursion
LSpacei ,i', j, j':
LSpacei , n 2 −1, j, k 1
i
j
j'
k 1
k 2
n/2
LSpacei , n 2 1, j,k 2
i'
space complexity: O(m)
Global alignment – linear
space, algorithm
LSpacei ,i ', j, j':
i
j
j'
k 1
k 2
return if areai ,i', j, j'empty
h:= i '−i
2
calc.Musing Ommemory
plusfind pathL h
crossing therow h
LSpacei ,h−1,k 1
, j'
print L h
LSpacei ,h1,k 2
, j'
h=n/2
i'
L h
log 2
n
time complexity: ∑ nm
i=0 2 i
≤ 2nm
Alignments:
Local alignment
Local alignment:
SmithWaterman algorithm
Seq X:
What’s local
Seq Y:
– Allow only parts of the sequence to
match
– Locally maximal: can not make it
better by trimming/extending the
alignment
Local alignment
Seq X:
Seq Y:
Why local
– Parts of sequence diverge faster
evolutionary pressure does not put
constraints on the whole sequence
seq X:
seq Y:
– Proteins have modular construction
sharing domains between sequences
seq X:
seq Y:
Domains  example
Immunoglobulin domain
Global → local alignment
a) global align
Take the global
equation
Look at the result of
the global alignment

s
e
q

s
b) retrieve the result
e
CAGCACTTGGATTCTCG
CACGATTCGTG
c) sum score along
the result
q
u
e
sum
align.
pos.
Local alignment – breaking
the alignment
A recipe
– Just don’t let the score go below 0
– Start the new alignment when it happens
– Where is the result in the matrix
Before:
After:

s
e
q
u
e

s
e
q
u
e

0

0
s
s
0
e
e
q
q
sum
sum
sum
align.
pos.
align.
pos.
align.
pos.
Local alignment – the
equation
M[i1, j1] + score(X[i],Y[j])
M[i, j] = max
M[i, j1] – g
M[i1, j] – g
0
Init the boundaries with 0’s
Great contribution
to science!
Run the algorithm
Read the maximal value
from anywhere in the
matrix

s
e
q
u
e
Find the result with
backtracking

s
e
q
Finding second best
alignment
We can find the best local alignment in
the sequence
But where is the second best one
Scoring:
1 for match
2 for a gap
…
…
A
G
A
…
Best alignment
A
G
A
18
16
14
16
19
17
14
17
20
15
18
16
G
15
18
A clump
A
…
13
16
Clump of an alignment
Alignments sharing at least one aligned
pair
…
A
G
A
…
…
A
18
16
G
16
19
A
20
G
A clump
A
…
Clumps
gene X
gene Y
Finding second best
alignment
Don’t let any matched pair to
contribute to the next alignment
…
A
G
A
…
“Clear” the best
alignment
…
A
G
0
0
0
0
1
0
0
A
1
0
0
0
1
Recalculate
the clump
G
A
…
2
0
0
3
1
1
Extraction of local
alignments – Waterman
Eggert algorithm
1. Repeat
a. Calc M without taking cells into
account
b. Retrieve the highest scoring
alignment
c. Set it’s trace to
Clumps
gene X
gene Y
Clumps
gene X
gene Y
Low complexity regions
Local composition
bias
– Replication slippage:
e.g. triplet repeats
Results in spurious
hits
– Breaks down statistical
models
– Different proteins
reported as hits due to
similar composition
– Up to ¼ of the
sequence can be
biased
Huntington’s disease
– Huntingtin gene of unknown
(!) function
– Repeats #: 635: normal; 36
120: disease
Pitfalls of alignments
Alignment is not a
reconstruction of the
evolution
(common ancestor is extinct by the
time of alignment)
Pitfalls of alignments
Matches to the
same
fragment
Arbitrary poor
alignment
regions
seq X:
seq Y:
What’s the number of
steps in these algorithms
How much memory is
used
Summary
1. Global
a.k.a. Needleman
Wunsch algorithm
2. Globallocal
3. Local
a.k.a. Smith
Waterman algorithm
4. Many local alignments
a.k.a. Waterman
Eggert algorithm
seq X:
seq Y:
seq X:
seq Y:
seq X:
seq Y:
seq X:
seq Y:
Amino acid
substitution matrices
Percent Accepted Mutations
distance and matrices
Accepted by natural selection
– not lethal
– not silent
Def.: S 1
and S 2
are PAM 1 distant if
on avg. there was one mutation per
100 aa
Q.: If the seqs are PAM 8 distant,
how many residues may be diffent
PAM matrix
Created from “easy”alignments
– pairwise
– gapless
– 85% id
f proline
f proline, valine
i.e.
− frequency of occurenceof proline
− frequency of substitution
proline with valine,for PAM1
∑ counta ,b
aligns
f a ,b=
∑
aligns
∑
c ,d!=a ,b
M − symmetricmatr ix ,
f a ,a f a i.e. M=[ f b,a f b,b]
countc,d
PAM matrix
How to calculte M for PAM 2
distance
– Take more distant seqs
– or extrapolate...
M 2 =[
f a ,a
f b,a
=[
f a f a ,af a ,af a ,bf b,a f a ,af a ,bf a ,bf b,b
f b,b]2
]
⋯ ⋯
PAM log odds matrix
Making of the PAM N matrix
observed
PAMN[a , b] = log 2
f aM N [a ,b]
f af b
By chance
alone
Why log
Mutations and chance:
• More freq: PAM N[a,b] > 0
• Less freq: PAM N[a,b] < 0
= log 2
M N [a ,b]
f b odds
log odds
PAM 250 matrix
BLOSUM matrix
BLOcks SUbstitution Matrix
Based on gapless alignments
More often used than PAM
BLOSUM N matrix
Cluster together sequences N%
identity
• f(a,b) frequency of occurrence a and b in
the same column
f a , b
f a = f a ,a∑
a≠b 2
• e(a,b) – chance alone
∑ ea ,b = 1
a≤b
ea ,a = f a 2
ea , b = 2 f af b for a≠b
BLOSUM N
[a ,b] = log 2
f a ,b
ea ,b
log odds
BLOSUM 62 matrix
# BLOSUM Clustered Scoring Matrix in 1/2 Bit Units
# Blocks Database = /data/blocks_5.0/blocks.dat
# Cluster Percentage: >= 62
# Entropy = 0.6979, Expected = 0.5209
A R N D C Q E G H I L K M F P S T W Y V
A 4 1 2 2 0 1 1 0 2 1 1 1 1 2 1 1 0 3 2 0
R 1 5 0 2 3 1 0 2 0 3 2 2 1 3 2 1 1 3 2 3
N 2 0 6 1 3 0 0 0 1 3 3 0 2 3 2 1 0 4 2 3
D 2 2 1 6 3 0 2 1 1 3 4 1 3 3 1 0 1 4 3 3
C 0 3 3 3 9 3 4 3 3 1 1 3 1 2 3 1 1 2 2 1
Q 1 1 0 0 3 5 2 2 0 3 2 1 0 3 1 0 1 2 1 2
E 1 0 0 2 4 2 5 2 0 3 3 1 2 3 1 0 1 3 2 2
G 0 2 0 1 3 2 2 6 2 4 4 2 3 3 2 0 2 2 3 3
H 2 0 1 1 3 0 0 2 8 3 3 1 2 1 2 1 2 2 2 3
I 1 3 3 3 1 3 3 4 3 4 2 3 1 0 3 2 1 3 1 3
L 1 2 3 4 1 2 3 4 3 2 4 2 2 0 3 2 1 2 1 1
K 1 2 0 1 3 1 1 2 1 3 2 5 1 3 1 0 1 3 2 2
M 1 1 2 3 1 0 2 3 2 1 2 1 5 0 2 1 1 1 1 1
F 2 3 3 3 2 3 3 3 1 0 0 3 0 6 4 2 2 1 3 1
P 1 2 2 1 3 1 1 2 2 3 3 1 2 4 7 1 1 4 3 2
S 1 1 1 0 1 0 0 0 1 2 2 0 1 2 1 4 1 3 2 2
T 0 1 0 1 1 1 1 2 2 1 1 1 1 2 1 1 5 2 2 0
W 3 3 4 4 2 2 3 2 2 3 2 3 1 1 4 3 2 11 2 3
Y 2 2 2 3 2 1 2 3 2 1 1 2 1 3 3 2 2 2 7 1
V 0 3 3 3 1 2 2 3 3 3 1 2 1 1 2 2 0 3 1 4
PAM vs BLOSUM
http://www.ncbi.nih.gov/Education/BLASTinfo/Scoring2.html
PAM is extrapolation from closely
related seqs
We are interested more distant
relationships