## Description

## 1 Objectives and scope

This exercise is predominantly concerned with Hopeld networks and associative

memory. When you are nished you should be able to:

• explain the principles underlying the operation and functionality of autoassociative networks,

• train the Hopeld network,

• explain the attractor dynamics of Hopeld networks the concept of energy

function,

• demonstrate how autoassociative networks can do pattern completion and

noise reduction,

• investgate the question of storage capacity and explain features that help

increase it in associative memories.

Most of the operations of the Hopeld-type of networks can be seen as vectormatrix operations.

First, we will look at some simple networks using the Hebbian learning principle, which is often employed in recurrent networks. You will construct an

autoassociative memory of the Hopeld type, and explore its capabilities, capacity and limitations. In most of the tasks, you will be asked to study the

dynamics and analyse its origins (potentially explain dierent behaviours you

observe).

## 2 Background

A neural network is called recurrent if it contains connections allowing output

signals to enter again as input signals. They are in some sense more general

than feedforward networks: a feedforward network can be seen as a special case

of a recurrent network where the weights of all non-forward connections are set

to zero.

One of the most important applications for recurrent networks is associative memory, storing information as dynamically stable congurations. Given a

noisy or partial pattern, the network can recall the original version it has been

trained on (autoassociative memory) or another learned pattern (heteroassociative memory).

The most well-known recurrent network is the Hopeld network, which is

a fully connected autoassociative memory network with two-state (bipolar -1,1

or binary 0,1) neurons. One reason that the Hopeld network has been so

well studied is that it is possible to analyse it using methods from statistical

mechanics, enabling exact calculation of its storage capacity, convergence properties and stability. In this assignment you will solely focus on the most popular

discrete version of the Hopeld network with Hebbian learning rule and both

synchronous (simultaneous) and asynchronous (sequential) update during recall.

2.1 Hebbian learning

The basic idea suggested by Donald Hebb is briey summarised here. Assume

that we have a set of neurons connected to each other through synapses. When

the neurons are stimulated with some input pattern, correlated activity causes

the respective synapses to grow, thereby strengthening the connections between

correlated pairs of neurons.

This will facilitate co-activations within the emerging groups of neurons, referred to as cell assemblies by Hebb. One of the implications of these co-activating populations is a functional feature of pattern

completion. In particular, when only part of the memory pattern (partial cue

or stimulus) is used to cue a memory network, stimulated neurons will likely

activate other neurons in their assembly through strengthened connections (by

means of earlier Hebbian learning) (see gure 1).

If two neurons are on the other hand anti-correlated, the synapses between

them get weakened or even become inhibitory. This way a pattern of activity

precludes other learned patterns of activity. This makes the network noise

resistant, since a neuron not belonging to the current pattern of activation will

be inhibited by the active neurons.

This form of learning is called Hebbian learning, and is one of the most used

non-supervised forms of learning in neural networks. Bear in mind its local

nature.

To eciently implement Hebbian learning, the calculations should be performed in matrix notation. The units can be indexed by i and the desired

activationsnof the training patterns are vectors x¯

µ with components x

µ

i

, where

µ is the number of the patterns. We could use for instance 0 and 1 for the activities but the calculations become easier if we choose -1 and 1 (bipolar units).

To measure the correlated activities we can use the outer product W = ¯x

T x¯

of the activity vectors we intend to learn. If the components xi and xj are correlated wij will be positive, and if they are anti-correlated wij will be negative.

Note that W is a symmetric matrix; each pair of units will be connected to each

other with the same strength.

The coecients for the weight matrix can then be written as:

wij =

1

N

X

P

µ=1

x

µ

i x

µ

j

where µ is index within a set of patterns, P is the number of patterns, and N is

the number of units.

Figure 1: A simple Hebbian fully connected auto-associative network. When

three units are activated by a stimulus, their mutual connections are strengthened. The next time any of them gets activated, it will tend to activate the

other neuron.

2.2 Hopeld network recall

To recall a pattern of activation x¯ in this network we can use the following

update rule:

xi ← sign

X

j

wijxj

where

sign(x) =

1 x >= 0

−1 x < 0

In fact, this variant of the Hopeld network, where all states are updated

synchronously, is less popular and goes under the nickname of the Little model.

In the original Hopeld model the states are updated one at a time, allowing each

to be inuenced by other states that might have changed sign in the previous

update steps. The choice of asynchronous or synchronous update has some

eects on the convergence properties (see section 3.3). The Little model is very

easy to implement relying on matrix operations.

Enter three memory patterns

x1=[-1 -1 1 -1 1 -1 -1 1]

x2=[-1 -1 -1 -1 -1 1 -1 -1]

x3=[-1 1 1 -1 -1 1 -1 1]

The next step is to calculate a weight matrix. In fact, you do not need to

scale the weight matrix with the 1

N

factor (the scaling becomes important only

if there is a bias term, a transfer function with a more complicated shape or if

we look at the energy as in section 3.3).

If the network has succeeded in storing the patterns x1, x2 and x3, they

should all be xed points when you apply the update rule. This simply means

that when you apply the update rule to these patterns you should receive the

same pattern back (content addresable memory).

• Check if the network was able to store all three patterns.

## 3 Tasks and questions

3.1 Convergence and attractors

Can the memory recall the stored patterns from distorted inputs patterns? De-

ne a few new patters which are distorted versions of the original ones:

x1d=[ 1 -1 1 -1 1 -1 -1 1]

x2d=[ 1 1 -1 -1 -1 1 -1 -1]

x3d=[ 1 1 1 -1 1 1 -1 1]

x1d has a one bit error, x2d and x3d have two bit errors.

• Apply the update rule repeatedly until you reach a stable xed point. Did

all the patterns converge towards stored patterns?

You will probably nd that x1, x2 and x3 are attractors in this network.

• How many attractors are there in this network? Hint: automate the

searching.

• What happens when you make the starting pattern even more dissimilar

to the stored ones (e.g. more than half is wrong)?

The number of iterations needed to reach an attractor scales roughly as

log(N) with the network size, which means there are few steps for a network

this small. If you want you can train a larger network to get slower convergence.

3.2 Sequential Update

So far we have only used a very small 8-neuron network. Now we will switch

to a 1024-neuron network and picture patterns. Load the le pict.dat, which

contains nine 1024-dim patterns stored one after another. We can name them

p1, p2, p3, p4, p5, p6, p7, p8 and p9. To start with, learn the rst three.

Since large patterns are hard to read as rows of numbers, please display these

1024-dim patterns as a 32 × 32 image.

• Check that the three patterns are stable.

• Can the network complete a degraded pattern? Try the pattern p10, which

is a degraded version of p1, or p11 which is a mixture of p2 and p3.

• Clearly convergence is practically instantaneous. What happens if we

select units randomly? Please calculate their new state and then repeat the

process in the spirit of the original sequential Hopeld dynamics. Please

demonstrate the image every hundredth iteration or so.

3.3 Energy

Can we be sure that the network converges, or will it cycle between dierent

states forever?

For networks with a symmetric connection matrix it is possible to dene

an energy function or Lyapunov function, a nite-valued function of the state

that always decreases as the states change. Since it has to have a minimum at

least somewhere the dynamics must end up in an attractor1

. A simple energy

function with this property is:

E = −

X

i

X

j

wijxixj

• What is the energy at the dierent attractors?

• What is the energy at the points of the distorted patterns?

• Follow how the energy changes from iteration to iteration when you use

the sequential update rule to approach an attractor.

• Generate a weight matrix by setting the weights to normally distributed

random numbers, and try iterating an arbitrary starting state. What

happens?

• Make the weight matrix symmetric (e.g. by setting w=0.5*(w+w’)). What

happens now? Why?

3.4 Distortion Resistance

The key propert studied here is the resistance of Hopeld network to noise

and distortion. In order to study the network’s robustness you can generate

noisy/distorted test stimuli by randomly ipping a selected number of units,

and investigate whether the network recovers the original clean patterns that

have been used for training.

In particular, train a network with p1, p2, p3, add noise to a pattern,

iterate it a number of times and check whether it has been successfully restored.

Let the script run across 0 to 100% noise and plot the result. For speed, use

the Little model rather than asynchronous updates.

• How much noise can be removed?

• Is there any dierence between the three attractors with regard

• Does the network always converge to the right attractor? Do the extra

iterations (beyond a single-step recall) help?

In the Little model it can actually end up alternating between two states with the same

energy; in the Hopeld model with asynchronous updates the attractor will always be a single

state.

3.5 Capacity

Now add more and more memories to the network to see where the limit is.

Start by adding p4 into the weight matrix and check if moderately distorted

patters can still be recognized. Then continue by adding others such as p5, p6

and p7 in some order and checking the performance after each addition.

• How many patterns could safely be stored? Was the drop in performance

gradual or abrupt?

• Try to repeat this with learning a few random patterns instead of the

pictures and see if you can store more.

• It has been shown that the capacity of a Hopeld network is around

0.138N. How do you explain the dierence between random patterns

and the pictures?

Create 300 random patterns and train a 100-unit (or larger) network with

them. After each new pattern has been added to the weight matrix, calculate

how many of the earlier patterns remain stable (a single iteration does not cause

them to change) and plot it.

• What happens with the number of stable patterns as more are learned?

• What happens if convergence to the pattern from a noisy version (a few

ipped units) is used? What does the dierent behavior for large number

of patterns mean?

The self-connections wii are always positive and quite strong; they always

support units to remain at their current state. If you remove them and compare

the curves from pure and noisy patterns for large number of patterns you will

see that the dierence goes away. In general it is a good idea to remove selfconnections, even though it seems that this step lowers the memory performance.

In fact, self-connections promote the formation of spurious patterns and have

negative eect on noise removal capabilities.

• What is the maximum number of retrievable patterns for this network?

• What happens if you bias the patterns, e.g. use sign(0.5+randn(300,100))

or something similar to make them contain more +1? How does this relate

to the capacity results of the picture patterns?

3.6 Sparse Patterns

The reduction in capacity because of bias is troublesome, since real data usually

is not evenly balanced.

Here we will use binary (0,1) patterns, since they are easier to use than

bipolar (±1) patterns in this case and it makes sense to view the ground

state as zero and diering neurons as active. If the average activity ρ =

(1/NP)

P

µ

P

i

x

µ

i

is known, the learning rule can be adjusted to deal with this

imbalance:

wij =

X

P

µ=1

(x

µ

i − ρ)(x

µ

j − ρ)

This produces weights that are still on average zero. When updating, please use

the slightly updated rule

xi ← 0.5 + 0.5 ∗ sign(X

j

wijxj − θ)

where θ is a bias term.

• Try generating sparse patterns with just 10% activity and see how many

can be stored for dierent values of θ (use a script to check dierent values

of the bias).

• What about even sparser patterns (ρ = 0.05 or 0.01)?

Good luck!