Combined transcript-first reference for the foundations and the single-solution metaheuristics: complexity, optimization framing, hill climbing, statistics, exploration vs exploitation, simulated annealing, tabu search, and escape strategies.
The doctor positions metaheuristics as what you reach for when polynomial algorithms do not exist. If your problem is linear, n log n, n², or even n³, do not even open this course. Only when search space is exponential do metaheuristics earn their place.
| Worst-case complexity | What to do |
|---|---|
| Linear / polynomial / n log n | Use the polynomial algorithm. Don't go forward with this course. |
| n^k for moderate k | Still acceptable for small n. May not need a metaheuristic. |
| Exponential (e.g. SAT 2^n, BPP combinatorial) | Brute force is infeasible. Metaheuristic is justified. |
Three pieces define an optimization problem: a search space, an objective function, and the goal of finding s* such that f(s*) is best (min or max) among feasible s.
| Type | Known | Find |
|---|---|---|
| Optimization | Model + output target | Input (decision variables) |
| Modelling / ML | Input + output | Model (parameters) |
| Simulation | Model + input | Output (what-if) |
| Constraint satisfaction (CSP) | Model + constraints, no objective | Any feasible solution |
Every optimization can be reduced to a decision problem (does there exist s with f(s) less than k?). The 8-queens canonical example: the decision form asks "is there a configuration with zero attacks?"; the optimization form asks "minimize the number of attacks." We use optimization because metaheuristics need a continuous improvement signal, not just yes/no.
| Strategy | How | When |
|---|---|---|
| Reject | Discard the candidate, generate another. | Wastes iterations. Acceptable only when violations are rare. |
| Repair | Take the violating component out and place it elsewhere (Sharia's bin-packing fix). | Always feasible. Higher cost per iteration. |
| Penalty function | Add a penalty term proportional to the violation amount or percentage. | Most flexible. Watch the magnitude. |
| Decoding / by-design | Pick a representation where invalid solutions cannot exist (the gold standard). | Hardest to design; cheapest to run. |
| Criterion | How you know | Trap |
|---|---|---|
| Found ideal / optimal | You confirm via problem structure (e.g. zero attacks in 8-queens). | Often impossible to prove for NP-hard problems. |
| Out of time / max iterations | Wall-clock budget or iteration counter. | None - this is the most common. |
| No improvement (saturation) | Best-so-far has not improved in k iterations. | You may be stuck at a local optimum, not the global one. |
| Schedule completion | SA temperature reached zero (or below threshold). | Tied to SA only. |
Single-solution metaheuristics share the same loop: hold one solution, generate one (or several) candidates, decide who survives. The differences are in how the candidates are generated and how the survival decision is made.
| Notation | Meaning | Behaviour |
|---|---|---|
| 1+1 | One parent, one child, the better survives. | Pure greedy. Maximum exploitation. |
| 1,1 | One parent, one child, child always replaces parent. | Forced movement. Allows escape from local optima. |
| 1+λ | One parent, λ children, best of (parent and children) survives. | Sample-the-gradient with elitism. |
| 1,λ | One parent, λ children, parent dies, best of children continues. | Sample-the-gradient without elitism. Exploration-friendly. |
Because the parent always competes, the algorithm never accepts a worse move - it cannot escape a basin without an external mechanism (restart, memory, perturbation). This is exactly the gap that SA, TS, ILS and friends fill.
Small move = exploitation. One swap, one bit flip on Gray code.
Large move = exploration. Multi-component change, big perturbation.
Random radius = the doctor's recommended trick. Draw the move size from a uniform or Gaussian distribution every iteration. Sometimes you exploit, sometimes you explore - in the same run, with no manual schedule.
The single most-emphasised warning. If the operator cannot reach a target solution from your starting point, no number of iterations will save you.
For BPP: swap alone is insufficient because swapping items between bins never empties a bin - so the bin count never drops. You must include move (cross-bin transfer) for the operator to reach lower-bin-count states.
Strong locality means a small genotype change produces a small fitness change. This is what lets HC and SA "feel" the gradient.
Plain binary representations have weak locality by construction: flipping the high bit of a number can change it by a huge amount. Gray code restores strong locality - any one-bit flip changes the value by exactly one.
Steepest-ascent: evaluate every neighbour, take the best. Expensive but informed.
First-improvement: walk neighbours in some order, take the first one that improves. Cheaper, less greedy, often surprisingly competitive.
Representation choice is the single most important design decision. Same algorithm + different representation = different algorithm in practice.
Genotype = the encoding the algorithm manipulates. Phenotype = what the encoding decodes to (the actual solution we evaluate). The mapping between them determines locality.
| Mapping | What it means | Consequence |
|---|---|---|
| One-to-one | Each genotype maps to exactly one phenotype. | Ideal. No waste. |
| One-to-many (broken) | Same genotype decodes to multiple phenotypes. | Algorithm cannot distinguish solutions. Avoid. |
| Many-to-one | Multiple genotypes decode to the same phenotype. | Search space is artificially inflated; the algorithm wastes iterations on duplicates. |
| Representation | Constraint enforced | Search space |
|---|---|---|
| 2D 8x8 board (any cell) | None | C(64, 8) = ~4 billion |
| One queen per row (length-8 vector of column indices) | One per row | 8^8 = ~16 million |
| Permutation (length-8 vector with distinct values) | One per row AND one per column | 8! = 40,320 |
| Restart interval | Behaviour | Tag |
|---|---|---|
| Very long | Most of the budget spent on one HC run. Few restarts. Each run gets to fully exploit a basin. | Exploitation |
| Very short | Many restarts, none of which has time to climb. Mostly random sampling. | Exploration |
| Moderate | Each restart has time to refine, then the budget moves to a fresh region. | Balanced |
The n-sample HC variant: at each step, generate n candidate tweaks, then commit to the best one. n=1 is pure greedy 1+1. n = full neighbourhood is steepest-ascent.
You commit blindly to the only sample you saw. With no gradient information you cannot tell whether you are walking up the steep side of the hill or the gentle side.
Steepest-ascent is expensive (the per-iteration cost is proportional to the neighbourhood size) and is more prone to local-optima trapping because there is no randomness to escape with.
1+λ keeps the parent in the competition (elitism). 1,λ does not - the parent dies regardless. This forces the search to move every iteration, even if all children are worse than the parent. The deliberate willingness to "go down" is exactly how HC variants escape local optima before SA was invented.
Always: average + variance + median + 95% confidence interval (CI). Box plots are the standard visualization. Non-overlapping CIs = significant difference; overlapping = inconclusive.
The objective is "number of bins" - but that is too flat. Most local moves do not change the bin count, so HC sees a plateau and cannot tell whether it is improving.
Add an empty-space variance term, a fill ratio, or a minimum-slack term. The new fitness is sensitive to changes that the bin count cannot see. (This idea is reused as internal vs external fitness in coevolution; same trick, different name.)
Sum of all weights divided by bin capacity = the theoretical minimum number of bins. Combined with the FFD approximation ratio (~17/10 OPT + 2), you can bound the optimality gap of any solution.
| Lever | Pull toward exploitation | Pull toward exploration |
|---|---|---|
| Locality (representation) | Strong (Gray code) | Weak (plain binary) |
| Tweak magnitude | Small move (1 swap) | Large move (multi-swap) |
| Restart interval | Long | Short |
| n-sample size | n=1 (greedy) | 1,λ (forced move) |
| Population (foreshadow) | Single solution | Population by design |
current best hugs best so far, you are exploiting too much. If current best oscillates wildly while best so far barely improves, you are exploring too much. The doctor's framing: "By seeing my monitors, I can tell - I need more exploration, I need more exploitation."
SA is hill climbing with a probabilistic willingness to accept a worse move. The acceptance probability decreases with the size of the quality gap and increases with the temperature.
Inspired by physics, not biology - the doctor stresses this contrast. A hot metal lets atoms rearrange; cooling slowly lets them settle into a clean lattice. Cooling too fast freezes them in a random configuration.
| Temperature | Acceptance probability of worse moves | Algorithm reduces to | Tag |
|---|---|---|---|
| T = 0 | exp(-large / 0) = 0. Worse moves never accepted. | Hill climbing. | Pure exploitation |
| T = infinity | exp(any / infinity) = 1. Every move accepted. | Random walk. | Pure exploration |
| Moderate, schedule cooling | Starts permissive, becomes strict. | Classical SA - explores then exploits. | Balanced over time |
T must be calibrated to the spectrum of objective-value differences in your problem. The number 250 or 500 or 5000 is meaningless on its own - what matters is the ratio of T to the typical |Q(R) - Q(S)|.
| T relative to typical |delta-Q| | Behaviour |
|---|---|
| T much smaller | delta-Q dominates the exponent. Probability of accepting worse moves is essentially zero from iteration one. SA degenerates to HC immediately. |
| T comparable | Acceptance probabilities are meaningful and span a useful range. |
| T much larger | delta-Q / T is near zero. Every move is accepted. SA is indistinguishable from random walk for many iterations until cooling catches up. |
| Schedule | How | Behaviour |
|---|---|---|
| Linear | T = T_0 - (T_0 / N) * iter. Drops by a constant per step. | Predictable. Reaches zero exactly at the end of the budget. |
| Geometric (most common) | T_new = alpha * T_old, with alpha in [0.9, 0.99]. | Never reaches zero. Approaches it asymptotically. |
| Logarithmic | T = C / log(iter). | Theoretically convergent to the global optimum, but impractically slow. |
| Adaptive | Adjust T based on observed best-so-far behaviour. | The engineer's design choice. Tied to plot diagnostics. |
| Cooling speed | Behaviour | Tag |
|---|---|---|
| Very fast | Freezes early. Becomes HC quickly. Stuck at the first local optimum. | Exploitation |
| Very slow | Stays hot too long. Random walk for most of the budget. Never settles. | Exploration |
| Geometric (alpha around 0.95) | Classic balance. | Balanced |
Tabu Search = hill climbing + memory of forbidden states/moves to escape local optima.
Without memory, HC after escaping a local optimum walks straight back to it on the next iteration. Tabu memory forbids re-entry for a number of iterations (the tenure).
TS is structured as (1, λ): generate λ neighbour tweaks, take the best non-tabu one - without comparing against the current S. If you compared back to S you would prevent escape, because S is the bully.
Store the entire previous solution. Maximally specific.
Trap: the area near a stored solution is enormous; you cannot cover the landscape; near-cycles slip through.
Store a range of values around the visited point. Discretise continuous landscapes first.
Each tabu entry covers a whole neighbourhood. Much more memory-efficient.
Store the swap or move you just executed. Forbid its inverse for the next k iterations.
Cheap and intuitive. Standard default.
Store a structural fragment using wildcards (e.g. 8*5*). Forbid any solution matching the schema.
For BPP: which items are co-located in a bin. For TSP: which edge is used. For scheduling: "job 3 at position k".
Tenure = the number of iterations a forbidden move stays in the list.
| Tenure | Behaviour | Tag |
|---|---|---|
| Zero | No memory at all. Forbidden move can be undone immediately. TS reduces to plain HC; cycling is possible. | No protection |
| Very small (1-2) | "Cycling - same step again, again." Almost no anti-cycling protection. | Cycling |
| Moderate (5-9 swaps) | Lets the search escape the local bully but releases moves in time so good ones can be re-tried. | Balanced |
| Very large | Almost no move can be repeated within the run. Forced diversification, but blocks legitimately good moves and can corner the search. | Over-constrained |
The standard aspiration rule: override the tabu list when the candidate move would yield a solution better than the global best so far. Refusing such a move because it is technically tabu would be foolish.
Aspiration is the safety valve that protects effectiveness regardless of the restriction style chosen for the tabu list.
The tabu list itself is short-term memory (a sliding window of recent moves). Long-term memory tracks how often each component appears across all visited solutions, then uses that frequency to decide what to do next. The key insight: the same observation (a component appearing often) can be interpreted in two opposite ways, and the designer chooses which one based on what the search needs.
| Observation | Interpretation | Action |
|---|---|---|
| High frequency (a component keeps appearing in good solutions) | Strong signal -> intensify around it | Restart from a configuration where the high-frequency components are fixed. Doctor: "let me exploit them more." |
| Bully -> diversify away from it | Add the high-frequency components to the tabu list. Force every new candidate to use different components. | |
| Low frequency (a component rarely tried) | Under-explored -> diversify into it | Restart from a configuration that contains the under-explored components. Doctor: "totally thinking outside the box." |
Why two rows for "high frequency"? The doctor explicitly described both uses in lecture - same data, opposite philosophy. Choose intensification when you trust the signal (the component is genuinely good); choose diversification when you suspect a bully (the component is dominating just because it appeared early, not because it is best).
Reactive Tabu does not fix the tenure or list size up front - it adapts both based on observed behaviour during the run.
| Observation | Reactive response |
|---|---|
| Solutions repeat / cycling detected | Tenure too short. Lengthen it. |
| Search is over-constrained, no useful moves available | Tenure too long. Shorten it. |
| No improvement for many iterations + tenure already long | Trigger a diversification step (random restart, perturbation, frequency-based restart). |
Idea: instead of restarting from scratch, perturb the local optimum a little and run local search again. The perturbation should land you in a nearby basin with a (hopefully) better local optimum.
| Perturbation | Behaviour |
|---|---|
| Very weak / zero | Local search falls back into the same basin. ILS reduces to plain HC. No escape. |
| Moderate (keep common, change rest) | Doctor's recommended approach. Exploits accumulated structure while still escaping the current basin. |
| Very strong / random restart | "Completely change must be avoided - you lose all the work invested." Equivalent to random restart, no ILS benefit. |
Idea: the neighborhood definition itself is variable. Sometimes you search close (exploit), sometimes far (explore). The radius schedule is the design knob.
| Schedule | Behaviour |
|---|---|
| Always close (small fixed radius) | Always exploiting. Stuck at local optima. |
| Always far (large fixed radius) | Always skipping local basins. Never converges. |
| Conservative growing (start small, expand on stagnation) | Classical VNS. Balanced. |
| Aggressive / random radius each iteration | Single algorithm sometimes exploits, sometimes explores. Doctor's preferred trick. |
Idea: change the objective function itself every time you visit a region. Add a penalty so the landscape deforms - the algorithm physically cannot return to the same place because that place no longer looks attractive.
Penalty is applied to features / components, not single solutions. For TSP each edge gets a penalty count; over-frequent edges accumulate penalty; the modified objective is:
where p_i is the penalty count for feature i, I_i(s) is 1 if s uses feature i, and lambda is the penalty strength.
Idea: two-phase. (1) Construct a solution by greedy randomised choices. (2) Improve it with local search. Repeat for m iterations and report the average plus confidence interval.
At every construction step, rank all candidate components by quality and pick uniformly from the top P%.
| Top P% | Behaviour | Tag |
|---|---|---|
| P = top 1 (greedy) | Always nearest neighbour. Multi-start gives identical solutions. Stuck on the same local optimum every time. | Pure greedy |
| P = top 10-20% | Mostly greedy with stochasticity. Different restarts explore different paths. | Sweet spot |
| P = 100% (full random) | No greediness. Pure random construction. Equivalent to random restart. | Pure random |
Idea: same family as GLS but cruder. Add random noise to the objective function so the bully's true fitness gets blurred. The algorithm sometimes pretends the bully is worse than it is, freeing the search to escape.
Unifying principle of GLS and noising: do not change the algorithm, change the landscape. Same algorithm runs on a different problem instance every iteration.
A general problem-solving template that explores the search space without a closed-form algorithm. Applies to any problem where a candidate can be evaluated as good or bad. No optimality guarantee - returns the best found within the time budget.
Because brute-force on exponential search spaces is infeasible. SAT 2^100 = 10^30. Even at 1000 evals/s, less than 1% requires 15 billion years.
Hill climbing. Worse moves never accepted. Pure exploitation.
Random walk. Every move accepted. Pure exploration.
No. T must be calibrated to the typical |delta-Q| of the problem. T much smaller than delta-Q = HC immediately. T much larger = random walk for many iterations.
The doctor's framing: a slow cooling schedule keeps T high longer, so the algorithm resembles a random walk for more of the run; a fast schedule freezes early and behaves like concentrated HC.
Moderate. Long wastes the budget after each plateau; short never gives any run time to exploit. (Midterm Q1.)
Weak locality. Flipping the high bit of a number can change it by a huge amount. The algorithm cannot tell whether a single mutation was a small step or a large one. Gray code restores strong locality.
The operator must be able to reach the optimum from any starting point. If your BPP tweak is "swap two items", you can never reduce the number of bins. You must include "move" for connectivity.
Most local moves do not change the count, so the landscape is mostly plateau and HC is blind. Augment with empty-space variance, fill ratio, or max waste - same idea as internal vs external fitness in coevolution.
Move attributes (cheap, intuitive) or feature components / schemata (more powerful for breaking bullies). Storing full solutions is naive - the area covered is negligible relative to the state space.
Too short = cycling, no protection. Too long = blocks even legitimately good moves, search corners itself. Sweet spot 5-9 swaps.
Override tabu when the move would give a new best-so-far. The protection valve that keeps effectiveness intact.
High-frequency components: either intensify (re-seed search around them) or diversify (forbid them as bullies). Low-frequency: diversify (force them in to explore). Same data, opposite use depending on what the search needs.
Too weak - falls back into the same basin. Too strong - equivalent to random restart, loses the structure you just earned. Moderate = keep common, change rest.
The neighborhood definition / radius. The schedule (fixed, growing, random) is the design choice.
Inside the objective function itself, via penalty terms on features. Best-so-far is tracked separately so the deformed landscape does not corrupt the recorded best.
Greediness knob. P = 1 = pure greedy (multi-start gives identical solutions). P = 100% = pure random. P around 10-20% is the sweet spot.
Random selection means each run gives a different result. Report mean plus 95% confidence interval, not a single number.
Memory breaks the bully's gravitational pull. A single-trajectory algorithm with memory can achieve diversity that previously required a population.
The pre-GA module is built on three engineer-controlled levers: representation (controls landscape shape and operator connectivity), tweak operator (controls move size and locality), and fitness function (controls sensitivity to small moves). Every algorithm here is a different way to dial exploration vs exploitation on top of these three knobs. Hill climbing is the s-metaheuristic skeleton; its variants 1+1, 1,1, 1+lambda, 1,lambda differ only in how candidates are generated and who survives. Simulated annealing adds a probabilistic acceptance of worse moves, governed by temperature T - at T=0 SA reduces to HC, at T=infinity to random walk; the cooling schedule slides between them. Tabu search adds explicit memory so the bully cannot trap the search in a basin; tenure and list contents are the design knobs, with aspiration as the safety valve. When HC variants get stuck, ILS perturbs the local optimum to land in a nearby basin, VNS expands the neighborhood radius, GLS reshapes the objective itself with feature penalties, GRASP rebuilds solutions from scratch with a controlled greediness knob, and noising blurs the objective with random noise. Across all of them, statistics matter: a single run is non-deterministic, so report mean plus confidence interval over multiple runs, and design fitness functions sensitive enough to give the algorithm a usable gradient.