ICS503 - Final Revision Booklet
Reading this booklet. Every question follows the doctor's pattern of asking about extremes ("what if zero, what if very large?") rather than specific numerical values. The model answers below are skeletons: full marks come from observation -> mechanism -> exploration/exploitation tradeoff -> tie-back to the slide or the homework.
- "Coevolution will be in the exam" - explicit, transcript 23.
- "Cuckoo will not be in the exam" - skip cuckoo entirely.
- "Think about this fitness function until next time - it might be in the exam" - referring to multi-goal-state internal fitness on the 8-puzzle slide.
- He said he will not ask algorithm implementation; therefore the rest is conceptual / parameter / comprehension.
- The first question is likely 10 points and reuses the experiment / plot / parameter style from the midterm.
- I. Coevolution: 1-population (slides 7-9)
- II. Coevolution: 2-population (firewall / GANs)
- III. Coevolution: cooperative (soccer / project teams)
- IV. Coevolution: diversity maintenance
- V. Differential Evolution - parameter behaviour
- VI. Particle Swarm - parameter behaviour
- VII. AS / ACS - parameter behaviour
- VIII. GA / SA - parameter behaviour
- IX. Tabu & Meta1/Meta2 (likely revisits)
- X. 8-puzzle GA + coevolution overlay
I. Coevolution - 1-population (slides EC-05-7, 8, 9)
These three slides are the most likely comprehension target. EC-05-7 sets up internal vs external fitness and the difficulty of measuring external fitness when there is no ground truth. EC-05-8 puts the 8-puzzle in front of you with multiple goal states and individuals targeting different goals, and intentionally leaves two fitness values blank for you to design. EC-05-9 lifts the framework into Algorithm 76 and shows the BPP analogue: two solutions with the same number of bins but very different empty-space variance.
1. The two missing fitnesses
On slide EC-05-8, two of the four shown individuals have explicit fitness values and two are written as "Fitness = ??". Propose how internal fitness should be assigned to the two blank ones so that the population covers all goal states without collapsing onto one. Justify the proposal in terms of internal vs external fitness.
- External fitness stays as the absolute distance (Manhattan or misplaced tiles) to the closest goal. It is what the algorithm reports as Best.
- Internal fitness drives breeding and must be relative to the cluster of individuals targeting the same goal - not an absolute distance.
- Assign each individual to its closest goal, then rank or normalize within that goal-cluster (e.g. distance divided by the sum of distances inside the cluster).
- Reward unique coverage: an individual that is the only one heading for an underserved goal gets a coverage bonus, even if its raw distance is worse.
- Penalise redundancy: if many individuals are clustered on the same "easy" goal, share their credit so weaker outliers survive.
"This minus-2 here for Goal A is not bad - but Goal B has only one individual. If you pick blindly you take the two minus-2's and you LOSE goal B. Internal fitness has to be relative."
2. Why averaging is dangerous
A student proposes: "for each individual, compute its distance to every goal state and use the average as internal fitness." Explain in your own words why this idea is dangerous in 1-population competitive coevolution.
- The average destroys the information about which goal an individual is heading toward.
- An individual close to Goal 1 only and one moderately close to all goals can have the same average - even though only the first can actually reach Goal 1.
- Averaging ignores how many other individuals already cover each goal, so it cannot maintain coverage.
- It eliminates diversity: a unique candidate covering an underrepresented goal gets penalised the moment it is "average" overall.
- Better: assign each individual to its closest goal and normalise within that cluster.
3. Same external, different internal (slide 9)
Slide EC-05-9 shows two BPP solutions S1 and S2. Both achieve the same number of bins, but one has high empty-space variance across bins and the other has near zero variance. From the external (objective) view they are identical. Argue which one the GA should prefer to keep for breeding, and why two fitness functions are needed at all.
- External fitness equals the number of bins. By that measure both solutions are tied and the algorithm cannot choose between them; selection becomes random.
- Internal fitness must differentiate tied solutions so that breeding has a usable signal.
- The high-variance solution (S1) often has slack: one nearly-empty bin can be redistributed and the next iteration may drop one bin. So S1 is a better breeding seed if your tweak operator can collapse a near-empty bin.
- The low-variance solution (S2) is a more stable plateau - prefer it only if the search has already converged and you want robustness.
- Deeper point: the same idea appears in deep learning - low average error is not enough if the variance of the error is high. The doctor's analogy: a model with the same mean error but lower variance is more trustworthy.
"S1 has 580 bins with variance 0.9; S2 has 580 bins with variance zero. Externally they look the same, but internally they're very different. We did the same in your bin packing - that's why we added empty-space variance."
4. Why two fitness functions in Algorithm 76
Algorithm 76 calls AssessInternalFitness and AssessExternalFitness on every iteration. Where is each one used in the algorithm, and why can you not collapse them into a single function?
- Internal fitness feeds the Breed step - it drives parent selection and breeding probabilities.
- External fitness feeds the Best update - it decides whether each individual beats the best-so-far and is used to report progress.
- You cannot collapse them because internal must be relative / coverage-aware, while external must be absolute / comparable across iterations.
- If internal were used for Best, Best would change meaning every generation (it would depend on who else is in the population at that moment).
- If external were used for breeding, you would lose the diversity-preserving bias and collapse onto one goal or one bully.
5. External fitness when there is no ground truth (slide 7)
In a 1-population competitive setting (e.g. checkers players), the slide gives three options for measuring external fitness: against a guru, against a panel of hand-crafted players, or against a sample of past generations. Briefly compare them.
- Against a guru. A single benchmark, stable, but expensive and may not exist for the problem.
- Against a panel. A small handcrafted suite covering different styles. Better coverage than one guru, but biased toward the panel's design.
- Against past generations. Cheap and automatic; relies on the assumption that later individuals are usually better than earlier ones - which can be wrong (a regression goes undetected).
- In all three cases the external fitness is still relative, so unstable systems can give odd results (the slide warns about this).
II. Coevolution - 2-population (firewall / GANs)
6. Map the firewall example onto P and Q
Define what each population represents in the firewall coevolution example, what each individual looks like, how internal fitness is computed for each side, and how P and Q coach each other.
- P (the solutions of interest). Each individual is a complete firewall ruleset; this is what we keep at the end.
- Q (the foils / test cases). Each individual is an attack scenario - a sequence of suspicious traffic.
- Fitness of a P individual: the fraction of Q's scenarios it correctly catches, evaluated against a sample of Q.
- Fitness of a Q individual: the fraction of P's defenders it manages to fool. A "good" attack defeats many defenders.
- Coaching loop. As P gets better, Q must evolve harder attacks to score points; as Q gets harder, P must evolve smarter rules. This is structurally identical to the GAN generator-discriminator dynamic.
- A good Q individual is balanced: if every defender catches it, it's too easy; if every defender fails, it's too hard. Same as the doctor's exam analogy - "easy quizzes give no discrimination, hard quizzes give no discrimination."
7. Crossover inside a firewall ruleset
Two students disagree about how to do crossover inside a firewall ruleset.
(A) "Take complete rules from each parent - never split inside a rule."
(B) "Allow splitting inside a rule (mix source IP from one rule with destination port from another)."
Discuss the exploration / exploitation impact of each design and propose a sensible compromise.
- (A) Whole-rule crossover. Preserves the semantic meaning of every rule. Mainly exploitation - you recombine valid building blocks. Convergence is fast but limited to combinations that already exist.
- (B) Inside-rule crossover. Still apple-to-apple in the sense that a source IP slot and a destination port slot come from rules of the same shape - so it is permitted. But it can break the semantic glue inside a rule and produce nonsense. More exploration but it ruins what already works.
- Compromise. Run whole-rule crossover at the normal rate, plus inside-rule crossover at a much lower rate. You preserve exploitation but still inject occasional structural exploration.
- Mutation can be the main exploration knob: small per-field changes inside one rule, frequent enough to keep the search from stagnating.
"Hassan said no - take the whole rule. But you can also allow inside-rule mixing at a small rate. You design the crossover. Not a static crossover."
8. Why 2-pop competitive is structurally a GAN
Explain why 2-population competitive coevolution is structurally identical to a Generative Adversarial Network. Map the parts.
- P plays the role of the discriminator - solutions of interest that we ultimately keep.
- Q plays the role of the generator - it produces synthetic test cases that try to defeat P.
- Each side's fitness depends on the other's quality, so they coevolve. Q forces P to be robust; P forces Q to be hard.
- Used when you do not have enough labelled data: Q manufactures the test cases needed to coach P.
- The collapse mode is the same - if Q gets too easy P stops learning. The fix is the same - balance the sides.
III. Coevolution - cooperative (soccer / project teams)
9. Internal vs total fitness in a soccer team
In a multi-robot soccer team, you have one sub-population per role (goalkeeper, defender, midfielder, striker, playmaker). Each sub-population evolves with its own GA. Define the two fitness signals each individual receives and explain why a player with the highest individual fitness is not necessarily picked for the team.
- Individual fitness measures the player's own role performance in isolation.
- Team fitness measures the score the player produces when grouped with one selected individual from each of the other sub-populations.
- An individually strong player may have low team fitness if its style does not align with the partners chosen from other sub-populations. The team is what wins, not the individual.
- Selection / breeding inside one sub-population must use both signals; pure-individual selection collapses to selfish play.
- The doctor's analogy: in a course project, your individual progress matters, but if you and your teammate duplicated work, the team's joint score is what gets graded.
10. Combinatorial explosion of team formation
If you have several sub-populations and you must pick one individual per sub-population to form a complete team, the number of possible teams grows multiplicatively. Why is exhaustively evaluating every team a bad idea, and what does the cooperative coevolution approach do instead?
- The number of teams is the product of the sub-population sizes - it explodes quickly even for a few sub-populations.
- Team-evaluation is the expensive step - you must run a soccer game / project simulation per team. Exhaustive evaluation is unfeasible.
- Idea: each sub-population keeps a small set of "team representatives". A new candidate is evaluated against a few teams formed from those representatives, and that score is used as its contribution.
- This is divide-and-conquer: the high-dimensional search is split into N smaller searches that talk to each other only at evaluation time.
IV. Coevolution - diversity maintenance
11. Five techniques to preserve diversity
Name five techniques that prevent premature convergence in a population metaheuristic. For each, explain in one line how it produces more exploration.
- Fitness sharing. Penalise individuals that are too close to each other in genotype/phenotype - the bully shares its credit with its neighbours, so weaker outliers survive.
- Crowding (replacement). When a child is born, kill the most similar individual; the population stays spread out.
- Larger population. Hopefully you get many bullies and many clusters instead of one bully - wider coverage.
- Add noise. Push individuals further from their current location with a high-spread perturbation, so they can jump basins.
- Pick less-fit individuals more often. A weaker selection pressure (the "Sicilian league" idea) lets low-quality schemata survive long enough to recombine.
- Bonus: random restarts of part of the population, or distance-based niching using Manhattan / centroid / single-linkage distance.
12. Three meanings of "similar"
When fitness sharing or crowding asks whether two individuals are similar, there are at least three meanings of similar. Name them and give an example for each in the bin-packing problem.
- Phenotype similarity. They look the same in the decoded solution - e.g. two BPP solutions with the same number of bins and the same per-bin fill levels.
- Genotype similarity. Their representations are close - e.g. two random-key vectors with small per-position difference.
- Behavioural / fitness similarity. Same fitness vector (same #bins and same empty-space variance) even though the chromosomes differ.
- Distance metric depends on which similarity you care about: Hamming / Manhattan / centroid / single-linkage.
V. Differential Evolution - parameter behaviour
DE's core idea is the deviation - the difference vector between two random members of the population. The mutant is built by taking that deviation, multiplying it by a scale, and adding it to a base vector. Knobs to know: the deviation itself (which is automatically adaptive - it carries the population's spread), the scale on top of the deviation, the crossover rate (per-coordinate inheritance from the mutant), the population size, and the choice of base vector (random member vs current best). All questions ask only about extremes - zero, very small, moderate, very large - not specific numerical values.
13. The deviation - DE's adaptive mutation
In DE, the mutation step is built from the deviation between two randomly picked population members. Why does the doctor describe this as an adaptive mutation? Discuss what the deviation looks like in the early generations vs the late generations and what that does to exploration vs exploitation. Why does this remove the need to schedule mutation magnitude by hand (unlike SA, which needs a cooling schedule)?
- The deviation IS the population's current spread. Two random members picked from a scattered population are far apart; two picked from a converged population are close.
- Early generations. Population is randomly scattered -> deviation is large -> the mutation step is large -> DE explores. The algorithm doesn't need to be told to explore - the deviation already does it.
- Late generations. Population has converged into one or a few basins -> deviation is small -> the mutation step is small -> DE refines / exploits. Again, no explicit signal needed.
- This is the doctor's "self-adaptive" property. SA needs a cooling schedule because temperature is an external parameter; DE needs no schedule because the deviation tracks the population's own state.
- Failure mode. If the population fully collapses (everyone becomes nearly identical), every deviation is near zero and DE stops moving altogether - even with a large scale on top. This is why diversity preservation matters even in a "self-adaptive" algorithm.
"If the population is spread out, the mutant will make major changes; if it's condensed in a region, the mutant will be small. Differentiation is an adaptive mutation - it is just based on how spread you are."
14. The deviation scale extremes
The deviation already carries the population's spread, but DE multiplies it by a scale before adding to the base vector. Discuss the behavioural effect of setting that scale to (a) zero, (b) very small, (c) moderate, (d) very large. How does the scale interact with the deviation - does a large scale rescue a converged population?
- Zero. The trial equals the base vector - no movement. Population stagnates; mutation cannot produce new candidates.
- Very small. Tiny perturbations - heavy exploitation, slow refinement around the current cloud, almost no exploration.
- Moderate. Classic balance - the difference vector encodes the local population structure, so DE exploits that structure while reaching slightly outside.
- Very large. Long jumps; aggressive exploration. Often overshoots, requires repair / clamping. In a BPP random-key encoding, big jumps may also cause sudden permutation flips that break partial orderings.
- Note: if population diversity collapses, all difference vectors shrink toward zero regardless of the chosen scale - DE then stops making progress.
15. Crossover rate extremes
Same kind of question for the crossover rate - the probability that each coordinate inherits from the mutant rather than the target.
- Zero. The trial gets only one coordinate from the mutant; the rest stay from the target. Per-coordinate exploration only - very local.
- Very low. Mostly target with a few coordinates updated. Good for separable problems where only a few dimensions matter.
- Very high. Almost every coordinate from the mutant - the trial is a fully new vector with no inheritance. Loses the per-coordinate refinement DE is famous for.
- Conceptual rule: low rate produces per-coordinate refinement; high rate produces whole-vector replacement.
16. Random base vs best base
DE can use a randomly chosen population member as base, or the current best as base. Compare their behaviour.
- Random base. Different parents every time - diversifying, slower convergence, more robust on multimodal landscapes.
- Best base. The whole population is pulled toward the current best. Fast convergence on smooth landscapes; high risk of premature convergence on multimodal ones.
- The bully intuition: best-base is DE's version of letting the strongest player dominate. Use random-base when the landscape probably has many basins.
17. Top-percentile parents only - which parents are allowed to breed
DE traditionally lets every parent generate one child. The doctor flagged a variant where only the top X% of parents are allowed to breed (GRASP-like elitism over breeders). Discuss the behavioural effect of (a) all parents allowed (X = 100%), (b) moderate fraction, (c) very small fraction (X = 3%).
- All parents allowed. Every parent breeds exactly one child - lazy / standard. Maximum exploration; even bad parents contribute to the next generation.
- Moderate fraction. Elite parents spawn multiple children. More exploitation of good regions while still allowing some diversity.
- Very small fraction (e.g. top 3%). Doctor's exact wording: "almost climbing - extreme exploitation." Population collapses around a tiny elite, premature convergence.
18. Selection of B and C - random vs fitness-biased
The two vectors that define the deviation can be picked uniformly at random or biased toward the fittest individuals. Discuss the behavioural impact of each.
- Random. The standard. (B - C) is the deviation between two random points - automatically large early when scattered, small late when converged. The "free" adaptive-mutation property the doctor loves.
- Fitness-biased (best with best). Both come from the same good region - (B - C) is small - exploitation. Risk: if both are stuck in the same local basin, the deviation pulls every new candidate INTO that basin - premature convergence.
- Fitness-biased when the best are scattered across multiple basins. (B - C) can be large enough to ESCAPE local optima - a form of niching that helps in multimodal landscapes.
VI. Particle Swarm - parameter behaviour
PSO has three pulls (old velocity, personal memory, informant memory) plus one design choice (informant topology: global, local, or none). All questions are about extremes.
19. The "old velocity" weight (inertia)
In PSO each particle's new velocity is a weighted sum of its old velocity, a pull toward its personal best, and a pull toward an informant best. Discuss the behavioural impact of setting the old-velocity weight to (a) zero, (b) very small, (c) moderate, (d) very large.
- Zero. Particles forget where they were heading - velocity is rebuilt every step using only memory pulls. No momentum, the swarm collapses quickly toward the line between p_best and informant best. Strong exploitation.
- Very small. Dampened motion; swarm converges quickly toward the memory line; risk of premature convergence.
- Moderate. Classic balance - keeps enough momentum to coast past local optima while still being attracted by memory.
- Very large. Velocity keeps growing each step - particles fly out of bounds; effectively pure exploration with little control. Common to start large and decay over iterations to shift exploration -> exploitation.
20. Informants - global, local, or none
PSO can use the global best of the entire swarm, the local best within a small informant neighbourhood, or only the personal best. Discuss the exploration/exploitation behaviour and the failure mode of each.
- Global best. All particles pulled toward the same leader. Fast convergence, strong exploitation. Failure mode is the bully - one early lucky particle drags the swarm into its basin.
- Local best (informants). Each particle is pulled toward the best in its small neighbourhood. Information propagates slowly, so different parts of the swarm explore different basins for longer. Slower convergence but better diversity.
- Personal best only. Every particle is essentially an independent hill-climber with momentum. Maximum exploration, no swarm coordination - the budget is wasted on duplicated regions.
- Tradeoff intuition: global is "everyone follows the leader"; local is "everyone follows their team captain"; none is "everyone is on their own".
"If we only have global best, we have a bully and the swarm collapses. If we only use personal or local informants, exploration is preserved but convergence is slow."
21. Personal-best pull (β) - the "memory of self" weight
PSO has three pulls; β is the one toward the particle's own best position ever visited. Discuss the behavioural effect of setting β to (a) zero, (b) very small, (c) moderate, (d) very large.
- Zero. Particle has no memory of its own discoveries - it ignores the good place it visited and lets the swarm/informants decide everything. The personal experience is wasted.
- Very small. Particle is mostly a follower - briefly pulled back to its own best when the swarm yanks it away, but does not insist.
- Moderate. Healthy tension - particle drifts back to its own best when needed but still listens to the swarm.
- Very large. Particle becomes obsessed with its own best - effectively a parallel hill-climber. The doctor's exact warning: large beta breaks the swarm into separate hills because each one is going to do its own. The swarm fragments and stops cooperating.
"Large beta breaks the swarm into separate different hills - each one is going to do its own."
22. Global-best pull (γ) - the "follow the leader" weight
γ is the pull toward the single best position found by the entire swarm so far. Discuss the behavioural effect of setting γ to (a) zero, (b) very small, (c) moderate, (d) very large.
- Zero. No global coordination - the swarm becomes a bunch of independent searchers. Risk of splitting into multiple sub-swarms with no agreement.
- Very small. Slow, gentle pull toward the leader. Information about the global best diffuses slowly.
- Moderate. Swarm gradually converges on the best discovered region while still allowing local refinement.
- Very large. Premature convergence - every particle stampedes to the current global best. The doctor's wording: "large = everybody is going to one place." The classical bully failure mode of PSO.
"Large γ - everybody's going to one place and everybody's going to help."
23. Informant pull (δ) - the middle layer
δ is the pull toward the best position found in the particle's small randomly-chosen informant subset (the layer between "me" and "whole swarm"). Discuss the behavioural effect of setting δ to (a) zero, (b) very small, (c) moderate, (d) very large.
- Zero. Informants ignored - you collapse to the textbook 2-component PSO (personal + global). The middle balancing layer is gone.
- Very small. Local context barely matters - particles mostly follow personal/global memories.
- Moderate. Healthy local exploitation - small groups of particles co-exploit promising sub-regions while still feeling the global pull.
- Very large. Particles become slaves to their tiny random clique - the swarm fractures into many sub-swarms each chasing its own clique-best, which can drift far from the true global best.
24. Step / jump size (ε)
The doctor explicitly called out ε - the multiplier on the resultant velocity vector that decides how far the particle actually moves each step. Discuss the behavioural effect of setting ε to (a) very small, (b) moderate, (c) very large.
- Very small. Particle barely moves - the swarm stagnates. Extreme exploitation, no useful traversal of landscape.
- Moderate. Particle moves the full computed velocity, balanced.
- Very large. Allows the system to reach known good regions quickly, but the downside is hard to do fine-grain optimization. Particle overshoots good regions, oscillates, never refines.
25. Stochastic-coefficient scope - inside vs outside the dimension loop
A subtle PSO design choice the doctor spent class time on: the random multipliers on each pull can be sampled inside the per-dimension loop (different per dimension) or outside (one scalar reused for all dimensions). What is the behavioural consequence of each?
- Outside the loop (one scalar per particle per step). The same value scales every dimension - the particle moves in exactly the original direction, only at a different magnitude. Direction never rotates. Pure exploitation along a fixed line.
- Inside the loop (one scalar per dimension). Each dimension is scaled independently - the particle's direction rotates stochastically each step. Allows both exploration and exploitation.
- The doctor's framing: "change direction means I can explore." Sampling outside kills the rotation; sampling inside enables it.
VII. Ant System & Ant Colony System - parameter behaviour
AS and ACS share the basic loop (build solutions, evaporate, deposit) but differ in who deposits and how greedy the construction step is. Knobs to know: evaporation rate, who is allowed to deposit pheromone, the local-update rule that ACS adds during construction, the greediness of the next-component choice, and the size of the candidate list (the GRASP-style restriction).
26. AS vs ACS - who deposits and what changes
Compare AS and ACS on (a) who deposits pheromone after a tour, (b) what local update (if any) happens during construction, (c) the resulting exploration/exploitation balance.
- AS. All ants deposit pheromone proportional to their solution quality after every iteration. Democratic - even mediocre tours leave a trace. More exploration, slower convergence; finds unusual co-locations.
- ACS. Only the best ant (best-so-far or iteration-best) deposits global pheromone. Plus a local decay applied to every component used during construction in the same iteration. Greedy / exploitative; fast convergence; risk of getting stuck.
- Both evaporate everywhere ("we decrement everybody"). The difference is who is rewarded.
- AS wins on instances where the obvious greedy structure is wrong; ACS wins where the greedy structure is right and you mostly need to refine.
27. Evaporation extremes
In any pheromone algorithm, what happens behaviourally when the evaporation rate is (a) zero, (b) very low, (c) moderate, (d) very high?
- Zero. No evaporation - pheromone keeps accumulating. The first iterations dominate forever; the algorithm cannot forget bad early choices.
- Very low. Long memory; rich-get-richer dominates; lock-in to early discovered structures.
- Moderate. Classic balance - old pheromone fades fast enough for the algorithm to react to new evidence.
- Very high. Short memory; pheromone signal washes out before it can guide construction; the algorithm collapses to a near-greedy / near-random heuristic.
28. The ACS local-update trick
In ACS, every time an ant uses a component during construction, the pheromone on that component is pulled partway back toward its initial value. Explain (a) what this rule does behaviourally, (b) what happens when the pull is very strong vs very weak, (c) why ACS needs this when AS does not.
- The rule erases part of the pheromone on a just-used component during the same iteration. Later ants in that iteration see less attractive pheromone there and choose other components - intra-iteration diversification.
- Very strong pull (close to a full reset). Components are almost reset between ants - very strong intra-iteration diversity; the global elitism step must do all the lifting.
- Very weak pull. Almost no change between ants - ACS behaves like AS in a single iteration; everyone copies the best, swarm collapses.
- ACS needs this because only the best ant deposits global pheromone. Without local decay, every ant would converge on the best and the search would stagnate immediately.
29. Candidate-list size (GRASP-style restriction)
ACS chooses the next component from a top-fraction of candidates. Discuss the behavioural effect of (a) a very small candidate list, (b) a moderate one, (c) using the full set.
- Very small list. Only the most promising components are considered - very greedy, near-deterministic, premature-convergence risk.
- Moderate list. Keeps the random-proportional flavour of AS but trims the long tail of poor components - balanced.
- Full set. Equivalent to AS-style proportional selection over all components - maximum exploration.
- Tighter list -> greedier; looser list -> more democratic. Sliding the list from loose to tight over iterations is one way to schedule exploration -> exploitation.
30. Pheromone vs fitness scale
The doctor warned you that pheromone values and fitness values must be on the same scale. What does that mean and what goes wrong if they are not?
- If the fitness values are much larger than the pheromone values, every deposit overwhelms the existing pheromone and the next iteration's edge probabilities collapse to one on the best edge and zero elsewhere.
- If the fitness is much smaller than the pheromone, the deposit barely changes anything and the algorithm becomes random.
- Fix: normalise the deposit (e.g. divide a constant by the cost so deposits stay on the same order of magnitude as pheromones), or use rank-based deposits.
31. Pheromone weight (α) vs heuristic weight (β) in the transition rule
In the next-edge probability, pheromone is raised to power α and the heuristic distance is raised to power β. Discuss the behavioural extremes: (a) α very high while β is zero, (b) α zero while β very high, (c) both moderate.
- α very high, β zero (pheromone-only). Ants chase pheromone blindly. Random at iteration 1 (no information yet); then locks into the first lucky path; premature convergence; ignores the geometric / heuristic shortcut even when it is obviously better.
- α zero, β very high (heuristic-only). Pure greedy randomised on distance - essentially GRASP. No learning across iterations; the colony memory is wasted; you may end up with the same answer every run.
- Both moderate. Heuristic gives a good starting bias; pheromone refines that bias over iterations; classic AS / ACS balance.
"If α dominates, they don't say 'oh, this is high pheromone, go' - they go for it greedily. The heuristic part is what allows GRASP-style exploration even when the pheromone signal is bad."
32. Initial pheromone value (τ₀)
Before any ant has run, every edge starts with the same pheromone value τ₀. Discuss the behavioural effect of choosing τ₀ (a) very small (or zero), (b) moderate, (c) very large.
- Very small / zero. First iteration is purely heuristic-driven (or fully random if no heuristic). The very first lucky tour gets disproportionate weight because all alternatives still sit near zero. Locks in extremely fast.
- Moderate. Reasonable starting signal; evaporation + deposits make differences emerge over a few iterations.
- Very large. Many iterations needed before any edge can stand out from the background noise; behaves uniformly random for too long; learning is slow.
- Practical rule: τ₀ should sit on the same order of magnitude as the per-iteration deposit.
33. Number of ants per iteration (the colony "population size")
For a fixed total budget, you can run many ants per iteration for few iterations, or one ant per iteration for many iterations. Discuss the behavioural effect.
- Very few ants (one). Looks like a single greedy-randomised solver; very little parallel exploration; pheromone signal is sparse and noisy; slow to form schemas.
- Moderate. Enough parallel trails per iteration to give a meaningful pheromone update without spending the whole budget per iteration.
- Very many ants per iteration. High exploration per iteration; common edges get heavily reinforced fast; but each iteration is expensive and the total iteration count drops, leaving less time for evaporation to filter out lock-in.
VIII. GA & SA - parameter behaviour
34. Population size vs number of generations
With a fixed budget of fitness evaluations, compare a small population with many generations against a large population with few generations.
- Small population, many generations. Deep exploitation per individual; faster drift; high risk of premature convergence with one bully.
- Large population, few generations. More diversity at start but few breeding cycles - selection pressure has little time to act; the population may end noisy without converging.
- Best practice: tune the population size so diversity is visibly maintained for the first part of the run and convergence happens in the second part.
- Connection to fitness sharing: a larger population is itself a diversity preserver - "five bullies, five clusters" rather than one.
35. GA crossover rate
In a GA, the crossover rate is the per-gene probability that the offspring inherits from the "other" parent rather than the main parent. Discuss the behavioural effect of setting that rate to (a) zero, (b) very small, (c) moderate, (d) one (50/50 mixture).
- Zero. No crossover - offspring is an identical copy of one parent. The "communication between individuals" essence of GA is lost; only mutation can drive change. The algorithm degenerates toward self-breeding / evolution-strategy.
- Very small. Very few genes inherited from the other parent - the offspring is mostly a copy with a sprinkle of differences. Aggressive exploitation; schemata stay intact; slow exploration.
- Moderate. Balanced mixing - schemata are partially recombined without being shredded.
- One (50/50). Aggressive exploration - the offspring inherits half its genes from each parent; the doctor's exact warning is that this "ruins the schemas" because good substrings have a coin-flip chance of being broken every generation.
"Very small crossover rate is very aggressive exploitation. Fifty-fifty is very aggressive exploration - random. Random always means exploration. Fifty-fifty most probably ruins the schemas."
36. Crossover style: 1-point vs 2-point vs uniform
The doctor compared four crossover styles: zero-point (no cut), 1-point, 2-point, and uniform (per-gene flip). Discuss what each one does to schemas and to the exploration/exploitation balance.
- Zero-point. Take one parent as-is - no recombination at all. Falls back to evolution-strategy / self-breeding.
- 1-point. One cut point preserves large blocks (schemas) from each parent. Mild structural disturbance. Cannot mix non-contiguous traits ("you can't take the eye colour from one and the hair colour from the other if they are not next to each other").
- 2-point. Two cut points - allows swapping a middle segment. More flexible than 1-point; can isolate a mid-chromosome chunk.
- Uniform. Coin flip per gene. Maximum flexibility - any combination is possible - but maximum schema destruction. Closer to random sampling than to recombination.
37. Selection method: roulette vs tournament vs rank
Compare the three selection methods the doctor lectured on: roulette wheel (probability proportional to fitness), tournament (best of k random picks), and rank-based (probability proportional to fitness rank). Which struggles with minimisation? Which gives the weakest individual the best chance to survive?
- Roulette wheel. Proportional to fitness; works naturally for maximisation; for minimisation you need a hack (subtract from a constant) and the choice of constant changes the bully effect drastically. A strong individual can dominate the wheel quickly.
- Tournament. Pick k random individuals, fittest wins. Doctor's preferred method - immune to maximisation/minimisation issues. Probabilistic version (chance proportional to quality within the tournament) gives weaker individuals a non-zero chance.
- Rank-based. Smooths selection pressure - the best is always preferred but the gap between best and second-best is smoothed, preventing the wheel from collapsing.
- Tournament with vs without replacement. Without replacement, the weakest is always out (zero survival chance); with replacement, it has a tiny chance only when picked twice against itself. The doctor preferred with-replacement to "give the weak a chance".
38. Elitism count / fraction
Discuss the behavioural effect of setting the elitism fraction to (a) zero, (b) very small (the doctor's preferred range), (c) moderate, (d) very large. When is each appropriate given the mutation/crossover rates?
- Zero. Pure replacement - even strong parents must compete. Appropriate when mutation/crossover rates are low (the strong already self-breed and survive without help).
- Very small. A few percent - keeps good traits without letting bullies dominate. Doctor's preferred default.
- Moderate. Already considered too high by the doctor - five percent is "too high".
- Very large. Bully always survives, gets duplicated everywhere; the elite competes for normal pool too, doubling its presence; collapse to one cluster, premature convergence.
- Coupling with mutation/crossover. If mutation+crossover are HIGH, you may need MORE elitism because exploration would otherwise destroy the good guys. If they are LOW, set elitism to zero - exploitation is already strong.
"If I put elitism, this good guy will appear multiple times. He's not going to compete there and also he's going to compete for the rest. He'll be everywhere."
39. Generational vs steady-state replacement (μ,λ vs μ+λ)
Compare GA behaviour under (a) generational replacement (parents always die, children form the next generation - μ,λ) and (b) steady-state / μ+λ (parents and children compete together for slots). Which is more prone to premature convergence?
- Generational (μ,λ). Like nature - the parent could be the strongest ever but still has to die. Allows continuous turnover and exploration; risks losing good schemata if the children happen to be weak.
- Steady-state (μ+λ). Very greedy - if a strong parent exists, it keeps winning forever. The bully is institutionalised; even more aggressive than elitism alone. High risk of premature convergence.
- Mixing the two (small μ+λ component, mostly generational) gives a defendable best-of-both compromise.
40. Encoding choice and locality (binary, gray, value, permutation)
The choice of encoding is itself a behavioural knob because it determines whether a small change in the chromosome produces a small change in fitness (locality). Compare (a) plain binary, (b) gray code, (c) value / real, (d) permutation.
- Binary. Weak locality - flipping the leftmost bit changes the value massively (e.g. 14 to 6); flipping the rightmost barely changes it. You cannot tell whether a mutation is exploring or exploiting.
- Gray code. Strong locality by design - any single bit flip changes the value by exactly one. Mutation behaves as a controlled exploitation step.
- Value / real. Mutation can be a small additive perturbation (e.g. Gaussian); locality is whatever you make it.
- Permutation. Forces no-duplicates by design (good for 8-queens); but breaks naive crossover - swap-only mutation cannot introduce a missing value, so connectivity to certain solutions is lost.
41. Handling invalid offspring: ignore vs penalty vs repair
A 1-point crossover on an 8-queens permutation can produce duplicates. Compare three strategies for handling such invalid offspring: (a) ignore (let them stay infeasible), (b) penalise their fitness, (c) repair on the fly (replace duplicates with the missing values).
- Ignore. The search space explodes to include all infeasible chromosomes. The GA wastes iterations on invalid regions; convergence is much slower; only acceptable if the operator can still reach feasible regions.
- Penalise. Add a fitness penalty proportional to the violation. The GA learns to avoid invalid regions; risk of an over-large penalty making all infeasibles look equally bad (no gradient back to feasibility).
- Repair. Detect duplicates and replace with missing values. The doctor's preferred approach. Adds extra computation per crossover but keeps the search space tight and feasible.
- Constrain by design. Drop crossover entirely and use a permutation-only operator like swap. Then exploration is limited to mutation only.
42. SA cooling schedule shape
SA's behaviour does not depend only on the initial temperature but on the SHAPE of the cooling schedule (linear vs geometric vs logarithmic vs adaptive). Discuss what happens when the schedule (a) cools very fast, (b) cools very slowly.
- Very fast cooling. Temperature drops to near-zero in a few iterations; SA quickly turns into hill climbing; gets stuck at the first local optimum found. Concentrated hill-climb behaviour.
- Very slow cooling. Temperature stays high for a long time; SA resembles a random walk for most of the budget; never settles to exploit any region.
- The doctor's exact phrasing: "the longer you stretch out the schedule, the more the algorithm resembles random walk; the shorter, the more concentrated the hill climb."
- Adaptive schedules (slow when finding improvements, fast when stuck) try to get the best of both extremes.
43. SA initial temperature relative to fitness scale
The acceptance probability in SA is exp(ΔQ / T). What happens when the initial T is (a) very small relative to typical ΔQ values, (b) very large relative to typical ΔQ values?
- Very small T. ΔQ dominates from iteration one; the probability of accepting a worse move is essentially zero; SA behaves like hill climbing from the start, regardless of the cooling schedule.
- Very large T. exp(ΔQ / T) is close to 1 for every ΔQ; every move is accepted; SA is indistinguishable from a random walk until the schedule cools T enough.
- Practical rule: T₀ should be tied to the spread of objective values, not chosen as a magic constant.
44. Representation locality - the underlying knob behind every algorithm
The doctor said: "The only thing I can really control is my representation." Compare (a) strong locality (e.g. gray code, swap on permutation) vs (b) weak locality (e.g. plain binary, large-step encoding) and explain why this affects every local-search algorithm, not just GAs.
- Strong locality. Neighbours have similar fitness - the landscape is smooth from the algorithm's point of view; HC, SA, TS can sense the gradient; exploitation works.
- Weak locality. One small change in the representation can move you anywhere in the fitness landscape; the algorithm cannot tell whether it is "close" to anything; behaves like random search; pure exploration with no exploitation possible.
- The locality is a property of the representation, not the algorithm. Picking a stronger representation is often more impactful than tuning algorithm parameters.
45. GA mutation rate extremes
In a GA, what happens when the mutation rate is (a) zero, (b) very small, (c) moderate, (d) very high?
- Zero. No exploration except via crossover; once diversity collapses the GA is stuck at one point.
- Very small. Schemata mostly intact, occasional novelty injected - the classical regime.
- Moderate. Aggressive exploration; fitness can drop generation to generation because schemata break too often.
- Very high. Chromosomes essentially randomised every step - the GA degenerates to random search.
46. Tournament size extremes
In tournament selection, discuss the behaviour at (a) tournament of one, (b) very small, (c) moderate, (d) tournament size equal to the entire population.
- Tournament of one. Uniform random selection - no pressure, diversity preserved, very slow convergence.
- Very small. Moderate selection pressure; the worst can still occasionally win.
- Moderate. High selection pressure; weakest never selected; resembles elitism over a sample.
- Tournament = whole population. Deterministic - the best of the population always wins; total elitism, premature convergence.
- Use a probabilistic tournament (the doctor's "winner gets a chance proportional to its quality" idea) to soften deterministic effects.
47. SA temperature - revisit from the midterm
Re-derive what SA becomes at temperature zero and at temperature infinity. Briefly say what behaviour you would observe in the current-best plot in each case.
- T = 0. Worse moves are never accepted - SA reduces to hill climbing. Pure exploitation. CB plot is monotonically improving (or flat once stuck).
- T = infinity. Worse moves are accepted with probability one - SA reduces to a random walk. CB plot oscillates wildly; BSF only improves when lucky.
- The cooling schedule is exactly the design knob that balances these two extremes over time.
IX. Tabu & Meta1/Meta2 (likely revisits)
48. Meta1 and Meta2 - hand-off
Recall the midterm: Meta1 converges fast then plateaus; Meta2 improves slowly but eventually wins. Beyond simply choosing one, propose a strategy that uses both. Justify in terms of efficiency and effectiveness.
- Run Meta1 first to quickly reach a good basin (efficient early). When its CB plateaus for a stretch, hand the best-so-far as the seed to Meta2.
- Meta2 starts from that warm seed and does the slower refinement / exploration around it (effective late).
- Could also be parallel: Meta1 budgeted on cheap exploitation, Meta2 budgeted on diverse multi-restart - then pick the best of two.
- Rationale: a single algorithm rarely is both efficient and effective; combining gives a Pareto-better outcome.
49. Tabu list - what to store
Should the tabu list store complete solutions, full moves, or only move attributes? Compare on memory, restriction strength, exploration impact, and aspiration handling.
- Complete solutions. Very expensive per entry; super-specific - only blocks exact revisits, near-cycles slip through.
- Full moves. Block reversing a specific move; cheap, intuitive, moderate restriction.
- Move attributes. Block a class of moves (e.g. any swap involving item i for the next few iterations). Cheapest, most restrictive - high diversification but risk of blocking good moves.
- Aspiration. Override tabu when a tabu move would yield a new best - protects effectiveness regardless of restriction style.
50. Reactive tabu - tenure adjustment
Explain how reactive tabu adjusts tenure based on observed behaviour: when does it shorten, when does it lengthen?
- Tracks repetition - hashes / fingerprints of recently visited solutions.
- If solutions start repeating quickly, tenure is too short - lengthen it.
- If no improvement for many iterations and tenure is already long, trigger a diversification (random restart, perturbation) and reset.
- Net effect: avoids manual tenure tuning; the algorithm self-paces between exploration and exploitation.
51. Tabu tenure length extremes
Tabu tenure is the number of iterations a move stays forbidden. Discuss the behavioural effect of setting the tenure to (a) zero, (b) very small, (c) moderate, (d) very large.
- Zero. No memory at all - the move can be undone on the very next iteration; pure HC behaviour, immediate cycling back to the local optimum is possible.
- Very small. The doctor's exact word: "cycling - same step again, again." The algorithm bounces between a small set of states; almost no anti-cycling protection.
- Moderate. Lets you escape the local bully but releases moves in time so genuinely good ones can be re-tried.
- Very large. Search becomes too constrained; even bad moves cannot be undone; the algorithm gets stuck wandering, forced to wait until forbidden moves expire, blocks legitimately good moves.
"Too few is cycling. Too many - very constrained. You're not changing because that was a bad swap, can I go back? No, you have to wait until this swap goes out."
52. Tabu list size (short-term memory capacity)
The tabu list size is how many recent solutions / moves / attributes are remembered. Discuss the behavioural effect of setting the list size to (a) zero, (b) very small, (c) moderate, (d) very large relative to total iterations.
- Zero. Memoryless - equivalent to plain hill climbing.
- Very small (one or two). Will fall back into the same local optimum repeatedly - "goes there, goes there again."
- Moderate. Long enough to remember the bullies but short enough that good zones can be revisited later for re-exploitation.
- Very large. "Once you get there, you cannot get out." Forces extreme diversification; every visited area becomes permanently forbidden; never revisits good zones; no exploitation. The doctor warned that for large state spaces a finite list can never cover everything anyway.
53. Tabu storage granularity (full solutions vs ranges vs move attributes vs feature components)
What you store in the tabu list is itself a behavioural knob. Compare four granularities: (a) full solutions, (b) regions / discretised ranges, (c) move attributes (e.g. last swap), (d) feature components / schemata (e.g. a particular sub-configuration).
- Full solutions. Most specific. Memory is wasted on near-duplicates; the list saturates in one tiny region; near-cycles slip through.
- Regions / ranges. Covers more landscape per memory slot - works for continuous problems too. Risk: too coarse a range bans legitimate solutions inside the region.
- Move attributes. Cheap to store; prevents immediate reversal of a specific move. Risk: a different path can lead to the same forbidden state.
- Feature components / schemata. Most powerful for breaking bullies that share structure. Risk: too aggressive masking forbids large families of legitimate solutions.
54. Neighbourhood sample size (sample-the-gradient n)
In sample-the-gradient TS / HC variants, you generate n candidate tweaks and pick the best. Discuss the behaviour at (a) n = 1, (b) very small, (c) moderate, (d) very large (full neighbourhood / steepest-ascent).
- n = 1. Degenerates to one-plus-one HC - you commit blindly to the only sample seen, no gradient information.
- Very small. Weakly informed direction; cannot reliably distinguish improving from worsening directions in multi-dimensional space.
- Moderate. Good gradient estimate; enough information to pick a sensible direction without spending the whole budget per iteration.
- Very large (full neighbourhood). Steepest-ascent / pure greedy. Doctor: "the second best could be better but I'm not willing to sacrifice." Very expensive per iteration; fewer total iterations possible; actually MORE prone to local-optima trapping because no chance of randomness.
55. Tweak operator strength / move radius
The number of components changed per tweak (1 swap vs 5 swaps) is itself a behavioural knob. Discuss (a) 1 move per tweak, (b) moderate, (c) many simultaneous moves, (d) random radius drawn from a uniform distribution each iteration.
- One move. Pure exploitation; converges to the nearest local optimum and gets stuck.
- Moderate. Mixes local refinement with occasional escapes.
- Many moves. "Seems like I've changed the whole state." Effectively a random restart - high exploration, no fine-tuning ability.
- Uniform random radius. Doctor's recommended trick - sometimes you exploit, sometimes you explore, in the same run, without any explicit schedule.
56. Frequency-based long-term memory (intensification vs diversification)
Long-term memory in TS counts how often each component appears in accepted solutions. Compare four ways to use it: (a) ignored, (b) intensification only (re-seed from high-frequency components), (c) diversification only (force in low-frequency components), (d) both, used dynamically.
- Ignored. TS reacts only to the recent past - cannot tell which features are persistently winning vs which were never tried; misses both intensification opportunities and unexplored regions.
- Intensification only. Re-seed search around the dominant features. Risk: doubles down on what may be a deep local-optimum bully.
- Diversification only. Force in components that have appeared rarely. Risk: throws away genuinely good components for noise.
- Both, dynamically. Diversify early in the run, intensify later on confirmed bullies. The doctor explicitly described running both in the same pipeline.
X. 8-puzzle GA + coevolution overlay (speculative double)
The doctor used the 8-puzzle as the running example for both the GA chapter (midterm Q7) and the multi-goal coevolution slide. He may stack a coevolution part on top of the standard 8-puzzle GA framing.
57. 8-puzzle GA, then make it multi-goal
Part (a): Propose a chromosome and fitness function for the standard single-goal 8-puzzle.
Justify which crossover and mutation operators are sensible.
Part (b): Now extend to several goal states. How does the chromosome change (or stay),
how does the fitness change, and what diversity mechanism would you add to ensure all
goals are covered?
- (a) Chromosome. A sequence of moves from {U, D, L, R} of length around the longest known optimal solution; optionally include a STOP symbol so the GA can decide when to stop.
- Fitness. Manhattan distance or misplaced-tiles count after executing the moves. Invalid moves (such as U immediately after D) either ignored, repaired, or heavily penalised.
- Crossover. Single-point or two-point on the move sequence is acceptable - "apple to apple". Order-preserving / cycle crossovers are not needed because the chromosome is not a permutation.
- Mutation. Random move replacement; never produce U immediately followed by D (or L by R) - that would dance.
- (b) Multi-goal. Chromosome stays the same. External fitness becomes the distance to the closest of the goals. Internal fitness becomes the relative distance within the cluster of individuals targeting the same goal, with a coverage bonus for individuals targeting an underserved goal.
- Diversity mechanism. Fitness sharing on goal-cluster (penalise individuals targeting the same goal as many others), or quota-based selection (reserve a slot per goal in the next generation).
- Without these, the GA collapses everyone onto the easiest goal and most of the budget is wasted.
Cheat-sheet (qualitative only - no numbers)
| Algorithm | Knob | At zero / extreme low | At extreme high |
|---|---|---|---|
| DE | Deviation (B - C) | Population converged -> small step (auto exploit) | Population scattered -> large step (auto explore) |
| DE | Deviation scale (on top of B - C) | No movement, stagnation | Wild jumps, overshoot |
| DE | Crossover rate | Per-coord refinement only | Trial ignores target |
| DE | Base vector | Random base = explore | Best base = exploit / bully |
| DE | Top-percentile breeders | All allowed = explore | Top 3% only = "almost climbing", premature |
| DE | B and C selection | Random = adaptive deviation | Fitness-biased = same-basin lock-in |
| PSO | Old-velocity weight (inertia) | No momentum, exploitation | Velocity grows, particles escape |
| PSO | Personal-best pull (β) | Forgets own discoveries | Swarm fragments into independent hills |
| PSO | Global-best pull (γ) | No coordination, sub-swarms drift | Stampede to one place, premature convergence |
| PSO | Informant pull (δ) | Middle layer gone | Slaves to small clique-best |
| PSO | Step size (ε) | Stagnation | Overshoots, no fine-tuning |
| PSO | Random scope (in/out of dim loop) | Outside = direction never rotates | Inside = direction rotates, can explore |
| PSO | Informant topology | None = independent search | Global = collapse to one basin |
| AS | Evaporation | Lock-in, no forgetting | No memory, near-random construction |
| AS/ACS | Pheromone weight α vs heuristic weight β | α=0 = pure GRASP, no learning | α very high, β=0 = pheromone lock-in, premature |
| AS/ACS | Initial pheromone τ₀ | First lucky tour locks in | Random walk for many iterations |
| AS/ACS | Number of ants per iteration | 1 ant = noisy, slow learning | Many ants = exploration, expensive |
| ACS | Greediness in choice | AS-like proportional | Always greedy, premature convergence |
| ACS | Local decay during construction | Behaves like AS, swarm collapses | Components reset, intra-iter diversity |
| ACS | Candidate list | Tight = greedy | Full = AS-like |
| GA | Crossover rate | 0 = no recombination, self-breeding | 50/50 = random, schemas destroyed |
| GA | Crossover style | Zero-point = no recombination | Uniform = max schema destruction |
| GA | Selection method | Roulette = bully effect, minimisation problems | Tournament = robust, probabilistic |
| GA | Elitism count | 0 = strong already self-survive | Large = bully omnipresent, premature |
| GA | Replacement strategy | Generational = exploration | Steady-state μ+λ = bully institutionalised |
| GA | Encoding | Gray = strong locality, controlled | Plain binary = weak locality, uncontrolled jumps |
| GA | Invalid offspring handling | Ignore = wastes budget | Repair = clean but extra computation |
| GA | Mutation rate | No exploration, stuck | Random search |
| GA | Tournament size | k=1 = no pressure | k=N = total elitism |
| SA | Temperature | T=0 = hill climbing | T=infinity = random walk |
| SA | Cooling schedule shape | Very fast = HC behaviour, stuck | Very slow = random walk for whole budget |
| SA | Initial T relative to ΔQ | Too small = HC from start | Too large = random walk for many iterations |
| TS | Tabu tenure length | Cycling, no protection | Stuck, blocks good moves |
| TS | Tabu list size | 0 = HC, memoryless | Once visited, forbidden forever |
| TS | Storage granularity | Full solutions = wasted memory | Feature schema = aggressive masking |
| TS / HC | Sample-the-gradient n | n=1 = blind commit | Full = steepest-ascent, expensive, traps |
| TS / HC | Tweak operator radius | 1 move = pure exploitation | Many moves = random restart |
| TS | Long-term frequency memory | Ignored = no intensification or diversification | Used dynamically = self-paced |
| All | Representation locality | Strong = smooth landscape, exploitable | Weak = small change = anywhere, random search |