Skip to main content

Category Properties: Helping the Optimizer Understand Your Categories

Attach numerical descriptors to categorical variables so the optimizer knows how similar your options are.

Category Properties: Helping the Optimizer Understand Your Categories

Attach numerical descriptors to categorical variables so the optimizer knows how similar your options are.

When you define a categorical variable like "Solvent" with options Ethanol, Methanol, and Toluene, the optimizer has no way to know that Ethanol and Methanol are chemically similar while Toluene is very different. Without that information, it treats all categories as equally distant, which can waste experiments.

Properties (also called descriptors) solve this. By attaching numerical values to each category, you give the optimizer a way to measure similarity and make smarter suggestions.


How It Work

  1. You define one or more properties on a categorical variable (e.g. Molecular Weight, Boiling Point).

  2. You assign a numerical value for each property on each category.

  3. The optimizer uses these values as a dense numerical vector to compute distances between categories.

Instead of treating categories as unrelated labels, the optimizer now sees each one as a point in a numerical space, and can infer that nearby points are likely to behave similarly.


Example: Solvents

Without properties: Ethanol, Methanol, Acetone, and Toluene are four unrelated options. The optimizer must try all of them independently.

With properties:

Solvent

Molecular Weight

XLogP

TPSA

Ethanol

46.07

-0.31

20.23

Methanol

32.04

-0.74

20.23

Acetone

58.08

-0.24

17.07

Toluene

92.14

2.73

0.00

Now the optimizer knows that Ethanol and Methanol are similar (close values), while Toluene is very different (high XLogP, zero TPSA). If Ethanol gives a good result, it will prioritize Methanol next rather than Toluene.


Example: Ligands

Ligand

Cone Angle (°)

% Buried Volume

TEP (cm-1)

PPh3

145

27.6

2068.9

PCy3

170

32.4

2056.4

P(tBu)3

182

36.5

2056.1

dppf

180

31.2

2064.3

Steric (cone angle, buried volume) and electronic (TEP) descriptors let the optimizer explore the ligand space efficiently without testing every option.


How to Choose Good Properties

Good descriptors capture the physical or chemical differences that matter for your experiment. Here are some guidelines:

For Chemical Compounds

  • Molecular weight — basic size descriptor, always relevant.

  • XLogP (partition coefficient) — captures hydrophobicity/polarity.

  • TPSA (topological polar surface area) — captures polarity and hydrogen bonding.

  • Boiling point — relevant for reactions where temperature matters.

  • pKa — relevant for acid/base chemistry.

  • Steric descriptors (cone angle, % buried volume) — crucial for catalysis.

  • Electronic descriptors (Hammett sigma, TEP) — for electronic effects in reactions.

For Non-Chemical Categories

  • Reactor type: volume (mL), max pressure (bar), max temperature (°C)

  • Supplier: purity (%), lead time (days), cost ($/kg)

  • Protocol: duration (min), number of steps, temperature range

General Principles

  • 2–5 properties is typical. More is not always better — noisy or irrelevant descriptors can hurt performance.

  • Choose properties that differentiate. If all categories have the same value for a property, it adds no information.

  • Use properties with different scales. A mix of size, polarity, and shape descriptors captures more information than three size descriptors.

  • Physical relevance matters. Properties related to the mechanism of your reaction work better than arbitrary numbers.


Without Properties vs With Properties

Without Properties

With Properties

How categories are seen

Unrelated labels (one-hot encoded)

Points in a numerical space

Similarity

All pairs equally distant

Distance reflects real differences

Exploration

Must try every category

Can infer from similar categories

Efficiency

More experiments needed

Fewer experiments to find optimum


Further Reading

The approach of using physicochemical descriptors to inform Bayesian optimization of categorical variables is based on research in the field:

Did this answer your question?