### Computational Radiology Laboratory Harvard Medical School

### www.crl.med.harvard.edu

### Children’s Hospital Department of Radiology Boston Massachusetts

### Evaluation of Image Segmentation

### Simon K. Warfield, Ph.D.

### Associate Professor of Radiology

### Harvard Medical School

**Segmentation **

### • Segmentation

### – Identification of structure in images.

### – Many different algorithms and a wide range of principles upon which they are based.

### • Segmentation is used for:

### – Quantitative image analysis – Image guided therapy

### – Visualization

**Validation of Image Segmentation **

### • Spectrum of accuracy versus realism in reference standard.

### • Digital phantoms.

### – Ground truth known accurately.

### – Not so realistic.

### • Acquisitions and careful segmentation.

### – Some uncertainty in ground truth.

### – More realistic.

### • Autopsy/histopathology.

### – Addresses pathology directly; resolution.

### • Clinical data ?

**Validation of Image Segmentation **

### • Comparison to digital and physical phantoms:

### – Excellent for testing the anatomy, noise and artifact which is modeled.

### – Typically lacks range of normal or

### pathological variability encountered in practice.

### MRI of brain

**Comparison To Higher Resolution **

### MRI Photograph MRI

**Comparison To Higher Resolution **

**Comparison to Autopsy Data **

### • Neonate gyrification index

### – Ratio of length of cortical boundary to length

### of smooth contour enclosing brain surface

**Staging **

**Stage 3 ** **Stage 5 **

**Stage 3: at 28 w GA **

shallow indentations of inf. frontal and sup. Temp. gyrus

(1 infant at 30.6 w GA,

normal range: 28.6 ± 0.5 w GA)

**Stage 4: at 30 w GA **

2 indentations divide front. lobe into 3 areas, sup. temp.gyrus clearly detectable

(3 infants, 30.6 w GA ± 0.4 w, normal range: 29.9 ± 0.3 w GA)

**Stage 5: at 32 w GA **

frontal lobe clearly divided into three parts: sup., middle and inf. Frontal gyrus (4 infants, 32.1 w GA ± 0.7 w,

normal range: 31.6 ± 0.6 w GA)

**Stage 6: at 34 w GA **

temporal lobe clearly divided into

**Neonate GI: MRI Vs Autopsy **

### GI Increase Is Proportional to Change in Age.

**GI Versus Qualitative Staging **

**Neonate Gyrification **

**Validation of Image Segmentation **

### • STAPLE (Simultaneous Truth and Performance Level Estimation):

### – An algorithm for estimating performance and ground truth from a collection of

### independent segmentations.

### – Warfield, Zou, Wells, IEEE TMI 2004.

### – Warfield, Zou, Wells, PTRSA 2008.

### – Commowick and Warfield, IEEE TMI 2010.

**Validation of Image Segmentation **

### • Comparison to expert performance; to other algorithms.

### • Why compare to experts ?

### – Experts are currently doing the segmentation tasks that we seek algorithms for:

### • Surgical planning.

### • Neuroscience research.

### • Response to therapy assessment.

### • What is the appropriate measure for such

### comparisons ?

**Measures of Expert Performance **

### • Repeated measures of volume

### – Intra-class correlation coefficient

### • Spatial overlap

### – Jaccard: Area of intersection over union.

### – Dice: increased weight of intersection.

### – Vote counting: majority rule, etc.

### • Boundary measures

### – Hausdorff, 95% Hausdorff.

### • Bland-Altman methodology:

### – Requires a reference standard.

### • Measures of correct classification rate:

### – Sensitivity, specificity ( Pr(D=1|T=1), Pr(D=0|T=0) )

### – Positive predictive value and negative predictive value

**Measures of Expert Performance **

### • Our new approach:

### • Simultaneous estimation of hidden

### ``ground truth’’ and expert performance.

### • Enables comparison between and to experts.

### • Can be easily applied to clinical data exhibiting range of normal and

### pathological variability.

**How to judge segmentations of the peripheral zone? **

### 1.5T MR of prostate Peripheral zone and segmentations

**Estimation Problem **

### • Complete data density:

### • Binary ground truth T _{i} for each voxel i.

### • Expert j makes segmentation decisions D _{ij.}

### • Expert performance characterized by sensitivity p and specificity q.

### – We observe expert decisions D. If we knew ground truth T, we could construct

### maximum likelihood estimates for each

### expert’s sensitivity (true positive fraction)

**Expectation-Maximization **

### • General procedure for estimation problems that would be simplified if some missing data was available.

### • Key requirements are specification of:

### – The complete data.

### – Conditional probability density of the hidden data given the observed data.

### • Observable data D

### • Hidden data T, prob. density

### • Complete data (D,T)

*f* ( **T** | **D, ˆ** θ ^{)}

Computational Radiology Laboratory.

**Expectation-Maximization **

### • Solve the incomplete-data log likelihood maximization problem

### • E-step: estimate the conditional

### expectation of the complete-data log likelihood function.

### • M-step: estimate parameter values

*Q(* θ ^{| ˆ} θ ^{)} = *E* ^{} ln *f* ( **D, T** | θ ^{)} | ^{D,} θ ^{ˆ}

^{D,}

###

###

###

###

###

### argmax _{θ} *Q* ( ) θ ^{| ˆ } θ

**Expectation-Maximization **

### • Since we don’t know ground truth T, treat T as a random variable, and solve for the expert

### performance parameters that maximize:

### • Parameter values θ _{j} =[p _{j} q _{j} ] ^{T } that maximize the

_{j}

### conditional expectation of the log-likelihood function are found by iterating two steps:

### – E-step: Estimate probability of hidden ground truth T given a previous estimate of the expert quality parameters, and take the expectation.

### – M-step: Estimate expert performance parameters by

*Q(* θ ^{| ˆ} θ ^{)} = *E* ^{} ln *f* ( **D, T** | θ ^{)} | ^{D,} θ ^{ˆ}

^{D,}

###

###

###

###

###

**STAPLE **

### • Consider binary labels:

### – foreground.

### – background.

### • Spatial correlation of the unknown true

### segmentation can be modelled with a

### Markov Random Field.

**To Solve for Expert Parameters: **

**True Segmentation Estimate **

**Expert Performance Estimate **

### Now we seek an expression for the conditional

### expectation of the complete-data log likelihood function

### that we can maximize.

**Expert Performance Estimate **

### Now, consider each expert separately:

**Expert Performance Estimate **

### p (sensitivity, true positive fraction) : ratio of expert identified class 1 to total class 1 in the image.

### q (specificity, true negative fraction) : ratio of expert

**Extension to Several Tissue Labels **

### • Complete data density:

### • True segmentation T _{i} for each voxel i – May be binary

### – May be categorical

### • Expert j makes segmentation decisions D _{ij}

### • Expert performance θ _{s’s} characterizes

### probability of deciding label s’ when true label

**Probability Estimate of True Labels **

**Expert Performance Estimate **

### Now, consider each expert separately:

**Parameter Estimation **

### Noting that

### We can formulate the constrained optimization

### problem:

**Parameter Estimation **

### Therefore

### And noting that

### We find that

**Results: Synthetic Experts **

### • Several experiments with known ground truth and known performance parameters.

### • Goal:

### – Determine if STAPLE accurately identifies known ground truth.

### – Determine if STAPLE accurately determines known expert performance parameters.

### – Understand sensitivity of STAPLE with respect to changes in prior hyper-parameters; requirements for number of observations to enable good

### estimation; convergence characteristics.

**Synthetic Experts **

### 10 observations of segmentation by expert with p=q=0.99

### STAPLE p,q estimates:

### mean p 0.990237

### std. dev p 0.000616

### mean q 0.990121

### std. dev q 0.00071

**Synthetic Experts **

### 10 segmentations by experts with p=0.95, q=0.90

### STAPLE p,q estimates:

### mean p 0.950104

### std. dev p 0.001201

### mean q 0.900035

### std. dev q 0.001685

**Expert and Student Segmentations **

### Test image Expert consensus Student 1

**Phantom Segmentation **

### Image Expert Students Voting STAPLE

### Image Expert

### segmentation

### Student

### segmentations

**Prostate Peripheral Zone **

### 1 2 3 4 5

### p

_{j }

### .879 .991 .937 .918 .895

### q

_{j }