Evaluation of Image Segmentation

(1)

Computational Radiology Laboratory Harvard Medical School

www.crl.med.harvard.edu

Children’s Hospital Department of Radiology Boston Massachusetts

Evaluation of Image Segmentation

Simon K. Warfield, Ph.D.

Associate Professor of Radiology

Harvard Medical School

(2)

Segmentation

•  Segmentation

–  Identification of structure in images.

–  Many different algorithms and a wide range of principles upon which they are based.

•  Segmentation is used for:

–  Quantitative image analysis –  Image guided therapy

–  Visualization

(3)

Validation of Image Segmentation

•  Spectrum of accuracy versus realism in reference standard.

•  Digital phantoms.

–  Ground truth known accurately.

–  Not so realistic.

•  Acquisitions and careful segmentation.

–  Some uncertainty in ground truth.

–  More realistic.

•  Autopsy/histopathology.

–  Addresses pathology directly; resolution.

•  Clinical data ?

(4)

Validation of Image Segmentation

•  Comparison to digital and physical phantoms:

–  Excellent for testing the anatomy, noise and artifact which is modeled.

–  Typically lacks range of normal or

pathological variability encountered in practice.

MRI of brain

(5)

Comparison To Higher Resolution

MRI Photograph MRI

(6)

Comparison To Higher Resolution

(7)

Comparison to Autopsy Data

•  Neonate gyrification index

–  Ratio of length of cortical boundary to length

of smooth contour enclosing brain surface

(8)

Staging

Stage 3 Stage 5

Stage 3: at 28 w GA

shallow indentations of inf. frontal and sup. Temp. gyrus

(1 infant at 30.6 w GA,

normal range: 28.6 ± 0.5 w GA)

Stage 4: at 30 w GA

2 indentations divide front. lobe into 3 areas, sup. temp.gyrus clearly detectable

(3 infants, 30.6 w GA ± 0.4 w, normal range: 29.9 ± 0.3 w GA)

Stage 5: at 32 w GA

frontal lobe clearly divided into three parts: sup., middle and inf. Frontal gyrus (4 infants, 32.1 w GA ± 0.7 w,

normal range: 31.6 ± 0.6 w GA)

Stage 6: at 34 w GA

temporal lobe clearly divided into

(9)

Neonate GI: MRI Vs Autopsy

(10)

GI Increase Is Proportional to Change in Age.

(11)

GI Versus Qualitative Staging

(12)

Neonate Gyrification

(13)

Validation of Image Segmentation

•  STAPLE (Simultaneous Truth and Performance Level Estimation):

–  An algorithm for estimating performance and ground truth from a collection of

independent segmentations.

–  Warfield, Zou, Wells, IEEE TMI 2004.

–  Warfield, Zou, Wells, PTRSA 2008.

–  Commowick and Warfield, IEEE TMI 2010.

(14)

Validation of Image Segmentation

•  Comparison to expert performance; to other algorithms.

•  Why compare to experts ?

–  Experts are currently doing the segmentation tasks that we seek algorithms for:

•  Surgical planning.

•  Neuroscience research.

•  Response to therapy assessment.

•  What is the appropriate measure for such

comparisons ?

(15)

Measures of Expert Performance

•  Repeated measures of volume

–  Intra-class correlation coefficient

•  Spatial overlap

–  Jaccard: Area of intersection over union.

–  Dice: increased weight of intersection.

–  Vote counting: majority rule, etc.

•  Boundary measures

–  Hausdorff, 95% Hausdorff.

•  Bland-Altman methodology:

–  Requires a reference standard.

•  Measures of correct classification rate:

–  Sensitivity, specificity ( Pr(D=1|T=1), Pr(D=0|T=0) )

–  Positive predictive value and negative predictive value

(16)

Measures of Expert Performance

•  Our new approach:

•  Simultaneous estimation of hidden

``ground truth’’ and expert performance.

•  Enables comparison between and to experts.

•  Can be easily applied to clinical data exhibiting range of normal and

pathological variability.

(17)

How to judge segmentations of the peripheral zone?

1.5T MR of prostate Peripheral zone and segmentations

(18)

Estimation Problem

•  Complete data density:

•  Binary ground truth T _i for each voxel i.

•  Expert j makes segmentation decisions D _ij.

•  Expert performance characterized by sensitivity p and specificity q.

–  We observe expert decisions D. If we knew ground truth T, we could construct

maximum likelihood estimates for each

expert’s sensitivity (true positive fraction)

(19)

Expectation-Maximization

•  General procedure for estimation problems that would be simplified if some missing data was available.

•  Key requirements are specification of:

–  The complete data.

–  Conditional probability density of the hidden data given the observed data.

•  Observable data D

•  Hidden data T, prob. density

•  Complete data (D,T)

f ( T | D, ˆ θ ⁾

(20)

Computational Radiology Laboratory.       

Expectation-Maximization

•  Solve the incomplete-data log likelihood maximization problem

•  E-step: estimate the conditional

expectation of the complete-data log likelihood function.

•  M-step: estimate parameter values

Q( θ ^{| ˆ} θ ⁾ = E ^ ln f ( D, T | θ ⁾ | ^D, θ ^ˆ



 





 

argmax _θ Q ( ) θ ^{| ˆ} θ

(21)

Expectation-Maximization

•  Since we don’t know ground truth T, treat T as a random variable, and solve for the expert

performance parameters that maximize:

•  Parameter values θ _j =[p _j q _j ] ^T that maximize the

conditional expectation of the log-likelihood function are found by iterating two steps:

–  E-step: Estimate probability of hidden ground truth T given a previous estimate of the expert quality parameters, and take the expectation.

–  M-step: Estimate expert performance parameters by

Q( θ ^{| ˆ} θ ⁾ = E ^ ln f ( D, T | θ ⁾ | ^D, θ ^ˆ



 





 

(22)

STAPLE

•  Consider binary labels:

–  foreground.

–  background.

•  Spatial correlation of the unknown true

segmentation can be modelled with a

Markov Random Field.

(23)

To Solve for Expert Parameters:

(24)

True Segmentation Estimate

(25)

Expert Performance Estimate

Now we seek an expression for the conditional

expectation of the complete-data log likelihood function

that we can maximize.

(26)

Expert Performance Estimate

Now, consider each expert separately:

(27)

Expert Performance Estimate

p (sensitivity, true positive fraction) : ratio of expert identified class 1 to total class 1 in the image.

q (specificity, true negative fraction) : ratio of expert

(28)

Extension to Several Tissue Labels

•  Complete data density:

•  True segmentation T _i for each voxel i –  May be binary

–  May be categorical

•  Expert j makes segmentation decisions D _ij

•  Expert performance θ _s’s characterizes

probability of deciding label s’ when true label

(29)

Probability Estimate of True Labels

(30)

Expert Performance Estimate

Now, consider each expert separately:

(31)

Parameter Estimation

Noting that

We can formulate the constrained optimization

problem:

(32)

Parameter Estimation

Therefore

And noting that

We find that

(33)

Results: Synthetic Experts

•  Several experiments with known ground truth and known performance parameters.

•  Goal:

–  Determine if STAPLE accurately identifies known ground truth.

–  Determine if STAPLE accurately determines known expert performance parameters.

–  Understand sensitivity of STAPLE with respect to changes in prior hyper-parameters; requirements for number of observations to enable good

estimation; convergence characteristics.

(34)

Synthetic Experts

10 observations of segmentation by expert with p=q=0.99

STAPLE p,q estimates:

mean p 0.990237

std. dev p 0.000616

mean q 0.990121

std. dev q 0.00071

(35)

Synthetic Experts

10 segmentations by experts with p=0.95, q=0.90

STAPLE p,q estimates:

mean p 0.950104

std. dev p 0.001201

mean q 0.900035

std. dev q 0.001685

(36)

Expert and Student Segmentations

Test image Expert consensus Student 1

(37)

Phantom Segmentation

Image Expert Students Voting STAPLE

Image Expert

segmentation

Student

segmentations

(38)

Prostate Peripheral Zone

1 2 3 4 5

p

_j

.879 .991 .937 .918 .895

q

_j

.998 .994 .999 .999 .999

Dice .913 .951 .967 .955 .944

(39)

A Binary MRF Model for Spatial Homogeneity.

Include a prior probability for the neighborhood configuration:

(40)

MAP Estimation With MRF Prior

(41)

Synthetic Experts

Only three segmentations by different quality experts.

STAPLE p,q estimates:

p1, q1 0.9505,0.9494 p2, q2 0.9511,0.8987 p3, q3 0.9000,0.8987

p=0.95,q=0.95 p=0.95,q=0.90

(42)

Cryoablation of Kidney Tumor

Segmentations before training session with radiologist:

Rater frequency. STAPLE with MRF.

After training session:

Based on the STAPLE

performance

assessment, we found the

training session created a

statistically

significant

increase in

(43)

Newborn MRI Segmentation

(44)

Newborn MRI Segmentation

Summary of segmentation quality (posterior probability

Pr(T=t|D=t) ) for each tissue type for repeated manual

segmentations.

(45)

STAPLE Summary

•  Key advantages of STAPLE:

–  Estimates ``true’’ segmentation.

–  Assesses expert performance.

•  Principled mechanism which enables:

–  Comparison of different experts.

–  Comparison of algorithm and experts.

•  Extensions for the future:

–  Can we learn image features that lead to

different levels of expert performance?

(46)

Acknowledgements

•  Neil Weisenfeld.

•  Andrea Mewes.

•  Petra Huppi.

•  Olivier Clatz.

•  William Wells.

•  Olivier Commowick .

This study was supported by:

Colleagues contributing to this work:

•  Arne Hans.

•  Heidelise Als.

•  Lianne Woodward.

•  Frank Duffy.

•  Arne Hans.

•  Kelly Zou.

Evaluation of Image Segmentation

Computational Radiology Laboratory Harvard Medical School

www.crl.med.harvard.edu

Children’s Hospital Department of Radiology Boston Massachusetts

Evaluation of Image Segmentation

Simon K. Warfield, Ph.D.

Associate Professor of Radiology

Harvard Medical School

Segmentation

• Segmentation

– Identification of structure in images.

– Many different algorithms and a wide range of principles upon which they are based.

• Segmentation is used for:

– Quantitative image analysis – Image guided therapy

– Visualization

Validation of Image Segmentation

• Spectrum of accuracy versus realism in reference standard.

• Digital phantoms.

– Ground truth known accurately.

– Not so realistic.

• Acquisitions and careful segmentation.

– Some uncertainty in ground truth.

– More realistic.

• Autopsy/histopathology.

– Addresses pathology directly; resolution.

• Clinical data ?

Validation of Image Segmentation

• Comparison to digital and physical phantoms:

– Excellent for testing the anatomy, noise and artifact which is modeled.

– Typically lacks range of normal or

pathological variability encountered in practice.

MRI of brain

Comparison To Higher Resolution

MRI Photograph MRI

Comparison To Higher Resolution

Comparison to Autopsy Data

• Neonate gyrification index

– Ratio of length of cortical boundary to length

of smooth contour enclosing brain surface

Staging

Stage 3 Stage 5

Stage 3: at 28 w GA

Stage 4: at 30 w GA

Stage 5: at 32 w GA

Stage 6: at 34 w GA

Neonate GI: MRI Vs Autopsy

GI Increase Is Proportional to Change in Age.

GI Versus Qualitative Staging

Neonate Gyrification

Validation of Image Segmentation

• STAPLE (Simultaneous Truth and Performance Level Estimation):

– An algorithm for estimating performance and ground truth from a collection of

independent segmentations.

– Warfield, Zou, Wells, IEEE TMI 2004.

– Warfield, Zou, Wells, PTRSA 2008.

– Commowick and Warfield, IEEE TMI 2010.

Validation of Image Segmentation

• Comparison to expert performance; to other algorithms.

• Why compare to experts ?

– Experts are currently doing the segmentation tasks that we seek algorithms for:

• Surgical planning.

• Neuroscience research.

• Response to therapy assessment.

• What is the appropriate measure for such

comparisons ?

Measures of Expert Performance

• Repeated measures of volume

– Intra-class correlation coefficient

• Spatial overlap

– Jaccard: Area of intersection over union.

– Dice: increased weight of intersection.

– Vote counting: majority rule, etc.

• Boundary measures

– Hausdorff, 95% Hausdorff.

• Bland-Altman methodology:

– Requires a reference standard.

• Measures of correct classification rate:

– Sensitivity, specificity ( Pr(D=1|T=1), Pr(D=0|T=0) )

– Positive predictive value and negative predictive value

Measures of Expert Performance

•  Segmentation

–  Identification of structure in images.

–  Many different algorithms and a wide range of principles upon which they are based.

•  Segmentation is used for:

–  Quantitative image analysis –  Image guided therapy

–  Visualization

•  Spectrum of accuracy versus realism in reference standard.

•  Digital phantoms.

–  Ground truth known accurately.

–  Not so realistic.

•  Acquisitions and careful segmentation.

–  Some uncertainty in ground truth.

–  More realistic.

•  Autopsy/histopathology.

–  Addresses pathology directly; resolution.

•  Clinical data ?

•  Comparison to digital and physical phantoms:

–  Excellent for testing the anatomy, noise and artifact which is modeled.

–  Typically lacks range of normal or

•  Neonate gyrification index

–  Ratio of length of cortical boundary to length

•  STAPLE (Simultaneous Truth and Performance Level Estimation):

–  An algorithm for estimating performance and ground truth from a collection of

–  Warfield, Zou, Wells, IEEE TMI 2004.

–  Warfield, Zou, Wells, PTRSA 2008.

–  Commowick and Warfield, IEEE TMI 2010.

•  Comparison to expert performance; to other algorithms.

•  Why compare to experts ?

–  Experts are currently doing the segmentation tasks that we seek algorithms for:

•  Surgical planning.

•  Neuroscience research.

•  Response to therapy assessment.

•  What is the appropriate measure for such

•  Repeated measures of volume

–  Intra-class correlation coefficient

•  Spatial overlap

–  Jaccard: Area of intersection over union.

–  Dice: increased weight of intersection.

–  Vote counting: majority rule, etc.

•  Boundary measures

–  Hausdorff, 95% Hausdorff.

•  Bland-Altman methodology:

–  Requires a reference standard.

•  Measures of correct classification rate:

–  Sensitivity, specificity ( Pr(D=1|T=1), Pr(D=0|T=0) )

–  Positive predictive value and negative predictive value

•  Our new approach:

•  Simultaneous estimation of hidden

•  Enables comparison between and to experts.

•  Can be easily applied to clinical data exhibiting range of normal and

•  Complete data density:

•  Binary ground truth T _i for each voxel i.

•  Expert j makes segmentation decisions D _ij.

•  Expert performance characterized by sensitivity p and specificity q.

–  We observe expert decisions D. If we knew ground truth T, we could construct

•  General procedure for estimation problems that would be simplified if some missing data was available.

•  Key requirements are specification of:

–  The complete data.

–  Conditional probability density of the hidden data given the observed data.

•  Observable data D

•  Hidden data T, prob. density

•  Complete data (D,T)

f ( T | D, ˆ θ ⁾

•  Solve the incomplete-data log likelihood maximization problem

•  E-step: estimate the conditional

•  M-step: estimate parameter values

Q( θ ^{| ˆ} θ ⁾ = E ^ ln f ( D, T | θ ⁾ | ^D, θ ^ˆ

argmax _θ Q ( ) θ ^{| ˆ} θ

•  Since we don’t know ground truth T, treat T as a random variable, and solve for the expert

•  Parameter values θ _j =[p _j q _j ] ^T that maximize the

–  E-step: Estimate probability of hidden ground truth T given a previous estimate of the expert quality parameters, and take the expectation.

–  M-step: Estimate expert performance parameters by

Q( θ ^{| ˆ} θ ⁾ = E ^ ln f ( D, T | θ ⁾ | ^D, θ ^ˆ

•  Consider binary labels:

–  foreground.

–  background.

•  Spatial correlation of the unknown true

•  Complete data density: