Data pre-processing

center data

calculate the average of each variable (column) and substract it from each value.

 unit variance (UV) scaling
  • if variables are measured in different units, data are scaled to give each variable equal chance to influence  the model.
  • divide each variable by its standard deviation, variance of scaled variables =1

pareto(PAR) scaling
What happens if big features dominate, but we know medium features are also important?
 Ctr (mean-centering only)
 — RISK: Medium peaks masked by large peaks
UV (mean-centering and unit variance)
— RISK: Baseline noise may be inflated
The alternative is Pareto scaling
  • Divide each variable by the square root of its SD
  • Intermediate between no scaling (Ctr) and UV
  • Weights up medium features without inflating baseline noise
  • Recommended option (NMR & MS metabonomics, Gene chip & proteomics data)

 

点赞

发表评论

电子邮件地址不会被公开。 必填项已用*标注