# Data pre-processing

### 波比 / 2016-12-13

##### Center data

calculate the average of each variable (column) and substract it from each value.

##### Unit variance (UV) scaling

- if variables are measured in different units, data are scaled to give each variable equal chance to influence the model.
- divide each variable by its standard deviation, variance of scaled variables =1

##### Pareto(PAR) scaling

What happens if big features dominate, but we know medium features are also important?

Ctr (mean-centering only)

– RISK: Medium peaks masked by large peaks

UV (mean-centering and unit variance)

-- RISK: Baseline noise may be inflated

##### The alternative is Pareto scaling

- Divide each variable by the square root of its SD
- Intermediate between no scaling (Ctr) and UV
- Weights up medium features without inflating baseline noise
- Recommended option (NMR & MS metabonomics, Gene chip & proteomics data)