AR1/docs/workers/outlier_analysis.md

2.2 KiB

Outlier Analysis Worker

Processes outlier detection jobs to identify statistical outliers in spatial data.

Overview

The outlier analysis worker identifies features with values that are statistically unusual using z-score or MAD (Median Absolute Deviation) methods.

Job Type

outlier_analysis

Input Parameters

{
  "dataset_id": 123,
  "value_field": "income",
  "method": "zscore",
  "threshold": 2.0
}

Parameters

  • dataset_id (required): Source dataset ID
  • value_field (required): Numeric field to analyze
  • method (optional): "zscore" or "mad" (default: "zscore")
  • threshold (optional): Z-score threshold or MAD multiplier (default: 2.0)

Output

Creates a new dataset with outlier analysis results:

  • Original features marked as outliers
  • Outlier score (z-score or MAD score)
  • Outlier flag
  • Original attributes preserved

Methods

Z-Score Method

Calculates standardized z-scores:

  • Mean and standard deviation calculated
  • Z-score = (value - mean) / standard_deviation
  • Features with |z-score| > threshold are outliers

MAD Method

Uses Median Absolute Deviation:

  • Median calculated
  • MAD = median(|value - median|)
  • Modified z-score = 0.6745 * (value - median) / MAD
  • Features with |modified z-score| > threshold are outliers

Example

# Enqueue an outlier analysis job via API
curl -X POST "https://example.com/api/analysis/outlier_run.php" \
  -H "Content-Type: application/json" \
  -d '{
    "dataset_id": 123,
    "value_field": "income",
    "method": "zscore",
    "threshold": 2.0
  }'

Background Jobs

This analysis runs as a background job. The worker:

  1. Fetches queued outlier_analysis jobs
  2. Validates input parameters
  3. Calculates statistics (mean/std or median/MAD)
  4. Identifies outliers
  5. Creates output dataset
  6. Marks job as completed

Performance Considerations

  • Processing time depends on dataset size
  • Z-score method requires two passes (mean/std, then scoring)
  • MAD method is more robust to outliers in calculation
  • Consider filtering null values before analysis