Pipeline Performance

Optimization techniques and best practices for fast, efficient pipelines.

Performance Fundamentals

What Makes Pipelines Fast or Slow?

Fast pipelines:

Small datasets (< 1MB)
Simple transformations (filter, pick fields)
Few steps (< 10)
Efficient utilities

Slow pipelines:

Large datasets (> 10MB)
Complex transformations (aggregate, sort)
Many steps (> 20)
Inefficient utility combinations

Key insight: Pipeline execution time is mostly about data size and step complexity.

Optimization Strategies

1. Filter Early

✨ Best practice: Reduce dataset size as early as possible.

Before (slow):


Input → Aggregate → Filter → Output

Aggregates entire dataset, then filters. Wasteful.

After (fast):


Input → Filter → Aggregate → Output

Filters first, aggregates smaller dataset. Much faster.

Impact: Can reduce execution time by 50-90%.

2. Pick Fields First

Remove unnecessary fields early:


Input → Pick Fields → [process with fewer fields] → Output

Why it helps:

Fewer fields = less data to process
Less memory usage
Faster downstream operations

Impact: 20-40% faster for wide datasets (50+ fields).

3. Avoid Redundant Steps

Don’t process the same data multiple times:


❌ Bad:
Input → Filter Fields → Remove Fields → Sort → Output
         (keeps A,B)    (removes C)    (sorts)

✅ Good:
Input → Filter Fields → Sort → Output
         (keeps A,B)           (sorts)

Why: Two field-filtering steps are redundant. Combine them.

4. Use Efficient Utilities

Some utilities are faster than others:

Utility	Speed	Use Case
Pick Fields	Very fast	Keep specific fields
Filter Fields	Fast	Remove specific fields
Clean JSON	Fast	Basic cleanup
Format Values	Moderate	Type conversions
Aggregate	Slow	Calculations on arrays
Sort	Slow	Reorder arrays

Tip: Use faster utilities when possible.

5. Split Large Pipelines

Break into multiple smaller pipelines:


Pipeline 1: Input → Filter → Clean → Output
Pipeline 2: [Pipeline 1 output] → Transform → Output
Pipeline 3: [Pipeline 2 output] → Aggregate → Output

Benefits:

Easier to debug
Can optimize each pipeline separately
Easier to reuse components

Data Size Optimization

Estimate Execution Time

Rough estimates:

Dataset Size	Simple Pipeline	Complex Pipeline
< 1MB	< 1 sec	1-5 sec
1-10MB	1-5 sec	5-30 sec
10-50MB	5-30 sec	30-120 sec
> 50MB	30+ sec	2+ min

Your mileage may vary based on step complexity.

Reduce Data Size

Strategies:

Sample data for development:


Input → Filter (take first 100 items) → [dev pipeline]

Filter early:


Input → Filter (by date range) → [process recent data]

Pick essential fields:


Input → Pick Fields (only what you need) → [process]

Utility-Specific Optimization

Aggregation

Aggregating large arrays is slow. Optimize:

1. Filter before aggregating:


❌ Input → Aggregate (sum of all) → Output
✅ Input → Filter (by condition) → Aggregate (sum of filtered) → Output

2. Pick fields before aggregating:


❌ Input → Aggregate (on all 50 fields) → Output
✅ Input → Pick Fields (only 5 needed) → Aggregate → Output

3. Use specific aggregation operations:


❌ Aggregate (get stats, then extract sum)
✅ Aggregate (operation: sum) — faster

Sorting

Sorting is slow for large arrays.

Optimizations:

Sort filtered data:


Input → Filter (reduce size) → Sort → Output

Sort once, not multiple times:


❌ Input → Sort → [process] → Sort → Output
✅ Input → Sort → [process] → Output

Find & Replace

Optimizing find/replace:

Be specific — Narrow search scope
Use case-sensitive — Faster than case-insensitive
Limit scope — Use target paths when possible

Memory Optimization

Memory Usage Patterns

Per execution:

Input data: 1x dataset size
Each step output: 1x dataset size (temporarily)
Peak usage: 2-3x dataset size

Example:

10MB input → ~20-30MB peak memory usage

Reduce Memory Usage

Strategies:

Simplify pipeline — Fewer steps = less memory
Filter early — Reduce data size sooner
Avoid multiple outputs — Don’t branch if not needed
Close other tabs — Free up browser memory

Execution Optimization

Worker Efficiency

Web Worker Architecture:


Main Thread (UI) ←→ Worker (Execution) ←→ Storage Worker (Persistence)

Optimizations:

Worker reuse — Worker stays alive, avoid recreation
Efficient messaging — Minimize data transfer between threads
Lazy loading — Load step outputs on-demand, not all at once

Step Execution Order

Topological sort determines execution order:


Input → A → C → Output
         ↘ B ↗

Execution order: Input → A → (B, C in parallel) → Output

Optimization:

Place expensive steps later in pipeline
Only if they don’t affect upstream filtering

Caching Strategies

Step Output Caching

How it works:

Step outputs stored during execution
Loaded on-demand when viewed
Cached in memory for fast access
Cleared when pipeline changes

Benefits:

Faster initial execution (no rendering overhead)
Lower memory usage (only load viewed steps)
Better UX for large pipelines

Manual cache control:

Refresh page → clears cache
Pipeline change → clears cache
Can manually clear cache in DevTools

Browser Performance

Browser Differences

Fastest: Chrome, Edge (Blink engine)

OPFS support (fastest storage)
Efficient worker implementation
Good performance

Moderate: Firefox (Gecko engine)

No OPFS (uses IndexedDB)
Slower storage
Still good performance

Slowest: Safari (WebKit engine)

No OPFS (uses IndexedDB)
Slower worker performance
Higher overhead

Mobile Performance

Mobile browsers are slower:

Less memory available
Slower JavaScript execution
Limited storage

Optimizations for mobile:

Use smaller datasets
Simplify pipelines
Use desktop for complex work

Monitoring Performance

Measure Execution Time

In the UI:

Check step duration labels where available
Look for slow steps

In the console:


// Enable performance logging
localStorage.debug = 'pipeline:*';

Identify Bottlenecks

Find the slowest step:

Run pipeline
Review each completed step’s duration
Note execution times
Focus on optimizing slowest step

Common bottlenecks:

Aggregation on large arrays
Sorting large arrays
Complex transformations (restructure, compute)

Performance Testing

Test with Sample Data

Development workflow:

Test small — Use 10-100 item sample
Verify logic — Ensure correctness
Test medium — Use 1K-10K items
Measure performance — Check execution time
Optimize if needed — Apply optimizations
Test large — Use full dataset (if needed)

Benchmark Your Pipeline

Create performance baseline:

Run pipeline with representative data
Record execution time
Document baseline performance
Re-measure after changes
Compare to baseline

Performance Trade-offs

Speed vs. Completeness

Faster (less complete):

Fewer steps
Simpler transformations
Less data validation

Slower (more complete):

More steps
Complex transformations
Thorough validation

Choose based on needs:

Development → Faster is better
Production → Completeness matters

Speed vs. Memory

Less memory (slower):

Stream processing (future feature)
Incremental loading
Frequent garbage collection

More memory (faster):

Load everything at once
Cache intermediate results
Less processing overhead

Advanced Optimization

Pipeline Parallelization

Future feature: Execute independent steps in parallel


Input → Filter → Sort → Output
       ↘ Aggregate ↗

Currently sequential, could be parallel in future.

Incremental Execution

Future feature: Only re-run changed steps

Benefit: Much faster iteration during development.

Streaming Execution

Future feature: Process data in chunks

Benefit: Start showing results before completion.

Performance Checklist

Use this checklist to optimize your pipelines:

Data Size:

Dataset < 10MB for optimal performance
Filter early to reduce size
Pick fields early to reduce width

Pipeline Structure:

Less than 20 steps
No redundant steps
Efficient utility combinations
Simple before complex

Execution:

Fast steps first, slow steps last
Minimal branching
No circular dependencies

Memory:

Close unnecessary tabs
Clear cache periodically
Use desktop for complex pipelines

Next Steps

Troubleshooting — Common issues
Building Basics — Creating pipelines
Execution — Running pipelines