Machine Learning Operations Engineer

WindBorne Systems is supercharging weather forecasts with a unique proprietary data source: a global constellation of next-generation smart weather balloons targeting the most critical atmospheric data. We design, manufacture, and operate our own balloons, using the data they collect to generate otherwise unattainable weather intelligence.

Our mission is to eliminate weather uncertainty, and in the process help humanity adapt to climate change, be that predicting hurricanes or speeding the adoption of renewables. We are building a future in which the planet is instrumented by thousands of our microballoons, eliminating gaps in our understanding of the planet and giving people and businesses the information they need to make critical decisions. The founding team of Stanford engineers was named Forbes 2019 30 under 30 and is backed by top-tier investors, including Khosla Ventures and Footwork VC.

WindBorne builds AI weather models that run 24/7, producing global forecasts every 20 minutes. Our research team is small and moves fast — but too much of their time goes to operationalization and infra firefighting instead of model development. We need someone to fix that.

Snapshot of the balloon constellation ON MARCH 31, 2026

Responsibilities

What you'd own:

Research to Operations pipelines — Our models serve real-time forecasts to customers with strict latency requirements. You'd own uptime end-to-end: build health monitoring, improve logging, diagnose failures across nodes.
Inference scaling & compute strategy — We have an on-prem cluster but also use cloud providers, especially for production deployments. You'd evaluate cost/performance tradeoffs across cloud options as we scale, and also help manage growing on-prem resources for compute and storage.
Data pipelines & upstream reliability — Weather data comes from dozens of sources (satellites, government agencies, our own balloon observations) with varying schedules, incomplete documentation and sometimes failing or changing quality. You'd build pipelines for training and realtime data that gracefully handle upstream delays, do QC checks on data, and add logging and alerting for a zoo of edge cases.
Training infrastructure — Make distributed training runs reliable. They die from silent OOMs, network faults, and storage issues. Build monitoring, auto-recovery, and job scheduling so researchers can launch experiments with less need for babysitting them.

‍

Skills and Qualifications

Requirements

Have run production systems end-to-end — you've been paged at 2am because a pipeline stopped serving and you know how to build systems so that stops happening
Experience with large datasets spanning petabytes
Understand cloud GPU economics and how to balance workloads across on-prem and cloud
Comfortable keeping up with fast-paced model releases and building reliable custom deployments for them
Experience with PyTorch, Docker, cursed memory management, compression and debugging network saturation.

‍

Benefits

401(k)
Dental insurance
Health insurance
Vision insurance
Unlimited PTO
Stock Option Plan
Office food and beverages

‍

Salary

$150k-$200K*** We are considering a range of backgrounds and experience levels for this position and adjust our offers accordingly to be competitive with market rates.

Location

1600 Bridge Pwky, Redwood City, CA. In person required.

‍