Machine Learning Operations Engineer

apply now

WindBorne Systems is supercharging weather forecasts with a unique proprietary data source: a global constellation of next-generation smart weather balloons targeting the most critical atmospheric data. We design, manufacture, and operate our own balloons, using the data they collect to generate otherwise unattainable weather intelligence.

Our mission is to eliminate weather uncertainty, and in the process help humanity adapt to climate change, be that predicting hurricanes or speeding the adoption of renewables. We are building a future in which the planet is instrumented by thousands of our microballoons, eliminating gaps in our understanding of the planet and giving people and businesses the information they need to make critical decisions. The founding team of Stanford engineers was named Forbes 2019 30 under 30 and is backed by top-tier investors, including Khosla Ventures and Footwork VC.

WindBorne builds AI weather models that run 24/7, producing global forecasts every 20 minutes. Our research team is small and moves fast — but too much of their time goes to operationalization and infra firefighting instead of model development. We need someone to fix that.

Balloon flying over mountains
Snapshot of the balloon constellation ON MARCH 31, 2026

Responsibilities

What you'd own:

  • Research to Operations pipelines — Our models serve real-time forecasts to customers with strict latency requirements. You'd own uptime end-to-end: build health monitoring, improve logging, diagnose failures across nodes.
  • Inference scaling & compute strategy — We have an on-prem cluster but also use cloud providers, especially for production deployments. You'd evaluate cost/performance tradeoffs across cloud options as we scale, and also help manage growing on-prem resources for compute and storage.
  • Data pipelines & upstream reliability — Weather data comes from dozens of sources (satellites, government agencies, our own balloon observations) with varying schedules, incomplete documentation and sometimes failing or changing quality. You'd build pipelines for training and realtime data that gracefully handle upstream delays, do QC checks on data, and add logging and alerting for a zoo of edge cases.
  • Training infrastructure — Make distributed training runs reliable. They die from silent OOMs, network faults, and storage issues. Build monitoring, auto-recovery, and job scheduling so researchers can launch experiments with less need for babysitting them.

Skills and Qualifications

Requirements

  • Have run production systems end-to-end — you've been paged at 2am because a pipeline stopped serving and you know how to build systems so that stops happening
  • Experience with large datasets spanning petabytes
  • Understand cloud GPU economics and how to balance workloads across on-prem and cloud
  • Comfortable keeping up with fast-paced model releases and building reliable custom deployments for them
  • Experience with PyTorch, Docker, cursed memory management, compression and debugging network saturation.

Benefits

  • 401(k)
  • Dental insurance
  • Health insurance
  • Vision insurance
  • Unlimited PTO
  • Stock Option Plan
  • Office food and beverages

Salary

Location

1600 Bridge Pwky, Redwood City, CA. In person required. 

What our hardware looks like

Close up of GSB