This page contains press release content distributed by XPR Media. Members of the editorial and news staff of the USA TODAY Network were not involved in the creation of this content.

Clockwork.io Introduces A New Class of Fault Tolerance to End Failure-Driven GPU Waste in AI Training

New TorchPass solution addresses a multi-million dollar challenge with AI infrastructure; uses Live GPU Migration to keep large-scale AI training running through hardware failures instead of forcing costly restarts

PALO ALTO, CA / ACCESS Newswire / March 11, 2026 / Clockwork.io, the leader in Software-Driven AI Fabrics– a programmable, vendor-neutral software layer that optimizes large-scale GPU clusters for real-time observability, fault tolerance, and deterministic performance-today announced the general availability of TorchPass Workload Fault Tolerance. This new class of software-driven fault-tolerance eliminates one of the most costly failure modes in large-scale AI training: catastrophic job restarts caused by infrastructure faults.

Delivered as a core capability of the Clockwork.io FleetIQ platform, TorchPass applies the principles of Software-Driven AI Fabrics to distributed training, using Live GPU Migration to allow workloads to continue running through GPU failures, network disruptions, driver bugs, and even full node crashes-without checkpoint restarts or lost progress.

“Companies are investing billions in next-gen chips, yet the costs of running distributed AI jobs remains grossly inflated because the ecosystem has accepted failure as a constant,” said Suresh Vasudevan, CEO of Clockwork.io. “We built TorchPass to fundamentally reject that premise. Instead of treating failure as inevitable and restarting after the fact, TorchPass makes infrastructure faults invisible to the workload-training continues through failures transparently, in software. For a typical 2,048-GPU deployment, that translates into over $6 million a year in recovered compute. This is what our Software-Driven AI Fabric approach was designed to deliver: fault-tolerant AI infrastructure.”

Dylan Patel, Founder and CEO of SemiAnalysis agreed that large-scale training jobs are limited by interruptions.

“As Blackwell clusters roll out with an NVL72 domain, and we look to the future with Rubin Ultra’s NVL576 domain, the idea that a single GPU error or network link flap can take down an entire run is totally unacceptable,” said Patel. “TorchPass solves a huge challenge with cluster reliability: it provides transparent failover and live workload migration that keeps MFU high, which in turn drives better GPU economics.”

Why AI Training Fails at Scale

Distributed AI training remains one of the most failure-prone workloads in modern infrastructure. As cluster sizes grow, fragility increases sharply. Research from Meta FAIR shows that mean time to failure drops to 7.9 hours in a 1,024-GPU cluster and to just 1.8 hours at 16,384 GPUs. This means that for most large, AI-focused enterprises or AI clouds, failure-driven restarts are completely inevitable – making this a major barrier to scaling AI’s impact.

Each failure forces training jobs to roll back to the most recent checkpoint, discarding minutes or hours of completed work and wasting additional time on manual intervention, reprovisioning resources and restarting training. These restarts silently cap GPU utilization, making reliability one of the largest hidden costs in AI infrastructure.

TorchPass addresses this problem by proactively addressing costly AI workload failures, solving them before the job stops or needs to restart. Vital for enterprises running large AI workloads and AI clouds alike, TorchPass dramatically improves the reliability of workloads and cluster utilization. For AI clouds, who can now address impacted GPUs while preserving the training run as planned, this translates into better customer SLAs and overall AI cloud economics, improving their ability to protect margin and deliver new models sooner.

“Managing compute output across large-scale GPU clusters is vital to ensuring we’re delivering reliable capacity to our customers. By using TorchPass we have the support of a company that focuses on resilience like it is a core business function: it replaces any specific failing GPU and keeps the rest of the job moving, rather than making one small problem impact our large-scale operations,” said David Power, CTO of Nscale. “In our evaluation, Live GPU Migration preserved both run continuity and throughput under real fault conditions, which is exactly what you need to deliver predictable time-to-train and a better customer experience at scale.”

How Live GPU Migration Works: Reliability Without Restart

TorchPass performs transparent, in-flight migration of impacted training ranks to spare resources when failures occur. TorchPass typically completes recovery in approximately three minutes while the training process continues uninterrupted.

It supports resilience across three failure scenarios:

  • Unplanned migration, handling sudden events such as kernel crashes, power failures, or GPU faults by reconstructing state from healthy replicas

  • Pre-emptive migration, triggered by early warning signals such as rising temperatures or ECC memory errors, enabling controlled migration before a hard failure

  • Planned migration, enabling maintenance, patching, and workload rebalancing without interrupting training

This approach reduces wasted training progress by 95%, cutting lost time from approximately three hours per day to under ten minutes in a 1,024-GPU cluster.

Jordan Nanos, Member of Technical Staff and lead author of ClusterMAX-SemiAnalysis’ independent benchmark for large-scale AI training-stress tested Clockwork.io TorchPass and found it delivered leading performance and efficiency for large-scale distributed training, enabling users to reduce checkpointing overhead in training. He shared the following results:

“In our testing, Clockwork.io TorchPass delivered the fastest and most efficient fault-tolerant performance for a gpt-oss-120B training run. We used TorchTitan on a Kubernetes cluster with 64x H200 GPUs. During our testing we measured job completion time (JCT) and Model FLOPs Utilization (MFU) against a standard approach (checkpoint-restart) and the leading open-source fault-tolerant training framework (TorchFT). We simulated multiple hardware failures on the cluster in order to stress test the fault-tolerant training frameworks.

When compared to checkpoint-restart, TorchPass was significantly faster to recover from failures. This reduced overall JCT and maintained high MFU. And when compared to TorchFT, TorchPass had a significantly higher MFU. This reduced overall JCT while also maintaining an equal time to recover from failures.

Using TorchPass also has a downstream effect where it provides users with an opportunity to reduce or even remove checkpointing from their training code. This means larger effective batch sizes, lower risk of out of memory errors (OOMs), and less time spent thinking about storage. For a research organization, this can ultimately mean a faster time to reach their training objective,” concluded Nanos.

Measurable Business Impact from Software-Driven Fault-Tolerance

For customers operating large AI clusters, the impact is immediate and measurable. In a typical 2,048-GPU H200 deployment, TorchPass Workload Fault Tolerance delivers over $6 million in annual savings by preventing wasted compute.

These savings come from eliminating hundreds of thousands of GPU-hours that would otherwise be lost to failure-driven restarts, cascading retries, and idle recovery time. By keeping training jobs running through infrastructure faults instead of restarting them, TorchPass converts lost GPU time into productive training, significantly improving the return on GPU investments that today often operate at just 30-50% of theoretical performance.

Enabling the Next Generation of AI Infrastructure

By making reliability a software-defined capability rather than a hardware constraint, TorchPass provides the operational confidence required to deploy next-generation, tightly coupled systems such as NVIDIA GB200 and GB300 NVL72 and future rack-scale systems, where dense architectures amplify the cost of even small failures.

TorchPass builds on Clockwork.io’s prior release of Network Fault Tolerance, which applies the same Software-Driven AI Fabric principles to network resilience by transparently rerouting traffic around link failures.

Together, these capabilities form Clockwork.io’s Software-Driven AI Fabric, a vendor-neutral software layer spanning network, compute, and storage. As modern AI workloads run on tightly coupled clusters where hundreds or thousands of processors must operate in coordinated lockstep, infrastructure behaves as a single system, where reliability and performance directly determine overall efficiency. By managing this complexity in software, Clockwork.io enables operators to run heterogeneous AI infrastructure as a unified platform-maintaining high utilization, predictable performance, and resilience while preserving the flexibility to evolve hardware and improve the economics of large-scale AI deployments.

To learn more about the launch of TorchPass, visit the Clockwork.io team in-person at NVIDIA GTC from March 16-19, Booth #205, or visit https://clockwork.io.

About Clockwork.io
Clockwork.io pioneers Software-Driven AI Fabrics™, delivering a programmable software layer that makes large-scale AI clusters observable, deterministic, and resilient by design to drive continuous workload progress and peak cluster utilization. Its FleetIQ platform enables enterprises to train, deploy, and serve the world’s most demanding AI workloads faster, more reliably, and at lower cost. Companies including Uber, Wells Fargo, DCAI, Nebius, Nscale, and White Fiber trust Clockwork.io to power their AI infrastructure. Learn more at www.clockwork.io.

Media Contact
Dana Trismen
clockwork@unshakablemarketinggroup.com
650-269-7478

SOURCE: Clockwork

View the original press release on ACCESS Newswire

Information contained on this page is provided by an independent third-party content provider. XPRMedia and this Site make no warranties or representations in connection therewith. If you are affiliated with this page and would like it removed please contact pressreleases@xpr.media

RPA Announces 2026 Nephrology Coding & Billing Workshop

RPA Announces 2026 Nephrology Coding & Billing Workshop

This workshop is designed to give nephrology teams the confidence they need to navigate complex regulatory requirements

March 13, 2026

John J. Malm & Associates Secures $300,000 Policy-Limits Settlement for Woman Severely Injured in Violent American Bulldog Attack

John J. Malm & Associates Secures $300,000 Policy-Limits Settlement for Woman Severely Injured in Violent American Bulldog Attack

Naperville, Illinois – John J. Malm & Associates announced today that the firm has secured an out-of-court

March 13, 2026

Chicago Law Firm, Briskman Briskman & Greenberg Secures $1.75 Million for Loading Dock Worker Injured at Home Depot

Chicago Law Firm, Briskman Briskman & Greenberg Secures $1.75 Million for Loading Dock Worker Injured at Home Depot

Chicago, Illinois – Briskman Briskman & Greenberg Personal Injury & Car Accident Lawyers has announced a $1.75

March 13, 2026

Law Office of Jason M. Hatfield Wins Medical Benefits Ruling for Injured Worker in Arkansas

Law Office of Jason M. Hatfield Wins Medical Benefits Ruling for Injured Worker in Arkansas

Springdale, Arkansas – The Law Office of Jason M. Hatfield in Springdale has obtained a ruling awarding additional

March 13, 2026

Bowen Painter Injury Lawyers Verdict Ranks No. 7 on CVN’s Top 10 Most Impressive Plaintiff Verdicts of 2025 list

Bowen Painter Injury Lawyers Verdict Ranks No. 7 on CVN’s Top 10 Most Impressive Plaintiff Verdicts of 2025 list

Savannah, Georgia – A $21.3 million trucking verdict secured by Bowen Painter Injury Lawyers has been ranked No. 7 on

March 13, 2026

Aiello, Harris, Abate Law Group PC Secures Judgment of Acquittal Following Prior Hung Jury in Federal Tax Trial of Hillsborough CPA

Aiello, Harris, Abate Law Group PC Secures Judgment of Acquittal Following Prior Hung Jury in Federal Tax Trial of Hillsborough CPA

Trenton, New Jersey – The United States District Court for the District of New Jersey has granted a Judgment of

March 13, 2026

New Study Finds AI Content Has No Statistically Significant Impact on Law Firm Google Rankings

New Study Finds AI Content Has No Statistically Significant Impact on Law Firm Google Rankings

San Francisco, California – Custom Legal Marketing (CLM), the legal marketing and law firm SEO agency behind the CLM

March 13, 2026

Patelco Credit Union delivers $100,000 in groceries for Bay Area families through Holiday Partnership with Safeway

Patelco Credit Union delivers $100,000 in groceries for Bay Area families through Holiday Partnership with Safeway

Grocery Gift Card Sweepstakes Supports Members and Communities DUBLIN, CA, UNITED STATES, March 13, 2026

March 13, 2026

Williams Homes Breaks Ground on Creekside, Bringing 134 New Homes to Fillmore

Williams Homes Breaks Ground on Creekside, Bringing 134 New Homes to Fillmore

Major milestone marks 30 years of homebuilding and continued growth in Southern California SANTA CLARITA, CA, UNITED

March 13, 2026

TBH Sterling Leads ADU Gold Rush: Transforming Hidden Real Estate Assets

TBH Sterling Leads ADU Gold Rush: Transforming Hidden Real Estate Assets

TBH Sterling leads Seattle ADU boom, turning basements into legal rental units. Design-build, permits, and

March 13, 2026

Council of Autism Service Providers, Association of Professional Behavior Analysts Release Autism Assessment Guidelines

Council of Autism Service Providers, Association of Professional Behavior Analysts Release Autism Assessment Guidelines

This industry works best when we’re collaborating for the common good. That’s exactly what happened here.”— Lorri

March 13, 2026

Education Franchise Opportunities Open in Sacramento by PEL Learning Center

Education Franchise Opportunities Open in Sacramento by PEL Learning Center

PEL Learning Center expands in Sacramento, offering Math & ELA tutoring with Singapore Math and Spalding methods,

March 13, 2026

Energy Bills Expected To Rise For LA Homeowners — Why Attic Insulation Matters More Than Ever

Energy Bills Expected To Rise For LA Homeowners — Why Attic Insulation Matters More Than Ever

Energy costs are becoming a growing concern for Los Angeles homeowners. As temperatures rise, many families may see

March 13, 2026

Customer Experience Consulting Launches With Focus on Customer Operations Infrastructure

Customer Experience Consulting Launches With Focus on Customer Operations Infrastructure

A new consulting firm focuses on the systems behind customer experience, helping organizations design CRM architecture,

March 13, 2026

The Village That Betrayed Its Children Confronts a Dark Chapter of Silence and Injustice

The Village That Betrayed Its Children Confronts a Dark Chapter of Silence and Injustice

Karen Elizabeth Lee reveals a powerful memoir exposing abuse, community silence, and the long-lasting impact of trauma.

March 13, 2026

Influential Women: Terri Savage EdD: Educational Consultant Advancing Inclusive Systems & Empowering Educational Leaders

Influential Women: Terri Savage EdD: Educational Consultant Advancing Inclusive Systems & Empowering Educational Leaders

LAUREL, MD, UNITED STATES, March 13, 2026 /EINPresswire.com/ — More Than 30 Years Advancing Inclusive Systems,

March 13, 2026

McPherson Industrial Power Rates 41% Below National Average, New Data Shows

McPherson Industrial Power Rates 41% Below National Average, New Data Shows

MCPHERSON, KS, UNITED STATES, March 12, 2026 /EINPresswire.com/ — New national utility data confirms that McPherson,

March 13, 2026

DAVE & BUSTER’S SAN ANTONIO PLANS GRAND REOPENING FOLLOWING MAJOR REMODEL

DAVE & BUSTER’S SAN ANTONIO PLANS GRAND REOPENING FOLLOWING MAJOR REMODEL

Pre-Open City-Wide Flag Hunt and First 100 Guests on Reopening Day Score Free Games for a Year! SAN ANTONIO, TX, UNITED

March 13, 2026

Cor Consulting Advances Expansion Goals With Plans to Double Locations and Support Small Business Growth

Cor Consulting Advances Expansion Goals With Plans to Double Locations and Support Small Business Growth

Cor Consulting plans to double its locations by 2026, expanding B2B telecom sales while supporting small businesses and

March 13, 2026

Besler Holdings Launches as a Collaborative Holdings Company Built on Integrity and Excellence

Besler Holdings Launches as a Collaborative Holdings Company Built on Integrity and Excellence

Besler Holdings, Inc. announces its official launch, marking the beginning of a purpose-driven organization dedicated

March 13, 2026

JSOC IT Launches AUTOPSY — Security Verification Platform That Runs the Investigation Before the Breach

JSOC IT Launches AUTOPSY — Security Verification Platform That Runs the Investigation Before the Breach

READY™ replaces self-reported security posture with API-verified findings — the average org scores 20–35 points lower

March 13, 2026

Fundy Software Launches Album Cover Materials Designer, an Industry-First Tool for Photographers

Fundy Software Launches Album Cover Materials Designer, an Industry-First Tool for Photographers

New feature empowers photographers to design and preview album cover materials with unprecedented speed and flexibility

March 13, 2026

Tencent Games’ VISVISE Makes Its GDC Debut, Showcasing Full-Stack AI Animation and Modeling Solutions

Tencent Games’ VISVISE Makes Its GDC Debut, Showcasing Full-Stack AI Animation and Modeling Solutions

SAN FRANCISCO, CA, UNITED STATES, March 13, 2026 /EINPresswire.com/ — At this year’s GDC Festival of Gaming, held from

March 13, 2026

Spiritual Poems Explores Faith, Reflection, and Personal Growth Through Inspirational Poetry

Spiritual Poems Explores Faith, Reflection, and Personal Growth Through Inspirational Poetry

Charles Gadbois shares a collection of spiritually inspired poems reflecting on faith, human experience, and the

March 13, 2026

Ensemble HRG Announces ‘Back-Office in a Box’: A Turnkey Consulting Suite Built by Operators for Senior Care Growth

Ensemble HRG Announces ‘Back-Office in a Box’: A Turnkey Consulting Suite Built by Operators for Senior Care Growth

LINCOLNWOOD, IL, UNITED STATES, March 13, 2026 /EINPresswire.com/ — Ensemble Healthcare Resource Group (Ensemble HRG)

March 13, 2026

A-One Janitorial LLC Selected to Service $152 Million, 866,000 Square-Foot AI Server Manufacturing Campus

A-One Janitorial LLC Selected to Service $152 Million, 866,000 Square-Foot AI Server Manufacturing Campus

Company to serve one of the largest AI server manufacturing campuses in the U.S. — supplying infrastructure to Google,

March 13, 2026

Excello Launches AI Platform to Help Founders Turn Strategy Into Weekly Execution

Excello Launches AI Platform to Help Founders Turn Strategy Into Weekly Execution

AI-powered platform helps founders connect their North Star strategy to weekly execution through personalized coaching

March 13, 2026

DNERO Launches Borderless Neobank Powering the Rise of the Latino Economy

DNERO Launches Borderless Neobank Powering the Rise of the Latino Economy

The latino economy already operates across borders, but the financial system still doesn't. DNERO is building the

March 13, 2026

Italian Aerospace Exports to the U.S. Surge by 8% as Supply-Chain Ties Deepen Ahead of Seattle Industry Summit

Italian Aerospace Exports to the U.S. Surge by 8% as Supply-Chain Ties Deepen Ahead of Seattle Industry Summit

SEATTLE, WA, UNITED STATES, March 13, 2026 /EINPresswire.com/ — Italian aerospace exports to the United States rose

March 13, 2026

World of Smash-Hit Dark Fantasy Anime Comes to Awaji Island – ‘Attack on Titan’ × Nijigen no Mori Event Opens Tomorrow

World of Smash-Hit Dark Fantasy Anime Comes to Awaji Island – ‘Attack on Titan’ × Nijigen no Mori Event Opens Tomorrow

To Be Held March 14 (Sat) – December 13 (Sun), 2026 AWAJI, JAPAN, March 13, 2026 /EINPresswire.com/ — Nijigen no Mori Inc. (Head Office: Awaji…

March 13, 2026

TRNR Updates FAQ’s & Investor Deck Following Ergatta Closing and Increased Guidance to more than $30 Million in 2026 Pro Forma Revenue

TRNR Updates FAQ’s & Investor Deck Following Ergatta Closing and Increased Guidance to more than $30 Million in 2026 Pro Forma Revenue

AUSTIN, TX / ACCESS Newswire / March 13, 2026 / Interactive Strength Inc. (Nasdaq:TRNR) ("TRNR" or the "Company"),

March 13, 2026

Tells.co Among First U.S. Platforms Approved for RCS Business Messaging

Tells.co Among First U.S. Platforms Approved for RCS Business Messaging

AI-powered messaging platform receives approval for next-generation RCS campaigns in the United States We built Tells

March 13, 2026

Tencent Debuts MagicDawn at GDC Showcasing AI-Driven Global Illumination and Spatial Audio for Next-Gen Game Experiences

Tencent Debuts MagicDawn at GDC Showcasing AI-Driven Global Illumination and Spatial Audio for Next-Gen Game Experiences

SAN FRANCISCO, CA, UNITED STATES, March 13, 2026 /EINPresswire.com/ — During this year’s GDC Festival of Gaming,

March 13, 2026

‘New Biography Reach for The Stars: The M&R Rush Story Chronicles 50 Years of Chicago Rock History’

‘New Biography Reach for The Stars: The M&R Rush Story Chronicles 50 Years of Chicago Rock History’

This book captures the heart of the M&R Rush story through photos, reflections, and the voices of six musicians who

March 13, 2026

Entrepreneur Jason Heinen Releases Book Challenging the Myth of Hard Work

Entrepreneur Jason Heinen Releases Book Challenging the Myth of Hard Work

In “Hard Work Isn’t the Problem. Your Focus Is.”, the author examines why constant effort can hide deeper problems in

March 13, 2026

Pavago LLC Unveils Comprehensive Framework for Identifying Remote Marketing Talent Across Global Regions

Pavago LLC Unveils Comprehensive Framework for Identifying Remote Marketing Talent Across Global Regions

March 13, 2026 – PRESSADVANTAGE – Pavago LLC, a leading offshore recruitment specialist, today announced the release of

March 13, 2026

Alice D. Johnson Featured on Next Level CEO

Alice D. Johnson Featured on Next Level CEO

FL, UNITED STATES, March 13, 2026 /EINPresswire.com/ — Dani Johnson, founder of Special Gathering, is set to appear on

March 13, 2026

Navigating New Horizons: Oscar Montoya Sr. Takes the Helm at the Port of Port Mansfield

Navigating New Horizons: Oscar Montoya Sr. Takes the Helm at the Port of Port Mansfield

PORT MANSFIELD, TX, UNITED STATES, March 13, 2026 /EINPresswire.com/ — In a move poised to steer the Port of Port

March 13, 2026

Heather Fahnestock Featured on Next Level CEO

Heather Fahnestock Featured on Next Level CEO

FL, UNITED STATES, March 13, 2026 /EINPresswire.com/ — Heather Fahnestock, founder of Waterview Learning Academy, is

March 13, 2026

LifeSafer, Atley Wiese Racing and Gweedo Memorial Foundation Launch Teen Driver Safety Initiative

LifeSafer, Atley Wiese Racing and Gweedo Memorial Foundation Launch Teen Driver Safety Initiative

Initiative will bring live Intelligent Speed Assistance (ISA) demonstrations and teen-focused education on speed

March 13, 2026