Published onAugust 22, 2024GPU Utilization is a Misleading MetricCluster-ManagementLoggingClusterGPUsMost ML teams use GPU Utilization as their main performance metric, but we found this can be quite misleading.
Published onJuly 11, 2024Automatic GPU Node Health and Pod SchedulingClusterCluster-ManagementGPUsLoggingMetricsAutomatically isolate faulty nodes and schedule only on healthy ones.
Published onJune 15, 2024Zero Instrumentation GPU Metrics with DCGM on Trainy: KonduktorTrainingClustersGPUsMonitor your GPU clusters with metrics.
Published onJune 7, 2024GPU Fabric Preflight Checks for MultiNode TrainingTrainingRun these benchmarks before training on multiple machines.
Published onNovember 21, 2023Function Calling with Gorilla LLMServingFunction-CallingHost your own function calling LLM.