Skip to main content
This page describes a reference architecture for a W&B deployment and outlines the recommended infrastructure and resources to support a production deployment of the platform. Depending on your chosen deployment environment for W&B, various services can help to enhance the resiliency of your deployment. For instance, major cloud providers offer robust managed database services which help to reduce the complexity of database configuration, maintenance, high availability, and resilience. This reference architecture addresses some common deployment scenarios and shows how you can integrate your W&B deployment with cloud vendor services for optimal performance and reliability.

Before you start

Running any application in production comes with its own set of challenges, and W&B is no exception. While we aim to streamline the process, certain complexities may arise depending on your unique architecture and design decisions. Typically, managing a production deployment involves overseeing various components, including hardware, operating systems, networking, storage, security, the W&B platform itself, and other dependencies. This responsibility extends to both the initial setup of the environment and its ongoing maintenance. Consider carefully whether a Self-Managed approach with W&B is suitable for your team and specific requirements. A strong understanding of how to run and maintain production-grade application is an important prerequisite before you deploy Self-Managed W&B. If your team needs assistance, our Professional Services team and partners offer support for implementation and optimization. To learn more about managed solutions for running W&B instead of managing it yourself, refer to W&B Multi-tenant Cloud and W&B Dedicated Cloud.

Infrastructure

W&B infrastructure diagram

Application layer

The application layer consists of a multi-node Kubernetes cluster, with resilience against node failures. The Kubernetes cluster runs and maintains W&B’s pods.

Storage layer

The storage layer consists of a MySQL database and object storage. The MySQL database stores metadata and the object storage stores artifacts such as models and datasets.

Infrastructure requirements

Kubernetes

The W&B Server application is deployed as a Kubernetes Operator that deploys multiple pods. For this reason, W&B requires a Kubernetes cluster with:
  • A fully configured and functioning Ingress controller.
  • The capability to provision Persistent Volumes.
W&B supports deployment on OpenShift Kubernetes clusters in cloud, on-premises, and air-gapped environments. For specific configuration instructions, see the OpenShift section in the Operator guide.

MySQL

W&B stores metadata in a MySQL database. The database’s performance and storage requirements depend on the shapes of the model parameters and related metadata. For example, the database grows in size as you track more training runs, and load on the database increases based on queries in run tables, user workspaces, and reports. W&B strongly recommends using managed database services (such as AWS RDS Aurora MySQL, Google Cloud SQL for MySQL, or Azure Database for MySQL) for production deployments. Managed services provide automated backups, monitoring, high availability, patching, and significantly reduce operational complexity. See the Cloud provider instance recommendations section below for specific service recommendations. If you choose to deploy a self-managed MySQL database, consider the following:
  • Backups: You should periodically back up the database to a separate facility. W&B recommends daily backups with at least 1 week of retention.
  • Performance: The disk the server is running on should be fast. W&B recommends running the database on an SSD or accelerated NAS.
  • Monitoring: The database should be monitored for load. If CPU usage is sustained at > 40% of the system for more than 5 minutes it is likely a good indication the server is resource starved.
  • Availability: To meet your availability and durability requirements, W&B recommends configuring a hot standby Server deployment on a separate machine that streams all updates in realtime from the primary deployment, and is ready to fail over if the primary server crashes, become corrupted, or experiences sustained downtime. Note that W&B does not support a multi-master topology or read-only replicas.

MySQL database creation

For instructions to manually create the MySQL database and user, see the bare-metal guide MySQL database section.

MySQL configuration parameters

If you are running your own MySQL instance, configure MySQL with these settings:
binlog_format = 'ROW'
binlog_row_image = 'MINIMAL'
innodb_flush_log_at_trx_commit = 1
innodb_online_alter_log_max_size = 268435456
max_prepared_stmt_count = 1048576
sort_buffer_size = '67108864'
sync_binlog = 1
These settings have been validated by W&B for optimal performance and reliability.

Redis

W&B depends on a single-node Redis 7.x deployment used by W&B’s components for job queuing and data caching. For convenience during testing and development of proofs of concept, W&B Self-Managed includes a local Redis deployment that is not appropriate for production deployments. W&B can connect to a Redis instance in the following environments:

Object storage

W&B requires object storage with pre-signed URL and CORS support, deployed in one of:
  • CoreWeave AI Object Storage is a high-performance, S3-compatible object storage service optimized for AI workloads.
  • Amazon S3 is an object storage service offering industry-leading scalability, data availability, security, and performance.
  • Google Cloud Storage is a managed service for storing unstructured data at scale.
  • Azure Blob Storage is a cloud-based object storage solution for storing massive amounts of unstructured data like text, binary data, images, videos, and logs.
  • S3-compatible storage such as MinIO Enterprise (AIStor), NetApp StorageGRID, or other enterprise-grade solutions hosted in your cloud or on-premises infrastructure.

Versions

SoftwareMinimum version
Kubernetesv1.32 or newer (Supported Kubernetes versions)
Helmv3.x
MySQLv8.0.x is required, v8.0.32 or newer; v8.0.44 or newer is recommended.
Aurora MySQL 3.x releases, must be v3.05.2 or newer
Redisv7.x

Networking

For a networked deployment, egress to these endpoints is required during both installation and runtime:
Additional container registries may be required depending on your deployment configuration:
  • https://gcr.io is needed when deploying Bufstream and etcd for Weave online evaluations.
To learn about air-gapped deployments, refer to Kubernetes operator for air-gapped instances. Access to W&B and to the object storage is required for the training infrastructure and for each system that tracks the needs of experiments.

DNS

The fully qualified domain name (FQDN) of the W&B deployment must resolve to the IP address of the ingress/load balancer using an A record.

Load balancer and ingress

The W&B Kubernetes Operator exposes services using a Kubernetes ingress controller, which routes to service endpoints based on URL paths with different ports. The ingress controller must be accessible by all machines that execute machine learning payloads or access the service through web browsers.

Ingress controller requirements

Your Kubernetes cluster must have an IngressClass available. Common ingress controller options include:

W&B service routing

The W&B Operator automatically routes requests to multiple backend services based on path:
PathServiceDefault portPurpose
/wandb-app8080Main web application UI
/apiwandb-api8081API service
/graphqlwandb-api8081GraphQL API endpoint
/graphql2wandb-api8081GraphQL API v2 endpoint
/consolewandb-console8082System Console
/traceswandb-weave-trace8722Weave tracing service (if enabled)

Example ingress configuration

The following shows an example ingress resource created by the W&B Operator:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: wandb
  namespace: wandb
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "0"
spec:
  ingressClassName: nginx
  rules:
  - host: wandb.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: wandb-app
            port:
              number: 8080
      - path: /api
        pathType: Prefix
        backend:
          service:
            name: wandb-api
            port:
              number: 8081
      - path: /graphql
        pathType: Prefix
        backend:
          service:
            name: wandb-api
            port:
              number: 8081
      - path: /graphql2
        pathType: Prefix
        backend:
          service:
            name: wandb-api
            port:
              number: 8081
      - path: /console
        pathType: Prefix
        backend:
          service:
            name: wandb-console
            port:
              number: 8082
  tls:
  - hosts:
    - wandb.example.com
    secretName: wandb-tls
The W&B Operator creates and manages the ingress configuration automatically. You typically do not need to create ingress resources manually. Ensure your cluster has a functioning ingress controller and the appropriate IngressClass configured.

SSL/TLS

W&B requires a valid signed SSL/TLS certificate for secure communication between clients and the server. SSL/TLS termination must occur on the ingress/load balancer. The W&B Server application does not terminate SSL or TLS connections. Important: W&B does not support self-signed certificates and custom CAs. Using self-signed certificates will cause challenges for users and is not supported. If possible, using a service like Let’s Encrypt is a great way to provide trusted certificates to your load balancer. Services like Caddy and Cloudflare manage SSL for you. If your security policies require SSL communication within your trusted networks, consider using a tool like Istio and side car containers.

Supported CPU architectures

W&B runs on Intel and AMD 64-bit architecture. ARM is not supported.

Deployment method

Recommended: W&B Kubernetes Operator with Helm

The recommended installation method for W&B Self-Managed is using the W&B Kubernetes Operator, deployed via Helm. This approach provides:
  • Automated updates and management of W&B components
  • Simplified configuration and deployment
  • Support for all deployment scenarios (cloud, on-premises, air-gapped)
For detailed installation instructions, see:

Infrastructure provisioning

Terraform is the recommended way to provision infrastructure for W&B production deployments. Using Terraform, you define the required resources, their references to other resources, and their dependencies. W&B provides Terraform modules for the major cloud providers. For details, refer to Deploy W&B Server within Self-Managed cloud accounts.

Sizing

Use the following general guidelines as a starting point when planning a deployment. W&B recommends that you monitor all components of a new deployment closely and that you make adjustments based on observed usage patterns. Continue to monitor production deployments over time and make adjustments as needed to maintain optimal performance.

Models only

Kubernetes

EnvironmentCPUMemoryDisk
Test/Dev2 cores16 GB100 GB
Production8 cores64 GB100 GB
Numbers are per Kubernetes worker node.

MySQL

EnvironmentCPUMemoryDisk
Test/Dev2 cores16 GB100 GB
Production8 cores64 GB500 GB
Numbers are per MySQL node.

Weave only

Kubernetes

EnvironmentCPUMemoryDisk
Test/Dev4 cores32 GB100 GB
Production12 cores96 GB100 GB
Numbers are per Kubernetes worker node.

MySQL

EnvironmentCPUMemoryDisk
Test/Dev2 cores16 GB100 GB
Production8 cores64 GB500 GB
Numbers are per MySQL node.

Models and Weave

Kubernetes

EnvironmentCPUMemoryDisk
Test/Dev4 cores32 GB100 GB
Production16 cores128 GB100 GB
Numbers are per Kubernetes worker node.

MySQL

EnvironmentCPUMemoryDisk
Test/Dev2 cores16 GB100 GB
Production8 cores64 GB500 GB
Numbers are per MySQL node.

Cloud provider instance recommendations

Services

CloudKubernetesMySQLObject Storage
AWSEKSRDS AuroraS3
Google CloudGKEGoogle Cloud SQL - MysqlGoogle Cloud Storage (GCS)
AzureAKSAzure Database for MysqlAzure Blob Storage

Machine types

These recommendations apply to each node of a Self-Managed deployment of W&B in cloud infrastructure.

AWS

EnvironmentK8s (Models only)K8s (Weave only)K8s (Models&Weave)MySQL
Test/Devr6i.larger6i.xlarger6i.xlargedb.r6g.large
Productionr6i.2xlarger6i.4xlarger6i.4xlargedb.r6g.2xlarge

Google Cloud

EnvironmentK8s (Models only)K8s (Weave only)K8s (Models&Weave)MySQL
Test/Devn2-highmem-2n2-highmem-4n2-highmem-4db-n1-highmem-2
Productionn2-highmem-8n2-highmem-16n2-highmem-16db-n1-highmem-8

Azure

EnvironmentK8s (Models only)K8s (Weave only)K8s (Models&Weave)MySQL
Test/DevStandard_E2_v5Standard_E4_v5Standard_E4_v5MO_Standard_E2ds_v4
ProductionStandard_E8_v5Standard_E16_v5Standard_E16_v5MO_Standard_E8ds_v4