One platform, 1,930 algorithms, every backend: how we built the complete stack for discrete AI
Your AI development stack is broken. PyTorch for research, TensorFlow for deployment, CUDA for NVIDIA, ROCm for AMD, separate tools for mobile, FPGA, and browser. Each backend needs custom optimization. Each framework has its own API. Dweve Core replaces all of it with a single declarative platform: 1,930 algorithms, 6 backends, one codebase.
The fragmentation problem
Your AI development environment is fragmented.
You prototype in PyTorch because researchers prefer it. You deploy with TensorFlow because production teams want Google's tooling. You write CUDA kernels for NVIDIA GPUs. You port to ROCm for AMD hardware. You rewrite everything for mobile using TensorFlow Lite or Core ML. You use ONNX to convert between frameworks, hoping nothing breaks. You maintain separate codebases for cloud, edge, and browser deployment.
Ten different tools. Thousands of dependencies. Version compatibility nightmares. Breaking changes every release cycle.
Update PyTorch? Hope your CUDA version matches. Want to deploy on AMD? Rewrite your kernels. Need browser inference? Start over with WebAssembly. Switching from NVIDIA to AMD GPUs? Good luck porting that codebase. Deploying on FPGAs? Learn an entirely different toolchain.
This fragmentation is not an accident. It's the natural result of each framework optimizing for its specific use case while ignoring interoperability. PyTorch excels at research but treats deployment as an afterthought. TensorFlow targets production but the research experience is painful. CUDA locks you into NVIDIA hardware. Each tool solves one problem while creating three more.
There's a better way.
Dweve Core: The complete platform for discrete AI
Dweve Core is a unified platform for building discrete neural networks from binary to 8-bit precision. It's not another framework. It's a complete replacement for your fragmented stack.
One installation. One API. One codebase. Automatic deployment to CPU, GPU, FPGA, WebAssembly, anywhere you need to run.
Here's what complete means:
1,930 algorithms covering every operation you need:
- 415 primitives: Atomic operations like XNOR, popcount, bit manipulation, quantization, conversion between formats
- 500 kernels: Optimized composite operations for common patterns
- 191 layers: Complete neural network building blocks (convolution, dense, normalization, activation, attention)
- 674 algorithms: High-level methods including transforms, training procedures, evolutionary search, knowledge distillation
- 30 interop utilities: Optional floating-point bridges for hybrid approaches
- 120 model architectures: Pre-optimized network templates from ResNets to Transformers
This isn't a subset. It's mathematical completeness. We analyzed every major neural network architecture and built every operation they require, optimized for discrete computation from first principles.
6 backends with automatic compilation:
- SIMD CPU: Hand-optimized kernels for x86 and ARM with automatic ISA detection (SSE2, AVX2, AVX-512, NEON, SVE/SVE2)
- CUDA: NVIDIA GPU optimization with warp-level primitives and Tensor Core utilization
- Rust-HDL: Direct FPGA and ASIC synthesis from algorithm descriptions
- WebAssembly: Browser-based inference with SIMD128 support for GDPR-compliant on-device processing
- ROCm: AMD GPU optimization with wavefront-level operations
- Metal: Apple Silicon optimization using unified memory architecture
6 bit-widths with adaptive precision:
- Binary (1-bit): Maximum efficiency with 16× compression versus FP16
- Ternary: {-1, 0, +1} with explicit sparsity
- 2-bit: Four levels for balanced compression
- 3-bit: Eight levels for quality-sensitive layers
- 4-bit: Sixteen levels approaching FP16 quality
- 8-bit: Near-full-precision for critical operations
The platform learns optimal bit-width per layer during training through gradient-based selection. Not heuristics. Not guesswork. Actual optimization based on how accuracy improves with added precision.
Write once, deploy everywhere
The platform uses a declarative Rust API. You describe what you want, the compiler figures out how to run it optimally on your target hardware.
Example network definition:
1let model = NetworkBuilder::new()
2 .input(BinaryTensor::new([1024, 784]))
3 .dense(784, 512, activation=BinaryActivation::Sign)
4 .dense(512, 256, activation=BinaryActivation::Sign)
5 .output(256, 10)
6 .build();That's it. Write this once, and the compiler automatically generates optimized implementations for every backend.
The compilation pipeline works through four levels:
Level 1: Neural Network IR - Your high-level network description gets parsed into a computational graph with operations, data flow, and hyperparameters.
Level 2: Graph Optimization - Standard compiler passes eliminate dead code, fold constants, deduplicate expressions, and fuse sequential operations. A dense layer followed by batch normalization and activation becomes a single fused kernel that loads input once and produces final output.
Level 3: BitOps Dialect - The graph lowers to bit-level operations explicitly typed with precision. This intermediate representation is hardware-agnostic but close to actual machine operations. Built on MLIR (Multi-Level Intermediate Representation) infrastructure for industrial-strength optimization.
Level 4: Hardware Lowering - Final code generation produces platform-specific implementations. For AVX-512, bit operations become VPXORQ and VPOPCNTQ instructions. For CUDA, they become warp-level intrinsics with coalesced memory access. For FPGA, they become XNOR gates and adder trees synthesized to Verilog.
The same source code runs on your laptop CPU during development, deploys to cloud GPUs in production, and compiles to FPGAs for deterministic real-time inference. No translation. No porting. No platform-specific code.
Built for customers to build with
Dweve Core powers Dweve Loom, our constraint-based reasoning system. But it's not just for us. It's a complete platform for anyone building discrete neural networks.
You can build:
- Custom architectures using the 191 layer types and 674 algorithms
- Domain-specific models optimized for your exact requirements
- Hybrid approaches mixing discrete and continuous computation via the 30 interop utilities
- Novel training methods using evolutionary search, knowledge distillation, or custom gradient estimators
The platform provides:
Complete training infrastructure: Six straight-through estimator variants for binary/ternary gradient flow. Binary-aware optimizers that maintain full-precision weights internally while binarizing for forward passes. Automatic bit-width selection through gradient-based optimization. Progressive multi-stage distillation from FP32 through INT8, INT4, ternary, to binary.
Automatic hardware optimization: Runtime detection of CPU capabilities (CPUID on x86, system registers on ARM) and automatic dispatch to fastest available SIMD implementation. GPU kernel variants selected based on warp size, shared memory, and register count. FPGA synthesis with automatic pipeline register insertion based on timing constraints.
Flexible quantization: Symmetric and asymmetric quantization with per-tensor, per-channel, or per-group scales. Dynamic quantization with runtime range detection or static quantization with pre-computed scales from calibration data. MSE-optimal and KL-divergence-based scale computation for minimal accuracy loss.
Why discrete computation matters
Traditional neural networks compute everything in 16-bit or 32-bit floating point, then make discrete decisions at the end. We operate directly in discrete space from binary to 8-bit, eliminating intermediate continuous computation.
This isn't simple quantization of existing models. The framework is architected from first principles around discrete operations:
Hardware reality: Modern processors consist of billions of transistors in two states. An XNOR gate requires 6 transistors. A 32-bit floating-point multiplier requires thousands and consumes orders of magnitude more energy. Discrete operations align with hardware fundamentals.
Memory efficiency: Binary weights pack 64 values per 64-bit word. A ResNet-50 with 25.6 million parameters occupies 3.1MB in binary versus 50MB in FP16. The entire model fits in CPU L3 cache (typical: 36-64MB). You become compute-bound instead of memory-bound.
Deployment flexibility: Small models enable on-device inference. No cloud dependency. No network latency. No data privacy concerns. Process sensitive information entirely on user devices without ever transmitting to servers.
The complete algorithm matrix
The platform provides comprehensive coverage across three dimensions: algorithms, backends, and bit-widths.
Fundamental operations (415 primitives):
- 46 bit operations: logic gates, manipulation, shifts, counting, finding
- 16 field operations: extract, deposit, pack, scatter, gather, mask creation
- 25 reductions: logical AND/OR/XOR, arithmetic sum/product, voting and consensus
- 17 distance metrics: Hamming, Jaccard, Dice, Tanimoto, cosine, Manhattan, Euclidean
- 12 interleaving ops: 2-way through 4-way interleave, Morton and Hilbert space-filling curves
- 44 arithmetic: addition, subtraction, multiplication, division, fixed-point operations
- 39 transforms: Walsh-Hadamard, FFT, DCT, NTT, wavelets (Haar, Daubechies, CDF97)
- 11 hashing: SHA-256, Blake3, xxHash, MinHash, SimHash, locality-sensitive hashing
- 22 random number generation: LFSR, Mersenne Twister, ChaCha20, Sobol sequences
- 15 quantization: symmetric, asymmetric, per-tensor, per-channel, scale computation
- 30 binary math: XNOR-popcount, ternary encoding, weight updates, gradient operations
- 48 format conversions: FP32/FP16/FP8/INT8/INT4/INT2 conversions in all directions
- 42 fixed-point operations: arithmetic, transcendentals, saturation across all bit-widths
Composite operations (500 kernels):
Optimized kernels combining primitives for common patterns. Matrix multiplication variants (standard, transposed, blocked). Convolution types (2D, 3D, depthwise, grouped, dilated). Normalization (batch, layer, group, instance). Activation functions (sign, hard tanh, piecewise linear). Attention mechanisms (self-attention, cross-attention, multi-head). Pooling (max, average, stochastic).
Network layers (191 layers):
Complete building blocks for constructing networks. Dense layers with binary, ternary, and multi-bit weights. Convolutional layers with all common variants. Recurrent layers (LSTM, GRU with binary gates). Attention layers (scaled dot-product, multi-head, relative position). Normalization layers with batch statistics and learned parameters. Residual connections with dimension matching.
High-level algorithms (674 algorithms):
Complete methods for training, inference, and optimization. Knowledge distillation with progressive multi-stage refinement. Evolutionary constraint discovery with genetic programming. Neural architecture search for discrete networks. Gradient estimators (straight-through, clipped, adaptive, momentum, hypernetwork). Optimizers (BinaryAdam, TernaryAdam, adaptive quantization). Distributed training with Byzantine-robust aggregation.
Backend implementation depth
Each algorithm exists in multiple optimized variants per backend. Not generic implementations. Hardware-specific code exploiting every architectural feature.
CPU SIMD: SSE2 provides universal x86-64 compatibility (every processor since 2001). AVX2 delivers 4-8× speedup on Haswell and newer (2013+). AVX-512 reaches 10-16× with mask registers for predication and VPTERNLOG for any 3-input Boolean function. NEON brings 3-4× speedup to all ARMv8 processors including mobile and Apple Silicon. SVE/SVE2 provides vector-length-agnostic code that automatically utilizes wider vectors on newer hardware.
CUDA: Warp-level primitives organize 32 threads executing in lockstep. Each thread processes 32 binary values packed in uint32. Full warp processes 1,024 binary values in parallel. Hardware intrinsics include __popc for population count, __ballot_sync for warp voting, __shfl_sync for fast communication without shared memory. Coalesced memory access ensures bandwidth utilization. Tensor Core utilization for matrix operations even with binary data.
Rust-HDL: Direct hardware synthesis from annotated Rust code. The framework generates Verilog/VHDL automatically. Binary XNOR-popcount operations map to XNOR gates (combinatorial logic, zero propagation delay) plus adder trees. Pipeline registers inserted automatically based on timing constraints. Synthesizes to both FPGAs (Xilinx, Intel) and ASICs.
WebAssembly: SIMD128 provides 128-bit vector operations in all modern browsers (Chrome 91+, Firefox 89+, Safari 16.4+). Operations include v128.and/or/xor for bitwise logic and i8x16.popcnt for population counting. Combined with Web Workers for multi-threading and SharedArrayBuffer for shared memory, achieves 60-80% of native CPU performance. On-device browser inference enables GDPR-compliant processing without server uploads.
ROCm: Wavefront-level optimization for AMD architectures with 64 threads per wavefront (double NVIDIA's 32). Each thread processes 32 binary values for 2,048 values per wavefront. Similar intrinsics to CUDA with __builtin_popcount, __ballot, and ds_swizzle. Programming model close enough that CUDA developers can write ROCm code immediately.
Metal: Apple Silicon optimization using unified memory architecture where CPU and GPU share physical RAM with cache coherency. Eliminates data copying overhead. Binary operations leverage Apple's custom matrix engines. M3 Max Neural Engine delivers 50-80 TOPS on binary inference using dedicated accelerators built into the SoC.
What we don't do (and why focus matters)
Important to clarify: we don't do everything. Focus enables excellence.
No floating-point inference: Discrete computation only from binary to 8-bit. If you need FP32/FP16/BFloat16 for deployment, use PyTorch or JAX instead. We optimize exclusively for discrete operations, enabling specializations impossible with mixed-precision floating-point. You can't be excellent at everything. We chose discrete AI and optimized ruthlessly.
No dynamic inference graphs: Models compile to static graphs for deployment. Training supports dynamic computation (necessary for research flexibility), but production inference is static. This enables ahead-of-time optimization: kernel fusion across entire network, memory layout optimization with known tensor shapes, prefetch instruction insertion with predictable access patterns.
Focused data preprocessing: We provide 8 specialized algorithms for neural network input preparation (normalization, adaptive scaling, learned quantization, binary augmentation), not general-purpose ETL. For feature engineering pipelines and data loading, use existing tools (Pandas, Polars, DuckDB). We're excellent at discrete neural network inference from binary to 8-bit. We're not replacing your entire data stack.
These aren't limitations. They're focus. By constraining scope to discrete neural networks with static inference graphs, we achieve optimization depth that comprehensive frameworks can't match.
Building on Dweve Core
The platform is ready to use. You can start building discrete neural networks today.
Complete toolchain:
- Declarative API: Rust DSL with NetworkBuilder for model definition
- Compiler infrastructure: MLIR-based optimization pipeline with four IR levels
- Training framework: Six STE variants, binary-aware optimizers, automatic bit-width selection
- Backend code generators: C with intrinsics for CPU, CUDA/HIP kernels for GPU, Verilog for FPGA
- Deployment tools: Export to ONNX, Core ML, TensorFlow Lite, or standalone binaries
Example workflow:
1. Define your network using NetworkBuilder
2. Train using binary-aware optimizers with adaptive bit-width selection
3. Compile to target hardware with automatic backend selection
4. Deploy as optimized binary or export to standard format
5. Run anywhere: cloud servers, edge devices, browsers, FPGAs
One platform. One codebase. Every backend. Complete discrete AI.
Stop juggling ten frameworks. Stop rewriting code for each deployment target. Stop fighting version compatibility. Build once on Dweve Core and deploy everywhere.
Dweve Core powers Dweve Loom, our constraint-based reasoning system launching in 2026. The framework implements the complete 1,930-algorithm stack across 6 backends with adaptive multi-bit quantization from binary to 8-bit. Built by a Dutch engineering team over three years of development.
Tagged with
About the Author
Marc Filipan
CTO & Co-Founder
Building the future of AI with binary neural networks and constraint-based reasoning. Passionate about making AI accessible, efficient, and truly intelligent.