Dweve Core

1. Getting Started

Soon

1.1 Installation Guide

1.2 Quickstart Tutorial

1.3 Your First Binary Neural Network

1.4 Core Concepts & Philosophy

1.5 Architecture Overview

2. Foundational Computational Philosophy

Soon

2.1 The Nature of Discrete Intelligence

2.2 Computational Substrate Design

2.3 Binary Tensors: Dense Packed Storage

2.4 Ternary Tensors: Sparsity-Aware Representation

2.5 Multi-Bit Quantized Tensors

2.6 Stochastic Bit Streams

2.7 Constraint Patterns

2.8 Population Vectors

2.9 Memory Layout Engineering

2.10 NUMA-Aware Placement

2.11 Cache-Line Alignment

2.12 Bit-Packing Strategies

2.13 Sparse Representation Techniques

2.14 Phase-Differential Encoding

2.15 Wasserstein Distance for Populations

3. The Hardware-Optimized Primitive Library

Soon

3.1 The 415 Atomic Operations Overview

3.2 Bitwise Logic Operations (15 ops)

3.3 Population Counting (8 variants)

3.4 Bit Manipulation (12 ops)

3.5 Constraint Evaluation (18 ops)

3.6 Stochastic Computing Operations (14 ops)

3.7 Tensor Operations (12 ops)

3.8 Transformation Primitives (8 ops)

3.9 Implementation Matrix

3.10 Universal Scalar Implementations

3.11 SSE2 128-bit SIMD

3.12 AVX2 256-bit SIMD

3.13 AVX-512 512-bit SIMD

3.14 ARM NEON SIMD

3.15 ARM SVE Scalable Vectors

3.16 CUDA GPU Implementations

3.17 ROCm AMD GPU

3.18 FPGA Synthesis

3.19 WebAssembly SIMD

3.20 Runtime Selection and Dispatch

3.21 Hardware Capability Detection

3.22 Performance Calibration

3.23 Learned Dispatch System

3.24 Benchmark Results Across Platforms

3.25 Advanced Optimization Techniques

4. BitOps Compiler Architecture

Soon

4.1 Compiler Architecture Overview

4.2 Neural Network IR (Level 1)

4.3 Graph Representation (Level 2)

4.4 BitOps Dialect (Level 3)

4.5 Hardware-Specific Lowering (Level 4)

4.6 MLIR Integration

4.7 Custom Dialects

4.8 Optimization Passes

4.9 Polyhedral Optimization

4.10 Operation Fusion Optimization

4.11 Hybrid Computation Kernels

4.12 Platform-Specific Code Generation

5. Gradient-Based Training Systems

Soon

5.1 Straight-Through Estimator Framework

5.2 Standard STE (Identity Gradient)

5.3 Clipped STE (Saturated Gradient)

5.4 Adaptive STE (Temperature-Scaled)

5.5 Momentum STE (Smoothed Gradient)

5.6 Hypernetwork STE (Learned Scaling)

5.7 Momentum Matching STE

5.8 STE Selection Strategies

5.9 STE Combination Approaches

5.10 BinaryAdam Optimizer

5.11 TernaryAdam Optimizer

5.12 AdaptiveQuantization Optimizer

5.13 Flip Rate Monitoring

5.14 Boundary-Aware Gradient Scaling

5.15 Binary Network Training Dynamics

5.16 Convergence Analysis

5.17 Hyperparameter Tuning for Binary Networks

5.18 Debugging Binary Network Training

6. Quantization Methods

Soon

6.1 Quantization Overview

6.2 Deterministic Binarization

6.3 Stochastic Binarization

6.4 Learned Threshold Binarization

6.5 Standard Ternary Quantization

6.6 Structured Sparsity Ternary

6.7 Learned Threshold Ternary

6.8 2-Bit Quantization

6.9 4-Bit Quantization

6.10 8-Bit Quantization

6.11 Per-Layer Bit-Width Selection

6.12 Gradient-Based Bit-Width Selection

6.13 Mixed-Precision Strategies

6.14 Quantization-Aware Training

6.15 Post-Training Quantization

7. Normalization Methods

Soon

7.1 Normalization Overview

7.2 Binary Batch Normalization

7.3 Integer-Only Batch Normalization

7.4 Fused Binary Batch Normalization

7.5 Binary Layer Normalization

7.6 Popcount-Based Layer Norm

7.7 Group Layer Normalization

7.8 Spatial Group Normalization

7.9 Structured Sparsity Group Norm

7.10 Fixed-Point Arithmetic Normalization

7.11 Integer-Only Normalization

7.12 Normalization Performance Comparison

8. Knowledge Distillation

Soon

8.1 Knowledge Distillation Overview

8.2 Progressive Multi-Stage Distillation

8.3 Task Loss (Hard Labels)

8.4 Logit Distillation (Soft Targets)

8.5 Feature Distillation (Intermediate Layers)

8.6 Attention Distillation

8.7 Temperature Scaling

8.8 Dark Knowledge Transfer

8.9 Similarity Structure Matching

8.10 Relational Knowledge Transfer

9. Evolutionary and Non-Gradient Methods

Soon

9.1 Evolutionary Methods Overview

9.2 Genetic Programming for Constraints

9.3 Population Management

9.4 Population Diversity Metrics

9.5 Mutation Operators

9.6 Crossover Operations

9.7 Multi-Objective Fitness Evaluation

9.8 Selection Mechanisms

9.9 Tournament Selection

9.10 Elite Preservation

9.11 Parallel Evolution (Island Model)

9.12 Migration Strategies

9.13 Convergence Criteria

9.14 Performance Analysis

10. Stochastic Constraint Discovery Pipeline

Soon

10.1 SCD Pipeline Overview

10.2 Pipeline Philosophy

10.3 Phase 1: Initialization (Layers 1-8)

10.4 Layer 1: Random Seeding

10.5 Layer 2: Entropic Spreading

10.6 Layer 3: Cellular Automaton Initialization

10.7 Layer 4: Power-Law Correlation

10.8 Layer 5: Fitness-Based Filtering

10.9 Layer 6: Diversity Injection

10.10 Layer 7: Niche Formation

10.11 Layer 8: Initial Statistics

10.12 Phase 2: Energy-Based Search (Layers 9-16)

10.13 Phase 3: Structural Refinement (Layers 17-28)

10.14 Phase 4: Parallel Exploration (Layers 29-40)

10.15 Phase 5: Exploration Control (Layers 41-48)

10.16 Phase 6: Network Stabilization (Layers 49-54)

10.17 Phase 7: Final Consolidation (Layers 55-57)

10.18 Implementation Details

10.19 SCD Performance Characteristics

11. Multi-Paradigm Neural Architectures

Soon

11.1 Neural Architectures Overview

11.2 Binary Transformers

11.3 Binary Self-Attention

11.4 Binary Multi-Head Attention

11.5 Position Encoding for Binary Models

11.6 Binary Convolutional Networks

11.7 Binary Convolutional Layers

11.8 Pooling Operations for Binary Networks

11.9 Binary Recurrent Networks

11.10 Binary LSTM Cells

11.11 Binary GRU Cells

11.12 Hybrid Binary-Float Architectures

11.13 Residual Connections in Binary Networks

11.14 Skip Connections and Dense Blocks

11.15 Neural Architecture Search for Binary Networks

12. Permuted Agreement Popcount (PAP)

Soon

12.1 PAP Algorithm Overview

12.2 Mathematical Foundation

12.3 Permutation Group Theory

12.4 Agreement Metric

12.5 SIMD Implementation

12.6 Windowed Processing

12.7 Statistical Confidence Bounds

12.8 PAP Performance Analysis

13. Router Architecture

Soon

13.1 Router Architecture Overview

13.2 Multi-Stage Routing Pipeline

13.3 Coarse-Grained Routing

13.4 Fine-Grained Routing

13.5 Page-Window Executor

13.6 Working-Set Governor

13.7 Cache Management Strategies

13.8 Constraint Prefetching

13.9 Load Balancing

13.10 Routing Quality Metrics

13.11 Adaptive Routing Strategies

13.12 Router Performance Analysis

14. Platform-Specific Hardware Optimizations

Soon

14.1 Platform Optimization Overview

14.2 x86 CPU Architectures

14.3 x86 SSE2 Optimizations

14.4 x86 AVX2 Optimizations

14.5 x86 AVX-512 Optimizations

14.6 x86 VPTERNLOG Usage

14.7 ARM Architectures Overview

14.8 ARM NEON SIMD

14.9 ARM SVE (Scalable Vector Extension)

14.10 ARM SVE2 Enhanced Instructions

14.11 GPU Implementations Overview

14.12 CUDA GPU Optimizations

14.13 CUDA Warp-Level Operations

14.14 CUDA Memory Coalescing

14.15 ROCm AMD GPU Optimizations

14.16 FPGA Synthesis Strategy

14.17 Rust-HDL FPGA Synthesis

14.18 FPGA Pipeline Design

14.19 WebAssembly SIMD

14.20 Cross-Platform Benchmarks

15. Memory Systems and Cache Optimization

Soon

15.1 Memory Systems Overview

15.2 Seven-Tier Memory Hierarchy

15.3 L1 Cache Optimization

15.4 L2 Cache Optimization

15.5 L3 Cache Optimization

15.6 Cache-Aware Algorithm Design

15.7 Cache Blocking Techniques

15.8 Prefetching Strategies

15.9 NUMA Architecture

15.10 NUMA-Aware Optimization

15.11 Memory Bandwidth Optimization

15.12 Latency Hiding Techniques

15.13 Memory Profiling and Analysis

15.14 Memory Performance Benchmarks

16. Inference Methods and Execution Modes

Soon

16.1 Inference Methods Overview

16.2 Deterministic Inference

16.3 Fixed-Point Inference

16.4 Integer-Only Inference

16.5 Stochastic Inference

16.6 Monte Carlo Inference

16.7 Probabilistic Inference

16.8 Hybrid Inference Strategies

16.9 Batch Inference Optimization

16.10 Streaming Inference

16.11 Latency Optimization

16.12 Throughput Optimization

17. Deployment Architectures

Soon

17.1 Deployment Overview

17.2 Edge Deployment

17.3 Mobile Device Deployment

17.4 IoT Device Deployment

17.5 Embedded Systems Deployment

17.6 Cloud Deployment

17.7 Kubernetes Deployment

17.8 Serverless Deployment

17.9 Hybrid Edge-Cloud Architecture

17.10 Model Serving Infrastructure

17.11 Load Balancing Strategies

17.12 Auto-Scaling Policies

17.13 Monitoring and Observability

17.14 Metrics Collection

17.15 Performance Profiling

17.16 Deployment Best Practices

18. Algorithm Reference: PRIMITIVES (415 operations)

Soon

18.1 Primitives Overview

18.2 Logic Gates (10 operations)

18.3 Bit Manipulation (8 operations)

18.4 Shift & Rotate (8 operations)

18.5 Counting Operations (12 operations)

18.6 Popcount Variants (6 operations)

18.7 Bitfield Operations (16 operations)

18.8 Reduction Operations (25 operations)

18.9 Distance & Similarity (17 operations)

18.10 Space-Filling Curves (12 operations)

18.11 Arithmetic Primitives (44 operations)

18.12 Morphological Operations (10 operations)

18.13 Transform Primitives (39 operations)

18.14 Walsh-Hadamard Transforms

18.15 Network Topologies (25 operations)

18.16 Hashing Primitives (11 operations)

18.17 Random Number Generation (22 operations)

18.18 Quasirandom Sequences (5 operations)

18.19 Quantization Primitives (15 operations)

18.20 Binary Math Functions (30 operations)

18.21 Activation Function Primitives

18.22 Vector Operations (51 operations)

19. Algorithm Reference: KERNELS (500 composite operations)

Soon

19.1 Kernels Overview

19.2 Convolution Kernels (50 variants)

19.3 Pooling Kernels (30 variants)

19.4 Attention Kernels (40 variants)

19.5 Normalization Kernels (25 variants)

19.6 Activation Kernels (35 variants)

19.7 Recurrent Kernels (45 variants)

19.8 Embedding Kernels (20 variants)

19.9 Loss Function Kernels (30 variants)

19.10 Metric Kernels (25 variants)

19.11 Optimizer Kernels (40 variants)

19.12 Regularization Kernels (20 variants)

19.13 Data Augmentation Kernels (35 variants)

19.14 Preprocessing Kernels (30 variants)

19.15 Postprocessing Kernels (25 variants)

19.16 Custom Kernel Development

20. Algorithm Reference: LAYERS (191 network building blocks)

Soon

20.1 Layers Overview

20.2 Dense/Fully-Connected Layers (15 variants)

20.3 Convolutional Layers (30 variants)

20.4 Pooling Layers (12 variants)

20.5 Recurrent Layers (25 variants)

20.6 Attention Layers (18 variants)

20.7 Normalization Layers (10 variants)

20.8 Dropout & Regularization Layers (8 variants)

20.9 Embedding Layers (12 variants)

20.10 Merge & Concatenation Layers (10 variants)

20.11 Reshape & Permute Layers (8 variants)

20.12 Activation Layers (15 variants)

20.13 Residual Connection Layers (10 variants)

20.14 Custom Layer Development

21. Algorithm Reference: ALGORITHMS (674 complete methods)

Soon

21.1 Algorithms Overview

21.2 Training Algorithms (80 variants)

21.3 Optimization Algorithms (65 variants)

21.4 Search Algorithms (45 variants)

21.5 Clustering Algorithms (35 variants)

21.6 Classification Algorithms (50 variants)

21.7 Regression Algorithms (30 variants)

21.8 Generation Algorithms (40 variants)

21.9 Reinforcement Learning (45 variants)

21.10 Evolutionary Algorithms (50 variants)

21.11 Constraint Solving (55 variants)

21.12 Graph Algorithms (40 variants)

21.13 Compression Algorithms (30 variants)

21.14 Ensemble Methods (35 variants)

21.15 Meta-Learning (25 variants)

21.16 Transfer Learning (30 variants)

22. Model Zoo: ARCHITECTURES (120 pre-built models)

Soon

22.1 Model Zoo Overview

22.2 Image Classification (25 models)

22.3 Object Detection (18 models)

22.4 Segmentation (15 models)

22.5 NLP Models (20 models)

22.6 Speech Processing (10 models)

22.7 Time Series (8 models)

22.8 Generative Models (12 models)

22.9 Reinforcement Learning (8 models)

22.10 Custom Model Development

23. Floating-Point Interoperability (30 operations)

Soon

23.1 Float Interop Overview

23.2 FP16 Conversion Operations

23.3 FP32 Conversion Operations

23.4 Mixed Precision Operations

23.5 Quantization/Dequantization

24. Rust API Reference

Soon

24.1 Installation

24.2 Quickstart

24.3 Core API

24.4 Tensor Types

24.5 Trait Definitions

24.6 Memory Safety

25. Python Bindings Reference

Soon

25.1 Installation

25.2 Quickstart

25.3 Core API

25.4 Tensor Operations

25.5 Layer API

25.6 Model API

25.7 Training API

25.8 Inference API

25.9 Utility Functions

26. Practical Mastery Guides (Planned)

Soon

26.1 Hardware Selection Matrix

26.2 Inference Optimization: Multi-Layer Strategies

26.3 CPU Optimization Recipes: Detective Guide

26.4 Cache Optimization Mastery

26.5 Thread Orchestration Mastery

26.6 Memory Optimization Deep Dive

26.7 SIMD Optimization Mastery

26.8 GPU Kernel Optimization

26.9 FPGA Synthesis Best Practices

26.10 WebAssembly Performance Optimization

27. Tutorials & Examples (Planned)

Soon

27.1 Hello World: First Binary Network

27.2 Image Classification with BinaryNet

27.3 Text Classification

27.4 Object Detection

27.5 Transfer Learning

27.6 Model Deployment to Production

27.7 Edge Device Deployment

27.8 Distributed Training

27.9 Building Custom Layers

27.10 Custom Training Loops

28. Implementation & Roadmap

Soon

28.1 Implementation Status

28.2 Product Roadmap

28.3 Feature Requests

28.4 Changelog

Dweve Core Documentation

Complete documentation with 421+ pages across 27 categories will be available in all supported languages upon our public launch. This preview demonstrates the documentation structure and core concepts.

Dweve Core

A production-ready framework for building artificial intelligence systems using discrete computation, developed over three years by a Dutch engineering team.

1,930

Hardware-optimized algorithms

Documentation categories

421+

Technical pages

What is Dweve Core?

Most artificial intelligence today runs on continuous mathematics: floating-point numbers, smooth gradients, differentiable functions. Dweve Core takes a radically different path. We build intelligence on discrete computation, using values from finite sets rather than the real number line. Binary. Ternary. 2-bit, 3-bit, 4-bit, 8-bit. Adaptive multi-bit switching between precision levels as needed. Why does this matter? Because intelligence in biology is fundamentally discrete. Neurons fire or don't fire. Synapses strengthen or weaken in quantized steps. The brain achieves staggering capability consuming just 20 watts, while modern AI devours kilowatts. Discrete computation aligns with how both biological intelligence and digital hardware actually work.

This isn't theoretical. Dweve Core provides 1,930 production-ready algorithms: 415 primitive operations forming the computational foundation; 500 kernels optimized for SIMD instruction sets (AVX-512, AVX2, NEON, SVE); 191 neural network layers spanning every major architecture; 674 constraint-solving algorithms across 46 categories (SAT, CSP, SMT, MaxSAT, ASP, ILP); 30 interop utilities bridging discrete and continuous representations; and 120 complete model architectures ready for deployment. The framework handles binary (1-bit), ternary (2-valued plus zero), 2-bit (quaternary), 3-bit, 4-bit, 8-bit, and adaptive multi-bit quantization where precision adjusts layer by layer based on sensitivity analysis. Everything runs on hardware you already own: x86 CPUs with AVX extensions, ARM processors with NEON or SVE, GPUs through CUDA or ROCm, FPGAs, even WebAssembly for browser deployment.

The efficiency gains are dramatic. A binary tensor packs 32 values where float32 stores one, delivering 32× memory compression. Cache utilization soars. Memory bandwidth pressure evaporates. Models that couldn't fit now run entirely in L3 cache. Inference throughput increases 10-100× over float32 implementations on the same silicon. Energy consumption drops proportionally. The same model that required cloud GPUs now runs on edge devices, mobile phones, embedded systems. This changes deployment economics fundamentally. But Dweve Core transcends quantization alone. The framework integrates six computational paradigms, each contributing unique capabilities: multi-bit neural networks, constraint-based reasoning, hyperdimensional computing, Tsetlin machines, cellular automata, and stochastic computing. These paradigms compose, enabling hybrid systems that combine their complementary strengths.

Why discrete computation?

Floating-point arithmetic emerged from the limitations of early computers, not the requirements of intelligence. Discrete computation aligns with three fundamental realities: biological precedent, hardware architecture, and energy physics. Biological neural networks operate through discrete events. Action potentials are all-or-nothing signals. Synaptic vesicles release neurotransmitters in integer quanta. Neural encoding uses spike timing and population codes, not continuous activation values. The brain achieves human-level intelligence consuming 20 watts because discrete computation fundamentally requires less energy than continuous computation at the same fidelity.

Hardware tells the same story. CPUs and GPUs excel at integer operations, bit manipulations, logical operations. Floating-point units, while fast, consume more power and silicon area than equivalent integer units. Memory systems transfer data in discrete blocks. Caches work at fixed granularity. Network packets carry integer byte counts. The entire computing stack from transistors through networks operates discretely. Forcing continuous mathematics onto discrete hardware introduces conversion overhead, numerical instability, and energy waste. Discrete computation removes this impedance mismatch.

The energy argument proves decisive at scale. Every bit flip, every addition, every memory access consumes power. Reducing operand precision from 32 bits to 1 bit cuts dynamic power consumption proportionally. Reducing memory footprint 32× means 32× fewer DRAM accesses, each of which costs thousands of picojoules. Fitting models in cache eliminates main memory traffic entirely. These savings compound. A model consuming kilowatts in float32 can run on watts when implemented discretely. When you need to deploy millions of inference endpoints, process billions of requests daily, or operate on battery power, energy efficiency stops being academic and becomes mission-critical. Discrete computation makes previously impossible deployments practical.

Core components

Dweve Core integrates six computational paradigms. Each offers distinct capabilities. Together they enable sophisticated hybrid systems combining continuous learning, discrete reasoning, symbolic constraints, and hardware-efficient inference.

Computational substrates

Multi-bit quantized neural networks form the primary learning substrate. The framework supports seven precision levels: binary (XNOR-popcount operations, maximum throughput, minimal memory), ternary (adding zero enables sparse activation, learned sparsity patterns), 2-bit (four discrete values, good expressiveness/efficiency balance), 3-bit (eight values, approaching float16 accuracy in many tasks), 4-bit (sixteen values, often matches float16 quality), 8-bit (256 values, near float32 quality, still 4× smaller), and adaptive multi-bit (per-layer or per-channel precision based on sensitivity analysis, optimal efficiency/accuracy tradeoffs). Straight-through estimators enable gradient flow during training. The evolution pipeline optimizes quantization thresholds, scale factors, and clipping ranges during or after training.

Constraint-solving substrates provide 674 algorithms across 46 categories for discrete reasoning. SAT solvers (DPLL, CDCL, local search) handle Boolean satisfiability problems foundational to formal verification, planning, and combinatorial optimization. CSP engines (backtracking, arc consistency, forward checking) solve constraint satisfaction problems common in scheduling, resource allocation, and configuration. SMT solvers (DPLL(T), Z3-style combination) handle satisfiability modulo theories, enabling reasoning about integers, arrays, bitvectors with logical structure. MaxSAT and weighted CSP algorithms optimize over constraint violations. Answer set programming (ASP) provides declarative problem specification. Integer linear programming (ILP) solves optimization problems with discrete variables. These algorithms enable precise symbolic reasoning impossible with purely neural approaches.

Hyperdimensional computing represents information in 10,000-dimensional binary vectors where distance encodes similarity. The 16 included algorithms perform binding (combining concepts), bundling (creating superposition representations), permutation (sequence encoding), and similarity queries. This brain-inspired approach exhibits remarkable properties: representations tolerate massive noise, operations compose naturally, learning requires few examples, and inference uses simple bitwise operations. Hyperdimensional computing excels at few-shot learning, robust pattern recognition, and compositional reasoning with minimal computational overhead.

Tsetlin machines learn through propositional logic, building interpretable models using Boolean clauses. The 15 algorithms span classification, regression, and reinforcement learning. Unlike neural networks' black-box decisions, Tsetlin machines produce human-readable explanations: which features triggered which clauses for each prediction. They train efficiently on small datasets, handle concept drift naturally, and provide guarantees about learning dynamics. Tsetlin machines bridge symbolic AI (explicit logical rules) and statistical learning (data-driven adaptation), offering interpretability critical for regulated domains.

Cellular automata compute through local interaction rules on discrete grids. The 9 algorithms implement various CA types: elementary (1D), Game of Life (2D), totalistic (state-dependent), and continuous (real-valued cells with discrete update rules). Cellular automata model spatial dynamics, simulate physical processes, generate patterns, and solve certain problem classes efficiently through massive parallelism. They provide a fundamentally different computation model: no central control, purely local interactions, emergent global behavior.

Stochastic computing represents values as bitstream probabilities, trading precision for massive parallelism. The 19 algorithms perform arithmetic (addition, multiplication), complex functions (exponentiation, square roots), and signal processing (filtering, correlation). Operations become trivial: AND gates multiply probabilities, OR gates add probabilities. This enables hardware implementations with minimal gate counts, high fault tolerance, and inherent error resilience. Stochastic computing suits approximate workloads where probabilistic answers suffice and hardware efficiency matters critically.

Hardware optimization

SIMD vectorization exploits data-level parallelism through processor vector extensions. AVX-512 kernels process 512-bit vectors, enabling 512 1-bit operations, 128 4-bit operations, or 64 8-bit operations per instruction. AVX2 handles 256-bit vectors on older x86 processors. NEON accelerates ARM mobile devices. SVE targets ARM server chips with scalable vector lengths. The compiler selects optimal kernels at runtime based on detected CPU features, ensuring maximum throughput on every platform without manual architecture-specific code.

Memory layout optimization arranges data to maximize cache efficiency and minimize bandwidth. Bit-packing compresses binary tensors 32×. Structure-of-arrays layouts improve vectorization by separating tensor dimensions. Cache-blocking tiles operations to fit working sets in L1/L2 cache. Prefetching brings data into cache before computation needs it. These optimizations frequently matter more than computational throughput, especially for memory-bound operations where DRAM bandwidth limits performance.

GPU implementations exploit massive parallelism for large batches. CUDA kernels target NVIDIA GPUs. ROCm kernels target AMD GPUs. Both implement binary matrix multiplication through bit-parallel operations, leveraging thousands of concurrent threads. Quantized convolutions distribute spatial regions across thread blocks. Memory coalescing patterns optimize global memory access. Shared memory caching reduces global memory traffic. GPU implementations excel when batch sizes large enough to saturate available parallelism.

FPGA and ASIC compilation generates hardware designs from algorithm descriptions. The synthesis pipeline produces Verilog/VHDL for FPGA deployment or ASIC tape-out. Binary operations map naturally to hardware gates. Quantized datapaths use narrow bit-widths. Pipelines exploit temporal parallelism. Specialized accelerators achieve 100× efficiency gains over general-purpose processors for fixed workloads. Custom silicon makes economic sense at scale, and Dweve Core provides the necessary compilation infrastructure.

Training and optimization

Straight-through estimators enable gradient-based training despite non-differentiable quantization. The forward pass uses discrete values for efficiency. The backward pass approximates gradients through quantization boundaries using various estimators: hard tanh STE clips gradients to prevent explosion, soft STE applies smooth approximations, stochastic STE adds gradient noise for exploration, and learned STE parameterizes estimation functions. STEs make standard optimizers (Adam, SGD, RMSprop) applicable to discrete networks.

Constraint evolution optimizes discrete structures using evolutionary algorithms, simulated annealing, and combinatorial search. The evolution pipeline handles SAT clauses, CSP variable orders, hyperdimensional binding patterns, and Tsetlin automaton clause selection. Fitness-guided search discovers configurations satisfying objectives while maintaining structural constraints. Population-based methods explore solution spaces intractable for gradient descent. This complements neural training for hybrid systems combining learned and engineered components.

Hyperdimensional learning trains through vector bundling and associative binding rather than gradient descent. Few-shot learning adds new concepts by bundling example vectors. Retraining adjusts vector associations without forgetting previous knowledge. Concept hierarchies build through recursive binding. Query-based retrieval finds semantically similar patterns. This learning paradigm scales to massive concept spaces, handles noisy data gracefully, and requires minimal computational resources compared to backpropagation.

Tsetlin training adjusts clause inclusion through reinforcement feedback. Type I feedback strengthens clauses producing correct outputs. Type II feedback weakens clauses producing incorrect outputs. Stochastic automaton state transitions implement exploration-exploitation tradeoffs. Team voting aggregates multiple Tsetlin automata for ensemble decisions. The training process requires no gradient computation, handles sparse data efficiently, and produces interpretable Boolean rules explaining each decision.

Neural architectures

Quantized transformers bring discrete computation to modern language models across all bit-widths. Binary attention uses XNOR-popcount for maximum efficiency. Ternary adds structured sparsity to attention maps. 2-bit and 4-bit quantization balances speed and expressiveness in feed-forward layers. 8-bit maintains near-full precision where needed. Multi-head attention, position encodings, and feed-forward networks all support the full spectrum from binary through adaptive multi-bit. The result: transformer capability at a fraction of traditional computational cost.

Quantized CNNs perform image processing at every precision level. Binary convolutions achieve maximum throughput through bitwise operations. Ternary enables learned sparsity. 4-bit and 8-bit convolutions balance efficiency with representational capacity. The framework supports all modern variants: arbitrary kernel sizes, dilated convolutions, grouped convolutions, depthwise separable convolutions. Pooling, normalization, and activation layers adapt to each quantization level. Precision can vary per layer, automatically optimized during training.

Quantized RNNs, LSTMs, and GRUs handle sequential processing with multi-bit hidden states. Binary variants maximize memory efficiency. Ternary adds expressiveness through three-valued states. 4-bit and 8-bit variants provide graduated capacity. Gates, state transitions, and memory cells all operate at the chosen precision level, with automatic gradient handling across quantization boundaries.

Hybrid and adaptive architectures mix precision levels optimally. Feature extraction runs at low bit-width for efficiency. Decision layers maintain higher precision where it matters. Adaptive multi-bit adjusts per-layer or even per-channel based on sensitivity. The framework seamlessly manages precision boundaries, quantization, dequantization, and gradient flow throughout mixed-precision networks.

Pre-built model architectures

The framework includes 120 pre-built model architectures across multiple domains:

Computer vision

• 25 image classification models (BinaryNet, XNOR-Net, Bi-Real Net variants)
• 18 object detection models (binary YOLO, RetinaNet, Faster R-CNN)
• 15 segmentation models (U-Net, DeepLab, Mask R-CNN)

Natural language & beyond

• 20 NLP models (binary transformers, text classifiers, NER systems)
• 10 speech processing models (ASR, TTS, keyword spotting)
• 8 time series models (forecasting, anomaly detection)
• 12 generative models (VAEs, GANs, diffusion models)
• 8 reinforcement learning architectures

Deployment infrastructure

Binary neural networks shine when deployed. A model that fits entirely in L3 cache changes deployment economics fundamentally. Edge devices that struggled with traditional networks suddenly become capable inference platforms. The framework supports deployment from resource-constrained embedded systems through mobile devices (iOS, Android), edge gateways, cloud VMs, and Kubernetes clusters. The same binary model runs everywhere, automatically selecting hardware-optimized implementations at runtime.

Model serving infrastructure handles production workloads through request batching, auto-scaling based on load, intelligent load balancing, and A/B testing for model comparison. The monitoring system captures inference latency distributions, throughput metrics, resource utilization patterns, and model performance characteristics. Detailed profiling reveals bottlenecks. Observability ensures you understand system behavior in production.

Getting started

The documentation is organized into 27 major categories covering everything from foundational concepts to advanced optimization techniques:

→ Getting started: Installation and quickstart

→ Foundational philosophy: Discrete intelligence principles

→ Primitive library: 415 atomic operations

→ Compiler architecture: Multi-level IR optimization

→ Training systems: STEs and optimizers

→ Quantization methods: Binary, ternary, multi-bit

→ Knowledge distillation: Teacher-student transfer

→ Constraint solving: SAT, CSP, SMT, MaxSAT, ASP, ILP

→ Hyperdimensional computing: Vector symbolic architectures

→ Tsetlin machines: Interpretable propositional learning

→ Cellular automata: Local interaction computation

→ Stochastic computing: Probabilistic bitstream processing

→ Hybrid systems: Combining multiple paradigms

→ Interoperability: Bridging discrete and continuous

→ Neural architectures: Transformers, CNNs, RNNs

→ Memory systems: Seven-tier hierarchy optimization

→ Platform optimizations: CPU, GPU, FPGA, WASM

→ SIMD kernels: AVX-512, AVX2, NEON, SVE

→ GPU acceleration: CUDA, ROCm implementations

→ Hardware synthesis: FPGA/ASIC compilation

→ Deployment: Edge, cloud, hybrid architectures

→ Serving infrastructure: Production deployment patterns

→ Monitoring: Performance profiling and optimization

→ Algorithm reference: Complete API documentation

→ Model zoo: 120 pre-built architectures

→ Benchmarks: Performance comparisons and metrics

→ Case studies: Real-world deployment examples

Begin with chapter 1 (Getting started) in the navigation menu, or jump to specific topics using the comprehensive chapter structure.

API languages

Rust (native)

The framework core lives in Rust, chosen for zero-cost abstractions, guaranteed memory safety, and predictable performance. The native API exposes complete control: memory layout decisions, computation graph construction, hardware dispatch policies, and low-level optimization. When you need maximum performance and full control, you work in Rust. Available under proprietary licensing.

Python bindings

Python bindings deliver a NumPy-style interface for researchers and rapid prototyping. PyO3 enables zero-copy data exchange between Python and Rust, eliminating serialization overhead. High-level operations feel Pythonic while performance-critical paths execute in compiled Rust. Full access to underlying capabilities through a familiar interface. Available under proprietary licensing.

Last updated: September 1, 2025

Next: Dweve Mesh

Step 1 of 425%

Join the Dweve Waitlist

Get early access to AI that respects your privacy, the planet, and your wallet.

Early Access Benefits

15% Lifetime Discount

Lock in founder's pricing forever

3-Month Activation Window

Flexible start date for your convenience

Priority Support

Direct access to our deployment team

Quick Process

What describes you best?

Individual or organization type

Share contact details

So we can reach out to you

Describe your needs

Help us personalize your experience

Takes less than 2 minutes • No credit card required