An open-source pipeline for archival-grade data storage in oligonucleotides, with triple-layer error correction and multi-stage compression. Written in Rust.
From arbitrary binary data to synthesis-ready FASTA output. Each stage is independently tested and composable.
Measured values from 151 automated tests (70 unit, 81 integration).
All numbers are reproducible via cargo test.
Compression ratios measured with BWT+MTF+ZRLE preprocessing → BPE tokenization → parallel ZSTD-22 and Brotli-11 trials. Range depends on data redundancy. Optimized for text-based formats; binary/pre-compressed data sees minimal or no compression.
Theoretical limit: loss = 1 − 1/redundancy. Practical recovery is slightly lower due to peeling decoder overhead in Robust Soliton distribution (c=0.025, δ=0.001, per Erlich & Zielinski 2017). At 2.0× redundancy, DATA2DNA survives ~30% oligo loss in tests.
DNA synthesis, storage, and sequencing each introduce distinct error types. Three independent correction layers ensure integrity under realistic conditions.
9,563 lines of Rust across 15 modules. No unsafe code. Parallel computation via Rayon. Actix-Web 4 HTTP server with SSE progress reporting.
| Parameter | Value | Notes |
|---|---|---|
| Oligo length | 300 bp | Twist/IDT synthesis compatible |
| Payload per oligo | 228 bp | 300 − 72 bp overhead |
| Payload efficiency | 76% | 228 / 300 |
| RS code | RS(255,223) | 32 parity symbols, 16-error correction per block |
| GF polynomial | 0x11D | x8 + x4 + x3 + x2 + 1 |
| Fountain distribution | Robust Soliton | c=0.025, δ=0.001 (DNA Fountain params) |
| Block size | 64 bytes | RS alignment |
| Default redundancy | 2.0× | Survives ~30% loss |
| Primers | 20bp × 2 | Standard PCR amplification |
| GC content target | 40–60% | Synthesis optimization |
Published DNA storage systems and how DATA2DNA relates. Note: direct comparison is limited since the following systems include wet-lab validation and DATA2DNA is currently simulation-only.
| System | Year | Bits/nt | Error Correction | Validation |
|---|---|---|---|---|
| Church, Gao & Kosuri | 2012 | ~0.83 | Repetition encoding | Wet lab |
| Goldman et al. | 2013 | ~0.33 | Fourfold redundancy | Wet lab |
| Erlich & Zielinski (DNA Fountain) | 2017 | 1.57 | RS + Fountain codes | Wet lab |
| Organick et al. (Microsoft/UW) | 2018 | ~1.10 | RS + Repetition | Wet lab, 200MB |
| DATA2DNA | 2025 | 0.76* | CRC + IRS + Fountain | Simulation only |
* 0.76 bits/nt effective with 2.0× redundancy (2.00 bits/nt raw encoding, 76% payload efficiency, halved by redundancy). Compression can multiply effective throughput on text data but does not change physical nt density. DATA2DNA has not yet been validated with physical DNA synthesis and sequencing.
DNA is the densest known information storage medium. At ambient temperature in a sealed container, it requires zero energy to maintain.
| Medium | Density | Durability | Storage Energy | Source |
|---|---|---|---|---|
| Hard drive | ~1 TB / 100g | 5–10 years | 6–8W continuous | |
| LTO-9 Tape | 18 TB / cartridge | ~30 years | Climate-controlled | |
| DNA | 215 PB / gram | 10,000+ years | Zero (ambient) | Erlich 2017, Zhirnov 2016 |
These density figures come from peer-reviewed literature. The theoretical maximum is 455 EB/gram (Zhirnov et al., Nature Materials, 2016). Practical density achieved in lab settings is 215 PB/gram (Erlich & Zielinski, Science, 2017). DNA storage remains expensive to write (~$0.01–0.10/nucleotide at scale) and slow to read (hours for sequencing). It is best suited for cold archival data that is written once and read rarely.
What works, what doesn't, and what's next.