Encoding Digital Data
Into Synthetic DNA

An open-source pipeline for archival-grade data storage in oligonucleotides, with triple-layer error correction and multi-stage compression. Written in Rust.

View Source See Benchmarks

9,563 Lines of Rust

151 Tests Passing

300bp Oligo Length

2.0× Redundancy

RS(255,223) Error Correction

8-Stage Encoding Pipeline

From arbitrary binary data to synthesis-ready FASTA output. Each stage is independently tested and composable.

01 HyperCompress BWT + MTF + ZRLE preprocessing, then parallel ZSTD‑22 / Brotli‑11 trials. Best result wins.

02 Interleaved RS Reed‑Solomon RS(255,223) with cross‑oligo interleaving. Converts burst losses to single‑symbol errors.

03 Fountain Codes Hybrid systematic/LT code: systematic phase for baseline coverage, then Robust Soliton LT droplets (c=0.025, δ=0.001). Operates on binary data.

04 Transcoder 2‑bit encoding (A=00, C=01, G=10, T=11) with rotation cipher for GC balance. Converts RS‑protected binary to DNA bases.

05 Oligo Builder 300bp structured oligos: primers + index + payload + CRC‑32. Synthesis‑ready format.

06 Constraint Check GC content 40–60%, homopolymer ≤3, restriction enzyme screening, primer compatibility.

07 FASTA Output Standard FASTA format with embedded decode metadata. Compatible with Twist, IDT, GenScript.

08 Cost Estimation Per‑oligo pricing at current commercial rates. Projects cost under vendor scenarios.

Oligo Structure — 300bp

FWD PRIMER

INDEX

PAYLOAD

CRC-32

REV PRIMER

20 bp

16 bp

228 bp (76% payload efficiency)

16 bp

20 bp

Performance Data

Measured values from 151 automated tests (70 unit, 81 integration). All numbers are reproducible via cargo test.

Compression Ratio by Data Type (HyperCompress Engine)

Compression ratios measured with BWT+MTF+ZRLE preprocessing → BPE tokenization → parallel ZSTD-22 and Brotli-11 trials. Range depends on data redundancy. Optimized for text-based formats; binary/pre-compressed data sees minimal or no compression.

Redundancy vs. Recoverable Oligo Loss (Fountain Codes)

Theoretical limit: loss = 1 − 1/redundancy. Practical recovery is slightly lower due to peeling decoder overhead in Robust Soliton distribution (c=0.025, δ=0.001, per Erlich & Zielinski 2017). At 2.0× redundancy, DATA2DNA survives ~30% oligo loss in tests.

Test Suite — 151 / 151 Passing

Encoding Efficiency

Triple-Layer Error Correction

DNA synthesis, storage, and sequencing each introduce distinct error types. Three independent correction layers ensure integrity under realistic conditions.

CRC-32 Per-Oligo Detection

Flags corrupt oligos before RS decoding. 16bp field per oligo. False positive rate < 2.3 × 10⁻⁸.

<10⁻⁸ false positive rate

Interleaved Reed-Solomon RS(255,223)

GF(2⁸) arithmetic with Berlekamp-Massey decoder. Cross-oligo interleaving converts burst oligo losses into single-symbol errors—corrects up to 16 per 255-symbol block.

16 errors / block

III

Fountain / LT Codes

Robust Soliton distribution (c=0.025, δ=0.001) with peeling decoder. 2.0× redundancy tolerates ~30% oligo loss. Based on Erlich & Zielinski 2017.

~30% oligos lost

// Redundancy math
surviving_data = redundancy × (1 − loss_rate)

// At 30% loss with 2.0× redundancy:
2.0 × 0.7 = 1.40 → 40% safety margin ✓

Codebase

9,563 lines of Rust across 15 modules. No unsafe code. Parallel computation via Rayon. Actix-Web 4 HTTP server with SSE progress reporting.

Lines of Code by Module

hypercompress.rs2,480 lines

main.rs1,584 lines

pipeline.rs1,122 lines

compressor.rs663 lines

dna_constraints.rs574 lines

oligo_builder.rs544 lines

fountain.rs538 lines

reed_solomon.rs458 lines

interleaved_rs.rs387 lines

fasta.rs328 lines

cost_estimator.rs295 lines

transcoder.rs283 lines

chaos.rs185 lines

consensus.rs99 lines

Technical Specifications

Parameter	Value	Notes
Oligo length	300 bp	Twist/IDT synthesis compatible
Payload per oligo	228 bp	300 − 72 bp overhead
Payload efficiency	76%	228 / 300
RS code	RS(255,223)	32 parity symbols, 16-error correction per block
GF polynomial	0x11D	x⁸ + x⁴ + x³ + x² + 1
Fountain distribution	Robust Soliton	c=0.025, δ=0.001 (DNA Fountain params)
Block size	64 bytes	RS alignment
Default redundancy	2.0×	Survives ~30% loss
Primers	20bp × 2	Standard PCR amplification
GC content target	40–60%	Synthesis optimization

Context in the Field

Published DNA storage systems and how DATA2DNA relates. Note: direct comparison is limited since the following systems include wet-lab validation and DATA2DNA is currently simulation-only.

System	Year	Bits/nt	Error Correction	Validation
Church, Gao & Kosuri	2012	~0.83	Repetition encoding	Wet lab
Goldman et al.	2013	~0.33	Fourfold redundancy	Wet lab
Erlich & Zielinski (DNA Fountain)	2017	1.57	RS + Fountain codes	Wet lab
Organick et al. (Microsoft/UW)	2018	~1.10	RS + Repetition	Wet lab, 200MB
DATA2DNA	2025	0.76*	CRC + IRS + Fountain	Simulation only

* 0.76 bits/nt effective with 2.0× redundancy (2.00 bits/nt raw encoding, 76% payload efficiency, halved by redundancy). Compression can multiply effective throughput on text data but does not change physical nt density. DATA2DNA has not yet been validated with physical DNA synthesis and sequencing.

Why DNA Storage?

DNA is the densest known information storage medium. At ambient temperature in a sealed container, it requires zero energy to maintain.

Medium	Density	Durability	Storage Energy	Source
Hard drive	~1 TB / 100g	5–10 years	6–8W continuous
LTO-9 Tape	18 TB / cartridge	~30 years	Climate-controlled
DNA	215 PB / gram	10,000+ years	Zero (ambient)	Erlich 2017, Zhirnov 2016

These density figures come from peer-reviewed literature. The theoretical maximum is 455 EB/gram (Zhirnov et al., Nature Materials, 2016). Practical density achieved in lab settings is 215 PB/gram (Erlich & Zielinski, Science, 2017). DNA storage remains expensive to write (~$0.01–0.10/nucleotide at scale) and slow to read (hours for sequencing). It is best suited for cold archival data that is written once and read rarely.

Current Status & Limitations

What works, what doesn't, and what's next.

What Works

Full encode/decode round-trip verified across 151 tests
Chaos simulation: oligo loss, base substitutions — recovery verified at 30% loss
HyperCompress achieves strong ratios on text-based data (CSV, JSON, SQL, code)
FASTA output compatible with commercial synthesis vendors
Web interface with real-time progress and drag-and-drop

Limitations

No wet-lab validation. Output has not been physically synthesized or sequenced.
No RS erasure decoding — only error correction (t=16). Erasure mode (2t=32 known positions) not yet implemented.
No random access — entire archive must be decoded to retrieve any file
No indel error correction (substitution-only model in chaos simulation)
Compression is ineffective on pre-compressed or binary data
Single-threaded fountain encoding (Rayon parallelism in compression only)

Roadmap

RS erasure decoding for known-position recovery (2t=32 symbols, target: v5.1.0)
Physical synthesis demo with Twist/IDT (target: v5.1.0)
PCR-based random access (Organick 2018 approach)
Nanopore sequencing integration
HEDGES indel correction (Press et al. 2020)

References

[1]

Erlich, Y. & Zielinski, D. (2017). DNA Fountain enables a robust and efficient storage architecture. Science, 355(6328), 950–954. doi:10.1126/science.aaj2038

[2]

Church, G.M., Gao, Y. & Kosuri, S. (2012). Next-generation digital information storage in DNA. Science, 337(6102), 1628. doi:10.1126/science.1226355

[3]

Goldman, N. et al. (2013). Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature, 494(7435), 77–80. doi:10.1038/nature11875

[4]

Organick, L. et al. (2018). Random access in large-scale DNA data storage. Nature Biotechnology, 36(3), 242–248. doi:10.1038/nbt.4079

[5]

Grass, R.N. et al. (2015). Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angewandte Chemie, 54(8), 2552–2555. doi:10.1002/anie.201411378

[6]

Zhirnov, V. et al. (2016). Nucleic acid memory. Nature Materials, 15(4), 366–370. doi:10.1038/nmat4594

[7]

Luby, M. (2002). LT codes. Proc. 43rd Annual IEEE Symposium on Foundations of Computer Science, 271–280. doi:10.1109/SFCS.2002.1181950

[8]

Ceze, L., Nivala, J. & Strauss, K. (2019). Molecular digital data storage using DNA. Nature Reviews Genetics, 20(8), 456–466. doi:10.1038/s41576-019-0125-3

Encoding Digital DataInto Synthetic DNA