DATA2DNA

Encoding Digital Data
Into Synthetic DNA

An open-source pipeline for archival-grade data storage in oligonucleotides, with triple-layer error correction and multi-stage compression. Written in Rust.

9,563 Lines of Rust
151 Tests Passing
300bp Oligo Length
2.0× Redundancy
RS(255,223) Error Correction
01

8-Stage Encoding Pipeline

From arbitrary binary data to synthesis-ready FASTA output. Each stage is independently tested and composable.

01 HyperCompress BWT + MTF + ZRLE preprocessing, then parallel ZSTD‑22 / Brotli‑11 trials. Best result wins.
02 Interleaved RS Reed‑Solomon RS(255,223) with cross‑oligo interleaving. Converts burst losses to single‑symbol errors.
03 Fountain Codes Hybrid systematic/LT code: systematic phase for baseline coverage, then Robust Soliton LT droplets (c=0.025, δ=0.001). Operates on binary data.
04 Transcoder 2‑bit encoding (A=00, C=01, G=10, T=11) with rotation cipher for GC balance. Converts RS‑protected binary to DNA bases.
05 Oligo Builder 300bp structured oligos: primers + index + payload + CRC‑32. Synthesis‑ready format.
06 Constraint Check GC content 40–60%, homopolymer ≤3, restriction enzyme screening, primer compatibility.
07 FASTA Output Standard FASTA format with embedded decode metadata. Compatible with Twist, IDT, GenScript.
08 Cost Estimation Per‑oligo pricing at current commercial rates. Projects cost under vendor scenarios.

Oligo Structure — 300bp

FWD PRIMER
INDEX
PAYLOAD
CRC-32
REV PRIMER
20 bp
16 bp
228 bp (76% payload efficiency)
16 bp
20 bp
02

Performance Data

Measured values from 151 automated tests (70 unit, 81 integration). All numbers are reproducible via cargo test.

Compression Ratio by Data Type (HyperCompress Engine)
12× 16× 8–16× CSV / TSV 6–12× JSON 10–14× SQL dump 4–8× Source code 3–6× Plain text ■ typical    ░ observed range

Compression ratios measured with BWT+MTF+ZRLE preprocessing → BPE tokenization → parallel ZSTD-22 and Brotli-11 trials. Range depends on data redundancy. Optimized for text-based formats; binary/pre-compressed data sees minimal or no compression.

Redundancy vs. Recoverable Oligo Loss (Fountain Codes)
1.0× 1.5× 2.0× 2.5× 3.0× 3.5× Fountain Code Redundancy Factor 0% 25% 50% 75% 100% Maximum Recoverable Loss ← DATA2DNA default (2.0×, ~30% loss) Practical (Robust Soliton) Theoretical (1 − 1/r)

Theoretical limit: loss = 1 − 1/redundancy. Practical recovery is slightly lower due to peeling decoder overhead in Robust Soliton distribution (c=0.025, δ=0.001, per Erlich & Zielinski 2017). At 2.0× redundancy, DATA2DNA survives ~30% oligo loss in tests.

Test Suite — 151 / 151 Passing
Unit 70 Integration 81 Total 151 all passing, 0 failures runtime: ~21s (unit 8s + integration 13s)
Encoding Efficiency
Raw encoding density 2.00 bits / nucleotide Payload efficiency (228 / 300 bp) 76% Effective density w/ 2.0× redundancy 0.76 bits / nucleotide stored
03

Triple-Layer Error Correction

DNA synthesis, storage, and sequencing each introduce distinct error types. Three independent correction layers ensure integrity under realistic conditions.

I
CRC-32 Per-Oligo Detection
Flags corrupt oligos before RS decoding. 16bp field per oligo. False positive rate < 2.3 × 10−8.
<10−8 false positive rate
II
Interleaved Reed-Solomon RS(255,223)
GF(28) arithmetic with Berlekamp-Massey decoder. Cross-oligo interleaving converts burst oligo losses into single-symbol errors—corrects up to 16 per 255-symbol block.
16 errors / block
III
Fountain / LT Codes
Robust Soliton distribution (c=0.025, δ=0.001) with peeling decoder. 2.0× redundancy tolerates ~30% oligo loss. Based on Erlich & Zielinski 2017.
~30% oligos lost
// Redundancy math
surviving_data = redundancy × (1 − loss_rate)

// At 30% loss with 2.0× redundancy:
2.0 × 0.7 = 1.40  →  40% safety margin ✓
04

Codebase

9,563 lines of Rust across 15 modules. No unsafe code. Parallel computation via Rayon. Actix-Web 4 HTTP server with SSE progress reporting.

Lines of Code by Module
hypercompress.rs2,480 lines
main.rs1,584 lines
pipeline.rs1,122 lines
compressor.rs663 lines
dna_constraints.rs574 lines
oligo_builder.rs544 lines
fountain.rs538 lines
reed_solomon.rs458 lines
interleaved_rs.rs387 lines
fasta.rs328 lines
cost_estimator.rs295 lines
transcoder.rs283 lines
chaos.rs185 lines
consensus.rs99 lines

Technical Specifications

ParameterValueNotes
Oligo length300 bpTwist/IDT synthesis compatible
Payload per oligo228 bp300 − 72 bp overhead
Payload efficiency76%228 / 300
RS codeRS(255,223)32 parity symbols, 16-error correction per block
GF polynomial0x11Dx8 + x4 + x3 + x2 + 1
Fountain distributionRobust Solitonc=0.025, δ=0.001 (DNA Fountain params)
Block size64 bytesRS alignment
Default redundancy2.0×Survives ~30% loss
Primers20bp × 2Standard PCR amplification
GC content target40–60%Synthesis optimization
05

Context in the Field

Published DNA storage systems and how DATA2DNA relates. Note: direct comparison is limited since the following systems include wet-lab validation and DATA2DNA is currently simulation-only.

SystemYearBits/ntError CorrectionValidation
Church, Gao & Kosuri2012~0.83Repetition encodingWet lab
Goldman et al.2013~0.33Fourfold redundancyWet lab
Erlich & Zielinski (DNA Fountain)20171.57RS + Fountain codesWet lab
Organick et al. (Microsoft/UW)2018~1.10RS + RepetitionWet lab, 200MB
DATA2DNA20250.76*CRC + IRS + FountainSimulation only

* 0.76 bits/nt effective with 2.0× redundancy (2.00 bits/nt raw encoding, 76% payload efficiency, halved by redundancy). Compression can multiply effective throughput on text data but does not change physical nt density. DATA2DNA has not yet been validated with physical DNA synthesis and sequencing.

06

Why DNA Storage?

DNA is the densest known information storage medium. At ambient temperature in a sealed container, it requires zero energy to maintain.

MediumDensityDurabilityStorage EnergySource
Hard drive ~1 TB / 100g 5–10 years 6–8W continuous
LTO-9 Tape 18 TB / cartridge ~30 years Climate-controlled
DNA 215 PB / gram 10,000+ years Zero (ambient) Erlich 2017, Zhirnov 2016

These density figures come from peer-reviewed literature. The theoretical maximum is 455 EB/gram (Zhirnov et al., Nature Materials, 2016). Practical density achieved in lab settings is 215 PB/gram (Erlich & Zielinski, Science, 2017). DNA storage remains expensive to write (~$0.01–0.10/nucleotide at scale) and slow to read (hours for sequencing). It is best suited for cold archival data that is written once and read rarely.

07

Current Status & Limitations

What works, what doesn't, and what's next.

What Works

Limitations

Roadmap

08

References

[1]
Erlich, Y. & Zielinski, D. (2017). DNA Fountain enables a robust and efficient storage architecture. Science, 355(6328), 950–954. doi:10.1126/science.aaj2038
[2]
Church, G.M., Gao, Y. & Kosuri, S. (2012). Next-generation digital information storage in DNA. Science, 337(6102), 1628. doi:10.1126/science.1226355
[3]
Goldman, N. et al. (2013). Towards practical, high-capacity, low-maintenance information storage in synthesized DNA. Nature, 494(7435), 77–80. doi:10.1038/nature11875
[4]
Organick, L. et al. (2018). Random access in large-scale DNA data storage. Nature Biotechnology, 36(3), 242–248. doi:10.1038/nbt.4079
[5]
Grass, R.N. et al. (2015). Robust chemical preservation of digital information on DNA in silica with error-correcting codes. Angewandte Chemie, 54(8), 2552–2555. doi:10.1002/anie.201411378
[6]
Zhirnov, V. et al. (2016). Nucleic acid memory. Nature Materials, 15(4), 366–370. doi:10.1038/nmat4594
[7]
Luby, M. (2002). LT codes. Proc. 43rd Annual IEEE Symposium on Foundations of Computer Science, 271–280. doi:10.1109/SFCS.2002.1181950
[8]
Ceze, L., Nivala, J. & Strauss, K. (2019). Molecular digital data storage using DNA. Nature Reviews Genetics, 20(8), 456–466. doi:10.1038/s41576-019-0125-3