Encoding method Explanation Fountain codes (Online) Online Codes: Online Codes proposed by Maymounkov improve LT codes. Online codes address the main problem with LT codes, which is that LT codes cannot guarantee that after (1 + ǫ)∗n generated packets, all n parts of the original message were encoded in a sufficient manner. Online codes follow a two-staged approach. First, auxiliary blocks are created in the so-called outer-encoding. Then, together with the message chunks, they are finally encoded into packets during the inner-encoding step. In particular, during the outer-encoding process, M auxiliary packets are created. Then, each chunk is randomly mixed into q different AUX packets using XOR. Like the chunks of the file, the resulting M AUX packets are considered as regular inputs in the inner encoding step. In this phase, a packet is generated by randomly selecting a degree n from the distribution function. As a result, selected from the union of the set of chunks and AUX blocks, n unique and equally distributed elements are mixed into this packet. This step is repeated infinitely or until a predefined condition occurs. Such a condition could be, e.g., reaching a certain number of created packets or successfully decoding with a decoder running simultaneously. Fountain codes (Raptor) Raptor: Raptor codes developed by Shokrollahi are the first fountain codes with (theoretically) linear encoding and decoding time. As the name Raptor codes (for rapid tornado) suggests, they are based on tornado codes, which can correct erasures with a fixed error rate. These codes require a constant number of blocks more than the Reed–Solomon codes, but are much faster in encoding and decoding. Meanwhile, there are several variants of the original method, e.g., R10 or the RaptorQ code. Depending on the procedure, purpose and area of use, various patents may apply. (l,e) - constrained encoder We present an (l,e)-constrained encoder that translates binary data to DNA strands that are runlength limited and e-balanced for arbitrary e, l > 0. This novel and efficient encoders that translate binary data into strands of nucleotides which satisfy the RLL constraint, the GC-content constraint, and are capable of correcting a single deletion, single insertion or single substitution, which are the most common errors in DNA storage channel. Our proposed codes achieve higher rates than previous results and approach the capacity, have low encoding/decoding complexity and limited error propagation. Wukong algorithm To convert data into DNA sequences, a novel codec algorithm called "Wukong" was developed. It is capable of encoding DNA into diverse lengths with controlled homopolymer runs and regional even GC content while maintaining a high coding potential and offering a unique codec pin library. On top of it, an error‐correction function that was achieved by integrating "XOR" redundancy and error‐ correction code was added. Finally, a specially devised "flanking sequences" generation process was used to achieve random‐access functionality. In addition, a sequencing analysis pipeline for decoding the DNA stored data was established and attached to this platform. The "Wukong" codec system starts from a codec rule library construction. A pool of 20, 922, 789, 888, 000 codec rules was generated from a permutation combination of the codec mapping relationship between four binary bits and two nucleotides. Then, two coding rules were randomly selected from the library to encode the data information. Four binary bits were regarded as one coding unit, whether they came from a single binary data string or several distinct binary data strings. One coding rule was used in the (2n − 1)th coding unit, and another was used in the 2nth coding unit. Notably, in this way, two binary fragments, three binary fragments, and four binary fragments could be encoded into one DNA sequence string, which then creates a rather free conversion between binary data and DNA sequence. To encode the data by "Wukong," computer files were first converted into binary bits and then chopped into fragments with desirable lengths. Binary fragments were randomly selected and encoded using two rules selected from the "Wukong" codec reservoir. If the encoded DNA sequence met the preset GC and homopolymer runs, the selected binary fragments were deleted from the original binary fragment pool. The selected binary fragments that failed to generate desirable sequences were put back into the original pool. Occasionally, the selected binary fragments could not generate desirable sequences under 30 times repeats, and hence a virtual fragment was added to produce a desirable sequence. Eventually, DNA sequences with desirable sequence properties were designed. 15-ary Huffman The Base64 algorithm has the ability to transform binary information into DNA sequences with controlled GC contents and reduced homopolymers. However, for certain information containing large number of repeats, the Base64 algorithm could not reduce the length of the DNA sequence to improve coding efficiency, and homopolymers may still appear. Therefore, the 15-ary Huffman algorithm was developed to compress information by reducing the length of the DNA sequence and avoid the appearance of homopolymers. ASCII The DNA data encoding and decoding processes were developed using Matlab 2021a (Mathworks, Inc., Natick, MA, USA). For the ASCII letter encoding of words, the text information could be represented as the ASCII code converted into binary information and stacked with RS ECC encoding for the seven bits error corrections. The binary information of the words was translated into the four DNA bases (A, T, C, and G). The encoded DNA letters were aligned with primer information for amplification and ligation. For two pixels, which were codon-based encoding of the pixel images, 16 × 16 pixel image frames, which have four color states per pixel, were decoded and linearly aligned. Each image frame was decoded using the same process. The four state colors were represented as 00, 01, 10, and 11. The binary information was translated by the codon DNA bases per two pixels. The encoded DNA sequences were fragmented with index sequences and aligned using primer information for DNA data oligo pools. Base64 The encoding algorithm includes three steps: I) converting the text information into Base64 code, which contains 64 different printable characters; II) reshaping and converting the Base64 code into two groups of 8-bit binary data, in which one group is balanced by a specific code; III) the hybrid encoding is suitable to map the balance code and binary code into DNA sequences according to a customized mapping rule. The homopolymers and GC content in the DNA sequence are controlled by the balance code and the customized mapping rule, which effectively reduce the sequencing error rate. Bite-base We encoded the book by converting the draft to html format (with embedded jpg images). Once made, we read the book in bit form and converted individual bits to A or C for 0 and T or G for 1. Bases were chosen randomly while disallowing homopolymer runs greater than three. Addresses of the bitstream were 19 bits long and numbered consecutively, starting from 0000000000000000001. The script Bits2DNA.pl (see code section) is the program we wrote for encoding the html file into DNA segments. Bitmap image encoding In a standard 24-bit bitmap image file, each pixel corresponds to 3 bytes of information, corresponding to the Red, Green, and Blue intensities on a scale of 0 to 255. Five (5) nucleotides are needed to encode each byte, so 15 nt is needed to encode each pixel color. Given the synthesis limitations of Twist, the lengths of library primers, and the length of the truth marker binding site, we have roughly 70 nt per DNA oligonucleotide for the address and the message. This corresponds well to a 2-byte address (10 nt) and a block of 4 pixels (60 nt). Consequently, we organize each bitmap image into ≤ 255x255 blocks of 2x2 pixels. For the pixel block corresponding to the four pixels with coordinates (99, 99), (99, 100), (100, 99), and (100, 100), we would encode both the X and the Y as 50. BMVO BMVO is a new algorithm based on Brownian motion, Nelder–Mead method and MVO algorithm, The BMVO algorithm uses Brownian motion and the Nelder–Mead method to improve slow iteration and single update method in the MVO algorithm. The quantitative comparison with mainstream heuristic algorithms (DMVO, GA, GWO, GSA and PSO) and the results of the test functions illustrate the superior performance of the BMVO algorithm. BO-DNA This paper proposed the mapping rules, which are optimized with a heuristic algorithm for constructing high-reliable and stable DNA sequences. Hereafter, the proposed mechanism is BO-DNA (Biologically Optimized encoding model for DNA). This encoding is based on our proposed mapping rules, chaotic-based customized moth-flame optimizer (CMF), and biological combinatory constraints. In this regard, BO-DNA’s significant contribution can be categorized into three following stages. 1. DNA mapping rules: Conversion of digital data into binary bits and mapping these bits into premier DNA nucleotides by our proposed mapping rules, 2. Optimization of encoding: Utilization of tent chaotic with a customized optimizer CMF for the improvement of lower bounds of DNA sequences with maximum iteration, 3. Bio-coding constraints: Calculation of biological combinatorial constraints in order to generate optimal and reliable DNA sequences. The digital data is converted into a binary format using the Python function. The binary data is divided into N groups based on their size, and then each group is divided into n segments based on the group size. Each data group can therefore be viewed independently, making it easy to read by the proposed mapping rules. Bose-Chaudhuri-Hocquenghem (BCH) Bose–Chaudhuri–Hocquenghem Encoding: The original information (binary numbers) was first divided into 224 groups. In each group, BCH (255,207) was used to generate 255-bit BCH codeword from 207-bit information symbols based on generator polynomial. Finally, 1 parity bit was added to the end of each group of BCH codeword, so that the Hamming weight (number of 1s) of codeword in each group was an even number. In this way, each group of codeword corresponded to exactly 128 bases. 224 groups of the above BCH code were used to store the information, resulting in 128 × 224 = 28 672 bp of DNA sequence. Chamaeleo coding, Galois field and base triplet association method We make the encoding process that is first to read in each test file as binary data, then divide it into fragments, add an index to identify the position of fragment, and then exploit the selected encoding algorithm to encode it into a DNA sequence, and finally add the designed primers at its both ends. As the error problem is not the focus of this paper, we do not add an error correction code field to the fragment. CLGBO Ahmadianfar et al. (2020), inspired by the gradient-based Newtonian approach, developed the GBO, a powerful and efficient algorithm that combines gradient and metaheuristic algorithm. GBO is mainly composed of a gradient search rule (GSR) and a local escape operator (LEO). The GSR uses a gradient-based approach to enhance the exploration capability of the algorithm and speed up convergence to obtain a better position in the entire search space. The LEO are mainly used to improve the efficiency of GBO for solving complex problems and to escape from local optima. The basic GBO algorithm is embedded with two innovations, the Cauchy mutation strategy and the Levy mutation strategy, to improve the overall optimization performance of the algorithm. Mutation strategy is a commonly used method for evolutionary algorithms to produce new individuals, which can effectively enrich the population. However, it is difficult for a single mutation operator to effectively balance the exploration and exploitation capabilities of the algorithm. Therefore, an algorithm combining two mutation operators is proposed to alleviate the lack of population diversity, the imbalance between exploitation and exploration, and the premature and slow convergence of the GBO algorithm Code book The code book strategy is used to encode the common characters by 30 six-base strings, which is different from previous works to encode 01 binary strings directly with A, T, C and G. The four shadowed codes are uniquely assigned to the frequently used space character, and three shift keys to upper-case letter, digit and punctuation, respectively. This code book can encode 26× 3+4=82 characters at most, which corresponds to a code density 1.06 bit/base. Concatenated codes Our encoding process starts by randomizing data to reduce chances of secondary structures, primer–payload non-specific binding, and improved properties during decoding. It then breaks the data into fixed-size payloads, adds addressing information (Addr), and applies outer coding, which adds redundant sequences using a Reed–Solomon code to increase robustness to missing sequences and errors. The level of redundancy is determined by expected errors in sequencing and synthesis, as well as DNA degradation. Next, it applies inner coding, which ultimately converts the bits to DNA sequences. The resulting set of sequences is surrounded by a primer pair chosen from the library based on (low) level of overlap with payloads. Constrained coding The encoding stage involves compression, representation conversion, encoding into DNA, and subsequent synthesis. Each synthesized DNA codeword is equipped with one or two addresses. Te encoding phase entails constrained coding, which limits the occurrence of the address block to one predefned position in the codeword only, and GC-content balancing of each substring of eight bases. Additional homopolymer checks are added directly into the string or stored on classical media; they correspond to only 0.02% of the data content. Constrained WMU code Previous studies did not consider the maximum run length constraint for k-WMU code, which can potentially affect the accuracy and reliability of primers in DNA storage systems. To address this issue our work aims to design a code to combine the k-WMU code with maximum run length constraint. Additionally, the studies indicated that the occurrences of errors can be reduced to a greater extent during primers synthesis and sequencing when multiple beneficial constraints are satisfied simultaneously. Thus, based on the k-WMU code with maximum run length constraint, we present simple ways to design a class of WMU codes to satisfy the maximum run length constraint and the proposed constraints in simultaneously. We also provide explicit constructions for the proposed codes, considering that certain explicit code constructions in were not taken into account. We summarize the main contributions of our work as follows. 1) As the WMU codes proposed in previous works for primer design did not consider the crucial maximum run length constraint, we propose new k-WMU codes with a maximum run length constraint for both k = 2 and general integers k. Additionally, we provide explicit code constructions for the proposed codes. 2) Based on our proposed WMU codes with a maximum run length constraint, we can enhance the codes through simple methods to satisfy the combinations of maximum run length constraint with other beneficial constraints for primer design, such as GC balanced and Hamming distance. 3) To apply our codes to primer design in DNA storage, we modify our proposed WMU code with the maximum run length constraint to implement various code lengths. Content-based barcodes (CBBs) encoding For an object of a table, a CBB is derived from the corresponding row and column. For example, the CBB encoding Id=100 would correspond to the value 100 of column Id, where Id is the primary key. A CBB would encode the value and its attribute’s name. Our encoding algorithm consists of the following 5 steps. 1). Mapping. The first step is to merge (and reverse) the desired value and its attribute’s name into a single string. For a value 100 in the column Id, we create the string "001=dI". The reversal of the string is performed because the DNA converters (step (3)) are sensitive to changes in the first characters and, thus, produce different DNA even if the input is similar. 2). Compression. The string obtained from the mapping step is then compressed byte by byte with a static Huffman code to produce a string of Huffman symbols. Since CBBs are unambiguous, the same attribute value must lead to the same CBB, and the used frequencies are fixed for all CBBs. Our frequencies are available in the supplementary material. 3). DNA Conversion. The resulting Huffman symbols are then mapped to DNA bases. There are many ways to convert these symbols to DNA bases that depend on the Huffman code’s base. We discuss the examined DNA converters RotatingTre, RotatingQuattro, Bin, and NaiveQuattro in the supplementary material. Figure 4 illustrates our converter RotatingQuattro that requires compressing "001=Id" with a Huffman code of base 4. 4). DNA Padding (GC Content Optimization). In this step, the obtained sequence seq from step (3) is padded using our function Pad(seq, L). This is illustrated for L = 12 in Fig. 4. Note that in this example, LCBB = 14 due to the permutation’s seed offset that is added (see next step). 5). Permutation (Avoidance of Long Homopolymers). After the padding step, the obtained sequence seq has a GC content of ≈ 50%, but could still suffer from long homopolymers and a non-uniform distribution of G’s and C’s. To further fulfill constraint C2, we permute the sequence as detailed above. Converter of digital information and genetic code To fully use the large data storage ability of DNA storage technology, it is crucial to realize an efficient conversion between DNA code and digital information, as well as improve the storage density of the DNA container. To this end, we developed a Python program that efficiently converted digital information into DNA code and vice versa. Damping Multi-Verse Optimizer (DMVO) algorithm he Damping Multi-Verse Optimizer (DMVO) algorithm constructs the DNA storage coding set that satisfies the combination constraint. The DMVO algorithm improves the Multi-Verse Optimizer (MVO) algorithm by employing the damping factor and Levy flight. Specifically, this method improves the diversity of the population via the damping factor and reduces the dependence on the current optimal individual during updating via Levy flight. Derrick Derrick (de-error-rick), utilizing RS code which is the most commonly used ECC for DDS by introducing the soft-decision decoding strategy from communication engineering. Due to the difference between the DNA sequences and the binary sequences,the soft-decision decoding strategies used in communication cannot be applied in the area of DDS directly. As mentioned by a previous review paper, ‘unlike other storage channels,which have only substitution errors,DNAchannels can also manifest base insertions and deletions, which makes coding more challenging.’ To address this issue, the error model was designed for describing different types of errors including substitutions, insertions and deletions, and a series of novel strategies were proposed in Derrick. DMOS encoder DMOS coding and encoding scheme. The DMOS encoder uses binary files and converts it to the byte arrays with codewords. The DMOS encoder then depends on including error correction, transduces the coding protocol to DMOS writer, and change the state of domains from 0 to 1 on DNA tape. DNA palette code DNA palette code is a 'bit-to-oligo' mapping approach grounded in combinatorial theory. The fundamental idea of the DNA palette code is to establish a bijection between XN, the range of the raw information, and O, the family of oligo sets. Here, we use X = {0, 1} to denote the binary alphabet and D = {A, T, G, C} to represent four natural DNA nucleobases: adenine (A), cytosine (C), guanine (G) and thymine (T). The code word corresponding to x ∈ X n is a subset of the oligo set. Metaphorically, the DNA palette code regards each oligo as a pigment, distinguishing different binary strings by coloring them with distinct colors. It enables the mixing of different pigments to create new colors. DNA prefix-synchronized codes In this section we describe the data format used for encoding two files of size 17 KB containing the introductory sections of Wikipedia pages of six universities: Berkeley, Harvard, MIT, Princeton, Stanford, and University of Illinois Urbana-Champaign. There were 1,933 words in the Text, out of which 842 were distinct. Note that in our conText, words are elements of the Text separated by a space. For example, "university" and "university." are counted as two different words, while "Urbana-Champaign" is counted as a single word. These 1,933 words were mapped to [1933/72 ]= 27 DNA blocks of length 1000 bps, as we grouped six words into fragments, and combined 12 fragments for prex-synchronized encoding. DNA shift algorithm In the fountain code, chains that do not meet the biological constraints will be directly discarded, resulting in a lower information density of the encoding. We have added "mod" and "iterate" bits to enhance the information density. Through DNA shift encoding, as many chains as possible are transformed into ones that satisfy the biological constraints successfully, thereby increasing the coding information capacity. Additionally, the core of the DNA shift encoding lies in cyclic modulo verification, which contributes to its lower algorithm complexity. The time complexity of the encoding process is O(n), making it efficient and practical. DNA-Aeon Our code consists of an outer fountain code and an inner code that resembles an arithmetic code, with a switched en- and decoder, i.e., our encoder uses principles of arithmetic decoding. The outer code uses the Raptor fountain code implementation of NOREC4DNA in the binary mode, which can generate all possible packets in a given seed range. Users can define either the number of packets generated from a file or the size of the individual packets generated. Furthermore, the addition of an optional header chunk is supported, containing meta-information such as filename, permissions, and padding of the last packet. The inner encoder takes as its input the packets generated by the outer code, and a FASTA file and concatenation scheme of codewords. One option for the generation of the additional input is the mCGR approach of ConstrainedKaos2 tool, which generates codewords with user-defined constraints, namely GC content, homopolymers, and undesired motifs. A model with transition probabilities is generated using the codebook. In essence, the inner encoder is an arithmetic decoder, treating the input as a compressed representation of a constraint-adhering DNA sequence. Using source decoding to encode data into a run-length limited representation was previously described by Dubé et al.30. Each fountain encoded packet is logically split into subpackets of the same, user-defined size, and a Cyclic Redundancy Check (CRC) is assigned to each subpacket. These CRCs serve as verification and synchronization markers for the decoder, increasing the substitution correction performance and allowing the correction of synchronization errors (insertions and deletions). The approach of using marker symbols for error detection in joint source and channel coding was previously described by Elmasry. DNA-Crypt The DNA-Crypt algorithm is based on small redundant regions comparable to least significant bits in the case of image steganography. The least significant bits encode a difference in colour of just one on the colour scale, not visible to the human eye, and can be used to hide information in images. DNA-Crypt encodes binary information using the following substitution cipher: 1).E : X → Y. 2).X = {xy : x, y ∈ {02, 12}} 3). Y ∈ {A, C, G, T}. The binary sequence 01110010010011112 would be encoded to E(01110010010011112) = GATCGTAA. Two bits could be encoded by one base, so one byte needs four bases for its encoding. DNA-GB2312 code table Currently, the widely adopted character encoding for computers is UTF-8, which uses 24 bits and includes 16,777,216 characters. However, the commonly used Chinese characters are approximately 5,000 characters in total. Relying solely on the UTF-8 character set would lead to low information density when encoding Chinese characters. To address this issue, we have chosen to use the Information Exchange Chinese Character Encoding Character Set-GB2312 character set (which includes 6,763 Chinese characters and 682 non-Chinese graphic characters). We establish the DNA-GB2312 code table by using a 7-bit base to represent one Chinese character. DNA-LC Constructed on the basis of the Levenshtein code, the DNA-LC code is excellent in that it is GC-balanced and avoids any homopolymer. It also has a relatively stable error correction ability, low time complexity and high recognition ability for insertion and deletion errors. However, the DNA-LC code is insensitive to consecutive errors. The proposed DNA-LC code is highly scalable and can be easily extended to hachimoji DNA with eight building bases to improve the storage density. DNA-QLC DNA-QLC encoding scheme: First, to improve the coding density for DNA storage systems, the scheme applies QRes-VAE (for quantized ResNet VAE) to compress images into several bitstreams. After that, the LC (Levenshtein code) is used to add check bits to the bit streams to realize the correction of substitution, deletion, and insertion errors during DNA storage, which can solve the problem of weak error correction ability. Finally, mapping rules are used to encode bitstreams into the DNA sequence that meet combinational constraints (local GC content of 50% and homopolymers with a size of less than 2) and to avoid the occurrence of undesired motifs (GAATTC and GGC) in DNA sequences, thus improving the robustness of DNA sequences. DRRC Nowadays, there are still some difficulties in the development of coding algorithms, i.e. ensuring coding efficiency while needing to take into account the satisfaction of specific constraints and establishing personalized coding for arbitrary constraints. In this article, we propose a DNA dual rule rotational coding (DRRC), where the two coding rules complement each other, the GC content can be stabilized at 50% by the control of chaotic sequences, and the introduction of rotational coding mode effectively avoids the emergence of long homopolymers and has a low time complexity. EORS EORS is a combination of Equilibrium Optimizer (EO) algorithm and Random Search (RS) algorithm. The equilibrium optimizer (EO) algorithm is inspired by a control volume in a hybrid dynamic mass balance, where the mass balance equation is used to describe the concentration of unreactive substances in a control volume weight dynamic balance. The mass balance equation provides a fundamental physical explanation for controlling the conservation of mass entering, leaving and produced within a volume. Although the EO algorithm uses parameters such as α1 to enhance the exploration ability of the population, the population richness of the EO algorithm still decreases in the later iteration, which greatly increases the probability of falling into a local optimum, especially when solving practical problems. For this reason, another mainstream search algorithm was added to the DNA code. The random search algorithm is a simple search algorithm suitable for large scale data, although it is slow to calculate accurate results. Random search (RS) is a series of numerical optimization methods that do not need the gradient of the optimization problem, so RS can be used for discontinuous or differentiable functions. RS works by iteratively moving to better positions in the search space sampled from the hypersphere around the current position. In this paper, the random search algorithm is used to process the output SEO of the EO algorithm. The RS algorithm can expand the search scope of the EO algorithm and obtain a larger coding set. By initializing set S, all codes in set S and SEO are judged one by one as to whether they satisfy combinatorial constraints. Explorer Explorer, an efficient and low energy DNA encoding algorithm based on the De Bruijn graph that can encode sequential data under arbitrary local and global biochemical constraints. We express the subsequences of length n within the encoded DNA sequence as a vertex in De Bruijn graph. By ensuring that each vertex complies with specific biochemical constraints, we can guarantee that the obtained DNA sequences adhere to the local biochemical constraints. Numerical experiments demonstrate that Explorer exhibits greater stability under various local constraint lengths, significantly reducing the memory occupancy and improving the encoding efficiency. Motivated by the recent advancements in applying deep neural net works to DNA storage. Fountain codes Encoding algorithm works in three computational steps: (a) preprocessing, (b) Luby Transform, and (c) screening. Its overall aim is to convert input files into a collection of valid DNA oligos that pass the biochemical constraints and be sent to synthesis. Fountain codes (LT) LT (Luby transform codes)The LT code proposed by Luby is considered to be the most fundamental and pioneering fountain code. LT divides the original file into n equally long ‘chunks’ that are then combined into packets using the XOR operator. The encoding process can be summarized as follows: 1). Choose the degree d from the probability distribution function p(.). 2). Select d evenly random distributed different chunks. 3). Generate the resulting packet ti by combining all selected chunks using the XOR operator. 4). Add reconstruction information (selected chunk number or seed) to the packet. Fountain codes (RaptorQ) RaptorQ code (RQ) is a fountain code that encodes a k-symbol message into a potentially infinite number of encoded symbols called packets (a packet can contain multiple encoded symbols). It allows efficient forward error correction and is typically used to send messages over lossy communication channels. Because of unreliable networks, some packets of a sender might never reach the receiver. In order to achieve successful communication, RQ introduces error correction such that the sender could send a potentially infinite sequence of packets until it captures a stop signal from the receiver stating that the entire message was successfully decoded. The main advantage of RQ is that the receiver can decode the message with high probability if any k + ε arbitrary packets have arrived. The term ε is called the overhead or redundancy and is typically a very small number. For a source message with k source symbols (s1,s2, ...,sk), RQ attempts to send the first k packets (p1, p1, ..., pk) that are equal to the source symbols, i.e., pi = si, 1 ≤ i ≤ k. If the first k packets are sent, and the sender did not obtain a stop signal, the sender will continue sending repair packets that are computed from a range of source symbols. The repair packets compensate for any of the lost packets. Even if the first k packets of RQ are lost, and only the following repair packets are received, it is still possible to decode the message with the same high probability. This is called a non-systematic code. For example, setting ε = 2, the probability of successfully restoring the k-symbol source message from k + 2 packets is greater than 99.9999%. Fountain codes (RQPAP) Luby Transform (LT) codes were the first fountain codes, published by Michael Luby in 2002 that are near optimal erasure codes. LT codes were further improved, and the latest improvement of this class of codes is the RaptorQ (RQ) code. A typical use case for RQ codes is data transmission over lossy networks. If the data were successfully transmitted, the receiver is very likely to decode and fully recover the message. Even if some part of the message was lost during transmission, the receiver could still recover the message if sufficient overhead was provided. Frequency matrix game graph (FMG) coding In mathematics, chaos games originally referred to a method of creating fractals using a polygon and a randomly selected starting point. The next sequence of points is created iteratively starting from the initial point, and each point in the sequence is generated by a given fraction of the distance from the previous point to the vertex of the polygon. Selecting vertices at random in each iteration and repeating the process up to the maximum number of times usually produce a fractal shape. For example, the Sierpinski triangle will be generated with a regular trigon and fraction 1/2. When the number of points increases to N, the Sierpinski triangle is simulated in three dimensions, and a Sierpinski simplex corresponding to (n− 1) dimensions is obtained. If the chaos game is played in a square, there will be no fractals, but the points will be uniformly filled inside the square. However, restrictions can be added to the steps for selecting vertices, such as restricting the current vertex from being selected in the next iteration, resulting in fractal. Also, in a chaotic game, starting at each vertex and going through all possible paths lead to the same image that takes only one random path. This method is deterministic and can be used for deterministic coding and classification. By increasing the length of code n, the CGR can be extended to order 2n FCGR, and each element in the matrix can represent a certain code. Fully connected neural network encoder The sequence encoder is a fully connected neural network. The 4096-dimensional FC2 vectors are fed into a 2048-dimensional hidden layer with a rectified linear activation, followed by an output layer with a "one-hot" sequence representation that is 80 nucleotides in length. In this representation, each sequence position has four channels, one for each base. A softmax activation function is applied that forces each position’s channels to sum to 1. A DNA sequence can be read off by picking the channel with the maximum activity at each position. The one-hot representation can produce indeterminate bases (for example, if all four channels at a position have a value of 0.25). Because of this, a regularization is applied during encoder training to minimize the entropy at each position. This encourages each position to have a well-defined maximum, which improves the accuracy of the yield predictor. The yield predictor takes a pair of one-hot sequence representations and produces an estimate of the yield of the hybridization reaction between the first sequence and the reverse complement of the second sequence. It is structured as a convolutional neural network. The network makes use of a local match layer that produces vectors of possible matches between each window of 3-mers. This encourages the predictor to make use of any unaligned matches between the two sequences. Galois field and base triplet association method We first map the data to be stored to B many k×m information blocks, denoted by Mb, b = 1,…, B, with elements in GF(p), where GF(p) denotes the finite field (Galois Field) with p elements, and p is a prime (we set p = 47, this specific choice is explained below). Typically, the data to be stored is simply a file consisting of a certain number of bytes (1 byte = 8 bits); in this sense a file is a number in base 28 = 256. Mapping the data to information blocks with elements in GF(p = 47) thus simply corresponds to converting this number to a number in base 47. A column of an information block consists of p many elements in GF(p), and thus corresponds to an element in the extension field GF(pm). Each block is encoded using a RS code (over the extension field GF(pm)) to a codeword Cb of length n. We will refer to the corresponding code as the "outer" code. Each of the n symbols of the codeword Cb will correspond to a DNA segment, and can be represented by a m-dimensional vector with elements in GF(p). In order to specify the membership of a symbol to a codeword Cb, b = 1,…, B, and to specify the position within the k × m codeword Cb, we append an (unique) index consisting of l elements in GF(p) to each symbol. This yields the vectors mb,1, …, mb,n of dimension K = m+ l each. Next, we encode each mb,j using a RS code over GF(p) in order to obtain the, henceforth called "inner" codewords cb,j. Each inner codeword has length N > K. Finally, each inner codeword cb,j is mapped to a string with symbols in {A, C, G, T} (corresponding to nucleotides) which is then synthesized in DNA. The mapping, described below, is such that no more than three nucleotides (i.e., symbols {A, C, G, T}) repeat. GCNSA To solve the problem of low efficiency of DNA storage encoding, a dataset was established by screening the DNA storage codes that met the constraints, and the GCNSA was used for prediction after training. In the GCNSA, the original code graph is reduced to an equivalent smaller graph to reduce the time complexity of the encoding process. Second, the GCN was used to extract features and the self-attention mechanism was used to capture the relationship between local DNA code words. Weighted aggregation of corresponding vectors of DNA code words was carried out to improve the efficiency of DNA storage and coding. Finally, the performance of GCNSA is verified by comparing it with baseline methods on DNA stored coding datasets. To further evaluate the GCNSA, DNA storage codes were predicted under a variety of combinatorial constraints. HEDGES The HEDGES means: Hash Encoded, Decoded by Greedy Exhaustive Search. The simplest case is a half-rate code (1 bit encoded per nucleotide) with no constraints on the output DNA sequence. The basic plan is a variant of a centuries-old "text auto-key encoding" cryptographic technique. We generate a stream of pseudorandom characters Ki ∈ {0, 1, 2, 3} ≡ {A, C, G, T}, where each Ki depends deter ministically, via a hash function, on a fixed number of previous message bits {bj }, on the current bit position index i of the current message bit bi, and on the strand ID. We then emit a character Ci in {A, C, G, T} with Ci = Ki + bi, the addition performed modulo 4. Redundancy for error correction occurs because, at each i, Ci can take on only two out of four values, because bi ∈ {0, 1}. The output DNA is always completely (pseudo)random because the hash is pseudorandom. Thus, the DNA is, in this sense, independent of the encoded message. We note that nothing limits this scheme to the four "natural" nucleotides; generalization to an increased number is straightforward. Huffman code Five files of digital information stored on a hard disk drive were encoded using purpose-written computer software. Each byte of each file to be encoded was represented as a sequence of DNA bases via base-3 digits ('trits' 0, 1 and 2) using a purpose-designed Huffman code10. Each of the 256 possible bytes was represented by five or six trits. IAOA In 2021, Abualigah et al. proposed the AOA and used it to solve practical problems. The AOA uses four arithmetic operators: multiplication, division, subtraction, and addition. The AOA search process can be divided into two main phases: the exploration phase and the development phase. Before the AOA starts the optimization process, it judges and selects the search phase (exploration or development). The arithmetic optimization acceleration function (Math Optimizer Accelerated) is an important factor that is used to control the exploration and development phases. Due to the advantages of having few parameters and a fast execution efficiency, the AOA has been widely used to solve many practical problems, such as engineering optimization and path planning. The arithmetic optimization algorithm proposed in the literature has improved the solution accuracy and algorithm stability to a certain extent. However, the AOA can easily fall into a local optimum and slowly converge. This study proposes the IAOA based on two strategies to overcome these two drawbacks. In the IAOA, the perturbations generated by the elementary functions are added to MOA and MOP, respectively, and double adaptive weights are added to the position update formula. This enables the algorithm to perform an adequate global search, avoid falling into locally optimal solutions, and improve the convergence speed of AOA and the accuracy of the obtained solutions Inear-time quaternary encoder The linear-time quaternary encoder can correct a single indel with [log n]+ 2 bits of redundancy. This construction improves the encoder of Tenengolts by reducing the redundancy by at least four bits. Based on Tenengolts’ encoder, the linear-time quaternary encoder is a more efficient encoder for codes capable of correcting a burst of indels with log n + O(log log n) bits of redundancy. Intermediate "binary" conding, constrained quaternary coding We have compared two coding approaches for constraint-based coding of DNA strings. In the first approach, an intermediate, ‘binary’, coding step is used, while in the second approach we ‘directly’ translate source data into constrained quaternary sequences. The binary approach is attractive as it yields a lower complexity of encoding and decoding look-up tables. The redundancy of the binary approach is higher than that of the quaternary approach for generating combined weight and run-length constrained sequences. The redundancy difference is small for larger values of the maximum homopolymer run. ISOWOA In dealing with optimization problems, the whale optimization algorithm has the drawback that local optimal solutions are prone to precocious convergence, i.e., it is assumed to be the best solution when the global optimal solution has not been found. Based on this, the proposed algorithm integrates the strategies of enhanced quasi-opposition learning, a nonlinear convergence factor, and a new time varying inertia weight with collaborative convergence factor, giving it a stronger search ability in the whole algorithm optimization stage. The adaptability and convergence precision of each iteration are improved, and the inherent defect of premature convergence is eliminated, so that it becomes an equilibrium algorithm. Through 38 benchmark functions, the proposed algorithm was compared to modern metaheuristic algorithm and variants of WOA on test cases based on multiple dimensions, number of iterations, mean value, and variance. The experiment shows that the results of 17 functions of CEC 2005 obtained by ISOWOA were obviously better than those of EWOA in terms of mean and standard deviation, and the results of another four functions were the same. In addition, the algorithm is superior to other algorithms in 11 functions in the test function set of CEC 2015, which again proves the superiority of ISOWOA algorithm. In biological computing, based on the same combinational constraints, the multi-constraint DNA storage coding set constructed by ISOWOA has a better solving effect than DMVO. The results show that the number of DNA storage codes filtered by ISOWOA increased by 2–18%, and the coding set can realize access to DNA files in biological experiments. JPEG codec JPEG codec based on an original, efficient encoding algorithm, which is highly robust to the sequencing error and deals with the ill-case sequences containing pattern repetitions. The construction of the code is strongly linked to some biological constraints presented in Section 3. This code is embedded in a coding workflow for the storage of digital images onto synthetic DNA. The proposed solution optimizes the trade-off rate-distortion thanks to a nucleotide allocation algorithm and is able to compress an input image without affecting strongly its visual quality. JPEG’s progressive encoding JPEG’s progressive encoding allowed for image data to be partitioned into scans by the color band and by frequency. Through careful organization of the file, a low-resolution grayscale image was constructed from a small percentage of the file’s data, or an increasingly higher resolution image was obtained from reading a greater percentage of the file. For the File Preview operations, the JPEG information was arranged in such a way that a 0 HD access pulled out a small amount of data and produced a low-resolution image. By tuning the access conditions as described, more of the file was accessed and a greater resolution image was produced. The encoding process: each partition was encoded into DNA using a multilevel approach. First, the JPEG file was partitioned into scans. Then, each partition was divided into blocks of 1665 bytes, which were interpreted as a matrix with 185 rows and 9 columns with one byte per position. Blocks smaller than 1665 bytes at the end of a file or partition were padded out with zeros. An RS outer code with parameters of [n = 255, k = 185, d = 71] added additional rows to each block to compensate for the possible loss of strands within a block. Each row was given a unique index that was two bytes long. Then, each row was appended with error correction symbols using an RS inner code given as [n = 14, k = 11, d = 4] that protected both the data and index bytes. Each row of a byte was converted into a DNA sequence using a comma-free code that mapped each byte to a unique codeword sequence. The codewords were designed using a greedy algorithm to be GC-balanced and have an edit distance of at least two to all other codewords. Each codeword had a length of 8 nts. The last step was the appending of primers to each end of the sequence and insertion of a restriction enzyme cut site in the middle of the strand. Each partition of the JPEG file used different primer binding sites, so these primer sequences were given as inputs for each partition as it was encoded. An additional set of flanking primers were added to each strand to enable the entire library to be amplified at once using a single common primer. The final set of strands for each file were synthesized into a DNA library. LCG random encoding Twenty 26×26 pixel, black-and-white icon bitmaps were generated as representations of 20 high-resolution colour images encompassing a broad range of subject matter. Each black-and-white icon was converted to a length-676 bitstring in a column-first order, with each black or white pixel encoded as a 0 or 1, respectively. This bitstring was compressed via run-length encoding, replacing each stretch of consecutive 0s or 1s with a two-tuple (value, length) to generate a list of two-tuples describing the entire bitstring. The maximum length is 15; runs longer than 15 bits are encoded as multiple consecutive runs of the same value. This run-length encoding was converted to a sequence of base-4 digits as follows: (1) Begin with an empty string, and set the current run to the first run of the list. (2) Append the value of the current run (0 or 1). (3) Using two base-4 digits, append the length of the current run. (4) If the length of this run was 15, encode the next run starting with step (2). Otherwise, encode starting at step (3). If no runs remain, the process is complete. This process produces a quaternary string describing the entire run-length encoding of the image. To avoid homopolymer runs and repeated subsequences in the final nucleotide sequence, each digit is offset by a random number generated from a linear congruential random number generator (LCG) beginning with a random seed (that is, the ith value generated by the LCG is added, modulo 4, to the ith base-4 digit of the quaternary string). We used an LCG of multiplier 5, modulus 231-1 and increment 0, although any LCG parameters with a period longer than the length of the sequence would be suitable. The final ‘randomized’ quaternary string is converted to nucleotides by a direct mapping (0=G, 1=A, 2=T, 3=C). The number used to seed the LCG is prepended to this sequence by converting it into a ternary string of length 20, whose digits are encoded in nucleotides via a base transition table, as done previously by Goldman et al.7 (0=GA, AT, TC, CG; 1=GT, AC, TG, CA; 2=GC, AG, TA, CT). The first digit is encoded directly (0=A, 1=T, 2=C). This sequence was modified with additional flanking sequences added to the beginning and end. LDPC code The key point in the encoding procedure is that we use only inter oligo codes, unlike previous schemes that considered both inter-oligo and intra-oligo codes. When LDPC codes are applied through all DNA sequences, insertion, deletion, or erasure errors are treated as a single erasure or substitution error in the viewpoint of each codeword in the inter-oligo direction. In other words, no additional intra-oligo code or redundancy is required to handle all errors in DNA-based data storage in the proposed scheme. LDPC codes are capacity-approaching code that nearly achieves the Shannon capacity with low complexity. Thus, employing LDPC codes can achieve high error correction performance even with low computational complexity compared to other error-correcting codes. Levy Equilibrium Optimizer Levy Equilibrium Optimizer algorithm is a combination of the Levy, an intermittent scale-free search model, and Equilibrium Optimizer (EO) algorithm. The equilibrium optimizer (EO) algorithm is inspired by a control volume in a hybrid dynamic mass balance, where the mass balance equation is used to describe the concentration of unreactive substances in a control volume weight dynamic balance. The mass balance equation provides a fundamental physical explanation for controlling the conservation of mass entering, leaving and produced within a volume. A study by Reynolds et al. showed that Drosophila flies explore their environment and search for food during foraging through a series of straight-line flight paths that are often interspersed with abrupt right-angle turns. An intermittent scale-free search model, called Levy, was proposed based on the scale-free flight of Drosophila. And the model was applied to the optimization process and optimal search by the researchers, and it was shown to have good search performance by preliminary results. In LEO algorithm, Levy Flights update strategy is used to replace random update based on current global optimization, which reduces the influence of minimax pool individuals on update mechanism. Therefore, levy flight algorithm was added in the later iteration of the algorithm in this paper to accelerate the convergence of EO algorithm and jump out of local optimum through Levy flight operation. In this paper, levy flight algorithm is used to carry out Levy flight operation on the pool in the late iteration of EO algorithm and process the output of EO algorithm, which can expand the search scope of EO algorithm and obtain a larger code set. By initializing set S, determine whether all codes in set S and SEO meet the combination constraint one by one. Matrix chaos game representation (mCGR) The chaos game representation (CGR) was first described by Barnsley. It is an iterative function system to construct fractals. The underlying idea of the algorithm is to assign numbers from one to three to the edges of a triangle. The algorithm starts from a randomly chosen vertex. Then the next vertex is drawn by randomly choosing a number from one to three, representing an edge of the triangle. Half of the distance to the direction (so-called scaling factor) to this edge of the triangle is drawn to set the new vertex. After several iteration steps, the Sierpinski triangle appears. Abbreviations of the algorithm, such as a change of angle, the scaling factor, or the number of edges, result in differently shaped fractals. The resulting patterns are called chaos game representation (CGR). Jeffrey was the first who used the CGR algorithm for DNA, where the four nucleotides are assigned to the edges of a square. There are several applications based on CGR, e.g. the analysis and comparison of whole-genome sequences or phylogenetic predictions. Moreover, the CGR has some interesting properties, as pointed out by Almeida et al.. The CGR patterns are unique, the sequence can be reconstructed by the coordinates, and the distributions of the points can be described as a generalized Markov Chain model. The frequency matrix chaos game representation (FCGR) is an extension of the CGR, in which the CGR is represented as a count matrix, where the number of dots of the CGR in each section of a grid is counted. The FCGR is particularly useful for machine learning approaches as it provides a fixed input dimension for any sequence length and has been successfully applied already in protein classification. An FCGR with the order of 2n represents a matrix of k-mers. Based on that, we can consider it as a representation of all possible DNA words of the length n. We will refer to this matrix representation in the order of 2n as matrix chaos game representation (mCGR). Mixed Error Processing Coding for Arbitrary Length (MEPCAL) MEPCAL employed a layered structure combining nested Reed-Solomon (RS) code, RaptorQ code, and an anchoring approach. In this study, its processing unit was base-256 numbers (information symbols), which was directly convertible to 4-bp DNA symbols. The MEPCAL encoder proceeded in four steps. First, information symbols were generated from the original information and appended by RS repair symbols at 8.6% redundancy rate before they were partitioned into encoding groups. Second, each encoding group was further organized into 16 encoding sets (packets), from which RaptorQ code, a variant of the fountain code, was applied to generate a surplus of repair packets. Regardless of their origin, the same number of packets would suffice for the restoration of the encoding group , providing sufficient coding flexibility at minimal redundancy cost. In Step 3, all packets were transcoded to DNA sequences, and a designated "leading base" was inserted between DNA symbols. These leading bases served as anchors for indel detection in the decoding process. Packets were then subjected to five filters to reject sequences with high error rates, low signal-to-noise ratios in the nanopore ionic current streams, extremely error-prone sub-sequences, inappropriate GC ratios and potential biological activities (potential open reading frames (ORFs), RSGE recombination sites, repetitive sequences, Golden gate exicision sites, etc.). These filters were customizable to the specific sequencing platform, the host organism and the genetic cloning method, to collectively enhance stability in DNA synthesis and transformation, and reduce error rates ab initio in sequencing. Next, from all packets, 16 qualified sequences were selected, and RS code was applied again at 50% redundancy rate to provide the second level of error protection. Finally, all symbols were organized linearly according to the order of encoding sets and then encoding groups and indexed by 5-bp and 10-bp interval indices, respectively. Through the MEPCAL encoder, the compressed 5.56 KB text data was coded into a DNA sequence of 50,540 bp (0.886 bit/base), of which 27.87% was for RS redundancy, 18.04% was for leading bases, and 9.77% was for indexing. Modulation encoding In the field of telecommunication, modulation has accomplished reliable signal transmission by superimposing baseband signals (or low-frequency signals) to the carrier signals (or high frequency signals). The modulated signal not only has the power to anti-interference but also has no effect on the modulating signal. In fact, the carrier signals serve as the backbone to protect the modulating signal from external disturbances. In this paper, we develop a new DNA storage architecture based on modulation encoding and decoding. Results from computer simulation and real data demonstrate that this storage architecture has three advantages: First, modulation encoding provides a storage-friendly encoding architecture to satisfy biological constraints and has similar thermodynamic properties. Second, modulation provides an effective way to solve synchronizations, and the proposed method can tolerate up to ∼40% errors. The decoding process is very time efficient and suitable for large-scale applications. In addition, it is robust to imperfect clustering reconstruction, which is very common in practice. MOPE We propose the MOPE encoding algorithm that helps the encoding process of DNA storage systems. First, a combination of the MBMO algorithm with better optimization capability and combinatorial constraints is utilized to code the non-payload. The non-payload coding set obtained by the MBMO algorithm is increased by 3%–18% compared with the current best result. The MBMO algorithm is an effective improvement of the BMO algorithm achieved through using the weight-based location update rule and hybrid mutation strategy. By comparing with the mainstream metaheuristic algorithms in both qualitative and quantitative aspects, it is shown that the MBMO algorithm can strike a balance between the exploitation and exploration phases and also has the advantages of avoiding falling into local optima and fast convergence. Second, the PE algorithm is proposed for encoding the payload, encoding different types of data (such as text and image) into DNA sequence. This algorithm is not only simple in structure but also can effectively prevent long homopolymer running and control the GC content within the specified range. Motif encoder In order to demonstrate the feasibility of using composite motifs, we developed a DNA storage system that uses composite motifs as building blocks. Figure 1 presents the read/write pipeline of our system. On the writing side, digital data is encoded into oligos containing composite motifs using a motif encoder. Writing a composite motif at any given position of an oligo sequence is done by mixing multiple motifs during the synthesis procedure to synthesize multiple DNA molecules that contain the corresponding combinations of motifs using Bridged Oligonucleotide Assembly (BOA). Data was encoded using a library of 96 payload motif sequences and 8 address motif sequences. The sequence design rules for base motifs that are used to derive composite motifs are similar to those of DNA barcode design. Thus, we started with DNA sequences designed to select to select 8 motifs for addressing and 96 motifs for composite payloads. Using a combination factor of 32, we developed a composite motif set of 3 × 1025 composite motifs (C(96, 32)). Thus, each composite motif, and hence, each synthesis cycle, can store 84-bits of data (log2C(96, 32)). In order to demonstrate the feasibility of composite motifs Movable type encoding This section outlines the procedures for converting digital files into DNA-MT storage files, emphasizing fragmentation, data strip formation, binary conversion, error detection and correction, and final encoding into DNA-MTs in blocks. These steps ensure accurate data encoding and robust error handling to optimize the integrity of the DNA-MT stored files. Fragmentation, strip formation, and binary conversion: To encode various types of files (e.g., text, images, audio, and video) into DNA-MT storage files, each target file was initially segmented into 100-byte fragments. Each fragment was further subdivided into twenty 5-byte data strips. Subsequently, each strip was appended with 4 bytes of address information to facilitate mapping within the storage file, and then subjected to binary conversion. This address allocation was designed to support a total DNA-MT storage capacity of ≈2 GB NOL-HHO "ROL" is a new random opposition-based learning strategas , the Equation is as follows (9): xˆj = aj + bj − rand() × xj , 1, 2, . . . , n. "ROL" can move the current candidate to a new search space to jump out of a local optimization, meaning we can calculate the fitness function values of individuals before and after a random opposition-based learning strategy, and use the best selection mechanism to select N better individuals. The Harris hawks optimization algorithm (HHO) is a population-based metaheuristic optimization algorithm that was proposed by Mirjalili et al. We refer to the HHO algorithm that combines the above two strategies as the "NOL-HHO" algorithm. The former nonlinear control parameter had a good ability to balance global and local searches, while the latter "ROL" strategy guides the population to jump out of a local area to exploit more promising areas, both of which greatly improve the performance of HHO. Nonbinary SIDEC code We segment the q-ary sequence with the maximum run-length r constraint into multiple sequences with the specified length; this is the message sequence of the proposed SIDEC code Ca,b(n + 1) . The message sequences used as the inputs pass through the algorithm of the systematic proposed SIDEC code in Section III.C, and the output codewords have properties of single insertion/deletion error correction and the maximum run-length r constraint. The maximum run-length of the output codewords is r = 3, since we design the fixed maximum run-length r to be three in message the parity sequences and the last symbol of the parity sequence is always different with the first symbol of the message sequence. The codewords with maximum run-length r = 3 can be used for DNA synthesis due to the constraints for DNA storage systems. Optimized (OPT) 6-bit encoding table To transform text messages into binary code, we utilized an encoding strategy where each ‘byte’ maps to a 6-bit character code (for 26 or 64 possible characters) built from two concatenated 3-bit data units over two barcoded cell populations. Because our classifier performance was not equal across all 3-bit data profiles , we examined different encoding schemes to optimize character-to-byte mapping. Two schemes were explored: (1) the classic DEC 6-bit encoding table for 64 basic ASCII characters and (2) an optimized (OPT) 6-bit encoding table, designed to take into account the letter usage frequency and classifier decoding performance bias. In the OPT encoding scheme, more frequently used characters (based on letter frequency in English text) are assigned to 6-bit bytes with higher decoding performance. Therefore, OPT encoding was expected to generally outperform DEC encoding for text messages. Panda Codec To achieve practical data storage, we developed a codec algorithm, named "Panda", for converting data into DNA sequences and vice versa. The essential logic of classical LZW compression was integrated into this algorithm to store as much data as possible. Moreover, to simplify the data codec process, we process the data by reducing the binary conversion steps. First, character compositions from text or digital number (0, 1, 2…9) compositions of pixel values from images were extracted and taken as the "element sequences". Each sequence was first used to generate a code table, and then a compression mapping process was employed to encode the data into decimal values, which were then converted into quaternary numbers (0, 1, 2, 3) corresponding to A/T/C/G sequences. To control the bio-constraints, including GC content and homopolymer runs in the sequence, a two-step extension mapping process was employed. In the first extension mapping step, each five-nucleotide unit was mapped to a six-nucleotide unit with the GC content strictly controlled at 50%. Then, the homopolymer repeats "AAAA", "TTTT", "GGGG", and "CCCC" was replaced by "ATGCA", "ATGCT", "ATGCG" and "ATGCC", respectively, at the second extension mapping step. By doing this, we strictly controlled the GC content of the encoded sequence to ≈50% and the homopolymer runs to below 4 nucleotides at both the whole and regional sequence levels. This should enable Panda to be very suitable for encoding data into the long genes used for in vivo data storage, which generally employ a gene assembly process that requires GC content even at the regional level (e.g., 100 bp). QRSS-MPA Faramarzi et al. proposed a new natural heuristic optimization algorithm this year, called the Marine Predator Algorithm (MPA). It follows rules governed by nature in the optimal foraging strategy in the marine ecosystem and the predator-to-predator ratio strategy. The inspiration for MPA came from a wide range of for aging strategies. For example, when the prey concentration is low, the predator uses the Levy strategy, whereas when the prey is abundant, the predator uses Brownian motion. The speed ratio V is used to weigh the Levy motion and Brownian motion between the prey and predator to determine the best encounter strategy. Through the analysis of MPA algorithm, it is found that the algorithm has two main shortcomings. First, randomly initializing the population will reduce the convergence rate. Secondly, the Levy predation strategy may miss the promising search area in the exploitation stage, resulting in the decreased optimization accuracy of the algorithm. Therefore, this paper proposes QRSS-MPA algorithm. It combines the quasi-opposite learning strategy and spiral search strategy with MPA. The quasi-opposite learning strategy is used to initialize the candidate solutions, which keeps the diversity of the population and enhances the exploration ability of the algorithm. In the exploitation stage, the introduction of spiral search strategy can make full use of the previous location and the location around the elite to conduct a concentrated search in the effective area. This not only enriched the diversity of the population to a certain extent, but also avoided falling into local optimum. In addition, the QRSS-MPA optimizer can better balance the exploration/exploitation of the search space and is more outstanding in performance. Quadtree partitioning encoding The quadtree partitioning encoding can convert image and text data directly into DNA base sequence data for storage. The innovation is that without converting image and text data into binary data before compression encoding, it can be directly encoded into DNA base data by quadtree encoding. This algorithm can make the biosynthetic technology of the overall DNA storage scheme more operable, and increase the flexibility and compatibility of the overall DNA storage scheme. DNA quadtree coding can effectively reduce the code length by assigning fewer bits to data with higher probability and more bits to data with lower probability. Specifically, during DNA encoding, the proposed algorithm generates a quadtree based on the probability of the occurrence of each character in the source data. Characters with the highest occurrence probability of source data characters will be encoded at the top of the quadtree and characters with lower occurrence probability will be placed at the bottom of the quadtree. After quadtree construction, 0-A, 1-G, 2-C, 3-T are used to map to DNA bases. Biological processes for DNA base storage after AGCT data are obtained. Quaternary numeric information is obtained by comparing the base data of the entire DNA chain vector with the map. For the decoding process, quaternary digital information is decoded one by one using DNA quadtree decoding algorithm to obtain the original data. Random code (RC) The RC system consists of three main components: random matrix, Gaussian preprocessing, and random equilibrium. Firstly, we use a pseudo-random number generator to create a random matrix, which is then subject to Gaussian preprocessing using the XOR elimination algorithm. This preprocessing step results in a generated matrix, which is a submatrix of the original random matrix with optimal decoding success rates. This ensures that all chunks of data are included in the generated matrix. Additionally, we propose a random equilibrium algorithm to ensure that the generated DNA sequence can pass the biological constraints screening successfully. RC algorithm steps: (1) Splitting the target storage document into K sub-packets based on the data capacity of the generated DNA strands. (2) Using adapter as seeds injected into a pseudo-random generator to generate 0/1 random matrices of specified dimensions. (3) Gaussian preprocessing is performed on random matrices with Gaussian XOR elimination, and select the generated matrix. (4) Based on the generated matrix, which labels the chunks involved in the XOR operation according to the elements are 1, and thus generates the droplet. (5) Random equilibration of data based on biological constraints such as homopolymer and GC content, to obtain the final storable DNA strand. Random coding 1) Randomization: We start by splitting the information to be stored into 𝐾 = 10977 many fragments of length 14 × 6 bits. We then multiply the bits of the information with a pseudorandom sequence. This step is invertible (since we know the pseudorandom sequence). After this randomization step, the bits appear random for statistical purposes. Outer-decoding: 2)We consider each fragment 𝑐𝑖 = [𝑐𝑖1, … , 𝑐𝑖6], 𝑖 = 1, … , 𝐾, as consisting of 6 blocks, 𝑐𝑖𝑗,𝑗= 1, … ,6, each block of length 14 bits. Use a Reed-Solomon code of length 16383 with symbols in {0, … , 214-1} to encode 𝑐𝑖1, … , 𝑐𝑖k to obtain the codeword 𝑐𝑖1, … , 𝑐𝑖n for each i. This yields the new, redundant sequences 𝑐𝑖 =[𝑐𝑖1, … , 𝑐𝑖6], 𝑖 = 𝐾 + 1, … 𝑀.3) Indexing: Next, we add a unique index of length 24 bits to each of the M = 16383 sequences. The sequences 𝑐𝑖 now have length 24 + 14 ⋅ 6 = 16 ⋅ 6 bits. 4) Inner-decoding: We now view each sequence 𝑐𝑖 as consisting of 18 symbols of length 6 bits and add two parity symbols by inner-decoding with a Reed-Solomon code with symbols in {0, … , 26-1}. This yields sequences of length 20 ⋅ 6 bits. 5) Mapping to DNA: Finally, we map each sequence of length 20 ⋅ 6 bits to DNA sequences of length 60 nt, simply by using the map 00 → A, 01 → C, 10 → G, 11 → T. Randomly fragmented encoding 1.Randomization., the first step of encoding entails randomization of the original data by mathematical addition with a pseudorandom sequence (specifically, we add the sequence of original data to a pseudorandom sequence with the exclusive or (XOR) operation). Because the pseudorandom sequence is known and can also be generated from a given seed value by the decoder, this step is reversible. 2. Fragmentation of digital data. It is difficult, inefficient and expensive to synthesize strands of DNA that are longer than 100–300 nucleotides. Consequently, digital data has to be fragmented and mapped to a multitude of short DNA sequences where two bits can be stored per nucleotide position (four possibilities, as each of the four bases of DNA can be mapped to a combination of two bits). 3. Appending redundant sequences. Reed–Solomon code is used to generate additional redundant sequences of the same length, which are appended to the original information. 4. Indexing sequences. Index information is added to each sequence, and the length of the index has to be chosen so that it enables indexing all sequences that are written on DNA. Currently, a good choice for the length of the index is 24 bits, as this allows for 224 = 16,777,216 indexed sequences. To account for larger DNA pools in the future, the length of the index can be increased. 5. Inner code redundancy. Our approach uses a Reed–Solomon error-correcting code, which takes the original information in each sequence (inner code message length = K) and adds redundant symbols. Similar to the outer code, this has the constraint that the inner code length (N) is limited by the inner code symbol size (mi) as N = 2mi − 6. Converting bits into DNA sequences. Following the preceding steps, the original digital file is now present in the form of many binary sequences, each comprising an index and inner code redundancy, and is ready for translation to DNA. For this, the binary data are mapped as 00 ↔ A, 01 ↔ C, 10 ↔ G and 11 ↔ T, allowing two bits of digital information per nucleotide. RaptorQ-Arithmetic-Base64-RS (RABR) The two encoding systems include tandem encoders of RaptorQ encoding for correcting DNA sequence loss, arithmetic encoding for reducing redundancy, improved Base64 encoding, or improved Lempel–Ziv–Welch (LZW) encoding for controlling homopolymer and specific nucleotide content, and RS code for correcting errors within DNA sequence. The RaptorQ-Arithmetic-Base64-RS (RABR) and RaptorQ-Arithmetic-LZW-RS (RALR) encoding systems could ensure a high coding efficiency and reliability despite sequence loss and readout error encountered in the DNA data storage channel. The two encoding systems could be extended from four nucleotides to six and eight nucleotides containing both natural and artificial nucleotides and are applicable to text, picture, and video. The absence of arithmetic compression in the coding systems facilitates an objective evaluation of the coding efficiency. RaptorQ-Arithmetic-LZW-RS (RALR) The two encoding systems include tandem encoders of RaptorQ encoding for correcting DNA sequence loss, arithmetic encoding for reducing redundancy, improved Base64 encoding, or improved Lempel–Ziv–Welch (LZW) encoding for controlling homopolymer and specific nucleotide content, and RS code for correcting errors within DNA sequence. The RaptorQ-Arithmetic-Base64-RS (RABR) and RaptorQ-Arithmetic-LZW-RS (RALR) encoding systems could ensure a high coding efficiency and reliability despite sequence loss and readout error encountered in the DNA data storage channel. The two encoding systems could be extended from four nucleotides to six and eight nucleotides containing both natural and artificial nucleotides and are applicable to text, picture, and video. The absence of arithmetic compression in the coding systems facilitates an objective evaluation of the coding efficiency. RBS In this study, we designed RBS methods, with which the generated codewords have the advantages of higher storage density and better coding quality. First, the data are rearranged by random XOR to eliminate the variability of file contents and make the bit distribution more even. Second, the file is partitioned into first, second, and third-fourth frequency word files according to the frequency words, and three methods, namely, data block merge, RH hybrid encode, and character merge, are proposed to encode the contents of the frequency word files since the word files have different content laws. The combination of these three methods not only improves the storage density of information but also makes up for the defect that rotational coding cannot handle duplicate data and interval data, thereby improving the applicable range of coding. The generated codewords of interval data were also evaluated in terms of entropy value change, free energy, and Hamming distance, and the original GC content and homopolymer length constraints were satisfied along with the free energy and Hamming distance constraints, which implies a more stable sequence structure and better coding quality. The base distribution of generated codewords is more in line with the expected distribution, and the relative entropy value approaches zero, which indicates that the error rate of generated codewords in sequencing will be lower. A reasonable base distribution can bring convenience to sequencing, reduce sequencing errors due to the overlapping of optical signals, and improve the correct rate of read data. The block-rotation encoding algorithm achieves a storage density of 1.85 bits/nt, offers high-quality coding and efficient storage, and generates high-quality codewords. Recursive formula We have presented a recursive expression for Br(n, w), i.e., the number of quaternary words with length n, GC-weight w, and runlength constraint r. Furthermore, we have derived a recursive expression for B1(n, w, 2), i.e., the size of the largest quaternary code with length n, GC-weight w, minimum Hamming distance 2, and no identical symbols next to each other in each codeword. An interesting research challenge is to find expressions or improve bounds for Br(n, w, d) with other values of r and/or d, i.e., for cases with a more relaxed runlength constraint and/or stronger error correcting/detecting capabilities. The recursive formula is to determine the size of the set of quaternary words with any fixed length, GC-weight, and runlength constraint. This simple recursive expression has the advantage that it can be easily evaluated for large lengths. Squiggle-based construction of concatenated DNA codes The squiggle-based construction of concatenated DNA codes for nanopore sequencers: Three aspects set our concatenated codes apart. First, our concatenated codes embed digital information in a single DNA molecule and are decoded from a single noisy read. Second, our concatenated codes are simply concatenated and thus does not suffer from loss in storage efficiency due to the use of an outer code. Finally, our decoder uses squiggle inputs while prior work only work on basecalled sequences. Three-layer encoding Text encoding: ASCII Text was converted into binary and encoded into DNA using the lemonjuice script. In brief, binary messages are entered into rows of a matrix, and base pairs assigned according to the values in a given column. Positions that have identical, stacked binary values are assigned A/T (000) or T/A (111). Conversely, sites that require simultaneous encoding of different bits are assigned as C/G (011), G/C (100), 5mC/G (001), G/5mC (110), 5hmC/G (010), or G/5hmC (101). Following this matrix, canonical DNA bases suffice for two-layer encoding, while three-layer encoding requires modified cytosine analogs. As pointed out in the main Text, it can be advantageous to avoid long runs of homopolymers by substituting A/T and T/A with the equivalent 5mC/G and G/5mC base pairs. Image encoding: Portable bitmap format image files were converted into DNA sequences (DFT-1 to -56) using the bakingsoda script. In brief, the images were first compressed with run length encoding, where runs of data holding the same binary value are stored as a count. Scanning across the horizontal axis gave the best compression factor. The obtained counts were further encoded with the universal Elias gamma code. In this code the value 1 remains as such, but for a number x, where x ≥ 1, define N as the highest power of 2 it contains. This value N is encoded in unary (N zeroes followed by a 1), and the remaining N binary digits of xare appended to the unary representation of N. To signal the end of the image, a unique stop code was appended to the Elias gamma encoded binary string. This string was then split into fragments of the desired length – 72 in the present study. If the number of fragments for each image was not integer or equal, 0s and 1s were randomly inserted to pad the strings. A list equal in length to the number of fragments was generated, with every element containing a matrix where each row contained the code from one image. Bases were then assigned depending on the bit pattern in a column, using the same assignment matrix as for Text encoding. The generated 72mers were then barcoded (7 nucleotides in length) and an additional A added to the 3’ end of each resulting oligonucleotide. Triplet code Triplet code scheme: the first, second, and third bits of the triplet code are encoded by the FAM, HEX, and ROX signals, respectively. We divided the 48 × 48 array into 16 12 × 12 subarrays that can accommodate a pixelated alphabet image (A-P) each. The FAM channel encodes the alphabets A-D and I-L, the HEX channel encodes the alphabets A, B, E, F, I, J, M, and N and the ROX channel encodes the alphabets A, C, F, H, I, K, N, and P. Single triplet codes 100, 010, and 001 are colored in red, green, and blue, double triplet codes 110, 011, and 101 are colored in yellow, cyan and magenta, while the triple and zero triplet codes 111 and 000 are colored in white and gray, respectively. Trits coding These transitions between non-identical nucleotides encode user-defined information. Given three possible transitions for each nucleotide, we use trits, a ternary instead of binary representation of information, to maximize information capacity. To convert information to DNA, information in trits is mapped to a template sequence that represents the corresponding transitions between non-identical nucleotides starting with the last nucleotide of the initiator. Enzymatic DNA synthesis of each template sequence produces "raw strands", or strandsR, which can be physically stored. To retrieve information stored in DNA, strandsR are sequenced and transitions between non-identical nucleotides extracted, resulting in "compressed strands", or strandsC. If a strandC is equivalent to the template sequence, the strand (compressed or raw) is considered "perfect" and the information is retrieved by mapping its sequence of transitions between non-identical nucleotides back to trits. Two-step encoding Two-step encoding procedure first translates an image file into 24 binary strings (if 3-bit quantization is used), and then converts the binary strings into DNA oligos for storage and amplification. 1).Quantization and Conversion into 2D Color Arrays. 2). Conversion of 2D Arrays into 1D Vectors. 3). Partitioning of Vectors According to Levels. 4).Using differential encoding and Huffman encoding for lossless Compression. 5). Conversion of Binary Strings into DNA Oligos. 6).Encoding of Granular and Highly Relevant Image Features via LDPC Codes (Optional). XOR based encoding algorithm The encoding process, uses two separate inputs: (i) encoded data (value) and (ii) primer (key). In order to retrieve the required data correctly, the key is created using the index of the data in the database, the type of data and size of the file in the number of bits. This key is then encoded using the proposed scheme to create a key. The significance of inserting the key is to make indexing possible in the synthetic DNA as well. Further, the key and the value are stitched together and sequenced and stored in the form of a 3D printed material. The process is reversible as the targeted data can be identified using the primer and decoded by the retrieval algorithm. Yin-yang codec Unlike other coding schemes developed using fixed mapping rules, the YYC provides dynamic combinatory coding schemes and can thus generate optimal DNA sequences to address the DNA synthesis and sequencing difficulties found when generating DNA sequences with long homopolymers, extreme GC content or complex secondary structure. The general principle of the YYC algorithm is to incorporate two independent encoding rules, called ‘yin’ and ‘yang’, into one DNA sequence (called ‘incorporation’), thereby compressing two bits into one nucleotide. Here, we use N1, N2, N3 and N4 to represent the four nucleic acids A, T, C and G, respectively. For one selected combinatory coding scheme, an output DNA sequence is generated by the incorporation of two binary segments of identical length. In the first step, the yang rule is applied to generate six different coding combinations. Then, in the yin rule, N1 and N2 are mapped to different binary digits, while N3 and N4 are also mapped to different binary digits independent of N1 and N2, leading to a total of 256 different coding combinations. Application of the yin and yang rules at one position will yield one and only one consensus nucleotide. Meanwhile, according to the four different options for the previous nucleotide, the two groups (N1/N2 and N3/N4) also have independent options for the mapping to 0 and 1. Therefore, the incorporated yin and yang rules provide a total of 1,536 (6×256) combinations of transcoding schemes to encode the binary sequence.