Transcriptomic landscape of Dendrobium huoshanense and its genes related to polysaccharide biosynthesis

Dendrobium huoshanense has long been used to treat various diseases in oriental medicine. In order to study its gene expression profile, transcripts involved in the biosynthesis of precursors of polysaccharides, as well as mechanisms underlining morphological differences between wild and cultivated plants, three organs of both wild type and cultivated D. huoshanense were collected and sequenced by Illumina HiSeq4000 platform, yielding 919,409,540 raw reads in FASTQ format. After Trinity de novo assembly and quality control, 241,242 nonredundant contigs with the average length of 967.5 bp were generated. qRT-PCR experiment on the selected transcripts showed the transcriptomic data were reliable. BLASTx was conducted against NR, SwissProt, String, Pfam, and KEGG. Gene ontology annotation revealed more than 40,000 contigs assigned to catalytic activity and metabolic process, suggesting its dynamic physiological activities. By searching KEGG pathway, six genes potentially involved in mannose biosynthetic pathway were retrieved. Gene expression analysis for stems between wild and cultivated D. huoshanense resulted in 956 genes differentially expressed. Simple sequence repeats (SSRs) analysis revealed 143 SSRs with the unit size of 4 and 3,437 SSRs the size of 3. The obtained SSRs are the potential molecular markers for discriminating distinct cultivars of D. huoshanense.


Digital signature
This PDF has been certified using digital signature with a trusted timestamp to assure its origin and integrity. A verification trust dialog appears on the PDF document when it is opened in a compatible PDF reader. Certificate properties provide further details such as certification time and a signing reason in case any alterations made to the final content. If the certificate is missing or invalid it is recommended to verify the article on the journal website. also made [6]. On the other hand, a characteristic alkaloid found in D. huoshanense, dendrobine, which was first isolated by H. Suzuki in 1932 from D. nobile and comprised the sesquiterpene skeleton structure with 15 carbons, acts as an antagonist of taurine and beta-alanine [7]. Moreover, dendrobine also exhibits antipyretic, hypertensive, and antiviral effects [8]. Although several groups have reported racemic and enantioselective syntheses for this compound during the past five decades [9][10][11][12], study on its biosynthetic pathway remains limited [13,14].
Due to its rigorous requirement for growth environment, long growth cycle, and illogical development by human beings, resource for D. huoshanense in China is now endangered. Employing transcriptomic approaches to collect and analyze bioinformation for understanding its genetic makeup and biosynthetic pathways of important secondary metabolites is a practical way to better protect and utilize this medicinal resource by using biotechnologies to produce products of interest in the future. By far, studies on genomic and transcriptomic level of Dendrobium plants have been confined to a few species such as D. officinal and D. nobile [13,15,16]. To our best knowledge, no transcriptomic report on D. huoshanense was retrieved up to date.
Analyses on transcripts before the dawn of microarray usually deployed radioactive DNA probes to hybridize targets in addition to common protocols such as homogenization and nucleic acid extraction. It was slow, expensive, laborious, and environmentally harmful [17]. With the development of DNA microarray, studying transcriptome became possible by fixing custom-designed DNA, cDNA, or oligonucleotides probes on glass surface or silicon chips for hybridization [18]. Shortcomings accompanying this powerful approach are mainly the requirement of existing knowledge regarding genomic information and the limited dynamic range of microarray chips themselves. In the early days of this century, RNA-Seq (RNA sequencing), also known as deep or high-throughput sequencing, offered a brand-new solution for those nonmodel plants, including D. huoshanense, which lack reference genomes. Meanwhile, RNA-Seq technologies intensively read amplified cDNA sequences and the coverage showing the times at certain locations where a nucleotide is detected could be as many as several thousand. Main stream software for de novo assembly of the data resulted from sequencers include SOAPdenovo2 [19] (Short Oligonucleotide Analysis Package 2), AbySS [20] (Assembly by Short Sequences), and Trinity [21] with numerous downstream applications to facilitate individual tasks.
In this study, 919,409,540 raw reads in FASTQ format from six organs of D. huoshanense were obtained by Illumina HiSeq4000 sequencing platform. De novo assembly was fulfilled by Trinity software. Transcripts possibly involved in the biosynthesis of important secondary metabolites, polysaccharides, and dendrobine, were proposed by analyzing findings from KEGG [22] (Kyoto Encyclopedia of Genes and Genomes) pathway and Gene Ontology databases. Assembled contigs, annotation, as well as differently expressed genes between distinct samples were uploaded to Github (https:// github.com/rongchunhan/Dendrobium-huoshanense).

Plant material and mRNA purification
All fresh samples were collected from healthy D. huoshanense plants in Huoshan County, Anhui Province, China in June 2016. Three biological replicates were prepared for wild flower of this species, three for wild leaf and three for wild stem. As far as cultivated type is concerned, the biological replicates for flower, leaf, and stem are two, three and three, respectively. All the subjects were submerged in RNAlater solution (Invitrogen, USA) right after sampling in the fields, with each sample trimmed to approximately 5 mm in every dimension. For total RNA extraction, RNAlater was removed from the sample surface by using a Kimwipe and then materials were frozen in liquid nitrogen. Subsequently, plant RNA purification reagent (Invitrogen) was used according to the manufacturer's instructions. Agilent 2100 Bioanalyzer was also applied to assess RNA integrity number (RIN), which should be over 7.5. mRNA was gathered by magnetic beads with oligo(dT) complementary to polyA tail of mRNA.

mRNA fragmentation, library preparation
Fragmentation buffer from Truseq RNA Sample Prep Kit (Illumina, USA) helped to randomly break mRNA, which could be several thousand kilobases long, into 300-bp fragments. With the presence of reverse transcriptase and random primers, first strand cDNA was synthesized. After the formation of double-stranded cDNA, end repair mix was added to make blunt end followed by attachment of adaptors.
High throughput sequencing and de novo assembly cDNA libraries were enriched through PCR amplification for 15 cycles and then certified low range ultra agarose was used for electrophoresis to recover target fragments. After quantification by TBS380 Picogreen (Invitrogen), samples were loaded to cBot for bridge PCR to form clusters. Finally, original data were generated from Illumina instrument through sequencing by synthesis approach.
By examining A/T/G/C base content distribution, distribution of base quality, and base error rate distribution, raw data were trimmed by using trimmomatic [23] to generate clean data for following processes. We adopted Trinity method to conduct de novo assembly and yielded contigs, which were the basis for downstream biological function analysis.

Transcripts annotation and KEGG pathway analysis
Prior to gene annotation, we used ORF prediction package provided by Trinity to do gene prediction for all the transcripts obtained. For eukaryotes, each sequence should contain UTR and then best ORF region is predicted according to Markov models [24]. Pfam database serves the purpose of revising predicted results and only protein sequences which can be matched to this database will be retained.
For BLASTx (Basic Local Alignment Search Tool), all the contigs were used to search NCBI NR (nonredundant), String, SwissProt [25], and KEGG databases for functional annotation with E-value set to 1e−5. NCBI NR contains nonredundant information from SwissProt, PIR (Protein Information Resource), PRF (Protein Research Foundation), and PDB (Protein Data Bank) databases and the translated protein data from GenBank as well as RefSeq. By comparing D. huoshanense transcriptomic data with NR, information for homologous sequences and similarity between related species could be attained.
By standardizing biological jargons across different databases, gene ontology [26] (GO) aims to categorize the results related to genetic research all over the world. We analyzed GO classification for D. huoshanense to study biological meanings and its distribution of all gene products. Currently, information from 66 unicellular organisms in COG (Clusters of Orthologous Groups of proteins) is generally divided into functional classifications [27]. By extracting COG number when comparing D. huoshanense sequence information against String database, we studied the functional attributes of this medicinal plant.
In living organisms, gene products do not exist or take effect in isolation. Different factors interact orderly to exhibit biological function, which makes KEGG, a database focusing on tackling genetic information in a systematic view, more valuable. It integrates sophisticated biological functions such as metabolic pathway, genetic information transfer, cellular process, and so forth to solve problems theoretically and practically. By BLASTx against KEGG database [28], we extracted KEGG orthology numbers which were used to retrieve biological pathways.

Gene expression analysis
RSEM [29] software calculates number of read counts for each transcript in a specific library and then FPKM (fragments per kilobase of exon per million mapped reads) values were obtained to show gene expression level. Three ingredients are essential: total exon fragments, mapped reads (millions), and exon length (Kb). To perform differential gene expression (GE) analysis, edgeR [30] was used based on RSEM results. For D. huoshanense, screening criteria for differential GE were FDR (false discovery rate) < 0.05 and log 2 |FC (fold change)| ≥ 1. We also analyzed upregulated and downregulated information in terms of GO and be aware that for a single transcript, it may correspond to several GO entries and vice versa.

Simple sequence repeats extraction
SSRs stands for simple sequence repeats, also called microsatellite or STRs (simple tandem repeats). SSR is widely distributed in eukaryotic genomes and because number and type of repeats vary in different individuals, it is very useful as molecular marker in the form of SSR or ISSR (inter SSR). We applied MISA (ver. 1.0) to detect SSR in transcripts [31].
Gene expression verification using quantitative RT-PCR Total RNA was extracted from different samples by the same method described in section "Plant materials and mRNA purification". RNA quality was then evaluated by ScanDrop 200 (Jena, Germany). Corresponding cDNA was synthesized with Superscript III Reverse Transcriptase (Invitrogen). Primers for qRT-PCR were designed by using NCBI/Primer-BLAST and tubulin was adopted as the internal control gene (Tab. S1). Amplification was performed using Stepone Plus Real-Time PCR System (Applied Biosystems, USA) with the method described previously [32].

Plant materials and cDNA library construction
Dendrobium huoshanense is perennial herb with succulent stems ranging from 3 to 9 cm long. Basal parts of its stem are typically inflated. The stems of cultivated D.
huoshanense are commonly longer than that of wild ones (Fig. 1) and after transplanting the cultivated plants to wild environment, this feature is preserved. In this study, we intended to collect information for the mechanisms underlining this morphological difference. After RNA extraction and purification, quality control was conducted to assure values of OD 260/280 and OD 260/230 were in the range of 1.8-2.0. A total of 17 cDNA libraries were constructed and sequenced by Illumina system. Library 1 to 3 belonged to D. huoshanense flower (wild), 4-6 leaf (wild), 7-9 stem (wild), 10-11 flower (cultivated), 12-14 leaf (cultivated), and 15-17 stem (cultivated). Fig. 2 illustrates the experimental flowchart for D. huoshanense analysis.

Trinity de novo assembly
After high-throughput sequencing, trimming as well as quality validation of the raw reads are indispensable because they may include adapters or data with poor quality. The software of Trimommatic was used to detect and discard adapters, subsequences of ambiguous nucleotides, and low-quality reads. The resultant clean reads were used for downstream de novo assembly. Clean reads from all 17 libraries were put together to be processed by Trinity and the output contigs longer than 200 bp were retrieved. To remove duplicates of contigs, CD-HIT-EST [33] was applied and its threshold of 0.9 was set to reduce noise interference and computational time. Finally, 241,242 nonredundant contigs were obtained. Tab. 1 shows statistics of sequencing and assembly. Percent of GC in eukaryotes varies from 20% to 60% [34]. GC content of D. huoshanense was found to be 43.77%, which is in the middle. Sequence length distribution of nonredundant contigs was summarized in Fig. S1. Number of contigs with the length from 201 to 1,000 bp was 161,876, which covered 67.1% of all transcripts.

Annotation of assembled transcripts
The number of contigs predicted to contain ORF by using ORF prediction package was 142,224 and this equaled 58.9% of the total assembled contigs, suggesting reliable results of our analysis. BLASTx is an algorithm to search a protein database with a translated nucleotide query. We adopted this strategy to do BLASTx against NCBI NR and other four public repositories. As a result, 32.99% of the transcripts had hit(s) in Pfam, 28.28% in KEGG, 38.20% in Swis-sProt, 22.09% in String, and 65.15% in NR. A detailed breakdown is presented in Fig. 3. By analyzing the hits against NCBI NR, transcripts from D. huoshanense that had similar sequences with different species could be studied. To our surprise, 39,861 contigs showed significant homology with Phoenix dactylifera, commonly known as date palm from family Arecaceae. And only 1,034 contigs were found closely similar to Arabidopsis thaliana (Fig. 4). Gene ontology classification provides overall insights into expression pattern of organisms under certain circumstances. We used Blast2GO online facility (http://www.blast2go.com/ b2ghome) to conduct GO annotation. Due to the nature of GO categories, one contig may correspond to several GO numbers from the three distinct classes: biological process, cellular component, and molecular function. From Fig. 5, over 40,000 contigs were assigned to catalytic activity and metabolic process, suggesting dynamic physiological properties taking place in D. houshanense, which is responsible for its medicinal efficacy. Nevertheless, the implications by using such strategy to analyze GO have obvious limitations. KEGG database is mostly dedicated for human and animal models and thus GO terms related to animal diseases may not be relevant to the study on plants. We turned to agriGO v2.0 [35], which focused on agricultural species, for more information and compared D. huoshanense GO data with that of A. thaliana. The result (Fig. S2) showed representative GO terms of the two species had similar distribution pattern. Percent of genes for A. thaliana was higher than that for D. huoshanense and this may attribute to its abundant short contigs that could not hit corresponding GO terms. We retrieved COG numbers for D. huoshanense transcripts by comparing sequence similarity against String database and subsequently acquired their COG function classification. Within

Items Numbers
Total bases of clean reads   the 25 classes, 3,909 transcripts were assigned to the type of information storage and processing with the functional label as translation, ribosomal structure, and biogenesis. Of the 12,472 genes that are related to metabolism, 1,144 transcripts were linked to secondary metabolites biosynthesis, transport, and catabolism (Fig. S3). KEGG pathway enrichment provides information regarding genes' interaction in a metabolic web. We listed 20 biosynthetic pathways in KEGG, which involved the most transcripts from D. huoshanense in Fig. S4 and 3,679 genes were assigned to ribosome biosynthesis. Dendrobium huoshanense is famous for its high contents of polysaccharides. By referring to KEGG database, the following is a possible biosynthetic pathway for D. huoshanense to produce mannose as the precursor of polysaccharides. After α-d-glucose is synthesized via photosynthesis, xylose isomerase catalyzes it to produce d-fructose, which is then turned into β-d-fructose-6P with the help of hexokinase or fructokinase. Subsequently, mannose-6-phosphate isomerase catalyzes β-d-fructose-6P to yield d-mannose-6P which is catalyzed by phosphomannomutase to form d-mannose-1P. Finally, with the presence of mannose-1-phosphate guanylyltransferase, GDP-d-mannose comes into being. The six above-mentioned enzymes that may be involved in the biosynthesis of polysaccharides were all presented in D. huoshanense transcriptomic dataset (Fig. 6).

Differential expression and SSR analysis
We were interested in analyzing differentially expressed (DE) genes between the wild and cultivated plants. Fig. 7 demonstrates that 956 genes were differentially expressed between stems of distinct types of D. huoshanense and this was a solid foundation to rely on for studying their morphological traits. Compared to leaf and flower, the morphological differences of stem between the wild and cultivated were more obvious. Generally, the stem of D. huoshanense has four-seven nodes and the protruding character for the cultivar is its longer internode compared with the wild type (Fig. 1), whereas it is hard to distinguish the two types solely by morphology of their leaves and flowers. As a new species, D. huoshanense was named fairly recently in 1984 [36] and cultivation for this herb started even later. In 2005, the seeds were collected by our lab in the wild and then suitable medium were prepared to help them germinate. After a few generations, the traits from the selected cultivar showed genetic stability. The relatively short period of breeding and inherent stability of the propagative organ may explain the reason why there were only eight DE genes found between wild and cultivated flowers. For DE genes between leaves of cultivated and wild type, there were 250 of them.
Gene ontology differential expression enrichment was also conducted for stems of both wild type and cultivar (Fig. 8). GO terms related to growth, multiorganism process and negative regulation of biological process were downregulated, which meant certain FPKM values in the cultivar were lower. On the contrary, receptor activity, structural molecular activity, and some others were higher.  SSR is a practical tool to discriminate different cultivars within a species. By examining a total of 171,111 contigs of D. huoshanense, we identified 25,424 SSRs and 4,367 contigs contained more than one SSR. In terms of unit size, it ranged from 1 to 6. SSRs in a number of 143 were retrieved with the unit size of 4 and 3,437 SSRs the size of 3. Fig. S5 demonstrates the distribution of SSRs in sequences.

Gene expression validation
To conduct qRT-PCR, three biological replicates were utilized for each reaction and a negative control comprising template without primers was also included. One gene involved in flavonoids biosynthesis, the other related to sugar metabolism, and the third annotated as class-1 low-molecular-weight heat shock protein (LMW HSP) were chosen as the target genes. Tab. 2 shows similar trend as far as the fold change between results of bioinformatics and qRT-PCR is concerned. For LMW HSP, it was differently expressed according to edgeR analysis, which coincided with our qPCR results. This indicated the RNA-Seq output was reproducible and reliable.

Discussion
Dendrobium huoshanense possesses remarkable pharmaceutical attributes and according to ancient materia medica books, it has been applied to treat various diseases for a very long time in China. However, due to its high morphological similarity with species from the same genus Dendrobium, it was often confused with others and researchers made judgments according to its very narrow distribution area which was recorded in ancient books. It was not until 1984 [36] that its scientific name was published and Chinese pharmacopeia included several Dendrobium species under the name Dendrobii Caulis. Commercial cultivation of D. huoshanense began in late twentieth century and the limited time span may explain the reason why gene expression was so similar concerning flower from the cultivar and wild type. Fig. 8 demonstrates downregulated GO terms related to growth in cultivar despite its longer stem and internode. This suggests the underling mechanisms of D. huoshanense morphology are complicated. Despite the effort to tackle genetic information for the medicinal plants in genus Dendrobium, RNA-Seq analysis for D. huoshanense has not been conducted so far. To provide transcriptomic data for this plant will facilitate the study on the biosynthesis of its abundant polysaccharides and other important secondary metabolites. By collecting 17 samples of wild and cultivated plants, we used Illumina platform to sequence its flower, stem, and leaf. Through qRT-PCR verification, the resultant data were reliable. RNA-Seq analysis provides not only the overall expression profile of this plant, but also detailed information of sequences that are involved in various biosynthetic pathways. The transcripts identified in fructose and mannose biosynthesis is particularly valuable because the genes will help to certify the actual pathway in D. huoshanense and by gene expression correlation, new genes related to downstream polysaccharide biosynthesis may pop up. The results of this experiment will serve as a stepping stone for D. huoshanense future research.

Fig. S1
Sequence length distribution of the assembled contigs.