Faq - Nucleosome Positioning Map

Frequently Asked Questions

What is NucMap database?
How can I search interested samples in NucMap database?
What is the rule of sample accession id?
What is Nucleosome Browser?
What kind of analysis can be done in NucMap Database?
Can I upload my own data to NucMap for side-by-side analysis?
In the analysis module, is there any restriction in interested gene names?
What’s the format for user defined gene list in analysis?
What’s the format for genomic regions in analysis?
Which assembly versions were used for all species?
Can I download data in NucMap database?
What is the data processing pipeline in NucMap database?

1. What is NucMap database?

NucMap is a database of genome-wide nucleosome positioning map across species. It is dedicated to collecting, analyzing and storing nucleosome positioning data in all organisms. NucMap provides services on querying and visualizing nucleosome positioning information at single nucleotide resolution. Meanwhile, it also provides some enrichment analysis on interested genes or samples.

2. How can I search interested samples in NucMap database?

In the "Sample Search" page, all samples in specific species will be listed by default when one species was assigned. User can also narrow sample list by searching with keywords about sample material or assay. All raw data in NucMap database were collected from GEO and ENCODE database. Accession id in GEO can also be used to narrow down sample list.

3. What is the rule of sample accession id?

Sample accession id was created in the format XXNucYYYZZ in NucMap database. XXs refers to different species. YYYs are numbered biosamples from same species. Same YYY means materials may come from the same or similar biosamples. ZZs are numbered assays from YYY. Different ZZs indicate that different assay conditions were applied on samples, such as the concentration of MNase, time of digestion, use different capture antibodies.

4. What is Nucleosome Browser?

Nucleosome browser is an interactive browser built to visualize nucleosome positioning information in NucMap. It was developed based on JBrowse. For each species, tracks for raw reads signal and analyzed nucleosome peaks are available for all samples. Users can choose their interested tracks and view any regions across whole genome.

5. What kind of analysis can be done in NucMap Database?

With analysis modules in NucMap, users can analyze nucleosome enrichment at transcription start site (TSS) across samples or different interested gene groups. Both heatmap and enrichment curve are available in the analysis module.

6. Can I upload my own data to NucMap for side-by-side analysis?

Yes. User can upload all type of data for side-by-side analysis in the nucleosome browser. The only prerequisite is that, the file(s) to be uploaded should be based on the same reference genome as the one used in NucMap. In NucMap, genome assembly version for each species can be found in this table.

7. In the analysis module, is there any restriction in interested gene names?

Generally, there is no restriction with interested gene names. However, it is very tricky in which kinds of names to be used. Enrichment analysis mainly focus on nucleosome density around transcription start site(TSS). Typically, one gene may have multiple transcripts with different TSSs. If gene name was used, such as Tp53, Med14 and so on, the most upstream TSS will be used in the enrichment analysis. If users want to check some specific TSSs, it will be better to use transcript names to replace gene names. Meanwhile, all species in NucMap are using annotation from RefSeq, and the excel files for gene names vs transcript names are available for all species on the download page. If some transcript names or gene names don't not match to the name in NucMap database, a caveat will be added to the analysis module page.

8. What’s the format for user defined gene list in analysis?

The customized gene list should be a tab-separated table. The first column should be gene names or transcript names and the second column is gene group name. if user want to classify all genes into different groups, this column will be labels for different groups. The group names are up to users’ purpose, and they could be “high expressed gene” or “low expressed gene”, “housekeeping gene” and so on. This column is optional. If the second column is not provided, all gene will be treated as the same group. Here is an example of customized gene list (Species: Mus musculus, Sample:mmNuc0020101).

9. What’s the format for genomic regions in analysis?

Genomic regions should be formatted as a BED file. One difference with standard BED file is that the 4th column will be borrowed as group name. In each region file, max 6 different group names will be allowed. According to the strand, center will be used and extended by the size defined by region range (upstream and downstream) in the computational process. If only the first three columns were provided, all regions will be treated as “default_group”. If the 6th column was missed, the strand will be treated as “+”. Here are three valid examples:

Example 1: full Bed file.

chr1  3451234 3459876 group1  0 +
chr2  1242357 1248769 group1  0 +
chr4  6548912 6550123 group2  0 +
chr2  3251945 3254589 group3  0 +

Example 2: strands are missing. All strand will be treated as “+”

chr1  3451234 3459876 group1
chr2  1242357 1248769 group1
chr4  6548912 6550123 group2
chr2  3251945 3254589 group3

Example 3: Only the first three columns were provided. All group name will be taken as “default_group” and all strand will be treated as “+”

chr1  3451234 3459876
chr4  6548912 6550123
chr2  3251945 3254589
chr2  1242357 1248769

10. Which assembly versions were used for all species?

Species	Genome assembly id and alias	RefSeq assembly accession
Arabidopsis thaliana	TAIR10	GCF_000001735.3
Caenorhabditis elegans	WBcel235, ce11	GCF_000002985.6
Candida albicans	ASM18296v3	GCF_000182965.3
Danio rerio	GRCz10, danRer10	GCF_000002035.5
Drosophila melanogaster	Release 6 plus ISO1 MT, dm6	GCF_000001215.4
Homo sapiens	GRCh38, hg38	GCF_000001405.38
Mus musculus	GRCm38, mm10	GCF_000001635.26
Neurospora crassa	NC12	GCF_000182925.2
Oryza sativa	IRGSP Build 4.0	GCF_000005425.2
Plasmodium falciparum	ASM276v1	GCF_000002765.3
Saccharomyces cerevisiae	R64, sacCer3	GCF_000146045.2
Schizosaccharomyces pombe	ASM294v2	GCF_000002945.1
Trypanosoma brucei	ASM21029v1	GCF_000210295.1
Xenopus laevis	Xenopus_laevis_v2, Xenla9.1	GCF_001663975.1
Zea mays	AGPv4	GCF_000005005.2

11. Can I download data in NucMap database?

In NucMap, all processed reads signal (in bigwig format) and nucleosome peaks can be freely downloaded from the page Download.

12. What is the data processing pipeline in NucMap database?

Details for Data processing on each step:

SRA data were converted into FASTQ data with fastq-dump (SRA Toolkit).
FASTQ data were aligned to their corresponding reference genome with bwa.
Remove potential multiple alignments by removing entries with MAPQ<10 and remove duplicates with Picard.
Convert BAM file to BED file with bedtools.
Call nucleosome peaks with DNAPOS.
Call nucleosome peaks with iNPS.
Convert BED file into bigwig file with bedGraphToBigWig (UCSC utilities) and bedtools. Reads count was normalized to reads per million (RPM).
Extend all reads toward 3' end and artificially adjust each read length to 73 bp, and then convert the BED file into bigwig file. There is the "ext73bp" as marker in the filenames; Shift all extended reads toward 3' end by 73 bp to increase signal-to-noise ratio, and then make bigwig file based on shifted locations. There is the "shift" as marker in the filenames. Reads count was normalized to reads per million (RPM).
Annotate nucleosome peak to nearest gene (by searching nearest TSS) with customized Perl script.
Calculate reads density at each TSS with bedtool and customized Perl script.
Calculate binary peak matrix (whether peak covers specific position) at TSS with customized Perl script.
Calculate peak count at each TSS with customized Perl script.