## HA412HO bronze annotation -- FILE_SPECIFICATIONS.txt ## ## Description: This file describes the Helianthus annuus bronze annotation file specifications. ## The files are listed under the directory they are found. ## Should you have a question or comment, please use the contact listed below. ## ## Contact: Evan Staton (statonse@gmail.com) ## ## Date last modified: May 24, 2016 ## Modified by: SES -- genes/ ---------------------------------------------------------------------------------------------- * Ha412v1r1_CDS_v1.0.fasta.gz - Description: Nucleotide CDS for each gene - Header format: >TranscriptID GeneID Chromosome_start_end_strand - Example: >Ha1_00044280-RA Ha1_00044280 Ha1_94590914-94591860_- * Ha412v1r1_CDS_iprscan_v1.0.tsv.gz - Description: Functional annotations in tab-delimited format (TSV) - Format specification: ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan/4/README.html#3 - Example (one of many annotations for a single transcript): HA_00012216-RA 6de45283509e39aeac36aa955b5fa739 476 ProSiteProfiles PS51375 Pentatricopeptide (PPR) repeat profile. 290 324 10.687 T 12-04-2016 IPR002885 Pentatricopeptide repeat This file is where you will find GO terms and all other functional information associated with a gene. What is annotated are the functional products of the genes, or transcripts. You can easily go back to the gene from which a transcript was derived by removing the "-RA" suffix. Note that these annotations are most easily found in the genome browser where all of this information is linked to the relevant database entries. For example, click the gene or transcript in the following link and take a look: http://sunflowergenome.org/jbrowse_current/?data=extdata%2Fbronze&loc=Ha1%3A76989221..76999930&tracks=Genes&highlight= * Ha412v1r1_prot_v1.0.faa.gz - Description: Translated CDS, or protein sequence - Header format (same as above): >TranscriptID GeneID Chromosome_start_end_strand - Example: >Ha1_00044004-RA Ha1_00044004 Ha1_75475361-75475747_- * Ha412v1r1_genes_v1.0.gff3.gz - Description: Annotated gene features in GFF3 format - Format specification: http://www.sequenceontology.org/gff3.shtml - Example (one gene, excluding declaration lines): Ha1 maker gene 72490 73753 . - . ID=Ha1_00043025;Name=maker-Ha1-snap-gene-0.20;Alias=gene1;Dbxref=Gene3D:G3DSA:1.20.1250.20,InterPro:IPR000109,InterPro:IPR020846,Pfam:PF00854,Phobius:CYTOPLASMIC_DOMAIN,Phobius:NON_CYTOPLASMIC_DOMAIN,Phobius:TRANSMEMBRANE,SUPERFAMILY:SSF103473;Ontology_term=GO:0005215,GO:0006810,GO:0016020 Ha1 maker mRNA 72490 73753 . - . ID=Ha1_00043025-RA;Parent=Ha1_00043025;Name=maker-Ha1-snap-gene-0.20-mRNA-1;Alias=mRNA1;_AED=0.09;_QI=0|0|0|1|0|0.25|4|0|297;_eAED=0.11;Dbxref=Gene3D:G3DSA:1.20.1250.20,InterPro:IPR000109,InterPro:IPR020846,Pfam:PF00854,Phobius:CYTOPLASMIC_DOMAIN,Phobius:NON_CYTOPLASMIC_DOMAIN,Phobius:TRANSMEMBRANE,SUPERFAMILY:SSF103473;Ontology_term=GO:0005215,GO:0006810,GO:0016020 Ha1 . stop_codon 72490 72492 . - . Parent=Ha1_00043025-RA Ha1 maker exon 72490 72939 . - . Parent=Ha1_00043025-RA Ha1 maker CDS 72490 72939 . - 0 ID=CDS1;Parent=Ha1_00043025-RA Ha1 . intron 72940 72952 . - . Parent=Ha1_00043025-RA Ha1 maker exon 72953 73065 . - . Parent=Ha1_00043025-RA Ha1 maker CDS 72953 73065 . - 2 ID=CDS1;Parent=Ha1_00043025-RA Ha1 . intron 73066 73399 . - . Parent=Ha1_00043025-RA Ha1 maker exon 73400 73455 . - . Parent=Ha1_00043025-RA Ha1 maker CDS 73400 73455 . - 1 ID=CDS1;Parent=Ha1_00043025-RA Ha1 . intron 73456 73481 . - . Parent=Ha1_00043025-RA Ha1 maker exon 73482 73753 . - . Parent=Ha1_00043025-RA Ha1 maker CDS 73482 73753 . - 0 ID=CDS1;Parent=Ha1_00043025-RA Ha1 . start_codon 73751 73753 . - . Parent=Ha1_00043025-RA Note that the 9th field, the attributes, contain the functional annotations for the gene. This information is also available in a simple tab-delimited table described below. Importantly, there is an 'Alias' for each gene and mRNA which refers to a previous version of the annotations so all work can be carried forward. The alias can be ignored if you are not working with older annotations. All of these terms are defined on the Sequence Ontology website. These annotations are viewable in JBrowse: http://sunflowergenome.org/jbrowse_current/?data=extdata/bronze * Ha412v1r1_genes_v1.0.gtf.gz - Description: Annotated gene features in GTF format This format is nearly identical to GFF3 except for the formatting of the attributes, and the main difference that it is only used for describing the coding features. The file is a bit verbose because the functional annotations are associated with each exon for exon-level analyses. - Format specification: http://mblab.wustl.edu/GTF22.html - Example (just one exon): Ha1 maker exon 72490 72939 . - . transcript_id "Ha1_00043025-RA"; gene_id "Ha1_00043025"; gene_name "maker-Ha1-snap-gene-0.20"; Name "maker-Ha1-snap-gene-0.20-mRNA-1"; Alias "mRNA1"; _AED "0.09"; _QI "0|0|0|1|0|0.25|4|0|297"; _eAED "0.11"; Dbxref "Gene3D:G3DSA:1.20.1250.20,InterPro:IPR000109,InterPro:IPR020846,Pfam:PF00854,Phobius:CYTOPLASMIC_DOMAIN,Phobius:NON_CYTOPLASMIC_DOMAIN,Phobius:TRANSMEMBRANE,SUPERFAMILY:SSF103473"; Ontology_term "GO:0005215,GO:0006810,GO:0016020"; Alias "gene1"; Dbxref "Gene3D:G3DSA:1.20.1250.20,InterPro:IPR000109,InterPro:IPR020846,Pfam:PF00854,Phobius:CYTOPLASMIC_DOMAIN,Phobius:NON_CYTOPLASMIC_DOMAIN,Phobius:TRANSMEMBRANE,SUPERFAMILY:SSF103473"; Ontology_term "GO:0005215,GO:0006810,GO:0016020"; * Ha412v1r1_genes_v1.0.fasta.gz - Description: Full-length nucleotide sequence for each gene - Header format: >GeneID Chromosome_start_end - Example: >Ha10_00000001 Ha10_61357-62306 The Gene, CDS, and protein files have spaces in the header to allow easier interpretation of analyses (e.g., with BLAST). -- transposons/ ---------------------------------------------------------------------------------------------- * Ha412v1r1_transposons_v1.0.gff3.gz - Description: Annotated transposons in GFF3 format - Format specification: http://www.sequenceontology.org/gff3.shtml - Example (one transposon, excluding declaration lines): Ha1 LTRharvest repeat_region 62089 64299 . + . ID=repeat_region1 Ha1 LTRharvest target_site_duplication 62089 62093 . + . Parent=repeat_region1 Ha1 LTRharvest LTR_retrotransposon 62094 64294 . + . ID=LTR_retrotransposon1;Parent=repeat_region1;family=RLX_singleton_family0;ltr_similarity=93.49;seq_number=8 Ha1 LTRharvest long_terminal_repeat 62094 62378 . + . Parent=LTR_retrotransposon1 Ha1 LTRdigest protein_match 63296 63494 7.4e-09 + . Parent=LTR_retrotransposon1;name=ARID;reading_frame=2 Ha1 LTRharvest long_terminal_repeat 64003 64294 . + . Parent=LTR_retrotransposon1 Ha1 LTRharvest target_site_duplication 64295 64299 . + . Parent=repeat_region1 It should be obvious this is an LTR retrotransposon. You can see the annotation information listed in the attributes field (from Tephra), LTR regions flanked by target site duplications, and any protein matches contained within the element. The entire region is defined as a repeat region. Note that the target site duplications are not part of the element but a feature of the host genome. All of these terms are defined on the Sequence Ontology website. The annotations are viewable in JBrowse: http://sunflowergenome.org/jbrowse_current/?data=extdata/bronze * Ha412v1r1_transposons_v1.0.fasta.gz - Description: Full-length nucleotide sequence for each transposon It should be noted that there is a defined nomenclature of TE types. That is just to say that "DTA" for hAT transposon of Class II or "RLG" for Gypsy LTR retrotransposon are defined elsewhere (Wicker et al., 2009), not by me. - Header format: >transposonID_Chromosome_start_end - Example: >DTA_terminal_inverted_repeat_element2540_Ha10_281456270_281457637 This tells us that hAT transposon number 2540 is found on chromosome 10 from 281456270-281457637. If the family is known this will be in the header as well. - Example: >RLG_family0_LTR_retrotransposon39667_Ha16_101255621_101274871 This tells us that a Gypsy LTR retrotransposon (element 39667, which is part of family0) is found on chromosome 16 from101255621-101274871. -- complete/ ---------------------------------------------------------------------------------------------- * Ha412v1r1_genes_transposons_v1.0.gff3.gz - Description: Annotated gene and transposon features in GFF3 format The format and specification of this file is described above. This is a file of combined features (found in other files) sorted by coordinate for convenience. All features are separate by "###" for visually inspecting features. * Ha412v1r1_genes_transposons_v1.0.fasta.gz - Description: Full-length gene and transposon sequences This is combination of other files described in this document. Having a combined file of full-length gene and transposons is for convenience. -- genome/ ---------------------------------------------------------------------------------------------- * Ha412v1r1_genome_no_cp-mt-rd_chr-q.fasta.gz - Description: Full-length nucleotide sequence for each pseudomolecule - Header format: >ChromosomeID - Example: >Ha1 You will find sequences for the 17 (labeled Ha1-Ha17) pseudomolecules in this file. * Ha412v1r1_organelles_rDNA.fasta.gz - Description: Full-length nucleotide sequence for organelle and rDNA - Header format: >organelleII_gi_GInumber - Actual headers in file (description): >cp_gi_88656873 (chloroplast) >mt_gi_571031384 (mitochondria) >rDNA_gi_563582565 (rDNA) For information on these files, look up the GI number on Genbank (or contact Chris Grassa for more information). HA_00012216-RA 6de45283509e39aeac36aa955b5fa739 476 ProSiteProfiles PS51375 Pentatricopeptide (PPR) repeat profile. 290 324 10.687 T 12-04-2016 IPR002885 Pentatricopeptide repeat