## HA412HO bronze annotation -- FILE_SPECIFICATIONS.txt
##
## Description: This file describes the Helianthus annuus bronze annotation file specifications.
##              The files are listed under the directory they are found.
##              Should you have a question or comment, please use the contact listed below.
##
## Contact: Evan Staton (statonse@gmail.com)
##
## Date last modified: May 24, 2016
## Modified by: SES

-- genes/
----------------------------------------------------------------------------------------------
* Ha412v1r1_CDS_v1.0.fasta.gz

  - Description: Nucleotide CDS for each gene
  - Header format: >TranscriptID GeneID Chromosome_start_end_strand
  - Example: >Ha1_00044280-RA Ha1_00044280 Ha1_94590914-94591860_-

* Ha412v1r1_CDS_iprscan_v1.0.tsv.gz
 
  - Description: Functional annotations in tab-delimited format (TSV)
  - Format specification: ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan/4/README.html#3
  - Example (one of many annotations for a single transcript):

HA_00012216-RA  6de45283509e39aeac36aa955b5fa739        476     ProSiteProfiles PS51375 Pentatricopeptide (PPR) repeat profile. 290     324     10.687  T       12-04-2016      IPR002885       Pentatricopeptide repeat

 This file is where you will find GO terms and all other functional information associated with a gene. What is 
 annotated are the functional products of the genes, or transcripts. You can easily go back to the gene from which
 a transcript was derived by removing the "-RA" suffix.
 
 Note that these annotations are most easily found in the genome browser where all of this information is linked
 to the relevant database entries. For example, click the gene or transcript in the following link and 
 take a look: http://sunflowergenome.org/jbrowse_current/?data=extdata%2Fbronze&loc=Ha1%3A76989221..76999930&tracks=Genes&highlight=

* Ha412v1r1_prot_v1.0.faa.gz

  - Description: Translated CDS, or protein sequence
  - Header format (same as above): >TranscriptID GeneID Chromosome_start_end_strand
  - Example: >Ha1_00044004-RA Ha1_00044004 Ha1_75475361-75475747_-
    
* Ha412v1r1_genes_v1.0.gff3.gz

  - Description: Annotated gene features in GFF3 format 
  - Format specification: http://www.sequenceontology.org/gff3.shtml
  - Example (one gene, excluding declaration lines):

Ha1	maker	gene	72490	73753	.	-	.	ID=Ha1_00043025;Name=maker-Ha1-snap-gene-0.20;Alias=gene1;Dbxref=Gene3D:G3DSA:1.20.1250.20,InterPro:IPR000109,InterPro:IPR020846,Pfam:PF00854,Phobius:CYTOPLASMIC_DOMAIN,Phobius:NON_CYTOPLASMIC_DOMAIN,Phobius:TRANSMEMBRANE,SUPERFAMILY:SSF103473;Ontology_term=GO:0005215,GO:0006810,GO:0016020
Ha1	maker	mRNA	72490	73753	.	-	.	ID=Ha1_00043025-RA;Parent=Ha1_00043025;Name=maker-Ha1-snap-gene-0.20-mRNA-1;Alias=mRNA1;_AED=0.09;_QI=0|0|0|1|0|0.25|4|0|297;_eAED=0.11;Dbxref=Gene3D:G3DSA:1.20.1250.20,InterPro:IPR000109,InterPro:IPR020846,Pfam:PF00854,Phobius:CYTOPLASMIC_DOMAIN,Phobius:NON_CYTOPLASMIC_DOMAIN,Phobius:TRANSMEMBRANE,SUPERFAMILY:SSF103473;Ontology_term=GO:0005215,GO:0006810,GO:0016020
Ha1	.	stop_codon	72490	72492	.	-	.	Parent=Ha1_00043025-RA
Ha1	maker	exon	72490	72939	.	-	.	Parent=Ha1_00043025-RA
Ha1	maker	CDS	72490	72939	.	-	0	ID=CDS1;Parent=Ha1_00043025-RA
Ha1	.	intron	72940	72952	.	-	.	Parent=Ha1_00043025-RA
Ha1	maker	exon	72953	73065	.	-	.	Parent=Ha1_00043025-RA
Ha1	maker	CDS	72953	73065	.	-	2	ID=CDS1;Parent=Ha1_00043025-RA
Ha1	.	intron	73066	73399	.	-	.	Parent=Ha1_00043025-RA
Ha1	maker	exon	73400	73455	.	-	.	Parent=Ha1_00043025-RA
Ha1	maker	CDS	73400	73455	.	-	1	ID=CDS1;Parent=Ha1_00043025-RA
Ha1	.	intron	73456	73481	.	-	.	Parent=Ha1_00043025-RA
Ha1	maker	exon	73482	73753	.	-	.	Parent=Ha1_00043025-RA
Ha1	maker	CDS	73482	73753	.	-	0	ID=CDS1;Parent=Ha1_00043025-RA
Ha1	.	start_codon	73751	73753	.	-	.	Parent=Ha1_00043025-RA

 Note that the 9th field, the attributes, contain the functional annotations for the gene. 
 This information is also available in a simple tab-delimited table described below. Importantly, 
 there is an 'Alias' for each gene and mRNA which refers to a previous version of the annotations so all 
 work can be carried forward. The alias can be ignored if you are not working with older annotations.
 All of these terms are defined on the Sequence Ontology website.

 These annotations are viewable in JBrowse: http://sunflowergenome.org/jbrowse_current/?data=extdata/bronze

* Ha412v1r1_genes_v1.0.gtf.gz

  - Description: Annotated gene features in GTF format 

  This format is nearly identical to GFF3 except for the formatting of the attributes, and the 
  main difference that it is only used for describing the coding features. The file is a bit verbose 
  because the functional annotations are associated with each exon for exon-level analyses.

  - Format specification: http://mblab.wustl.edu/GTF22.html
  - Example (just one exon):

Ha1 maker exon 72490 72939 . - . transcript_id "Ha1_00043025-RA"; gene_id "Ha1_00043025"; gene_name "maker-Ha1-snap-gene-0.20"; Name "maker-Ha1-snap-gene-0.20-mRNA-1"; Alias "mRNA1"; _AED "0.09"; _QI "0|0|0|1|0|0.25|4|0|297"; _eAED "0.11"; Dbxref "Gene3D:G3DSA:1.20.1250.20,InterPro:IPR000109,InterPro:IPR020846,Pfam:PF00854,Phobius:CYTOPLASMIC_DOMAIN,Phobius:NON_CYTOPLASMIC_DOMAIN,Phobius:TRANSMEMBRANE,SUPERFAMILY:SSF103473"; Ontology_term "GO:0005215,GO:0006810,GO:0016020"; Alias "gene1"; Dbxref "Gene3D:G3DSA:1.20.1250.20,InterPro:IPR000109,InterPro:IPR020846,Pfam:PF00854,Phobius:CYTOPLASMIC_DOMAIN,Phobius:NON_CYTOPLASMIC_DOMAIN,Phobius:TRANSMEMBRANE,SUPERFAMILY:SSF103473"; Ontology_term "GO:0005215,GO:0006810,GO:0016020";

* Ha412v1r1_genes_v1.0.fasta.gz

 - Description: Full-length nucleotide sequence for each gene
 - Header format: >GeneID Chromosome_start_end
 - Example: >Ha10_00000001 Ha10_61357-62306

 The Gene, CDS, and protein files have spaces in the header to allow easier interpretation of analyses
 (e.g., with BLAST).

-- transposons/
----------------------------------------------------------------------------------------------
* Ha412v1r1_transposons_v1.0.gff3.gz

 - Description: Annotated transposons in GFF3 format
 - Format specification: http://www.sequenceontology.org/gff3.shtml
 - Example (one transposon, excluding declaration lines):

Ha1     LTRharvest      repeat_region   62089   64299   .       +       .       ID=repeat_region1
Ha1     LTRharvest      target_site_duplication 62089   62093   .       +       .       Parent=repeat_region1
    Ha1     LTRharvest      LTR_retrotransposon     62094   64294   .       +       .       ID=LTR_retrotransposon1;Parent=repeat_region1;family=RLX_singleton_family0;ltr_similarity=93.49;seq_number=8
Ha1     LTRharvest      long_terminal_repeat    62094   62378   .       +       .       Parent=LTR_retrotransposon1
    Ha1     LTRdigest       protein_match   63296   63494   7.4e-09 +       .       Parent=LTR_retrotransposon1;name=ARID;reading_frame=2
Ha1     LTRharvest      long_terminal_repeat    64003   64294   .       +       .       Parent=LTR_retrotransposon1
Ha1     LTRharvest      target_site_duplication 64295   64299   .       +       .       Parent=repeat_region1

 It should be obvious this is an LTR retrotransposon. You can see the annotation information listed in 
 the attributes field (from Tephra), LTR regions flanked by target site  duplications, and any protein 
 matches contained within the element. The entire region is defined as a repeat region. Note that the 
 target site duplications are not part of the element but a feature of the  host genome. All of these 
 terms are defined on the Sequence Ontology website.

 The annotations are viewable in JBrowse: http://sunflowergenome.org/jbrowse_current/?data=extdata/bronze

* Ha412v1r1_transposons_v1.0.fasta.gz

 - Description: Full-length nucleotide sequence for each transposon

 It should be noted that there is a defined nomenclature of TE types. That is just to say that "DTA" 
 for hAT transposon of Class II or "RLG" for Gypsy LTR retrotransposon are defined elsewhere 
 (Wicker et al., 2009), not by me.

 - Header format: >transposonID_Chromosome_start_end
 - Example: >DTA_terminal_inverted_repeat_element2540_Ha10_281456270_281457637
 
   This tells us that hAT transposon number 2540 is found on chromosome 10 from 281456270-281457637.
   If the family is known this will be in the header as well.

 - Example: >RLG_family0_LTR_retrotransposon39667_Ha16_101255621_101274871

   This tells us that a Gypsy LTR retrotransposon (element 39667, which is part of family0) 
   is found on chromosome 16 from101255621-101274871.

-- complete/
----------------------------------------------------------------------------------------------

* Ha412v1r1_genes_transposons_v1.0.gff3.gz
 
  - Description: Annotated gene and transposon features in GFF3 format

 The format and specification of this file is described above. This is a file of combined features
 (found in other files) sorted by coordinate for convenience. All features are separate by "###" 
 for visually inspecting features.

* Ha412v1r1_genes_transposons_v1.0.fasta.gz

  - Description: Full-length gene and transposon sequences

 This is combination of other files described in this document. Having a combined file of full-length gene
 and transposons is for convenience. 

-- genome/
----------------------------------------------------------------------------------------------
* Ha412v1r1_genome_no_cp-mt-rd_chr-q.fasta.gz

 - Description: Full-length nucleotide sequence for each pseudomolecule
 - Header format: >ChromosomeID
 - Example: >Ha1
 
 You will find sequences for the 17 (labeled Ha1-Ha17) pseudomolecules in this file.
 
* Ha412v1r1_organelles_rDNA.fasta.gz

 - Description: Full-length nucleotide sequence for organelle and rDNA
 - Header format: >organelleII_gi_GInumber
 - Actual headers in file (description):

 >cp_gi_88656873 (chloroplast)
 >mt_gi_571031384 (mitochondria)
 >rDNA_gi_563582565 (rDNA)

 For information on these files, look up the GI number on Genbank (or contact Chris Grassa for more information).
HA_00012216-RA	6de45283509e39aeac36aa955b5fa739	476	ProSiteProfiles	PS51375	Pentatricopeptide (PPR) repeat profile.	290	324	10.687	T	12-04-2016	IPR002885	Pentatricopeptide repeat