Genotype files must be tabular with the samples as columns and the SNPs as rows, they can also be zipped
or gzipped.
Since there appears to be an unlimited array of different formats for genotype files, we specify here
those that can be imported into AutozygosityMapper without any further manipulation.
In every file, lines starting with the number sign (#) will be ignored. In each line, the SNP ID
(Affymetrix ID or, with Illumina files, dbSNP ID) must be directly followed by the genotypes. The
genotypes must be written in one fo the following ways:
SNP ID | Sample01 | Sample02 | Sample03 | Sample04 | Sample08 | Sample09 |
SNP_A-1513509 | BB | BB | AB | BB | AB | BB |
SNP_A-1518411 | BB | BB | BB | BB | BB | BB |
SNP_A-1511066 | AB | NoCall | AA | AA | AA | AA |
SNP_A-1517367 | AA | AB | AB | AA | AA | AB |
DBSNP* | Sample01 | Sample02 | Sample03 | Sample05 | Sample06 |
rs10000010 | 3 | 0 | 3 | 2 | 1 |
rs10000023 | 3 | 3 | 2 | 1 | 2 |
rs10000030 | 3 | 3 | 0 | 2 | 3 |
rs1000007 | 0 | 3 | 1 | 0 | 0 |
rs10000092 | 3 | 0 | 1 | 3 | 0 |
rs10000121 | 1 | 1 | 1 | 2 | 2 |
Instead of 1/2/3/0, also the character format (AA, AB, BB, --) can be used.
Additionally, real genotypes are allowed. Please note that this will drastically reduce the upload speed.
*) As dbSNP IDs are very humane, in other species the column 'SNP NAME' is used instead.
The VCF file must have the following columns:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 (...) chr1 14930 . A G . . . GT:DP 1/1:31 0/1:30 0/0:23
The content of the columns 'ID', 'QUAL', 'FILTER', 'INFO' is ignored. The format
attribute is used to determine which part of the samples' genotypes is the genotype and
which one is the coverage. Please note that the DP flag must be included
in the FORMAT string (not only in INFO!),
unless you set the minimum coverage
value in the upload interface to 0. Without the DP flag in FORMAT it is
impossible to exclude
genotypes with a low coverage because the DP information in INFO aggegrates the
coverage over all samples!
The file must be sorted by chromosome.
Sites at which the genotype is uncertain (two alt alleles) are skipped.
Here is a sample
file.
(Cases: Sample1, Sample2; controls Sample3, Sample4 - should yield a hit on chr6.)
# all BAM files in the same directory samtools mpileup -D -gf /path/to/genome.fa *.bam | bcftools view -c -g - > filename.vcf # BAM files in different directories samtools mpileup -D -gf /path/to/genome.fa /path/to/bam1.bam /path/to/bam2.bam | bcftools view -c -g - > filename.vcf # reference genome: /path/to/genome.fa # output file: filename.vcfGATK offers a similar option.