HomozgosityMapper pedigree Login

Input file formats


SNP genotype files

Genotype files must be tabular with the samples as columns and the SNPs as rows, they can also be zipped or gzipped.
Since there appears to be an unlimited array of different formats for genotype files, we specify here those that can be imported into AutozygosityMapper without any further manipulation.
In every file, lines starting with the number sign (#) will be ignored. In each line, the SNP ID (Affymetrix ID or, with Illumina files, dbSNP ID) must be directly followed by the genotypes. The genotypes must be written in one fo the following ways:

Affymetrix (example [Chip: Mapping50K_Hind240])

SNP ID Sample01 Sample02 Sample03 Sample04 Sample08 Sample09
SNP_A-1511066 AB NoCall AA AA AA AA

Instead of AA/AB/BB/NoCall, also the 'number format' (0,1,2,-1) can be used.

The following columns will be ignored and do not have to be removed from the file:


DBSNP* Sample01 Sample02 Sample03 Sample05 Sample06
rs10000010 3 0 3 2 1
rs10000023 3 3 2 1 2
rs10000030 3 3 0 2 3
rs1000007 0 3 1 0 0
rs10000092 3 0 1 3 0
rs10000121 1 1 1 2 2

Instead of 1/2/3/0, also the character format (AA, AB, BB, --) can be used.
Additionally, real genotypes are allowed. Please note that this will drastically reduce the upload speed.
*) As dbSNP IDs are very humane, in other species the column 'SNP NAME' is used instead.

VCF file (Next Generation Sequencing genotypes)

The VCF file must have the following columns:

#CHROM POS    ID  REF  ALT   QUAL  FILTER  INFO  FORMAT  Sample1  Sample2  Sample3 (...) 
chr1   14930  .   A    G     .     .       .     GT:DP   1/1:31   0/1:30   0/0:23

The content of the columns 'ID', 'QUAL', 'FILTER', 'INFO' is ignored. The format attribute is used to determine which part of the samples' genotypes is the genotype and which one is the coverage. Please note that the DP flag must be included in the FORMAT string (not only in INFO!), unless you set the minimum coverage value in the upload interface to 0. Without the DP flag in FORMAT it is impossible to exclude genotypes with a low coverage because the DP information in INFO aggegrates the coverage over all samples!
The file must be sorted by chromosome.

Sites at which the genotype is uncertain (two alt alleles) are skipped.

Here is a sample file.
(Cases: Sample1, Sample2; controls Sample3, Sample4 - should yield a hit on chr6.)

You can generate such a file from your aligned NGS data with SAMtools like this:
# all BAM files in the same directory
samtools mpileup -D -gf /path/to/genome.fa *.bam | bcftools view -c -g - > filename.vcf
# BAM files in different directories 
samtools mpileup -D -gf /path/to/genome.fa /path/to/bam1.bam /path/to/bam2.bam | bcftools view -c -g - > filename.vcf
# reference genome: /path/to/genome.fa
# output file: filename.vcf
GATK offers a similar option.
Please read the manuals of SAMtools / bcftools to find the appropriate settings for your data.