DNA Raw Data

Understanding Your DNA Raw Data
All of the major genealogical companies allow you to download your DNA raw data so you can analyze it or upload it into other companies. If you open the raw data file you will see a lot of complicated letters and numbers. The goal of this wiki page is to explain what all of that information means.

General Format
A raw data file will typically have four or five columns. The first column records the "rsid." This is a code that the DNA company uses to record what SNP is being looked at. Each SNP has a unique combination here that is always rs and then a number. The second column will have a number 1-26. This is the number of the chromosome that the SNP is located on. The X chromosome can be referred to as X or 23, the Y chromosome can be referred to as Y or 24, the pseudo-autosomal region on the Y chromosome can be referred to as PAR or 24, and the mitochondrial DNA can be referred to as MT or 26. The third column will be a number and this records where on the chromosome the SNP is located. The first nitrogenous base on a chromosome is always 1, the next is 2, the next is 3, in order up into the millions. The higher the number, the farther along the chromosome the SNP is located. Since not all bases pairs are tested, you may notice large jumps in the numbers, but the numbers always get higher and higher until the end of the chromosome is reached. The fourth and fifth columns record the two values you have at each SNP. Some companies lump them both into one column (four column template) and some separate each into their own column (five column template).

Nitrogenous Bases (ATCG)
DNA chromosomes look like long twisty ladders. The longest chromosome (1) has over 249 million rungs and the smallest (21) has over 48 million. In total there are over 3 billion of these rungs in human DNA. Each rung in the ladder will contain a pair of nitrogenous bases: Adenine (A), Thymine (T), Cytosine (C), and Guanine (G). A and T are always paired together and C and G are paired together. Although two SNPs will always be together at each spot, only one of the two values at each spot will do any coding, the other is just a backbone that holds the structure together. The side that does the coding is called the + strand and the side that is the backbone is the - strand. Sometimes in an A T pair, the A will be the coding gene and the T will be the backbone, other times it will be the reverse and the same is true for C and G pairs. For simplicity, DNA companies will therefore just record the value of a person's + strand at each spot they test.

Single-Nucleotide Polymorphisms (SNPs)
All human beings are 99.9% identical in their genetic makeup meaning that at out of the 3 billion genes we all have 99.9% are the same in all humans. The places where it is possible for a variance to occur are called, SNP's which stands for Single-nucleotide polymorphisms. SNP's are the main force behind DNA and what gives it it's genealogical value. When two individuals have enough matching SNP's in a row, this becomes a matching segment. The more matching SNP's there are, the bigger the segment is. If a segment is big enough (bigger than 15cm's), then the segment must be identical by descend (IBD) which means the two individuals share that segment because they both descend from a common ancestor who passed on that segment of DNA to both of them. The more matching segments there and the bigger they are, the closer two test takers are probably related. By testing a sample of a person's SNP's and then comparing them to everyone else in the database, it is possible to identify a person's genetic relatives. Most major companies will test 500-600k SNP's.

In theory each SNP can be one of the four nitrogenous bases (A, C, G, or T), but in practice only two are ever found at each specific spot the vast majority of the time. There is usually a major allele and a minor allele that is present in at least 5% of test takers. In autosomal DNA, each person will have two nitrogenous bases at each spot, one inherited from their mother and one inherited from their father. This means that at each SNP tested a person can have one of three combinations: two copies of the major allele (called a homozygous SNP), one copy of the major allele and one copy of the minor allele (called a heterozygous SNP), and two copies of the minor allele (called a homozygous SNP). When two people have at least once matching SNP at each spot, it is a half match, if both SNPs match it is a full match, and if neither SNP matches it is no match. Since there are only three possible combinations at each spot, many people will be either a full or half match at any given SNP by coincidence even though they are not related. In fact, somebody who is heterozygous will match everybody on earth at that spot. This is why it is important that hundreds-thousands of SNPs in a row match to be confident that a matching segment is identical by descent and not just a coincidence.

Phasing
The order that the SNPs are listed within your raw data file is arbitrary and it is impossible to know which gene came from each parent without comparing the raw data to another relative. If both genes are the same (A A for example) then one A came from mom and one from dad. If your DNA is heterozygous at a certain SNP (C G for example), the only way to know which parent gave you the C and which the G is by comparing against other relatives. Sorting out the paternal and maternal SNPS is called phasing. In this situation, if you compared your DNA against your mom's and at the spot where you have C G, your mom has G G then you must have inherited the C from your dad and the G from your mom. If you are C G and your mom is also C G, then it is still unclear which gene came from which parent and comparing against your dad or another relative would be necessary to figure it out. Programs such as GedMatch.com offer the ability to phase your DNA by comparing it against one or both parents. Using phased kits reduces the amount of false segments identified between you and a match and is a valuable tool for people interested in small DNA segments. However, in cases where you and the parent being compared against are both heterozygous (like C G) the value becomes a no call and is discarded from the comparison. For this reason, comparing your DNA against both parents creates better results than just comparing against one. Perhaps in the spot where you and your mom are both C G, your father is C C, now it can be concluded you inherited the C from your father and the G from your mother.