Sunday, December 2, 2012

D-statistics on ADMIXTURE components

I have implemented the method of D-statistics as an R function. This will allow you to take your raw genotype data and calculate various D-statistics of the form:

D(Pop1, YOU; Pop3, Outgroup)

Please read the original post for details on how to use this tool.

Friday, November 30, 2012

Geno 2.0 patch for DIYDodecad

(See important update at the end of this post)

People who have tested using the Genographic Project's Geno 2.0 test can now use the DIYDodecad tool with their data. The raw data download from this test has a slightly different format than the ones from 23andMe and Family Finder, so it is necessary to convert your data in a format that DIYDodecad can interpret.

So, after you have downloaded and extracted the DIYDodecad software as per its instructions, you should also download a couple of extra files into your working directory; these files are included in this patch:

  • standardize.r which replaces the standardize.r in the DIYDodecad software bundle, and allows you to convert your Geno 2.0 formatted data
  • hgdp.base.txt which includes additional information about SNP markers that is not found in your Geno 2.0 raw data download, and which is necessary to complete the conversion process.
Once these two files have been extracted into your working directory, the process of using DIYDodecad is exactly the same as for any other user of the software.

The only difference is that at the step where you convert your data using the standardize command (see DIYDodecad README file), you will use the command:


standardize('johndoe.csv', company='geno2')

where johndoe.csv is your unzipped raw data download. This will write a genotype.txt file in the working directory, and you can proceed the rest of the way as per the instructions.

You can use all ancestry calculators released by the Project (or indeed other projects); the most recent one is globe13

You should be aware, that because the Geno 2.0 test includes a smaller number of SNPs, and because globe13 and other calculators were developed using the common SNP set of 23andMe and Family Finder, the analysis using globe13 will only include ~34 thousand SNPs and will be "noisier" than usual. In the future, I might develop new calculators that make use of the SNP set of the Geno 2.0 test itself.

PS: Feel free to post a comment below if you experienced any difficulty converting your data; also thanks to CeCe Moore for graciously sharing a raw data file with me, which allowed me to build this converter.

UPDATE:

Apparently, the data format has been changed for some Geno 2.0 data downloads.
If your data includes a [Header] ... [Data] preamble followed by a list of 5 comma-separated values, ignore this.
If it includes a header "SNP,Chr,Allele1,Allele2" followed by a list of 4 comma-separated values, you should follow the instructions as above, but use company='geno2new' instead.

Wednesday, October 31, 2012

'globe13' participant results

Project participant results for the globe13 calculator can be found in the spreadsheet. Population median results and Fst divergences are also included.

Below, you can see the first two dimensions of an MDS plot of the 13 components:

A neighbor-joining tree of the 13 components based on the Fst divergences:
I have also created a TreeMix plot using Palaeo_African as an outgroup, and allowing as many as 5 migration edges:
The actual tree is:


((West_African:0.00448794,(East_African:0.00506576,(((((East_Asian:0.0173284,Siberian:0.00732773):0.0027852,(Amerindian:0.026174,Arctic:0.0118342):0.00742092):0.0114738,Australasian:0.0488974):0.00266559,South_Asian:0.00734044):0.008089,(Southwest_Asian:0.00541405,((West_Asian:0.00620657,North_European:0.00657599):0.00311587,Mediterranean:0.00798949):0.00650328):0.0118925):0.0299627):0.00597674):0.00671186,Palaeo_African:0.0215931);
0.0640319 NA NA NA Palaeo_African:0.0215931 Australasian:0.0488974
0.270468 NA NA NA Australasian:0.0488974 East_Asian:0.0173284
0.185213 NA NA NA South_Asian:0.00734044 ((West_Asian:0.00620657,North_European:0.00657599):0.00311587,Mediterranean:0.00798949):0.00650328
0.129883 NA NA NA North_European:0.00657599 Amerindian:0.026174
0.138757 NA NA NA Arctic:0.0118342 (West_Asian:0.00620657,North_European:0.00657599):0.00311587

Monday, October 29, 2012

'globe13' calculator

The globe13 calculator is based on the K=13 analysis. It includes the following components:


  • Siberian
  • Amerindian
  • West_African
  • Palaeo_African
  • Southwest_Asian
  • East_Asian
  • Mediterranean
  • Australasian
  • Arctic
  • West_Asian
  • North_European
  • South_Asian
  • East_African

Fst divergences between ancestral components can be found here.

You need to extract the contents of the RAR file to the working directory of DIYDodecad. You use it by following exactly the instructions of the DIYDodecad README, but always type 'globe13' instead of 'dv3' in these instructions. You can consult the spreadsheet for proportions of the 13 components in different world populations.

Terms of use: 'globe13', including all files in the downloaded RAR file is free for non-commercial personal use. Commercial uses are forbidden. Contact me for non-personal uses of the calculator.

Tuesday, October 23, 2012

'globe10' calculator

As part of the on-going analysis of the world dataset, I am releasing the 'globe10' calculator, which is based on the K=10 analysis. This calculator includes the following ancestral components:
  • Amerindian
  • West_Asian
  • Australasian
  • Palaeo_African
  • Neo_African
  • Siberian
  • Southern
  • East_Asian
  • Atlantic_Baltic
  • South_Asian
The names may be the same as the ones from previous calculators released by the Project, but you should always consult the spreadsheet to see how they might differ. In this case, inclusion of Amerindian, Australasian populations, African hunter-gatherers, dealing with the Paniya issue, and inclusion of data of Schlebusch et al. (2012), and  Pagani et al. (2012), have all combined to change components in subtle ways, although their modalities remain largely unchanged, and hence so do the names.

You need to extract the contents of the RAR file to the working directory of DIYDodecad. You use it by following exactly the instructions of the DIYDodecad README, but always type 'globe10' instead of 'dv3' in these instructions. You can consult the spreadsheet for proportions of the 10 components in different world populations.

Terms of use: 'globe10', including all files in the downloaded RAR file is free for non-commercial personal use. Commercial uses are forbidden. Contact me for non-personal uses of the calculator.

Friday, October 19, 2012

'globe4' calculator

Patterson et al. (2012) recently published evidence for admixture in northern Europeans between a population resembling modern Sardinians (and the Neolithic Tyrolean Iceman, whose genome was published earlier this year), and, surprisingly Native Americans. The authors attribute the Amerindian-like ancestry element to a North Eurasian population that spawned Native Americans, and which also contributed ancestry to northern Europeans. They propose two possibilities for the origin of this admixture: (i) the Mesolithic Europeans resembled Amerindians, or (ii) there was an influx of Amerindian-like populations from the east during late prehistory. A palimpsest of these two processes may explain parts of the observed signal of admixture.

In a recent K=4 admixture experiment, I demonstrated that ADMIXTURE software produces an Amerindian ancestral component that closely tracks the signal of admixture using the D-statistic test. I have decided to make this test available for download and use with DIYDodecad.

The test has four ancestral populations:
  • European
  • Asian
  • African
  • Amerindian
It is important to remember that some of these components track different aspects of ancestry that is better resolved at higher resolution. There are also populations that "don't fit well" in this 4-partite scheme (e.g., certain African or Australasian populations).

For example, the Amerindian component of this test may indicate (i) real recent Native American ancestry, (ii) East Eurasian ancestry found in Siberia and East Asia, (iii) the common signal of admixture differentiating most European groups from Sardinians and Near Eastern Caucasoid groups. Similarly, the Asian component may indicate Australasian, South Asian, or East Eurasian ancestry. And, the European component tracks the ancestry of individuals from West Eurasia in general, although it reaches is maximum in Sardinians.

This test may, however, be useful to Old World individuals who want to get an idea about the signal of admixture discovered by Patterson et al., so I decided to make it available. For individuals who don't suspect recent Amerindian or Siberian/East Asian ancestry, and who don't belong to populations with recent such ancestry, the Amerindian component will most likely represent the aforementioned signal.

You need to extract the contents of the RAR file to the working directory of DIYDodecad. You use it by following exactly the instructions of the DIYDodecad README, but always type 'globe4' instead of 'dv3' in these instructions. You can consult the spreadsheet for proportions of the 4 components in different world populations.

Terms of use: 'globe4', including all files in the downloaded RAR file is free for non-commercial personal use. Commercial uses are forbidden. Contact me for non-personal uses of the calculator.

Saturday, October 13, 2012

Geno 2.0 data request

If anyone has received results from the Geno 2.0 test of the Genographic Project and want to share it with me, feel free to send it at dodecad@gmail.com. I will not distribute it or share it with anyone. I want to see what SNPs are tested, what format the data is in, and what is its intersection with other available datasets. This way, I can update my DIYDodecad software so that Geno 2.0 testees can use the various calculators released by the project to get an alternative ancestry assessment.

In time, and if there is interest, I may release additional calculators that make use of the particular SNP set tested by Geno 2.0.

Sunday, August 12, 2012

fastIBD analysis of Africans and African Americans

Individuals from the following populations have been included in this analysis:
African_American_D Somali_D Moroccan_D Algerian_D North_African_Jews_D Tunisian_D East_African_Various_D Yoruba_D Sudan_D Egyptian_D Chad_D
These were analyzed in the context of a large set of African populations. CEU European Americans were also added to account for the European admixture present in some African American individuals.
This is the first time I have included African American Dodecad participants in this type of analysis.

A few quick points:
  • fastIBD was run with default parameters over a dataset of 679 individuals/255020 SNPs
  • fastIBD identifies segments of relatively recent origin that are shared by individuals. These results should not be construed as measures of overall genetic similarity or origins. Rather, they suggest which populations have exchanged genes in the relative recent past.
With that said, you can get:
  • Spreadsheet of numeric results, showing median sharing (in centi-Morgans, cM)
  • Population-level graphical results, showing an ordering of other populations based on median IBD sharing.

IBD sharing was assessed only for populations with 5+ individuals.

The following heat map allows for a quick appraisal of populations sharing an excess of IBD sharing (read row-by-row). The grouping of populations by language group and/or region is clearly manifested. There are some interesting details that jump off the screen (but do consult the spreadsheet for details). For example, notice that: 
  • within the Bantu group (Bantu_NE, LWK/Luhya, and Bantu_S), only the South Bantu have an excess of IBD sharing with San.
  • Of the North Africans, Egyptans show an excess of IBD sharing with Tigray
  • Notice that of the Ethiopians/East Africans it is the Omotic speaking Wolayta that seem to especially share IBD with the Ari people who are also Ethiopian Omotic speakers.




Some visualizations (see graphical results above for full set):

Mozabites showing a high degree of within-population IBD sharing, and secondarily with other NW African groups.

The Dodecad Project Somali sample shows high degree of sharing within itself and also with the Pagani et al. Somali and Ethiopian Somali samples, and then with various other East African groups.
Sources of data are listed at the bottom left of this blog.

Saturday, August 11, 2012

On the so-called "Calculator Effect"

The genome blogger Polako recently announced a calculator effect (May 2012) affecting admixture estimates:
However, many people are getting skewed results, despite doing everything right. For instance, users from the UK often come out much more continental European than they should. Some of them actually believe that this is because they're genetically more Norman or Saxon than the average Brit. Nope, the real reason is what I call the "calculator effect". This is when the algorithm produces different results for people who are part of the original ADMIXTURE runs that set up the allele frequencies used by the calculators, than those who aren't, even though both sets of users are of exactly the same origin, and should expect basically identical results.
This, however, was described by myself many months prior, in Novemeber 2011, following up on observations made during my first analysis of Yunusbayev et al. Armenians in September 2011. It has been listed in the Technical Stuff at the bottom of this blog ever since.

I had observed at the time that the newly available Yunusbayev et al. Armenian sample appeared more "European" using the Dodecad v3 calculator tool, which had been built using the Project Armenians (Armenian_D) as well as the Armenian sample of Behar et al.

I then explained why this was happening, and released new versions of the Dodecad tools, such as K12a, and K12b, and more recently K10a as new scientific and project participant samples became available.

Polako also proposes a "solution" to the problem:
I actually designed my Eurogenes ancestry tests for Gedmatch with this problem in mind, by only using academic references to source the allele frequencies. This means that test results for Eurogenes project members and non-members are directly comparable. Perhaps other genome bloggers can eventually do the same?
The only effect of this "solution" is to ensure that there is a "calculator effect" for everyone using his tools. For example, if he uses only published Finns and Lithuanians to build his calculator, then every Finn and Lithuanian who takes his test will wonder why he is "different" from the published Finns and Lithuanians, because they will all suffer a "calculator effect" with respect to the reference populations. So, perhaps they will all be on equal footing with respect to each other, but their results will all be biased because of the issue I had identified.

Moreover, their results will never improve as more people join his Project, because these new people will not be included in newer versions of calculators: all users of DIY Eurogenes tools will continue to receive sub-par results. Well, small consolation, at least they'll all receive comparable sub-par results.

The solution to this problem was also described in my original post, and it's not an unimaginative quick fix of biasing everyone's results with respect to the reference populations:
What can we do to solve this problem? Sample, sample, sample. There is no shortcut. The gross details of the genetic landscape (such as the relationship between major continental groups) are easy to infer, but the details will always have room for improvement.
It is only by adequate sampling, that is by including more and more people, rather than excluding even the ones we have, that ever more accurate admixture estimators can be devised. As sample sizes grow (= more scientists publish their data, and more people join projects such as this one), allele frequencies of the different components will become ever more secure, and deviations of individuals who did not contribute to the inference of the genetic components will converge to zero.

I am already quite confident that inclusion biases amount to only a few percent for Dodecad Project tools and only for the closely related components (e.g., West Asian vs. North European); as mentioned in my original post, these biases are trivial for more distantly related components (e.g., European vs. East Asian).

And, the way to further reduce biases that do persist is to foster participation, rather than consign everyone to a sort of fossilized mediocrity, excluding whole populations of active direct-to-consumer customers (e.g., Norwegians, or Assyrians, or Iraqis, or Germans, or Koreans, or, ...) on the basis that no "academic reference" has made dense genotype data on them freely and publicly accessible.

Friday, August 10, 2012

fastIBD analysis of East/Central Eurasians and select West Eurasians


Individuals from the following populations have been included in this analysis:
Philippines_D Turkish_D Iranian_D Russian_D Finnish_D Turkish_Cypriot_D Ukrainian_D Belorussian_D Chinese_D Korean_D Japanese_D Tatar_Various_D Kazakh_D Szekler_D Hungarian_D Estonian_D Azeri_D Udmurt_D Mixed_Turkic_D 
These were analyzed in a context of a complete set of Central/East Eurasian populations; West Eurasian populations included were mostly Uralic and Turkic speaking groups, and a few others (such as East Slavs or Iranians).

A few quick points:
  • fastIBD was run with default parameters over a dataset of 627 individuals/255020 SNPs
  • fastIBD identifies segments of relatively recent origin that are shared by individuals. These results should not be construed as measures of overall genetic similarity or origins. Rather, they suggest which populations have exchanged genes in the relative recent past.
With that said, you can get:
  • Spreadsheet of numeric results, showing sharing (in centi-Morgans, cM)
  • Population-level graphical results, showing an ordering of other populations based on mean IBD sharing.
IBD sharing was assessed only for populations with 5+ individuals.

The following heat map allows for a quick appraisal of populations sharing an excess of IBD sharing (read row-by-row)

And, a few visualizations of mean IBD sharing:

Notice high levels of within-population IBD sharing for Finns, consistent with a population that experienced expansion from a small number of founders (small ancestral population size).
Compare with Turks, who are a much more diverse population.
These two plots (you can check the spreadsheet for exact numbers) indicate different sources for the East Eurasian element in Turks and Finns. 

The top eastern populations for Turks are: Turkmen, Chuvash, Uzbek, Uygur, all of which are Turkic speakers, followed by Hazara, Yukagir, and Selkup.  For Finns, there is high degree of sharing with various Siberian groups of different languages, including Uralic Selkups (16.4cM) and Nganassan (9.6cM). Turks share less with these Uralic speakers (6.4 and 2.8cM respectively). So, these are strong hints of common shared ancestry within the Turkic and Uralic language families.

The Chuvash population is also quite interesting, as it shares more with Selkup and Nganassan, contrasting with other Turkic speakers. This makes excellent sense, and is in agreement with other recent findings:
Results from this study maintain that the Chuvash are not related to Altaic or Mongolian populations along their maternal line, thus supporting the “Elite” hypothesis that their language was imposed by a conquering group —leaving Chuvash mtDNA largely of Eurasian origin. Their maternal markers appear to most closely resemble Finno-Ugric speakers rather than Turkic speakers.
Sources of data are listed at the bottom left of this blog.

Tuesday, June 12, 2012

'K10a' calculator

The 'K10a' calculator represents an intermediate stage between the K7 and K12 analyses released so far from the Project. The following components have been inferred:
  • Palaeoafrican 
  • South_Asian 
  • West_Asian 
  • Southeast_Asian 
  • Sub_Saharan 
  • Atlantic_Baltic
  • Red_Sea 
  • East_Asian 
  • Mediterranean 
  • Siberian 
There are a couple of points of interest; first, the Red_Sea component related Arabians with East Africans. At a higher level of resolution the "Southwest_Asian" and "East_African" (K12)  components emerge. The "Red_Sea" component is not very closely related to any other components, but is somewhat related to the "Mediterranean" and "Atlantic_Baltic" components.

So, using the different calculators of the Dodecad Project, we first have (K7) a contrast between Africa and West Eurasia, then a signal of the shared ancestry between Arabia and East Africa (K10), and finally, strong signals of local ancestry in the two regions.

Second, the Mediterranean component here is modal in Sardinians as usual, but also projects into North Africa. Again, this is intermediate between K7 which shows a predominance of West Eurasian ancestry in North Africa + an African component, and K12 in which there are "Atlantic_Med" and "Northwest_Afican" regional components.

These are strong hints that the West Eurasian element in Africa differs between NW and E Africa. In the former region, it is most related to Sardinians, and in the latter it is most related to Arabians. Of course, ultimately the two elements are related to each other.

Table of Fst distances between components:


MDS plots of the first few dimensions:


Downloads: 
Project participants can find their results in the spreadsheet. Non-participants can use DIYDodecad to calculate their results, but they should place all the calculator files in the same directory as the DIYDodecad software, and replace 'dv3' with 'K10a' in all the instructions of the README file.

Component labels are indicative, and you should compare your results against the normalized median results for different populations included in the spreadsheet.

Terms of Use

You are free to use 'K10a', including all downloaded files for any non-commercial purpose, as long as you attribute them to the Dodecad Project and to Dienekes Pontikos as follows:

The 'K10a' admixture calculator is courtesy of Dienekes Pontikos and was developed as part of the Dodecad Ancestry Project; more information here.

Saturday, June 9, 2012

'weac2' calculator

I have made a new version of the 'weac' calculator (West Eurasian cline). This is based on a large Old World dataset at K=7 and includes the following ancestral components:
  • Palaeoafrican 
  • Atlantic_Baltic 
  • Northeast_Asian 
  • Near_East 
  • Sub_Saharan 
  • South_Asian 
  • Southeast_Asian 

The West Eurasian cline is formed between the Near_East and Atlantic_Baltic components.

Here is the table of Fst distances between components:

MDS plots of the first few dimensions:


Downloads:
Project participants can find their results in the spreadsheet. Non-participants can use DIYDodecad to calculate their results, but they should place all the calculator files in the same directory as the DIYDodecad software, and replace 'dv3' with 'weac2' in all the instructions of the README file.

(NOTE: Some  IDs may have wrong results in the spreadsheet because of a misalignment of IDs with results; I'll fix this and update this notice. UPDATE: Results should be correct in spreadsheet now - 9 Jun 2012)


Component labels are indicative, and you should compare your results against the normalized median results for different populations included in the spreadsheet.

Terms of Use

You are free to use 'weac2', including all downloaded files for any non-commercial purpose, as long as you attribute them to the Dodecad Project and to Dienekes Pontikos as follows:

The 'weac2' admixture calculator is courtesy of Dienekes Pontikos and was developed as part of the Dodecad Ancestry Project; more information here.

Friday, April 27, 2012

Estimating your Gök4-related ancestry


I have taken Table S15 from Skoglund et al. (2012), and the Dodecad Project K7b admixture proportions in order to investigate possible relationships.

In Table S15 the authors estimate the Neolithic farmer ancestry in several populations on the basis of a single Neolithic individual from the Funnel Beaker (TRB) culture which was found in a megalithic burial in Gökhem parish.

Most of these populations are already part of the Dodecad Ancestry Project, except the three Swedish samples; given the intermediacy of the Central_Sweden sample, I have decided to use my Swedish_D sample of Project participants as a stand-in for it.

Below, you can see a scatterplot relating Gök4-related ancestry with K7b "Southern" component:



The correlation between the two variables is very strong (R-squared = 0.93).

Dodecad Project participants who already have K7b results, as well as customers of DTC testing companies who can use DIYDodecad together with the K7b calculator can approximately estimate their Gök4-related ancestry by plugging in their "Southern" value (in %) into the following equation:

Gök4-related ancestry = 1.721*Southern+19.736

I anticipate that when I am able to study the Neolithic Swedish genomes directly, the Neolithic farmer from Sweden will turn up "Southern" in a K=7 resolution experiment.

Sunday, March 11, 2012

ChromoPainter/fineSTRUCTURE analysis of Italy/Balkans/Anatolia

This was done on the same dataset as the previous fastIBD analysis.

The population assignments:



The heatmap, showing relationship between inferred populations:


The principal components analysis:



The correspondence between inferred populations and K12b components:



Results for Project participants can be found in this spreadsheet; remember than in the chunkcounts tabs, columns represent donor and rows recipient populations.

Monday, March 5, 2012

fastIBD analysis of Italy/Balkans/Anatolia

I have included the new Turkish data from Hodoğlugil & Mahley (2012) in this analysis. Additionally, there are now 5 participants in the Serb_D and Turkish_Cypriot_D sub-populations, as well as a Bosnian Muslim. There are now project participants from many Balkan countries, although Albania, the fYROM, and Croatia remain as "black holes" in the map.

Still, I am hopeful that there will be more project participants from currently under-represented populations. I have already started processing the same dataset with ChromoPainter (which takes much longer), and hopefully that analysis will be posted at the end of this week or the beginning of the next one.

First, the heatmap of inter-population IBD:

Remember that the tree groups similar populations together, and for each row in the matrix, the red end of the spectrum indicates lots of IBD sharing, and the blue end low IBD sharing. Additionally, I have now calculated the median IBD sharing, which is more resistant in the presence of potential relatives in the data.

The results appear fairly reasonable, with the Balkan, Anatolian, and Italian populations of the title forming separate branches, and the mainland Greek sample joining with Central/South Italians and Sicilians.

The Clusters Galore can be seen below; 28 clusters were inferred with 21 dimensions:



Results for Project participants can be found in the spreadsheet, and include the probabilities that each ID is assigned to each of the 28 clusters, as well as the Z-scores comparing each individual against all populations with 5+ individuals. The Z-score should be read as follows: for each row, high values indicate a high degree of IBD sharing, while low values indicate a low degree of IBD sharing.

Of course, I encourage Project participants to leave a message in the Information about Project samples thread.

Wednesday, February 15, 2012

Correspondence between ChromoPainter clusters and ADMIXTURE components in Balkans/West Asia

I took the 25 different inferred clusters from my recent ChromoPainter analysis, and calculated their normalized median components in terms of the K12b calculator. This is a quite useful exercise, since it can show in what sense clusters are different from each other.




Here are two ways in which you may use this correspondence.

1. Different clusters of a single population

For example, the Turks with partial Balkan ancestry tend to belong to pop10, whereas those of Anatolian ancestry to pop13, and those from northeastern Anatolia to pop22. If we compare the admixture proportions of these three groups, we notice e.g.,

  • An excess of Atlantic_Baltic and North_European in pop10
  • An excess of Caucasus in pop22
Or, there is a group of 5 Iranians that belong to pop12, whereas the overwhelming majority of Iranians and Kurds belong to pop21. Strikingly, pop12 differs from all other populations in having substantial levels of East_African and Sub_Saharan. So, it seems that fineSTRUCTURE was able to infer that some Iranian individuals had this feature in common. These individuals were already evident in the Iranian population portrait (right), but fineSTRUCTURE was able to group them even though there were no African populations in the ChromoPainter analysis; presumably, the software was able to detect that these individuals shared a set of chunks that were quite different than is the norm for the Balkan/West Asian area.

2. Related clusters


fineSTRUCTURE grouped the different populations in a tree structure. For example, it grouped pop18, the "North Balkan" cluster with pop23, the "Bulgarian-Romanian" one.

Looking at the admixture proportions, we can tell that the two clusters do indeed seem quite similar, but there are some differences, e.g., an excess of North_European in pop18, and an excess of Caucasus in pop23. This makes sense given the geographical origin of individuals belonging to the two clusters.

Tuesday, February 14, 2012

ChromoPainter/fineSTRUCTURE analysis of Balkans/West Asia

I have carried out a ChromoPainter/fineSTRUCTURE analysis of Balkans/West Asia. This is a slightly different dataset than the one used in the previous fastIBD analysis of the same region. It also took much longer (about a week, with two CPUs dedicated to the task) to complete, so it is not something that can be done routinely.

Technical details (skip if you want)


413 individuals from 33 populations were studied, on 258,100 SNPs, after --geno 0.03 --maf 0.01 filters were applied. Data were phased in Beagle with the default 10 iterations. Genetic maps from the HapMap were used. fineSTRUCTURE was used on ChromoPainter output, with 500,000 burnin/runtime iterations each.

25 Inferred Populations


fineSTRUCTURE imposes a tree structure on a number of inferred populations. The following heatmap shows this tree structure; columns represent donor populations, rows, recipient ones.


There was a total of 25 populations, labeled pop0, pop1, ..., pop24.

The following table summarizes how many individuals from each original population were assigned to each inferred population:


I will limit myself to populations which include Dodecad Project members:

  • pop6 includes a Project North Ossetian, as well as all Yunusbayev et al. North Ossetians
  • pop7 is mainly Armenian
  • pop16 is also mainly Armenian; it would be interesting to see whether this bipartite division of Armenians is in agreement with the one inferred in the previous fastIBD analysis
  • pop8 is mainly Greek, and appears to be "continental Greek"; it also includes some other Balkan individuals
  • pop14 is also Greek, and includes a variety of people with ancestry from Crete, the Aegean, Cyprus, Asia Minor, Cappadocia, and the Pontus as well as continental Greek. It could be labeled "eastern Greek"
  • pop11 is Cypriot, including the single 100% Greek Cypriot of the Project, all 3 100% Turkish Cypriots, as well as a Turkish individual of partial Turkish_Cypriot ancestry
  • pop10 is Turkish, and includes people with some ancestry from the Balkans, as well as Anatolia. It could be labelled "Balkan Turkish"
  • pop13 is also Turkish, and seems to include people with ancestry exclusively from Anatolia, including almost all the Behar et al. Turks
  • pop15 is Assyrian; some Assyrians also fall on the aforementioned pop16 which includes mainly Armenians
  • pop18 could be labelled "North Balkan"; there is probably structure to be uncovered within this cluster, once more participants from the Balkans join the Project
  • pop20 is "Georgian-Abkhazian"
  • pop21 is "Kurdish-Iranian"
  • pop22 could be labeled "Northeastern Anatolia" or (more classically) "Pontus-Colchis". It appears to unite various individuals from Northeastern Turkey and neighboring Georgia, having Karadeniz Turkish, Armenian, Pontic Greek, and Kartvelian ancestry. I strongly encourage participants from this region to join the Project, especially Pontic Greeks, as there are no 100% Pontic Greeks currently in the Project.
  • pop23 is "Bulgarian-Romanian" mainly, and also includes one Serb. Once again, I emphasize that the power of this approach using haplotypes depends on participation, so I encourage all people from the Balkans to consider joining the Project.
Principal Components Analysis


I have also used the PCA feature of fineSTRUCTURE to carry out principal components analysis. I am plotting the first two dimensions of this PCA, using my own visualization code that places labels in the average position on the plane:


Results


Results for Project participants are included in the spreadsheet.

  • Population matrix, shows how many individuals from each population were assigned to each cluster
  • Z score population matrix, shows the normalized number of "chunks" from each donor population (columns) to each recipient (row). Do not compare across rows! The way to read this table is the following: for each row, higher values indicate more sharing. For example, the "Cypriots" population has pop11 as its main donor.
  • Individual assignments: the pop number that all Project and reference IDs were assigned to
  • Individual Chunkcounts: the number of chunks copied from its donor population (column) to each individual
  • Individual PCA: your PCA co-ordinates that can help you find your dot on the Principal Components Analysis graphic (see above)
Averaged results were included only for populations with >=5 members.
The raw chunkcounts for all 413x413 individuals can be found here.

Monday, February 6, 2012

Other testing companies

The Dodecad Project is not affiliated with any genetic testing companies. Until now, I have included Project participants from 23andMe and FamilyTreeDNA "Family Finder" tests, but it has come to my attention that there are new players in the field, such as Ancestry.com (see post on Your Genetic Genealogist) and Lumigenix (see post on GenomesUnzipped).

If you have data from any company entering this field, please contact me at dodecad@gmail.com (do not send data right away!). That way, I can find out how many markers are in common between the new tests and my existing datasets, and figure out how easy it will be to convert them for use in the Project and in DIYDodecad.

Tuesday, January 31, 2012

'K12b' and 'K7b' calculators

I am releasing two new calculators with K=12 and K=7 components, named 'K12b' and 'K7b'. You can scroll down to the bottom if you are just interested in the downloads, or read on.


New Features

The new 'K12b' calculator is an update of the previous K12a one, that was inferred using all the new samples submitted during the last submission opportunity. The 12 components are still roughly the same, although their allele frequencies may have changed by a bit, so existing participants can expect to have slightly altered results, and new participants in the Project more so, since their data are now contributing to the creation of the new tool. Non-participants can, of course, use the new calculator with DIYDodecad.

I have also taken the opportunity to do some minor tweaks. I am releasing population portraits for K12b (which were lacking in K12a); I've changed my visualization code so that the sample IDs of non-Dodecad populations can now be seen in the barplots. This may be useful for anyone else using these reference populations, by quickly identifying potential outliers in them.

I have also decided to use normalized median admixture proportions for the populations. For example, if 5 individuals in a population have 0, 0, 0.2, 0.5, 10.0% of a particular component, then the average is 2.14%, but the median is 0.2%. By using the median, the proportions become less susceptible to the presence of outliers (such as the 10%). However, if the median is calculated over every component separately, it is no longer guaranteed that the components will add up to 100%; this can be addressed by re-normalizing them (scaling them by a constant factor) so that they do. I believe that use of the normalized median will not only give better proportions that are less susceptible to outliers, but will also improve results of the new Dodecad Oracle for K12b.

At the same time I am also releasing 'K7b' which is an update of the existing 'eurasia7' calculator and which has been built on exactly the same dataset as 'K12b' but at a lower (K=7) level of detail.

Information on K7b


Information spreadsheet.

Normalized median admixture proportions barplot for all included populations (a high resolution version of this is included in the download bundle):


Table of Fst divergences:

Neighbor-joining tree (based on above):

Information on K12b


Information spreadsheet.


Normalized median admixture proportions barplot for all included populations (a high resolution version of this is included in the download bundle):

Table of Fst divergences:

Neighbor-joining tree (based on above):
Multidimensional Scaling Plots of K12b and K7b


I have created MDS plots using synthetic individuals representing the 12 ancestral components of K12b and the 7 ancestral components of K7b. By including both in the same plot, one gets an idea of the relationship of the components at different resolution. The first 10 dimensions can be seen below:

Here is a blowup of the main West Eurasian groups from the plot of the first two dimensions:

Some observations:

  • The Atlantic_Med component which is bi-modal in Basques and Sardinians occupies the apex of the figure; this makes sense, since Southwest Europe is quite distant (along land routes) to both Asia and Africa.
  • The Caucasus component is surrounded by most of the others; this is consistent with my theory elaborated in The womb of nations: how West Eurasians came to be.
  • The Atlantic_Baltic component (from K=7) is intermediate between the Atlantic_Med and North_European components.
  • Similarly, the West_Asian component (from K=7) is intermediate between the Caucasus and Gedrosia components; the Gedrosia component diverges in the direction of the Asian groups (not shown in this figure), and in particular of South Asians. This divergence can also be seen in the plot of dimension #3.
  • The Northwest_African component diverges in the direction of Sub-Saharan Africans.

Technical Details


A dataset of 268 populations/3,115 individuals was assembled. A total of 265,519 SNPs are in common in the various source datasets as well as the 23andMe v2/v3 and Family Finder platforms. Iterative removal of distant relatives was performed by removing one individual from each pair within a population if that pair had a RATIO of 2.5 or greater or more than the mean and two standard deviations in IBD analysis performed in PLINK 1.07. A total of 2,675 individuals remained. 4 individuals were removed for low genotyping rate (less than 97%). 264,328 SNPs remained after removal of SNPs with less than 97% genotyping rate or 1% minor allele frequency. 166,770 SNPs remained after linkage-based disequilibrium pruning (--indep-pairwise 200 25 0.4). The final set thus consisted of 2,671 individuals/268 populations/166,770 SNPs. Ancestral populations (components) were inferred using ADMIXTURE 1.21, with K=7 and K=12 and default parameters.

No individuals were removed from the source datasets, except in the case of the Armenians_Y sample, where one individual (ID: armenia3) was dropped because he/she was the same as a Dodecad Project participant.

Downloads


K7b population portraits, spreadsheet, and DIYDodecad files.
K12b population portraits, spreadsheet, and DIYDodecad files.

Dodecad Oracle (K12b edition) can be downloaded from here. Please read the instructions of the previous Oracle on how to use this tool. Note that the number of populations is now 223.

To use either calculator with DIYDodecad, with your 23andMe or Family Finder data, follow the instructions in the README file, but substitute 'K12b' or 'K7b' for 'dv3'.

Project participant results for both K7b and K12b are found in the spreadsheets in the Individual Results tab.

Terms of Use


You are free to use K12b and K7b, including all downloaded files for any non-commercial purpose, as long as you attribute them to the Dodecad Project and to Dienekes Pontikos as follows:

The [K7b/K12b] admixture calculator is courtesy of Dienekes Pontikos and was developed as part of the Dodecad Ancestry Project; more information here.

Tuesday, January 24, 2012

Submission Opportunity is OVER

Thank you everyone for submitting their data. I will not accept any more data at this time. A couple of submissions came in at the last second, so I accepted one more than I promised, who got the brand new DPD001 ID.

Those who submitted in time will get their IDs and their results will be posted in the K12a spreadsheet.
Additionally, I will run all participants over world9, so that spreadsheet will also include everybody.

From now on, I will be reworking some of the Project tools to make use of newer samples submitted during this submission opportunity.

If you wish to submit your data during this off period, note that you must contact me at dodecad@gmail.com. Do not send data at this time, unless I indicate that I can accept it! I will let you know if I can process it, and note that I will normally only consider those who matched the eligibility criteria of the most recent submission period.

Monday, January 23, 2012

Open submission for everybody until DOD999

SUBMISSION OPPORTUNITY IS NOW OVER

Everyone on the planet is invited to submit their data, regardless of their ancestry.

All other rules apply, especially the no relatives clause. Additionally, I will accept a single submission from each submitter, so don't submit all your friends. Moreover, regardless of your ancestry, you should let me know the origin of your four grandparents.

There are 35 spots open, so hurry, since last time I had a free-for-all I had to close it down after about 12 hours due to overwhelming demand. I will close project submission after I assign DOD999.

All submissions after I post the end-of-submission announcement on the blog will be ignored. If you post this in any forums or mailing lists, include this post link so that people will know whether the opportunity is over.

Saturday, January 21, 2012

fastIBD analysis of Afroasiatic groups (Jews, Arabs, Assyrians, Berbers, Somalis, Amharas, etc.)

Please refer to the previous analysis on the Balkans/West Asia for more information about the interpretation of this type of analysis.

I am very pleased with the way this analysis of Afroasiatic groups has turned out, revealing an exceptional degree of resolution. I invite individuals from the Near East and Africa who are eligible, to submit their data, so that they can be included in future runs of this kind.

Clusters Galore


45 clusters were inferred with 29 dimensions.


I can't comment on all 45 clusters, so I'll just limit myself to the ones that are significantly represented among Project participants: 1. Ashkenazi, 4. Assyrian/Mandaean, 6. Somali, 7. Moroccan, 8. Algerian/Tunisian, 9. Sephardic, 10. Morocco Jews, 11. Iran/Iraq Jews, 12. Non-Jewish Ethiopians, 13. Saudi, 14. Arab #1, 15. Arab #2, 16. Egyptian

Inter-Population IBD


Results for Project Participants


The results can be found in the spreadsheet.

I have also added the full IBD sharing matrix which lists how many Morgans of sequence are estimated to be IBD with probability greater than 10^-6 between all pairs of individuals.

You can google any non-Project sample IDs to get some more information about their origin. For example, GSM536710 is an Iraqi Jew who shares about half his genome with GSM536714, also an Iraqi Jew. These two samples are almost certainly first-degree relatives. Or, GSM537032, a Samaritan shares 740-1,480cM with the other 2 Samaritans, an exceptional amount in this small and probably highly inbred population.

You can manipulate this matrix in R. After you download it and unzip it, you can load it into R as follows:

X<-read.table('afroasiatic_ibd_sharing.txt',row.names=1,header=T)

Then, you can, for example, sort the IBD sharing for a particular individual, as follows:

sort(X['DOD026',])

fastIBD analysis of Central/Eastern Europe

Please refer to the previous analysis on the Balkans/West Asia for more information about the interpretation of this type of analysis.

Clusters Galore


The Clusters Galore can be found in the spreadsheet. After inspection of the 23 clusters inferred with 21 dimensions, they could be described as:

  1. Mordvin
  2. East Slavic
  3. Polish-Ukrainian
  4. East Balkan
  5. Vologda Russians
  6. Lithuanian
  7. Central European (combining many groups with small sample sizes)
  8. A couple of related (?) individuals
  9. Anatolian
  10. Greek
  11. Chuvash
  12. Ossetian
  13. A couple of related individuals
  14. A couple of related individuals
  15. Balkar
  16. A couple of related individuals
  17. Chechen
  18. Kumyk
  19. A couple of related individuals
  20. Adygei
  21. Lezgin #1 (main)
  22. Lezgin #2
  23. Lezgin #3
If you belong to a population with few other participants, you might end up latching onto a cluster dominated by a bigger group. This does not mean that your population is not distinctive, only that there are not enough samples to reveal its distinctiveness if it exists.

Inter-Population IBD


Results for Dodecad Participants

Results can be found in the spreadsheet.

If you have joined the Project, please consider leaving a comment in the Information about Project samples thread. That will help others make better sense of their results, e.g., if you find that you belong in the same cluster with some other individual, you might want to know something about their origins.

UPDATE: I have added the IBD sharing matrix.See here on how to use it.

Thursday, January 19, 2012

fastIBD analysis of South Asia

Please refer to the previous analysis on the Balkans/West Asia for more information about the interpretation of this type of analysis.

Clusters Galore


The Clusters Galore analysis can be found in the spreadsheet. 59 clusters were inferred with 47 MDS dimensions. The very fine-scale structure (I only considered the first 50 dimensions, but many more seemed significant than in any previous experiment) is probably the result of the size of the South Asian population, as well as the practice of endogamy associated with the caste system. High intra-population IBD sharing is also evident in the following (notice how well-defined the diagonal is):

Inter-Population IBD




Results for Dodecad participants

They can be found in the spreadsheet. Many Project participants belong to a population with 1 or 2 individuals, so cluster #1 seems to be a generalized catch-all for many such individuals. Individuals from he two sub-populations that I've identified recently Iyer_D, and Jatt_D all belong to the same cluster. The Iyer_D cluster (#4) also seems to include the Iyengar project participants as might be expected.

It is also interesting how all Dodecad participants fall in just 7 of the 59 clusters. This goes to show how truly diverse people from the Indian subcontinent are. I fully expect that with more participation further structure will be revealed, since it seems that due to endogamy it only takes a few participants from each ethnic group for a specific cluster pertaining to that group to be identified. So, I invite people from South Asia to join the Project during this submission opportunity.

Tuesday, January 17, 2012

fastIBD analysis of Iberia, France, Italy, Balkans, Anatolia and European Jews

On the heels of the previous analysis of Balkans/West Asia, a new experiment on a different set of populations. Please refer to the earlier post for some thoughts/explanations about this type of analysis, I'll stick to "just the data" for this post.

Clusters Galore




24 clusters inferred with 17 MDS dimensions.

The Galore analysis provides increased resolution within Iberia (#6-9, 11), Italy, and the Ashkenazi Jewish group (#14-16).

The Iberian results are particularly interesting, showing the power of this approach compared to the one with unlinked data. There appear to be:

  • a Spanish Basque (#6), 
  • French Basque (#11) cluster, as well as 
  • a Portuguese/Galician/Castilla Y Leon (#9) cluster, and 
  • a complementary Castilla La Manch/Cantabria/Andalucia/Murcia (#7) cluster, and 
  • a smaller Aragon/Cataluna cluster (#8). 
There is overlap between these clusters, but the geographical contrasts are quite evident. I did not go through the results of Spanish Project participants (all the Portuguese fall in the Galician cluster, and our Basque member in the Basque cluster as expeccted), so it would be interesting to hear whether they fall in the cluster(s) which exist in their regions of origin.

Inter-Population IBD




Results for Project Participants


The results can be found in the spreadsheet.