1 00:00:01,000 --> 00:00:01,480 Hello, 2 00:00:01,546 --> 00:00:05,090 I would like to welcome you on a presentation with topic 3 00:00:05,160 --> 00:00:10,130 "Fine-scale analysis of population structure based on genomic data 4 00:00:10,200 --> 00:00:15,090 and quantification of selection effect on livestock genome", which was prepared 5 00:00:15,160 --> 00:00:19,050 for the third degree of education. 6 00:00:19,120 --> 00:00:25,200 This presentation is a part of the ISEGREED project, which is supported 7 00:00:25,266 --> 00:00:28,090 by the European Union. 8 00:00:28,160 --> 00:00:33,760 This presentation belonging to the module number 2: Conservation and sustainable 9 00:00:33,826 --> 00:00:37,170 use of animal genetic resources. 10 00:00:37,240 --> 00:00:41,640 My name is Nina MoravĨíková. This presentation was also 11 00:00:41,706 --> 00:00:44,010 prepared by Professor Kasarda. 12 00:00:44,080 --> 00:00:48,280 We are working on the Slovak University of Agriculture in Nitra, 13 00:00:48,346 --> 00:00:51,680 the Faculty of Agrobiology and Food Resources and Institute 14 00:00:51,746 --> 00:00:55,610 of Nutrition and Genomics. 15 00:00:55,680 --> 00:01:02,640 This presentation is divided to four parts: quality control of genomic data, 16 00:01:02,706 --> 00:01:06,050 approaches and tools for population structure analysis, 17 00:01:06,120 --> 00:01:10,960 approaches and tools for evaluating the impact of selection on the livestock 18 00:01:11,026 --> 00:01:15,160 genome, and the last part is functional annotation 19 00:01:15,226 --> 00:01:20,570 of region significantly affected by selection pressure. 20 00:01:20,640 --> 00:01:27,010 Quality control of genomic data is really important step 21 00:01:27,080 --> 00:01:30,160 before any type of analysis. 22 00:01:30,226 --> 00:01:34,200 I would like to speak mainly about 23 00:01:34,266 --> 00:01:38,970 the quality control of data which is related to the genomic data 24 00:01:39,040 --> 00:01:42,370 obtained by using SNP chips. 25 00:01:42,440 --> 00:01:46,770 If we have incorrect or low quality data, 26 00:01:46,840 --> 00:01:49,560 this can usually lead to errors 27 00:01:49,626 --> 00:01:56,760 in analysis and mainly errors related to the interpretation of results. 28 00:01:57,400 --> 00:02:04,040 Data quality indicators which are usually used are call rate of SNP markers overall 29 00:02:04,106 --> 00:02:10,920 in the meta-population, and then also call rate of SNP markers 30 00:02:10,986 --> 00:02:14,130 within individuals in the population, 31 00:02:14,200 --> 00:02:20,160 then frequency of the minor allele frequency and also deviation 32 00:02:20,226 --> 00:02:23,090 from the Hardy-Weinberg equilibrium. 33 00:02:23,160 --> 00:02:30,650 Sometimes it's also good to apply quality control for linkage disequilibrium. 34 00:02:30,720 --> 00:02:37,440 Type of quality control which is used before the analysis depends mainly 35 00:02:37,506 --> 00:02:45,010 on the type of the analysis and also the main objective of the analysis. 36 00:02:45,080 --> 00:02:51,160 On this slide, you can see standard quality control which is used if 37 00:02:51,226 --> 00:02:56,050 we would like to analyze population genetic structure. 38 00:02:56,120 --> 00:03:02,410 Usually, this quality control of genomic data covers call rate across SNPs 39 00:03:02,480 --> 00:03:08,960 and across animals, which minimum value is usually set to 90%, 40 00:03:09,026 --> 00:03:12,210 and then also minor allele frequency. 41 00:03:12,280 --> 00:03:16,880 The minimum value for minor allele frequency is based on the 42 00:03:16,946 --> 00:03:20,690 Mendelian Inheritance Law. 43 00:03:20,760 --> 00:03:25,570 Then we also applied Hardy-Weinberg equilibrium test. 44 00:03:25,640 --> 00:03:32,040 Sometimes it's also good to control level of linkage disequilibrium across 45 00:03:32,106 --> 00:03:37,360 SNPs because if we would like to analyze population 46 00:03:37,426 --> 00:03:41,050 structure, it will be good to have only 47 00:03:41,120 --> 00:03:45,130 information about neutral genetic markers. 48 00:03:45,200 --> 00:03:52,330 For this purpose, we can use several types of programs and web-based tools. 49 00:03:52,400 --> 00:03:55,690 For example, we can use program PLING. 50 00:03:55,760 --> 00:04:00,800 On the right side of this slide, you can see a graphical visualization 51 00:04:00,866 --> 00:04:07,160 of quality control of SNP chip data in case of horses. 52 00:04:07,920 --> 00:04:14,920 Which type of analysis we can perform if we are speaking about population 53 00:04:14,986 --> 00:04:20,050 structure and utilization of SNP data. 54 00:04:20,120 --> 00:04:25,610 We can analyze genetic differentiation within and between populations, 55 00:04:25,680 --> 00:04:31,970 then we can also evaluate or estimate the degree of genetic admixture 56 00:04:32,040 --> 00:04:36,640 within or between them, as well as changes in their gene pool 57 00:04:36,706 --> 00:04:42,610 which have arisen, for example, due to selection, migration, or genetic drift. 58 00:04:42,680 --> 00:04:47,240 But we can also estimate other parameters, for example, 59 00:04:47,306 --> 00:04:52,570 genomic relationship matrix and based on the results optimize mating plans. 60 00:04:52,640 --> 00:04:58,080 The most common type of method which can be use for the analysis of population 61 00:04:58,146 --> 00:05:01,730 structure are calculation of Wright's FST index, 62 00:05:01,800 --> 00:05:05,770 calculation of genetic distance and relationship matrices, 63 00:05:05,840 --> 00:05:10,120 principal component analysis, discriminant analysis of principal 64 00:05:10,186 --> 00:05:14,330 components, and Bayesian analysis of genetic admixture and gene flow 65 00:05:14,400 --> 00:05:18,520 between populations, and also construction of phylogenetic 66 00:05:18,586 --> 00:05:22,170 trees and genetic networks. 67 00:05:22,240 --> 00:05:27,210 Wright's fixation index FST is one of the most commonly used 68 00:05:27,280 --> 00:05:32,360 parameters for evaluation of the degree of genetic differentiation 69 00:05:32,426 --> 00:05:34,770 between and within populations. 70 00:05:34,840 --> 00:05:37,600 Its value range from zero to one. 71 00:05:37,666 --> 00:05:42,530 If the value is equal to zero, then the populations are genetically identical. 72 00:05:42,600 --> 00:05:46,640 But if the value is equal to one, we can say that the populations 73 00:05:46,706 --> 00:05:49,930 are genetically totally different. 74 00:05:50,000 --> 00:05:55,720 The interpretation of Wright's FST index is relative easy, and also the time 75 00:05:55,786 --> 00:05:58,850 for the computation is relatively short. 76 00:05:58,920 --> 00:06:03,160 But the FST index cannot be use for the quantification of genetic 77 00:06:03,226 --> 00:06:05,730 relationship between individuals 78 00:06:05,800 --> 00:06:08,090 that means on individual level. 79 00:06:08,160 --> 00:06:14,090 Also, if the level of diversity in the population is low, then also 80 00:06:14,160 --> 00:06:18,770 the reliability of the results is relatively low. 81 00:06:18,840 --> 00:06:24,800 For the calculation of FST index, we can use many tools, for example, 82 00:06:24,866 --> 00:06:30,080 Arlequin, Genepop, and Genalex, but these three programs are limited 83 00:06:30,146 --> 00:06:34,040 mainly in a connection to the number 84 00:06:34,106 --> 00:06:38,330 of SNPs for which we have genetic data. 85 00:06:38,400 --> 00:06:43,850 But we can use also many R packages, for example, StAMPP. 86 00:06:43,920 --> 00:06:50,970 On the figure on the left side, you can see dendrogram, which were 87 00:06:51,040 --> 00:06:57,810 made based on the FST matrix for the 16 cattle breeds. 88 00:06:57,880 --> 00:07:03,770 This visualization is relatively nice because we see that we have 89 00:07:03,840 --> 00:07:09,850 two genetic clusters composed of breed which are somehow connected 90 00:07:09,920 --> 00:07:16,490 from historical point of view or from phylogenetical point of view. 91 00:07:16,560 --> 00:07:21,330 Relationship matrices express genetic similarities and also 92 00:07:21,400 --> 00:07:24,760 kinship between individuals within a population. 93 00:07:24,826 --> 00:07:30,840 That means these matrices can be used for the quantification of level of genetic 94 00:07:30,906 --> 00:07:34,210 relationship between individuals. 95 00:07:34,280 --> 00:07:38,600 Each element of the matrix represents a measure of genetic similarity 96 00:07:38,666 --> 00:07:41,050 between a pair of individuals. 97 00:07:41,120 --> 00:07:45,840 Relationship matrices are most often calculated based on the frequency 98 00:07:45,906 --> 00:07:51,320 of alleles in the population, while the calculation itself can be based 99 00:07:51,386 --> 00:07:56,730 on various approaches, for example, calculation of the IBD matrix 100 00:07:56,800 --> 00:07:59,890 or Nei's genetic distances. 101 00:07:59,960 --> 00:08:05,480 The calculation of relationship matrices is also relatively easy, 102 00:08:05,546 --> 00:08:09,800 and after calculation, we have relatively accurate estimates 103 00:08:09,866 --> 00:08:13,610 of relationship between animals in the population. 104 00:08:13,680 --> 00:08:20,040 But sometimes, if we have information about high number of individuals or 105 00:08:20,106 --> 00:08:25,850 animals in the population, this type of analysis is time consuming. 106 00:08:25,920 --> 00:08:30,640 For the calculation of relationship matrices, we can use, for for example, 107 00:08:30,706 --> 00:08:33,890 PLINK, if you would like to calculate IBD matrix, 108 00:08:33,960 --> 00:08:38,720 or we can also use different R packages, for example, StAMPP, if you would like 109 00:08:38,786 --> 00:08:42,090 to calculate Nei's genetic distance matrix. 110 00:08:42,160 --> 00:08:46,040 On the left side, you can see example of visualization 111 00:08:46,106 --> 00:08:51,370 of genetic distance matrix, which is valid for the 112 00:08:51,440 --> 00:08:53,490 five breeds of dogs. 113 00:08:53,560 --> 00:08:58,010 Based on obtained result, we can say that 114 00:08:58,080 --> 00:09:01,280 animals which belong to the same breeds 115 00:09:01,346 --> 00:09:08,930 are connected together and created one genetic cluster. 116 00:09:09,000 --> 00:09:13,600 Another type of method which can be use for the evaluation of population 117 00:09:13,666 --> 00:09:17,490 structure is principal component analysis. 118 00:09:17,560 --> 00:09:23,090 PCA is a multivariate statistical method that decomposes a covariance matrix 119 00:09:23,160 --> 00:09:28,800 of genetic data and extract the principal component that reflect the variability 120 00:09:28,866 --> 00:09:31,290 of the data in the the dataset. 121 00:09:31,360 --> 00:09:36,640 For the visualization of the result, usually first two principal 122 00:09:36,706 --> 00:09:42,170 components are used because these two first principal components 123 00:09:42,240 --> 00:09:46,810 explain the highest proportion of variability in the dataset. 124 00:09:46,880 --> 00:09:51,170 PCA provides basic information about the genetic structure, 125 00:09:51,240 --> 00:09:56,530 which is useful when testing databases with a large number of individuals. 126 00:09:56,600 --> 00:10:00,720 PCA is a time-saving method for assessing the state 127 00:10:00,786 --> 00:10:04,080 of genetic differentiation. 128 00:10:04,720 --> 00:10:07,640 Visualization of PCA components is 129 00:10:07,706 --> 00:10:12,770 really simple and good interpretable. 130 00:10:12,840 --> 00:10:15,000 But what are disadvantage 131 00:10:15,066 --> 00:10:16,840 of PCA analysis? 132 00:10:16,906 --> 00:10:18,400 It's mainly low sensitivity if 133 00:10:18,466 --> 00:10:22,730 we would like to estimate the degree of genetic admixture 134 00:10:22,800 --> 00:10:26,570 within and between populations. 135 00:10:26,640 --> 00:10:32,920 For the calculation of principal component analysis can be use also many tools, 136 00:10:32,986 --> 00:10:37,840 for example, PLINK or R package Adegenet. 137 00:10:37,906 --> 00:10:43,240 On the left side, you can see example of visualization 138 00:10:43,306 --> 00:10:48,970 of principal component analysis in case of 16 sheep breeds. 139 00:10:49,040 --> 00:10:55,410 On the figure, you can see that by using this method, we really found three genetic 140 00:10:55,480 --> 00:10:58,250 groups, and deeper 141 00:10:58,320 --> 00:11:04,570 evaluation of the groups showed us that 142 00:11:04,640 --> 00:11:10,120 the obtained differentiation is connected 143 00:11:10,186 --> 00:11:13,010 mainly to the origin of each breed. 144 00:11:13,080 --> 00:11:16,920 Discriminant analysis of principal components is 145 00:11:16,986 --> 00:11:23,800 a method of discriminant analysis, which is usually used for the evaluation 146 00:11:23,866 --> 00:11:28,210 of genetic structure between predefined groups or clusters. 147 00:11:28,280 --> 00:11:33,760 It uses PCA to reduce the dimension of the data and then discriminant analysis 148 00:11:33,826 --> 00:11:37,530 to maximize the resolution between populations. 149 00:11:37,600 --> 00:11:42,160 Discriminant analysis of principal components provides a more accurate 150 00:11:42,226 --> 00:11:46,610 representation of the genetic structure between predefined clusters, 151 00:11:46,680 --> 00:11:49,970 compared to, for example, classical PCA. 152 00:11:50,040 --> 00:11:55,800 But sometimes, is this analysis sensitive to low level of diversity 153 00:11:55,866 --> 00:11:57,850 in the population. 154 00:11:57,920 --> 00:12:02,010 If we use discriminant analysis of principal components, we can 155 00:12:02,080 --> 00:12:06,080 expect relatively high accuracy in detecting differences between 156 00:12:06,146 --> 00:12:09,840 populations, and also results which are 157 00:12:09,906 --> 00:12:14,520 relatively simply and easy interpretable. 158 00:12:15,440 --> 00:12:21,010 On this slide, on the left side, you can see representative results 159 00:12:21,080 --> 00:12:25,050 from the discriminant analysis of principal components. 160 00:12:25,120 --> 00:12:28,570 In this case, was used genomic data 161 00:12:28,640 --> 00:12:32,800 for red deer populations, seven farmed, 162 00:12:32,866 --> 00:12:36,210 and two wild red deer populations. 163 00:12:36,280 --> 00:12:41,890 By applying this method, we found three clusters. 164 00:12:41,960 --> 00:12:48,410 First two clusters were composed from wild populations, Slovak and Spain, 165 00:12:48,480 --> 00:12:55,170 and the third clusters was composed from the populations of farmed animals. 166 00:12:55,240 --> 00:13:00,000 For the calculation of discriminant analysis of principal components, we can 167 00:13:00,066 --> 00:13:03,530 use, for example, R package Adegenet. 168 00:13:03,600 --> 00:13:09,960 If you would like to estimate the proportion of genetic admixture within 169 00:13:10,026 --> 00:13:14,130 the gene pool of population, we can use Bayesian approach. 170 00:13:14,200 --> 00:13:18,810 Bayesian approach allows the identification of genetic groups 171 00:13:18,880 --> 00:13:23,290 and the degree of admixture within individuals without the need 172 00:13:23,360 --> 00:13:26,960 to predefine groups or clusters. 173 00:13:27,160 --> 00:13:33,250 Bayesian approach provides relatively accurate identification of genetic 174 00:13:33,320 --> 00:13:37,680 clusters, and this method is flexible if we are speaking about 175 00:13:37,746 --> 00:13:39,730 the complex structures. 176 00:13:39,800 --> 00:13:45,250 But mainly if we have information for high number of animals, this method is 177 00:13:45,320 --> 00:13:48,840 time consuming compared to others. 178 00:13:49,160 --> 00:13:54,600 For analysis or for testing of degree of genetic admixture, based on the 179 00:13:54,666 --> 00:13:57,570 Bayesian approach, we can use many tools. 180 00:13:57,640 --> 00:14:02,930 We can use, for example, program Structure, Admixture, or Faststructure. 181 00:14:03,000 --> 00:14:06,450 On the left side, you can see representative results 182 00:14:06,520 --> 00:14:12,320 from the estimation of genetic admixture between seven farmed and two 183 00:14:12,386 --> 00:14:16,090 wild populations of red deer. 184 00:14:16,160 --> 00:14:23,210 Similarly to discriminant analysis of principal components, we found that two 185 00:14:23,280 --> 00:14:29,610 wild populations from Slovakia and Spain were totally differentiated 186 00:14:29,680 --> 00:14:31,890 from farmed populations. 187 00:14:31,960 --> 00:14:38,530 As you can see on the figure, farmed populations were relatively admixed. 188 00:14:38,600 --> 00:14:43,690 That means we found relatively high degree of admixture between 189 00:14:43,760 --> 00:14:50,610 farmed populations of red deer, mainly due to the migration of animals 190 00:14:50,680 --> 00:14:55,520 and also artificial insemination. 191 00:14:55,640 --> 00:15:00,120 We can use Bayesian approach also for estimation of gene 192 00:15:00,186 --> 00:15:02,490 flow between populations. 193 00:15:02,560 --> 00:15:05,770 We can use, for example, program TreeMix. 194 00:15:05,840 --> 00:15:11,760 Program TreeMix is based on the allele frequencies, and it creates phylogenetic 195 00:15:11,826 --> 00:15:15,400 trees with the possibility of testing the intensity of migration 196 00:15:15,466 --> 00:15:17,610 between populations. 197 00:15:17,680 --> 00:15:23,000 This method is based on the maximum probability and allows estimation 198 00:15:23,066 --> 00:15:27,850 of phylogenetic relationships and migration between population. 199 00:15:27,920 --> 00:15:32,880 Program TreeMix allow the detection of the itensity of migration and gene 200 00:15:32,946 --> 00:15:36,160 flow in the past, but sometimes the reliability 201 00:15:36,226 --> 00:15:42,410 of the results depends on the amount of available genomic data as well as 202 00:15:42,480 --> 00:15:48,090 on the reliability of the allele frequency estimation. 203 00:15:48,160 --> 00:15:53,760 On this slide, on the right side, you can see results from the analysis 204 00:15:53,826 --> 00:15:58,240 of gene flow intensity between red deer populations based 205 00:15:58,306 --> 00:16:00,010 on the Bayesian approach. 206 00:16:00,080 --> 00:16:04,010 In this case, we used program Bayesass, 207 00:16:04,080 --> 00:16:07,880 which allows to determine the intensity 208 00:16:07,946 --> 00:16:12,130 of gene flow between and also within populations. 209 00:16:12,200 --> 00:16:18,880 This program, compared to the TreeMix, provides us information about the recent 210 00:16:18,946 --> 00:16:23,200 migration rate, not migration rate in the past. 211 00:16:24,160 --> 00:16:30,890 Population structure can be also evaluate by constructing genetic networks, 212 00:16:30,960 --> 00:16:34,930 for example, by using package Netview. 213 00:16:35,000 --> 00:16:40,160 This package is a visualization tool that uses genetic networks to show 214 00:16:40,226 --> 00:16:44,090 relationships between individuals or populations. 215 00:16:44,160 --> 00:16:48,650 It creates genetic networks that show genetic relationships 216 00:16:48,720 --> 00:16:52,050 and gene flow between populations. 217 00:16:52,120 --> 00:16:57,520 This package or Netview is really suitable for assessing complex 218 00:16:57,586 --> 00:17:01,410 relationships as well as the impact of migration. 219 00:17:01,480 --> 00:17:07,560 Its visualization is intuitive and suitable for displaying 220 00:17:07,626 --> 00:17:10,170 admixture and differentiation. 221 00:17:10,240 --> 00:17:16,130 But if we have information about the large number of individuals, 222 00:17:16,200 --> 00:17:20,490 its utilization is relatively limited. 223 00:17:20,560 --> 00:17:25,400 On this slide, you can see graphical visualization of the results of testing 224 00:17:25,466 --> 00:17:30,280 three different scenarios of development of intra-population 225 00:17:30,346 --> 00:17:37,450 and inter-population genetic relationships within 16 cattle breeds using Netview. 226 00:17:37,520 --> 00:17:42,680 Compared to the results from, for example, PCA or discriminant analysis 227 00:17:42,746 --> 00:17:48,890 of principal components, we found that animals are clustered 228 00:17:48,960 --> 00:17:53,890 together if they have common historical background. 229 00:17:53,960 --> 00:17:57,680 That means if there is really high 230 00:17:57,746 --> 00:18:03,050 intensity of gene flow between them. 231 00:18:03,120 --> 00:18:08,610 Another type of graphical visualization 232 00:18:08,680 --> 00:18:12,530 of genetic relationships between animals 233 00:18:12,600 --> 00:18:18,330 or between populations is a construction of phylogenetic trees. 234 00:18:18,400 --> 00:18:23,250 Phylogenetic trees are graphical representations 235 00:18:23,320 --> 00:18:28,690 of evolutionary relationships between populations or species 236 00:18:28,760 --> 00:18:31,250 derived from the genetic data. 237 00:18:31,320 --> 00:18:37,570 They are usually used to visualize genealogical or genetic relationships, 238 00:18:37,640 --> 00:18:41,960 model evolutionary processes, and also track population 239 00:18:42,026 --> 00:18:44,730 differentiation and migration. 240 00:18:44,800 --> 00:18:49,570 They can be created using a variety of algorithm and models, 241 00:18:49,640 --> 00:18:55,160 but most commonly used models are based on the genetic distances, for example, 242 00:18:55,226 --> 00:19:00,200 Nei's genetic distance, or probabilistic models like maximum 243 00:19:00,266 --> 00:19:04,120 likelihood and Bayesian methods. 244 00:19:04,560 --> 00:19:09,250 For the preparation of phylogenetic tree, we can use different tools, 245 00:19:09,320 --> 00:19:13,250 for example, SplitsTree or various R packages. 246 00:19:13,320 --> 00:19:20,770 On this slide, you can see an example of phylogenetic tree, and this tree was 247 00:19:20,840 --> 00:19:27,610 derived from the Nei's genetic distance matrix calculated for eight horse breeds. 248 00:19:27,680 --> 00:19:34,160 This study was mainly oriented to the analysis of genetic relationship of Slovak 249 00:19:34,226 --> 00:19:39,720 warmblood horse to another historically connected horse 250 00:19:39,786 --> 00:19:44,210 breeds which can be found in the Europe. 251 00:19:44,280 --> 00:19:48,400 Now, we are going to the next part of this presentation, 252 00:19:48,466 --> 00:19:52,650 which is related to the approaches and tools that can be use 253 00:19:52,720 --> 00:19:58,450 for the evaluation of the impact of selection on the livestock genome. 254 00:19:58,520 --> 00:20:02,600 Genomic regions under strong selection pressure are usually 255 00:20:02,666 --> 00:20:05,330 called selection signals. 256 00:20:05,400 --> 00:20:10,130 Analysis of selection signals distribution in the genome allows 257 00:20:10,200 --> 00:20:15,560 for a better understanding of evolutionary processes and also the impact 258 00:20:15,626 --> 00:20:19,960 of domestication, and then also the impact of natural 259 00:20:20,026 --> 00:20:25,610 and intensive artificial selection of specific genomic regions which control 260 00:20:25,680 --> 00:20:31,600 preferred phenotypic traits in terms of adaptability, resilience, 261 00:20:31,666 --> 00:20:38,410 or performance of individuals, populations, and also livestock species. 262 00:20:38,480 --> 00:20:41,170 Analysis of selection signals or selection 263 00:20:41,240 --> 00:20:46,440 signatures also allows us to identify 264 00:20:46,506 --> 00:20:51,200 genomic regions showing a decrease or increase in genetic variability 265 00:20:51,266 --> 00:20:53,250 or genetic diversity. 266 00:20:53,320 --> 00:20:58,760 In this type of analysis, we don't need to have information 267 00:20:58,826 --> 00:21:02,050 about the phenotype of animals. 268 00:21:02,120 --> 00:21:06,930 Approaches and methods for evaluation of the selection signals distribution 269 00:21:07,000 --> 00:21:11,690 in the livestock genome can be divided to two groups. 270 00:21:11,760 --> 00:21:18,360 First group of methods is group which is based on the evaluation of inter-population 271 00:21:18,426 --> 00:21:21,210 or inter-breeds differences. 272 00:21:21,280 --> 00:21:26,720 The second one is group of method for evaluation of variability 273 00:21:26,786 --> 00:21:29,250 at the intra-population level. 274 00:21:29,320 --> 00:21:34,520 In the case of first group, we can speak about the calculation 275 00:21:34,586 --> 00:21:38,810 of Wright's FST index at the genome-wide level, 276 00:21:38,880 --> 00:21:43,570 quantification of differences in linkage disequilibrium, 277 00:21:43,640 --> 00:21:47,720 which is method based on the analysis of 278 00:21:47,786 --> 00:21:52,770 haplotype structure and also PCA analysis. 279 00:21:52,840 --> 00:21:58,760 In the case of second group of method, we can speak about the distribution 280 00:21:58,826 --> 00:22:05,130 of runs of homozygosity or heterozygosity-rich regions in the genome, 281 00:22:05,200 --> 00:22:10,570 and also level of linkage disequilibrium, 282 00:22:10,640 --> 00:22:16,080 RDA analysis, or Tajima's D statistics. 283 00:22:16,280 --> 00:22:20,520 Similarly, as in case of analysis of population structure, 284 00:22:20,586 --> 00:22:25,440 also in this case, Wright's FST is one of the most commonly used 285 00:22:25,506 --> 00:22:32,450 approach for analysis of selection signals distribution in the genome. 286 00:22:32,520 --> 00:22:37,400 In this case, selection signals are identified based on the differences 287 00:22:37,466 --> 00:22:40,730 in allelic frequencies between populations, 288 00:22:40,800 --> 00:22:45,080 which arose as a result of, for example, different breeding goals 289 00:22:45,146 --> 00:22:47,170 or breed standards. 290 00:22:47,240 --> 00:22:54,490 Two basic types of signals we can obtain if we use this approach 291 00:22:54,560 --> 00:23:00,080 in which the different type of selection correspond to the regions represented 292 00:23:00,146 --> 00:23:05,210 by several loci or SNP markers with a high value of FST index, 293 00:23:05,280 --> 00:23:09,810 and on other hand, by the regions with a low value 294 00:23:09,880 --> 00:23:13,760 represent genomic regions that were subject to the same type 295 00:23:13,826 --> 00:23:16,800 of selection in a given breeds. 296 00:23:17,400 --> 00:23:22,320 Threshold value, defining the signal, is usually set up as 1% 297 00:23:22,386 --> 00:23:25,250 of the highest FST values. 298 00:23:25,320 --> 00:23:31,330 This method is relatively simply method for calculation and is widely 299 00:23:31,400 --> 00:23:34,730 used in population genetics. 300 00:23:34,800 --> 00:23:39,570 But this method cannot be used if you would like to analyze selection 301 00:23:39,640 --> 00:23:43,810 signals at the intra-population level. 302 00:23:43,880 --> 00:23:49,610 For the calculation of Wright's FST index on the genome-wide level, 303 00:23:49,680 --> 00:23:54,770 we can use, for example, PLINK and for the visualization program R. 304 00:23:54,840 --> 00:23:58,520 On the left side, you can see example of the visualization 305 00:23:58,586 --> 00:24:04,330 of Wright's FST distribution in the autosomal genome. 306 00:24:04,400 --> 00:24:10,450 This study was based on the genomic data for beef cattle breeds. 307 00:24:10,520 --> 00:24:16,960 What is typical for this type of study is also description of selection signals. 308 00:24:17,026 --> 00:24:22,570 That means we usually analyze start and end position of the selection signals, 309 00:24:22,640 --> 00:24:29,480 protein coding genes which are located directly or very close to the selection 310 00:24:29,546 --> 00:24:36,890 signals, and also QTLs, which are located in the region of selection signal. 311 00:24:36,960 --> 00:24:42,330 In the table on the right side, you can really see that we found 312 00:24:42,400 --> 00:24:47,400 many QTLs, which were previously associated with important 313 00:24:47,466 --> 00:24:51,010 phenotypic traits in cattle. 314 00:24:51,080 --> 00:24:56,560 Another approach for estimation of selection signals distribution 315 00:24:56,626 --> 00:25:02,210 in the livestock genome is approach which is based on the variability in 316 00:25:02,280 --> 00:25:09,370 linkage disequilibrium, or we can say, differences in linkage disequilibrium 317 00:25:09,440 --> 00:25:11,330 between breeds. 318 00:25:11,400 --> 00:25:15,720 In this case, I would like to speak about integrated haplotype score, 319 00:25:15,786 --> 00:25:19,640 which is very frequently used for the analysis of selection 320 00:25:19,706 --> 00:25:23,330 signals distribution in the genome. 321 00:25:23,400 --> 00:25:28,360 In this case, selection signals are derived from a change in the linkage 322 00:25:28,426 --> 00:25:32,210 disequilibrium in the genome of the evaluated breeds 323 00:25:32,280 --> 00:25:39,050 and the emergence of specific haplotypes due to the linkage disequilibrium. 324 00:25:39,120 --> 00:25:45,690 Integrated haplotype score value can be defined simply as a measure of how 325 00:25:45,760 --> 00:25:51,010 unusual a haplotype consisting of a specific SNP marker is 326 00:25:51,080 --> 00:25:53,200 compared to the rest of the genome. 327 00:25:53,266 --> 00:25:58,240 Integrated haplotype score is a particularly 328 00:25:58,306 --> 00:26:02,810 sensitive method for detecting the effect of recent selection that led 329 00:26:02,880 --> 00:26:06,050 to an increase in the frequency of a certain 330 00:26:06,120 --> 00:26:10,960 allelic variant in a population, but has not yet eliminate 331 00:26:11,026 --> 00:26:14,080 other variants at a given locus. 332 00:26:14,160 --> 00:26:20,000 The analysis begins with the calculation of extended haplotype homozygosity, 333 00:26:20,066 --> 00:26:24,250 which quantifies the decrease in homozygosity of the haplotype 334 00:26:24,320 --> 00:26:29,800 from a certain SNP marker, and then continues with the calculation 335 00:26:29,866 --> 00:26:34,880 of the integrated haplotype score value, which is based on the logarithm 336 00:26:34,946 --> 00:26:39,850 of the ratio of integrated extended haplotype homozygosity values 337 00:26:39,920 --> 00:26:43,250 for two allelic variants. 338 00:26:43,320 --> 00:26:48,960 Integrated haplotype score can reach positive values when haplotype carrying 339 00:26:49,026 --> 00:26:54,080 a single allele is longer and has a higher extended haplotype 340 00:26:54,146 --> 00:26:59,490 homozygosity, indicated a significant effect of positive selection 341 00:26:59,560 --> 00:27:04,320 or negative values when an alternative allele has a higher 342 00:27:04,386 --> 00:27:08,440 extended haplotype homozygosity which can also reflect selection 343 00:27:08,506 --> 00:27:11,010 but in opposite direction. 344 00:27:11,080 --> 00:27:15,210 Threshold value defining the signal is set 345 00:27:15,280 --> 00:27:18,250 similar to previous approach, for example, 346 00:27:18,320 --> 00:27:24,850 as 1% of the highest positive values of integrated haplotype score. 347 00:27:24,920 --> 00:27:30,040 This approach is suitable for detecting the effect of recent selection 348 00:27:30,106 --> 00:27:35,730 and identification of signals which can arise as a result of adaptation, 349 00:27:35,800 --> 00:27:42,760 but is also sensitive for the data quality, and if you would like to obtain 350 00:27:42,826 --> 00:27:48,330 reliable estimates, you need high quality and robust genomic data. 351 00:27:48,400 --> 00:27:52,880 For the calculation of integrated haplotype score, we can use, for example, 352 00:27:52,946 --> 00:27:57,250 program Haploview or other R packages. 353 00:27:57,320 --> 00:28:00,200 On the left side, you can see example 354 00:28:00,266 --> 00:28:04,800 from the analysis of variability in linkages equilibrium in the genome 355 00:28:04,866 --> 00:28:09,760 of milk and beef cattle breeds, and on the right side, 356 00:28:09,826 --> 00:28:12,650 you can see description of identified 357 00:28:12,720 --> 00:28:17,880 selection signals and also genes and QTLs, 358 00:28:17,946 --> 00:28:23,760 which were located directly in the region of the signals. 359 00:28:24,640 --> 00:28:30,320 Evaluation of the inter-population or interbreed differences and the following 360 00:28:30,386 --> 00:28:38,090 analysis of selection signatures can be also performed by using PCA analysis. 361 00:28:38,160 --> 00:28:43,040 In this case, this analysis assumes that the signals in the genome arose as 362 00:28:43,106 --> 00:28:46,360 a result of the local adaptation of individuals to the 363 00:28:46,426 --> 00:28:49,490 environmental conditions. 364 00:28:49,560 --> 00:28:55,800 PCA analysis is in this context an alternative method for identifying 365 00:28:55,866 --> 00:28:59,570 selection signals to the Wright's FST index. 366 00:28:59,640 --> 00:29:04,880 Detection of selection signals is based on the assumption of the existence 367 00:29:04,946 --> 00:29:11,080 of a correlation between genetic variants and principal components which reflects 368 00:29:11,146 --> 00:29:15,810 the local adaptation of population to the production environment. 369 00:29:15,880 --> 00:29:20,690 To identify selection signal, different tests can be used, 370 00:29:20,760 --> 00:29:23,690 for example, Mahalanobis distance test. 371 00:29:23,760 --> 00:29:29,370 In this case, the identification of SNP markers showing association with positive 372 00:29:29,440 --> 00:29:34,370 selection is based on the construction of a Z-score vector 373 00:29:34,440 --> 00:29:39,530 obtained by regression analysis of the relationship between SNP markers 374 00:29:39,600 --> 00:29:42,770 and the principal components of K. 375 00:29:42,840 --> 00:29:48,050 The threshold value which defined the signal of selection can be, 376 00:29:48,120 --> 00:29:54,250 in this case, determined, for example, based on the false discovery rate test. 377 00:29:54,320 --> 00:29:59,730 This method is really efficient in case of visualization. 378 00:29:59,800 --> 00:30:06,130 But because this method is alternative, it's not so often used for the 379 00:30:06,200 --> 00:30:11,530 quantification of selection signals in the genome. 380 00:30:11,600 --> 00:30:16,480 For the analysis of distribution of selection signals in the genome 381 00:30:16,546 --> 00:30:22,850 by using PCA analysis, can be use, for example, R package PCAdapt. 382 00:30:22,920 --> 00:30:28,200 This method also allows you to quantify 383 00:30:28,266 --> 00:30:31,330 genetic differentiation in the data set 384 00:30:31,400 --> 00:30:36,880 and then provide you information about the selection signals distribution, as you 385 00:30:36,946 --> 00:30:40,730 can see on the slide on the figure 13. 386 00:30:40,800 --> 00:30:46,120 Then the last step of analysis is usually description of the selection signals, 387 00:30:46,186 --> 00:30:50,530 that mean description of the start and the end position of the signal, 388 00:30:50,600 --> 00:30:55,280 number of genes and number of QTLS, which are located 389 00:30:55,346 --> 00:30:58,890 directly or very close to the signal. 390 00:30:58,960 --> 00:31:03,520 If we would like to analyze distribution of selection signals 391 00:31:03,586 --> 00:31:08,960 at the intra-population level, we can use method which is based 392 00:31:09,026 --> 00:31:13,530 on the identification of runs of homozygosity in the genome. 393 00:31:13,600 --> 00:31:18,440 This approach assumes that regions in the genome showing strong selection 394 00:31:18,506 --> 00:31:24,250 signals are the results of an increase in local homozygosity due to intensive 395 00:31:24,320 --> 00:31:28,240 breeding to traits defined in the breed standard of each breed. 396 00:31:28,306 --> 00:31:31,640 Runs of homozygosity regions forming 397 00:31:31,706 --> 00:31:36,730 selection signals located in the genome are formed by the alleles derived 398 00:31:36,800 --> 00:31:41,360 from common ancestors, which can be inherited from generation 399 00:31:41,426 --> 00:31:46,370 to generation in unchanging form. 400 00:31:46,440 --> 00:31:52,680 Selection signals are then identified based on the frequency of SNP 401 00:31:52,746 --> 00:31:57,960 markers in runs of homozygosity in specific region across 402 00:31:58,026 --> 00:32:00,410 individuals in the population. 403 00:32:00,480 --> 00:32:07,450 Threshold value for defining the signal is similarly to another approach set to 404 00:32:07,520 --> 00:32:10,570 as 1% of the highest value. 405 00:32:10,640 --> 00:32:16,970 This method allows to detect regions where there has been a decrease in diversity. 406 00:32:17,040 --> 00:32:23,200 Because of this, this method also can serve as a good indicator 407 00:32:23,266 --> 00:32:25,730 of the effect of positive selection. 408 00:32:25,800 --> 00:32:31,520 But if we would like to obtain reliable estimates or reliable results, 409 00:32:31,586 --> 00:32:36,970 we need to also have high quality and robust genomic data. 410 00:32:37,040 --> 00:32:40,810 On this slide, you can see results from the analysis 411 00:32:40,880 --> 00:32:46,130 of distribution of runs of homozygosity segments in the genome 412 00:32:46,200 --> 00:32:49,090 of Slovak warmblood horse. 413 00:32:49,160 --> 00:32:53,610 Based on the threshold value, we found the 414 00:32:53,680 --> 00:32:57,170 selection signals on chromosome 1, 2, 6, 415 00:32:57,240 --> 00:33:03,050 9, 11, 15, and 16, And we also identified many genes 416 00:33:03,120 --> 00:33:10,120 inside the regions of selection signals which were included in the formation or 417 00:33:10,186 --> 00:33:15,050 in the genetic control of important phenotypic traits for horses. 418 00:33:15,120 --> 00:33:19,640 On the other hand, we can also analyze selection signals 419 00:33:19,706 --> 00:33:24,840 distribution in the genome based on the regions showing high 420 00:33:24,906 --> 00:33:27,850 level of heterozygosity. 421 00:33:27,920 --> 00:33:34,440 This method is usually used to detect regions which may be important, 422 00:33:34,506 --> 00:33:40,200 for example, in terms of adaptability or response to environmental changes 423 00:33:40,266 --> 00:33:42,530 or the occurrence of pathogens. 424 00:33:42,600 --> 00:33:47,720 This method is based on the assumptions that the heterozygous individuals have 425 00:33:47,786 --> 00:33:52,840 usually higher fitness than homozygous ones. 426 00:33:52,960 --> 00:33:56,800 In this case, a high level of heterozygosity may be 427 00:33:56,866 --> 00:34:02,320 the result of balancing selection effect that means the preservation of genetic 428 00:34:02,386 --> 00:34:05,280 diversity within a population. 429 00:34:05,346 --> 00:34:10,930 Similar to analysis of 430 00:34:11,000 --> 00:34:12,930 runs of homozygosity 431 00:34:13,000 --> 00:34:18,450 selection signals are derived from the frequency of SNP markers 432 00:34:18,520 --> 00:34:24,000 in heterozygosity-rich regions in a specific genomic region across 433 00:34:24,066 --> 00:34:26,370 individuals in the population. 434 00:34:26,440 --> 00:34:29,120 Threshold value is usually set based 435 00:34:29,186 --> 00:34:34,250 on the 1% of the highest values. 436 00:34:34,320 --> 00:34:39,480 This approach allows us to detect regions in which there is an increased 437 00:34:39,546 --> 00:34:42,530 proportion of heterozygous genotypes. 438 00:34:42,600 --> 00:34:49,440 That means that also can serve us as an indicator of genomic regions which can 439 00:34:49,506 --> 00:34:54,610 be important in terms of adaptation or evolutionary potential. 440 00:34:54,680 --> 00:35:00,130 But if we would like to have reliable result, we need to also analyze high 441 00:35:00,200 --> 00:35:04,130 quality and robust genomic data. 442 00:35:04,200 --> 00:35:08,440 On this slide, you can see results from the analysis 443 00:35:08,506 --> 00:35:12,050 of distribution of heterozygosity-rich 444 00:35:12,120 --> 00:35:15,650 regions in the five horse breeds, 445 00:35:15,720 --> 00:35:20,760 and this study was based especially on the analysis of distribution 446 00:35:20,826 --> 00:35:26,280 of heterozygosity-rich regions in the genomic coordinates of major 447 00:35:26,346 --> 00:35:29,480 histocompatibility complex. 448 00:35:29,960 --> 00:35:36,640 Another interesting approach is identification of selection signals 449 00:35:36,706 --> 00:35:40,130 in the genome based on the RDA analysis. 450 00:35:40,200 --> 00:35:44,680 RDA tests the relationship between genetic variability and also 451 00:35:44,746 --> 00:35:46,690 environmental factors. 452 00:35:46,760 --> 00:35:53,650 That means it quantified the influence of natural selection on the genome structure. 453 00:35:53,720 --> 00:35:59,810 This approach is basically a method of evaluating genotype environment 454 00:35:59,880 --> 00:36:05,240 association that evaluates the percentage of genomic variability explained 455 00:36:05,306 --> 00:36:10,280 by environmental variables and also detects loci under a strong 456 00:36:10,346 --> 00:36:12,360 selection pressure. 457 00:36:12,480 --> 00:36:15,120 This method is two-step analysis 458 00:36:15,186 --> 00:36:19,810 in which genetic and environmental data are evaluated using 459 00:36:19,880 --> 00:36:23,720 multivariate linear regression. 460 00:36:24,480 --> 00:36:30,290 From advantages of this method, we can 461 00:36:30,360 --> 00:36:33,170 Mention that this method is really 462 00:36:33,240 --> 00:36:36,920 good approach to evaluate the relationships between genetic 463 00:36:36,986 --> 00:36:41,410 variability within a population and environmental factors. 464 00:36:41,480 --> 00:36:47,050 But similarly to previous approaches, if you would like to have 465 00:36:47,120 --> 00:36:51,530 good results or results with high 466 00:36:51,600 --> 00:36:54,600 reliability, you need to also have information 467 00:36:54,666 --> 00:36:59,410 about high number of SNP markers and animals. 468 00:36:59,480 --> 00:37:04,170 For RDA analysis, we can use, for example, 469 00:37:04,240 --> 00:37:07,880 R Package vegan or DeepGenomeScan program. 470 00:37:08,960 --> 00:37:15,690 Last approach which I would like to mention is Tajima's D statistic, 471 00:37:15,760 --> 00:37:21,130 which evaluates population diversity and can be used 472 00:37:21,200 --> 00:37:25,920 as an indicator of balancing selection. 473 00:37:26,120 --> 00:37:30,210 Tajima's D can reach positive or negative values. 474 00:37:30,280 --> 00:37:36,290 Positive values indicated significant effect of balancing selection, 475 00:37:36,360 --> 00:37:41,880 and negative values, on the other hand, can be associated with the effect 476 00:37:41,946 --> 00:37:47,890 of positive selection on the genome of analyzed population or breed. 477 00:37:47,960 --> 00:37:53,890 Threshold value is defining similar to other approach, for example, 478 00:37:53,960 --> 00:37:59,730 as the 1% of the highest positive values. 479 00:37:59,800 --> 00:38:04,400 This method allows us to detect regions in which there is an increased 480 00:38:04,466 --> 00:38:06,610 proportion of heterozygous genotypes. 481 00:38:06,680 --> 00:38:12,640 That means it's relatively good indicator of regions important, 482 00:38:12,706 --> 00:38:17,330 for example, in term of adaptation. 483 00:38:17,400 --> 00:38:21,930 But also, if we would like to have results 484 00:38:22,000 --> 00:38:25,200 with good quality, 485 00:38:25,266 --> 00:38:29,650 we also need to have information about high number of markers 486 00:38:29,720 --> 00:38:33,720 and high number of animals. 487 00:38:34,320 --> 00:38:39,400 Here you can see the results from the analysis of selection signals 488 00:38:39,466 --> 00:38:43,480 distribution derived from the Tajima's D statistic 489 00:38:43,546 --> 00:38:50,370 across the genome of five horse breeds coming from Czech Republic and Slovakia. 490 00:38:50,440 --> 00:38:54,760 As you can see, we found that selection 491 00:38:54,826 --> 00:38:58,080 signals were distributed non-uniformly 492 00:38:58,146 --> 00:39:04,490 across the genome of tested horse breeds, but we also found that in some 493 00:39:04,560 --> 00:39:10,840 genomic regions, selection signals overlapped across breeds. 494 00:39:12,320 --> 00:39:16,640 Next step, after identification of selection signals 495 00:39:16,706 --> 00:39:22,530 in the genome is usually description of the regions of selection signals. 496 00:39:22,600 --> 00:39:29,130 This description is usually based on the searching for quantitative trait 497 00:39:29,200 --> 00:39:36,040 loci or protein coding genes located directly or very close to the 498 00:39:36,106 --> 00:39:39,160 region of selection signals. 499 00:39:39,360 --> 00:39:44,330 Then it's also important to analyze 500 00:39:44,400 --> 00:39:49,120 biological function of QTLs or genes. 501 00:39:49,186 --> 00:39:51,770 For this purpose, we can use 502 00:39:51,840 --> 00:39:56,800 several databases or tools, for example, 503 00:39:56,866 --> 00:40:02,040 GO, which is gene ontology or KEGG, which is Kyoto Encyclopedia 504 00:40:02,106 --> 00:40:05,560 of Genes and Genomes. 505 00:40:06,120 --> 00:40:13,810 Here you can see really good databases for the identification of QTLs or genes. 506 00:40:13,880 --> 00:40:18,760 For the identification of QTLs, you can use animal QTL database 507 00:40:18,826 --> 00:40:24,610 in which you can find information about different livestock species. 508 00:40:24,680 --> 00:40:30,200 Really good and simple web-based tool for the obtaining information 509 00:40:30,266 --> 00:40:34,640 about the genes in a certain region is 510 00:40:34,960 --> 00:40:37,520 a tool, Biomart, providing 511 00:40:37,586 --> 00:40:40,370 by the Ensemble database. 512 00:40:40,440 --> 00:40:43,560 If you would like to analyze 513 00:40:44,200 --> 00:40:48,210 biological function of genes or biological 514 00:40:48,280 --> 00:40:51,640 pathways in which are genes included 515 00:40:51,706 --> 00:40:56,080 You can use, for example, the web-based tool David. 516 00:40:58,760 --> 00:41:03,360 What are advantages of functional annotation of regions significantly 517 00:41:03,426 --> 00:41:05,850 affected by selection pressure? 518 00:41:05,920 --> 00:41:12,160 The main advantage is mainly the fact that the detailed analysis of regions 519 00:41:12,226 --> 00:41:16,290 in the genome significantly affected by selection pressure 520 00:41:16,360 --> 00:41:21,680 allows the identification of specific genes and biological pathways 521 00:41:21,746 --> 00:41:24,530 responsible for phenotypic traits. 522 00:41:24,600 --> 00:41:29,360 The future research of identified genes or 523 00:41:29,426 --> 00:41:33,240 QTLs in regions under strong selection 524 00:41:33,306 --> 00:41:39,890 pressure can be in the future potentially used in the breeding programs. 525 00:41:39,960 --> 00:41:44,640 But functional annotation has also 526 00:41:44,706 --> 00:41:46,090 disadvantages. 527 00:41:46,160 --> 00:41:50,760 The most important problem is the fact 528 00:41:50,826 --> 00:41:53,560 that the overlap between selection signals 529 00:41:53,626 --> 00:42:00,000 and functional regions does not always imply a causal relationship 530 00:42:00,066 --> 00:42:06,530 and also the fact that the information in the available databases is 531 00:42:06,600 --> 00:42:12,640 limited to the current knowledge and may not always cover all 532 00:42:12,706 --> 00:42:16,840 relevant genes or QTL loci. 533 00:42:17,800 --> 00:42:23,400 On this slide, you find the list of the papers which were 534 00:42:23,466 --> 00:42:27,610 used for the preparation of this presentation, 535 00:42:27,680 --> 00:42:31,520 and the full text of the papers are also 536 00:42:31,586 --> 00:42:36,040 available in the folder Study Materials. 537 00:42:36,920 --> 00:42:41,410 By this slide, I would like to thank you for your attention. 538 00:42:41,480 --> 00:42:48,360 If you will have questions or if you would like to continue with this topic 539 00:42:48,426 --> 00:42:53,680 in the future and need help, please contact me on my email address, 540 00:42:53,746 --> 00:42:56,130 which you can see on the slide. 541 00:42:56,200 --> 00:42:59,930 On the slide is also QR code. 542 00:43:00,000 --> 00:43:03,120 By scanning of this QR code, 543 00:43:03,186 --> 00:43:07,280 you can obtain access to other modules 544 00:43:07,346 --> 00:43:10,960 which were prepared within the project ISAGREED.