一个基于熵的精细定位数量性状位点的指数
Journal of Genetics and Genomics (Formerly Acta Genetica Sinica) April 2007, 34(4): 373-380
An Entropy-based Index for Fine-scale Mapping of QTL
Yang Xiang1,3, Yumei Li 2,3, ①, Zaiming Liu1, Zhenqiu Sun2
1. College of Mathematics, Central South University, Changsha 410081, China ; 2. School of Public Health, Central South University, Changsha 410078, China; 3. Mathematics Department of Huaihua College, Huaihua 418000, China
Abstract: By comparing the entropy and conditional entropy in a marker, an entropy-based index for fine-scale link-age-disequilibrium gene mapping is presented using high-density marker maps in extreme samples for quantitative trait. The en-tropy-based index is the function of LD between the marker and the trait locus and does not depend on marker allele frequencies across the loci. It is parallel to Hardy-Weinberg disequilibrium (HWD) measure for QTL fine mapping, but its power of fine map-ping QTL is higher than that of HWD measure. Through simulations, the fine mapping performance of this entropy-based index is investigated extensively under various genetic parameters. The results show that the indices presented here are both robust and powerful.
Keywords: entropy; entropy-based Index; QTL; fine mapping
When designing a fine-scale mapping study, it is assumed that a region linked to a putative dis-ease-susceptibility locus (DSL) or quantitative trait locus (QTL) has already been established. Here, fine mapping is referred to as attempting to narrow what may be a 10 cM region indicated by linkage analysis to a≤1 cM region containing a DSL or a QTL. The simplest method of fine mapping is to calculate the linkage disequilibrium (LD) measure between the trait locus and a marker locus, and such examples include the recently proposed Hardy-Weinberg disequilibrium (HWD) measures for dichotomous that use affected individuals [1], and include HWD measures recently proposed for quantitative traits that use extreme sam-ples of populations [2,3]. HWD indices compare the frequencies of observed and the expected homozy-
gosities in a marker. Although HWD measures are robust and powerful for their excellent properties of being independent of marker allele frequencies, the power of HWD indices is not high.
Shannon entropy[4,5], originally defined in infor-mation theory, is used to measure the uncertainty in a variable. Conditional entropy measures the average uncertainty in a variable given knowledge of a second variable. When applied to characterize DNA variation, entropy measures genetic diversity and extracts the maximal amount of information for a set of SNP markers [5,6]. So, the difference between entropy in a marker and conditional entropy in the marker will be a measure of the association of the markers with the disease.
In this article, an entropy-based index for
Received: 2006-06-16; Accepted: 2006-08-04
This work was supported by Scientific Research Fund of Huaihua University and the National Natural Foundation of China (No.10371133).
① Corresponding author. E-mail: [email protected] www.jgenetgenomics.org
374
Journal of Genetics and Genomics 遗传学报 Vol.34 No.4 2007
fine-scale mapping of QTL is developed using high-density
marker maps in extreme samples. Entropy in a marker is first defined, followed by the definition of the conditional entropy in a marker in extreme sam-ples. The entropy-based index was compared with HWD measure proposed previously[3]. Through com-puter simulations, the fine mapping performance of the entropy-based indices was investigated.
1 Methods
A QTL with two alleles can be defined as A (with frequencyp ) and a (with frequency q =1−p ). Let s be the phenotypic value s =μ+G +e , where
μ is the mean baseline value,G is the genotypic value
at the QTL, ande is residual due to polygenic effects of the remaining QTLs and random environmental
effects. Let v , d , and -v be the genotypic values (G ) for individuals with genotypes AA , Aa , and aa , re- spectively. Without loss of generality, we assume
μ= 0 ande ~N (0,σ2e ) . Let ϕU be the proportion of
the population that has phenotypic values of the quan-titative trait s >U (U is an upper-threshold value, chosen from the continuous distribution of a study of quantitative trait), and ϕT be the proportion of the population that has phenotypic values of the quantita-tive trait s
pr (s > U|AA ), γ12= pr (s > U|Aa ),
γ22= pr (s > U|aa ),
φ11= pr (s
T|aa ). Let p T U . and p . refer to as the frequencies in individuals s U , respectively.
1. 1 Entropy and conditional entropy of a
marker The (Shannon) entropy[5]
of a variableX is de-fined as
H (X ) =E [−log pr (x )]=−Σx
pr (x ) ⋅log pr (x ) (1)
where, (
pr x ) is the probability that X is in the state x , and pr (x ) ⋅log pr (x ) is defined as 0 if pr (x ) =0. The conditional entropy of X given Y =y is the entropy of the probability distribution pr (x |Y =y ) ,
H (X |Y =y ) =E [−log pr (x |Y =y )]
=−∑pr (x |Y =y ) ⋅log pr (x |Y =y ) (2)
x
The concept of entropy can be used to study
DNA variation at a marker locus and patterns of LD[6].
A variableX referring the state of marker can be defined as
pr (X =M ) =p M , pr (X =m ) =p m , p M +p m =1 Therefore, the entropy of a marker, H can be obtained as,
H =E [−log pr (x )]=−p M ⋅log p M −p m ⋅log p m (3) Let H T and H U denote the entropy of a marker inindi-viduals with s U , respectively. Then
H p T T T p T T =−M ⋅log p M −p m ⋅log m ,
logH p U U ⋅log p U U =−M ⋅p U M −p m m (4)
If the marker allele is in LD with the QTL, the marker-allele frequencies of the population and that in individuals with s U and population. The entropy of a marker in individuals with s U and population will also be different, and the differ-ence can quantify the level of LD between the marker and the QTL.
The difference in the entropy of a marker in indi-viduals with s
and population can be as, ΔH T =H −H T , and the difference in the entropy of a marker in individuals with s >U and population, ΔH U =H −H U . From appendix,
www.jgenetgenomics.org
Yang Xiang et al.: An Entropy-based Index for Fine-scale Mapping of QTL
375
ΔH 22T ≈δMA ⋅b 1⋅log
p M p +δ1
MA ⋅b 1⋅ (5)
m
2p M ⋅p m ΔH ⋅b p 221
U ≈δMA 2⋅log M p +δMA ⋅b 2⋅ (6)
m 2p M ⋅p m Clearly, ΔH T and ΔH U , the difference in the entropy of a marker between individuals withs U and popu- lation, are function of the measure of LD between the
marker and the QTL. 1. 2 Entropy-based index
Now, an index can be defined as
L 2p [ΔH M =|
M ⋅T b 1−ΔH U b 2]
p | (7)
m
From Eqs (5)–(6),
2L M ≈
δMA ⋅b
p 2
m
where, b =|b 1−b 2|. L M is a function of LD between the marker and the QTL. Assume that there is an ini-tial complete association between the QTL allele A and the marker allele M , at the 0th generation when the allele A is initially introduced into the study
population. After n generations, δ(n ) AM
=(1−θ) n δ(0)
AM =(1−θ) n p . p [7]m , where, θ is the recombination be-
tween the QTL and the marker locus,
δ(0)AM
is the ini-
tial complete LD between the allele A and M at the 0th
generation, δ(0)
AM
=p . p m . Then (Appendix) 2L M ≈
δMA ⋅b
p 2
=(1−θ)
2n
⋅p 2
⋅b
m Apparently, L M is a decreasing function of the recom-bination θand achieves at its maximum at θ=0. Also, L M is marker-allele-frequency independent.
2 Simulation
To substantiate the properties of the proposed measure, extensive simulations under a wide range of
www.jgenetgenomics.org
parameter values were performed. The simulations are
implemented similar to those in Deng et al. and Jiang et al[1-
3]. 14 dense marker loci that are positioned at
0.1–0.2 cM intervals were considered and a span of 2
cM on both sides of a putative QTL. The simulation parameters include the frequency of the allele A (p ) at the QTL and that of the allele M (p M ) at a marker lo-cus, the ratio d/v, the thresholds T and U, the herita- bility (h 2) of the QTL, and the sample size (2n ). The marker allele frequencies can be determined randomly for values from 0.35–0.65, andp =0.2. Three models: recessive, additive, and dominant models (corre- sponding to the ratio d/v -1, 0, and 1, respectively).
Under a specific genetic model, a population with the effective size of 15,000 is simulated starting from the 0th generation, with an initial complete as- sociation between alleles A and M [p (M A ) =1]. In the case of initial incomplete association, p (M A ) = 0.95. The population then evolved for 50 generations under random mating and genetic drift. A hundred populations were simulated for analyses.
3 Results
3. 1 The properties of L M
Each of the 100 simulation populations were
sampled 1,000 times. Each sample consists of 2n in- dividuals including n individuals with s
s >U For each sample, the meas- ures L M were initially calculated. Then, the average values of the measure over the 100,000 samples were
obtained. For comparison, the measure was standard- ized by scaling the observed maximum value to one. Fig. 1 shows the results of L M under the dominant, additive, and recessive models, respectively, where the QTL is located at the middle of the markers 7 and 8. The maximum points of the indexL M are at marker 7 or marker 8. The values are roughly symmetric with respect to the QTL, and decrease with the
376
Journal of Genetics and Genomics 遗传学报 Vol.34 No.4 2007
Fig. 1 Standardized average values of L M for 100,000 samples under the dominant model, the additive model, and the re-cessive model (h 2 = 0.2)
In simulations, the bottom 10% and the top 10% of the population distribution were sampled. p = 0.2, 2n = 400.
increase of the genetic distance between a marker and the QTL.
In practice, the position of a QTL is fine mapped by the peaksof the entropy measuresthat reflect the
approximately 70%, and 95% within 0.3 cM forL M . Using the 5-point moving-average method, the prob-ability increases. Under additive and dominant mod-els, the probabilities of the two indices are generally higher than that under recessive model. Under initial incomplete association and compared with the initial complete association, the probability to fine map the QTL is, however, somehow decreased.
In addition, the effects of different sample sizes (e.g., 2n = 100, 200, and 400) and various sam-ple-selection criteria (e.g., extreme samples from the bottom and/or top 5%, 10%, and 20%) on the power of these measures were investigated (data not shown). As expected, the power of QTL fine mapping in-creases with increasing sample size (2n ), and also increases with the stringency of the sample selection. 3. 2 Comparing with HWD measure for the
QTL fine mapping Four HWD measures using dense markers in extreme samples to fine map a QTL have been pro- posed previously[3]. HWD indices compare the fre- quencies of observed and the expected homozygosi- ties in a marker. Although HWD measures are pow-
www.jgenetgenomics.org
degrees of disequilibrium. Toindicate the likelihood of the success of finemapping from these peaks,the probability (here, referred to as the power) that the peaks fallwithin a certain distancefrom the putative QTL position. To guardagainst noisy distributions ofthe measure, as hasoccurred elsewhere
[2,3]
, the peaks
were located by means of the 5-point moving-average method. The 5-point moving-average method was considered to reduce variability due to fluctuations of the entropy-based index. This is because the peak is located by the averages of the neighboring points[2]. Fig. 2 shows the results of the probability of the QTL fine mapping when there is an initial complete asso- ciation and an initial incomplete association between QTL and marker under three models for 2n = 400 and extreme samples from the bottom 10% and the top 10% of the population distribution. When there is an initial complete association between QTL and marker, the probability to fine map the QTL within 0.1 cM is
Yang Xiang et al.: An Entropy-based Index for Fine-scale Mapping of QTL
377
Fig. 2 Probability of the QTL fine mapping for L M by use of five-point moving average (—) and by use of the raw measure themselves (…) under three models (h 2 = 0.2) when there is initial complete LD(○) or initial incomplete LD(*) between QTL and marker. In simulations, the bottom 10% and the top 10% of the population distribution were sampled. p = 0.2, 2n = 400.
www.jgenetgenomics.org
378
Journal of Genetics and Genomics 遗传学报 Vol.34 No.4 2007
Table 1 The power comparison of the entropy-based index and HWD measure
L M
Recessive Additive Dominant
2
h 2 = 0.1 h 2 = 0.2 h = 0.05h 2 = 0.1 h 2 = 0.2h 2 = 0.05 h 2 = 0.1 h 2 = 0.2 h 2 = 0.05
0.56 0.59 0.63 0.68 0.70 0.75 0.59 0.71 0.74
LCD 0.30 0.39 0.43 0.45 0.64 0.66 0.43 0.50 0.64
p = 0.2, 2n = 400, and extreme samples from the bottom 10% and the top 10% of the population distribution were used.
erful for their excellent properties of being independ- ent of marker allele frequencies, the power of HWD measures is not high. The indexL M
corresponds to
HWD measureLCD using mixed samples with s U
[3]
. The entropy-based index was compared
with HWD measure. Table 1 shows the power com-parison of the two indices when there is initial com-plete LD between QTL and marker for 2n = 400, and
extreme samples from the bottom 10% and the top 10% of the population distribution. Simulation results showed that the power of index L M was higher than that of HWD measureLCD.
4 Discussion
Fine-mapping of genes for complex traits is a challenge after robustly identifying significant linkage to a genomic region. Development of robust methods for relating genomic information is urgently needed. In this report, an entropy-based index (L M ) is presented for fine-scale LD gene mapping using high-density marker maps in extreme samples for quantitative trait. The advantages of the indices are that (1) it is the function of LD between the marker and the trait locus and does not depend on marker al-lele frequencies across loci, and thus can eliminate “noise” and even bias introduced by varying marker allele frequencies across loci; (2) it is parallel to HWD measure as previously proposed [3] for QTL fine mapping. But the power of the index here is higher than that of HWD measure. The simulation results showed that the entropy-based index is valid and powerful for fine mapping of the human QTL on the
basis of significant linkage results.
It should be noted that the results are on the basis of a simple evolution model, that is, the QTL allele A was initially introduced into the general population n generations ago, and there was complete LD between the allele A at the QTL and an allele M at the marker locus. In the case of initial incomplete association, it is assumed thatp (M A ) =0.95in simulation. Simula- tion results showed that under initial incomplete asso- ciation, the power of the entropy-based index is, however, considerably decreased, compared with the initial complete association. In addition, it is assumed that there are two alleles at marker locus and trait lo- cus in this report. If there are multiple alleles at both the marker locus and the trait locus, similar theory can still be developed.
In the present study, it is assumed that the marker allele frequency in the studied population is known. However, the marker allele frequency may not be known in a general population. If individuals from the general population can be sampled, a random sample of people may be used to estimate marker frequencies. When subjects are only sampled from the extreme ends (s U ) of the phenotypic distribu- tion, the index here will depend on marker allele fre-quencies across loci if the population-marker- frequencies are replaced with the marker frequencies of the combined samples of these extreme individuals. The next study is expected to develop an index based on Shannon entropy for fine mapping of the human QTL, when the marker allele frequency may only be known in the extreme samples.
www.jgenetgenomics.org
Yang Xiang et al.: An Entropy-based Index for Fine-scale Mapping of QTL
379
Acknowledgements: The authors thank Dr. Hong-wen Deng, for helpful discussions at the initial stage of this study. The authors also appreciate Dr. Miaoxin Li for discussion in simulation.
Appendix:
Firstly,
H T T T T
T =−p M ⋅log p M −p m ⋅log p m , p T M =pr (M |y
pr (M , y
=pr (M , AA , y
+pr (M , Aa , y
=p MA ⋅p ⋅pr (y
+(p MA ⋅q +p Ma ⋅p ) ⋅pr (y
+p Ma ⋅q ⋅pr (y
=p MA ⋅p ⋅φ11+(p MA ⋅q +p Ma ⋅p ) ⋅φ12+p Ma ⋅q ⋅φ22
Therefore,
p T
M =a 1⋅p MA +a 2⋅p Ma , Where
a 1=(φ11⋅p +φ12⋅q ) T , a 2=(φ22⋅q +φ12⋅p ) T
Note that
p MA =p M ⋅p +δMA , p Ma =p M ⋅q +δMa ,
δMA =−δMa and a 1⋅p +a 2⋅q =1
Then,
p T
M =a 1⋅(p M ⋅p +δMA ) +a 2⋅(p M ⋅q +δMa )
=p M +δMA (a 1−a 2)
=p M +b 1⋅δMA
Where, b 1=a 1−a 2 Similarly,
p T
m
=p m +b 1⋅δmA
H =−p T ⋅log p T T log p T
T M M −p m ⋅m
=−(p M +b 1⋅δMA ) ⋅log(p M +b 1⋅δMA ) −(p m +b 1⋅δmA ) ⋅log(p m +b 1⋅δmA )
=−(p ⎛b ⋅δ⎞
M +b 1⋅δMA ) ⋅log p M ⎜1+1MA ⎝p ⎟
M ⎠
www.jgenetgenomics.org
−(p ) ⋅log p ⎛b ⋅δ⎞
m +b 1⋅δmA m ⎜1+1mA ⎝p ⎟
m ⎠≈−(p M +b 1⋅δMA ) ⋅log p M
−(p ⎛b M +b 1⋅δMA ) ⋅⎜⎜1⋅δMA −b 21⋅δ2MA ⎞ ⎝
p M 2p 2⎟M ⎟⎠ −(p m +b 1⋅δmA ) ⋅log p m
b ⎛b 1⋅δmA ) ⋅⎜⎜1⋅δ22−(p m +mA b 1⋅δmA ⎞
⎝p −m 2p 2⎟m ⎟ ⎠
[by log(1+x ) ≈x −x 2 =−p M ⋅log p M −p m ⋅log p m −b 1⋅δMA ⋅(logp M +1) −b 1⋅δmA ⋅(logp m +1) −b 21⋅δ2MA b 21⋅δ2mA
2p −
M 2p m
Note that δMA =−δmA Then,
H ≈H −b ⎛p ⎞1
T 1δMA ⎜log M ⎝p ⎟−b 21⋅δ2MA m ⎠2p M ⋅p
m Similarly, H b ⎛p ⎞
U ≈H −2δMA ⎜log M ⎝
⎟−b 2p 2m ⎠⋅δ2
1
MA 2p p , Where, b 2=c 1−c 2,
M ⋅m
c 1=(γ11p A +γ12p a ) U , c 2=(γ22p a +γ12p A ) U Then
ΔH p T ≈δMA ⋅b 1⋅log
M p +δ2⋅b 21
MA 1⋅, m 2p M ⋅p m
log
ΔH p U ≈δMA ⋅b 2⋅M p +δ2⋅1
MA ⋅b 22m 2p M ⋅p m
Next, L M is calculated
From the above-mentioned equations, L −ΔH M =|
2p M ⋅[ΔH T b 1U 2]
p |
m
2 ≈
δMA ⋅b
−θ) 2n p 2. p 2
m ⋅b
n p 2
=
(1m
p 2
=(1−θ) 2⋅p 2⋅b
m
here b =|b 1−b 2|
380
Journal of Genetics and Genomics 遗传学报 Vol.34 No.4 2007
Hered , 2003, 56(4): 160-165.
References
1 Jiang R, Dong J, Wang D, Sun FZ. Fine-scale mapping using
Hardy-Weinberg disequilibrium. Ann Hum Genet, 2001, 65(2): 207-219.
2 Deng HW, Chen WM, Recker RR. QTL fine mapping by
measuring and test for Hardy-Weinberg and linkage disequi-librium at a series of linked marker loci in extreme samples of populations. Am J Hum Genet, 2000, 66(3): 1027-1045. 3 Deng HW, Li YM, Li MX, Liu PY. Robust indices of
Hardy-Weinberg disequilibrium for QTL fine mapping. Hum
4 Zhao JY , Boerwinkle E, Xiong MM. An entropy-based statistic
for genomewide association studies. Am J Hum Genet, 2005, 77(1): 27-40.
5 Shannon CE. A mathematical theory of communication. MD
Comput 1997, 14(4): 306-17.
6 Hampe J, Schreiber S, Krawczak M. Entropy-based SNP se-lection for genetic association studies. Hum Genet, 2003, 114(1): 36-43.
7 Hartl DL. A primer of population genetics. 3rd ed. Sinauer,
Sunderland, Massachusetts, 1999.
一个基于熵的精细定位数量性状位点的指数
向 阳1,3,李玉梅2,3,刘再明1, 孙振球2
1. 中南大学数学学院,长沙 410081; 2. 中南大学公共卫生学院,长沙 410078; 3. 怀化学院数学系,怀化 418008
摘 要:针对数量性状位点的精细定位,本文采用群体的极端样本,利用稠密的标记位点,通过比较标记的熵和条件熵,给出了一个基于熵的指数。该指数是标记基因和性状位点间连锁不平衡系数的函数,它不依赖于标记基因的频率。该指数对应我们之前提出的数量性状位点精细定位的哈迪-温伯格不平衡(HWD )指数,但在精细定位数量性状位点时,本文提出的指数的效能要高于哈迪-温伯格不平衡(HWD )指数。通过计算机模拟,文章调查了不同遗传参数下该指数的性质。模拟结果表明该指数用作精细定位是有效的。 关键词:精细定位; 数量性状; 熵; 基于熵的指数
作者简介:向阳(1970-),男,湖南人,博士研究生,研究方向:风险理论, 统计遗传。E-mail:[email protected]
www.jgenetgenomics.org