This note concerns the use of data transformations in calculations of the Transcriptome Age Index or TAI. Here I argue that the TAI, when calculated as an expression weighted sum of phylostratigraphic indices is a vector projection and thus the length of the relative expression vector should be normalized to account for the fact that expression vectors are shorter when more genes are expressed. Mathematically, the square root transform achieves this normalization because of a property of histograms (transforming an n-simplex into an n-sphere).
The Square Root Transform
Let’s begin with the definition of the \(TAI\) at stage \(s\): \[TAI(s) = \sum_i^N p_i \frac{e_i(s)}{\sum_i^Ne_i(s)}\] where, \(e_i(s)\) is the expression of gene \(i\) in stage \(s\) and \(p_i\) is the measure of gene age (phylostratum) of gene \(i\) and where there are \(N\) genes total.
If we define the normalized expression \[n_i(s) = \frac{e_i(s)}{\sum_i^Ne_i(s)}\] This can be rewritten as \[TAI(s) = \langle p_i,n_i(s)\rangle\]where \(\langle\cdot,\cdot\rangle\) is the inner (dot) product. Thus, we are taking the dot product of 2 vectors, the phylostratum vector and the normalized expression vector. By construction, the normalized expression vector satisfies the property \(\sum_i^N n_i(s)=1\). The space of possible normalized expression vectors is then the N-simplex.
Code
usingPlotsvertices = [ [1.0, 0.0, 0.0], # Unit vector in x direction [0.0, 1.0, 0.0], # Unit vector in y direction]# Create the plotp =plot( xlabel="X", ylabel="Y", title="Standard 1-Simplex", legend=false, aspect_ratio=:equal,)# Plot all edges of the tetrahedronfor i in1:2for j in (i+1):2 v1, v2 = vertices[i], vertices[j]plot!(p, [v1[1], v2[1]], [v1[2], v2[2]], color=:blue, linewidth=2)endend# Plot vertices as pointsscatter!(p, [v[1] for v in vertices], [v[2] for v in vertices], color=:blue, markersize=5)# Set the axis limitsplot!(p, xlim=(-0.1, 2.5), ylim=(-0.1, 2.5))# Display the plotvector = [1, 2]quiver!(p, [0], [0], quiver=([vector[1]], [vector[2]]), color=:purple, linewidth=2, arrow=arrow(:closed, 0.1))exp_vec = [ [1, 0], [0, 1], [0.5, 0.5]]for v in exp_vecquiver!(p, [0], [0], quiver=([v[1]], [v[2]]), color=:red, linewidth=2, arrow=arrow(:closed, 0.1))enddisplay(p)
This diagram depicts an example with 2 genes, in this case the possible normalized expression vectors are shown as the blue simplex, with 3 example expressions shown as the red vectors. A hypothetical phylostratum vector with gene age 1 for gene X and age 2 for gene Y is shown in purple, but this analysis does not depend on the particular form of \(p_i\). What happens when we calculate the TAI for different possible normalized expression vectors along the simplex (blue)? In general, the dot product can be calculated as \[\langle p_i,n_i(s) \rangle = \left\lVert p_i\right\rVert \left\lVert n_i(s)\right\rVert cos(\theta)\]
When comparing the TAI between different stages, the phylostratum vector is fixed, so \(\left\lVert p_i\right\rVert\) is constant. As expression patterns change between stages however, we would like to see how these changes affect the projection of \(n_i(s)\) onto \(p_i\). This projection has 2 components, \(\left\lVert n_i(s)\right\rVert\) and \(cos(\theta)\). However, the magnitude of \(n_i(s)\) is not constant, in fact near the vertices, the magnitude is larger than near the center of the simplex (equal expression of all genes). This implies that in this formulation, stages which have fewer genes expressed or a small number of more highly expressed genes (and thus normalized expression vectors nearer to the vertices) will have a necessarily larger TAIs regardless of which genes are expressed. This is not a feature we would like to have in the TAI, and in fact this feature gets much worse the more genes we have. In the 2 gene example, the magnitude of \(n_i\) at the center is \(0.7071\) times the magnitudes at the vertices. As the number of genes goes up, this factor decreases even more.
Code
usingLinearAlgebra# For the norm() functionusingPlots# For plotting# Initialize an array to store the normsl =zeros(999)# Main loopfor i in2:1000 x =ones(i) # In Julia, this creates a vector, not a matrix x = x /sum(x) l[i-1] =norm(x)end# Plot the resultsplot(l, xlabel="Number of Genes", ylabel="Magnitude of Center", title="Norm of Centered Expression Vector")
The simple solution is to transform the expression vectors so that they all have unit length. This is easy enough to do, but because these values fall on a simplex, taking the square root of the normalized expression vector is a convenient way to carry out this transformation. This works because \(\sum_i^N n_i(s)=1\), thus if we take the square root at this stage, and let \(r_i(s)= \sqrt{n_i(s)}\).
\[\sum_i^N r_i(s)^2=1\] and thus the \[\left\lVert r(s)\right\rVert = \sqrt{\sum_i^N r_i(s)^2} = 1\]
Now the set of possible transformed expression vectors is the unit N-sphere rather than the simplex.
Code
usingPlotsvertices = [ [1.0, 0.0, 0.0], # Unit vector in x direction [0.0, 1.0, 0.0], # Unit vector in y direction]# Create the plotp =plot( xlabel="X", ylabel="Y", title="Standard 1-Sphere", legend=false, aspect_ratio=:equal,)# Plot n-spherefunctionquarter_circle(t) x =cos.(t) y =sin.(t)return x, yend# Generate pointst =range(0, π/2, length=100)x, y =quarter_circle(t)# Create the plotplot!(x, y, aspect_ratio=:equal, label="Quarter Circle", linewidth=2, color=:blue)# Plot vertices as pointsscatter!(p, [v[1] for v in vertices], [v[2] for v in vertices], color=:blue, markersize=5)# Set the axis limitsplot!(p, xlim=(-0.1, 2.5), ylim=(-0.1, 2.5))# Display the plotvector = [1, 2]quiver!(p, [0], [0], quiver=([vector[1]], [vector[2]]), color=:purple, linewidth=2, arrow=arrow(:closed, 0.1))exp_vec = [ [1, 0], [0, 1], [0.7071, 0.7071]]for v in exp_vecquiver!(p, [0], [0], quiver=([v[1]], [v[2]]), color=:red, linewidth=2, arrow=arrow(:closed, 0.1))enddisplay(p)
The issue of the form of \(p_i\), for example whether it should be a quantile rank, is separate from this issue.