Chi-squared analysis on multiple sequence alignments with python

T. Jeffrey Cole

Last updated on Nov 3, 2021 2 min read tutorial, python, bioinformatics

Overview

For a general chi-square analysis for an alignment it is calulated $$chi2 = sum[i from 1 to k] (O_i - E_i)^2 / E_i$$

where k is the size of the alphabet (e.g. 4 for DNA, 20 for amino acids) and the values 1 to k correspond uniquely to one of the nucleotides or amino acids. O_i is the nucleotide or amino acid frequency in the sequence tested. E_i is the nucleotide or amino acid frequency expected from the ‘master’ distribution (e.g. the overall frequencies - depends on what one is using).

Whether the nucleotide (or amino acid) composition deviates significantly for the ‘master’ distribution is done by testing the chi2 value using the chi2-distribution with k-1 degrees of freedom (df=3 for DNA or df=19 for amino acids).

Python functions

Loading multiple sequence alignment

Using the biopython AlignIO function, we can load in a multiple sequence alignment file of a variety of formats.

alignment_file = "path/to/alignment.fasta"
alignment = AlignIO.read(open(alignment_file), "fasta")

Create composition matrix

import pandas as pd
def compositionMatrix(aln):
	compDict = {}
	fixedCharacters = ["-","A","C","D","E","F","G","H","I","K","L","M","N","P","Q","R","S","T","V","W","Y"]
	for record in aln:
		header = record.id
		seq = record.seq
		currentSeqMat = [0]*21
		for i in range(len(seq)):
			aa = seq[i]
			try:
				characterPos = fixedCharacters.index(aa)
				currentSeqMat[characterPos]+= 1
			except:
				print("ERROR:", header, "contains character ("+aa+") not in the list:",fixedCharacters)
		compDict[header] = currentSeqMat
	compDF = pd.DataFrame.from_dict(compDict, orient='index',
                       columns=fixedCharacters)
	return compDF

run chi-squared analysis

from scipy import stats
import numpy as np
import pandas as pd 

def chi2test(compDF):
	seqTotals = compDF.sum(axis=1)
	gaps = compDF["-"]
	gapsPerSeq = gaps/seqTotals

	nonGap = compDF.loc[:, 'A':'Y']
	nonGapTotals = nonGap.sum().to_frame()
	nonGapSeqTotals = nonGap.sum(axis=1).to_frame()
	numCharacters = nonGapTotals.sum()
	expectedFreq = nonGapTotals / numCharacters

	expectedCountArray = np.dot(nonGapSeqTotals,expectedFreq.transpose())
	expectedCountDF = pd.DataFrame(expectedCountArray,columns =nonGap.columns, index =nonGap.index.values )
	chi2DF = ((expectedCountDF - nonGap)**2)/expectedCountDF
	chi2Sum = chi2DF.sum(axis=1)

	pValueDf = 1 - stats.chi2.cdf(chi2Sum, 19)
	outDF = pd.DataFrame({"Gap/Ambiguity":gapsPerSeq,"p-value":pValueDf})
	outDF.index.name='header'
	
	return outDF

python protein bioinformatics