GengLee's Blog

😏 做個迷人的混蛋 桀驁而不失鋒範

0%

python 实战-根据已知蛋白名称从基因组提取蛋白序列

实战目的


Pkinase_maize_hmm.txt 是一个含有所需要提取蛋白名称(target name)的 txt 文件,Zea_mays.B73_RefGen_v4.pep.all.fa 文件是一个含有玉米基因组所有蛋白序列信息的 fa 文件,目的就是从 fa 文件提取我们所需蛋白名称的序列。


#                                                                            --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord
# target name accession tlen query name accession qlen E-value score bias # of c-Evalue i-Evalue score bias from to from to from to acc description of target
#------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
Zm00001d052584_P001 - 1006 Pkinase PF00069.25 264 1.1e-134 450.7 0.0 1 3 3e-45 7.5e-44 152.9 0.0 2 201 38 253 37 302 0.87 pep chromosome:B73_RefGen_v4:4:193545827:193587762:1 gene:Zm00001d052584 transcript:Zm00001d052584_T001 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed
Zm00001d052584_P001 - 1006 Pkinase PF00069.25 264 1.1e-134 450.7 0.0 2 3 1.1e-42 2.8e-41 144.5 0.0 1 260 366 642 366 644 0.90 pep chromosome:B73_RefGen_v4:4:193545827:193587762:1 gene:Zm00001d052584 transcript:Zm00001d052584_T001 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed
Zm00001d052584_P001 - 1006 Pkinase PF00069.25 264 1.1e-134 450.7 0.0 3 3 2.5e-44 6.3e-43 149.9 0.0 8 260 688 965 681 967 0.86 pep chromosome:B73_RefGen_v4:4:193545827:193587762:1 gene:Zm00001d052584 transcript:Zm00001d052584_T001 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed
Zm00001d052584_P021 - 1006 Pkinase PF00069.25 264 1.1e-134 450.7 0.0 1 3 3e-45 7.5e-44 152.9 0.0 2 201 38 253 37 302 0.87 pep chromosome:B73_RefGen_v4:4:193545969:193587645:1 gene:Zm00001d052584 transcript:Zm00001d052584_T021 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed
Zm00001d052584_P021 - 1006 Pkinase PF00069.25 264 1.1e-134 450.7 0.0 2 3 1.1e-42 2.8e-41 144.5 0.0 1 260 366 642 366 644 0.90 pep chromosome:B73_RefGen_v4:4:193545969:193587645:1 gene:Zm00001d052584 transcript:Zm00001d052584_T021 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed
Zm00001d052584_P021 - 1006 Pkinase PF00069.25 264 1.1e-134 450.7 0.0 3 3 2.5e-44 6.3e-43 149.9 0.0 8 260 688 965 681 967 0.86 pep chromosome:B73_RefGen_v4:4:193545969:193587645:1 gene:Zm00001d052584 transcript:Zm00001d052584_T021 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed
Zm00001d052584_P023 - 1006 Pkinase PF00069.25 264 1.1e-134 450.7 0.0 1 3 3e-45 7.5e-44 152.9 0.0 2 201 38 253 37 302 0.87 pep chromosome:B73_RefGen_v4:4:193545969:193587690:1 gene:Zm00001d052584 transcript:Zm00001d052584_T023 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed
Zm00001d052584_P023 - 1006 Pkinase PF00069.25 264 1.1e-134 450.7 0.0 2 3 1.1e-42 2.8e-41 144.5 0.0 1 260 366 642 366 644 0.90 pep chromosome:B73_RefGen_v4:4:193545969:193587690:1 gene:Zm00001d052584 transcript:Zm00001d052584_T023 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed
>Zm00001d035916_P001 pep chromosome:B73_RefGen_v4:6:60137069:60138570:1 gene:Zm00001d035916 transcript:Zm00001d035916_T001 gene_biotype:protein_coding transcript_biotype:protein_coding description: UMP/CMP kinase1
MPTVLCFYRANVYDHIFCLGGPGSGKGTQCSKIVRHFGFTHLSAGDLLRQQVQSDTEHGA
MIKNLMHEGKLVPSDIIVRLLLTAMLQSGNDRFLVDGFPRNEENRRAYESVIGIEPELVL
FIDCPREELERRILHRDQGRDDDNVDTIRKRFQVFHDSTLPVVLYYDRMGKVRRVDGAKS
ADAVFDDVKAIFTQLLTTQVHSLTHIYLPFFFPIDCSLLIKP
>Zm00001d048284_P001 pep chromosome:B73_RefGen_v4:9:153880862:153883850:-1 gene:Zm00001d048284 transcript:Zm00001d048284_T001 gene_biotype:protein_coding transcript_biotype:protein_coding description:YlmG homolog protein 2 chloroplastic
MAASSADPAQHHASSRPPLLLAVRHLPFPGVPRTRTFPVPGPDVLAPLARRLEELASAAA
AHPLLKPLFAAHSHLSSFSQRSRREPTPLWFDGASHEQGRRRLVAARRATVLSSGELCFA
AVLGDSVAGTVVASGINNFLNLYNTVLVVRLVLTWFPNTPPAIVAPLSTILAFLVLNAFT
STAAALPAELPSCSATAQHHQRTAVSPCSAPHEATLSQRKWMRRMRSQKPQGDDGDH
>Zm00001d048284_P002 pep chromosome:B73_RefGen_v4:9:153880862:153883850:-1 gene:Zm00001d048284 transcript:Zm00001d048284_T002 gene_biotype:protein_coding transcript_biotype:protein_coding description:YlmG homolog protein 2 chloroplastic
MAASSADPAQHHASSRPPLLLAVRHLPFPGVPRTRTFPVPGPDVLAPLARRLEELASAAA
AHPLLKPLFAAHSHLSSFSQHDTRFPRPQRLHQHGRRASRGAAKLLGNGAASPEDCCLTL
LGAA

思 路


提取 Pkinase_maize_hmm.txt 文件中的蛋白名称(去重)

从 fa 文件提取对应蛋白的序列

步 骤


1. 提取 Pkinase_maize_hmm.txt 文件中的蛋白名称

方法1:循环遍历去除重复项后写入到一个列表

gene_list = [] #设置空列表用于存储蛋白名称
with open(r'D:\data\Pkinase_maize_hmm.txt','r') as a: #读取含有蛋白名称的文件
for i in a: #for循环逐行遍历
if not i.startswith('#'): #跳过文件中的注释行
gene_id = i.split(' ')[0] #按空格将每行信息分开并选取第一列蛋白名称
if gene_id not in gene_list: #如果临时列表中没有当前元素则追加
gene_list.append(gene_id) #append函数用于将符合条件的蛋白名称添加到列表
print(len(gene_list)) #查看蛋白数目

方法2:转换为集合(set)再转换为列表(list)

gene_list1 = []
gene_list2 = []
with open(r'D:\data\Pkinase_maize_hmm.txt','r') as a:
for i in a:
if not i.startswith('#'):
gene_list1.append(i.split(' ')[0])
gene_list2 = list(set(gene_list1))
print(len(gene_list2))

判断两种方法结果是否一致:

#把列表转换为集合,利用集合操作符求出交集,然后再转换回列表类型
print(len(list(set(gene_list) & set(gene_list2))))

2. 从 fa 文件提取对应蛋白的序列

final_seq = {} #设置一个空字典用于存放结果
parse_check = False #设置变量用于判断读取的行是否为所需蛋白名称所在行,首先设为flase
with open(r'D:\yanglab_data\bioinformation_file\protein\blastdb\Zea_mays.B73_RefGen_v4.pep.all.fa','r') as a:
for i in a:
if i.startswith('>'): #判断是否以>开头
seq_ID = i.split(' ')[0][1:] #读取fa文件的蛋白名
if seq_ID in gene_list: #判断fa文件中的蛋白名称是否在所需蛋白列表中
parse_check = True #如果包含在gene_list中则将parse_check重新赋值为True
seq=[] #设置空列表用于储存序列
final_seq[seq_ID] = seq #设置字典中key对应的value为列表形式
else:
parse_check = False #如果fa的蛋白不在gene_list中,parse仍为false
elif parse_check == True: #如果parse为true(即fa蛋白在gene_list中)
seq_part= i.split('\n')[0] #将序列添加到序列列表seq中
seq.append(seq_part)

#将字典中的结果输出到results.fa中
outfile = open('results.fa', 'w')

for id in final_seq:
outfile.write('>%s\n' % id) #设置字符串格式提取id信息(即蛋白名称)
sequence = final_seq[id] #获取id对应的序列
for i in sequence:
outfile.write('%s' % (i)) #获取序列信息并设置格式输出
outfile.write('\n')
outfile.close()

判断结果是否准确

file = []
with open(r'D:\data\results.fa') as a:
for i in a:
if i.startswith('>'):
file.append(i.split()[0][1:])
print(len(file))
print(len(list(set(gene_list) & set(file))))
-------------The End-------------
Buy Me A Coffee