python 实战-根据已知蛋白名称从基因组提取蛋白序列

实战目的

Pkinase_maize_hmm.txt 是一个含有所需要提取蛋白名称（target name）的 txt 文件，Zea_mays.B73_RefGen_v4.pep.all.fa 文件是一个含有玉米基因组所有蛋白序列信息的 fa 文件，目的就是从 fa 文件提取我们所需蛋白名称的序列。

#                                                                            --- full sequence --- -------------- this domain -------------   hmm coord   ali coord   env coord
# target name        accession   tlen query name           accession   qlen   E-value  score  bias   #  of  c-Evalue  i-Evalue  score  bias  from    to  from    to  from    to  acc description of target
#------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- ---------------------
Zm00001d052584_P001  -           1006 Pkinase              PF00069.25   264  1.1e-134  450.7   0.0   1   3     3e-45   7.5e-44  152.9   0.0     2   201    38   253    37   302 0.87 pep chromosome:B73_RefGen_v4:4:193545827:193587762:1 gene:Zm00001d052584 transcript:Zm00001d052584_T001 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed
Zm00001d052584_P001  -           1006 Pkinase              PF00069.25   264  1.1e-134  450.7   0.0   2   3   1.1e-42   2.8e-41  144.5   0.0     1   260   366   642   366   644 0.90 pep chromosome:B73_RefGen_v4:4:193545827:193587762:1 gene:Zm00001d052584 transcript:Zm00001d052584_T001 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed
Zm00001d052584_P001  -           1006 Pkinase              PF00069.25   264  1.1e-134  450.7   0.0   3   3   2.5e-44   6.3e-43  149.9   0.0     8   260   688   965   681   967 0.86 pep chromosome:B73_RefGen_v4:4:193545827:193587762:1 gene:Zm00001d052584 transcript:Zm00001d052584_T001 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed
Zm00001d052584_P021  -           1006 Pkinase              PF00069.25   264  1.1e-134  450.7   0.0   1   3     3e-45   7.5e-44  152.9   0.0     2   201    38   253    37   302 0.87 pep chromosome:B73_RefGen_v4:4:193545969:193587645:1 gene:Zm00001d052584 transcript:Zm00001d052584_T021 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed
Zm00001d052584_P021  -           1006 Pkinase              PF00069.25   264  1.1e-134  450.7   0.0   2   3   1.1e-42   2.8e-41  144.5   0.0     1   260   366   642   366   644 0.90 pep chromosome:B73_RefGen_v4:4:193545969:193587645:1 gene:Zm00001d052584 transcript:Zm00001d052584_T021 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed
Zm00001d052584_P021  -           1006 Pkinase              PF00069.25   264  1.1e-134  450.7   0.0   3   3   2.5e-44   6.3e-43  149.9   0.0     8   260   688   965   681   967 0.86 pep chromosome:B73_RefGen_v4:4:193545969:193587645:1 gene:Zm00001d052584 transcript:Zm00001d052584_T021 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed
Zm00001d052584_P023  -           1006 Pkinase              PF00069.25   264  1.1e-134  450.7   0.0   1   3     3e-45   7.5e-44  152.9   0.0     2   201    38   253    37   302 0.87 pep chromosome:B73_RefGen_v4:4:193545969:193587690:1 gene:Zm00001d052584 transcript:Zm00001d052584_T023 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed
Zm00001d052584_P023  -           1006 Pkinase              PF00069.25   264  1.1e-134  450.7   0.0   2   3   1.1e-42   2.8e-41  144.5   0.0     1   260   366   642   366   644 0.90 pep chromosome:B73_RefGen_v4:4:193545969:193587690:1 gene:Zm00001d052584 transcript:Zm00001d052584_T023 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed

>Zm00001d035916_P001 pep chromosome:B73_RefGen_v4:6:60137069:60138570:1 gene:Zm00001d035916 transcript:Zm00001d035916_T001 gene_biotype:protein_coding transcript_biotype:protein_coding description: UMP/CMP kinase1
MPTVLCFYRANVYDHIFCLGGPGSGKGTQCSKIVRHFGFTHLSAGDLLRQQVQSDTEHGA
MIKNLMHEGKLVPSDIIVRLLLTAMLQSGNDRFLVDGFPRNEENRRAYESVIGIEPELVL
FIDCPREELERRILHRDQGRDDDNVDTIRKRFQVFHDSTLPVVLYYDRMGKVRRVDGAKS
ADAVFDDVKAIFTQLLTTQVHSLTHIYLPFFFPIDCSLLIKP
>Zm00001d048284_P001 pep chromosome:B73_RefGen_v4:9:153880862:153883850:-1 gene:Zm00001d048284 transcript:Zm00001d048284_T001 gene_biotype:protein_coding transcript_biotype:protein_coding description:YlmG homolog protein 2 chloroplastic
MAASSADPAQHHASSRPPLLLAVRHLPFPGVPRTRTFPVPGPDVLAPLARRLEELASAAA
AHPLLKPLFAAHSHLSSFSQRSRREPTPLWFDGASHEQGRRRLVAARRATVLSSGELCFA
AVLGDSVAGTVVASGINNFLNLYNTVLVVRLVLTWFPNTPPAIVAPLSTILAFLVLNAFT
STAAALPAELPSCSATAQHHQRTAVSPCSAPHEATLSQRKWMRRMRSQKPQGDDGDH
>Zm00001d048284_P002 pep chromosome:B73_RefGen_v4:9:153880862:153883850:-1 gene:Zm00001d048284 transcript:Zm00001d048284_T002 gene_biotype:protein_coding transcript_biotype:protein_coding description:YlmG homolog protein 2 chloroplastic
MAASSADPAQHHASSRPPLLLAVRHLPFPGVPRTRTFPVPGPDVLAPLARRLEELASAAA
AHPLLKPLFAAHSHLSSFSQHDTRFPRPQRLHQHGRRASRGAAKLLGNGAASPEDCCLTL
LGAA

思路

提取 Pkinase_maize_hmm.txt 文件中的蛋白名称（去重）

从 fa 文件提取对应蛋白的序列

步骤

1. 提取 Pkinase_maize_hmm.txt 文件中的蛋白名称

方法1：循环遍历去除重复项后写入到一个列表

gene_list = [] #设置空列表用于存储蛋白名称
with open(r'D:\data\Pkinase_maize_hmm.txt','r') as a:  #读取含有蛋白名称的文件
    for i in a: #for循环逐行遍历
        if not i.startswith('#'): #跳过文件中的注释行
            gene_id = i.split(' ')[0] #按空格将每行信息分开并选取第一列蛋白名称
            if gene_id not in gene_list: #如果临时列表中没有当前元素则追加
                gene_list.append(gene_id) #append函数用于将符合条件的蛋白名称添加到列表
print(len(gene_list)) #查看蛋白数目

方法2：转换为集合(set)再转换为列表(list)

gene_list1 = []
gene_list2 = []
with open(r'D:\data\Pkinase_maize_hmm.txt','r') as a:
    for i in a:
        if not i.startswith('#'):
            gene_list1.append(i.split(' ')[0])
            gene_list2 = list(set(gene_list1))
print(len(gene_list2))

判断两种方法结果是否一致：

#把列表转换为集合,利用集合操作符求出交集,然后再转换回列表类型
print(len(list(set(gene_list) & set(gene_list2))))

2. 从 fa 文件提取对应蛋白的序列

final_seq = {} #设置一个空字典用于存放结果
parse_check = False #设置变量用于判断读取的行是否为所需蛋白名称所在行，首先设为flase
with open(r'D:\yanglab_data\bioinformation_file\protein\blastdb\Zea_mays.B73_RefGen_v4.pep.all.fa','r') as a:
    for i in a:
        if i.startswith('>'): #判断是否以>开头
            seq_ID = i.split(' ')[0][1:] #读取fa文件的蛋白名
            if seq_ID in gene_list: #判断fa文件中的蛋白名称是否在所需蛋白列表中
                parse_check = True #如果包含在gene_list中则将parse_check重新赋值为True
                seq=[] #设置空列表用于储存序列
                final_seq[seq_ID] = seq #设置字典中key对应的value为列表形式
            else:
                parse_check = False #如果fa的蛋白不在gene_list中，parse仍为false
        elif parse_check == True: #如果parse为true（即fa蛋白在gene_list中）
            seq_part= i.split('\n')[0] #将序列添加到序列列表seq中
            seq.append(seq_part)

#将字典中的结果输出到results.fa中
outfile = open('results.fa', 'w')

for id in final_seq:
    outfile.write('>%s\n' % id) #设置字符串格式提取id信息（即蛋白名称）
    sequence = final_seq[id] #获取id对应的序列
    for i in sequence:
        outfile.write('%s' % (i)) #获取序列信息并设置格式输出
    outfile.write('\n')
outfile.close()

判断结果是否准确

file = []
with open(r'D:\data\results.fa') as a:
    for i in a:
        if i.startswith('>'):
            file.append(i.split()[0][1:])
print(len(file))
print(len(list(set(gene_list) & set(file))))

实战目的

思 路

步 骤

1. 提取 Pkinase_maize_hmm.txt 文件中的蛋白名称

2. 从 fa 文件提取对应蛋白的序列

思路

步骤