Posted onEdited onInData_processingViews: Symbols count in article: 6.1kReading time ≈6 mins.
实战目的
Pkinase_maize_hmm.txt 是一个含有所需要提取蛋白名称(target name)的 txt 文件,Zea_mays.B73_RefGen_v4.pep.all.fa 文件是一个含有玉米基因组所有蛋白序列信息的 fa 文件,目的就是从 fa 文件提取我们所需蛋白名称的序列。
# --- full sequence --- -------------- this domain ------------- hmm coord ali coord env coord # target name accession tlen query name accession qlen E-value score bias # of c-Evalue i-Evalue score bias from to from to from to acc description of target #------------------- ---------- ----- -------------------- ---------- ----- --------- ------ ----- --- --- --------- --------- ------ ----- ----- ----- ----- ----- ----- ----- ---- --------------------- Zm00001d052584_P001 - 1006 Pkinase PF00069.252641.1e-134450.70.0133e-457.5e-44152.90.0220138253373020.87 pep chromosome:B73_RefGen_v4:4:193545827:193587762:1 gene:Zm00001d052584 transcript:Zm00001d052584_T001 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed Zm00001d052584_P001 - 1006 Pkinase PF00069.252641.1e-134450.70.0231.1e-422.8e-41144.50.012603666423666440.90 pep chromosome:B73_RefGen_v4:4:193545827:193587762:1 gene:Zm00001d052584 transcript:Zm00001d052584_T001 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed Zm00001d052584_P001 - 1006 Pkinase PF00069.252641.1e-134450.70.0332.5e-446.3e-43149.90.082606889656819670.86 pep chromosome:B73_RefGen_v4:4:193545827:193587762:1 gene:Zm00001d052584 transcript:Zm00001d052584_T001 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed Zm00001d052584_P021 - 1006 Pkinase PF00069.252641.1e-134450.70.0133e-457.5e-44152.90.0220138253373020.87 pep chromosome:B73_RefGen_v4:4:193545969:193587645:1 gene:Zm00001d052584 transcript:Zm00001d052584_T021 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed Zm00001d052584_P021 - 1006 Pkinase PF00069.252641.1e-134450.70.0231.1e-422.8e-41144.50.012603666423666440.90 pep chromosome:B73_RefGen_v4:4:193545969:193587645:1 gene:Zm00001d052584 transcript:Zm00001d052584_T021 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed Zm00001d052584_P021 - 1006 Pkinase PF00069.252641.1e-134450.70.0332.5e-446.3e-43149.90.082606889656819670.86 pep chromosome:B73_RefGen_v4:4:193545969:193587645:1 gene:Zm00001d052584 transcript:Zm00001d052584_T021 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed Zm00001d052584_P023 - 1006 Pkinase PF00069.252641.1e-134450.70.0133e-457.5e-44152.90.0220138253373020.87 pep chromosome:B73_RefGen_v4:4:193545969:193587690:1 gene:Zm00001d052584 transcript:Zm00001d052584_T023 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed Zm00001d052584_P023 - 1006 Pkinase PF00069.252641.1e-134450.70.0231.1e-422.8e-41144.50.012603666423666440.90 pep chromosome:B73_RefGen_v4:4:193545969:193587690:1 gene:Zm00001d052584 transcript:Zm00001d052584_T023 gene_biotype:protein_coding transcript_biotype:protein_coding description:Protein kinase domain containing protein expressed
>Zm00001d035916_P001 pep chromosome:B73_RefGen_v4:6:60137069:60138570:1 gene:Zm00001d035916 transcript:Zm00001d035916_T001 gene_biotype:protein_coding transcript_biotype:protein_coding description: UMP/CMP kinase1 MPTVLCFYRANVYDHIFCLGGPGSGKGTQCSKIVRHFGFTHLSAGDLLRQQVQSDTEHGA MIKNLMHEGKLVPSDIIVRLLLTAMLQSGNDRFLVDGFPRNEENRRAYESVIGIEPELVL FIDCPREELERRILHRDQGRDDDNVDTIRKRFQVFHDSTLPVVLYYDRMGKVRRVDGAKS ADAVFDDVKAIFTQLLTTQVHSLTHIYLPFFFPIDCSLLIKP >Zm00001d048284_P001 pep chromosome:B73_RefGen_v4:9:153880862:153883850:-1 gene:Zm00001d048284 transcript:Zm00001d048284_T001 gene_biotype:protein_coding transcript_biotype:protein_coding description:YlmG homolog protein 2 chloroplastic MAASSADPAQHHASSRPPLLLAVRHLPFPGVPRTRTFPVPGPDVLAPLARRLEELASAAA AHPLLKPLFAAHSHLSSFSQRSRREPTPLWFDGASHEQGRRRLVAARRATVLSSGELCFA AVLGDSVAGTVVASGINNFLNLYNTVLVVRLVLTWFPNTPPAIVAPLSTILAFLVLNAFT STAAALPAELPSCSATAQHHQRTAVSPCSAPHEATLSQRKWMRRMRSQKPQGDDGDH >Zm00001d048284_P002 pep chromosome:B73_RefGen_v4:9:153880862:153883850:-1 gene:Zm00001d048284 transcript:Zm00001d048284_T002 gene_biotype:protein_coding transcript_biotype:protein_coding description:YlmG homolog protein 2 chloroplastic MAASSADPAQHHASSRPPLLLAVRHLPFPGVPRTRTFPVPGPDVLAPLARRLEELASAAA AHPLLKPLFAAHSHLSSFSQHDTRFPRPQRLHQHGRRASRGAAKLLGNGAASPEDCCLTL LGAA
思 路
提取 Pkinase_maize_hmm.txt 文件中的蛋白名称(去重)
从 fa 文件提取对应蛋白的序列
步 骤
1. 提取 Pkinase_maize_hmm.txt 文件中的蛋白名称
方法1:循环遍历去除重复项后写入到一个列表
gene_list = [] #设置空列表用于存储蛋白名称 with open(r'D:\data\Pkinase_maize_hmm.txt','r') as a: #读取含有蛋白名称的文件 for i in a: #for循环逐行遍历 ifnot i.startswith('#'): #跳过文件中的注释行 gene_id = i.split(' ')[0] #按空格将每行信息分开并选取第一列蛋白名称 if gene_id notin gene_list: #如果临时列表中没有当前元素则追加 gene_list.append(gene_id) #append函数用于将符合条件的蛋白名称添加到列表 print(len(gene_list)) #查看蛋白数目
方法2:转换为集合(set)再转换为列表(list)
gene_list1 = [] gene_list2 = [] with open(r'D:\data\Pkinase_maize_hmm.txt','r') as a: for i in a: ifnot i.startswith('#'): gene_list1.append(i.split(' ')[0]) gene_list2 = list(set(gene_list1)) print(len(gene_list2))
for id in final_seq: outfile.write('>%s\n' % id) #设置字符串格式提取id信息(即蛋白名称) sequence = final_seq[id] #获取id对应的序列 for i in sequence: outfile.write('%s' % (i)) #获取序列信息并设置格式输出 outfile.write('\n') outfile.close()
判断结果是否准确
file = [] with open(r'D:\data\results.fa') as a: for i in a: if i.startswith('>'): file.append(i.split()[0][1:]) print(len(file)) print(len(list(set(gene_list) & set(file))))