BUSCO 评估

1、BUSCO 简介

BUSCO 是 Benchmarking Universal Single-Copy Orthologs 的缩写，其基于 OrthoDB 数据库中大量物种的单拷贝基因构建了几个大的进化分支的核心基因集，将这些保守的核心基因集与组装结果进行比对，根据比对上的比例、完整性，来评价基因组组装，转录组组装和基因注释的完整性。

2、核心基因的定义

1、High universality：Orthologs present in > 90% of the species (considered as universal)

2、Low duplicability：> 90% of the species with single-copy genes

3、BUSCO 的安装与使用

# Installing BUSCO
conda create -n busco_env -c conda-forge -c bioconda busco=5.4.7

conda info -e 
conda activate busco_env
conda deactivate

# To display all available datasets
busco --list-datasets

# Running BUSCO
busco -m genome -i INPUT.nucleotides -o OUTPUT -l LINEAGE -c 20    # Genome mode: assessing a genome assembly
busco -m protein -i INPUT.amino_acids -o OUTPUT -l LINEAGE -c 20    # Protein mode: assessing a gene set
busco -m transcriptome -i INPUT.nucleotides -o OUTPUT -l LINEAGE -c 20    # Transcriptome mode: assessing assembled transcripts

# 重要参数
-i  defines the input file to analyse which is either a nucleotide fasta file or a protein fasta file, depending on the BUSCO mode. As of v5.1.0 the input argument can now also be a directory containing fasta files to run in batch mode.
-o  defines the folder that will contain all results, logs, and intermediate data
-m  sets the assessment MODE: genome, proteins, transcriptome
-l  It can be a dataset name, i.e. bacteria_odb10, or /path/to/bacteria_odb10. In the former case, which is the recommended usage, BUSCO will automatically download and version the corresponding dataset. In the latter case, the dataset found in the given path will be used. 
    Generally the lineage to select for your assessments should be the most specific lineage available, e.g. for assessing fish data one would select the *actinopterygii* lineage rather than the *metazoa* lineage.
    BUSCO运行时会自动下载指定的数据集，如果下载较慢，可从 https://busco-data.ezlab.org/v5/data/lineages/ 手动下载，调用时指定路径。
-c  Specify the number of threads to use.

4、输出结果

评估结果在OUTPUT_NAME/short_summary.specific.dataset.label.txt中

5、批量整理评估结果

当对大量的基因组进行 BUSCO 评估后，可以使用 ypchan 的 summary_BUSCO_results.py 整理所有基因组的 BUSCO 评估结果以进行下一步分析。

summary_BUSCO_results.py 下载地址：Phylogenomics/summary_BUSCO_results.py at main · ypchan/Phylogenomics · GitHub

# 赋予运行权限
chmod 755 summary_BUSCO_results.py

# Place all BUSCO short summary files (short_summary.specific.dataset.label.txt) in a single folder. 

# Recommended usage         
find . -name 'short_summary.specific*txt' | summary_BUSCO_results.py - > busco_statistics.tsv

6、使用 generate_plot.py 进行结果可视化

conda安装的BUSCO中没有generate_plot.py脚本，可从 scripts/generate_plot.py · master · ezlab / busco · GitLab 下载。

# 运行generate_plot.py
python3 generate_plot.py -wd [WORKING_DIRECTORY] [OTHER OPTIONS]

# required arguments:
  -wd       Define the location of your working directory
  --no_r    To avoid to run R. It will just create the R script file in the working directory

# 使用 --no_r 时，只生成出图的 R scrip 文件 busco_figure.R ，可进行自定义
# 在 windows 下使用 RStudio 出图
> install.packages("tidyverse")
> library(ggplot2)
> source("path/to/busco_figure.R", encoding = 'UTF-8')
[1] "Plotting the figure ..."
[1] "Done"
> my_output
[1] "plot//busco_figure.png"  # 生成图片位于 Documents\plot 文件夹下

可视化结果示例：