数据来源
样本基本信息
一、数据下载
使用来自https://doi.org/10.1093/biolre/ioae007的原始测序数据(GSE247382)
Step 1:在GEO数据库检索GSE247382,获取BioProject编号:PRJNA1037470,再到ENA数据库检索PRJNA1037470,选择fastq_aspera,下载tsv,获得所有样本的下载链接。
Step 2:把下载列表保存为一个文件 例如文件名:list.txt,包含所有样本的下载链接
(fasp.sra.ebi.ac.uk:/vol1/fastq/SRR267/038/SRR26767238/SRR26767238.fastq.gz fasp.sra.ebi.ac.uk:/vol1/fastq/SRR267/038/SRR26767238/SRR26767238.fastq.gz )
Step 3:批量下载脚本 batch_download.sh
#!/bin/bashLIST=list.txtwhile read line; do echo "Downloading: $line" ascp -v -QT -l 300m -P33001 -k1 \ -i ~/.aspera/connect/etc/asperaweb_id_dsa.openssh \ era-fasp@$line ./ echo "Done: $line" echo "---------------------------"done < "$LIST"
Step 4:运行脚本
chmod +x batch_download.sh./batch_download.sh
所有文件会自动下载到当前目录,花费时间约4h,下载完成后记得核对是否所有样本均已下载。
二、数据质控
使用fastp进行数据质控,本数据为单端测序,代码如下:
for fq in *.fastq.gz; do fastp -i $fq -o clean_$fq -h ${fq%.fastq.gz}.html -j ${fq%.fastq.gz}.json; done获得质控报告SRR26767283.html,在浏览器打开查看。
最核心需要关注的 3 个指标:
三、比对到GRCh38参考基因组
Step 1:在ensembl下载GRCh38参考基因组
# 下载基因组序列 wget https://ftp.ensembl.org/pub/release-115/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz# 下载注释文件 wget https://ftp.ensembl.org/pub/release-115/gtf/homo_sapiens/Homo_sapiens.GRCh38.115.gtf.gz # 解压 gunzip *.gz
Step 2:构建hisat2参考基因组注释文件索引
#构建参考基因组注释文件索引hisat2-build Homo_sapiens.GRCh38.dna.primary_assembly.fa GRCh38_index
Step 3:将文件夹内所有clean_xxx.fastq.gz比对到GRCh38人类参考基因组
#!/bin/bash# HISAT2 index pathINDEX=/home/yqx/2_reference/ensembl/GRCh38_index# Number of threadsTHREADS=48for fq in clean_*.fastq.gzdo # get sample name base=$(basename $fq .fastq.gz) echo "=== Processing $fq ===" # Step1: HISAT2 alignment → SAM hisat2 -p $THREADS \ -x $INDEX \ -U $fq \ -S ${base}.sam # Step2: SAM → BAM samtools view -@ $THREADS -bS ${base}.sam > ${base}.bam # Step3: sort BAM samtools sort -@ $THREADS -o ${base}.sorted.bam ${base}.bam # Step4: index BAM samtools index ${base}.sorted.bam # Optional: remove intermediate files rm ${base}.sam rm ${base}.bam echo "Output: ${base}.sorted.bam and index created." echo "---------------------------------------------"doneStep 3:检查mapping质量
#!/bin/bashfor bam in clean_SRR*.sorted.bamdo echo "Processing $bam ..." samtools flagstat "$bam" > "${bam%.bam}.flagstat.txt"done输出结果如下:
查看是否有mapping rate低于90%的样本
ounter(lineounter(linegrep -H "mapped (" clean_SRR*.sorted.flagstat.txt \| awk -F'[()% ]+' '{ if ($4 < 90) print $1, $4"%"}'四、定量
使用featureCounts对所有clean_SRR*.sorted.bam进行定量(单端测序数据)
ounter(lineounter(lineounter(lineounter(lineounter(linefeatureCounts -T 48 \ -a /home/yqx/2_reference/ensembl/Homo_sapiens.GRCh38.115.gtf \ -o gene_counts.txt \ -t exon -g gene_id \ clean_SRR*.sorted.bam