Query first.
query <- GDCquery(project = "TCGA-BRCA",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
sample.type = c("Primary Tumor","Solid Tissue Normal"))
Check the sample type.
df= getResults(query)
df$sample_type %>% table() %>% data.frame()
Primary tumor : 1111
Solid Tissue Normal : 113
Save the downloaded data.
# Download and prepare data
GDCdownload(query, directory = dir)
data <- GDCprepare(query, directory = dir)
# data %>% saveRDS(paste0(dir,"TCGA_BRCA_transcriptome.rds"))
data = readRDS(paste0(dir,"rds/TCGA_BRCA_transcriptome.rds"))
assayNames(data)
## [1] "unstranded" "stranded_first" "stranded_second" "tpm_unstrand"
## [5] "fpkm_unstrand" "fpkm_uq_unstrand"
Available TCGA DB assay types
Assay Type | Features | Use Case |
---|---|---|
unstranded | Combines expression signals from both DNA strands, no strand distinction. | Suitable when the directionality of transcription is not a concern. |
stranded_first | Expression data specific to the first DNA strand, allowing transcription direction determination. | Useful for accurate gene expression profiling, especially in overlapping genes transcribed in opposite directions. |
stranded_second | Expression data specific to the second DNA strand, enabling clear insights into transcription. | Used when identifying transcriptional direction for the second strand or resolving overlapping gene expression. |
tpm_unstrand | TPM (Transcripts Per Million) normalized unstranded data. Accounts for sequencing depth and gene length. | Ideal for comparing expression levels across samples due to its robust normalization method. |
fpkm_unstrand | FPKM (Fragments Per Kilobase per Million) normalized unstranded data. | Best for comparing gene expression levels within a single sample; less suitable for cross-sample comparisons due to sequencing depth differences. |
fpkm_uq_unstrand | FPKM-UQ (Upper Quartile normalized FPKM) unstranded data. Adjusts normalization using the top 25% expressed genes. | Useful for cross-sample comparisons by minimizing the impact of highly expressed genes and improving data comparability. |
# unstranded: Strand 비특이적 발현 데이터
# stranded_first: Strand 특이적 (첫 번째 strand) 발현 데이터
# stranded_second: Strand 특이적 (두 번째 strand) 발현 데이터
# tpm_unstrand: TPM (Transcripts Per Million) 방식으로 정규화된 strand 비특이적 발현 데이터
# fpkm_unstrand: FPKM (Fragments Per Kilobase of transcript per Million mapped reads) 방식으로 정규화된 strand 비특이적 발현 데이터
# fpkm_uq_unstrand: FPKM-UQ (Upper Quartile normalized FPKM) 방식으로 정규화된 strand 비특이적 발현 데이터
rows: Gene
columns: patients
# 'unstranded'
counts <- assay(data, "unstranded")
# ENSG to Symbols
library(org.Hs.eg.db)
# 유전자 ID에서 버전 제거
ensembl_ids <- gsub("\\..*", "", rownames(counts))
# 중복된 ID 처리
unique_ids <- unique(ensembl_ids)
# 중복 제거된 ID에 대해 유전자 심볼 매핑
gene_symbols <- mapIds(org.Hs.eg.db,
keys = unique_ids,
column = "SYMBOL",
keytype = "ENSEMBL",
multiVals = "first")
# 중복 제거하면서 미스매치 발생.
# 원래 데이터에 매핑된 유전자 심볼 할당
# 여기서 중복 제거된 목록을 이용해 각각의 원본 ID에 매핑된 심볼 할당
symbol_names <- gene_symbols[ensembl_ids]
rownames(counts) <- symbol_names
# Metadata
filtered_colData <- colData(data)[colnames(counts),]
TCGA_BRCA_countsMeta = list(counts = counts,
meta = filtered_colData)
TCGA_BRCA_countsMeta %>% saveRDS(paste0(dir,"TCGA_BRCA_countsMeta.rds"))
TCGA_BRCA_countsMeta.rds