TCGA-BRCA data preparation

Download TCGA-BRCA data

Query first.

query <- GDCquery(project = "TCGA-BRCA",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  sample.type = c("Primary Tumor","Solid Tissue Normal"))

Check the sample type.

df= getResults(query)
df$sample_type %>% table() %>% data.frame()

Primary tumor : 1111
Solid Tissue Normal : 113


Save the downloaded data.

# Download and prepare data
GDCdownload(query, directory = dir)
data <- GDCprepare(query, directory = dir)
# data %>% saveRDS(paste0(dir,"TCGA_BRCA_transcriptome.rds"))


Import the prepared data

data = readRDS(paste0(dir,"rds/TCGA_BRCA_transcriptome.rds"))

Available assays names

assayNames(data)
## [1] "unstranded"       "stranded_first"   "stranded_second"  "tpm_unstrand"    
## [5] "fpkm_unstrand"    "fpkm_uq_unstrand"

RNA-sequencing data (assay) types

Available TCGA DB assay types

TCGA DB Assay Types

Assay Type Features Use Case
unstranded Combines expression signals from both DNA strands, no strand distinction. Suitable when the directionality of transcription is not a concern.
stranded_first Expression data specific to the first DNA strand, allowing transcription direction determination. Useful for accurate gene expression profiling, especially in overlapping genes transcribed in opposite directions.
stranded_second Expression data specific to the second DNA strand, enabling clear insights into transcription. Used when identifying transcriptional direction for the second strand or resolving overlapping gene expression.
tpm_unstrand TPM (Transcripts Per Million) normalized unstranded data. Accounts for sequencing depth and gene length. Ideal for comparing expression levels across samples due to its robust normalization method.
fpkm_unstrand FPKM (Fragments Per Kilobase per Million) normalized unstranded data. Best for comparing gene expression levels within a single sample; less suitable for cross-sample comparisons due to sequencing depth differences.
fpkm_uq_unstrand FPKM-UQ (Upper Quartile normalized FPKM) unstranded data. Adjusts normalization using the top 25% expressed genes. Useful for cross-sample comparisons by minimizing the impact of highly expressed genes and improving data comparability.

Key Points

  • Unstranded: Directionality is not distinguished → Suitable for simple expression analysis.
  • Stranded: First/Second strand distinction → Essential for direction-sensitive transcription analysis or overlapping genes.
  • TPM/FPKM: Both are normalization methods; TPM is better for cross-sample comparison, FPKM is better for within-sample comparisons.
  • FPKM-UQ: Designed for reliable cross-sample comparisons with reduced impact of highly expressed genes.

unstranded

  1. unstranded: Expression data that does not distinguish between the two DNA strands. This type of data aggregates the expression signals from both strands, which is useful when the directionality of transcription is not a concern.

stranded_first

  1. stranded_first: Expression data that is specific to the first strand of the DNA. This type of data allows for the determination of the specific DNA strand from which the RNA was transcribed, enhancing the accuracy of gene expression profiling, especially in areas of the genome with overlapping genes transcribed in opposite directions.

stranded_second

  1. stranded_second: Expression data that is specific to the second strand of the DNA. Like data from the first strand, this helps in accurately mapping RNA reads to their originating strand, providing clear insights into the transcriptional landscape.

tpm_unstrand

  1. tpm_unstrand: Transcripts Per Million (TPM) normalized unstranded expression data. TPM is a normalization method used in RNA sequencing data analysis that accounts for both the depth of sequencing and the gene length, enabling comparison across samples.

fpkm_unstrand

  1. fpkm_unstrand: Fragments Per Kilobase of transcript per Million mapped reads (FPKM) normalized unstranded expression data. FPKM normalization takes into account the length of the fragments and the total number of reads, facilitating comparison across genes within a sample but not always between different samples due to potential differences in sequencing depth.

fpkm_uq_unstrand

  1. fpkm_uq_unstrand: Upper Quartile normalized FPKM (FPKM-UQ) unstranded expression data. This normalization method adjusts for differences in the distribution of gene expression data, using the upper quartile (top 25% of expressed genes) as a scaling factor. This is particularly useful for minimizing the influence of highly expressed genes and improving the comparability between samples.

INFO ALL

# unstranded: Strand 비특이적 발현 데이터
# stranded_first: Strand 특이적 (첫 번째 strand) 발현 데이터
# stranded_second: Strand 특이적 (두 번째 strand) 발현 데이터
# tpm_unstrand: TPM (Transcripts Per Million) 방식으로 정규화된 strand 비특이적 발현 데이터
# fpkm_unstrand: FPKM (Fragments Per Kilobase of transcript per Million mapped reads) 방식으로 정규화된 strand 비특이적 발현 데이터
# fpkm_uq_unstrand: FPKM-UQ (Upper Quartile normalized FPKM) 방식으로 정규화된 strand 비특이적 발현 데이터

Modify data for the downstream analysis

Select the raw reads

rows: Gene
columns: patients

# 'unstranded' 
counts <- assay(data, "unstranded")

Convert ENSG to Symbols

# ENSG to Symbols
library(org.Hs.eg.db)

# 유전자 ID에서 버전 제거
ensembl_ids <- gsub("\\..*", "", rownames(counts))

# 중복된 ID 처리
unique_ids <- unique(ensembl_ids)

# 중복 제거된 ID에 대해 유전자 심볼 매핑
gene_symbols <- mapIds(org.Hs.eg.db,
                       keys = unique_ids,
                       column = "SYMBOL",
                       keytype = "ENSEMBL",
                       multiVals = "first")

# 중복 제거하면서 미스매치 발생.
# 원래 데이터에 매핑된 유전자 심볼 할당
# 여기서 중복 제거된 목록을 이용해 각각의 원본 ID에 매핑된 심볼 할당
symbol_names <- gene_symbols[ensembl_ids]
rownames(counts) <- symbol_names

Extract meta data

# Metadata 
filtered_colData <- colData(data)[colnames(counts),]

Merge counts and meta data

TCGA_BRCA_countsMeta = list(counts = counts,
                            meta = filtered_colData)
TCGA_BRCA_countsMeta %>% saveRDS(paste0(dir,"TCGA_BRCA_countsMeta.rds"))

TCGA_BRCA_countsMeta.rds

TCGA BRCA project meta data information

List
  1. barcode: Unique identification code for each sample within the TCGA project.
  2. patient: Unique identifier assigned to each patient.
  3. sample: Unique identifier for the sample.
  4. shortLetterCode: Simple code representing the sample.
  5. definition: Describes the definition of the sample type.
  6. sample_submitter_id: ID of the submitted sample.
  7. sample_type_id: ID of the sample type.
  8. tumor_descriptor: Provides a description of the tumor.
  9. sample_id: Sample ID.
  10. sample_type: Type of sample (e.g., Primary Tumor).
  11. composition: Describes the composition of the sample.
  12. days_to_collection: Days until sample collection.
  13. state: Indicates the state of the sample.
  14. initial_weight: Initial weight of the sample.
  15. preservation_method: Sample preservation method.
  16. pathology_report_uuid: Unique identifier for the pathology report.
  17. submitter_id: Submitter ID.
  18. oct_embedded: Whether included in OCT (Optical Coherence Tomography).
  19. specimen_type: Type of specimen.
  20. is_ffpe: Whether treated with Formalin-Fixed Paraffin-Embedded (FFPE).
  21. tissue_type: Type of tissue.
  22. synchronous_malignancy: Presence or absence of synchronous malignancy.
  23. ajcc_pathologic_stage: AJCC pathological stage.
  24. days_to_diagnosis: Days until diagnosis.
  25. treatments: Treatment details.
  26. last_known_disease_status: Last known disease status.
  27. tissue_or_organ_of_origin: Tissue or organ of origin.
  28. days_to_last_follow_up: Days to last follow-up.
  29. age_at_diagnosis: Age at diagnosis.
  30. primary_diagnosis: Primary diagnosis.
  31. prior_malignancy: Presence or absence of prior malignancy.
  32. year_of_diagnosis: Year of diagnosis.
  33. prior_treatment: Previous treatment details.
  34. ajcc_staging_system_edition: Edition of the AJCC staging system.
  35. ajcc_pathologic_t: AJCC pathological T rating.
  36. morphology: Morphology.
  37. ajcc_pathologic_n: AJCC pathological N rating.
  38. ajcc_pathologic_m: AJCC pathological M rating.
  39. classification_of_tumor: Tumor classification.
  40. diagnosis_id: Diagnosis ID.
  41. icd_10_code: International Classification of Diseases code (ICD-10).
  42. site_of_resection_or_biopsy: Site of resection or biopsy.
  43. tumor_grade: Tumor grade.
  44. progression_or_recurrence: Whether there is progression or recurrence.
  45. alcohol_history: Alcohol history.
  46. exposure_id: Exposure ID.
  47. race: Race.
  48. gender: Gender.
  49. ethnicity: Ethnicity.
  50. vital_status: Vital status.
  51. age_at_index: Age at index.
  52. days_to_birth: Days to birth.
  53. year_of_birth: Year of birth.
  54. demographic_id: Demographic ID.
  55. days_to_death: Days to death.
  56. year_of_death: Year of death.
  57. bcr_patient_barcode: BCR patient barcode.
  58. primary_site: Primary site.
  59. project_id: Project ID.
  60. disease_type: Type of disease.
  61. name: Name.
  62. releasable: Whether releasable.
  63. released: Whether released.
  64. days_to_sample_procurement: Days to sample procurement.
  65. paper_patient: Patient data used in the paper.
  66. paper_Tumor.Type: Tumor type described in the paper.
  67. paper_Included_in_previous_marker_papers: Whether included in previous marker papers.
  68. paper_vital_status: Vital status described in the paper.
  69. paper_days_to_birth: Days to birth described in the paper.
  70. paper_days_to_death: Days to death described in the paper.
  71. paper_days_to_last_followup: Days to last follow-up described in the paper.
  72. paper_age_at_initial_pathologic_diagnosis: Age at initial pathological diagnosis described in the paper.
  73. paper_pathologic_stage: Pathological stage described in the paper.
  74. paper_Tumor_Grade: Tumor grade described in the paper.
  75. paper_BRCA_Pathology: BRCA pathology described in the paper.
  76. paper_BRCA_Subtype_PAM50: BRCA subtype PAM50 described in the paper.
  77. paper_MSI_status: MSI status described in the paper.
  78. paper_HPV_Status: HPV status described in the paper.
  79. paper_tobacco_smoking_history: Tobacco smoking history described in the paper.
  80. paper_CNV Clusters: CNV clusters described in the paper.
  81. paper_Mutation Clusters: Mutation clusters described in the paper.
  82. paper_DNA.Methylation Clusters: DNA Methylation clusters described in the paper.
  83. paper_mRNA Clusters: mRNA clusters described in the paper.
  84. paper_miRNA Clusters: miRNA clusters described in the paper.
  85. paper_lncRNA Clusters: lncRNA clusters described in the paper.
  86. paper_Protein Clusters: Protein clusters described in the paper.
  87. paper_PARADIGM Clusters: PARADIGM clusters described in the paper.
  88. paper_Pan-Gyn Clusters: Pan-Gyn clusters described in the paper.