Department of Clinical Genetics, Gothenburg University, Gothenburg, Sweden

Department of Mathematical Statistics, Chalmers University of Technology, Gothenburg, Sweden

Genomics Core Facility, Gothenburg University, Gothenburg, Sweden

Center for Medical Genetics, Ghent University Hospital, Ghent, Belgium

Department of Cancer Genetics, Royal College of Surgeons in Ireland and Children's Research Centre, Our Lady's Children's Hospital, Dublin, Ireland

Childhood Cancer Research Unit, Karolinska Institute, Astrid Lindgren Children's Hospital Q6:05, S-171 76 Stockholm, Sweden

Children's Hospital of Philadelphia, Division of Oncology, The University of Pennsylvania, Philadelphia, PA

Abstract

Background

There are currently three postulated genomic subtypes of the childhood tumour neuroblastoma (NB); Type 1, Type 2A, and Type 2B. The most aggressive forms of NB are characterized by amplification of the oncogene

Results

The present study explores subgroup discrimination by gene expression profiling using three published microarray studies on NB (47 samples). Four distinct clusters were identified by Principal Components Analysis (PCA) in two separate data sets, which could be verified by an unsupervised hierarchical clustering in a third independent data set (101 NB samples) using a set of 74 discriminative genes. The expression signature of six NB-associated genes

Conclusions

Based on expression profiling we have identified four molecular subgroups of neuroblastoma, which can be distinguished by a 6-gene signature. The fourth subgroup has not been described elsewhere, and efforts are currently made to further investigate this group's specific characteristics.

Background

Neuroblastoma (NB) is a childhood tumour of the sympathetic nervous system, and is the most common cancer diagnosed during infancy. The prognosis of NB patients depend upon clinical factors as stage

Studies show that sporadic NB tumours can be assigned to three major subtypes based on their genomic profile, and these molecular signatures also categorize risk groups of NB patients

Genome-wide transcriptome microarray analysis enables the possibility to investigate the expression of all genes in a tumour simultaneously. De Preter and colleagues established a 132-gene classifier that discriminates the three major genomic NB subtypes reflecting inherent differences in gene expression between these subtypes

In the present study, we explored subtype discoveries by unsupervised expression profiling using Principal Components Analysis (PCA). The analyses identified four distinct PCA clusters in two independent data sets, which were verified in a third larger data set by PCA and unsupervised hierarchical clustering. This study presents a new alternative way of subtype discrimination which will hopefully facilitate the search for subtype-specific therapeutic targets and the development of personalized medicine for children with neuroblastoma.

Results

Subtype discovery by PCA

Principal Components Analysis (PCA) was performed on Affymetrix HU133A expression profiles from 17

PCA plots of two parallel cluster analyses

**PCA plots of two parallel cluster analyses**. Principal Components Analysis (PCA) of the two test groups using a variance cut-off of 0.4. **A**. HU133A De Preter data set (414 variables, 17 tumour samples, **B**. HU133A McArdle/Wilzén data sets (716 variables, 30 tumour samples,

**PCA loadings from the De Preter and McArdle/Wilzén data set**. Column 1-6: Variables (genes/probe-sets) and their PCA loadings for Principal components 1, 2, and 3 (PC1, PC2, PC3) in data-set 1 and 2 (De Preter and McArdle/Wilzén respectively). Common variables: Genes/probe-sets that were present in the PCA analysis of both data-sets.

Click here for file

**PCA of cortex, neuroblast and NB samples**. Unfiltered Principal Components Analysis (PCA) of the De Preter data set (7438 variables, 23 samples. Colour codes of spheres: Red = neuroblastoma tumour specimens; Blue = Neuroblasts; Green = Cortex.

Click here for file

By Fisher's exact test, MNA and del1p were found to be significantly more frequent in PCA cluster p3 (p = 0.018 and p = 3.9E-04, Fisher's exact test, table

Characteristics of the four PCA clusters(p1-p4)

**Fisher's exact test**

**p1**

**p2**

**p3**

**p4**

**Data set 1**

**(n = 4)**

**Data set 2**

**(n = 10)**

**Data set 1**

**(n = 4)**

**Data set 2**

**(n = 10)**

**Data set 1**

**(n = 6)**

**Data set 2**

**(n = 5)**

**Data set 1**

**(n = 3)**

**Data set 2**

**(n = 5)**

High stage (3 or 4)

0,053*

0,056

0,330

0,691

0,075

0,355

0,324

1,000

Stage 4

0,029*

0,001**

0,241

0,062

0,088

0,304

0,124

0,696

Outcome

0,139

0,051*

0,555

1,000

0,072

1,000

0,728

0,070

MNA

0,208

0,375

0,208

0,141

0,001**

0,003**

0,324

1,000

Del1p

0,088

0,210

0,088

0,067

0,018*

3,9E-04**

0,360

0,589

Del11q

0,441

0,235

0,559

0,013*

0,160

0,622

0,051*

0,622

Gain17q

0,063

0,128

0,365

0,109

0,058

1,000

0,458

0,651

**ANOVA one-way**

**ANOVA significance**

**FC p1**

**FC p2**

**FC p3**

**FC p4**

**Data set 1**

**Data set 2**

**Data set: 1**

**2**

**Data set: 1**

**2**

**Data set: 1**

**2**

**Data set: 1**

**2**

1,1E-03**

0,277

-1,8

-0,5

1,2

0,0

3,6

1,4

-1,0

-0,6

6,1E-06**

1,4E-05**

-2,9

-1,7

2,9

2,0

7,8

2,1

-3,0

-2,6

0,011*

0,003**

0,5

1,1

2,0

0,4

1,0

0,1

-1,9

-2,6

2,0E-04**

7,0E-04**

-0,6

-0,3

0,4

-0,4

12,6

2,4

-3,5

-1,4

6,7E-04**

1,1E-05**

3,7

3,4

3,6

0,7

0,1

-4,0

-0,3

-2,6

0,006**

3,4E-08**

0,5

0,6

2,0

1,0

1,2

0,7

-2,3

-3,2

Fisher's exact test: Each cluster (p1, p2, p3, or p4) versus the other clusters combined. n = the number of samples in each subgroup. High stage = INSS stage 3 and 4; outcome = dead of disease; MNA =

Frequency of prognostic factors and survival probability in PCA clusters

**Frequency of prognostic factors and survival probability in PCA clusters**. **A**. The frequency of prognostic factors in the four PCA clusters p1-p4 for the De Preter data set (n = 17) and McArdle/Wilzén (n = 30). High stage (3-4) = INSS stage 3 and 4; DOD = Dead of disease; MNA = MYCN amplification; Del = deletion. Significant occurrence of prognostic factors in one PCA cluster compared to the others is marked with stars (Fisher's exact test). n = the number of samples in each subgroup. **B**. Kaplan Meier survival curves of PCA clusters (p1-p4) from three data sets (De Preter, McArdle, and Wilzén). OS = Overall survival (n = 43). EFS = Event-free survival (n = 35). Chi-square significance by Log-rank (Mantel-Cox).

With the intention to explore the expression of genes that have previously been associated with NB, we performed a mining of gene lists from literature. Starting with 15 gene lists

**Workflow of the study**. Step 1: Subtype discovery by unsupervised PCA of two data sets (De Preter and McArdle/Wilzén) from three microarray expression studies (upper panel). Data-mining of gene lists from literature, resulting in the selection of 6 NB-associated genes (lower panel). Step 2: Defining the 74-gene subtype discrimination gene set by SAM (upper panel). Verification of subgroup existence by hierarchical clustering and PCA in a third data set (Wang) using the 74-gene set (lower panel).

Click here for file

**Gene lists from literature & hits in PubMed**. **A**. List of 15 expression studies used for the data-mining. **B**. PubMed searches of 157 and 30 genes respectively. PubMed searches were performed as follows: Search 1(left): 157 genes, search term "Gene Symbol"[TIAB] AND "Gene expression"[MeSH Terms] AND "neuroblastoma"[MeSH Terms]. Search 2 (right):30 genes, search term "Gene Symbol"[TIAB] OR "Alias name"[TIAB]) AND "Gene expression"[MeSH Terms] AND "neuroblastoma"[MeSH Terms]. The six NB-associated genes

Click here for file

**Multiple comparisons by Post hoc test (Tukey)**. Gene expression of

Click here for file

Characteristics of cluster p4

**Progn. factor**

**p4 vs p1**

**p4 vs p2**

**p4 vs p3**

High stage (3-4)

0,047988

*

↑

0,48968

0,375645

Stage 4

0,002127

**

↑

0,583591

0,506192

DOD

0,009569

**

↑

0,296618

0,600782

Del1p

0,039341

*

↑

0,009569

*

↑

0,071035

MNA

0,566667

0,35

0,00905

**

↓

Del11q

0,181631

0,291194

0,165635

Gain17q

0,212934

0,428148

0,31448

**Transcript**

**p4 vs p1**

**p4 vs p2**

**p4 vs p3**

0,696703

0,403066

0,005545

**

↓

0,042222

*

↓

1,06E-06

**

↓

4,78E-06

**

↓

5,43E-05

**

↓

0,000243

**

↓

0,046452

*

↓

0,036238

*

↓

0,055719

**

↓

0,002016

**

↓

0,001017

**

↓

0,03105

*

↓

0,093904

0,004037

**

↓

0,002773

**

↓

0,006196

**

↓

Progn. Factor: Fischer's exact test of frequency occurrence of prognostic factors in group p4 compared to the other groups separately. High stage = INSS stage 3 and 4; DOD = Dead of disease; Del = deletion; MNA = MYCN amplification. Transcript: T-test by Welch of differential expression in p4 compared to the other groups separately. p-values from each data-set are combined (see text for details). The arrows marks the direction: ↑ Higher frequency or higher expression in p4, ↓ Lower frequency or lower expression in p4.

According to Kaplan-Meier, overall survival (OS) and event free survival (EFS) rates were significantly different between the four clusters (OS p = 2.24E-04, EFS p = 0.019, Log-rank, Mantel-cox). The lowest survival probabilities were found in PCA clusters p3 and p4 with a 5-year OS rate of 50% and 62.5% respectively, and an EFS rate of 22.2% and 25% at 5 year from diagnosis (Figure

Verification by hierarchical clustering and PCA

In order to verify the existence of the four groups, a discriminative gene set was defined and applied to a third independent data set. First, the p-clusters in data sets 1 and 2 were integrated by reassignment of tumours based on their 6-gene expression profile (r1-r4, Additional file

**Rules and assignments of r-groups**. Rules for r-group assignments (upper table): Groups (r1-r4) were defined based on the standard deviation (sd) of expression for the six NB-associated genes. R-Assignments of samples from data set 1 and 2 into r-groups (lower table): Expression sd intervals of 5 out of 6 genes had to be in agreement with the rules for each r-group in order to be categorized.

Click here for file

r-groups vs. p-groups

**r1**

**r2**

**r3**

**r4**

**ND**

**p1**

**9**

1

0

0

4

**p2**

1

**10**

0

0

3

**p3**

0

0

**10**

0

1

**p4**

0

0

0

**6**

2

Contingency table of r-groups and p-groups of the two test data sets (De Preter and McArdle/Wilzén), 47 samples in total. The r-group assignment was based upon MNA status, INSS stage and expression levels of the six NB-associated genes with the highest PubMed search score (see additional file

The unsupervised hierarchical clustering of the 101 NB samples clearly divided tumour cases into four distinct subgroups (Figure

Hierarchical clustering using a 74 discriminative gene set

**Hierarchical clustering using a 74 discriminative gene set**. **A**. Unsupervised hierarchal clustering of the Affymetrix HGU95Av2 Wang data set (102 tumour samples in total **B**. PCA plot of the 74-classifier gene-set. Left panel: the four hierarchical clusters colour-coded according to the Hierarchical clustering (see above). Right panel: three INSS stages: Green = stage 1, Blue = stage 3, Red = stage 4, Grey = Human fetal brain. **C**. Kaplan Meier survival curves of hierarchical clusters (h1-h4) from the Wang data set. OS = Overall survival (n = 92). EFS = Event-free survival (n = 92). Chi-square significance by Log-rank (Mantel-Cox).

Expression subgroups vs. Genomics subgroups

**Data sets 1 & 2 (De Preter & McArdle/Wilzén)**

**p1**

**p2**

**p3**

**p4**

**Type 1**

**10**

4

0

1

**Type 2A**

3

**10**

1

**4**

**Type 2B**

1

0

**9**

1

In total 47

**Other**

0

0

1

2

**Data set 3 (Wang)**

**h1**

**h2**

**h3**

**h4**

**Type 1**

**21**

4

0

**9**

**Type 2A**

15

**15**

0

**7**

**Type 2B**

0

1

**18**

1

In total 101

**Other**

2

5

0

3

Contingency table of p-group assignment and genomic subgroup assignment of the two test data sets (De Preter and McArdle/Wilzén). The p-group assignment was based upon the PCA clusters. The genomic subtypes were established based on INSS stage, MNA status, and Del11q status (see text for details). The highest numbers of cases of the p- and h-group assignments falling into a specific genomic subgroup category are shown in bold.

Patients of cluster h3 showed the worst outcome, with 10 out of 16 dead of disease (2 were lost for follow up) and a survival probability of 36.5% at 5 years from diagnosis (Figure

Five specific gene clusters (g1-g5) showing differential expression in the four h-groups were noted (Figure

Validation of PCA clusters and the 6-gene signature

A PCA of unfiltered global transcripts of the three data sets clearly confirmed the existence of four distinct subgroups (Additional file

**PCA validation of p- and h-groups using unfiltered expression data**. Principal Components Analysis (PCA) of unfiltered global expression data (4728 genes) from three data sets (De Preter, McArdle/Wilzén, and Wang). **A**. PCA plotted by loadings generated from the McArdle/Wilzén data set. **B**. PCA plotted by loadings generated from the De Preter data set. Cases (spheres) are coloured by their group assignments: Green = p1/h1, Orange = p2/h2, Red = p3/h3, Blue = p4/h4.

Click here for file

Also, the 6-gene signature (

Verification of 6-gene signature

**Verification of 6-gene signature**. **A**. PCA of the six NB associated genes **B**. Expression heat map of the 6-gene signature of the Wang data set. The colour scale is based on standard deviations (SD) and ranges from +2 SD (red) to -2 SD (green). Samples are presented in the same order as in Figure 3A.

In order to check the previously reported relationship of the

**Expression heat map of MYCN, c-MYC and MYCN/c-MYC downstream targets**. The two test data sets De Preter (n = 17, Upper left panel) and McArdle/Wilzén (n = 30, lower left panel) are divided into four PCA clusters (p1-p4), and the verification data set Wang (n = 102, right panel) is divided into four hierarchical clusters (h1-h4). The heat-map colour scale is based on standard deviations (sd) and ranges from +2 sd (red) to -2 sd (green). Status of prognostic factors is shown by black and white squares to the right of each panel. Stage/DOD: Black = INSS stage 4 or dead of disease, Dark grey = INSS stage 3, White = Low INSS stage (stage 1 or 2) and alive, Light grey = Not determined.

Click here for file

Correlations

De Preter

Sign.

-,470

PCC

,057

N

17

McArdle/Wilzén

Sign.

-,420*

PCC

,021

N

30

Wang

Sign.

-,366**

PCC

,000

N

101

Pearson Correlations of

Discussion

A large number of publications prove that cancer can be classified through gene expression profiling. Principal components analysis (PCA) is a useful tool to reduce the dimensions of data to be able to identify and visualize hidden patterns. PCA has been widely used in genome expression studies to discriminate tumour subtypes. For example, Yeoh and colleagues identified prognostically important subtypes and a novel subgroup of pediatric acute lymphoblastic leukaemia (ALL) by PCA of gene expression data

In the current study, subtypes of neuroblastoma were explored by expression profiles from four microarray studies

In the second step, the existence of four groups could be verified by an unsupervised hierarchical clustering and PCA of a third data set (Wang data set

The hierarchical clustering also identified five gene clusters. Nervous system developmental genes, including

The fourth novel tumour group (h4) was found to be characterized by high expression of several brain-specific and nervous system developmental genes. The Erbb receptors (

The validation test of the four PCA clusters using unsupervised and unfiltered global transcripts clearly shows that the four subgroups exist in all three data sets (Additional file

Del11q groups of the Wang data set

**Del11q groups of the Wang data set**. Unfiltered PCA plots of the Wang data set (101 NB samples, 7542 genes). The four hierarchical clusters (h-groups) in panels A-C (left) are colour-coded as follows: Green = h1, Orange = h2, Red = h3, Blue = h4. Del11q genetic aberrations in panels A-C (right) are colour-coded as follows: Black = 11q-deletion, off-white = No 11q-deletion, White = Undetermined. **A**. PCA of all 101 neuroblastoma cases. **B**. PCA of 74 cases without MNA and Del1p **C**. PCA of 55 cases of the hierarchical groups h1 and h2 without MNA and Del1p.

The discriminative power of the six NB genes strengthen the fact that these genes are indeed important in neuroblastoma development.

Conclusions

In conclusion, by expression profiling of 148 NB tumours from four different Affymetrix-based microarray studies, our data suggest the existence of at least four molecular subgroups of neuroblastoma tumours. Three of the expression-based tumour groups corresponded well to the previously postulated genomic subtypes and a fourth novel group was identified which has not been described elsewhere. The novel tumour group comprised high-stage 11q-deleted tumours with low expression of

Materials and methods

Data pre-processing

Raw data files from four published neuroblastoma expression microarray studies generated from two different platforms (

Principal Components Analysis (PCA)

Principal Components Analysis (PCA) was performed using Omics Explorer 2.0 Beta from Qlucore

PubMed gene list

In order to identify known genes which have previously been identified as predictive or differentially expressed in NB disease, the literature was reviewed and gene lists from 15 neuroblastoma expression studies were selected according to the following: 16 differentially expressed genes from Albino et al., 2008

Statistical analysis and subtype discrimination

The frequency of prognostic marker,

The genomics subtypes were defined based on INSS stage, MNA status, and del11q status (table

Verification by hierarchical clustering

In order to identify a subgroup discriminative gene set, the 98 most differentially expressed genes between subgroups were identified by SAM. First, the p- group assignments from the two data sets were translated by reassignment into four integrated groups (r1-r4) defined by rules for expression levels of the six NB genes (Additional file

The existence of molecular clusters was verified by an unsupervised hierarchical clustering of a third independent data set (Wang data set, comprising 102 samples,

Validation of PCA

In order to verify that the four identified groups could be recognized and discriminated in all three data sets we performed PCA using the same Principal Components loadings. PCA was performed using the R function prcomp on unfiltered expression data, and PCA plots were visualized in 3D using MatLab R2009a. Prior to the analyses, the three pre-processed data sets were filtered to contain the same set of genes (4728 genes in total) and each gene was normalized to center around zero with unit variance.

In the first test, a PCA was performed using the McArdle/Wilzén data (data set 2) and the loading scores from the first three Principal Components were plotted using different colours for each previously identified group (p1-p4 in data sets 1 and 2, and h1-h4 in data set3, see Additional file

In the second test, we repeated the analysis starting with a PCA on the De Preter data set, and the loadings from the De Preter data were then applied to the McArdle/Wilzén and Wang data sets to check if the same pattern appeared (Additional file

Survival analyses by Kaplan Meier

The Overall survival (OS) and Event-free survival (EFS) of patients assigned to the four PCA subgroups (p1-p4) from the two test data sets (De Preter, McArdle/Wilzén) were analysed by Kaplan Meier. OS included totally 43 samples and 4 patients were lost for follow up (3 in p2, and 2 in p3). EFS included totally 35 samples and 11 patients were lost for follow up (5 in p1, 3 in p2, 2 in p3, and 2 in p4). Also, OS and EFS analyses of patients assigned to the four hierarchical subgroups (h1-h4) from the Wang data set were analysed by Kaplan Meier. The OS and EFS analyses included totally 92 samples and 9 patients were lost for follow up (1 in h1, 2 in h3, and 6 in h4). The OS significance was calculated by chi-square Log-rank (Mantel-Cox), and the five year survival significance was calculated by Fisher's exact test.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

FA formulated the study design, performed the microarray analysis, PCA, and hierarchical clustering. FA also drafted the manuscript. DD performed programming and cluster calculation, and revised the manuscript. MN verified groups by PCA using unfiltered data. RJ supervised the study design. KD, JV, RS, and JM provided clinical data in terms of status of prognostic marker and survival, and revised the manuscript. SN supervised the study design, statistical analysis, and interpretations of results. All authors read and approved the final manuscript.

Acknowledgements

This work was supported by grants from the Swedish Medical Council and the Swedish Children's Cancer Foundation.