Is It Ok for Geo Upload Before Article Available
NCBI GEO: annal for loftier-throughput functional genomic information
Tanya Barrett,
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Center Drive, Bethesda, Md 20892, The states
*To whom correspondence should be addressed. Tel: +1 301 402 4057; Fax:
+i 301 480 0109
; Email: barrett@ncbi.nlm.nih.gov
Search for other works past this author on:
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Center Drive, Bethesda, Medico 20892, USA
Search for other works by this author on:
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Middle Bulldoze, Bethesda, MD 20892, USA
Search for other works by this author on:
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Center Drive, Bethesda, MD 20892, Usa
Search for other works by this author on:
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Heart Bulldoze, Bethesda, MD 20892, U.s.
Search for other works by this author on:
National Heart for Biotechnology Data, National Library of Medicine, National Institutes of Health, 45 Center Drive, Bethesda, Physician 20892, USA
Search for other works by this author on:
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 45 Center Bulldoze, Bethesda, Doctor 20892, U.s.
Search for other works past this author on:
National Center for Biotechnology Data, National Library of Medicine, National Institutes of Health, 45 Center Drive, Bethesda, Dr. 20892, The states
Search for other works by this author on:
National Eye for Biotechnology Data, National Library of Medicine, National Institutes of Wellness, 45 Center Drive, Bethesda, Physician 20892, USA
Search for other works by this author on:
National Eye for Biotechnology Data, National Library of Medicine, National Institutes of Health, 45 Center Bulldoze, Bethesda, MD 20892, USA
Search for other works by this author on:
Received:
26 September 2008
Accepted:
06 October 2008
Published:
21 October 2008
Abstract
The Gene Expression Passenger vehicle (GEO) at the National Center for Biotechnology Information (NCBI) is the largest public repository for high-throughput gene expression data. Additionally, GEO hosts other categories of high-throughput functional genomic data, including those that examine genome copy number variations, chromatin structure, methylation status and transcription factor bounden. These data are generated by the inquiry customs using loftier-throughput technologies like microarrays and, more recently, next-generation sequencing. The database has a flexible infrastructure that can capture fully annotated raw and processed information, enabling compliance with major community-derived scientific reporting standards such every bit 'Minimum Information About a Microarray Experiment' (MIAME). In addition to serving every bit a centralized information storage hub, GEO offers many tools and features that allow users to effectively explore, analyze and download expression data from both factor-axial and experiment-centric perspectives. This article summarizes the GEO repository structure, content and operating procedures, as well as recently introduced data mining features. GEO is freely attainable at http://www.ncbi.nlm.nih.gov/geo/.
INTRODUCTION
The Gene Expression Omnibus (GEO) repository was established in 2000 (one) to host and freely disseminate high-throughput cistron expression information. The database is built and maintained by the National Center for Biotechnology Information (NCBI), a division of the National Library of Medicine (NLM), located on the campus of the National Institutes of Health (NIH) in Bethesda, Doc, U.s.. The data are contributed by the research community, frequently in compliance with grant or journal directives that require data to exist made publicly available, thus assuasive others to access, evaluate and re-clarify results.
The three principal objectives of this projection are to: The fulfillment of these goals can be assessed in terms of database growth and usage. Today the database holds over ten 000 experiments comprising 300 000 samples, 16 billion private affluence measurements, for over 500 organisms, submitted past 5000 laboratories from around the world. The database typically receives over sixty 000 query hits and ten 000 bulk FTP downloads per day, and has been cited in over 5000 manuscripts.
-
provide a robust, versatile database in which to efficiently shop high-throughput functional genomic data;
-
offer the simplest submission procedures and formats that support complete and well-annotated data deposits from the research community; and
-
provide user-friendly mechanisms that allow users to query, locate, review and download experiments of involvement.
The 'Omix' division
In recent years, microarray engineering science has seen an explosion of applications that get in beyond analyzing gene expression levels. Examples of such studies include those that examine genome single nucleotide polymorphism (SNP) and copy number variations (unremarkably chosen 'assortment comparative genomic hybridization', or aCGH), genome–protein binding surveys (normally called 'chromatin immunoprecipitation on chip', or Chip–chip), and various epigenomic factors such as nucleosomal positioning and genome methylation status. Additionally, not-array-based methodologies such as next-generation sequencing are increasingly being practical to such genome-wide investigations (2). Despite the fact that GEO was initially set up to store gene expression data generated past microarrays and serial assay of gene expression (SAGE), the flexible design of the database allows these not-expression or alternative high-throughput data types to exist similarly hosted with little extra try or overhead. Thus, we accept been accommodating to requests to accept such data and extended the minimal standards to fit these types. In fact, at the time of writing, over 15% of the data in GEO are non-expression data. Consequently, the proper name of the database, Gene Expression Motorcoach, has become somewhat misleading and perhaps confusing to users. To accost this concern, the non-expression data have recently been placed under a new partitioning called 'Omix', which denotes a mixture of 'omic data. Other than a handful of minor issues specific to certain data types, the submission and download procedures and formats for Omix data are largely the same as for GEO. Additionally, equivalent levels of information reporting standards as established for expression information (3,4) are beingness practical to these other 'omic types. This includes requiring raw data, processed data, protocols and adequate sample and experiment descriptive information. All experiments in GEO and Omix have been newly assigned into broad experiment types, making it much easier for users to locate specific data or technology types.
High-throughput sequence information
GEO recently began processing loftier-throughput sequence data submissions (v–vii). It can be expected that adjacent-generation sequencing technologies volition get widely used and accept a considerable impact on genome-broad surveys (ii). GEO accepts sequence data for studies that examine factor expression, gene regulation, epigenetics, or other studies where measuring sequence abundance is central to the experiment design. GEO hosts the candy and analyzed sequence data, together with sample and experiment metadata; raw information files are brokered to NCBI's Short Read Archive sequence database, ensuring that these sequence data are integrated with NCBI'due south drove of sequence-specific resources (viii).
DATABASE Construction AND DATA Menses
Equally discussed in the Introduction department, the GEO database athenaeum a wide variety of chop-chop evolving, large-scale, functional genomic experiment types. These experiments generate data of many different file types, formats and content which consequently nowadays considerable challenges in terms of information handling and querying. The GEO database has built-in flexibility to accommodate diverse data types. Notably, tabular information are not fully granulated in the cadre database. Rather, they are stored equally plain text, tab-delimited tables that have no restrictions on the number of rows or columns allowed. Some columns, however, reserve special meaning, and information from these are extracted to secondary databases and used in downstream query and analysis applications as described in the GEO DataSets department. Accompanying supplementary and native file types are linked from each record and stored on an FTP server. Contextual biological and other descriptive metadata are stored in designated fields inside database tables with appropriate relations and restrictions.
Submitter-supplied data
The overall construction of the core GEO database remains as described previously (ane,9). Briefly, information submitted to GEO are stored in a relational MSSQL database partitioned into three entity types:
A Platform tape is composed of a summary description of the array and a information table defining the array template. For sequence-based technologies, the Platform lists the sequences detected and identified in that experiment. Each row in the tabular array corresponds to a single feature, and includes sequence annotation and tracking information equally provided by the submitter. The tabular array may contain any number of columns allowing thorough annotation of the array. Each Platform record is assigned a unique and stable GEO accession number with prefix GPL. A Platform may reference many Samples that have been submitted by multiple submitters.
A Sample record is composed of a clarification of the biological fabric and the experimental protocols to which it was subjected, and a data table containing abundance measurements for each characteristic on the corresponding Platform. The tabular array may contain any number of columns in which to comprehensively present results. The metadata fields may concord very large volumes of text to let elaborate descriptions of the biological source and protocols. Each Sample tape is assigned a unique and stable GEO accession number with prefix GSM. A Sample must reference only one Platform and may be included in multiple Series. A Sample record volition typically include supplementary files containing the raw (non-normalized) measured data, linked with the main tape.
A Series tape defines a set of related Samples considered to be part of a written report, and describes the overall study aim and design. Each Series record is assigned a unique and stable GEO accession number with prefix GSE. Series records may contain ane or more summary tables and supplementary files.
GEO DataSets
The submitter-supplied objects described higher up are very heterogeneous with regards to the mode, content and level of detail with which the experiments are described. But despite this diversity, all expression-based submissions share a common core set of elements: Through a procedure that employs both automated data extraction and manual curation, these three categories of data are captured from the submitter-supplied records and organized into an upper-level object called a GEO DataSet. A DataSet represents a collection of consistently candy experimentally related Sample records, summarized and categorized according to experimental variables. Each DataSet is assigned a unique GEO accession number with prefix GDS. DataSets are a means for transforming diverse styles of incoming data into a relatively standardized format upon which downstream data assay and data display tools tin can exist built. At this fourth dimension, only expression-based DataSets are being generated.
-
sequence identity tracking data of each characteristic on the Platform;
-
normalized expression measurements; and
-
text describing the biological source and experiment aim.
DataSets provide two singled-out renderings of the data (Figure 1):
-
an experiment-centered representation that encapsulates the entire study. This information is presented as a DataSet record which comprises a synopsis of the experiment, a breakdown of the experimental variables, admission to auxiliary objects, several data brandish and assay tools and download options; and
-
a cistron-centered representation that presents quantitative gene expression measurements for one gene across a DataSet. This data is presented equally a GEO Contour, and comprises factor identity note, DataSet title, links to auxiliary data and a nautical chart depicting the expression level and rank of that gene across each Sample in the DataSet. Factor annotation is derived from querying sequence identifiers (e.g. GenBank accessions, clone IDs) with the latest Entrez Cistron and UniGene databases, an of import point given the dynamic nature of cistron notation (10).
Figure ane.
A choice of GEO screenshots. The DataSet Browser (A) enables simple keyword searches for DataSets. When a DataSet is selected, a window appears (B) which contains detailed data about that DataSet, download options, and links to analysis features including gene expression profiles (C). Each expression contour can be viewed in more detail to see the activity of that cistron beyond all Samples in the DataSet (D).
Figure 1.
A pick of GEO screenshots. The DataSet Browser (A) enables simple keyword searches for DataSets. When a DataSet is selected, a window appears (B) which contains detailed data about that DataSet, download options, and links to analysis features including gene expression profiles (C). Each expression contour tin can be viewed in more than detail to see the activeness of that gene across all Samples in the DataSet (D).
SUBMISSION PROCEDURES, FORMATS AND STANDARDS
Great emphasis has been placed on making data deposit procedures as simple as possible for submitters, while not compromising the level of experimental notation required (eleven). 4 submission options are offered: web forms, spreadsheets, a plain text format and an XML format (Table one). All these formats are designed to capture all components of the MIAME checklist (iv). Deciding which method to use depends on the volume, type of data to be submitted and current data format. To farther ease the submission procedure, native files are requested where possible (east.g. Affymetrix CHP and CEL files). No matter what deposit method is used, the last GEO records will look like and contain equivalent information. A skilled team of curators is on mitt to assist researchers should whatsoever questions arise about submission procedures (email: geo@ncbi.nlm.nih.gov).
Table one.
GEO deposit options and formats
Pick | Format | Key Points |
---|---|---|
Web eolith | Web forms | Deposit of individual records. Unproblematic step-by-footstep interactive web forms. |
GEOarchive | Spreadsheets (e.g. Excel) | Batch deposit. Good choice for most users who have many Samples to submit. |
SOFT (Simple Motorbus Format in Text) | Plain text | Batch deposit. A simple, line-based, tab-delimited format that can be readily generated, particularly if the information are already in a database. |
MINiML (MIAME notation in Markup Linguistic communication) | XML | Batch deposit. Basically an XML rendering of SOFT format, and similarly suitable if data are already in a database. The XML schema definition is available at the GEO website. |
Option | Format | Central Points |
---|---|---|
Web deposit | Spider web forms | Deposit of private records. Unproblematic stride-by-step interactive web forms. |
GEOarchive | Spreadsheets (east.chiliad. Excel) | Batch deposit. Practiced option for most users who have many Samples to submit. |
SOFT (Simple Bus Format in Text) | Plain text | Batch deposit. A simple, line-based, tab-delimited format that can be readily generated, particularly if the data are already in a database. |
MINiML (MIAME notation in Markup Linguistic communication) | XML | Batch deposit. Basically an XML rendering of SOFT format, and similarly suitable if data are already in a database. The XML schema definition is bachelor at the GEO website. |
Table 1.
GEO deposit options and formats
Option | Format | Key Points |
---|---|---|
Spider web deposit | Web forms | Deposit of private records. Unproblematic step-by-step interactive web forms. |
GEOarchive | Spreadsheets (east.chiliad. Excel) | Batch deposit. Good choice for most users who have many Samples to submit. |
SOFT (Elementary Double-decker Format in Text) | Obviously text | Batch deposit. A elementary, line-based, tab-delimited format that can be readily generated, particularly if the data are already in a database. |
MINiML (MIAME notation in Markup Language) | XML | Batch eolith. Basically an XML rendering of SOFT format, and similarly suitable if data are already in a database. The XML schema definition is bachelor at the GEO website. |
Option | Format | Fundamental Points |
---|---|---|
Web deposit | Web forms | Deposit of individual records. Simple footstep-past-step interactive web forms. |
GEOarchive | Spreadsheets (e.g. Excel) | Batch eolith. Good choice for most users who accept many Samples to submit. |
SOFT (Unproblematic Motorcoach Format in Text) | Plain text | Batch eolith. A simple, line-based, tab-delimited format that can be readily generated, peculiarly if the data are already in a database. |
MINiML (MIAME note in Markup Language) | XML | Batch eolith. Basically an XML rendering of SOFT format, and similarly suitable if information are already in a database. The XML schema definition is available at the GEO website. |
All data undergo syntactic validation upon upload. A fellow member of the curation squad reviews each record to ensure that information are organized correctly and incorporate sufficient information to interpret the experiment. If content or structural bug are identified, or if critical MIAME components are missing, the curator works with the submitter until the issue is resolved or explained. Submissions are typically approved inside 2–5 days, but expedited approval tin can exist performed on request. Researchers are provided the capability to update their records at any fourth dimension. Records may be kept individual until a manuscript describing the data is published. Submitters may generate a temporary reviewer URL that grants anonymous, confidential access to their private data, typically via a journal editor. Guidelines for reviewers and editors regarding how to access and evaluate private data are provided at http://www.ncbi.nlm.nih.gov/projects/geo/info/reviewer.html.
TOOLS TO RETRIEVE, EXPLORE AND VIZUALIZE DATA
Given the wide scope of biological projects and organisms represented in GEO experiments, it is crucial to provide effective query tools and so users can quickly locate, analyze and visualize data relevant to their specific interests. A summary of the main query features, and their location and purpose, is provided in Table 2. Figure 2 depicts a schematic overview of the query workflow and how the various features and tools are interlinked.
Effigy 2.
A schematic overview of query workflow, and how diverse features and tools are interlinked. A description of the location and purpose of many of these features is provided in Table 2.
Figure ii.
A schematic overview of query workflow, and how various features and tools are interlinked. A description of the location and purpose of many of these features is provided in Table 2.
Table 2.
A summary of the location and purpose of various GEO data mining tools and features
Features introduced within the concluding ii years are labeled NEW.
Table two.
A summary of the location and purpose of diverse GEO data mining tools and features
Features introduced inside the last two years are labeled NEW.
NCBI'due south powerful Entrez (PubMed-like) search and linking system serve equally the basis for most queries; Entrez GEO DataSets contains experiment-centered information and Entrez GEO Profiles contains cistron-centered expression data. Relevant material is located simply by typing in keywords or fielded Boolean (AND, OR and Not) phrases. Additionally, several auxiliary tools feed into Entrez, including the cluster heat maps and the 'Query grouping A versus B' tool.
A rich complement of Entrez links is generated to connect information to related information: inter-database links reciprocally connect GEO to other NCBI resources such equally PubMed, GenBank and Cistron; intra-database links connect genes related by expression pattern, chromosomal position or sequence. Entrez search retrievals can be sorted and filtered past various flags and criteria, and downloaded by various mechanisms. Avant-garde Entrez features allow generation of multipart fielded queries, or tin join multiple queries to identify overlapping results.
GEO provides several graphical renderings that greatly facilitate interpretation and visualization of expression data, including: Express programmatic admission is supported using a suite of programs called the Entrez Programming Utilities, or E-Utils. For users who need to perform more robust analyses, all GEO information are available for majority download via anonymous FTP at ftp://ftp.ncbi.nih.gov/pub/geo/Information/ and tin can be imported into external 3rd-party software applications, eastward.g. the freely available GEOquery packet for BioConductor (12).
-
interactive pre-calculated hierarchical and on-the-fly chiliad-ways/thousand-medians cluster oestrus map images that may hint at groups of coordinately regulated genes.
-
Expression profile charts that track the activity of 1 gene across all Samples in a DataSet. A breakdown of the experimental design is provided forth the bottom of the chart, helping the user to rapidly assess whether expression levels are shifting with experimental variables.
-
Thumbnail chart images provided on Entrez Profile retrievals that enable rapid visual profile scanning and comparison.
-
Value and probability distribution charts that provide rapid indication of how well normalized the information are.
CONCLUSIONS
NCBI'southward GEO public annal stores massive volumes of published high-throughput functional genomic data generated past the international research community. In addition to archiving data, tools are provided to assist users of all levels of expertise to quickly search, query, analyze, visualize and download these data. These features employ classic data reduction and filtering methods, succinct displays designed for human scanning, and extensive linking with disparate but related data sources.
Looking at the literature, it is apparent that GEO is used routinely every bit a principal information resources past the inquiry community. Hundreds of third-party publications cite GEO data as evidence to support or complement contained studies, or use GEO information as the ground of statistical or analytical hypotheses or tools (http://www.ncbi.nlm.nih.gov/projects/geo/info/ucitations.html).
As high-throughput technologies advance, large-scale functional genomic datasets are becoming easier and cheaper to generate. However, a major challenge remains in translating various sets of functional genomic information into context, i.e. integrating these datasets with each other and, ultimately, making correlations with appreciable phenotypes. Collecting and archiving comprehensive 'omic datasets in common formats at one public location similar GEO is an important offset stride in facilitating such big-scale integrative analyses. It can exist predictable that a connected increase in availability of these datasets volition contribute to our understanding of how the genome regulates and specifies cellular types, states and processes.
The GEO database and tools continue to undergo intensive development and modification aimed at enhancing the experiences of both data submitters and data consumers. The submission pipeline and data transfer mechanisms will continue to be upgraded, and nosotros plan to develop additional data retrieval and mining features, particularly for the novice user.
FUNDING
Funding for open access charge: Intramural Inquiry Program of the National Institutes of Wellness; National Library of Medicine.
Conflict of involvement argument. None alleged.
REFERENCES
one
, , .
Cistron Expression Omnibus: NCBI gene expression and hybridization array data repository
,
Nucleic Acids Res.
,
2002
, vol.
xxx
(pg.
207
-
210
)
2
, .
Sequence census methods for functional genomics
,
Nat. Methods
,
2008
, vol.
5
(pg.
19
-
21
)
iii
, , , , , , , , , , et al.
Standards for microarray data: an open letter
,
Environ. Health Perspect.
,
2004
, vol.
112
(pg.
A666
-
A667
)
iv
, , , , , , , , , , et al.
Minimum information near a microarray experiment (MIAME)-toward standards for microarray information
,
Nat. Genet.
,
2001
, vol.
29
(pg.
365
-
371
)
5
, , , , , , , , , , et al.
Integration of external signaling pathways with the core transcriptional network in embryonic stem cells
,
Cell
,
2008
, vol.
133
(pg.
1106
-
1117
)
6
, , , , , , , , , , et al.
Endogenous siRNAs derived from transposons and mRNAs in Drosophila somatic cells
,
Science
,
2008
, vol.
320
(pg.
1077
-
1081
)
vii
, , , , , , , , , , et al.
Genome-scale DNA methylation maps of pluripotent and differentiated cells
,
Nature
,
2008
, vol.
454
(pg.
766
-
770
)
viii
, , , , , , , , , , et al.
Database resources of the National Center for Biotechnology Information
,
Nucleic Acids Res.
,
2008
, vol.
36
(pg.
D13
-
D21
)
9
, , , , , , , , , .
NCBI GEO: mining tens of millions of expression profiles—database and tools update
,
Nucleic Acids Res.
,
2007
, vol.
35
(pg.
D760
-
D765
)
ten
, .
Reannotation of array probes at NCBI'south GEO database
,
Nat. Methods
,
2008
, vol.
5
pg.
117
eleven
, .
NCBI GEO standards and services for microarray data
,
Nat. Biotechnol.
,
2006
, vol.
24
(pg.
1471
-
1472
)
12
, .
GEOquery: a bridge between the Cistron Expression Passenger vehicle (GEO) and BioConductor
,
Bioinformatics
,
2007
, vol.
23
(pg.
1846
-
1847
)
Published by Oxford University Press 2008
This is an Open up Access commodity distributed nether the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/past-nc/2.0/great britain/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
Submit a comment
You take entered an invalid lawmaking
Give thanks y'all for submitting a annotate on this commodity. Your comment will be reviewed and published at the journal'south discretion. Please check for further notifications by electronic mail.
Source: https://academic.oup.com/nar/article/37/suppl_1/D885/1009447
0 Response to "Is It Ok for Geo Upload Before Article Available"
Postar um comentário