[Table of Contents]
[Previous Section]
[Next Section]
How THOR Achieves Maximum
Performance
The THOR MPP Data Server platform consists of four to 288
processing nodes configured in a shared-nothing configuration.
THOR was designed from the ground up to be the first MPP solution
that is both massively parallel and inherently object oriented.
Of all MPP implementations, only THOR's SQL objectManager is truly
object-oriented: all data is encapsulated with the behaviors or
functions needed to operate on the data.
In this section of the White Paper, we describe how THOR's hardware
and software models support maximum performance in a number of
ways, including:
- Shared-nothing, MPP hardware configuration
- Unique toroidal mesh interconnect among processing nodes
- Uniform data distribution among nodes
- Underlying data-driven, object-oriented design
- Deadlock prevention via serialization instead of costly row
and table locking
- Partitioned and pipelined parallism
Why THOR uses a shared-nothing, MPP hardware
configuration
DSS/OLAP requires better scaling
There are significant advantages to an MPP architecture for processing
the complex queries which characterize the DSS/OLAP market. These
queries tend to be quite different from those in transaction processing.
They often touch large amounts of data, are normally quite I/O
intensive, and can run for very long times. They freely use SQL
operators such as joins and aggregations that require processing
large numbers of inter-row relationships, and thus they demand
much more in the way of computing resources than typical OLTP
queries. The result is that an architecture such as a shared-nothing
MPP system, which isolates processors, scales much better than
SMP for this workload.
Shared-nothing: definition
In a shared-nothing organization, the hardware is a set of processor
nodes connected by a communication network. Each node contains
its own CPU, memory, I/O bus, and disk storage. All communication
between processors is switched through the communications network.
The shared-nothing organization can be compared to two other MPP
hardware organizations:
- The shared-disk model, where each node has its
own private memory, but can directly access all disks.
- The shared-memory model, where all nodes can
directly access common memory and all disks.
Shared memory or shared disk systems don't scale
At first sight, common access to shared data should make it easier
to build a system, because any processor can have access to any
data item that it wants. In fact, this is the main source of bottleneck
problems. The intensive I/O activity in DSS/OLAP applications
must compete with the CPU for memory bandwidth. In SMP and shared-disk
or shared-memory MPP systems, as the number of CPUs increases,
contention for resources such as memory becomes more intense,
and performance suffers a severe impact.
Shared-nothing makes near-linear speedup and scale-up
possible
Because of THOR's shared-nothing architecture, the addition of
more processing nodes does not lead to memory, disk, or CPU bottlenecks.
In THOR's MPP implementation, each node has its own resources,
so contention remains essentially constant no matter how many
CPUs are added. In fact, the only contention in THOR is for communication
bandwidth, and communication bandwidth increases as nodes are
added, so never becomes a bottleneck.
This means that there is very little overhead relative to the
size of a system; if system A has twice as many nodes as system
B, system A will process queries and perform administrative
tasks roughly twice as fast as system B. This capability
ensures that managers can always scale a system appropriately,
and add processing power simply by adding additional nodes.
THOR's unique toroidal
mesh interconnect
For maximum performance, SQL objectManager must have homogeneous
connectivity among its nodes, with low latency, minimum interference
among the traffic on the network, and minimum path length between
source and destination nodes. THOR achieves this objective by
using a unique toroidal interconnect, managed by a dedicated set
of communication processors, one in each node.
Envisioning the toroidal mesh
The toroidal interconnect can be visualized as follows. First,
imagine a rectangular connection grid with a processor node at
each intersection, as shown in Figure 4(a). Then connect
the processing nodes on the right to those on the left with added
network lines, forming a cylindrically connected system as shown
in Figure 4(b). Finally, envision connecting the nodes
in the top and bottom planes as shown in Figure 4(c).
The nodes are connected as if they were laid out on the surface
of a torus, or a doughnut shape.
(a)
(b)
(c)
Figure 4.
Envisioning THOR's Toroidal Mesh
How nodes interconnect
Each node interconnects with the other nodes in a "nearest
neighbor" fashion. This connection employs special communication
processors to offload the CPUs and is more cost-effective than
other MPP systems that rely on multistage or crossbar switches
or require the CPU to participate in the data routing. It also
provides enhanced system availability through multiple path routing.
The toroidal mesh topology has the important property that the
array of processors is symmetric with respect to every node. Any
function can be assigned to any node, and it will perform as well
as if it had been assigned to any other node. This homogeneity
is the most efficient communication design for optimal execution
of the SQL objectManager software.
Uniform data distribution among nodes
THOR uses processor affinity, combined with a proprietary
hashing algorithm, to automatically and uniformly
distribute database rows among processor nodes.
Why is uniform data distribution important?
For parallel processing to achieve optimum performance, table
rows must be distributed evenly among a system's processors. If
four processors each contain 1000 rows, for example, and if the
processors work in parallel, the time it takes to process all
4000 rows is roughly comparable to the time it takes to process
1000 rows. However, if one processor contains 2500 rows and the
others contain 500 rows each, the time it takes to process all
4000 rows is roughly comparable to the time it takes to process
2500 rows, or all the rows on the overloaded processor.
Uniform distribution, therefore, maximizes parallel processing
performance for many operations.
Affinity: definition
A technique sometimes used to distribute data among nodes is called
affinity. The term comes from the fact that each data item
has affinity for a processor. Each processor is given a part of
the database to manage, and all requests for that data are routed
through that processor. THOR's shared-nothing system implements
affinity; each processor accesses data and memory on its own disks.
Hashing: definition
To determine which row has affinity with which node, THOR uses
a proprietary hashing algorithm. A hashing algorithm is simply
a formula that generates a number indicating to which node a row
should be assigned.
How THOR generates hash values
THOR's hashing algorithm uses a large random number, in conjunction
with the primary index of a row, to generate the row's hash value.
Bits from the hash value are used to select the node where the
row resides.
To ensure even data distribution, the THOR SQL objectManager requires
that the primary index for a table be unique. If a table has no
primary unique index, THOR generates a synthetic hash value that
ensures "round-robin" distribution of data to nodes.
Data-driven, object-oriented design
The software design of THOR is based on
a data-driven model of execution, implemented using object oriented
technology. Data-driven processing and object orientation combine
to yield a product which is truly unique today, but which will
be the paradigm for the next generation of implementations by
other vendors.
A brief description of data-driven processing
Data-driven processing finds multiple operations that can be undertaken
concurrently within the evaluation of one or more expression.
Data-driven processing is inherently parallel, and overcomes the
limitations of procedural programming, in which each statement
must complete execution before the next statement begins. This
model lets THOR process and evaluate many queries at the same
time, further increasing performance.
How queries are executed
As each user query comes in, it is assigned to the next available
node in a "round robin" fashion. It is possible to do
this only because the node array is symmetric with respect to
each node, and because every node in the system can perform all
the functions of the database system.
The assigned node is referred to as the initiating node or query
captain for that query, and is responsible for:
- Parsing the query
- Optimizing the query
- Breaking the query down into tokens, or messages
- Obtaining serialization tags
- Accumulating the final results and performing the final aggregations
- Coordinating the transaction commit
Multiple nodes act as query captains for multiple queries in parallel,
which leads to greatly improved performance. Figure 5 shows
a simplified schematic of THOR's toroidal mesh, illustrating how
the query captain and the nodes communicate.
Figure 5.
How nodes communicate during a query
As discussed in "Uniform data distribution among nodes"
on page 11, each node is given a part of the database to
manage, and all requests for that data are routed through that
processor. This feature means that each node contains specific
data, in the form of row objects.
Following is a simplified description of the procedure THOR uses
to process and execute a query.
Step 1 - The query captain prepares and distributes the query
At the initiating node, the query is parsed, analyzed, and
optimized. This process defines a plan of execution for the database
node software. The plan is then broadcast via messages to all nodes
for execution by their row objects.
Step 2 - The row objects process the query
Each row object determines whether it matches the query criteria
and, if so, how it needs to compute and distribute data to successive
objects or back to the query captain. The row objects communicate
among themselves via messages to complete execution. As results for a set of rows on a node are available, the
nodes distribute interim results to other row objects or back
to the query captain.
Step 3 - The query captain assembles and returns results
As results are pipelined in from the processing nodes, the
query captain performs any final aggregations and calculations.
When all results have been returned and all processing has
been completed, the query captain returns the final results to
the requester.
Summary
As the steps above illustrate, the only tasks that occur sequentially
during a query are those performed by the query captain. All primary
database processing takes place in parallel on all nodes at once.
THOR's unique method of distributing query initiation tasks among
all nodes in the system, coupled with its data-driven and object-oriented
approach to query processing at the node level, means that a large
number of queries can be processed in parallel (inter-query parallelism).
It also means that a single query can be processed in parallel
on all nodes at once (intra-query parallelism.) In this way, THOR
fully maximizes the parallel capabilities of its hardware and
software system.
Inter- and intra-query parallel processing are discussed in more
detail in "Partitioned and pipelined parallelism" on
page 15.
Preventing deadlock
In the past, parallel DBMS implementations have been severely
affected by deadlock contention. Reducing deadlock contention
has been a problem both for efforts to process multiple queries
in parallel and for efforts to process a single query in parallel
across multiple nodes.
Lock table can lead to bottlenecks
A common solution to this problem is for the software to mark
pages as locked so that one processor can finish making its update
before other processors are allowed to use the data. Unfortunately,
the lock table is often centralized, and as more processors are
added, the lock table becomes a bottleneck that limits system
performance.
Serialization replaces the lock table
With data-driven as with any other execution model, it is necessary
to ensure that operations take place in the correct order. In
place of row-level or table-level locking, THOR uses a data-driven
technique called serialization for most operations. The
function of serialization is to make explicit which operations
must precede other operations, and which can proceed at the same
time.
Serialization tickets
Each operation that needs to be carried out (INSERT, UPDATE, SELECT,
etc.) is assigned a serialization ticket. For each row
object that requires a ticket, the row object is looked up and
its serialization counters are incremented. Operations are then
executed in the order that the tickets were issued. Once an operation
has been issued tickets, the statement is serialized.
Tickets have two important properties:
- Tickets are just numbers, and thus are very fast and easy
to issue.
- Tickets allow a row object to make a local decision about
the order in which operations are executed.
Ticket generation is the only centralized function in THOR's software.
Read and write processes isolated
Each operation receives both a read and a write ticket, so that
any number of reads can be carried out until a ticket for a write
operation comes along. At that point, other read and write operations
must wait only for the brief instant while the write operation
is completing.
The major benefits of serialization
Serializing a statement has several important benefits:
- Many transactions that are deadlocked on other RDBMS systems
are deadlock-free and restart-free on THOR.
- Even though the query may execute on many processors in parallel,
all decisions about ordering operations can be made locally by
the row objects. This minimizes inter-node communication, thus
speeding up query processing.
- Serialization provides distributed concurrency control, and
replaces the lock manager. There is no need for any other concurrency
control mechanism.
Serialization and ACID principles
Serialization also assures that transactions meet the ACID principles
for proper transaction processing. ACID stands for the following
properties:
- Atomicity - The transaction is an atomic unit, consisting
of a single collection of actions that is either completed in
full (committed) or aborted in full (rolled back).
- Consistency - All integrity constraints of the database
are fulfilled after the transaction has finished, thus leaving
the database in a consistent state.
- Isolation - A transaction must be processed in apparent
isolation. Data used by a transaction cannot change while the
transaction is being processed. Similarly, other transactions
can't use the results of the current transaction until it is committed
or aborted.
- Durability - Once a transaction commits, its results
must be durable, even in the case of software or hardware failures.
Partitioned and pipelined parallelism
All of the features discussed above work in concert to enable
two forms of parallel processing, partitioned and pipelined. Each
of these is briefly described in this section.
Partitioned parallelism improves intra- and inter-query
performance
Partitioned parallelism refers to a system's ability to
distribute a single query across multiple nodes at the same time.
For example, imagine you have to go through the phone book to
find the name of someone whose phone number you have. Obviously,
this could take a very long time. Now imagine that the phone book
is divided up into 100 sections of equal size, and 100 people
are simultaneously looking for that name. The result will be found
in approximately 1/100 the time. This is analogous to one of the
primary ways THOR maximizes intra-query processing performance.
Partitioned parallelism also refers to a system's ability to perform
multiple queries on a node at the same time. Returning to our
phone book example, this time imagine that your 100 people have
to find two names instead of one. Instead of looking for only
one name, each person can be looking for both names at the same
time. Because both queries ("find a name") are being
executed in parallel, results will be returned in less time than
if the queries were processed in serial, one after the other.
This is analogous to one of the primary ways THOR maximizes inter-query
processing performance.
Pipelined parallelism improves intra-query performance
Pipelined parallelism refers to a system's ability to begin
feeding results back as they are available. Continuing our phone
book example, imagine this time that you have to produce a sorted
list of all people whose phone number begins with 234. If you
are the query captain (as described in "How queries are executed"
on page 12), you are responsible for sorting the names your
100 workers find.
Instead of waiting for each of your workers to find all the matching
names, you ask them to hand you their list of names in sorted
batches of 20 at a time. Each time you receive another sorted
20 names, you inter-sort them with the other names you have already
received and sorted. In this way, names are being sorted almost
as fast as they are being found; there is no significant delay
between finding all the names and producing a sorted list. This
is analogous to another primary way THOR maximizes intra-query
processing performance.
[Table of Contents]
[Previous Section]
[Next Section]