Column-oriented VS Row-oriented databases
A database management system (DBMS) is a system that stores data tables. Those tables are composed of rows and columns. A table can be stored in a column-oriented type or row-oriented. Depending on the type of applications and workloads you better choose one or the other. However, the usage is the same in both orientation and SQL queries do not differ.
|Column 1||Column 2||Column 3|
The row orientation is the standard one, where you work with the data “one” row/record at a time. When you read the row you retrieve all the columns of this one. This is perfectly suited for applications where you want to write (INSERT or UPDATE) or as mentionned, retrieve one row at a time.
To illustrate this point, let’s say you have a table describing employees of a company. This table contains the name of the employees and several columns describing their projects and their successes. When a manager wants to check the status of an employee in this table it will only read one row, the one concerning this employee. The manager doesn’t need to get the informations about the other employees and would be interested in getting all the information (columns) about this particular employee. The same applies if the manager wants to add a project in the employee’s row.
In this situation a row oriented database would be beneficial to have better performance. This type of workload can be defined as a transaction and more precisely as an OLTP system. OLTP standing for Online Transactional Processing is focused on transaction oriented tasks involving INSERT, UPDATE, DELETE with small amounts of data.
|Row 1||John Doe||3||…||NO|
|Row 2||Julia Smith||5||…||YES|
|Row 3||Max Berti||6||…||YES|
Column oriented or columnar databases are getting increasingly popular with the rise of analytics and big data. In fact, it reduces drastically the I/O requirements for analytical applications and increase the speed by only retrieving the wanted columns.
Let’s say you have a huge table containing terabytes of data describing a large amount of products. If you want to get the average price of all of the products, retrieving all the columns of all the rows instead of retrieving only the price column doesn’t seem like a good idea. A column-oriented database would only provide the price column with optimized access really fast. This type of database performs incredibly well for OLAP (Online Analytical Processing) usecases.
Moreover, as the data is stored by columns the cardinality (i.e the range of different possible values) of the column is easily computed. This allows the database to use this information to compress a complete column much more efficiently than a row-oriented one. You need much less storage space in a column store than a row store database because of this compression.
The recent applications are more and more about training datasets for machine learning, analytics studies, visualization and so on. Most of the time, you do not need all the columns of a table for this kind of usecases, but you select only some of them instead. You usally read the name Data Warehouse for this kind of database, which is defined by the column-oriented characteristics.
Which one to choose ?
The choice is important and can affect the performance of your overall applications with your database.
If you are interested in inserting, updating, deleting and retrieving single record then a row-oriented database would be more efficient. Let’s assume a bank has a database describing its users. When a user wants to know his current money balance, it will create a data access to retrieve the row containing this information. When he withdraws money, this row is updated. In this situation it is important to be performant in single row access, to provide information to the users really fast.
On the other hand, if you are interested in analytics to compute statistics and operations using multiple data records then a column-oriented database is much more performant. Indeed, if your queries are about aggregating single column (e.g. sum all the values) or retrieving a few columns among the complete table, a columnar database is perfectly suited. Additionally, the compression is more important here and the tables are efficiently stored.
Can we combine both column-oriented and row-oriented databases characteristics ?
It is possible but as always the performance is about tradeoff. A system fully dedicated to one of the orientation will be more efficient than one mixing both, depending on the usecases types.