What is data replication ? Efficient database synchronization
What is data replication
Data replication is the process and technology used to create efficient database copies. It is used to dynamically distribute and synchronize data across different data platforms like cloud instances, mainframes, physical servers etc. Data replication improves data availability, accessiblity and reliability.
You can parameterize the data replication based on your specifics needs. The data can be either synchronously or aysnchronously replicated. The data can be copied on demand, be transferred in bulk or in batches according to a schedule. You can also replicate data in real time whenever data is inserted, deleted or updated in the original database.
There are multiple usecases to data replication, but one of the main is to create a backup of the data. If the original database isn’t working anymore in case of a disaster then the replicated database can take over. Another usecase would be to replicate data to the cloud to benefits from analytics applications available in the cloud environment which could be applied to the data.
Benefits of data replication
It may sound expensive to pay for another database system or a big cloud instance to have the same data instead of just using the original database. But using data replication offers many advantages and can actually save much more money and computational power.
It improves :
- the availability and reliability by implementing the High Availability and Disaster Recovery (HADR) feature. It means you’ll always have a working system available even if the primary one goes down due to numerous possible reasons. If a disaster happens it will take care of setting a new system which already contains the same data thanks to the replication. In the end, even if the main system goes down you’ll nearly see no changes as everything is handled.
- the latency and network performance by having data replicated in different geographical locations. For example, if a team in France needs to access a database located in California they’ll have much more latency than a team accessing a database located in France with the same data replicated.
- the analytics performances by replicating data into a data warehouse optimized for Online Analytical Processing (OLAP). If the main database is optimized for Online Transactional Processing (OLTP) and handles a huge amount of transactions it won’t be able to provide much analytics capabilities. Then having a data warehouse with replicated data allows data scientists to run extensive analytics tasks.
- the original database system performances by distributing requests and accesses to different database systems.
How does data replication work
Data replication methods
There are different ways data replication is handled :
- Full table replication copies everything from the source to the destination, including new, updated, and existing data. This method requires more processing power and generates larger network loads than copying only changed data. It is useful if records are hard deleted from a source periodically, or if the source doesn’t have a suitable column for key-based replication.
- Key-based incremental replication also known as incremental data capture or incremental loading, updates only data changed since the previous update. It is much more efficient than Full table replication as it will copy only a few rows instead of a full table. However, if the replicated database hard deletes a table it won’t be able to find the key associated. It means if the original database has changes on this specific table, the replicated database won’t be able to apply the changes.
- Log-based incremental replication replicates data based on information from the original database log file, listing all the changes. This method is the most efficient but not all databases does provide a suited log file. The main databases support this method, like Oracle, DB2, MySQL or PostgreSQL.
Data replication schemes
Data replication in distribution servers can be carried out through different schemes :
- Full data replication enables the complete database to be replicated at every site of the distributed system. This scheme maximizes data availability and redundancy across a wide area network.
- Partial replication will only replicate some parts of the database to each site of the distributed system. For example the original database could replicate only tables related to Sales to one replica database and only tables related to Expenses to another replica database.