Many services have some sort of job that gets executed at regular intervals and does some sort of processing or aggregation. In many instances you read data from the database, do the needed calculations and write the result. This is pretty straightforward and normally does not pose any major problems. You have @Scheduled
in Spring, Scheduled Tasks in Play or a simple CRON job that gets executed at regular intervals. Everything works fine if you have one instance that does the processing! But what if you have multiple instances that start at the same time? How will you avoid the same entry being processed twice, by two different instances of the same job.
Case Study
Let’s assume that you have a job that needs to process files which contain information related to users. It can be some static files that are generated each day, uploaded by third-party services to your system or just some files on the disk that holds data. When a file is uploaded, you write in your database a few things that make it easy to identify each file: upload path, file name, upload timestamp, if it has been processed or not, maybe who uploaded it. Not that important what the files contains, what is important for this case study is that a file can’t be processed twice in the same day.
Next, you write a job that retrieves the file information from the database and starts processing it. At the end, the result is persisted and the file is marked as ‘Processed’. A simplified version can look something like the code below. I explicitly wrote the criteria for the select here and not in a DAO to make the errors more obvious.
public class CSVProcessorJob {
@Autowired
private CriteriaBuilder cb;
@Autowired
private EntityManager entityManager;
@Transactional(readOnly = true)
@Scheduled(cron = "0 0 1 * * *") // 1AM each night
public void processFiles() {
CriteriaQuery<FileInfo> criteriaQuery = cb.createQuery(FileInfo.class);
Root<FileInfo> root = criteriaQuery.from(FileInfo.class);
Predicate notProcessedPredicate = cb.equal(root.get("processed"), Boolean.FALSE);
criteriaQuery.select(root).where(notProcessedPredicate);
TypedQuery query = entityManager.createQuery(criteriaQuery);
List<FileInfo> files = query.getResultList();
for (FileInfo fileInfo:files) {
processFile(fileInfo);
fileInfo.setProcessed(Boolean.TRUE);
entityManager.merge(fileInfo);
}
}
}
Everything works fine for a few months, but as your application grows, you find that there are too many files that need to be processed and you are falling behind. So, you create a new instance of you job, on a completely different machine. Twice the power means that you will be processing twice as fast.
After the first night, complaints start to come. Not only the jobs did not finish faster, your files got processed twice.
Wrong Approach – How NOT do to it
Now, it may be obvious what the problem is. You could get each file info one-by-one, as in the code below, but you have too many trips to the database:
public class CSVProcessorJob {
@Autowired
private CriteriaBuilder cb;
@Autowired
private EntityManager entityManager;
@Transactional(readOnly = false)
@Scheduled(cron = "0 0 1 * * *") // 1AM each night
public void processFiles() {
CriteriaQuery<FileInfo> criteriaQuery = cb.createQuery(FileInfo.class);
Root<FileInfo> root = criteriaQuery.from(FileInfo.class);
Predicate notProcessedPredicate = cb.equal(root.get("processed"), Boolean.FALSE);
criteriaQuery.select(root).where(notProcessedPredicate);
TypedQuery query = entityManager.createQuery(criteriaQuery);
query.setMaxResults(1);
FileInfo fileInfo = query.getSingleResult();
while (file != null) {
processFile(fileInfo);
fileInfo.setProcessed(Boolean.TRUE);
entityManager.merge(fileInfo);
// This in reality throws an exception, but we can assume it returns null to simplify the example
fileInfo = query.getSingleResult();
}
}
}
You could avoid too many trips to the database by retrieving them in batches of 10, or 100, or even 1000. And for a while everything may seem to be working correctly, until one day you see a few files that are again processed twice. And the risk of processing the same file multiple times increases with the number of processing nodes and the size of the batch.
Why is this happening
The double-processing occurs because you have a race condition in your code. Not on thread-level, but on actual instances of the job. Each job retrieves a list of files that were not yet processed completely independently. Each file is marked as ‘processed’ only after the job finishes with it, but the other instance may have already retrieved it from the database and it does NOT know of the update in status, so it processes it again.
You could update all the files retrieved right away, and this does indeed shrink the window (and the odds of it happening) but the problem can still occur. If the updates from the first instance comes after the retrieval for the second, you will still process some of the files twice:
Approach 2 – Double-select with a lock
The above approach won’t work in a multi-instance environment. So, we need to make sure that the files that we are trying to process are not already being processed by another instance. To do this, we add another column to our entity, called “lock_id“. We select the entries that we want to process by having a statement “WHERE lock_id IS NULL“. After we do that, we populate the column for our entries with a unique ID. The code looks something like this:
@Transactional(readOnly = false)
@Scheduled(cron = "0 0 1 * * *") // 1AM each night
public void processFiles() {
CriteriaQuery<FileInfo> criteriaQuery = cb.createQuery(FileInfo.class);
Root<FileInfo> root = criteriaQuery.from(FileInfo.class);
Predicate notProcessedPredicate = cb.equal(root.get("processed"), Boolean.FALSE);
Predicate lockIdNullPredicate = cb.isNull(root.get("lockId"));
criteriaQuery.select(root).where(notProcessedPredicate, lockIdNullPredicate);
TypedQuery query = entityManager.createQuery(criteriaQuery);
query.setMaxResults(10);
UUID generatedLockId = UUID.randomUUID();
List<FileInfo> files = query.getResultList();
files.forEach(file -> {
file.setLockId(generatedLockId);
entityManager.merge(file);
});
processRetrievedFiles(files);
}
This approach won’t work, even though it is NOT as bad as the previous one. There is still a small window from when we get the files and when we update them with the lock ID. To fix this, we need to do a double select.
The first select is when we get the list of files that we assume we can process. Next, we do a batch update on all the entries using the an IN statement with our lock_id. Once the update is done, we do another select using the lockId as the key we are searching for. Using the code below, we can guarantee that no files that are already reserved by another instance will be given to us.
@Transactional(readOnly = false)
@Scheduled(cron = "0 0 1 * * *") // 1AM each night
public void processFiles() {
// Retrieve the files that are not yet reserved by another instance
CriteriaQuery<FileInfo> criteriaQuery = cb.createQuery(FileInfo.class);
Root<FileInfo> root = criteriaQuery.from(FileInfo.class);
Predicate notProcessedPredicate = cb.equal(root.get("processed"), Boolean.FALSE);
Predicate lockIdNullPredicate = cb.isNull(root.get("lockId"));
criteriaQuery.select(root).where(notProcessedPredicate, lockIdNullPredicate);
TypedQuery query = entityManager.createQuery(criteriaQuery);
query.setMaxResults(10);
List<FileInfo> files = query.getResultList();
// Update the lockId to those entries that we retrieved earlier AND have not yet been reserved by another instance
UUID generatedLockId = UUID.randomUUID();
List<UUID> fileIds = files.stream().map(FileInfo::getId).collect(Collectors.toList());
CriteriaUpdate criteriaUpdate = cb.createCriteriaUpdate(FileInfo.class);
root = criteriaQuery.from(FileInfo.class);
CriteriaBuilder.In<UUID> inPredicate = criteriaUpdate.in(root.get("id"));
for(UUID id : fileIds){
inPredicate.value(id);
}
criteriaQuery.where(lockIdNullPredicate, inPredicate);
criteriaUpdate.set(root.get("lockId"), generatedLockId);
entityManager.createQuery(criteriaUpdate).executeUpdate();
// Retrieve the ones where we managed to update
criteriaQuery = cb.createQuery(FileInfo.class);
Predicate lockIdIsOursPredicate = cb.equal(root.get("lockId"), generatedLockId);
criteriaQuery.select(root).where(notProcessedPredicate, lockIdIsOursPredicate);
query = entityManager.createQuery(criteriaQuery);
query.setMaxResults(10);
files = query.getResultList();
processRetrievedFiles(files);
}
This approach now works, but has two disadvantages. First, you need to do multiple trips to the database. You first get a list of files, update them and retrieve them again. Also, you may only manage to update a smaller subset of the files you initially retrieved, resulting in fewer files being processed in on batch, even though there are still unprocessed files left.
Secondly, you need a stop condition that is smarter. Normally, you have the above code in a loop that works as long as you have files retrieved in the last SELECT statement. However, this may finish prematurely. If the all files you retrieve are reserved by other instances before your instance manages to reserve them, the update won’t do anything and the final getResultList() will return an empty list, leaving the impression that all the files were processed, even though there may still be files left, just not from the initial list you got back. Yes, eventually all of them will get processed, but one node may do more work than the others.
Correct Approach – Locking the entry before retrieving it (UPDATE with limit)
Now, here is where we can improve this a lot. First improvement is making sure that there won’t be any way of retrieving more than we can actually lock. Second, we can do only one SELECT. To do this, we need to lock all the entries that we are going to retrieve and only do the selection after retrieving them. This is called an ‘UPDATE with limit’ and sadly, it is NOT supported in Hibernate. However, we can write a simple SQL native query that does this just fine:
UPDATE file_info set lock_id = ? WHERE id in (
SELECT id FROM file_info WHERE lock_id IS NULL AND processed = FALSE LIMIT ?
)
Now, we can use the power of native queries to execute this statement which will lock our entries. After that, we can just retrieve them by searching for the lockId that we generated. Now, the code works properly across multiple instances that process at the same time and we can repeat this as long as we get results back. When no more files are returned in the select statement, we know for sure that ALL the files were processed, and each node has done the same amount of work.
@Transactional(readOnly = false)
@Scheduled(cron = "0 0 1 * * *") // 1AM each night
public void processFiles() {
CriteriaBuilder cb = entityManager.getCriteriaBuilder();
// Our update SQL statement
String sql = "UPDATE file_info set lock_id = ? WHERE id in (\n" +
"\tSELECT id FROM file_info WHERE lock_id IS NULL AND processed = FALSE LIMIT ?\n" +
")\n";
// Do the update in a JPA transaction
UUID generatedLockId = UUID.randomUUID();
jpaApi.withTransaction(entityManager1 -> {
Query query = entityManager1.createNativeQuery(sql, FileInfo.class);
query.setParameter(1, generatedLockId);
query.setParameter(2, 10);
query.executeUpdate();
});
// Do the retrieve
CriteriaQuery<FileInfo> criteriaQuery = cb.createQuery(FileInfo.class);
Root<FileInfo> root = criteriaQuery.from(FileInfo.class);
Predicate notProcessedPredicate = cb.equal(root.get("processed"), Boolean.FALSE);
Predicate lockIdPredicate = cb.equal(root.get("lockId"), generatedLockId);
criteriaQuery.select(root).where(notProcessedPredicate, lockIdPredicate);
TypedQuery typedQuery = entityManager.createQuery(criteriaQuery);
List<FileInfo> files = typedQuery.getResultList();
processRetrievedFiles(files);
}
*As a note, if you assume that each file can be processed only once (and not at regular intervals), you don’t even need the ‘processed’ column. Just leave the UUID after the update and for each loop generate a new one.
Conclusions
In this article, you learned how to do an UPDATE with limit using Hibernate (even though it is a native query) and how to properly lock entries before processing them so that you avoid double processing in a multi-instance job. The approach is easy to implement and works across multiple databases without any problems. The database structure is easy to understand and the code is easy to implement.
The post How to lock entries in the database to avoid double-processing using Update with limit appeared first on Petre Popescu.