1 – Investigation to zero in on the real culprit
We started with observing resource utilization graphs for our dynos. As we all know that worker dynos carry most of the heavy lifting. They are indeed used to perform such tasks. So worker dynos are more prone to such errors than others. Then we turned our focus the worker dynos. We wanted to zero in on the specific tasks that were consuming more memory. Obviously, when the tasks taking a lot of memory used to start, memory usage graphs on Heroku resulted in huge spikes and emitted R14 errors under events. We made a list of the tasks who resulted in spikes in the graph.
2 – Fixed issues with the Background jobs
After we discover exactly which one are the culprits for R14 errors, we can begin fixing them. We need to fix them in several aspects.
- A. Optimized memory usage
Writing code to fetch data from the database is literally a cakewalk when we are using ActiveRecord. ActiveRecord, if not used smartly enough, may lead to huge memory usage. We need to spot the memory bloats and fix them. We used the following gems to address memory issues.
MemoryProfilergem to profile memory usage by different background jobs.
Bulletgem to track and reduce the number of queries used by jobs to fetch data.
Ojgem to optimize the JSON parsing process, as our application used JSON parsing very often.
- B. Fixed errors For the concerned background jobs, we observed the errors reported by the error reporting tool, in our case New Relic. We tracked and fixed them to reduce job retries happening because of these errors.
3 – Rearrangements
Rearranged schedule of the jobs in order to reduce congestion at any given time. While scheduling the tasks, we generally take care that no two big jobs are scheduled near each other. We simply intend to avoid their overlapping to bring down memory utilization at any given time. But over time we may realize that this scheduling has gone wrong. So we need to rearrange this schedule considering changed circumstances and memory loads.
4 – Limited the retry attempts of jobs to at max 3
(this was suitable for our application, yours may need more or less than this). Background jobs may fail due to things that have nothing to do with our code, e.g. something goes wrong at a remote microservice, etc. So the background processing libraries allow us to give some more tries for the job. This is a good practice as there exist some situations that are unavoidable. But the downside of these retries is that if we don’t limit them, they will keep trying for several times, which may be completely unnecessary. If you are sure that external factors are not going to cause your job to fail, then in such cases, it really makes no sense to retry such jobs. For example – our job was trying to fetch some data from a given URL and was failing because a wrong URL was provided. This clearly needs no retries. So we can restrict their retries to save our resources.
5 – Installed Jemalloc addon
Heroku states that using malloc in the multithreaded environment may lead to excessive memory usage. As suggested, we installed jemalloc, which provides a kind of malloc implementation that tries to save memory simply by avoiding memory fragmentation. It indeed helped us to save memory to a great extent.
6 – Increased worker memory size.
Even after trying all other weapons in our arsenal :), the R14 errors don’t go away, we really need to think about increasing memory size of the concerned dynos. But this should be our last resort and we must be aware of the fact that they are going to add to our billing costs. In our case, we increased the worker dyno’s memory from 512M to 1024M, ours was a small application :).