Published Mar 8, 2020
At the time of this writing, we need to run a couple thousand machine learning tasks in production every couple of hours in the background — from predictive analytics to anomaly detection to dynamic pricing.
For example, we run predictions for a majority of our 2,000+ locations every 3 hours, and we run over a hundred dynamic pricing tasks every hour. Often times it is the best for these tasks to run in a small time window. A good number of our machine learning models are trained with hour-of-the-day as a key feature, and they yield best results when run near the beginning of each hour.
When we just got started, I wrote Python scripts to run these models in threads, then sleep until the end of that hour. However, this quickly became untenable as the number of locations grew — threads failed all the time, retry logics became more and more complicated, the needs for cpu and memory went through the roof due to bursty concurrency, other services on the cluster crashed at the beginning of each hour… Not to mention it was almost impossible to monitor whether each task succeeded or failed, how long did it run, and to get its full logs. Babysitting these tasks was hard!
With several painful iterations, I finally found a perfect couple to take care of these kids — Airflow + AWS ECS Fargate. The setup solves most of my headaches, and runs our tasks in a much more reliable and scalable fashion.
There are three key components in this setup:
The scheduler continuously checks all tasks, and dispatches the ones that are scheduled to be run to workers. Workers call executors to send these tasks to Fargate to run. After a task is sent to Fargate, the sender worker continuously monitors its status and health. Once done, the worker reports back to the scheduler, then continues to run its next task. Note that the worker is occupied and won’t execute other tasks when waiting for the Fargate task to finish.
Such setup solves so many of my problems!
Now the tasks are very scalable, limited by just two main factors: How many workers we have, and how many concurrent Fargate tasks AWS allows us to run. For workers, we can easily increase its number to handle extra loads. For Fargate, AWS resources are virtually infinite to us.
Now we can delegate retry logics to Airflow: How long before a task times out, how many times to retry, how long to wait between retries, etc. I no longer have to consider them in my code!
We can now easily control how many tasks to run at the same time through Airflow pools. Goodbye to the old threading days!
We can now see which tasks succeeded, which failed, which retried, and how long did they run, all through the Airflow UI. With our custom Fargate executor, we can even see all task logs in Airflow. Failures and bugs can now be spotted with ease.
With all the non-functional components extracted to Airflow, I can focus on doing just machine learning in the code: Train or run models, and write results to databases.
We no longer need to reserve EC2 instances to handle the peak load. We let AWS handle the peaky demands, and we simply pay for what we use. Of course, we need to pay extra AWS taxes for flexibility, but the costs are very clearly associated with each task, so we can better estimate and include them in our product pricing.
I am just being picky here. There are several places in this setup I wish could be better.
I hinted about this issue earlier. Each worker needs to wait for its Fargate task to finish, and it is idle most of the time while the Fargate task is running. It would be more scalable if this were async: Image if a worker can send a task to Fargate, then immediately continue with a next task. It’ll later periodically check the statuses of ongoing tasks. Such mechanism will allow us to run many concurrent Fargate tasks with a small number of workers.
Most of the time, I have to setup Airflow pools to control the number of concurrent Fargate tasks. This is because most of our machine learning tasks need to query APIs for data, which in turn query databases for data. If we do not limit concurrency, the number of requests sent to our API gateways and databases could be very peaky, leading to crashes. Now the scalability bottleneck shifts to API servers and databases.
In our setup, we use Celery for Airflow workers, and ECS for executors. It could be better if we use Kubernetes for both workers and executors. This will allow us to deploy exclusively to EKS Fargate, instead of using two different systems.
To wrap up, I’d like to share some thoughts about AWS Lambda vs. Fargate. The key difference between the two is that Lambda does not require Docker, and it has a limitation of max duration. I believe Fargate is a better solution because it uses Docker as the interface to setup operating systems, run-time environments, and execution commands. This means the interface is platform-agnostic. One can easily move to Azure or GCP by writing a custom Airflow executor without making other substantial changes in Airflow.