Retreaver is fully aware of the enormous impact downtime events have on our customers. We know that customers lose large amounts of money any time their phone systems go down. Beyond revenue, unquantifiable things like trust are lost, and the repercussions can be enormous.
We take nothing more seriously than providing a service that is dependable, because we want you to trust us with your call traffic. We work tirelessly to ensure that Retreaver will always be available, responsive, and fast.
Importantly, all call traffic is handled by Twilio, who have an upstanding reputation for world-wide reliability. Retreaver does not self-host phone systems or rely on “race-to-the-bottom” providers. Like Retreaver, Twilio has a strong focus on uptime and quality that we have come to depend on and trust. We only trust the proven reliability of Twilio with your voice traffic.
With that in mind, the Retreaver API must be responsive to requests 24/7 from both customers and Twilio. Our high-availability architecture has been designed with this requirement at the forefront.
Warning: nerdiness & technical specifics!
Retreaver is architected for high availability, hosted on Amazon’s AWS. Our primary systems are in the us-east-1 region. We have redundant systems across multiple availability zones for every service. We currently use Postgres managed by RDS for our database, and have automatic failover enabled should a single availability zone go down, or should the server itself crash. Our Redis and Memcached servers are managed by ElastiCache and all have redundant failover systems online.
Retreaver achieves multi-region high availability by also maintaining an online standby cluster in the us-west-2 region of AWS. The database sever in this cluster is kept in sync with our main cluster in us-east-1 using Bucardo.
In the event of an unlikely major catastrophe affecting all of us-east-1, the us-west-2 cluster is failed-over to. Many providers do not have a contingency plan in place for this event, but we remember all too clearly the outage of October 2012.
- During a failover event, all asynchronous external callbacks (‘pixel fires’) that are normally proxied through a us-east-1 server are delayed until normal operations are restored.
- Some call log and all dashboard features are unavailable during a failover event, since our search servers are currently in us-east-1 only. Call traffic continues to be handled normally.
The failover process is automated and tested weekly.
We monitor uptime and key operational information such as queue lengths via Pingdom. Our developers are on-call to respond to emergencies and are automatically paged by Pingdom in the event of an unexpected issue.
Errors, response times, and servers are monitored via New Relic. Additionally, we pipe all this information and much more into Datadog to provide our developers with a single point of reference, ensuring a fast, targeted response.
Although unlikely, in case of database corruption, human error, or other scenario where failover to a second set of servers is inadequate, Retreaver maintains database backups to S3 via RDS, allowing for point-in-time recovery for up to a month.
Whenever maintenance is scheduled, or during an outage, we keep our customers abreast of what’s happening so that they can make important business decisions. Status updates are broadcast to all Retreaver users through alerts on all pages of our app.
If you have any questions, don’t hesitate to contact us. Just submit a support ticket and we’ll respond quickly to address your concerns.