The backups we are making is to serve as an independent copy that we can use to rebuild the table should something unexpected happen, from data pipeline failures to accidental insert/update/delete queries. We hope we'll never need to actually use these files, but take comfort in knowing we can rebuild from scratch to the beginning of CD2 if needed.
Our primary CD2 job constantly appends the new data to our web_logs table. I assume the DAP works this way too? Or does it only keep the data that is available in the API in its tables? We chose to build our own API scripts so that we could maintain tight control over the behavior of a lot of the actions, in addition to platform compatibility since we are not on Postgres for our data operations. Our goal is that our primary web_logs table will have all of our data until the beginning of CD2, and possibly until the beginning of CD1 once we build a conversion process.
At a high level they way our back ups work is to use the AWS Glue ETL Jobs. We chose this over AWS Lambda because of the 15 minute cap on Lambdas. Every now and then when we were on Lambda for this we would get stuck in the API "waiting" status long enough for the 15 minute cap to prematurely end our backup job. Since we did not want to have to maintain a datastore for timestamps for the since/until mechanic of CD2, we decided to take our web_logs backups in 24 hour increments using a since until that is always since 00:00:00 until 23:59:59 for the given date we want to back up. We take the Gzipped JSON files generated by the CD2 API and send them directly into an S3 bucket that is organized by year/year-month/year-month-day. We also felt that this made doing a partial restore from backups possible if only a certain range of dates was affected and needed to be restored from backups. This way we can delete all of the effected days from the primary table and reimport the corresponding days from the back up files. This way we don't have to attempt to figure out the exact record to cut over on to avoid duplication or anything like that.
As a secondary back up we have a script that checks our S3 bucket's year/year-month/year-month-day structure over the last 30 days looking for an empty one and emailing us to run the back up for the missing date in the event it is empty.
My team and I will need to look through the code, but I imagine that we should have a lot of sharable components with our backup process.
Edit: I didn't realize the login for the forum grabbed my non-admin account, I am the original poster.