Community Help

dtod · ‎10-30-2023

dap initdb isn't working for us with the web_logs table. As far as I can tell this is due to the quantity of data, so I figured we could somehow initialize it and load data incrementally, but I don't see how to do that with the existing commands, and the python code is a bit beyond me.

Would appreciate any pointers.

Thanks

jwals · ‎10-30-2023

Hi @dtod, we had a similar problem with our submissions table. We used the relevant CREATE TABLE statements that you can find here and then the shell script below, which works on the files downloaded by dap snapshot. Hope that helps!

#list gz files in job subdirectory - there should be just one - to make array
gzips=($(ls job*/*.gz))

#loop through files in array
for archive in ${gzips[@]}
do
  
  echo "$(date) starting on $archive"
  
  #field 47 is meta.ts, which isn't stored in the table and causes errors
  gzip -dc $archive |cut -f1-46 >> $archive.txt
  
  #splitting archives into 500K line parts
  split -d -l 500000 $archive.txt subm_
  
  #ls parts by the filename stem to make array
  parts=($(ls subm_*))
  
  #loop through files in array
  for part in ${parts[@]}
  do 
  
    #test filename to see if it's first part, which contains header row
    if [ $(echo $part | cut -f2 -d_) == "00" ]; 
      then psqlcommand="\copy canvas.submissions FROM $part WITH (HEADER) ;"
      else psqlcommand="\copy canvas.submissions FROM $part ;"
    fi
    
    psql -c "$psqlcommand"
    echo "$(date) $part done"
    
    rm $part
    
  done
  
  rm $archive.txt
  
done

View solution in original post

jwals · ‎10-30-2023

Hi @dtod, we had a similar problem with our submissions table. We used the relevant CREATE TABLE statements that you can find here and then the shell script below, which works on the files downloaded by dap snapshot. Hope that helps!

#list gz files in job subdirectory - there should be just one - to make array
gzips=($(ls job*/*.gz))

#loop through files in array
for archive in ${gzips[@]}
do
  
  echo "$(date) starting on $archive"
  
  #field 47 is meta.ts, which isn't stored in the table and causes errors
  gzip -dc $archive |cut -f1-46 >> $archive.txt
  
  #splitting archives into 500K line parts
  split -d -l 500000 $archive.txt subm_
  
  #ls parts by the filename stem to make array
  parts=($(ls subm_*))
  
  #loop through files in array
  for part in ${parts[@]}
  do 
  
    #test filename to see if it's first part, which contains header row
    if [ $(echo $part | cut -f2 -d_) == "00" ]; 
      then psqlcommand="\copy canvas.submissions FROM $part WITH (HEADER) ;"
      else psqlcommand="\copy canvas.submissions FROM $part ;"
    fi
    
    psql -c "$psqlcommand"
    echo "$(date) $part done"
    
    rm $part
    
  done
  
  rm $archive.txt
  
done

dtod · ‎10-30-2023

Thank you. Yes, that is helpful (and you're speaking my language :-)).

Is it possible, once you've done this, to shift to a syncdb?

stimme · ‎10-31-2023

@dtod In order for the syncdb command to work, you'll have to insert a row into canvas_logs.dap_meta for the web_logs table you create. If you initialized the user_agents table, its row in canvas_logs.dap_meta can serve as a model. It should be okay to leave the schema description an empty string; dap client can update it when there is a new version. I'm attaching an example INSERT statement.

jwals · ‎10-31-2023

Yes, after we got the initial loading done with the snapshot + shell script solution we have been updating the table with syncdb (technically with the Python library's dap.actions.sync_db function, but the CLI is just a wrapper around that Python function so it should work). Good luck!

CD2: initialize table schema only with incremental data

cd2

dap

Re: How are you handling LTI analytics data?

Why does listing tables from canvas_logs namespace...

dap client throwing QueryException on sync table

Ability to see marks from multiple courses built u...

dapclienterror malformed http response and process...

Re: How are you handling LTI analytics data?

Submission Versions in Data Access Platform

Starting to utilize Redshift

Admin permissions to view conversations in Inbox

CD2 Key - now can go for three years

You're signed out

CD2: initialize table schema only with incremental data

Community Help

View our top guides and resources: