Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding table_prefix option to prevent database table collisions for auto-generated tables #495

Merged
merged 10 commits into from
May 20, 2021

Conversation

gureckis
Copy link
Member

@gureckis gureckis commented Apr 27, 2021

A somewhat undocumented features is that psiturk now creates two additional book-keeping tables in your database besides the one specified by table_name option. These are amt_hit and campaign and help coordinate the campaigns and hit listings for the dashboard and command line. In some cases though multiple psiturk users might share one database for lab coordination purposes. If so then these tables will conflict across multiple instances and cause problems. This PR adds an option called table_prefix (default value false). Whenn it is set to true this will prepend the value specified by table_name to those auto-generated tables. For instance if table_name is exp1 then the two tables in question become exp1_amt_hit and exp1_campaign. This at least provides a mechanism for avoiding collisions.

@coveralls
Copy link

coveralls commented Apr 27, 2021

Coverage Status

Coverage increased (+1.07%) to 61.118% when pulling 48db427 on exp_table_names into 956ef9f on master.

@deargle
Copy link
Collaborator

deargle commented Apr 27, 2021

Hmm, the current default for table_name in psiturk 3 is "assignments" instead of the former "turkdemo" because I didn't understand the multi lab user use case. Prepending "assignments" probably would be weird.

Can you describe the shared use case approach? In this approach, is each lab user running a unique experiment? If so, I have been reading the sqlalchemy docs and they support "schemas," where each schema can have its own copy of a table, and collisions are avoided. The schema name becomes a prefix of sorts in the underlying query. Like "exp1.assignments," "exp2.assignments." a single database can have many schemas, and heroku says having up to 50 schemas for a given db should be fine.

One thing that schemas would not work for would be for people who want to share one single table of assignments for purposes of blocking participants, differentiating studies by prefixes to values they sert for code_version. For that reason, I might want to avoid the explicit schemas route. But maybe have a table_prefix config var that is blank by default, but if set to a string, then that string becomes the prefix for the two tables you mentioned. Does that sound okay? It's more complex and less intuitive than schemas, but maybe that's okay.

@gureckis
Copy link
Member Author

Yes the use case I think it pretty typical. A lab might have one mysql server running at a fixed location like a lab server. Multiple researchers might be running experiments on that database at once or a researcher might be running more than one psiturk experiment at once, but using the same database. In some cases, an IRBs might require a particular database system. For instance an IRB my say you have to use the Mysql Database infrastructure managed by the university and can't put PPI on random servers. The overhead of making multiple databases for each experimenter might be non-zero depending on the management tools provided by the university say. This just gives people that option to use 1 DB and multiple psiturk tasks.

Re: the prefix option as opposed to the flag, that sounds good....

@gureckis
Copy link
Member Author

actually if assignments is the default table name it feels like they should all just be autogenerated and the table_name used as the prefix? would make more sense that amt_assignments amt_hit and amt_campaigns are sort of standardized perhaps with the option to prefix them as needed?

@deargle
Copy link
Collaborator

deargle commented Apr 27, 2021 via email

@gureckis
Copy link
Member Author

gureckis commented Apr 27, 2021

well the most straightforward options then are

  1. a prefix_table string that applies to all three tables which defaults to empty
  2. a prefix_table string that applies to only the amt_hit and campaign tables (again defaulting to empty)
  3. a prefix_table flag that prefixes the amt_hit and campaign tables with the contents of table_name (this is in the current PR)
  4. explicitly configure the amt_hit and campaign tables however you want, again defaulting to the existing values

now that I write it, perhaps 4 is the most sensible because it also calls attention to the fact that these tables are going to be created. it wasn't clearly mentioned anywhere obvious in the new version and it took a bit of debugging to realize this was happening. if it is listed in the default config it raises awareness.

@jacob-lee
Copy link
Collaborator

As it happens I encountered this issue a few days ago. It is not a shared psiturk environment, but we run longitudinal or multi-step studies, and it makes a lot of sense to keep things in the same database, particularly when recruiting participants and avoiding participants who have already done the study (one could go a complicated qualifications route I suppose).

The table prefix seems a reasonable idea. In a shared environment its still not going to stop clobbering, because users may not be sufficiently careful (for example, because they borrow config files from each other).

btw, and just putting this out there, but I hardly use the base table any more to store experimental data (i.e. what used to be turkdemo). Psiturk saves each subjects data in one row, and the task data in one cell. That means each PUT has to have all the data in it each time. For long experiments, or experiments where participants have kept buttons pressed or used scripts to press buttons repeatedly, these requests have gotten so large that the system bogs down significantly, or times out, or in one cases, violated my databases request size limit. Its also generally inefficient, makes it difficult to do other things like base their bonus on performance. So its custom tables almost all the way. One of the big problems with this is having to worry about race conditions, and repeated requests--you need some kind of nonce and the server side code to support it.

Also, there really should be no reason psiturk shouldn't be able to support multiple experiments simultaneously from one server. The main thing is the routes (including the ad route) need to be prefixed with the experiment name, and the ads too when creating the HIT. psiturk create hit foo_experiment 3.00 10 1

@deargle
Copy link
Collaborator

deargle commented Apr 27, 2021

@gureckis number 4 sounds good. So that would be:

  • add a new config var called assignments_table_name and deprecate table_name. If both are set, prefer assignments_table_name If only one is set, prefer that one.

    I already wrote a way to deprecate config vars here

  • add new config vars campaigns_table_name and amt_hits_table_name

  • add entries for all three round about here, then rerun create_sample_config_from_defaults.py to get a new default user config file.

  • update docs

@jacob-lee your psiturk use case is interesting, you've evolved to a new plane :-)

@deargle
Copy link
Collaborator

deargle commented May 7, 2021

There's another table we need to be mindful of -- apscheduler_jobs. It gets created if the dashboard is used to try to add jobs.

@gureckis
Copy link
Member Author

Ok I think this implements option 4 from above. The exact syntax for deprecating config options is not clear to me but in this example I describe that table_name is being retired in preference for assignments_table_name. The code still refers to table_name internally.

@deargle deargle force-pushed the exp_table_names branch from d6cbcd9 to 3629a9d Compare May 20, 2021 04:50
@deargle
Copy link
Collaborator

deargle commented May 20, 2021

As ugly as it is, I think we might need to leave the default for the jobs table to "apscheduler_jobs" so that upgrading psiturk doesn't break anyone's currently-running jobs by creating a new table all of a sudden.

Btw I rebased this PR and force pushed.

@deargle
Copy link
Collaborator

deargle commented May 20, 2021

Same for changing back to amt_hit

Edit: the currently used table names are:

  • campaign
  • amt_hit
  • apscheduler_jobs
  • assignments

@gureckis
Copy link
Member Author

darn it! ok. I suspect this effects between N=0 and N=2 people but we're running a professional shop here.

@deargle
Copy link
Collaborator

deargle commented May 20, 2021

It's purely an academic exercise! :-)

@deargle
Copy link
Collaborator

deargle commented May 20, 2021

I'm working on the changes rn

@deargle
Copy link
Collaborator

deargle commented May 20, 2021

Oop okay, just got your changes. I'll finish merging in a few other tweaks.

@gureckis
Copy link
Member Author

gureckis commented May 20, 2021

I'm still not 100% sure on the assignments table... if a future person only provides assignment_table_name in their config will the lookup via the psiturkconfig object still resolve? Sorry, I know I could read but also could just ask what you remember.

@deargle
Copy link
Collaborator

deargle commented May 20, 2021

if a future person only provides assignment_table_name in their config will the lookup via the psiturkconfig object still resolve?

Yes -- psiturk_config.py will set table_name to the value of assignments_table_name. But I'm changing the config file to have no default for assignments_table_name -- because if it did have a default, then the default would always override whatever the user set for table_name, even if they never set assignmenst_table_name

to prevent it _always_ overriding any `table_name` that the user sets

also, update docs
@deargle deargle merged commit d56d26c into master May 20, 2021
@deargle
Copy link
Collaborator

deargle commented May 20, 2021

🎉 woo, high five!

@gureckis
Copy link
Member Author

Thanks! high five. I feel so accomplished, I'm going to go try to understand experiment.py just for fun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants