### Generic implementation of a processing graph Remove explicit mentions to /splits or /first-rows from code, and move them to the "processing graph": ```json { "/splits": {"input_type": "dataset", "required_by_dataset_viewer": true}, "/first-rows": {"input_type": "split", "requires": "/splits", "required_by_dataset_viewer": true}, } ``` This JSON (see libcommon.config) defines the *processing steps* (here /splits and /first-rows) and their dependency relationship (here /first-rows depends on /splits). It also defines if a processing step is required by the Hub dataset viewer (used to fill /valid and /is-valid). A processing step is defined by the endpoint (/splits, /first-rows), where the result of the processing step can be downloaded. The endpoint value is also used as the cache key and the job type. After this change, adding a new processing step should consist in: - creating a new worker in the `workers/` directory - update the processing graph - update the CI, tests, docs and deployment (docker-compose files, e2e tests, docs, openapi, helm chart) This also means that the services (API, admin) don't contain any code mentioning directly splits or first-rows. And the splits worker does not contain direct reference to first-rows. ### Other changes - code: the libcache and libqueue libraries have been merged into libcommon - the code to check if a dataset is supported (exists, is not private, access can be programmatically obtained if gated) has been factorized and is now used before every processing step and before even accepting to create a new job (through the webhook or through the /admin/force-refresh endpoint). - add a new endpoint: /admin/cancel-jobs, which replaces the last admin scripts. It's easier to send a POST request than to call a remote script. - simplify the code of the workers by factorizing some code into libcommon: - the code to test if a job should be skipped, based on the versions of the git repository and the worker - the logic to catch errors and to write to the cache This way, the code for every worker now only contains what is specific to that worker. ### Breaking changes - env vars `QUEUE_MAX_LOAD_PCT`, `QUEUE_MAX_MEMORY_PCT` and `QUEUE_SLEEP_SECONDS` are renamed as `WORKER_MAX_LOAD_PCT`, `WORKER_MAX_MEMORY_PCT` and `WORKER_SLEEP_SECONDS`. --- * feat:🎸 add /cache-reports/parquet endpoint and parquet reports * feat:🎸 add the /parquet endpoint * feat:🎸 add parquet worker Note that it will not pass the CI because - the CI token is not allowed to push to refs/convert/parquet (should be in the "datasets-maintainers" org) - the refs/convert/parquet does not exist and cannot be created for now * ci:🎡 add CI for the worker * feat:🎸 remove the hffs dependency we don't use it, and it's private for now * feat:🎸 change the response format associate each parquet file with a split and a config (based on path parsing) * fix:🐛 handle the fact that "SSSSS-of-NNNNN" is "optional" thanks @lhoestq * fix:🐛 fill two fields to known versions in case of error * feat:🎸 upgrade datasets to 2.7.0 * ci:🎡 fix action * feat:🎸 create ref/convert/parquet if it does not exist * feat:🎸 update pytest See pytest-dev/py#287 (comment) * feat:🎸 unlock access to the gated datasets Gated datasets with extra fields are not supported. Note also that only one token is used now. * feat:🎸 check if the dataset is supported only for existing one * fix:🐛 fix config * fix:🐛 fix the branch argument + fix case where ref is created * fix:🐛 fix logic of the worker, to ensure we get the git sha Also fix the tests, and disable gated+private for now * fix:🐛 fix gated datasets and update the tests * test:💍 assert that gated with extra fields are not supported * fix:🐛 add controls on the dataset_git_revision * feat:🎸 upgrade datasets * feat:🎸 add script to refresh parquet response * fix:🐛 fix the condition to test if the split exists in lis
