Skip to content Skip to footer

Login orRegister

0 items - $0.00 0

18 Year Old Py RegEx Is Added to List of CVEs, Causes 1000s of Builds to Fail by beatthatflight

0CommentsShare PostShare on Facebook Share on XShare by EmailSend Link

18 Year Old Py RegEx Is Added to List of CVEs, Causes 1000s of Builds to Fail by beatthatflight

18 Year Old Py RegEx Is Added to List of CVEs, Causes 1000s of Builds to Fail by beatthatflight

ByHackTech December 1, 2022

Share This Article

Share Post

Newsletter

Sed ut perspiciatis unde.

### Generic implementation of a processing graph

Remove explicit mentions to /splits or /first-rows from code, and move them to the "processing graph":

```json
{
  "/splits": {"input_type": "dataset", "required_by_dataset_viewer": true},
  "/first-rows": {"input_type": "split", "requires": "/splits", "required_by_dataset_viewer": true},
}
```

This JSON (see libcommon.config) defines the *processing steps* (here /splits and /first-rows) and their dependency relationship (here /first-rows depends on /splits). It also defines if a processing step is required by the Hub dataset viewer (used to fill /valid and /is-valid).
A processing step is defined by the endpoint (/splits, /first-rows), where the result of the processing step can be downloaded. The endpoint value is also used as the cache key and the job type.

After this change, adding a new processing step should consist in:
- creating a new worker in the `workers/` directory
- update the processing graph
- update the CI, tests, docs and deployment (docker-compose files, e2e tests, docs, openapi, helm chart)

This also means that the services (API, admin) don't contain any code mentioning directly splits or first-rows. And the splits worker does not contain direct reference to first-rows.

### Other changes

- code: the libcache and libqueue libraries have been merged into libcommon
- the code to check if a dataset is supported (exists, is not private, access can be programmatically obtained if gated) has been factorized and is now used before every processing step and before even accepting to create a new job (through the webhook or through the /admin/force-refresh endpoint).
- add a new endpoint: /admin/cancel-jobs, which replaces the last admin scripts. It's easier to send a POST request than to call a remote script.
- simplify the code of the workers by factorizing some code into libcommon:
  - the code to test if a job should be skipped, based on the versions of the git repository and the worker
  - the logic to catch errors and to write to the cache
  This way, the code for every worker now only contains what is specific to that worker.
 
### Breaking changes

- env vars `QUEUE_MAX_LOAD_PCT`, `QUEUE_MAX_MEMORY_PCT` and `QUEUE_SLEEP_SECONDS` are renamed as `WORKER_MAX_LOAD_PCT`, `WORKER_MAX_MEMORY_PCT` and `WORKER_SLEEP_SECONDS`.

---

* feat: 🎸 add /cache-reports/parquet endpoint and parquet reports

* feat: 🎸 add the /parquet endpoint

* feat: 🎸 add parquet worker

Note that it will not pass the CI because
- the CI token is not allowed to push to refs/convert/parquet (should be
  in the "datasets-maintainers" org)
- the refs/convert/parquet does not exist and cannot be created for now

* ci: 🎡 add CI for the worker

* feat: 🎸 remove the hffs dependency

we don't use it, and it's private for now

* feat: 🎸 change the response format

associate each parquet file with a split and a config (based on path
parsing)

* fix: 🐛 handle the fact that "SSSSS-of-NNNNN" is "optional"

thanks @lhoestq

* fix: 🐛 fill two fields to known versions in case of error

* feat: 🎸 upgrade datasets to 2.7.0

* ci: 🎡 fix action

* feat: 🎸 create ref/convert/parquet if it does not exist

* feat: 🎸 update pytest

See pytest-dev/py#287 (comment)

* feat: 🎸 unlock access to the gated datasets

Gated datasets with extra fields are not supported. Note also that only
one token is used now.

* feat: 🎸 check if the dataset is supported only for existing one

* fix: 🐛 fix config

* fix: 🐛 fix the branch argument + fix case where ref is created

* fix: 🐛 fix logic of the worker, to ensure we get the git sha

Also fix the tests, and disable gated+private for now

* fix: 🐛 fix gated datasets and update the tests

* test: 💍 assert that gated with extra fields are not supported

* fix: 🐛 add controls on the dataset_git_revision

* feat: 🎸 upgrade datasets

* feat: 🎸 add script to refresh parquet response

* fix: 🐛 fix the condition to test if the split exists in lis

Tags: Added Regex

Written by

HackTech

View all posts by HackTech

Leave a comment

Leave a comment Cancel reply

You must be logged in to post a comment.

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Log in to your account