Fix BNCI2020_002 reshape, MartinezCagigal2023 stim_trial, and MAMEM Figshare rate-limiting#1068
Merged
Merged
Conversation
- BNCI2020_002 (`_convert_attention_shift`): the F-contiguous `bciexp.data` array was being flattened with the default C-order reshape, which produced a trial-fastest interleaved layout (`flat[c, k] = data[c, k // n_trials, k % n_trials]`) instead of the trial-major layout that every per-trial stim marker assumes. Every per-stimulus epoch was therefore sampled from the wrong trial and the canonical N2pc analysis sat at chance. Reshape via `transpose(0, 2, 1).reshape(...)` so `flat[c, t * n_samples + s]` equals `data[c, s, t]`. - MartinezCagigal2023Checker / MartinezCagigal2023Pary: `stim_trial` was carrying the per-recording trial index `0..n_trials-1` instead of the attended command id, breaking multi-recording multiclass classification (the same marker value pointed at different commands across recordings). Resolve `command_ids` from `cvep_data["command_idx"]` for train mode and from the `commands_info` label map for test mode, pass them to `add_stim_channel_trial`, and add `command_id` to the `_trial_meta` annotation extras so the attended symbol survives BIDS export. Codebook indexing (`add_stim_channel_epoch`) keeps using the per-recording trial id since `codes` is built per recording. - MAMEM SSVEP loaders: `fs_get_file_list` was being hammered once per subject inside both `data_path` and `_get_single_subject_data`, which triggered Figshare's 403 rate limit from datacenter IPs. Add a process-level `lru_cache` keyed by `(article_id, version)` on `fs_get_file_list`, and persist the listing as JSON next to the data in `BaseMAMEM` so subsequent runs of an already-downloaded dataset skip the API entirely. `force_update=True` bypasses both layers. Add `moabb/tests/test_dataset_fixes.py` with offline regression tests for all three fixes (mocking `loadmat`, `_fs_paginated_file_list`, and `fs_get_file_list` so the suite stays network-free).
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 118ff462df
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Tighten the on-disk filelist cache so an already-downloaded MAMEM dataset never contacts Figshare: - Centralise local-path resolution in `BaseMAMEM._dataset_root`, so `data_path` and `_get_single_subject_data` never disagree on where the filelist cache should live. - Document `_load_or_fetch_filelist`'s resolution order (disk cache first, API only as a fallback) and add INFO/DEBUG logs so users can see "loaded from cache" vs "fetching from Figshare" without enabling network tracing. - Surface the cache failure path: if the JSON cache is unreadable (corrupt / wrong version) we now log the underlying error before falling back to the API. Add `test_mamem_already_downloaded_does_not_ping_figshare` -- an end-to-end test that pre-creates local files and the cache JSON, then iterates `data_path` over three subjects with `fs_get_file_list` patched to raise on call. Confirms zero Figshare API hits in the "already downloaded" case (the user's primary concern).
- Hoist the c-VEP train/test command-id resolution into a shared ``resolve_cvep_command_ids`` helper in ``moabb/datasets/utils.py``; both MartinezCagigal files now call it instead of carrying their own copy of the train/test branch. - Replace the per-trial ``np.where(trial_idx == t)[0][0]`` scans (one for command_ids, one for first_trial_onsets — both O(n_trials × n)) with a single ``np.unique(trial_idx, return_index=True)`` pass that feeds both arrays. - Drop the redundant ``osp.exists`` TOCTOU check in ``BaseMAMEM._load_or_fetch_filelist``; let the open() attempt and ``FileNotFoundError`` decide. - Trim verbose docstrings, comments, and log messages across the changed files (BNCI2020_002 reshape comment, ``_load_or_fetch_filelist`` / ``_dataset_root`` / ``_filelist_cache_path``, ``fs_get_file_list``, whats_new entries, and the test docstrings).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes three independent loader bugs surfaced by
neuralfetch's MOABB wrappers, where the datasets were either silently corrupted or unusable from datacenter IPs:BNCI2020_002reshape produced wrong EEG layout (moabb/datasets/bnci/bnci_2020.py).bciexp.datais(n_channels, n_samples, n_trials)and F-contiguous out ofscipy.io.loadmat; the default C-orderreshape(n_channels, -1)produced a trial-fastest interleaved layout (flat[c, k] = data[c, k // n_trials, k % n_trials]) instead of the trial-major layout that the per-trial stim markers and downstream epoching both assumed. Every per-stimulus epoch was therefore sampled from the wrong trial and the canonical N2pc analysis sat at chance. Fix:transpose(0, 2, 1).reshape(...)soflat[c, t * n_samples + s] == data[c, s, t].MartinezCagigal2023Checker/MartinezCagigal2023Parystim_trialcarried the wrong values (moabb/datasets/martinezcagigal2023_checker_cvep.py,..._pary_cvep.py). The channel was passedtrial_labels = unique_trials.astype(int)(per-recording trial index0..n_trials-1) instead of the attended command id. The same marker value therefore pointed at different attended symbols across recordings and broke multiclass classification. The other c-VEP loaders (Castillos*,Thielen*) already encode the command id. Fix: resolvecommand_idsfromcvep_data["command_idx"](train) or from thecommands_infolabel map (test), pass them toadd_stim_channel_trial, and addcommand_idto the_trial_metaannotation extras so the attended symbol survives BIDS export. The epoch-level codebook keeps using per-recording trial id sincecodesis built per recording.MAMEM SSVEP loaders 403'd on the Figshare API (
moabb/datasets/ssvep_mamem.py,moabb/datasets/download.py).fs_get_file_listwas called once per subject inside bothdata_pathand_get_single_subject_data, which triggered the public endpoint's rate limit from datacenter IPs whenever a user iterated subjects. Two layers of caching now apply: (1) a process-level@functools.lru_cache(maxsize=None)onfs_get_file_listkeyed by(article_id, version), and (2) an on-disk JSON cache (figshare_filelist_<id>.json) inside the dataset directory, so once a MAMEM dataset has been downloaded the loader never contacts Figshare again.force_update=Truebypasses both layers.Adds
moabb/tests/test_dataset_fixes.pywith three offline regression tests (mockingloadmat,_fs_paginated_file_list, andfs_get_file_list) so the suite stays network-free.Test plan
python -m ruff checkandpython -m ruff format --checkclean on all touched files.pytest moabb/tests/test_dataset_fixes.py(3 passed) — covers the trial-major layout assertion, thelru_cacheshort-circuit, and the on-disk filelist cache +force_updatebypass.pytest moabb/tests/test_martinezcagigal_verbosity.pystill passes.Notes
The
BNCI2020_002reshape fix is independently verified byneuralfetch'stest_reichert2020_impact_eeg_layout(which previously had to override_load_rawto work around the bug). With this PR, the override can be removed downstream.