joR
AI & ML interests
Recent Activity
Organizations
vLLM v0.11.1 seems to work, but v0.11.2 fails
Max model len check
ok, but what was the gains over untrained Qwen-VL 4B and 8B ? You don't provide baseline.
Also check in a very granular way the validation set. What works and what doesn't ? Is is the same for both with only some few slim differences ? Is there a part of the validation that is always wrong : orientation, relative position, identification, inter-relation,etc.? You have question_category, add question_type, object_type, object, other_object_type, other_object...
Can density vs diversity help for a section of validation questions?
I think you have too many variables to be able to get the answer you want.
Take 1 type of question, have it formulated in different ways (does A touch B, is A adjacent to B, is A over B...), then validate on checkpoints at every X steps, starting with step 0. Do you see different changes in diversity vs density, does it plateau, does it continue to get better accuracy (more epoch needed). Once you have a working pipeline for evaluation, test for a few more types of question. Save to dataset your validation tests, type (diverse,dense), step X, generated answer, reference answer, value pass|fail. Then you can reprocess your validation tests with different code/llm judge to inspect it, etc.
I would really recommend that you group your validation by task/item/orientation/interaction labels. This could show where you had specific gains in diversity vs density. Try different labeling strategies as it may influence greatly. This can be done on validation data that you already saved. Generate lots of plots to visualize.
Also remove validation on items that are already passing on baseline, will make your evaluation faster but you will not be able to test regression.
Before going even bigger, I would try a few more micro training. Select a small set of 100 images and check gains on validation specific to that set image (question on same subject, action...) and see what gains a small set gives to validation. Also try up to 15 epoch or until plateau.
Also for validation, try absolute answers like left/right/yes/closer/farther/over, depending on your validation function or do you use a llm for checking answer against answer in the dataset ? For example right hand is present in answer. Did you validate that you have no errors handling accuracy ?
Good luck
Quick ideas after a super fast read.
I would look at performance / $(tok) during training and during in inference. Thinking may have less returns than instruct with +/- 2% depending on the token usage for .
First looks seems that instruct could be more efficient as less tokens/% accuracy.
You should have validation on different level of questions (stratified), if your real world task requires high accuracy on precise things and hard things or high accuracy on simple things. Also check if there are similarities in the failed validation test (categorize by task, question type, object...). This is the only way to investigate what is causing trouble and perhaps focus on it a bit more to see if there are improvements. Like after 1 epoch, if there is some things that fails validation, train on a separate set with a higher proportion of that specific thing and check if accuracy gets better. If it plateau, you have the issue to tackle, if it continue to get better, that is under-represented in your base dataset.
Also validate against standard VLM Qwen and to check changes accuracy +/- in/out of domain after training. Also check other VLM (kimi vl, mistral, ...) to have comparison to Qwen initial accuracy.
Perhalps an other VLM has better validation before SFT, making finetuning easier ?
Also validate after 1 epoch and 2 epoch etc.
Also check merging models (dense + diverse) using mergekit.
Also check that what you train is well supported in the video encoder (SigLIP-2) is not the only one or it may require some training if it has problem discerning certain things you ask.
just the first things that came to mind.
Thanks for the article
using `assistant` instead of `system`in first role before `user`
Double system role ?
Possible quality enhancement for the dataset
Great, what about different language ? What minimal mix is required for bilingual, trilingual ...
Also, adding a new language to a existing model ? Or coding skills ?
If you took your GPT2 model and tried to make it talk 16+language, would continued training on a new static mix work or does it need to be started from scratch ? Catastrophic risks ?
Thanks