Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix Bug: KeyError: 'text' Corresponding to issue #296 #300

Merged
merged 3 commits into from
Apr 18, 2024

Conversation

shiweijiezero
Copy link
Contributor

Solution for issue #296
File data_juice/config/config.py lines 418-429 did not consider the situation when arg: text_key was initialized to 'text', resulting in arg: text_key not being updated properly and always being initialized to the value of 'text'
Thus, I just supplement the set of "text_key".

File data_juice/config/config.py lines 418-429 did not consider the situation when arg: text_key was initialized to 'text', resulting in arg: text_key not being updated properly and always being initialized to the value of 'text'
@shiweijiezero
Copy link
Contributor Author

shiweijiezero commented Apr 17, 2024

Original Code:

    for op in cfg.process:
        for op_name in op:
            args = op[op_name]
            if args is None:
                args = {
                    'text_key': text_key,
                    'image_key': cfg.image_key,
                    'audio_key': cfg.audio_key,
                    'video_key': cfg.video_key,
                }
            elif args['text_key'] is None:
                args['text_key'] = text_key
                args['image_key'] = cfg.image_key
                args['audio_key'] = cfg.audio_key
                args['video_key'] = cfg.video_key

            op[op_name] = args

Fixed:

    for op in cfg.process:
        for op_name in op:
            args = op[op_name]
            if args is None:
                args = {
                    'text_key': text_key,
                    'image_key': cfg.image_key,
                    'audio_key': cfg.audio_key,
                    'video_key': cfg.video_key,
                }
            elif args['text_key'] is None:
                args['text_key'] = text_key
                args['image_key'] = cfg.image_key
                args['audio_key'] = cfg.audio_key
                args['video_key'] = cfg.video_key
            if text_key:
                args['text_key'] = text_key
                args['image_key'] = cfg.image_key
                args['audio_key'] = cfg.audio_key
                args['video_key'] = cfg.video_key

            op[op_name] = args

@yxdyc yxdyc requested review from HYLcool and yxdyc April 18, 2024 02:25
@HYLcool
Copy link
Collaborator

HYLcool commented Apr 18, 2024

Hi @shiweijiezero , thanks for your contribution!

You're correct👍🏻. According to your report and modification, we found out that a previous PR #165 made the code snippet useless because args for each OP are initialized by the default args and command line args before.

if args is None:
args = {
'text_key': text_key,
'image_key': cfg.image_key,
'audio_key': cfg.audio_key,
'video_key': cfg.video_key,
}
elif args['text_key'] is None:
args['text_key'] = text_key
args['image_key'] = cfg.image_key
args['audio_key'] = cfg.audio_key
args['video_key'] = cfg.video_key
if text_key:
args['text_key'] = text_key
args['image_key'] = cfg.image_key
args['audio_key'] = cfg.audio_key
args['video_key'] = cfg.video_key

In the data-juicer config modules, args of OPs are expected to be overwritten in this priority order: default args -> config args -> command line args. Following this rule and based on your modification, we suggest you continue to fix this bug completely by:

  1. swapping these two lines 271 and 272 of code. Function init_setup_from_cfg updates OP args from config files and function update_op_process updates them from command line args. So we need to apply init_setup_from_cfg then update_op_process.

    try:
    cfg = parser.parse_args(args=args)
    cfg = update_op_process(cfg, parser)
    cfg = init_setup_from_cfg(cfg)

  2. decoupling the updates for text/audio/image/video keys. These keys are updated only by the changes of text_key before, which would cause the same issue you met on audio/image/video keys. So we suggest you modify this part (lines 425 to 429) of the code like the below:

            ......
            else:
                if 'text_key' not in args or args['text_key'] is None:
                    args['text_key'] = text_key
                if 'image_key' not in args or args['image_key'] is None:
                    args['image_key'] = cfg.image_key
                if 'audio_key' not in args or args['audio_key'] is None:
                    args['audio_key'] = cfg.audio_key
                if 'video_key' not in args or args['video_key'] is None:
                    args['video_key'] = cfg.video_key
            op[op_name] = args
            ......

Thanks for your contribution again! Feel free to discuss the fix suggestions with us if you have any further considerations!

@HYLcool HYLcool added the bug Something isn't working label Apr 18, 2024
@HYLcool HYLcool linked an issue Apr 18, 2024 that may be closed by this pull request
3 tasks
@shiweijiezero
Copy link
Contributor Author

Yeah,
Following your suggestion, everything seems to be OK.
And I had committed it again.

@HYLcool
Copy link
Collaborator

HYLcool commented Apr 18, 2024

One last thing, you might need to apply pre-commit checking according to coding style doc.

Normalize Format
Copy link
Collaborator

@HYLcool HYLcool left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Thanks for your contribution!

@HYLcool HYLcool merged commit 33f72b1 into modelscope:main Apr 18, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: 运行tools/analyze_data.py报错,出现 KeyError: 'text'
2 participants