Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compound words in NER #156

Open
torava opened this issue Mar 14, 2019 · 2 comments
Open

Compound words in NER #156

torava opened this issue Mar 14, 2019 · 2 comments

Comments

@torava
Copy link

torava commented Mar 14, 2019

Compound words don't work optimally with NER in nlp.js. For example, according to nlp.js, Spiderman is a spider but not a man. Whitespace seems to have excessive significance. According to nlp.js, Spider-Man or Spider Man is certainly a spider and a man. I don't see the point in this separation. Especially in Finnish language this is a critical issue. We have hell of a lot compoundwords. It would be nice if nlp.js could be configured to behave differently in this case.

var {NerManager} = require('node-nlp');
manager = new NerManager({threshold:0.1});
manager.addNamedEntityText('species', 'spider', 'en', ['spider']);
manager.addNamedEntityText('species', 'man', 'en', ['man']);
manager.findEntities('spiderman', 'fi').then(entities => console.log(entities));
manager.findEntities('spider man', 'fi').then(entities => console.log(entities));
manager.findEntities('spider-man', 'fi').then(entities => console.log(entities));

spiderman returns

[ { start: 0,
    end: 8,
    len: 9,
    levenshtein: 3,
    accuracy: 0.5,
    option: 'spider',
    sourceText: 'spider',
    entity: 'species',
    utteranceText: 'spiderman' } ]

spider man and spider-man both return

[ { start: 0,
    end: 5,
    len: 6,
    levenshtein: 0,
    accuracy: 1,
    option: 'spider',
    sourceText: 'spider',
    entity: 'species',
    utteranceText: 'spider' },
  { start: 7,
    end: 9,
    len: 3,
    levenshtein: 0,
    accuracy: 1,
    option: 'man',
    sourceText: 'man',
    entity: 'species',
    utteranceText: 'man' } ]
@jesus-seijas-sp
Copy link
Contributor

Hello, in this case is because of this function:
https://github.com/axa-group/nlp.js/blob/master/lib/util/similar-search.js#L135

This function extract where are the word positions based on if it's alphanumeric or not. So perhaps an strategy by language can be implemented, as it's also done in the tokenizers for each language.

One question, can you explain what you expect as result in each case? "spider man", "spiderman" and "spider-man".

Also, for "spiderman" is returning "spider" because the threshold provided is very low, with a threshold >0.5 will return empty array.

@torava
Copy link
Author

torava commented Mar 15, 2019

Hi, I would expect it to return both spider and man. Then you would have better chance to detect whatever new supermen will be invented. However, it's not as practical example than there are in Finnish language:

kana-caesarsalaatti = chicken caesar salad
tonnikalapastasalaatti = tuna pasta salad
pasta-kinkkusalaatti = pasta salad with ham
savukalasalaatti = smoked fish salad
savuporosalaatti = smoked reindeer salad
lohisalaatti = salmon salad
savulohi-vihannessalaatti = salad with smoked salmon and vegetables
savustettu lohi ja vihannessalaatti = salad with smoked salmon and vegetables
kylmäsavulohisalaatti = cold smoked salmon salad
sipuli-perunasalaatti = onion potato salad
tomaatti-mozzarellasalaatti = tomato mozzarella salad
peruna-broileri-juustosalaatti = potato broiler cheese salad
grillikasvis-couscoussalaatti = salad with grilled vegetables and couscous
savuhärkä-pastasalaatti = pasta salad with smoked beef
lohi-avokadosalaatti = salmon avokado salad
kinkku-nuudelisalaatti = ham noodle salad
seesamiahvensalaatti = sesam perch salad
riisinuudelisalaatti = rice noodle salad
valkosipulisalaattikastike = garlic salad dressing
yrtti-balsamicosalaattikastike = herb balsamico salad dressing
tomaatti-chilisalaattikastike = tomato chili salad dressing

I could try to list all salads or salad dressings or just agree that if a compound word ends with salad then it's a salad or if it ends with salad dressing then it's a salad dressing. Then if I have also list of different ingredients I could tell what kind of salad or salad dressing it is. However, it's not possible now.

var {NerManager} = require('node-nlp');
var manager, entities;
manager = new NerManager({threshold: 0.1});
manager.addNamedEntityText('animal', 'poro', 'fi', ['poro']);
manager.addNamedEntityText('dish', 'salaatti', 'fi', ['salaatti']);
manager.addNamedEntityText('food', 'savuporo', 'fi', ['savuporo']);
manager.addNamedEntityText('process', 'savu', 'fi', ['savu']);
manager.findEntities('savuporosalaatti', 'fi').then(entities => console.log(entities));

returns []

manager.findEntities('porosalaatti', 'fi').then(entities => console.log(entities));

returns

 [{ start: 0,
    end: 11,
    len: 12,
    levenshtein: 4,
    accuracy: 0.5,
    option: 'salaatti',
    sourceText: 'salaatti',
    entity: 'dish',
    utteranceText: 'porosalaatti' } ]

But still I don't know what kind of salad it is. I could add savuporosalaatti as a named entity but that's an endless path to take. Just think that the same work has to be done to any kind of dish: bread, porridge, soup, stew, omelette, sushi, burrito... It would be easier to prepare for any kind of dish than to tell people what kind of dish can they have.

This gets even trickier because savuporo is smoked reindeer but savu alone is smoke. savustettu is smoked. And almost every meat can be smoked. savustettu lohi ja vihannessalaatti would be salad with smoked salmon and vegetables. savustettu lohi-vihannessalaatti would mean that the whole salmon vegetable salad is smoked which is unusual but I won't prevent you to do that either.

Although grammar rules say that correct form is valkosipulisalaattikastike you can sometimes see it written valkosipuli salaattikastike. Therefore, it would be best to have a possibility to find entities with form /(food[\s|-]?)*dish/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants