Results 1 to 18 of 18

Thread: Project: create a linguistic IM analyzer

  1. #1
    Banned
    Join Date
    Oct 2005
    TIM
    TiNe
    Posts
    7,967
    Mentioned
    11 Post(s)
    Tagged
    0 Thread(s)

    Default Project: create a linguistic IM analyzer

    I would like to write a program that would use the vocabularly correlates Rick got from the Russians to take written text and put an IM element tag before every instance of its corresponding vocabulary.

    For example, if the text contains the word "history", then the program would put after the word "history".

    I think this could be very easily done, however design is half the work of programming.

    I believe such a program could illustrate with very interesting clarity just how much influence Model A has on a person's speech.

  2. #2
    reyn_til_runa's Avatar
    Join Date
    Feb 2006
    Location
    new jersey
    Posts
    1,009
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)

    Default

    Quote Originally Posted by tcaudilllg View Post
    I would like to write a program that would use the vocabularly correlates Rick got from the Russians to take written text and put an IM element tag before every instance of its corresponding vocabulary.

    For example, if the text contains the word "history", then the program would put after the word "history".

    I think this could be very easily done, however design is half the work of programming.

    I believe such a program could illustrate with very interesting clarity just how much influence Model A has on a person's speech.
    neat idea, but it might confuse something like history channel for history.

    (i.e. the computer's natural difficulty with literal versus metaphorical meaning and man's natural tendency to litter conversation with meaningless/misused words).
    whenever the dog and i see each other we both stop where we are. we regard each other with a mixture of sadness and suspicion and then we feign indifference.

    Jerry, The Zoo Story by Edward Albee

  3. #3
    Banned
    Join Date
    Oct 2005
    TIM
    TiNe
    Posts
    7,967
    Mentioned
    11 Post(s)
    Tagged
    0 Thread(s)

    Default

    Quote Originally Posted by reyn_til_runa View Post
    neat idea, but it might confuse something like history channel for history.

    (i.e. the computer's natural difficulty with literal versus metaphorical meaning and man's natural tendency to litter conversation with meaningless/misused words).
    Never suggested it would be perfect... just something to give us a picture of what's going on. It'd be a long term topic of interest to make it (ever more) accurate.

    From my programming experience, what's needed is:
    - the list, stored across a string array.
    - the list would have to be broken down into an object for each element, each with its own list of words.
    - a parser to break up the input text into tokens
    - a control loop to test each token against the entire list. If a match is found, look up the index of the match in an element correspondence table to get the element.
    - Append the token and its corresponding element (if any) to an output string.

  4. #4

    Join Date
    Aug 2007
    Posts
    75
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)

    Default

    I haven't coded in a few years, but it sounds like you could just do something like this in Perl or something...

    - assign a variable to each IM element.
    - read a file containing the conversation log line by line...
    - for each line, loop through a list of key words and phrases. Look for matches with regular expressions and increment corresponding IM variable. And you could append the phrases to arrays if you wanted to know which phrases were being counted.
    - compare variables. Variable with highest count --> leading function.

    Would probably take like <10 mins. to code. You could even use CGI or something to create an online version. Hope you make this -- it'd be interesting to see.
    delta nf (?) ... 4w5 (?)

  5. #5
    Banned
    Join Date
    Oct 2005
    TIM
    TiNe
    Posts
    7,967
    Mentioned
    11 Post(s)
    Tagged
    0 Thread(s)

    Default

    I think it would go something like this (JS code follows):
    Code:
    vocab = new Array("[list]");
    elementCorrespondenceTable = new Array(vocab.length);
    [fill elementCorrespondenceTable]]
    
    [parse tokens -- I think you can use regular expressions for that]
    
    for (tokenIndex = 0; tokenIndex < tokenList.length; index++) {
    
      for (vocabIndex = 0; vocabIndex < vocab.length; vocabIndex++)
        if (tokenList[index] == ElementCorrespondenceTable[vocabIndex]) {
          Output = Output + tokenList[index] + ElementCorrespondenceTable[vocabIndex];
        }
      }
    }
    (to the programmers here this is obviously terribly inefficient but it'll do for illustration.)

    Link to the vocabulary:
    http://wikisocion.org/en/index.php?title=Vocabulary

    Now I just need a parser. Can someone give me a regular expression right quick with all the syntax usually used in english? I never got much into them myself.
    Last edited by tcaudilllg; 03-16-2008 at 03:05 AM.

  6. #6
    Banned
    Join Date
    Oct 2005
    TIM
    TiNe
    Posts
    7,967
    Mentioned
    11 Post(s)
    Tagged
    0 Thread(s)

    Default

    Maybe this'll do it:
    Code:
    syntaxTokenList = new Array(0);
    syntaxTokenPoint = new Array(0);
    
    // Make a buffer of all the syntax characters, along with their positions in the text
    
    for (parserIndex = 0; parserIndex < Input.length; parserIndex++) {
    
      if (Input.charAt(parserIndex) == "," || Input.charAt(parserIndex) == "."
      || Input.charAt(parserIndex) == "?" || Input.charAt(parserIndex) == "!"
      || Input.charAt(parserIndex) == ":" || Input.charAt(parserIndex) == ";") {
    
       // Store the syntax character
    
        syntaxTokenList[syntaxTokenIndex] = Input.charAt(parserIndex);
        syntaxTokenPoint[syntaxTokenIndex] = parserIndex;
        syntaxTokenIndex++;
    
      // Replace the syntax character with a space.
    
        Input1 = Input.substring(0, parserIndex);
        Input2 = Input.substring(parserIndex + 1);
        Input = Input 1 + " " + Input2;
      }
    
    }
    This is getting complicated.

    Here's the parser.
    Code:
    tokenList = new Array(0);
    
    for (parserIndex = 0; parserIndex < Input.length; parserIndex++) {
    
      if (Input.charAt(parserIndex) == " ") {
    
        if (Input.charAt(parserIndex - 1) != " ") {
          tokenIndex++;
        }
      }
      else {
        tokenList[tokenIndex] =
          tokenList[tokenIndex + Input.charAt(parserIndex);
      }
    }
    I'd say a regular expression could have eliminated most of the syntax code, but I digress.

    Now I'm wondering how to put the elements in with the reinstated syntax. (to restore the information to readability.) Si, you tricksy fox.

    Let's put 'em together.

    Code:
    for (parserIndex = 0; parserIndex < Input.length; parserIndex++) {
    
      if (Input.charAt(parserIndex) == "," || Input.charAt(parserIndex) == "."
      || Input.charAt(parserIndex) == "?" || Input.charAt(parserIndex) == "!"
      || Input.charAt(parserIndex) == ":" || Input.charAt(parserIndex) == ";") {
    
       // Store the syntax character
    
        syntaxTokenList[syntaxTokenIndex] = Input.charAt(parserIndex);
        syntaxTokenPoint[syntaxTokenIndex] = parserIndex;
        syntaxTokenAssociate[syntaxTokenIndex] = tokenIndex;
        syntaxTokenIndex++;
    
      // Replace the syntax character with a space.
    
        Input1 = Input.substring(0, parserIndex);
        Input2 = Input.substring(parserIndex + 1);
        Input = Input 1 + " " + Input2;
      }
    
      if (Input.charAt(parserIndex) == " ") {
    
        if (Input.charAt(parserIndex - 1) != " ") {
          tokenIndex++;
        }
      }
      else {
        tokenList[tokenIndex] =
          tokenList[tokenIndex + Input.charAt(parserIndex);
      }
    }
    Which puts us in position to conduct the following solution:
    Code:
    for (tokenIndex = 0; tokenIndex < tokenList.length; index++) {
    
      for (vocabIndex = 0; vocabIndex < vocab.length; vocabIndex++)
        if (tokenList[index] == ElementCorrespondenceTable[vocabIndex]) {
          Output = Output + tokenList[index] + ElementCorrespondenceTable[vocabIndex];
          while (syntaxTokenAssociate[syntaxTokenIndex] == tokenIndex)
            Output = Output + syntaxTokenList[syntaxTokenIndex];
            syntaxTokenIndex++;
          }
        }
      }
    }
    Not complete by any means (ellipsis, for example, would not remain solid), but it'll do for a test, I think. (nevermind, changing "if" to "while" did the trick.)
    Last edited by tcaudilllg; 03-16-2008 at 07:44 AM.

  7. #7
    Banned
    Join Date
    Oct 2005
    TIM
    TiNe
    Posts
    7,967
    Mentioned
    11 Post(s)
    Tagged
    0 Thread(s)

    Default

    Here it is:
    http://lordgalbalan.atspace.com/IManalyzer.html

    It's completely self-contained and without dependencies. This is a javascript version, not PHP.
    Just copy the text into the top box and hit process. An IM analysis of your words will appear in the bottom box. (note that the vocabulary is currently very, very small, nor does the program distinguish between words that have different meanings depending on their usage. (for example, "might" vs "I might or I might not").

    It also has some syntax foilbles, so don't expect a flawless performance. (although it will be stable).

    It's public domain so go nuts.

  8. #8
    I'm back, assholes! Herzy's Avatar
    Join Date
    May 2005
    TIM
    SLE
    Posts
    5,098
    Mentioned
    44 Post(s)
    Tagged
    7 Thread(s)

    Default

    I typed in a paragraph of something I wrote, and it gave me every single IM element except Fi, and it gave me Ne three times as much as any other IM element.

    IMO common words like "if" shouldn't be analyzed in your program. But it is a neat idea overall.
    , Se-sub
    8w8-3w8-7w8 sx/sx

  9. #9
    snegledmaca's Avatar
    Join Date
    Sep 2005
    Posts
    1,900
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)

    Default

    This is fascinating. Ok, so I took 9 large segments I made from the site, in total 18 full pages in word, around 12000 words, 55000 character, and counted how many times each element popped up. The results are:

    Te – 160
    Fi – 134
    Ne – 120
    Se – 74
    Ni – 74
    Ti – 69
    Si – 50
    Fe – 15

  10. #10
    Banned
    Join Date
    Oct 2005
    TIM
    TiNe
    Posts
    7,967
    Mentioned
    11 Post(s)
    Tagged
    0 Thread(s)

    Default

    Two points:
    1) this is probably the most effective means of proving the existence of type to skeptics
    2) the system itself has a lot of problems that need to be worked out

    I'm putting this on Sourceforge, and I encourage anyone with programming experience to take a look at it and ask themselves what they can contribute to it. Additionally, I think we would all benefit from a discussion of the terms we associate with the information aspects.

    Issues:
    - syntax problems requiring the implementation of a better parsing algorithm
    - sometimes words alone are insufficient to capture an element. A phrase parser would better capture these idioms. Phrase parsers require sophisticated algorithms and technology which may not now exist, to be honest. (at the very least we don't see it in Systrans translations of Russian articles: current translators lack any comprehension of context).
    - not all words can be correlated to IM aspects. Some words describe relations between aspects. (the aspects and the relations between them being the sum total of all human cognition and linguistic processing).

    I don't think I have any obligation to take on all of these responsibilities myself. I think we have a community obligation to each of us doing their part to create the most capable IM analysis program we can manage.

  11. #11
    Banned
    Join Date
    Oct 2005
    TIM
    TiNe
    Posts
    7,967
    Mentioned
    11 Post(s)
    Tagged
    0 Thread(s)

    Default

    On phrase parsing:
    I think the best way to implement it is to create a sort of "turtle" language, like was used in early programming languages for graphics functions, that can be used to define the relationship between individual words in a phrase.

    The algorithm goes something like this:
    - if a word (we will call it the "key") is a part of a phrase that describes an IM aspect, pass it to a phrase processor
    - the phrase processor has a list of words associated with the key word. Alongside this list it has the positions of these words relative to the key word. (for example, "seems like" where "like " is the key word would be described "-1", indicating the positions of "seems") In this case the element associated with the recognized phrase would be used instead of the element of its constituent words. (the complicated part of this is, the words which already had elements attached to them would need to be re-examined.)

    I'll do this, but I need the rest of you to take on the task of actually going through the dictionary to find all the words to associate with the elements. I'm sure the Wikisocion people are up for that, am I right?

  12. #12
    Creepy-bg

    Default

    Quote Originally Posted by tcaudilllg View Post
    Two points:
    1) this is probably the most effective means of proving the existence of type to skeptics
    2) the system itself has a lot of problems that need to be worked out

    I'm putting this on Sourceforge, and I encourage anyone with programming experience to take a look at it and ask themselves what they can contribute to it. Additionally, I think we would all benefit from a discussion of the terms we associate with the information aspects.

    Issues:
    - syntax problems requiring the implementation of a better parsing algorithm
    - sometimes words alone are insufficient to capture an element. A phrase parser would better capture these idioms. Phrase parsers require sophisticated algorithms and technology which may not now exist, to be honest. (at the very least we don't see it in Systrans translations of Russian articles: current translators lack any comprehension of context).
    - not all words can be correlated to IM aspects. Some words describe relations between aspects. (the aspects and the relations between them being the sum total of all human cognition and linguistic processing).

    I don't think I have any obligation to take on all of these responsibilities myself. I think we have a community obligation to each of us doing their part to create the most capable IM analysis program we can manage.
    I know this doesn't really help but Systrans licensed translation software does phrases (so the technology does exist)

  13. #13
    emeye's Avatar
    Join Date
    Mar 2006
    Posts
    255
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)

    Default

    Quote Originally Posted by Bionicgoat View Post
    I know this doesn't really help but Systrans licensed translation software does phrases (so the technology does exist)
    Yes, but in the same manner Debbie does Dallas: without imagination and relatively poorly.
    XXXx <-- almost a beer

  14. #14
    I had words here once, but I didn't feed them Khola's Avatar
    Join Date
    Nov 2007
    TIM
    ESE
    Posts
    3,535
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)

    Default

    Hey, don't knock Debbie does Dallas! My hubby says it was 'aaiiight!
    Hello, my name is Bee. Pleased to meet you .



  15. #15
    Creepy-bg

    Default

    Quote Originally Posted by bee View Post
    Hey, don't knock Debbie does Dallas! My hubby says it was 'aaiiight!
    I liked the 2000 version... the old one just doesn't do it for me

  16. #16
    I had words here once, but I didn't feed them Khola's Avatar
    Join Date
    Nov 2007
    TIM
    ESE
    Posts
    3,535
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)

    Default

    I wouldn't know
    Hello, my name is Bee. Pleased to meet you .



  17. #17
    Creepy-bg

    Default

    Quote Originally Posted by bee View Post
    I wouldn't know
    people should watch more porn...

  18. #18
    I had words here once, but I didn't feed them Khola's Avatar
    Join Date
    Nov 2007
    TIM
    ESE
    Posts
    3,535
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)

    Default

    This is why moving back into my parent's house was a bad idea. Damn I gotta get outta here!
    Hello, my name is Bee. Pleased to meet you .



Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •