Project: create a linguistic IM analyzer

**tcaudilllg** · 03-15-2008, 10:50 PM

I would like to write a program that would use the vocabularly correlates Rick got from the Russians to take written text and put an IM element tag before every instance of its corresponding vocabulary.

For example, if the text contains the word "history", then the program would put

after the word "history".

I think this could be very easily done, however design is half the work of programming.

I believe such a program could illustrate with very interesting clarity just how much influence Model A has on a person's speech.

**reyn_til_runa** · 03-15-2008, 11:06 PM

Originally Posted by tcaudilllg

I would like to write a program that would use the vocabularly correlates Rick got from the Russians to take written text and put an IM element tag before every instance of its corresponding vocabulary.

For example, if the text contains the word "history", then the program would put

after the word "history".

I think this could be very easily done, however design is half the work of programming.

I believe such a program could illustrate with very interesting clarity just how much influence Model A has on a person's speech.

neat idea, but it might confuse something like history channel for

history.

(i.e. the computer's natural difficulty with literal versus metaphorical meaning and man's natural tendency to litter conversation with meaningless/misused words).

**tcaudilllg** · 03-15-2008, 11:50 PM

Originally Posted by reyn_til_runa

neat idea, but it might confuse something like history channel for

history.

(i.e. the computer's natural difficulty with literal versus metaphorical meaning and man's natural tendency to litter conversation with meaningless/misused words).

Never suggested it would be perfect... just something to give us a picture of what's going on. It'd be a long term topic of interest to make it (ever more) accurate.

From my programming experience, what's needed is:
- the list, stored across a string array.
- the list would have to be broken down into an object for each element, each with its own list of words.
- a parser to break up the input text into tokens
- a control loop to test each token against the entire list. If a match is found, look up the index of the match in an element correspondence table to get the element.
- Append the token and its corresponding element (if any) to an output string.

**Sabo** · 03-16-2008, 01:09 AM

I haven't coded in a few years, but it sounds like you could just do something like this in Perl or something...

- assign a variable to each IM element.
- read a file containing the conversation log line by line...
- for each line, loop through a list of key words and phrases. Look for matches with regular expressions and increment corresponding IM variable. And you could append the phrases to arrays if you wanted to know which phrases were being counted.
- compare variables. Variable with highest count --> leading function.

Would probably take like <10 mins. to code. You could even use CGI or something to create an online version. Hope you make this -- it'd be interesting to see.

**tcaudilllg** · 03-16-2008, 01:32 AM

I think it would go something like this (JS code follows):

Code:

vocab = new Array("[list]");
elementCorrespondenceTable = new Array(vocab.length);
[fill elementCorrespondenceTable]]

[parse tokens -- I think you can use regular expressions for that]

for (tokenIndex = 0; tokenIndex < tokenList.length; index++) {

  for (vocabIndex = 0; vocabIndex < vocab.length; vocabIndex++)
    if (tokenList[index] == ElementCorrespondenceTable[vocabIndex]) {
      Output = Output + tokenList[index] + ElementCorrespondenceTable[vocabIndex];
    }
  }
}

(to the programmers here this is obviously terribly inefficient but it'll do for illustration.)

Link to the vocabulary:
http://wikisocion.org/en/index.php?title=Vocabulary

Now I just need a parser. Can someone give me a regular expression right quick with all the syntax usually used in english? I never got much into them myself.

**tcaudilllg** · 03-16-2008, 02:47 AM

Maybe this'll do it:

Code:

syntaxTokenList = new Array(0);
syntaxTokenPoint = new Array(0);

// Make a buffer of all the syntax characters, along with their positions in the text

for (parserIndex = 0; parserIndex < Input.length; parserIndex++) {

  if (Input.charAt(parserIndex) == "," || Input.charAt(parserIndex) == "."
  || Input.charAt(parserIndex) == "?" || Input.charAt(parserIndex) == "!"
  || Input.charAt(parserIndex) == ":" || Input.charAt(parserIndex) == ";") {

   // Store the syntax character

    syntaxTokenList[syntaxTokenIndex] = Input.charAt(parserIndex);
    syntaxTokenPoint[syntaxTokenIndex] = parserIndex;
    syntaxTokenIndex++;

  // Replace the syntax character with a space.

    Input1 = Input.substring(0, parserIndex);
    Input2 = Input.substring(parserIndex + 1);
    Input = Input 1 + " " + Input2;
  }

}

This is getting complicated.

Here's the parser.

Code:

tokenList = new Array(0);

for (parserIndex = 0; parserIndex < Input.length; parserIndex++) {

  if (Input.charAt(parserIndex) == " ") {

    if (Input.charAt(parserIndex - 1) != " ") {
      tokenIndex++;
    }
  }
  else {
    tokenList[tokenIndex] =
      tokenList[tokenIndex + Input.charAt(parserIndex);
  }
}

I'd say a regular expression could have eliminated most of the syntax code, but I digress.

Now I'm wondering how to put the elements in with the reinstated syntax. (to restore the information to readability.) Si, you tricksy fox.

Let's put 'em together.

Code:

for (parserIndex = 0; parserIndex < Input.length; parserIndex++) {

  if (Input.charAt(parserIndex) == "," || Input.charAt(parserIndex) == "."
  || Input.charAt(parserIndex) == "?" || Input.charAt(parserIndex) == "!"
  || Input.charAt(parserIndex) == ":" || Input.charAt(parserIndex) == ";") {

   // Store the syntax character

    syntaxTokenList[syntaxTokenIndex] = Input.charAt(parserIndex);
    syntaxTokenPoint[syntaxTokenIndex] = parserIndex;
    syntaxTokenAssociate[syntaxTokenIndex] = tokenIndex;
    syntaxTokenIndex++;

  // Replace the syntax character with a space.

    Input1 = Input.substring(0, parserIndex);
    Input2 = Input.substring(parserIndex + 1);
    Input = Input 1 + " " + Input2;
  }

  if (Input.charAt(parserIndex) == " ") {

    if (Input.charAt(parserIndex - 1) != " ") {
      tokenIndex++;
    }
  }
  else {
    tokenList[tokenIndex] =
      tokenList[tokenIndex + Input.charAt(parserIndex);
  }
}

Which puts us in position to conduct the following solution:

Code:

for (tokenIndex = 0; tokenIndex < tokenList.length; index++) {

  for (vocabIndex = 0; vocabIndex < vocab.length; vocabIndex++)
    if (tokenList[index] == ElementCorrespondenceTable[vocabIndex]) {
      Output = Output + tokenList[index] + ElementCorrespondenceTable[vocabIndex];
      while (syntaxTokenAssociate[syntaxTokenIndex] == tokenIndex)
        Output = Output + syntaxTokenList[syntaxTokenIndex];
        syntaxTokenIndex++;
      }
    }
  }
}

Not complete by any means (ellipsis, for example, would not remain solid), but it'll do for a test, I think. (nevermind, changing "if" to "while" did the trick.)

**tcaudilllg** · 03-18-2008, 10:04 AM

Here it is:
http://lordgalbalan.atspace.com/IManalyzer.html

It's completely self-contained and without dependencies. This is a javascript version, not PHP.
Just copy the text into the top box and hit process. An IM analysis of your words will appear in the bottom box. (note that the vocabulary is currently very, very small, nor does the program distinguish between words that have different meanings depending on their usage. (for example,

"might" vs

"I might or I might not").

It also has some syntax foilbles, so don't expect a flawless performance. (although it will be stable).

It's public domain so go nuts.

**snegledmaca** · 03-18-2008, 11:13 AM

This is fascinating. Ok, so I took 9 large segments I made from the site, in total 18 full pages in word, around 12000 words, 55000 character, and counted how many times each element popped up. The results are:

Te – 160
Fi – 134
Ne – 120
Se – 74
Ni – 74
Ti – 69
Si – 50
Fe – 15

**tcaudilllg** · 03-18-2008, 10:43 PM

Two points:
1) this is probably the most effective means of proving the existence of type to skeptics
2) the system itself has a lot of problems that need to be worked out

I'm putting this on Sourceforge, and I encourage anyone with programming experience to take a look at it and ask themselves what they can contribute to it. Additionally, I think we would all benefit from a discussion of the terms we associate with the information aspects.

Issues:
- syntax problems requiring the implementation of a better parsing algorithm
- sometimes words alone are insufficient to capture an element. A phrase parser would better capture these idioms. Phrase parsers require sophisticated algorithms and technology which may not now exist, to be honest. (at the very least we don't see it in Systrans translations of Russian articles: current translators lack any comprehension of context).
- not all words can be correlated to IM aspects. Some words describe relations between aspects. (the aspects and the relations between them being the sum total of all human cognition and linguistic processing).

I don't think I have any obligation to take on all of these responsibilities myself. I think we have a community obligation to each of us doing their part to create the most capable IM analysis program we can manage.

**tcaudilllg** · 03-19-2008, 04:37 AM

On phrase parsing:
I think the best way to implement it is to create a sort of "turtle" language, like was used in early programming languages for graphics functions, that can be used to define the relationship between individual words in a phrase.

The algorithm goes something like this:
- if a word (we will call it the "key") is a part of a phrase that describes an IM aspect, pass it to a phrase processor
- the phrase processor has a list of words associated with the key word. Alongside this list it has the positions of these words relative to the key word. (for example, "seems like" where "like " is the key word would be described "-1", indicating the positions of "seems") In this case the element associated with the recognized phrase would be used instead of the element of its constituent words. (the complicated part of this is, the words which already had elements attached to them would need to be re-examined.)

I'll do this, but I need the rest of you to take on the task of actually going through the dictionary to find all the words to associate with the elements. I'm sure the Wikisocion people are up for that, am I right?

Creepy-bg · 03-19-2008, 10:57 AM

Originally Posted by tcaudilllg

Two points:
1) this is probably the most effective means of proving the existence of type to skeptics
2) the system itself has a lot of problems that need to be worked out

I'm putting this on Sourceforge, and I encourage anyone with programming experience to take a look at it and ask themselves what they can contribute to it. Additionally, I think we would all benefit from a discussion of the terms we associate with the information aspects.

Issues:
- syntax problems requiring the implementation of a better parsing algorithm
- sometimes words alone are insufficient to capture an element. A phrase parser would better capture these idioms. Phrase parsers require sophisticated algorithms and technology which may not now exist, to be honest. (at the very least we don't see it in Systrans translations of Russian articles: current translators lack any comprehension of context).
- not all words can be correlated to IM aspects. Some words describe relations between aspects. (the aspects and the relations between them being the sum total of all human cognition and linguistic processing).

I don't think I have any obligation to take on all of these responsibilities myself. I think we have a community obligation to each of us doing their part to create the most capable IM analysis program we can manage.

I know this doesn't really help but Systrans licensed translation software does phrases (so the technology does exist)

**emeye** · 03-19-2008, 11:36 AM

Originally Posted by Bionicgoat

I know this doesn't really help but Systrans licensed translation software does phrases (so the technology does exist)

Yes, but in the same manner Debbie does Dallas: without imagination and relatively poorly.

**Khola aka Bee** · 03-19-2008, 11:38 AM

Hey, don't knock Debbie does Dallas! My hubby says it was 'aaiiight!

Creepy-bg · 03-19-2008, 11:51 AM

Originally Posted by bee

Hey, don't knock Debbie does Dallas! My hubby says it was 'aaiiight!

I liked the 2000 version... the old one just doesn't do it for me

**Khola aka Bee** · 03-19-2008, 11:52 AM

I wouldn't know

Creepy-bg · 03-19-2008, 11:53 AM

Originally Posted by bee

I wouldn't know

people should watch more porn...

**Khola aka Bee** · 03-19-2008, 11:55 AM

This is why moving back into my parent's house was a bad idea. Damn I gotta get outta here!

Thread: Project: create a linguistic IM analyzer

Thread Tools

Display

Project: create a linguistic IM analyzer

Posting Permissions