Lexical Processing

There are 3 steps in the Mining Process:

Text Pre-Processing
Inputs: unstructured data, Text, Html
Model Building

1. Text Pre-Processing:

Text encoding(ASCII, Unicode, etc)
Converting to lowercase
Removing Symbols and punctuations
Handling numbers
Stop-Word removal
Tokenization
Stemming and Lemmatisation

1.1 Text Encoding:

Computers understand Numerical Degrees so we can use binary. So encoding the alphabet and symbols to bits i.e. US-ASCII -> 0-127 decimal values assigned to symbols

The group of 8 bits but not include 128-255

ex: in Python, you can check unicode by ord(x)= 120 and character by chr(120) = 'x'. You can check the AsciiTable For more details.

for script:

for i in range(0,127):

    ch = chr(i)
    print(str(i) + " == '" +ch +"'")

Common Text Encoders:

input Alphabet --> ASCII --> output binary 8bit
Input Characters & --> Utf-8 --> output binary 8bit
Input Characters & --> Utf-16 --> output binary 16bit

# create a string 
amount = u"$" 
print('Default string: ', amount, '\n', 'Type', type(amount), '\n') 

# encode UTF-8 ->byte format 
amount_encoded = amount.encode('utf-8')
print('Encoded to UTF-8: ', amount_encoded, 'Type', type(amount_encoded), '\n') 

# decode UTF-8 <-byte format 
amount_decoded = amount_encoded.decode('utf-8') 
print('Decoded UTF-8: ', amount_decoded, '\n', 'Type', type(amount_decoded), '\n')

1.2 Converting text to Lowercase, Removing Symbols and punctuations & Handling numbers:

These three are some kind of text standardized to make text pre-process and give standardized input.

Handling Numbers is still tricky as we may need or may not need when texts are more important in such cases and also in some cases we need both(Keep Alphanumeric Tokens).

1.5 Stop-Word Removal

There are three kinds of words:

Highly frequent words called stop words, such as “is,” “an,” and “the”
Significant words, which are typically more important than the other words to understand the text
Rarely occurring words, which are less important than significant words

There are 2 reasons to remove:

They don't give much weight to info.
The frequency is too high to reduce the data size and also reduce the feature size.

Search This Blog

Techi Life Requirements