Ask Computer Engineering Expert

Assignment

Instructions & Brief

Task A

We will analyse the top emoticons found in the messages of tweets, from the 'msgraw_sample.txt' data used in the tutorial. Note this should be done a Linux machine or similar where bash supported.

Task A.1

The first sub-task is to extract the top 20 emoticons and their counts from the tweets. This must not be done entirely manually, and it can only be done using a single shell script. So you need to write a single shell script 'tweet2emo.sh' that will input 'msgraw_sample.txt' from stdin and produce a CSV file 'potential_emoticon.csv' giving a list of candidate emoticons with their occurrence counts. The important word here is "candidate". Perhaps only 1 in 5 of your candidates are emoticons. Then you need to edit this by hand, deleting non-emoticons, and deleting less frequent ones, to get your final, list 'emoticon.csv'.

So for this task, you must submit:

(1) a single bash script, 'tweet2emo.sh' : this must output, one per line, a candidate emoticon and a count of occurrence, and cannot have any Python or R programmes embedded in it. More details on how to do this below.

(2) the candidate list of emoticons generated by the script, 'potential_emoticon.csv' : CSV file, TAB delimited file with (count, text-emoticon).

(3) the final list of emoticons selected, 'emoticon.csv' : CSV file, TAB delimited file with (count, text-emoticon); these should be the 20 most frequent emoticons from 'potential_emoticon.csv', but you will have to select yourself, manually by editing, which are actually emoticons. To do this, you may use an externally provided list of recognised emoticons, but not should be used in step (2).

(4) a description for this task is included in your final PDF report describing the method used for the bash script, and then the method used to edit the file, to get the file for step (3).

Your bash scripts might take 2-5-10 lines and might require storing intermediate files.

The following single line commands, which process a file from stdin and generate stdout should be useful for this task:

perl -p -e 's/\s+/\n/g;'

-- tokenise each line of text by converting space characters to newlines;

NOTE: this reportedly also work on Windows where newline character is different

perl -p -e 's/>/>/g; s/

-- convert embedded HTML escapes for '>' and '

-- you need to do this if you want to capture emoticons using the '<' or the '>' characters, like '

sort | uniq -c | perl -p -e 's/^\s+//; s/ /\t/; '

-- assumes the input file has one item per line
-- sort and count the items and generates TAB delimited file with (count, item) entries

Specially, in order to recognise potential emoticons, you will need to write suitable greps. Here are some examples:

grep -e '^_^'

-- match lines containing the string "^_^"

grep -e '^^_^'

-- match lines starting with the string "^_^", the initial "^", called an anchor, says match start of line

grep -e '^_^

-- match lines ending with the string "^_^", the final "tiny_mce_markerquot;, called an anchor, says match end of line

grep -e '^^_^

-- match lines made exactly of the string "^_^", using beginning and ending anchors

grep -e '^0_0

-- match lines made exactly of the string "0_0"

grep -e '^^_^

-e '^0_0

-- match lines made exactly of the string "^_^" or "0_0"; so two match strings are ORed

grep -e '^[.:^]

-- match lines made exactly of the characters in the set ".:^"

-- the construction "[ ... ]" means "characters in the set " ... " but be warned some characters used inside have strange effects, like "-", see next

grep -e '^[0-9ABC]

-- match lines made exactly of the digits ("0-9" means in the range "0" to "9") or characters "ABC"

grep -e '^[-0-9ABC]

-- match lines made exactly of the dash "-", the digits, or the characters "ABC"
-- we place "-" at the front to stop in meaning "range"

For more detail on grep see:

But my advice is "keep it simple" and stick with the above constructs. Remember you get to edit the final results by hand anyway. But if your grep match strings say "7" is an emoticon, it probably isn't a strong enough filter.

Task A.2

We would like to compute word co-occurrence with emoticons. So suppose we have the tweet:
loved the results of the game ;-)
then this means that emoticon ';-)' co-occurs once with each of the words in the list ' loved the results of the game' once.

You can use the supplied Python program 'emoword.py" which uses a single emoticon, takes 'msgraw_sample.txt' as stdin and outputs a raw list of co-occurring tokens.

./emoword.py ':))'

Note the emoticon is enclosed in single quotes because the punctuation can cause bash to do weird things otherwise.

You can also put this in a bash loop to run over your emoticon list like so:

for E in ';)' ':)' 'echo running this emoticon $E
done

or counting them too using

CNT=1
for E in ';)' ':)' 'echo running this emoticon $E > $CNT.out
CNT=$(( $CNT + 1)) # this is arithmetic in bash
done

But be warned, bash does strange things with punctuation ... it treats it differently as it plays a role in the language. So while you can have a loop doing this:

for E in ';)' ':)' '

where you have edited in your emoticons, and used the single quotes to tell bash the quoted text is a single token, if instead you try and be clever and read them from a file

for E in `cat emoticons.txt` ; do

then bash well see individual punctuation and probably fail to work in the way you want.

For each emoticon in your list 'emoticon.csv', find a list of the 10-20 most commonly occurring interesting words. Report on these words in your final PDF report. Note that words like "the" and "in" are called stop words, see https://en.wikipedia.org/wiki/Stop_words, and are uninteresting, so try and exclude these from your report.

So for this task, you must submit:

(1) a single bash script, 'emowords.sh' : as used to support your answers, perhaps calling 'emoword.py'; this should output for each of your 20 emoticons the most frequent words co-occurring with it (in tweets); use what ever format suits, as the results will be transferred and written up in your report.

(2) a description for this task is included in your final PDF report describing the method used for the bash script, and then the final list of selected interesting words per emoticon, and how you got them.

Task A.3

See if there are other interesting information you can get about these emoticons. For instance is there anything about countries/cities and emoticons? Which emoticons have long or short messages? Whats sorts of messages are attached to different emoticons?

You can use the Python program 'emodata.py" which reads your 'emoticon.csv' file, takes 'msgraw_sample.txt' as stdin and outputs selected data from the tweet file.

./emodata.py

Report on this in your final PDF report. Use any technique or coding you like to get this information. Your report should describe what you did and your results.

Task B

Consider the two files 'training.csv' and 'test.csv'.

Task B.1

Plot histograms of X1, X2, X3 and X4 in train.csv respectively and answer: which variable(s) is(are) most likely samples drawn from normal distributions?

Task B.2

Fit two linear regression models using train.csv.

Model 1: Y~X1+X2+X3+X4
Model 2: Y~X2+X3+X4
Which model has higher Multiple R-squared value?

Task B3

Now use the coefficients of Model 1 and 2 respectively to predict the Y values of test.csv, then calculate the Mean Squared Errors (MSE) between the predictions and the true values. Which model has smaller MSE? Which model is better? More complex models always have higher R square but are they always better?

Computer Engineering, Engineering

  • Category:- Computer Engineering
  • Reference No.:- M92807874

Have any Question?


Related Questions in Computer Engineering

Does bmw have a guided missile corporate culture and

Does BMW have a guided missile corporate culture, and incubator corporate culture, a family corporate culture, or an Eiffel tower corporate culture?

Rebecca borrows 10000 at 18 compounded annually she pays

Rebecca borrows $10,000 at 18% compounded annually. She pays off the loan over a 5-year period with annual payments, starting at year 1. Each successive payment is $700 greater than the previous payment. (a) How much was ...

Jeff decides to start saving some money from this upcoming

Jeff decides to start saving some money from this upcoming month onwards. He decides to save only $500 at first, but each month he will increase the amount invested by $100. He will do it for 60 months (including the fir ...

Suppose you make 30 annual investments in a fund that pays

Suppose you make 30 annual investments in a fund that pays 6% compounded annually. If your first deposit is $7,500 and each successive deposit is 6% greater than the preceding deposit, how much will be in the fund immedi ...

Question -under what circumstances is it ethical if ever to

Question :- Under what circumstances is it ethical, if ever, to use consumer information in marketing research? Explain why you consider it ethical or unethical.

What are the differences between four types of economics

What are the differences between four types of economics evaluations and their differences with other two (budget impact analysis (BIA) and cost of illness (COI) studies)?

What type of economic system does norway have explain some

What type of economic system does Norway have? Explain some of the benefits of this system to the country and some of the drawbacks,

Among the who imf and wto which of these governmental

Among the WHO, IMF, and WTO, which of these governmental institutions do you feel has most profoundly shaped healthcare outcomes in low-income countries and why? Please support your reasons with examples and research/doc ...

A real estate developer will build two different types of

A real estate developer will build two different types of apartments in a residential area: one- bedroom apartments and two-bedroom apartments. In addition, the developer will build either a swimming pool or a tennis cou ...

Question what some of the reasons that evolutionary models

Question : What some of the reasons that evolutionary models are considered by many to be the best approach to software development. The response must be typed, single spaced, must be in times new roman font (size 12) an ...

  • 4,153,160 Questions Asked
  • 13,132 Experts
  • 2,558,936 Questions Answered

Ask Experts for help!!

Looking for Assignment Help?

Start excelling in your Courses, Get help with Assignment

Write us your full requirement for evaluation and you will receive response within 20 minutes turnaround time.

Ask Now Help with Problems, Get a Best Answer

Why might a bank avoid the use of interest rate swaps even

Why might a bank avoid the use of interest rate swaps, even when the institution is exposed to significant interest rate

Describe the difference between zero coupon bonds and

Describe the difference between zero coupon bonds and coupon bonds. Under what conditions will a coupon bond sell at a p

Compute the present value of an annuity of 880 per year

Compute the present value of an annuity of $ 880 per year for 16 years, given a discount rate of 6 percent per annum. As

Compute the present value of an 1150 payment made in ten

Compute the present value of an $1,150 payment made in ten years when the discount rate is 12 percent. (Do not round int

Compute the present value of an annuity of 699 per year

Compute the present value of an annuity of $ 699 per year for 19 years, given a discount rate of 6 percent per annum. As