@xmruibi 2015-11-06T06:26:36.000000Z 字数 18680 阅读 786

Projects Script

Interview_Preparation

0 My Introduction:

I just had graduated in this May from University of Pittsburgh with Masters degree of Information Science. I also got the Bachelor degree of Computer Science in two years ago. So in last five year, I’ve accumulate much of computer science knowledge as well as a strong interest in coding, especially on Java programming. I think I’m a person desired to know more wider knowledge and learn more some cutting-edge techniques. i also like to share what I have learn from my project. If you’re interested, you can visit my personal blog.

In my earlier experience (undergraduate and first year of graduate), they were focus on web application development by using Javascript for frontend and Java for backend with some frameworks, like Spring, Hibernate， and SQL database like MySQL. I also have some experience on pretty advanced techniques like cloud computing and machine learning. So I joined our school’s research team on a innovative search engine development and took the course on cloud computing. I’ve practiced pretty much cloud techniques like Hadoop, AWS and distributed database. I really like to do some practical projects just for self interest and pursuing those new techniques. Most of my projects are uploaded to the Github. If you’re interested in, you can check the link in the last line on my resume.

0 Why machine learning?

In the very beginning I just heard about this word from media and known it as a tool to predict something with big data. And then, When I chose to take the Data Mining course during the second graduate semester, I started my interest on machine learning. Then, in last summer, a guy who wanna to apply PhD on big data field invited me to join his TREC conference project. That was my first project combine both machine learning and information retrieval. After that, I joined a research project about the collaborative search engine last semester in our school. Where I made significant progress on learning information retrieval knowledge. And recently, I did the stock predicting project. I found, nowadays, our information technology always combined with the machine learning models from the basic to the complicated.

0 Why software engineering:

First of all, I think I really like to learn technical things and anything of engineering world. Not only the software engineering, but also I like to make some craft, like model of cars or battle ships and to do some DIY on mechanical stuff. I think all of those engineering work can make you fill of the sense of achievement.
I remember when I was 8 years old. I got my first computer. Then I started my interest in this area. At first I like to do some hardware DIY, like upgrade memory or graphic card to build my customized computer. Then after I chose computer science as my undergraduate major, I had developed lots of web applications at first….

0 Why backend?

Actually, at the first, I used to like the front end things. Because it looks more interesting and more visible. You can watch your work and show your work to everybody who don’t really know the IT knowledge. As you see, my earlier experience are focus on this area for a while. But when I did more on frontend things, I realized the backend is more important. It likes the heart of software. Especially in this big data era, I found we need to deal with many complex backend architecture and services and also smarter algorithms to load such big data. During my graduate study, I’ve tried some information retrieval things and also some cloud computing techniques. After this experience, I think backend things requires more technical knowledge you have. Not only the programming, but you have to know a bunch of thing about computer science. It’s quite a challenge but you can learn a lot. That is what I want in the future.

0 What’s your favorite language?

I think my favorite one should be the Java. It’s the first language I learned for programing. And also most of my experience rely on Java. Because Java is a pretty mature object-orient compiler language, and it can bear the heavy workload. As you can see, most of big data architecture are rely on Java, like Hadoop. But Java also has its defect, it does’t like Cpp which can reach more low-level thing. And it’s to heavy so that right now many web application just using some javascript or python or php MVC framework. So I think I should learn more language to fit more flexible requirement.

Internship Experience:

1. Intern in Yonyou;

My previous internship is in Yonyou Software. That company is the most largest management software provider in eastern Asia, like what SAP did. So during this intership, we built a ERP platform for a company in tobacco industry. The company had a very mature solution on developing such platform. They have a prototype but we need to modified this module to fit the target user requirement. Our task was to build the module for searching merchandise in storage on the ERP platform. At the first, they gave me a week for training to learn their developement process and their technical framework. Then I started my task to implement merchandise searching function with different filter conditions on both backend and frontend by using Spring MVC and Hibernate and JSP. our team member are internship students. My supervisor let me to lead this small team. because my performance during their training session.

Chanllenge:

Because that experience was first internship of me and was the first time I learn about enterprise level development. So many first times for me. I was only a junior student at that time. It was quite big chanllenge for me to fit enviornment in a big company and to learn quickly and catch up with their development speed. There are serveral ways. But I think my way is to communicate with people around you, ask your advisor and keep calm down to learn the knowledge.

Conflict:

Because our tasks are focus on searching and retreval. We met some problems on database indexing. Some people believe we need to index more columns but some disagree. Because when we add more index for table column, it help our searching retrieval speed but it reduce the update speed. You know some merchandise information are updated rapidly. So we had a meeting for list all trade-off. Compare with all possible situations, evaluate which column should be index.

Favorite Project Intro:

I’d like to talk about my research project about Query Classifier. Basically to say, that is a some kind of that when you give a query then it return the most possible category or topic of this query. Like, you’re giving Apple, then it will return you’re asking about the company instead of the fruit. But this query classifier is a subproject for collaborative search engine, which is a search engine platform for a team not an individual (As you know, most of current search engine are just personalized for individual). When a team is doing a search task, like planning to go for a travel, so one guy focus on searching hotel or airline ticket, one guy focus on studying the route of attractions. So my classifier is trying to find out what does each person focus on. Then the collaborative search engine should give each person different ranking results.

Why you like this project?

Because this is very pratical project. you know I really like to do some pratical things, transfer the knowledge onto practice. In this project, it adopted, implemented a lot of knowledge from what I learned during graduate study, like machine learning, data mining and information retrieval. And it is a quite individual project, although it is a subproject or support component for a big one and I still talk with professor and other teammate who was making other support components. Most of code and design work is finished by my self. So it is quite good experience on testing myself abilitity.

What did you learn from this project?

The first thing should be a bunch of academic knowledge about search engine and some machine learning algorithms. This project is quite complicate, and you need to talk with your advisor and communicate with your teammates who were doing other subproject. We have to cooperate each subproject to run well and support entire collaborative search engine. And I also learn about the system design, although it is a subproject, I still need to do some architecture design to reduce the time consuming during querying.

Challenge

My challenge was to design the system architecture. Because this system combine both indexing and retrieval. But the indexing process involve too much machine learning algorithms and statistical methods which slow pretty much down the running speed. So when i almost finished the system, I need to reconstruct the system architecture. Then I figure it out by using a kind of user profile and action driven pattern. The data which concerning about user profile is loaded when user login and the task and topic data indexing start when user begin to choose their search task. So when user start to search, our indexed data is already loaded and also only partial data which only support the current task is loaded instead of load too much unnecessary data.

Conflict

I think conflicts are alway exist during any projects. Most of conflicts are about the comparison on different technical solutions.
For backend problem, most of situation is arguing the different techniques or algorithms. I think it is good to present all the trade-off of different solutions. Make a list. evaluate all the trade-off with teammates. Listen to their opinions.. I am not a very stubborn person. I like to learn from other suggestions.
For the front end problem, I have one more method. That is implement your idea and show your page render. Then compare with others. I think using the real result is better than just describe what is your page possible look like.

1. Query Classifier:

Basically

query classifier is some kind of that when you give a query then it return the most possible category or topic of this query.

But

things are little bit different here. Because this query classifier is a subproject for collaborative search engine.

As for Collaborative Search Engine

That is a search engine platform for a team not an individual (As you know, most of current search engine are just personized for individual). In here we set some experiment according to the real situation that when a team is addressing some problems around a certain task. For example, we assume a team are planning to go for a travel, so one of guy focus on booking hotel or airline ticket, one guy focus on studying the route of attractions. Then the collaborative search engine should give each person different ranking results.

Why need query classifier?

We assume that during the collaborative search engine working, a team is working on a certain task. And there is a task statement wrote by natural language. In task statement, the subtopic also indicated by some sentences. (Experiment initial stage) So my query classifier is trying to figure out the input query belong to what kind of subtopic under this task.

Difficult:

Procedure:

Step 0. Basic idea is build language model for each subtopic in one task and get the relevance score between query and subtopic model.

Step 1. Set up Corpus for task and each subtopic

Keywords extraction from task statement
- We believe the keyword should be noun or proper noun or noun phrase from task statement.
- Then Stanford NLP parser to get part-of-speech tags.
- There is a little bit tricky during finding the noun phrase. As we know the NLP parser generate the tagging sentences as a tree. Every leaf of tree is the word in sentence. But we need look up the parent nodes or grandparent nodes of leaves to get the phrase tag. So each time even if we find a word is noun or proper noun we still need to check its upper level nodes and consider its tags.
Build keywords as query on Google
- A subtopic can be represent by some keywords combination:
- Those keywords can be formed as the queries
- Fetch the Google Top 20 Results title and snippet as the subtopic model

Step 2. Evaluate Query and Subtopic Relevance

I consider both classical statistic model and language model and get relevance score from two part, but different weights on two models (0.8 and 0.2). As we know language model has better performance. Because, the language model is more rely on the real language rules other than the mathematical statistical rules and the details of how statistics like term frequency and document
length are used differ.

Query Likelihood Model: Dirichlet Smoothing Algorithm
- considering term frequency and collection frequency
- Using a reference model (collection language model) to discriminate unseen words.
  $P (w | D) = c ( w , M D ) + μ \cdot P ( w | M C ) | D | + μ$ $P(w|D) = \frac{c(w,M_D)+\mu\cdot P(w|M_C)}{|D|+\mu}$
  |D| means the length of current document!
  $c(w,M_D)$ means the term frequency in a document
Vector Space Model: using TF-IDF score
- The vector between subtopic model and query
- but did some query expansion: top 5 snippet from google result as expansion

2. TREC Conference:

That was a project for attend Text REtrieval Conference. Basically, we were assigned with an task to set up a retrieval model for retrieving most related tweets by certain query. They provides 260 million tweets and fifty-five topics. There is an Conference Official Tweets API we can get the 10 thousand tweets with basic rank under a topic. In this project, our mainly target is to get best ranking with 1000 results under each topic as query.

The enitire project should be separated as 3 parts: query expansion, document expansion, and retrieval model.

Query Expansion:
Using WordNet.API to get synonyms of topic description, then combine with these synonyms to query on Google. Get Top 10 Google Results' title. Then we use tfidf to fetch the 10 keywords from that. And add these 10 words on orginal query words as the new query for our final retrieval model.
Procedure: original query -> synonyms -> google result's title -> tfidf keyword extraction -> new query
We didn't directly use synonyms because some synonyms may lead to more confusing condition and higher perplexity. So the Google is a kind of filter to make the meaning of these query terms more concertrated.
Document Expansion:
Because the tweets are too short so we need document expansion. Using VSM model to get Top 10 similiar tweets and combine these eleven tweets as a new document.
Retrieval Model:
Why? The first reason id because we did the query expansion and now the amount of terms in each query is a little bit large, we need to weight the terms inquery.
BM25: The score of BM25, $score(q|D)$ , is the sum of each term score in query. For each term, we think about two aspect: the weight of this term and the relevance between this term and this document.
- Part One (Term Weight):
  So we can see if the number of documents contain this term the weight should be less.
- Part Two (Term Relevance):
  
  其中， $k_1，k_2，b$ 为调节因子(Regulatory factor)，通常根据经验设置，一般 $k_1=2，b=0.75；f_i$ 为 $q_i$ 在d中的出现频率， $qf_i$ 为 $q_i$ 在Query中的出现频率。 $d_l$ 为文档d的长度， $avgdl$ 为所有文档的平均长度。由于绝大部分情况下， $q_i$ 在Query中只会出现一次，即 $qf_i$ =1，因此公式可以简化为：
Evaluation:

3. Stock Prediction:

Description

We used Yahoo Finance API to retrieve 10 Years historical daily stock data from eight thousand stocks on US market. The data processing is based on Hadoop Mapreduce and the storage is on the distributed MongoDB. And use the Mahout API to do data analsis. Currently, we didn't get very precise prediction but just to learn handling large data set analysis on distributed system, especially on hadoop ecosystem. However, we still continue doing this project for self interesting and there are bunch of future works to do, from two aspect, applying and comparing more models and we also considering to extract some finance news keyword and adding them as factor in our model. In this project, my work is mainly focus on the cloud platform configuration and

It is not focus on the machine learning but on learning and dealing with scalable dataset on distributed system.

My task are:

Set up the mongodb and hadoop clusters on remote 3 vms;
Design the virtual servers for mongodb sharding on 15 ports on different vms:
- Mongos and Config server and Shard Main set, Replica Set and Arbiter
Implemented the Mapreduce stock information crawler;
- Firstly we have a list of symbols name on US market
- Each mapper is assigned to read symbol list
- Each reducer is assigned to get symbols data from Yahoo API
- Once the Historical data fetched from API, we need to convert this data into a serializable object and also design a hash function to set hash code for Shard key.
Implemented hadoop and mongodb connector;
- Need tp consider the MongoDB acceptable format like BSON and MongoDBUpdateWritable
Mahout API:
- At first, we consider this prediction problem as a binary classifier:
  - Using the historical daily quote data (From 10 years ago to present) as the training data
  - The idea is very simple: use the previous day's data as the input X,
  - The target value Y is the (Close minus Open) of next day, which indicates whether the tomorrow's stock price goes down or goes up.
  - It is a very practical strategy. After the close of the stock market we will get all the data by today. Then we can use this prediction result to decide whether we are going to buy the stock or sell it if we already have it.
  - The predictor selection: there are several attributes of each record. After consideration and comparison we choose Open, Close, Daily Low, Daily High and (Close / Open Ratio) to be the predictors. Open, Close, Low and High are already in the data. So we just need to calculate the (Close / Open) in the pre-processing part.
  - So for the classification problem, there are some models to choose, like Logistic Regression, Random Forest and Naive Bayes.
- but finally we only choose the most basic model - logistic regression (Because we got an accident close to the project dealine, our remote virtual machines occured a problem, all of our vms are missing due to the network flood, the provider shut down all of our vms. So we applied other 3 new vms after this accident, but all of our previous work need to redo). However, the mahout api didn't support this model on multiple machine. it allow RF and Naive Bayes on multiple machine.
  - Format problem: our data is json format which is not supported by Mahout. So we need to transfer the data format from JSON to txt or csv ? Why csv??

4. CarFinder

Description:

Firstly, we crawled data from Cars.com. Using collaborative filtering algorithm to recommend second-hand cars to users.

My task is:

Build up the front-end by Bootstrap
Design the algorithm for recommendation
Tweet Sentimental analysis

Collaborative Filtering

Item-based C-F, which is according to the history data from target users
- Use Cookie to record users browser history.
- Recommend related cars according to their history.
- Calculating the similarity between two cars using the its attributes
  - Vector Space Model come from two parts:
    - Cars Specifications Index (we weight it with 75 percent);
    - Cars Description(weight less because considering the different perform from different description-description may come from different author and different length that may lead to some imprecise condition);
User-based C-F, which is based on the known data from similar users

Rating Car by Sentimental analysis on Tweets

Get one model's all recent related tweets and use NLP tool to judge if this tweet is negative or positive sentiment. Then we calculate the positive percentage in these sentimental analysis. Finally, we get the rate between zero to five according to that percentage.
The rate is shown with each item when the recommended list return on user page.

5. Web Service

Description

Basically, it is a project to use Service Oriented Architecture on orginal course registeration system. The idea is to separete each module as JAX-WS(Java API for XML Web Services) and use Servlet on client side to call these service on Server side. And the communication between client and server is using the SOAP messages.

Summarization of Web Service Final Project from Rui Bi

Description of my work:

My work was focus on Permission Part (Several services and client part servlets):
Make permission request services as a following procedure:
a) Ask for permission;
b) Set permission info to faculty queue;
c) Send email to faculty with Servlet URL address;
d) Faculty checks permission;
e) Accept or reject result into student queue (Once it delivered it trigged Listener Bean);
f) Send email to student for reminding permission status update;

3 JMS Design and Construction:
Queue
Two JMS message queues for permission module
Two JMS Destination Resources: JNDI Name: jms/permsStu, jms/permsFal
One JMS Connection Factory:
JNDI Name: jms/permsPool
Resource Type: javax.jms.ConnectionFactory

Object Message: Serialized the xml object

Created some utility services for the entire team to use:
JMS Services Message Producer for setting message in queue. (msgUtil package)
JMS Services Message Receiver for getting message from queue. (msgUtil package)
JMS Message-Driven Bean for listening message arrived in queue. (msgListener package)
Java Mail Service as a unify interface to use the mail service. (mail package)
Constant for judging different status. (constant package)

Major challenges in my work:
1. JMS Listener set up: If you want to use JMS MDB Listener, JMS resources shouldn’t be created in NetBeans IDE, it must create in Glassfish Console with physical name in each destination resource, otherwise the MDB cannot tell different queues in JMS resources.
2. Mail service: the email with a complete Servlet URL (like localhost:8080/xxxxx/xxxxServlet?permid=1) for faculty to check students permission info can be reject by some mail sever like Gmail and Hotmail.
3. Serializable objects addressing: several additional schemas added for generate object which is for storing permission information as object message in queue.
4. Status codes design: these codes are designed for checking message queue status and mail service status.

Refactor class names and reorganize all of packages in project: those classes or services created by other teammates may have unclear name and untidy package names. So refactor and reorder these things is tough.

Projects Script

0 My Introduction:

0 Why machine learning?

0 Why software engineering:

0 Why backend?

0 What’s your favorite language?

Internship Experience:

1. Intern in Yonyou;

Chanllenge:

Conflict:

Favorite Project Intro:

Why you like this project?

What did you learn from this project?

Challenge

Conflict

1. Query Classifier:

Basically

But

As for Collaborative Search Engine

Why need query classifier?

Difficult:

Procedure:

Step 0. Basic idea is build language model for each subtopic in one task and get the relevance score between query and subtopic model.

Step 1. Set up Corpus for task and each subtopic

Step 2. Evaluate Query and Subtopic Relevance

2. TREC Conference:

3. Stock Prediction:

Description

My task are:

4. CarFinder

Description:

My task is:

Collaborative Filtering

Rating Car by Sentimental analysis on Tweets

5. Web Service

Description

6. Col*fusion

内容目录