[关闭]
@xmruibi 2015-11-06T06:26:36.000000Z 字数 18680 阅读 786

Projects Script

Interview_Preparation


0 My Introduction:

I just had graduated in this May from University of Pittsburgh with Masters degree of Information Science. I also got the Bachelor degree of Computer Science in two years ago. So in last five year, I’ve accumulate much of computer science knowledge as well as a strong interest in coding, especially on Java programming. I think I’m a person desired to know more wider knowledge and learn more some cutting-edge techniques. i also like to share what I have learn from my project. If you’re interested, you can visit my personal blog.

In my earlier experience (undergraduate and first year of graduate), they were focus on web application development by using Javascript for frontend and Java for backend with some frameworks, like Spring, Hibernate, and SQL database like MySQL. I also have some experience on pretty advanced techniques like cloud computing and machine learning. So I joined our school’s research team on a innovative search engine development and took the course on cloud computing. I’ve practiced pretty much cloud techniques like Hadoop, AWS and distributed database. I really like to do some practical projects just for self interest and pursuing those new techniques. Most of my projects are uploaded to the Github. If you’re interested in, you can check the link in the last line on my resume.

0 Why machine learning?

In the very beginning I just heard about this word from media and known it as a tool to predict something with big data. And then, When I chose to take the Data Mining course during the second graduate semester, I started my interest on machine learning. Then, in last summer, a guy who wanna to apply PhD on big data field invited me to join his TREC conference project. That was my first project combine both machine learning and information retrieval. After that, I joined a research project about the collaborative search engine last semester in our school. Where I made significant progress on learning information retrieval knowledge. And recently, I did the stock predicting project. I found, nowadays, our information technology always combined with the machine learning models from the basic to the complicated.

0 Why software engineering:

First of all, I think I really like to learn technical things and anything of engineering world. Not only the software engineering, but also I like to make some craft, like model of cars or battle ships and to do some DIY on mechanical stuff. I think all of those engineering work can make you fill of the sense of achievement.
I remember when I was 8 years old. I got my first computer. Then I started my interest in this area. At first I like to do some hardware DIY, like upgrade memory or graphic card to build my customized computer. Then after I chose computer science as my undergraduate major, I had developed lots of web applications at first….

0 Why backend?

Actually, at the first, I used to like the front end things. Because it looks more interesting and more visible. You can watch your work and show your work to everybody who don’t really know the IT knowledge. As you see, my earlier experience are focus on this area for a while. But when I did more on frontend things, I realized the backend is more important. It likes the heart of software. Especially in this big data era, I found we need to deal with many complex backend architecture and services and also smarter algorithms to load such big data. During my graduate study, I’ve tried some information retrieval things and also some cloud computing techniques. After this experience, I think backend things requires more technical knowledge you have. Not only the programming, but you have to know a bunch of thing about computer science. It’s quite a challenge but you can learn a lot. That is what I want in the future.

0 What’s your favorite language?

I think my favorite one should be the Java. It’s the first language I learned for programing. And also most of my experience rely on Java. Because Java is a pretty mature object-orient compiler language, and it can bear the heavy workload. As you can see, most of big data architecture are rely on Java, like Hadoop. But Java also has its defect, it does’t like Cpp which can reach more low-level thing. And it’s to heavy so that right now many web application just using some javascript or python or php MVC framework. So I think I should learn more language to fit more flexible requirement.


Internship Experience:

1. Intern in Yonyou;

My previous internship is in Yonyou Software. That company is the most largest management software provider in eastern Asia, like what SAP did. So during this intership, we built a ERP platform for a company in tobacco industry. The company had a very mature solution on developing such platform. They have a prototype but we need to modified this module to fit the target user requirement. Our task was to build the module for searching merchandise in storage on the ERP platform. At the first, they gave me a week for training to learn their developement process and their technical framework. Then I started my task to implement merchandise searching function with different filter conditions on both backend and frontend by using Spring MVC and Hibernate and JSP. our team member are internship students. My supervisor let me to lead this small team. because my performance during their training session.

Chanllenge:

Because that experience was first internship of me and was the first time I learn about enterprise level development. So many first times for me. I was only a junior student at that time. It was quite big chanllenge for me to fit enviornment in a big company and to learn quickly and catch up with their development speed. There are serveral ways. But I think my way is to communicate with people around you, ask your advisor and keep calm down to learn the knowledge.

Conflict:

Because our tasks are focus on searching and retreval. We met some problems on database indexing. Some people believe we need to index more columns but some disagree. Because when we add more index for table column, it help our searching retrieval speed but it reduce the update speed. You know some merchandise information are updated rapidly. So we had a meeting for list all trade-off. Compare with all possible situations, evaluate which column should be index.

Favorite Project Intro:

I’d like to talk about my research project about Query Classifier. Basically to say, that is a some kind of that when you give a query then it return the most possible category or topic of this query. Like, you’re giving Apple, then it will return you’re asking about the company instead of the fruit. But this query classifier is a subproject for collaborative search engine, which is a search engine platform for a team not an individual (As you know, most of current search engine are just personalized for individual). When a team is doing a search task, like planning to go for a travel, so one guy focus on searching hotel or airline ticket, one guy focus on studying the route of attractions. So my classifier is trying to find out what does each person focus on. Then the collaborative search engine should give each person different ranking results.

Why you like this project?

Because this is very pratical project. you know I really like to do some pratical things, transfer the knowledge onto practice. In this project, it adopted, implemented a lot of knowledge from what I learned during graduate study, like machine learning, data mining and information retrieval. And it is a quite individual project, although it is a subproject or support component for a big one and I still talk with professor and other teammate who was making other support components. Most of code and design work is finished by my self. So it is quite good experience on testing myself abilitity.

What did you learn from this project?

The first thing should be a bunch of academic knowledge about search engine and some machine learning algorithms. This project is quite complicate, and you need to talk with your advisor and communicate with your teammates who were doing other subproject. We have to cooperate each subproject to run well and support entire collaborative search engine. And I also learn about the system design, although it is a subproject, I still need to do some architecture design to reduce the time consuming during querying.

Challenge

My challenge was to design the system architecture. Because this system combine both indexing and retrieval. But the indexing process involve too much machine learning algorithms and statistical methods which slow pretty much down the running speed. So when i almost finished the system, I need to reconstruct the system architecture. Then I figure it out by using a kind of user profile and action driven pattern. The data which concerning about user profile is loaded when user login and the task and topic data indexing start when user begin to choose their search task. So when user start to search, our indexed data is already loaded and also only partial data which only support the current task is loaded instead of load too much unnecessary data.

Conflict

I think conflicts are alway exist during any projects. Most of conflicts are about the comparison on different technical solutions.
For backend problem, most of situation is arguing the different techniques or algorithms. I think it is good to present all the trade-off of different solutions. Make a list. evaluate all the trade-off with teammates. Listen to their opinions.. I am not a very stubborn person. I like to learn from other suggestions.
For the front end problem, I have one more method. That is implement your idea and show your page render. Then compare with others. I think using the real result is better than just describe what is your page possible look like.


1. Query Classifier:

Basically

query classifier is some kind of that when you give a query then it return the most possible category or topic of this query.

But

things are little bit different here. Because this query classifier is a subproject for collaborative search engine.

As for Collaborative Search Engine

That is a search engine platform for a team not an individual (As you know, most of current search engine are just personized for individual). In here we set some experiment according to the real situation that when a team is addressing some problems around a certain task. For example, we assume a team are planning to go for a travel, so one of guy focus on booking hotel or airline ticket, one guy focus on studying the route of attractions. Then the collaborative search engine should give each person different ranking results.

Why need query classifier?

We assume that during the collaborative search engine working, a team is working on a certain task. And there is a task statement wrote by natural language. In task statement, the subtopic also indicated by some sentences. (Experiment initial stage) So my query classifier is trying to figure out the input query belong to what kind of subtopic under this task.

Difficult:

Procedure:

Step 0. Basic idea is build language model for each subtopic in one task and get the relevance score between query and subtopic model.

Step 1. Set up Corpus for task and each subtopic

Step 2. Evaluate Query and Subtopic Relevance

I consider both classical statistic model and language model and get relevance score from two part, but different weights on two models (0.8 and 0.2). As we know language model has better performance. Because, the language model is more rely on the real language rules other than the mathematical statistical rules and the details of how statistics like term frequency and document
length are used differ.


2. TREC Conference:

That was a project for attend Text REtrieval Conference. Basically, we were assigned with an task to set up a retrieval model for retrieving most related tweets by certain query. They provides 260 million tweets and fifty-five topics. There is an Conference Official Tweets API we can get the 10 thousand tweets with basic rank under a topic. In this project, our mainly target is to get best ranking with 1000 results under each topic as query.

The enitire project should be separated as 3 parts: query expansion, document expansion, and retrieval model.


3. Stock Prediction:

Description

We used Yahoo Finance API to retrieve 10 Years historical daily stock data from eight thousand stocks on US market. The data processing is based on Hadoop Mapreduce and the storage is on the distributed MongoDB. And use the Mahout API to do data analsis. Currently, we didn't get very precise prediction but just to learn handling large data set analysis on distributed system, especially on hadoop ecosystem. However, we still continue doing this project for self interesting and there are bunch of future works to do, from two aspect, applying and comparing more models and we also considering to extract some finance news keyword and adding them as factor in our model. In this project, my work is mainly focus on the cloud platform configuration and

It is not focus on the machine learning but on learning and dealing with scalable dataset on distributed system.

My task are:


4. CarFinder

Description:

Firstly, we crawled data from Cars.com. Using collaborative filtering algorithm to recommend second-hand cars to users.

My task is:

Collaborative Filtering

Rating Car by Sentimental analysis on Tweets

Get one model's all recent related tweets and use NLP tool to judge if this tweet is negative or positive sentiment. Then we calculate the positive percentage in these sentimental analysis. Finally, we get the rate between zero to five according to that percentage.
The rate is shown with each item when the recommended list return on user page.


5. Web Service

Description

Basically, it is a project to use Service Oriented Architecture on orginal course registeration system. The idea is to separete each module as JAX-WS(Java API for XML Web Services) and use Servlet on client side to call these service on Server side. And the communication between client and server is using the SOAP messages.

Summarization of Web Service Final Project from Rui Bi

Description of my work:

My work was focus on Permission Part (Several services and client part servlets):
Make permission request services as a following procedure:
a) Ask for permission;
b) Set permission info to faculty queue;
c) Send email to faculty with Servlet URL address;
d) Faculty checks permission;
e) Accept or reject result into student queue (Once it delivered it trigged Listener Bean);
f) Send email to student for reminding permission status update;

3 JMS Design and Construction:
Queue
Two JMS message queues for permission module
Two JMS Destination Resources: JNDI Name: jms/permsStu, jms/permsFal
One JMS Connection Factory:
JNDI Name: jms/permsPool
Resource Type: javax.jms.ConnectionFactory

Object Message: Serialized the xml object

Created some utility services for the entire team to use:
JMS Services Message Producer for setting message in queue. (msgUtil package)
JMS Services Message Receiver for getting message from queue. (msgUtil package)
JMS Message-Driven Bean for listening message arrived in queue. (msgListener package)
Java Mail Service as a unify interface to use the mail service. (mail package)
Constant for judging different status. (constant package)

Major challenges in my work:
1. JMS Listener set up: If you want to use JMS MDB Listener, JMS resources shouldn’t be created in NetBeans IDE, it must create in Glassfish Console with physical name in each destination resource, otherwise the MDB cannot tell different queues in JMS resources.
2. Mail service: the email with a complete Servlet URL (like localhost:8080/xxxxx/xxxxServlet?permid=1) for faculty to check students permission info can be reject by some mail sever like Gmail and Hotmail.
3. Serializable objects addressing: several additional schemas added for generate object which is for storing permission information as object message in queue.
4. Status codes design: these codes are designed for checking message queue status and mail service status.

  1. Refactor class names and reorganize all of packages in project: those classes or services created by other teammates may have unclear name and untidy package names. So refactor and reorder these things is tough.

6. Col*fusion

添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注