ai @ wvu: FayolaP

Showing posts with label FayolaP. Show all posts

Tuesday, January 14, 2014

Relational Knowledge Transfer

With relational transfer, it is the relationship among data from a source domain to a target domain that is transferred [1]. In our experiments so far, we are looking at synonym learning (source and target with different features) based in relational transfer.

Experiment

Data

The source data is a combination of the following OO data: poi-3.0 ant-1.7 camel-1.6 ivy-2.0 jEdit-4.1 and the target data is jm1 (Halstead metrics).

Procedure

x% of the target is labelled and all others are unlabeled.
Only 50% of the target data are used as test instances (these are from the unlabeled bunch).
BORE is applied separately to the labelled x% from the target and the source data.
Each instance now has a score that is the product of the ranks from the power ranges (the scores are normalized).
Each target instance gets a BORE score by using the ranks from the x%.
These are then matched to their nearest [instances scores] from the source data and the majority defect label is assigned to the target instance.
For the within experiment, the x% of labelled target data is used as the train set and the 50% test instances are the test.
The above is also benchmarked with a 10 x 10 cross-validation experiment on jm1 with Naive Bayes.

Initial Results

Click here

or syns.pdf

So far there are four things offered

Synonyms - (if technology, or data collect methods, or metrics change, can we still use previous projects).
Cross Prediction method for synonyms based on relational transfer of different data-sets.
The percentage of labelled data used - second opinion paper is at 6% for the lowest and mixed paper experiments with 10%
Methods closely resembles the second opinion paper, BORE is linear.

Reference

[1] Pan, Sinno Jialin, and Qiang Yang. "A survey on transfer learning." Knowledge and Data Engineering, IEEE Transactions on 22.10 (2010): 1345-1359.

Wednesday, February 27, 2013

Not Global vs Local

I'm playing with the idea of a global and local rules family, where local rules are the children and global rules the grandparents. The parents are the middle rules. In other words, the local rules are subsets of global and middle rules.

The goal - provide the project manager with a family of options. If a local rule is best but would prefer a more global solution, find its global and middle relatives and see if they work just as well.

My initial experiment is a little flawed by my simple rule score metric, P(defects), but results are useful.

Tuesday, January 26, 2010

1987 Evett Fiber Model

(defn numerator [y x]
(let [m (nrow x)
fnc1 (fn [t u]
(- (/ (* (Math/exp (- u)) (- t 1) (Math/pow u (- t 2)))
(- t 2))
(* (- (Math/exp (- u))) (Math/pow u (- t 1)))))
fnc2 (fn []
(/ (* (- m 1) (+ m 1)) m))
smean (fn [x]
(div (reduce plus x) m))
Sx (fn [x]
(let [vals (div
(reduce plus
(matrix (map #(pow (minus % (smean x)) 2) x))) m)
comat (matrix [[(first vals) 0] [0 (second vals)]])
icomat (matrix [[(second vals) 0] [0 (first vals)]])]
[comat icomat]))]

(/ (/ (fnc1 (/ m 2) (first y)) (* (Math/PI) (fnc1 (/ (- m 2) 2) (first y))))
(mult (pow (det (mult (fnc2) (first (Sx x)))) 0.5)
(Math/pow (+ 1 (mmult (mmult (mult (minus y (smean x)) (fnc2))
(second (Sx x)))
(trans (minus y (smean x)))))
(/ m 2))))))

(defn denominator [y z lamb]
(let [Normal-zlamb (fn [zi]
(* (/ 1 (Math/sqrt (* 2 (Math/PI) (Math/pow lamb 2))))
(Math/exp (- (/ (mmult (minus y zi) (trans (minus y zi)))
(* 2 (Math/pow lamb 2)))))))
SumZ (/ (apply + (map #(Normal-zlamb %) z)) (count z))]
SumZ))

(defn likelihood [x y z lamb]
(/ (numerator y x) (denominator y z lamb)))

Tuesday, October 6, 2009

MESO

In their paper, "MESO: Supporting Online Decision Making in Autonomic Computing Systems", Kasten and McKinley showcase their novel approach to pattern classification. Called MESO (Multi-Element Self-Organising tree), this pattern classifier is designed to support online, incremental learning and decision making in autonomic systems.

Using supervised data MESO uses a novel approach to cluster data. It also unveils a new tree structure to organise the resulting clusters, which the authors call sensitivity spheres.

To create their sensitivity spheres, Kasten and McKinley improved on the long standing leader follower algorithm which creates small clusters of patterns. Basically, a training pattern within a specified distance is assigned to that cluster, otherwise a new cluster is created.

What is the problem with the algorithm: The distance measure which determines the size of the clusters is fixed throughout the clustering process.

In their paper the authors proposed the use of a growth function to remedy this problem.
$grow_{\delta } = \frac{(d - \delta) \frac{\delta}{d}f }{1 + ln(d - \delta + 1)^{2} }$

$\rightarrow$ distance between the new pattern and the nearest sensitivity sphere

$\frac{\delta}{d} \rightarrow$ scales the result relative to the difference between the current $\delta$ and

Note: the denominator serves to limit the growth rate based on how far the current $\delta$ is from

Once the data is assigned to these clusters or sensitivity spheres, it is then organised into a novel tree structure. Kasten boasts of a tree structure which rather than focussing on putting individual patterns into large clusters close to the root of the tree, he instead places the focus on his sensitivity spheres. He then builds a MESO tree starting at the root node which is home to all the sensitivity spheres. He further explains:

The root node is then split into subsets of similar spheres which produces child nodes. Each child node is futher split into subsets until each child contains only one sphere.

Results

Using the eight datasets in the table below MESO results shows it superiority in terms of speed and accuracy against other classifiers.

Thursday, September 3, 2009

Fayola Peters

Fayola's journey into the world of Computer Science began in 2004 as an undergrad at Coppin State University in Baltimore.

Currently, she is working toward her Master's degree at West Virginia University and has jumped onto the Forensic Interpretation Model research path.

CLIFF

CLIFF is a plotting tool which offers the visualization of data. This software features four(4) of the current models used in the forensic community, namely the 1995 Evett model, the 1996 Walsh model, the Seheult model and the Grove model. For each model, data can be generated randomly and plotted. Other features of CLIFF include:

the ability to perform dimensionality reduction by applying the Fastmap algorithm
the ability to determine if the results gained from a region of space is important and well supported.

Monday, August 24, 2009

Creating the New Generation of Forensic Interpretation Models

In their report, "Strengthening Forensic Science in the United States: A Path Forward" the National Academy of Sciences (NAS) laid out not only the challenges facing the forensic science community but also put forward recommendations for improvements required alleviate the problems, disparities, and lack of mandatory standards currently facing the community.

Highlighted in this report is the concern that faulty forensic science can contribute to the wrongful convictions of innocent people. One the other side of the coin, someone who is guilty of a crime can be wrongfully acquitted.

Take for instance a hypothetical case of a suspect being connected to a crime scene by trace evidence such as glass. It would be very easy to challenge the forensic analyses by simply requesting that the analysis be done by at least five(5) different forensic labs. One can almost be 100% certain that they would have no choice but to throw out the evidence because of the inconsistency of the respective results.

It is for this reason we have taken up the NAS's challenge to develop a standard forensic interpretation methods to convincingly "demonstrate a connection between evidence and a specific individual or source". It is our hope that this project will not only supply the analyst with an interpretation of the data, but also a measure of what minimal changes in the data can result in a different interpretation.

For this project we studied current methods of forensic interpretation of glass evidence. In this subfield of forensics we have so far identified at least three separate branches of research leading to three different 'best' models: the 1995 Evett model, the 1996 Walsh model, and the 2002 Koons model. To date our research has shown that the former models are 'broken'. There are two root causes for the broken mathematical models:

'Brittle' interpretations where small input changes can lead to large output changes.
Inappropriate assumptions about the distribution of data.

To deal with these two issues, our project proposes the use of (a) clustering algorithms (meso and ant), and (b) treatment learning. The clustering algorithms will allow us to reason about crime scene data without the knowledge of standard statistical distributions, while treatment learning offers the analyst a measure of how strongly an interpretation should be believed.

Our project will also boast of a plotting tool (CLIFF) which offers the visualization of data. The software features four(4) of the current models used in the forensic community, namely the 1995 Evett model, the 1996 Walsh model, the Seheult model and the Grove model. For each model, data can be generated randomly and plotted. Other features of CLIFF include:

the ability to perform dimensionality reduction by applying the Fastmap algorithm
and the ability to determine if the results gained from a particular region of space is important and well supported.

In the end, it must be made clear that our purpose is not to determine innocence or guilt, or give a 100% guarantee of a match/non-match. Instead our goal is to aid in the decision making process of forensic interpretation. We want to provide the analyst with a standard dependable tool/model which reports to them an interpretation of the crime scene data as well as the treatment learner's analysis of what minimal 'treatment' could change the interpretation. What happens after this is in the hands of the analyst.

Ultimately we want to be able to rely on the results presented in a court of law, results based on correct models accepted and used by the entire forensic community. So, if a suspect demands that test be done by five(5) different forensic labs, there should be no dramatic differences in the results.

Tuesday, August 11, 2009

New version of Cliff

Fayola and Zach coded an amazing new version of the forensics tool. Which Timm used to write an NIJ proposal. So cross fingers!

(Short) Paper accepted to ASE'09

Assessing the Relative Merits of Agile vs Traditional Software Development.

Bryan Lemon, Aaron Riesbeck, Tim Menzies, Justin Price, Joseph D’Alessandro, Rikard Carlsson, Tomi Priﬁti, Fayola Peters, Hiuhua Lu, Dan Port

We implemented Boehm-Turner’s model of agile and plan-based software development. That tool is augmented with an AI search engine to ﬁnd the key factors that predict for the success of agile or traditional plan-based software developments. According to our simulations and AI search engine: (1) in no case did agile methods perform worse than plan-based approaches; (2) in some cases, agile performed best. Hence, we recommend that the default development practice for organizations be an agile method. The simplicity of this style of analysis begs the question: why is so much time wasted on evidence-less debates on software process when a simple combination of simulation plus automatic search can mature the dialogue much faster?