Religious & Philosophical

Apple Disease Dataset Analysis

Description
A code-based data exploration analysis with the aim of finding the user prediction metrics from a test system built in UniBz.
Published
of 16
0
Published
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Share
Transcript
  Apple Disease Dataset Analysis Analysing user prediction metrics Erdal Domi, Gabriela Boyadjiyska  In [1]: # Erdal Domi, Gabriela Boyadjiyska# Group number 2# Assignment 4   import   pandas   as   pdfrom   matplotlib   import  pyplot as  plt import   numpy   as   np   # import all log data log_data = pd.read_csv("ChallengeData_10-31.csv", delimiter = ";") # remove image information columns clean_data = log_data.loc[:,"challenge_id":"time"] # convert final rank array to python data structure# the final rank is an array of computed scores for each disease score_tuples=clean_data.loc[:,'final_rank'] final_ranks = [] for  score in  score_tuples: final_ranks.append(eval(score)) # collect target disease for each round  target_diseases=[] for  index,row in  clean_data.iterrows(): target_diseases.append(row['target_disease']) # This function checks the presence of the target disease in the positive scoring diseases in the final ranking# Effectively, this checks if the target disease was shown to the user in the disease selection phase # round_ranking is an array containing computed scores for the diseases in this particular round  def  disease_presence(target_disease, round_ranking): for  current_ranking in  round_ranking: # current_ranking[0] holds the name of the disease# current_ranking[1] holds the score of the disease   if  current_ranking[1] > 0  and  current_ranking[0] == target_disease: return  1  return  0  # This function counts the number of shown results to the user def  shown_diseases(round_ranking): positive_count = 0  for  current_disease_ranking in  round_ranking: if  current_disease_ranking[1] > 0: positive_count +=1  if  positive_count > 0: return  positive_count return  0  def  disease_position(target_disease,round_rank): indices = [i for  i, tupl in  enumerate(round_rank) if  tupl[0] == target_disease] return  indices[0]+1  #the dcg calculation is made ad-hoc for this example since we only consider dcg@3 def  calculate_dcg(disease_presence,disease_position):   dcg=0  if  disease_presence: #if the disease is present in the results shown to the user   if (disease_position == 1  or  disease_position == 2): #if the disease is the first or the second, the dcg will be 1  dcg=1  elif (disease_position == 3): #if the disease is in the third place, dcg is 1/log2(3)  dcg=0.6309  else : return  0  return  dcg dcg_sum = 0 for  i in  range(0,len(target_diseases)): if  shown_diseases(final_ranks[i])>0: dcg_sum = dcg_sum + calculate_dcg(disease_presence(target_diseases[i], final_ra nks[i])\ ,disease_position(target_diseases[i],final_ranks[i])) '''Since iDCG in our system is always 1, DCG=nDCG ''' print("Average nDCG@3 for all interactions: ", dcg_sum/len(target_diseases)) '''Now we need to extract the relevant data for the different interfaces and applying thesame formula '''  dataset_4 = log_data[log_data['round_images']==4] score_tuples_4=dataset_4.loc[:,'final_rank'] final_ranks_4 = [] for  score in  score_tuples_4: final_ranks_4.append(eval(score)) # collect target disease for each round of 4 target_diseases=[] for  index,row in  dataset_4.iterrows(): target_diseases.append(row['target_disease']) dcg_sum = 0 for  i in  range(0,len(target_diseases)): if  shown_diseases(final_ranks_4[i])>0: dcg_sum = dcg_sum + calculate_dcg(disease_presence(target_diseases[i], final_ra nks_4[i])\ ,disease_position(target_diseases[i],final_ranks_4[i])) print(" \n Average nDCG@3 for 4 image interface: ", dcg_sum/len(target_diseases)) '''----------------------'''  dataset_8 = log_data[log_data['round_images']==8] score_tuples_8=dataset_8.loc[:,'final_rank'] final_ranks_8 = []  for  score in  score_tuples_8: final_ranks_8.append(eval(score)) # collect target disease for each round of 8 target_diseases=[] for  index,row in  dataset_8.iterrows(): target_diseases.append(row['target_disease']) dcg_sum = 0 for  i in  range(0,len(target_diseases)): if  shown_diseases(final_ranks_8[i])>0: dcg_sum = dcg_sum + calculate_dcg(disease_presence(target_diseases[i], final_ra nks_8[i])\ ,disease_position(target_diseases[i],final_ranks_8[i])) print("Average nDCG@3 for 8 image interface: ", dcg_sum/len(target_diseases)) '''----------------------'''  dataset_12 = log_data[log_data['round_images']==12] score_tuples_12=dataset_12.loc[:,'final_rank'] final_ranks_12 = [] for  score in  score_tuples_12: final_ranks_12.append(eval(score)) # collect target disease for each round of 12 target_diseases=[] for  index,row in  dataset_12.iterrows(): target_diseases.append(row['target_disease']) dcg_sum = 0 for  i in  range(0,len(target_diseases)): if  shown_diseases(final_ranks_12[i])>0: dcg_sum = dcg_sum + calculate_dcg(disease_presence(target_diseases[i], final_ra nks_12[i])\ ,disease_position(target_diseases[i],final_ranks_12[i])) print("Average nDCG@3 for 12 image interface: ", dcg_sum/len(target_diseases)) nDCG For the first section, we compute the nDCG@3 for the overall number of interactions, and for each interfaceseparately. To do that, we build on the code submitted for Assignment 2. First we extract the data and save itin python variables. We also keep the helper functions. Computing nDCG@3 for all interactions requires usto evaluate DCG@3/IDCG@3 for the whole data-set. DCG@3 itself is computed by a helper function whichfollows the formula given on the slides. Average nDCG@3 for all interactions: 0.5775975308641975 Average nDCG@3 for 4 image interface: 0.5217391304347826 Average nDCG@3 for 8 image interface: 0.5187 Average nDCG@3 for 12 image interface: 0.6753933333333334  In [2]: '''The log data contained errors in labeling successed, perhaps due to concurrency issues. We do not fix this data, but we simplycount the instances'''  inspection_data = log_data.loc[:,"target_disease":"selected_disease"].values error_count = 0 ok_count = 0 for  rounder in  inspection_data: if  rounder[2] ==  True   and  rounder[0]!=rounder[3]: # print('false True', rounder[2], 'target: ',rounder[0], 'selected: ',rounder[3])  error_count +=1  elif  rounder[2] ==  False   and  rounder[0]==rounder[3]: # print('false False',rounder[2], rounder[0], rounder[3])  error_count +=1  else : # print('supposed ok ',rounder[2], 'target: ',rounder[0], 'selected: ',rounder[3])  ok_count +=1 print(' \n Rows with false success labels: ', error_count) '''Log data user accuracy as computed by number of successes over the total number of atte mpts.'''  success_labels = log_data.loc[:,'success'] success_count = 0  for  row in  success_labels: if  row ==  True : success_count +=1 print("The user accuracy of the log data is: ", success_count/81) print("The user accuracy not considering user 294 is: ", success_count/(81-4)) User independent variable We analyze how the observed accuracy changes when considering independent variables. The first variablewe consider is the user. The approach is that if a user gives noisy input or has only false guesses, we canremove those instances from the calculations. We sorted the data in excel and found that there is only 1 user who gave completely wrong answers, userID 294. With 4/4 wrong answers, we can decide to exclude thisfrom the computation and see if the accuracy changes.It is clear that this analysis of the independent variable results in only a change of ~1%, so we conclude thatit is not meaningful enough to draw a conclusion. Another independent variable that we can analyze is thetarget disease. Rows with false success labels: 5 The user accuracy of the log data is: 0.35802469135802467 The user accuracy not considering user 294 is: 0.37662337662337664
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x