Google Play Store Apps

Programėlės

Hypotheses:

- The highest rating_count has Category Music or Games;

- The category which has the most apps is the most popular (having the most rating_count);

- There is correlation between rating_count and the rating it self.

 

Some columns have incorrect data types: 

Released, Size. Released should be a datetime. Size is probably rendered as string because each size contains the letter 'M' to indicate megabytes. These issues will be added to the list too.Looks like all numerical columns looks realistic, like rating should be between 0 and 5.

 

But the maximum value for price is 400$ which is a bit suspicious. We will dig into that later.

Some categories of interest like Music and Education are given with different labels: there are both 'Music & Audio' and 'Music' labels as well as 'Education' and 'Educational' for education.

They should be merged together to represent a single category.

Later, we will subset for the top 8 columns after finishing cleaning.

Before we further explore, let's deal with the issues we highlighted. Here is the final list:Issues List For the Dataset:

 

Missing values in several cols: 

  • Rating, 
  • rating count,
  • Installs, 
  • minimum and maximum installs, 
  • currency and more

 

Drop these columns: 

  • App ID, 
  • minimum android version, 
  • developer ID, 
  • website and email, 
  • privacy policy link.

 

Incorrect data types for release data and size. Music and education is represented by different labels. Drop unnecessary categories. Looking at medium price for each, Business seems to be the winner closely followed by Books $ Reference.

 

Now, let's see if more ratings mean higher ratings. Again, we will only look at apps with ratings fewer than 100k and exclude the ones with no ratings.

We could confirm our earlier notion that there is a non-linear positive relationship between rating and rating count with a coefficient of r=0.019.

naujiena
video
video